The Problem

"The WAN show" podcast by LMG had always been my favourite podcast to listen to. So much so that whenever I hear a new thing in tech I just wait for Linus and Luke to talk about it on the WAN show. This was the motivation behind wan search. If I can search for a topic discussed in the show and better yet get the exact time within the video it would be great.

WANsearch

The obvious next question is "why not just use YouTube search". Well, there a couple of problems one is that it does not index what's said within the video. The second is it doesn't show where a certain phrase/word was said. WANsearch answers both of these problems.

Technology behind WANsearch

The engine was written using Golang and the Gin framework was used to make the API because Golang is fast and easy to work with. The frontend was made with Svelte. It uses a single file SQLite database.

How search engines work

Search engines are programs that help users find information from a group of documents. They work by indexing these documents and creating a massive database of words and phrases. When a user enters a query, the search engine searches its index and retrieves the most relevant document. In the case of WANsearch these documents are YouTube subtitles from WANshow episodes downloaded using yt-dlp. A stripped down version of these files (just words) was used to create the database. WANsearch uses the following techniques to find and retrieve relevant videos.

Reverse indexing

Reverse indexing is a fundamental technique used by search engines. It involves creating an inverted index, which is a data structure that maps words and phrases to the documents (in your case, YouTube video subtitles) where they appear. This allows the search engine to quickly find all the videos that contain a specific word or phrase.

Tf-idf

Tf-idf (term frequency-inverse document frequency) is an algorithm used by search engines to weigh the importance of a word in a document. It considers two factors:

  • Term Frequency (tf): How often a word appears in a document.
  • Inverse Document Frequency (idf): How common the word is across all documents in the collection (i.e., all YouTube video subtitles).

Tf-idf helps WANsearch to prioritize documents.

Quoted searches on WANsearch

Quoted search or search for a specific phrase is relatively easy. In the backend a simple SQL LIKE statement is executed to find the relevant video(s).

Timestamps

The second feature WANsearch provide is a link to the specific time within a video a word or phrase is said.

This was done by doing something similar to the reverse indexing. Using subtitles files a table was created where every timestamp a word was said within a video was recorded. This was pretty easy to do with python.

Deployment

When using the right deployment strategy for WANsearch the main concerns were speed and target audience. Considering WANsearch is a relatively low-trafic site and WAN show audience is mostly English speaking and located in Americas, I chose to deploy the backend on a VPS on Google cloud. I made sure to attach a SSD to the VPS since the sqlite dB file was also located in the same box. I decided to serve the svelte frontend on cloudflare pages because I wanted the initial page load to be fast and I knew that cloudflare pages had very low cold start times.