"The WAN show" podcast by LMG had always been my favourite podcast to listen to. So much so that whenever I hear a new thing in tech I just wait for Linus and Luke to talk about it on the WAN show. This was the motivation behind wan search. If I can search for a topic discussed in the show and better yet get the exact time within the video it would be great.
The obvious next question is "why not just use YouTube search". Well, there a couple of problems one is that it does not index what's said within the video. The second is it doesn't show where a certain phrase/word was said. WANsearch answers both of these problems.
The engine was written using Golang and the Gin framework was used to make the API because Golang is fast and easy to work with. The frontend was made with Svelte. It uses a single file SQLite database.
Search engines are programs that help users find information from a group of documents. They work by indexing these documents and creating a massive database of words and phrases. When a user enters a query, the search engine searches its index and retrieves the most relevant document. In the case of WANsearch these documents are YouTube subtitles from WANshow episodes downloaded using yt-dlp. A stripped down version of these files (just words) was used to create the database. WANsearch uses the following techniques to find and retrieve relevant videos.
Tf-idf (term frequency-inverse document frequency) is an algorithm used by search engines to weigh the importance of a word in a document. It considers two factors:
Tf-idf helps WANsearch to prioritize documents.
Quoted search or search for a specific phrase is relatively easy. In the backend a simple SQL LIKE statement is executed to find the relevant video(s).
The second feature WANsearch provide is a link to the specific time within a video a word or phrase is said.
This was done by doing something similar to the reverse indexing. Using subtitles files a table was created where every timestamp a word was said within a video was recorded. This was pretty easy to do with python.
When using the right deployment strategy for WANsearch the main concerns were speed and target audience. Considering WANsearch is a relatively low-trafic site and WAN show audience is mostly English speaking and located in Americas, I chose to deploy the backend on a VPS on Google cloud. I made sure to attach a SSD to the VPS since the sqlite dB file was also located in the same box. I decided to serve the svelte frontend on cloudflare pages because I wanted the initial page load to be fast and I knew that cloudflare pages had very low cold start times.