For many projects FTS is now a requirement. SQLite's fts5 extension provides full-text search (fts) functionality that's simple and fast. Here I document a solution for serverless applications that can run SQLite indexes in memory for fts.
My data is books. A 20 byte ID, a title, an array of authors and the publish year ~ 50-100 bytes per book. Data size per item is a very important optimisation as it will determine the minimum size of your indexes. The smaller the better.
#Building the index
Queries can now be run returning the book ID's title and their ranking:
150K books can be stored in a changeset file of around 22 MB.
The usual way to use these indexes would be to store the data on disk and have SQLite query the disk directly. Going to serverless we lose the option for disk IO and have to use the cloud provided storage solutions. We can therefore run the SQLite index in-memory, fetching the data when the table is needed.
In my case I'm running a go application under google cloud run. Here I store the changesets under google's cloud storage and apply them as a changeset.
The library I used for go is crawshaw.io/sqlite, which importantly supports sqlite's session extensions. Below is a snippet on how to apply the sqlite index to an in memory database:
We can then apply the changeset when a search query is run. The first query will be slow, downloading the index and applying it. But subsequent queries will run very quickly on the in memory indexes.
The above worked great on a small dataset of around 150K books at 22 MB.
However, going to a million books upped the index size to 192 MB.
A bit on the heavy side for cloud run's RAM usuage.
To get a reasonable larger dataset we can compress, or more accurately throw away, more of the data.
fts5 tables have a
content option. This allows you to specify the backing table for the search index.
content='' creates a contentless table, the columns won't be available under SQLite.
Importantly for contentless fts tables we now need to set the
rowid on inserts.
We must have an explicit
rowid INTEGER PRIMARY KEY.
This is to avoid SQLite VACUUM rearrange the rowids and corrupting what the search index data points to.
Queries can now be run returning the book ID's and their ranking:
Content isn't available via the index so all information must be fetchable based
1 Million books can now be stored in 91 MB. Going further we can compress the index on disk with gzip for a 47 MB gzipped file.
Alternatively serverless solutions like Algolia are great and allow serverless applications to remain stateless without worrying about the implementation. SQLite also has some interesting solutions on running it client side. This is done by compiling SQLite to wasm and dynamically fetching parts of the database from storage when queries are run (write up). For small and simple application running the SQLite index as part of the application was exciting to see how easy it is to get a simple workable solution.