Search indexer architecture

dmonad · May 15, 2021, 11:57am

@YousefED Eventually, I’d like to move to a server that doesn’t need to keep the document in-memory anymore. This is why I added an “Alternative updates” API that doesn’t require the server to keep the state in-memory. Instead, differences can be computed on the fly.

In the below discussion, I described an algorithm that allows the server to synchronize millions of documents efficiently. This has been implemented at least once. After a debounced timeout after a change, you could re-index the changed file. If you want to build an offline-ready application, I like the idea to perform indexing locally instead of on the server.

Since you are designing for a central system, you can use a numeric clock instead of a state-vector. The state-vector tracks the state of all peers and is too large to sync millions of documents efficiently (100-2000 bytes per document).

I hope that gives a new perspective on the design of an efficient backend.