Search indexer architecture

Im considering the following architecture for a local search indexer. Given:

  • multiple documents can be updated, e.g. there can be 100 documents a user is “following”
  • documents can be updated by different users
  • I dont want to keep 100 documents in memory and subscribe to 100 documents and index all changes
  • I would like to be able to search through all 100 documents. I’d like to index the document client side and not rely on a central server

This would require a method to get notified of doc changes, without observing all docs. I’m considering:

  1. have a YDoc “feed”, containing a Ymap “documentUpdates”
  2. whenever a user edits a document, call documentUpdates.set(docId, stateVector)
  3. the Indexer subscribes to documentUpdates
  4. if theres a new/changed documentUpdates entry that hasn’t been synced yet, load the corresponding YDoc until the document reaches a state that includes the stateVector. Then, index the document and unload it.
  5. the Indexer keeps a state of which (documentId, stateVector) combination has been indexed. The Indexer always reindexes complete documents

(Later on, we could optimize this further and index incremental updates, for now, it would be ok to always reindex complete documents)

a) Step 4 would require to determine whether a stateVector is part of a document state. Does such an API exist?
b) This design is probably not completely fool-proof, e.g. in “split-brain” situations. However, I think it’s relatively simple and should be covering most scenarios
c) I understood Deletions are not included in State Vectors, so if there are no insertions/changes the documentUpdates won’t be updated, and the delete-only changes won’t trigger reindexing. Maybe my overall concept is flawed, and there’s a better approach for this?

Note that there are two different categories of CRDTs: state-based and op-based . Both serve the same purpose, but work in different ways and come with their own design trade-offs. In this series, I’m mostly going to focus on state-based CRDTs.
Lars Hupel An introduction to Conflict-Free Replicated Data Types

Is what you’re proposing a sort of event stream of document operations? ie. an “op based” CRDT scheme? Correct me if I’m wrong, I don’t claim to be an expert.

@YousefED Eventually, I’d like to move to a server that doesn’t need to keep the document in-memory anymore. This is why I added an “Alternative updates” API that doesn’t require the server to keep the state in-memory. Instead, differences can be computed on the fly.

In the below discussion, I described an algorithm that allows the server to synchronize millions of documents efficiently. This has been implemented at least once. After a debounced timeout after a change, you could re-index the changed file. If you want to build an offline-ready application, I like the idea to perform indexing locally instead of on the server.

Since you are designing for a central system, you can use a numeric clock instead of a state-vector. The state-vector tracks the state of all peers and is too large to sync millions of documents efficiently (100-2000 bytes per document).

I hope that gives a new perspective on the design of an efficient backend.

Thanks @dmonad!

Perhaps I wasn’t clear about this, but my question was regarding a local indexer. Im first trying to design my application to run fully decentralized. I.e. this would run completely in-browser using a search engine like fuse.js. In this way, it would also work in the future for e2e encrypted documents (as there cannot be a central cloud-based search engine in that scenario).

Do you think it’s a good design to keep a local search index up to date?

(I think I’ve got the basics working, but indeed deletes are not picked up yet.)

Hey YousefED,

The approach you described is what I implemented for my product and it worked well, up to a point. I think it will work fine depending on the scale of the feed. In my case, projects in my application had tens of thousands of documents, and the centralized update feed stopped being feasible, as parsing it began to take 5-10 seconds in some cases. I think it was manageable up to around 200-500 updatedAt entries, at which point the parsing became noticeable. You could push this threshold higher if 1) you put the task on a background thread (which doesn’t make it faster, but makes it more pleasant), and 2) don’t have the first-time app load tied to the feed parsing, or your requirements don’t mind the loading time. I did #1 but could not avoid #2.

In the end, I implemented the system described in the thread dmonad linked to, and used mechanisms outside of YJS for update notifications. Whenever we received a notification that a doc has been updated, we’d load it into memory, attach a “SearchProvider” that subscribes to Doc updates and translates them to search index actions, and then synced the document.

Wow, thanks @braden for your response. Very exciting to see the depth of knowledge in this community. Curious what project this was (if you’re able to share).

As I don’t expect thousands of documents I think it should be feasible (also #1 and #2 are possible for me as the search is not critical, it’s fine if search results come available asynchronously with a delay).

Good to learn how / when you switched to a centralized notification system. For now, I’ll try to stick to a decentralized approach as long as possible.

Did you have a similar issue with State Vectors and deletes? How did you resolve that, or did your implementation not build on State Vectors?

I didn’t use state vectors for the search indexing task, but rather did a brute-force indexing of the document whenever a change was made to a field I cared about. I.E., on document update, if field “name” or “tags” changed, enqueue for indexing, etc.

Deletes have been a little trickier; for now I soft-delete via a “deletedAt” timestamp, and if the indexer ever sees this value go from null to non-null, it deletes the corresponding entry from the search index. Of course, this has some issues, as soft-deletes only work if everything else in your system is also soft-deleting, which means old data is being left around. Maybe your documentUpdates feed could incorporate some kind of tombstone-based approach so you can fully delete data.