How is y-leveldb coming along?

Hey @canadaduane,

Just to show you something because you have been waiting for this for quite some time. (Sorry if not everything here makes sense. It was a long day and I know I don’t always make sense when I’m tired. I still wanted to get this out for you).

I published what I have here: https://github.com/yjs/y-leveldb

What I’m currently working on: I’m working on an interface for providers that allows to exchange document updates without creating a Yjs instance. This will allow the server to process and sync large documents with constant memory consumption (just in the size of the update buffers + x bytes as working memory).

Why: Creating a Yjs instance is rather costly for a server. A large Y.Doc consumes about 40 MB of memory. With a 1GB server instance you can only handle about 20 large documents at the same time on a server because the server needs to keep the Yjs documents in memory. Hence the idea to compute document updates without actually creating a Y.Doc.

Constraints I’m looking for a generic, scalable provider approach that will work in all db environments.

Solution Currently you can transform the Yjs document to a single document update. Furthermore, you can compute a diff using state vectors (see here https://github.com/yjs/yjs#Document-Updates). In order to allow the server to sync Yjs document updates without creating Y.Docs, we need to be able to merge document updates (const singleUpdate = Y.mergeUpdates([update1, update2, ..])), and compute diffs on document updates instead of the Yjs document (const missingChanges = Y.diffUpdate(latestUpdate, stateVector)).

For this, I have completely reworked Yjs’s update encoding and I have worked on an interface to find the required document updates in leveldb without constructing the whole Y.Doc every time a client wants to sync.

The current database provider approach is to store all incremental document updates in a list. Optionally, you can merge all document updates to a single entry. This is very database friendly because you don’t have to write the whole document every time something changes (large writes are costly especially in leveldb).

In the new approach, we will still store document updates in a list in causal order. [update1, update2, update3] But we will have a separate list of state vectors that point to updates in the database.

When a client wants to sync, it will send its state-vector to the server, the server will query for the first missing document update using the state vector (e.g. update2). The server will then grab all missing document updates ([update2, update3]), merge them to a single document update updateMerged23 = Y.mergeUpdates([update2, update3]), and then perform the usual diff using the stateVector (missingUpdates = Y.encodeStateAsUpdate(updateMerged23, stateVector)).

The advantages:

  • constant memory consumption
  • cheap syncs because in most cases updates just need to be merged
  • less db load because only a subset of databases will be queried
  • Uses state vectors instead of some kind of server-clock approach. State vectors are just much easier to work with and allow for very efficient syncs.

The disadvantages

  • Need to maintain a list of state vectors + more disk space, but this can be optimized
  • Rework of the encoding/decoding protocols

It is actually quite tricky to get all of this right. First I envisioned a much easier approach, but I think this is where I want to go with this.

Nevertheless, y-leveldb can already be used as a database provider using the current y-websocket implementation. All the necessary methods (storeUpdate, getYDoc) are already available. You should be able to upgrade to the new approach when everything is ready.

I will publish a new y-leveldb release tomorrow and I will post documentation on how to set it up here.

To everyone who doesn’t know. @canadaduane is sponsoring the y-leveldb work. Thanks again for this! Who knows when I would have found the motivation to finally tackle server load.

1 Like