How is y-leveldb coming along?

canadaduane · June 24, 2020, 2:38am

I’m curious how it’s progressing. I’m looking forward to being able to persist worlds in https://relm.us

dmonad · June 25, 2020, 8:47pm

Persisting worlds, that just sounds awesome

I’ll publish what I have on the weekend. I’ve been thinking a lot about optimizing how multiple rooms can be stored in a single file.

canadaduane · July 6, 2020, 6:41pm

Any more progress to report?

dmonad · July 7, 2020, 11:38pm

Hey @canadaduane,

Just to show you something because you have been waiting for this for quite some time. (Sorry if not everything here makes sense. It was a long day and I know I don’t always make sense when I’m tired. I still wanted to get this out for you).

I published what I have here: https://github.com/yjs/y-leveldb

What I’m currently working on: I’m working on an interface for providers that allows to exchange document updates without creating a Yjs instance. This will allow the server to process and sync large documents with constant memory consumption (just in the size of the update buffers + x bytes as working memory).

Why: Creating a Yjs instance is rather costly for a server. A large Y.Doc consumes about 40 MB of memory. With a 1GB server instance you can only handle about 20 large documents at the same time on a server because the server needs to keep the Yjs documents in memory. Hence the idea to compute document updates without actually creating a Y.Doc.

Constraints I’m looking for a generic, scalable provider approach that will work in all db environments.

Solution Currently you can transform the Yjs document to a single document update. Furthermore, you can compute a diff using state vectors (see here https://github.com/yjs/yjs#Document-Updates). In order to allow the server to sync Yjs document updates without creating Y.Docs, we need to be able to merge document updates (const singleUpdate = Y.mergeUpdates([update1, update2, ..])), and compute diffs on document updates instead of the Yjs document (const missingChanges = Y.diffUpdate(latestUpdate, stateVector)).

For this, I have completely reworked Yjs’s update encoding and I have worked on an interface to find the required document updates in leveldb without constructing the whole Y.Doc every time a client wants to sync.

The current database provider approach is to store all incremental document updates in a list. Optionally, you can merge all document updates to a single entry. This is very database friendly because you don’t have to write the whole document every time something changes (large writes are costly especially in leveldb).

In the new approach, we will still store document updates in a list in causal order. [update1, update2, update3] But we will have a separate list of state vectors that point to updates in the database.

When a client wants to sync, it will send its state-vector to the server, the server will query for the first missing document update using the state vector (e.g. update2). The server will then grab all missing document updates ([update2, update3]), merge them to a single document update updateMerged23 = Y.mergeUpdates([update2, update3]), and then perform the usual diff using the stateVector (missingUpdates = Y.encodeStateAsUpdate(updateMerged23, stateVector)).

The advantages:

constant memory consumption
cheap syncs because in most cases updates just need to be merged
less db load because only a subset of databases will be queried
Uses state vectors instead of some kind of server-clock approach. State vectors are just much easier to work with and allow for very efficient syncs.

The disadvantages

Need to maintain a list of state vectors + more disk space, but this can be optimized
Rework of the encoding/decoding protocols

It is actually quite tricky to get all of this right. First I envisioned a much easier approach, but I think this is where I want to go with this.

Nevertheless, y-leveldb can already be used as a database provider using the current y-websocket implementation. All the necessary methods (storeUpdate, getYDoc) are already available. You should be able to upgrade to the new approach when everything is ready.

I will publish a new y-leveldb release tomorrow and I will post documentation on how to set it up here.

To everyone who doesn’t know. @canadaduane is sponsoring the y-leveldb work. Thanks again for this! Who knows when I would have found the motivation to finally tackle server load.

canadaduane · July 8, 2020, 3:19am

Great work, Kevin! Thanks for the update. I’m excited for your very thorough solution.

Out of curiosity, does this open a path towards squashing history (like we talked about before, e.g. “clear out history”)?

dmonad · July 8, 2020, 3:23pm

I don’t think that this will ever be part of Yjs directly. There might be some benefit in implementing this in y-websocket / y-protocols. An event that signals the clients to migrate all data to a new instance.

For now you can use one of the discussed methods.

dmonad · July 10, 2020, 1:43am

I reimplemented y-leveldb and also added persistence support in y-websocket

As described in the y-websocket docs, you just need to set an environment variable in order to allow persistence using y-leveldb:

YPERSISTENCE=./dbDir y-websocket-server

You can also set persistence using the setPersistence function that is exported by y-websocket/bin/utils.js:

  const LeveldbPersistence = require('y-leveldb').LeveldbPersistence
  const ldb = new LeveldbPersistence('./my-storage')
  setPersistence({
    bindState: async (docName, ydoc) => {
      const persistedYdoc = await ldb.getYDoc(docName)
      const newUpdates = Y.encodeStateAsUpdate(ydoc)
      ldb.storeUpdate(docName, newUpdates)
      Y.applyUpdate(ydoc, Y.encodeStateAsUpdate(persistedYdoc))
      ydoc.on('update', update => {
        ldb.storeUpdate(docName, update)
      })
    },
    writeState: async (docName, ydoc) => {}
  })

y-leveldb only stores incremental updates and therefore we don’t need the writeState method.

In case you want to rework the persistence approach:

writeState is called once after all clients left the room and just before the instance can be destroyed.
bindState is called once after the document is created. You can use it to start listening to document updates.

As I mentioned in the previous posts, I’m currently in the process of reworking how y-websocket uses persistence. The y-leveldb interface is pretty stable and you probably won’t have to migrate any data when the persistence layer changes. I might even keep the old interface functional.

canadaduane · July 10, 2020, 3:18am

Fantastic! I’ll give this a test drive soon. Thanks!

tobiasandersen · September 15, 2020, 9:59am

Since this is my first comment around here, I’d just like to start of by thanking you for your hard work on Yjs, as well as for keeping such a friendly and welcoming tone!

Now for my question — what’s the reasoning behind writing the initial update in bindState()? I’m referring to the following lines from your example above:

const newUpdates = Y.encodeStateAsUpdate(ydoc)
ldb.storeUpdate(docName, newUpdates)

Why is it not enough to write only from the update handler?

Thanks again!

dmonad · September 15, 2020, 11:00am

Welcome @tobiasandersen!

It might be enough. I don’t make any assumptions on how y-websocket is used. If the user initialized some content (or used another database-adapter) then the initial document might not be empty (before registering the update event). In the worst case, you write a tiny / empty update to the database. One improvement might be to check beforehand if the update is empty. On the downside, this introduces a special case that requires more cognitive load without really providing any performance gain. I like to simplify things like that.

tobiasandersen · September 15, 2020, 11:48am

Ah, that makes a lot of sense. Thanks!

csbenjamin · November 3, 2020, 7:18pm

@dmonad First I want to say that yjs is amazing. I am building notebook app and yjs makes me implement offline and collaborative capability with little effort. I’m looking forward for y-leveldb being able to delivery diff without building a Y.Doc instance with the whole document. Is there an eta for this feature? As soon as I learn more about yjs internals, I will happily contribute with some code.

dmonad · November 3, 2020, 10:56pm

Hi @csbenjamin,

I’m currently finishing up other work. But I’ll keep in mind that you need this feature as well.

I fear the implementation of the diff approach is quite complicated and not a good place to start contributing to Yjs. It requires deep knowledge of the CRDT algorithm, and the document update format (including the binary compression approach). You can find out more about it here if you are still interested: https://docs.yjs.dev/api/internals

The next important thing I want to finish up is the documentation, and cleaning up the demo section.

Thanks for your sponsorship btw I usually follow up with a mail, but I didn’t find yours.

csbenjamin · November 4, 2020, 10:09pm

I have a (maybe silly) question. Until heard about y-leveldb, I didn’t know about levelDB. And it turns out that I loved it and I want to use it to store other things non related to yjs. It is recommended to keep a separate levelDB database for store yjs documents or can I use y-leveldb with an existing levelDB database?

canadaduane · November 5, 2020, 1:54am

I was able to store data in leveldb alongside the yjs data and it did not affect functionality or performance. It’s possible you will want to split it out for performance sake if you run a very large site. But other than that, I don’t think it will be a problem.

csbenjamin · November 5, 2020, 2:03am

My concern was about the keys. But looking into the source code of y-leveldb I could see that there is no chance to have some conflict . But it is good to know you already use it this way having no problem. Thanks

german-jablo · May 6, 2024, 11:10am

dmonad:

The current database provider approach is to store all incremental document updates in a list. Optionally, you can merge all document updates to a single entry. This is very database friendly because you don’t have to write the whole document every time something changes (large writes are costly especially in leveldb).

In the new approach, we will still store document updates in a list in causal order. [update1, update2, update3] But we will have a separate list of state vectors that point to updates in the database.

When a client wants to sync, it will send its state-vector to the server, the server will query for the first missing document update using the state vector (e.g. update2). The server will then grab all missing document updates ([update2, update3]), merge them to a single document update updateMerged23 = Y.mergeUpdates([update2, update3]), and then perform the usual diff using the stateVector (missingUpdates = Y.encodeStateAsUpdate(updateMerged23, stateVector)).

@dmonad, If I am understanding correctly, this new approach would imply that for each update in the YDoc a row is created in the DB. Is that correct?

When you say that “Optionally, you can merge all document updates to a single entry”, that means that you can no longer synchronize the state without loading the entire document into memory. Am I right here too?

dmonad · May 10, 2024, 9:21am

This ticket is quite old.

My current recommendation is to implement something like y-redis:

it stores incremental updates in a redis stream
the same stream is used to subscribe to changes on a document
in regular intervals, a worker merges the changes from the stream and stores them in a single row in a database.
The server component doesn’t load the document to memory. It only uses the “alternative update API”, which is quite efficient.
The document is only loaded to memory when the worker process persists the data to a database. This is necessary to garbage-collect some information that we are not interested in anymore.

It is desirable to avoid maintaining the Yjs document (new Y.Doc()) in-memory while the websocket connection is active. This is still the approach that the default y-websocket server implements. y-redis is an alternative, more efficient backend. No component maintains the document in-memory anymore. However, when syncing with a peer, we need to pull the whole document to compute the differences, not just the missing updates. With the alternative update API, we can compute the differences directly from the encoded update, without loading a Y.Doc.

It probably doesn’t make sense to optimize this further (although you could). This is already quite efficient.