Question regarding updates and state vectors in y-leveldb

braden · February 18, 2021, 11:25pm

Hey there! I’m digging around in y-leveldb and had a couple questions.

One, I noticed that the implementation included in y-websocket server reads from LDB, then immediately writes back to it. Each time I start my server, this of course results in the update list growing by one, even though the updates have identical information in them. ~~What’s the intention of this behavior?~~ (Answered this one myself upon re-reading; it’s using the in-memory ydoc here, which could have new stuff in it. Oops!)

In that same area, it seems like I’d be able to use Y.diffUpdate and the stored state vector to avoid writing updates I already have. The state vector, however, is only updated on flushDocument, and on state vector creation. Is the intention to go ahead and write updates I already have and let the trimming take care of compaction? Or do I need to update the state vector on every update? Is there a reason y-levelDB doesn’t do this?

Anyways, just a few questions I had while digging around. Really enjoying playing with the new Y.diffUpdate and Y.mergeUpdate; they are super cool.

dmonad · February 20, 2021, 10:30pm

Hey @braden! Happy you like the diffUpdates feature

The getStateVector method in y-leveldb is currently incomplete. Nevertheless, it could potentially save database-queries (e.g. computing the state-vector for sync-step-1 can be avoided even if it might lead to a bit more overhead).

The state-vector database-field should be interpreted as “the database has at least the information of the defined state-vector” or “the compacted state has the information of the defined state-vector”. Updating the state-vector on every update is potentially expensive as it doubles the number of writes. While leveldb is really fast, other databases that are supported by levelup have a lot more overhead.

With the diff-update feature, we can update the state vector in the database by simply merging all updates that come after the compacted state and then extracting the state-vector. I.e. if we have [compactedState, update1, update2], stateVector: sv0 in the database we first compact update1-2 compactedUpdates = Y.mergeUpdates([update1, update2]) and extract the stateVector sv12 = Y.getStateVectorFromUpdate(compacted) and write everything back to the database:

[compactedState, compactedState], stateVector:  mergeStateVectors(sv0 + sv12)

I plan to update the getStateVector method in leveldb to always retrieve the latest state vector. If the state is not compacted and the state-vector in the database is not up to date, the state-vector will be updated using the above approach.

Regarding your other question. I think it should be possible to avoid duplicate writes of updates in many cases. I’m unsure if it makes sense to check every update if it should be written. So I would favor letting the compaction handle removing the duplicate updates.

braden · February 25, 2021, 9:22pm

Awesome, that all makes sense. Thank you for the reply!

Another question: Does diffUpdate always converge to an empty diff after applying mutual updates to documents? I’m experimenting with sending the smallest amount of information possible between two nodes and I’m finding I can’t get the diff to converge to an empty Uint8Array. I imagine I just have a bug somewhere, but I wanted to confirm my assumption about convergence. I have side effects based on the results of a sync, and I’d like to be able to opt out of those side effects if there’s no meaningful changes in an update.

dmonad · February 28, 2021, 12:50pm

Hey @braden,

Does diffUpdate always converge to an empty diff after applying mutual updates to documents?

It does not.

Basically, Yjs has a struct store (containing all insertions) and a delete store (containing an optimized representation of all ids that have been deleted). We can compute a minimal diff for the struct store (using lamport timestamps). But the delete store is always sent to the other client when syncing.

Other CRDTs don’t make that distinction and store everything in an “operation store”. In Yjs, we achieve better compression faster parsing of the encoded document by separating struct- and delete store.

That might be hard to achieve. If you opt-out of syncing some changes, you might not be able to apply other updates (that depend on the changes that you didn’t apply).

braden · March 7, 2021, 6:20pm

Still pondering over minimizing DB writes/network exchange between a client and server using differential updates.

Would it be productive to include a “delete vector” with SyncStep1? Let’s say in addition to the state vector’s (clientId, clock)[], I include a delete vector that is all known clientIds and the size of corresponding DeleteSet.

Could I then use that delete vector to determine convergence when responding to SyncStep1 with SyncStep2 ? I’m trying to avoid introducing realtime timestamps to my 1000+ document synchronization process.

dmonad · March 8, 2021, 5:15pm

You could send a snapshot instead. A snapshot consists of a state vector and a delete set. In theory, if the snapshot of doc1 equals doc2, then the documents are in sync. I think there are some edge cases when the delete set might not match, because the order in which it is written is non-deterministic (a state vector writes cliend-ids in decreasing order. A delete set writes client-ids in any order - but we can fix that).

If you don’t want to use timestamps (based on UTC time), then you might be better off computing hashes on the encoded document. Hashes are smaller than snapshots and are faster to generate.

Syncing thousands of documents is gonna be pretty inefficient if you still need to compute sync steps. My approach would be to store some hash/timestamp alongside the shared document. If the hashes match, then there is no need to initiate a sync.

I wanna share an idea that has been lurking in the depth of my mind for quite some time now. If I would write some note taking application, I might expect that some of my users have access to millions of documents (e.g. because they are part of a large company). In this case, even the above approach would be too inefficient because the client would need to send millions of hashes to the server to check if there are any updates. Instead, I would work with a last-modified timestamp (based on UTC time). When I initially sync with a server, I’m gonna ask the server what changed since the beginning of time. The server will reply with all document-names that currently exist. The client is gonna sync with each of them individually (pulling the complete state in the background). The next time the client syncs with the server, it asks “what happened since yesterday”. The server is gonna reply with all documents that changed since yesterday. Again, the client is gonna sync with each of them individually in the background. There are probably only going to be a few documents that have been changed, so we get close to optimal performance without accessing cold documents (that might even live in a cheap “cold storage” as AWS S3 Glacier).

I wanted to implement the above approach in Ydb. Ydb could be run as a server for a company, or as a personal database on your own computer. Clients should also be able to sync with each other individually, sharing content based on some permission system. How do we efficiently synchronize millions of documents amongst many clients? I think it should be possible to generalize the above approach but allowing many “sync-targets”. So whenever we sync with another database, we are going to remember the time we initialized that sync with that particular instance. Consecutive syncs are going to be much faster, because eventually, we will have synced with all Ydb instances.

braden · March 8, 2021, 5:44pm

Nice! The approach you just listed is precisely what I implemented yesterday! It works fine, though I’ve run into a bit of a hiccup. When an empty client initiates a “give me all changes since the beginning of time”, the server sends the “dirty list”, and naturally the client sends SyncStep1, which the server follows up with SyncStep2+SyncStep1. The sync step 1 prompts the client to send Step 2 updates, many of which contain no meaningful changes, but the server can’t tell… Then I get stuck in an update loop, as now everything has been marked as “dirty” again.

It seems I’ll need to add an option to skip the Server Sync Step 1 response for the “Server to client” sync, and let the client handle the “client to server sync” more selectively. Still not sure most elegant approach to have date convergence. Maybe I’ll keep the “single sync” protocol the same, and add on message types for a “bulk sync” protocol that behaves different w.r.t its side effects.

dmonad · March 8, 2021, 8:08pm

Then I get stuck in an update loop, as now everything has been marked as “dirty” again.

Makes sense. I think in this case you don’t really need to sync, you just want to pull the data. A sync is only necessary when the data changed locally as well. But it makes sense to solve the underlying problem first.

I assume you get into an infinite loop because the timestamp updates all the time? You might get around that by listening to the “update” event. If you are using differential updates: You could check if the old document equals the new document after applying sync step 2 (if no change was applied, then the resulting merge equals the initial document). So only update the timestamp when a change was applied.

If you follow the generic Ydb pattern: for each document that changed locally (whether server or client), you could ask the other peer to send a sync step 1 if it changed locally (introducing sync step 0). You should absolutely work around the normal sync protocol when syncing millions of documents. For live collaborating on a single document, it is great. But in your case you might focus on the big picture instead and optimize later. You could even request the complete document (no sync steps) without a big performance hit (This is how every other note editing app works). It would increase network traffic, but it would reduce database access.

braden · March 14, 2021, 11:54pm

The described bulk sync process has proven to work nicely and minimize data transfer.

0a. For every change on the client, the client writes a “dirty” record identifying a changed document.
0b. For every doc on the server, the server keeps the last update time.

On connection established, the client sends a Sync Step 0 for each dirty record.
The server responds with Sync Step 1.
The client responds with Sync Step 2 and clears the corresponding dirty record.
The client sends “Get” Step 1 with the date of the last GS1, or -Inf.
The server responds with GS2, the list of docs that have changed since the provided date.
The client sends SS1 for each doc.
The server responds with SS2 for each doc.

If your connection is live, you could technically not write a dirty record on client updates, since the receiver probably got the update, but I think the redundancy provided by not clearing the dirty record until you receive an SS1 for it is cozy and doesn’t result in much duplicate work in practice.

Like you said, for P2P it seems like you’d just do the same bookkeeping as above, but for each known peer.

The approach also has a nice signal-to-noise ratio in regards to knowing when to update secondary stores, like search indexes or a relational DB that follows a YJS doc.

dmonad · March 17, 2021, 1:33pm

Neat, this is super valuable @braden. Thanks for sharing!

Just to clarify this for other people who are looking to implement something similar:

Sync Step 1/2 are defined in the y-protocols package. SS1 is simply a state vector. SS2 are the computed differences. https://docs.yjs.dev/api/document-updates#update-api

“Get Step” is described in Question regarding updates and state vectors in y-leveldb

GS1: A message containing a timestamp of the date when all documents were last synced.
GS2: A message containing all documents that changes since GS1.

Just a note: You have to be careful that you don’t rely on all messages working. The connection could break up before you synced all documents. This is why you should update the “last synced”/GS1 timestamp only when all documents in GS2 were synced.