Question regarding updates and state vectors in y-leveldb

dmonad · March 8, 2021, 5:15pm

You could send a snapshot instead. A snapshot consists of a state vector and a delete set. In theory, if the snapshot of doc1 equals doc2, then the documents are in sync. I think there are some edge cases when the delete set might not match, because the order in which it is written is non-deterministic (a state vector writes cliend-ids in decreasing order. A delete set writes client-ids in any order - but we can fix that).

If you don’t want to use timestamps (based on UTC time), then you might be better off computing hashes on the encoded document. Hashes are smaller than snapshots and are faster to generate.

Syncing thousands of documents is gonna be pretty inefficient if you still need to compute sync steps. My approach would be to store some hash/timestamp alongside the shared document. If the hashes match, then there is no need to initiate a sync.

I wanna share an idea that has been lurking in the depth of my mind for quite some time now. If I would write some note taking application, I might expect that some of my users have access to millions of documents (e.g. because they are part of a large company). In this case, even the above approach would be too inefficient because the client would need to send millions of hashes to the server to check if there are any updates. Instead, I would work with a last-modified timestamp (based on UTC time). When I initially sync with a server, I’m gonna ask the server what changed since the beginning of time. The server will reply with all document-names that currently exist. The client is gonna sync with each of them individually (pulling the complete state in the background). The next time the client syncs with the server, it asks “what happened since yesterday”. The server is gonna reply with all documents that changed since yesterday. Again, the client is gonna sync with each of them individually in the background. There are probably only going to be a few documents that have been changed, so we get close to optimal performance without accessing cold documents (that might even live in a cheap “cold storage” as AWS S3 Glacier).

I wanted to implement the above approach in Ydb. Ydb could be run as a server for a company, or as a personal database on your own computer. Clients should also be able to sync with each other individually, sharing content based on some permission system. How do we efficiently synchronize millions of documents amongst many clients? I think it should be possible to generalize the above approach but allowing many “sync-targets”. So whenever we sync with another database, we are going to remember the time we initialized that sync with that particular instance. Consecutive syncs are going to be much faster, because eventually, we will have synced with all Ydb instances.