Using y.js for distributed storage

janesconference · January 29, 2021, 5:36pm

Hi,
I’m a newbie at y.js, but looking at the leveldb thread, I understand that it is possible to distribute and save data on level without necessarily materializing the doc (and consequently having constant). I was wondering if it was possible to apply this concept to a distributed redundant storage: many nodes, each one with an instance of leveldb, with constant memory footprint, eventually converging to the same data distributed in all the dbs.
Would that be feasible? Or I’m just misunderstanding how it works? In case it’s feasible, what data would be on the databases? a list of deltas or the latest snapshot of the doc?

dmonad · February 2, 2021, 7:35pm

Hi @janesconference,

Sure, that would be possible. I completed the groundwork for this feature: https://github.com/yjs/yjs/pull/274

y-websocket does not yet sync without loading the Yjs state to memory. This is what I’m working on next.

The y-leveldb database contains a list of small incremental updates. When a client syncs with y-leveldb, all contained updates will be merged to a single document state (using the new Y.mergeUpdates function). Then we sync with the client using Y.diffUpdate(mergedUpate, state vector).

When the database contains too many updates (~100-1000 entries), we simply merge all contained updates and merge them to a single entry. This reduces overhead when querying the database. The same approach is used by y-indexeddb.

janesconference · April 13, 2021, 11:17am

Sorry, for some reason I wasn’t notified of the reply. Thanks for answering!

To clarify, what I would like to do is a cluster-like collection of servers, each backed by leveldb. The clients would just use this cluster as a key-value db.

What I see is a dynamic number of servers that are connected via y-websocket. They are resilient and eventually consistent. If one or more server go down the overall cluster won’t go down. If we add one or more server, the cluster “grows” for free.

My questions would be:

how much this solution would scale, in terms of practicality? (like, how does the number of messages grow as the number of k.v. grow)?
would it be possible for a client todo a get(key) in one of the servers in that cluster, and the server get the value for the key without materializing the whole doc?

dmonad · April 13, 2021, 12:20pm

Hi @janesconference,

I think you should try that out. To my knowledge, nobody has done that before. You could maybe learn from other (decentralized / eventual consistent) databases. I’ve never built a database and have little knowledge to share.

Currently, all serious applications that use Yjs eventually store the Yjs document in a central database like Postgres. My goal is really to make it easy to integrate Yjs into existing applications, instead of providing an ultimate data storage solution. I always hoped that, eventually, we would have a highly scalable database specifically for Yjs. At the moment, you have to integrate Yjs into an existing highly-scalable database (which is fairly easy).