Transactions + Nested Subdocuments

raine · January 5, 2023, 8:54pm

I have been reading more about subdocuments and have arrived at a quandary.

Subdocuments seem to be widely recommended for more complex data modeling, in particular granular access control and lazy loading.

Yet, because subdocuments are independent, correct me if I’m wrong, you lose all atomicity across document boundaries. The CRDT will not resolve conflicts across document boundaries, e.g. Subdocuments nested within Y.Map, as opposed to simply nested Y.Maps.

Am I missing something? Is this a limitation of YJS? Granular access control is a requirement on my project, and lazy loading a necessity due to an expected up to 1 million+ nodes in a user’s tree. At the same time, I need conflicts to be handled by the CRDT across different levels of the tree, just like a nested Y.Map.

Thanks so much for your input!

P.S. RxDB is an offline-first database that does not support transactions and instead offers revisions and conflict handling. Not sure if the points made there also apply to YJS, at least in a general sense.

jarone · January 6, 2023, 4:05am

About the problem of excessive data. I think there are two directions to try:

Reduce the weight of the tree
Reduce the weight of each node in the tree

I don’t have a good idea about the first point.
But on the second point, I think we can do this:

The whole tree is a ydoc; Each node is also an independent ydoc
Each node in the tree only stores one guid
Whether to render the business logic controls any node. When rendering is required, create a new connection and download the corresponding ydoc content

Therefore, we need a custom connection provider.

raine · January 6, 2023, 4:24pm

Yes, assuming you mean that the root node of the tree is a ydoc (rather than storing the whole tree in a single ydoc).

Yes, seems necessary for lazy loading.

Right, this would be a necessary extension to the provider(s) in order to handle subdocuments.

However, the main question remains: Given that a variety of data models require or benefit from subdocuments, what consistency guarantees exist across subdocument boundaries? Do we not lose the consistency of the CRDT for inter-document data structures? My use case of granular subgraphs is one instance of this problem.

I say this without having thought through the CRDT operations and state vectors that might be unique to a lazy-loaded graph. I was hoping that I could find a CRDT that has already abstracted those low-level details. I could go with a “graph-first” library like GunDB… but I like YJS and have been doing everything I can to avoid Gun due to its infamously bad codebase, fragmented documentation, and reports of consistency issues.

chrysalis · January 7, 2023, 2:51am

@raine I have recently started exploring subdocuments, and I learned a few things after playing around with them. Subdocuments are being touted as a game-changer, but they are not quite turn-key out-of-the-box solution one might expect after reading about them.

Not all providers support them today, so likely you will have to implement your own (which should be relatively easy)
You can’t really deepObserve subdocuments. Listening for changes requires doing everything that you would do for an ordinary y document (e.g. observe the changes to shared types or update events on the ydoc)
I find the lazy loading as a default behavior a bit annoying

Having said that, I believe there is a lot of potential to improve them in the future.

I am curious about the use case where you need atomicity of updates across multiple subdocuments. Can you tell me a bit about what exactly you are trying to do?

raine · January 7, 2023, 4:19am

Yes, that’s been my experience. Seems like they have a lot of potential, but a bit undercooked.

There is some good starter code here: How to sync thousands of documents and have local persistent store? - #6 by nokola. (Not sure why this hasn’t made it into a PR in two years. I know, we’re all busy )

Fair enough. Hopefully better support for autoLoad: true will make this easier.

I have a tree with 1 million+ nodes per user (personal knowledge management app). It’s far too much to load into memory at once, so I need lazy loading, hence I need subdocuments.

A delete operation needs to delete all descendants atomically across nodes (i.e. subdocuments).

a
- b
  - c
  - d

Deleting b must also delete c and d. If this were a nested Y.Map I could use transact, but I can’t do this when they are separate subdocuments.

Now for this example you might suggest using “tombstoning” or another technique to clean up orphaned nodes, but this is just one example of many multi-node operations that require atomicity, so I don’t believe any kind of solution involving post hoc clean up is sufficient.

For example, consider a command called collapse which deletes a node and moves all its children up a level. If activated on b above, it should result in:

a
- c
- d

Moving multiple nodes at once should be atomic.

I’m sure I will learn a lot more once I start writing some code, but hoping to avoid putting the effort into a custom provider if there are known (or implied) limitations to consistency when building trees or graphs with subdocuments.

Which goes back to @dmonad’s quote in the OP. What exactly are the limitations of subdocuments re: atomicity? What is the alternative without losing lazy loading and granular access control?

Thanks for reading this far

chrysalis · January 7, 2023, 6:54am

@raine Thanks for explaining your use case. Knowing what I know of yjs and subdocs, I’d try to look for a solution to atomicity elsewhere. I don’t believe yjs itself has any mechanisms inherent to the library for this. Have you considered storing individual ydocs in a graph db (e.g. neo4j) that can maintain the relations between various nodes (and presumably modify those relations atomically) — and depend on yjs only for the collaborative editing of individual nodes?

raine · January 7, 2023, 3:57pm

I’ve considered a dedicated graph database, though I’m not sure how easy it will be to squeeze offline-first functionality out of them.

Using YJS only for collaborative editing would work if both users are online at the same time and working on proximal subgraphs. However if the users are are working offline on distal subgraphs, I have the original problem again: distributed changes across a larger-than-in-memory space need to be synchronized through a server using a CRDT.

I might take a stab at extending WebsocketProvider and see how far I get. Nested subdocuments do at least have the Doc guids managed by the CRDT, and the only real invariant is that children updates from different devices get merged instead of overwritten. I just don’t have the knowledge of CRDTs or YJS internals to project further into the problem.

joakim · January 9, 2023, 1:43pm

That’s exactly what I’m thinking of doing in my project (also a personal knowledge management app). Best tool for the job, etc. I haven’t written a line of code yet, so I’m sure there will be dragons. This is uncharted territory for me. Would be interesting to look at how other similar projects have solved this.

joakim · January 12, 2023, 1:59pm

@raine Here’s an offline-first graph database that works with IndexedDB: GitHub - levelgraph/levelgraph: Graph database JS style for Node.js and the Browser. Built upon LevelUp and LevelDB.

Though it doesn’t seem to be actively developed, and it depends on levelup which is to be superseded by abstract-level.

raine · January 12, 2023, 4:23pm

Thanks for the suggestion. I don’t believe that will work with yjs since yjs writes state vectors in binary to disk, but it looks interesting.

raine · January 12, 2023, 4:58pm

I started work on a lazy-loaded graph type based on subdocuments:

But I decided to go a different direction on my project and create a separate database for each node, which fits better with my existing architecture. We’ll see :).