One large Y.Doc or many smaller Y.Doc?

cmmartin · December 31, 2021, 5:24pm

Hi,

I’m learning yjs and very impressed so far. I have a couple lingering questions.

I’m considering building a collaborative note taking application. A single project can have many pages of notes. I’m trying to decide whether I should have one Y.Doc per project, with each “page” essentially being a Y.Map with nested sub-children. Or whether each individual “page” should be its own Y.Doc instance. I’m curious the pros and cons of each approach. A user can only edit one page at a time, but may edit many pages in a single session. Is there any reason to favor one approach to the other?
I’m trying to determine a strategy for long-term document storage in Postgres. Is there a maximum size for the binary format. In other words, if I call Y.encodeStateAsUpdate with a large document, can I make any assumptions about the size of the UInt8Array that is returned? Will it grow infinitely with the size of my document? What kind of database column are people typically using for storing these updates?

ViktorQvarfordt · January 2, 2022, 12:56am

Structuring data in YDocs

One basically needs to decide on the following:

To use one or multiple YDocs for an entity or set of entities in your application.
How to structure the data within a YDoc.

When reasoning around how to structure data in Yjs I recommend to consider these aspects:

The flow of data for common use cases: It can be good to group data that is often used together. In contrast, it may not be practical to load hundreds of YDocs at once or load new YDocs very frequently.
Read/write permissions: Permissions cannot be practically enforced within a YDoc so you need to split data into multiple YDocs if you need different permissions for different parts of the data.
Size is very rarely a practical problem as long as you deal with human-entered text input. (See benchmarks.)
Separate structure and data: In some cases it can be practical to have one YDoc that holds the only the id references across entities (eg. pages) and one YDoc per entity data. This is particularly relevant if you need different permission levels for different entities. If you have no need for granular control, a split like this may be unnecessarily complex.
History and undo: At what level is it natural to track edit history and perform undo? It is much easier to perform history tracking within a single YDoc rather than spread across multiple YDocs.
Consider using a single top-level YMap: Top-level shared types cannot be deleted, so you may want to structure all your data in a single top-level YMap, eg. yDoc.getMap('data').get('page-1').
Subdocuments: You may also consider using subdocuments. However, it gets bit more complex and your provider may not support it.

Storing YDocs in a database

The return type of Y.encodeStateAsUpdate is a byte array (Uint8Array). Postgres has a data type for binary data just like this, called BYTEA. Other SQL databases call this BLOB or BINARY LARGE OBJECT.

Estimating the size of YDocs

Generally speaking, the size of the byte array representation given by Y.encodeStateAsUpdate will grow as you apply edit operations on your document. Yjs does apply garbage collection but some traces of past edits cannot be fully garbage collected in order to maintain the properties of a CRDT. The advice I can give on this is to 1) use the update format V2 version which provides much better compression and 2) run some experiments where you simulate scenarios that will be common for your application and see how your YDocs grow in size.

cmmartin · January 2, 2022, 6:42pm

Thanks a lot @VictorQvarfordt! This is extremely helpful

dmonad · January 7, 2022, 2:51pm

Thank you so much for the summary @ViktorQvarfordt ! Would you mind if I copy that to the documentation?

ViktorQvarfordt · January 7, 2022, 5:19pm

I’m glad it was useful. Feel free to copy and reuse in any way!

folencao · November 26, 2022, 5:44am

Hi @cmmartin - About your first question, I have exactly the same situation, may I ask your final solution?

whether I should have one Y.Doc per project, with each “page” essentially being a Y.Map with nested sub-children.

With this approach, it is hard to handle ydoc.on('update',xx) event that records the updates on each page as the updates from entire ydoc which from many pages.

Or whether each individual “page” should be its own Y.Doc instance.

With this approach, the bad thing I think is we should establish a new WebSocket connection with a new ydoc instance every time when user switches between pages? as the websocket connection only bind to one ydoc instance.

Thanks for providing suggestions.

MentalGear · February 17, 2024, 7:42pm

Yes, I’m wondering as well about the performance impact on using one Y.Doc per Page, how many websocket connections are ok for a client/server to have at one time?

raine · February 17, 2024, 11:22pm

If I recall correctly, I started to see degraded performance with 100+ websocket connections.

FYI hocuspocus multiplexes all messages through a single websocket connection for any number of Docs.

folencao · February 19, 2024, 5:39am

Currently we use one connection for each client and use sub doc to handle each page, so for one user, there is one connection but multiple sub doc.

MentalGear · February 19, 2024, 12:13pm

Thanks for sharing your XPs! I found that partykit’s hibernation mode allows for thousands of simultaneous connections. Whether that’s something that’s really needed is another thing.

PS: @folencao How’s your experience with subdocuments so far? I heard they are rather finicky.

folencao · February 19, 2024, 1:10pm

Our sub-doc is built on the code from Extend y-websocket provider to support sub docs synchronization in one websocket connection, but we did some changes and fix to fit our features, not sure your app requirements, but you can check if the yjs subdocument is enough for you https://docs.yjs.dev/api/subdocuments