Optimizing initial load of a document receiving a lot of updates

erwan · October 30, 2023, 10:34am

Hi Everyone,

I’ve been digging into various threads on efficient document storage but haven’t landed on a solid approach yet. I’m reaching out to see if anyone can provide some clarity or suggestions.

My setup is as follows:

Frontend: prosemirror + yjs (+ custom provider using websockets and offline cache)
Backend: golang (websockets)

At the moment, a new document is initialized when created by a user, with updates stored individually. When a page is loaded, a refresh request fetches the document along with all updates via the websocket server. The document gets loaded, updates are applied and everything runs smoothly.

However, the hiccup arises when, say, 10 users are collaborating on a lengthy document, generating a multitude of updates rapidly. If a new user joins in, the initial load can get quite cumbersome as it involves loading all updates made since the document’s creation.

I’m considering two main solutions:

1- On the client side, have regular calls to Y.encodeStateAsUpdate(doc) to save it as the initial document entity, while removing all preceding updates. But, I suspect this might disrupt the collaboration amongst users editing the document concurrently. How and who should trigger this action is also a concern.

2- On the server side, load the document with updates, run Y.encodeStateAsUpdate(doc) to update the document entity, and clear all previous updates. This seems to only affect new users, keeping the experience intact for current users, unless there are unsynced offline updates. This solution, though, necessitates a new Node service, veering from our current all-Golang backend which isn’t ideal.

I’m very open to any insights or suggestions on these methods or any other efficient way to tackle this issue.

I’ve also looked into these threads:

Appreciate your help!

raine · October 30, 2023, 4:43pm

When performing the initial sync, the client sends a state vector to the server so that it only gets the missing updates. If you compress them all into a single update, wouldn’t that make the initial load slower?

erwan · October 30, 2023, 5:01pm

Thank you for your response. When you mention compressing them, are you referring to using Y.mergeUpdates to combine all updates into one?

If so, since my backend is built with Golang, this could pose a problem as I can’t do it instantly. However, I could periodically group updates together using a Node.JS micro-service to lessen the update count, like combining every 50 updates. Would this be effective?

Additionally, about combining all updates into the document from the backend, do you anticipate any possible problems?

raine · October 30, 2023, 5:53pm

I was referring to Y.encodeStateAsUpdate(doc). It generates a single update as far as I know. The problem I anticipate is that it would force users to download the entire Doc rather than just the missing portion.

At the risk of stating something you already know, I would spend some time to pinpoint the problem before attempting a solution. Is it overhead from the messaging format? Network latency? Number of updates? Total download size? The answer to that should give you a clearer picture of where to optimize.

erwan · October 30, 2023, 6:00pm

Ah, I misunderstood your earlier response, my mistake.

Currently, when a new user loads the document, they need to fetch the document itself (which is quite small since I don’t merge updates now) plus all the past updates to the document.

I tested with 10 users and noticed the update count rises rapidly if they are crafting a lengthy document together (like a few pages). This slows down the initial call since the database has to retrieve all updates, send them, and then the client needs to load it all.

I figured that regularly merging the updates into the document using Y.encodeStateAsUpdate(doc) could speed things up. This way, I wouldn’t need to fetch all earlier updates from the database. I’m also guessing that consolidating the updates into the document might reduce the response size by eliminating the back-and-forth actions (like adding and then removing a character), right?

raine · October 30, 2023, 8:36pm

I’m not sure I understand the distinction between the “document itself” and the “past updates”. A Doc is no more than the list of all its updates. Regardless, yes, the user would need to load all past updates to be able to view the Doc.

This was sort of my point. Which of those three things you list is most significantly contributing to the load time?

It should be pretty easy to test that hypothesis in an isolated example.

I’m not sure I understand what you mean here. The entire history of the Doc needs to be loaded into memory for the user to view it, whether they are encoded in one blob or many. A single update will have some reduction in space (due to less overhead), but I don’t know how significant that is. It would be worth measuring that to determine if that will create the performance gains you desire.

I don’t think that reverse actions are consolidated, as there is always the possibility of a different client having concurrent offline changes. CRDT’s keep the full history around in order to merge without conflict (again, whether that is as a single update or many… the underlying Items are still all there). There is undoubtedly some overhead in maintaining separate updates, but I haven’t measured it. However, I’m still concerned that combining all the updates will eliminate the performance gains from the state vector that informs SyncStep2.

True performance gains would come from throwing away the history (i.e. replacing the Doc with a snapshot at regular intervals) or splitting the content into multiple Docs that can be loaded independently.

erwan · October 30, 2023, 8:54pm

I meant having a consolidated snapshot document vs loading updates one by one.

Loading from the database + formatting the data would take some time as we store them individually in my case.

Sorry if that wasn’t clear, but I was actually suggesting to throwing away the history and creating a snapshot that would replace the initial document update as a starting point. Does that make more sense?

raine · October 30, 2023, 11:24pm

I see. So you’re wanting to find a way to truncate the history without disrupting concurrent activity.

I’ve considered this problem before and haven’t come up with a good solution. Let me know if you figure out anything.

erwan · October 31, 2023, 5:50am

Ok, thank you again for your answers.
Usually, how are users dealing with updates storage? Are they just returning all of them even if there are a million of them?

Of I decide to merge batches of 100 updates together regularly, would it potentially mess up with anything?

raine · October 31, 2023, 12:58pm

I wanted to see for myself how the number of updates affected the total size, so I made a little demo. It creates 100 * 3 transactions and then compares the size of separate updates, merged updates, single update, and a snapshot applied to a new doc.

(The reason for setting three different keys is to avoid an optimization that YJS makes when encoding contiguous changes to the same key, which distorted the results.)

Run in CodeSandbox

import * as Y from 'yjs'

// doc1 and saved updates
const doc1 = new Y.Doc()
const map1 = doc1.getMap()
const updates1 = []
doc1.on('update', update => updates1.push(update))

// populate doc1
for (let i=0; i<100; i++) {
    map1.set('a', i)
    map1.set('b', i * 2)
    map1.set('c', i * i)
}

// copy doc1 json to doc2
const doc2 = new Y.Doc()
const map2 = doc2.getMap()
const json = doc1.toJSON()['']
for (let key in json) {
    map2.set(key, json[key])
}

const sumLength = (accum, update) => accum + update.length

console.log('separate updates', updates1.reduce(sumLength, 0))
console.log('merged updates', Y.mergeUpdates(updates1).length)
console.log('single update', Y.encodeStateAsUpdate(doc1).length)
console.log('new doc', Y.encodeStateAsUpdate(doc2).length)

Summary:

separate updates: 8785 bytes
merged updates: 3387 bytes
single update: 2592 bytes
new doc: 36 bytes

raine · October 31, 2023, 1:05pm

y-websocket sends a single update (containing all changes from the given state [vector]. (Called from the server here.)

Note that the size of a single update will still far exceed the size of a newly created Doc populated with a snapshot of the same content, as can be seen in my demo above.

erwan · October 31, 2023, 1:46pm

Thank you for sharing the data, it’s really helpful!

From this, I gather that it might be wise to merge large batches of updates to save on both database queries and bandwidth over time.

I’ll try to create a new Node micro-server and implement this logic in my system.