Encode state as update and self state vector return big updates

baolin · June 4, 2023, 8:40am

Hi, all, I’m new to yjs. I use encodeStateAsUpdateV2 to get the minimized updates on the server side.
When I call encodeStatesAsUpdateV2 use the current state vector. it returns big updates, almost full updates( closed to not pass sv). According to the API document, only write the missing differences to the update message.

Is there something wrong with my way of coding, or my understanding of the API?

const updateNeedApply = readFileSync('./tmp/1412294364815704065_1665017489422151681_current')

const ydoc = new Y.Doc({
})

Y.applyUpdateV2(ydoc, updateNeedApply)
const sv = Y.encodeStateVector(ydoc)
const updates = Y.encodeStateAsUpdateV2(ydoc, sv)

console.log(`sv from update: ${Y.encodeStateVectorFromUpdateV2(updates)}`) // 0
console.log(`updates len: ${updates.length}`) // 7970726

raine · June 4, 2023, 12:02pm

I haven’t worked with the Updates API directly myself, but I am seeing that you are using the state vector on the same Doc it was generated from. I think it only works when using it on a different Doc (i.e. to synchronize them).

In the documentation, under Example: Sync two clients by computing the differences, notice how diff1 is using stateVector2, and diff2 is using stateVector1.

jarone · June 5, 2023, 12:58am

baolin · June 5, 2023, 2:19am

@jarone @raine

Thank your reply!

Actually, I changed my test code to the below, it has the same result.

const updateNeedApply = readFileSync('./tmp/1412294364815704065_1665017489422151681_current')

const ydoc1 = new Y.Doc({})
const ydoc2 = new Y.Doc({})

Y.applyUpdateV2(ydoc1, updateNeedApply)
Y.applyUpdateV2(ydoc2, updateNeedApply)

const sv1 = Y.encodeStateVector(ydoc1)
const sv2 = Y.encodeStateVector(ydoc2)

console.log(`sv1 == sv2: ${sv1.toString()==sv2.toString()}`) // sv1 == sv2: true

const diff1 = Y.encodeStateAsUpdateV2(ydoc1, sv2) 
const diff2 = Y.encodeStateAsUpdateV2(ydoc2, sv1)

console.log(`diff1: ${Y.encodeStateVectorFromUpdateV2(diff1)}  len: ${diff1.length}`) // diff1: 0  len: 7970726
console.log(`diff2: ${Y.encodeStateVectorFromUpdateV2(diff2)}  len: ${diff2.length}`) // diff2: 0  len: 7970726

When I call parseUpdateMetaV2 to the diff1 and diff2 updates, its return

{"from":{},"to":{}}

According to the API document, should the diff be very tiny? now the size of the update is 7M when the size of the updateNeedApply update is 10M.

jarone · June 6, 2023, 12:31pm

hi, @baolin

Hope this demo can help you:

const Y = require('yjs')

const guid = 'id'

const ydoc1 = new Y.Doc({ guid })
const ydoc2 = new Y.Doc({ guid })

const ytext1 = ydoc1.getText('text')
const ytext2 = ydoc2.getText('text')

ytext1.insert(0, 'a')
ytext2.insert(0, 'b')

const stateVector1 = Y.encodeStateVector(ydoc1)
const stateVector2 = Y.encodeStateVector(ydoc2)

const diff1 = Y.encodeStateAsUpdate(ydoc1, stateVector2)
const diff2 = Y.encodeStateAsUpdate(ydoc2, stateVector1)

Y.applyUpdate(ydoc1, diff2)
Y.applyUpdate(ydoc2, diff1)

console.log({
  doc1: ydoc1.toJSON(),
  doc2: ydoc2.toJSON(),
})

// { doc1: { text: 'ba' }, doc2: { text: 'ba' } }

baolin · June 9, 2023, 12:29pm

Hi, thanks for your explanation.

After reading and debugging the applyUpdateV2 function’s source code, I found some missing updates in the PendingStruct or some pendingxxx else. That is why generating updates from sv was so huge. It’s my fault.

Appreciate your help again!

Gin-Quin · September 7, 2024, 6:49pm

I’ve just encountered the same issue.

It’s crazy that this:

const diff = Y.encodeStateAsUpdateV2(doc, Y.encodeState(doc))

…does not return an empty update.

Why would you apply to a document updates it already has? (Even in pending state)

This has actually serious issues because you will apply the same “useless” updates over and over every time you re-sync. Since I’m not merging all updates together, but rather storing “small stacks” of them, it means I’ll write endlessly the same useless updates to my database.

And pending updates are not an excuse, because yeah, you can have pending updates when dealing when shared documents, that’s something that happens.

I’d be curious to know how you handled this @baolin, because I’m feeling very confused.

dmonad · September 9, 2024, 12:51pm

And pending updates are not an excuse, because yeah, you can have pending updates when dealing when shared documents, that’s something that happens.

It really shouldn’t happen. Especially not if you implement a client-server sync. Pending updates are something that should only be populated if you build a peer-to-peer sync system.

If you have pending updates, it probably means that your sync implementation lost some messages, which is not good and should be fixed. Yjs requires a reliable network protocol (as all sync engines do).

The recommended sync flow is to exchange state vectors initially to compute the differences. After the initial sync you should only exchanges updates that are emitted from the update event. Then you avoid syncing pending structs etc for every single update.

Gin-Quin · September 9, 2024, 1:23pm

Indeed, pending updates happen when there are sync errors (or peer-to-peer, or any other reason, but it can happen). In my opinion, that’s why CRDTs exist: to fix sync issues, and resync different documents in different states.

The thing with the current behavior is that once one of my databases have sync issues, I have to resend the same big update message everytime–even after I resynced it with other documents that have all the data.

This is what happens step by step:

We sync
We realize updates are missing (because previous sync errors)
We diff the missing updates
We send the missing updates
The database now has all the updates

Works fine the first time. But next time the client sync…

We sync again
There are no missing updates but the client think so because it once has been desynced (or any other reason I’m not aware of), so it does not return an empty update
So we resend the missing updates that are actually not missing anymore
We store these missing updates once more in the database

And this goes on on every resync, making the database growing endlessly.

I found a workaround, but that’s not ideal:

I’m comparing with snapshots to check if documents are the same
Sometimes, the snapshot comparison say “yes, they are the same”, even though a diffing update returns a non-null update. That’s great, because this is a reliable way for me to check if the server needs update from the client.
But sometimes, the snapshot comparison says “no, they are not the same”, even though we applied the missing updates to make sure they are the same.
So I had to add another object comparison check to verify by hand if the objects are the same, to prevent endless unnecessary updates to the database

This last point is particularly dangerous, because you can have equal objects with a complete different update history. But because snapshots also fail sometimes, I didn’t found any other way to solve this.

I’m ready to help debugging or working on this.

The best solution would be for update diffing to only return the necessary parts, whatever the state of the document. Is it something feasible?

I would be fine with a snasphot comparison working 100% of the times as well. When it failed when documents were supposed to be the same, the two snapshots had very tiny differences, like 1 more deletion and 1 more entry in the state vector of one of the snapshots.

dmonad · September 10, 2024, 1:43pm

Indeed, pending updates happen when there are sync errors (or peer-to-peer, or any other reason, but it can happen). In my opinion, that’s why CRDTs exist: to fix sync issues, and resync different documents in different states.

If you assume that your network is unreliable, you need a way to detect if updates are missing before they get lost forever. CRDTs can’t detect if an update is missing (although if you have “pending” structs, you can be sure that updates are missing).

Hence CRDTs require a reliable network protocol. Updates may not get lost.

In a peer-to-peer network (even if reliable), a pending update can exist while we are waiting for a message from a third client (update B from client 2 depends on update A, but we are waiting for client 1 to send us update A). This can’t happen in a client-server network.

I highly recommend making your network reliable (this may require an additional protocol on-top of your database / network stack).

NataliaMolchanova · October 31, 2024, 8:40am

Could someone explain why diff is not empty in this case?

const diff = Y.encodeStateAsUpdate(doc, Y.encodeStateVector(doc))

Logically it feels as it has to be empty, because all updates already exist in this document

Any information is extremely appreciated

dmonad · November 22, 2024, 5:29pm

@NataliaMolchanova

This is gonna be a complex answer…

Deletions are stored separately from insertions / “structs”.

In order to insert a character (or any other content), you create a new “struct”. Structs make up the structure of the Yjs document / Yjs data types. They have a lamport timestamp that allows us to uniquely identify them and find them (e.g. using relative positions).

In order to delete a struct, you add its lamport timestamp to the list of deletions. Deletions don’t have a lamport timestamp. Hence they are not addressable.

Both structs and deletions are highly compressed. We can use a state vector (a list of lamport timestamps) to find missing structs. But deletions are not addressable, hence you always have to sync the full set of deletions.

The set of deletions is usually very small compared to the whole document when encoded. It’s a tradeoff to make deletions addressable (Yjs is the only CRDT that I know that doesn’t make them addressable). This approach allows us to compress the encoded document much more efficiently than CRDTs that have addressable deletions. The cost is that we always have to sync the full set of deletions, even when there is no content to sync.