Splitting emojis in YText can lead to a broken document

YText.applyDelta can split an UTF-16 surrogate pair and produce an invalid character in the stored string,
that can not be encoded using encodeURIComponent, so it becomes not possible to encode the document to store or transfer it.

Code example:

const yjs = require('yjs')

const doc = new yjs.Doc()
const yText = doc.getText()

yText.applyDelta([{insert: '😀'}])
yText.applyDelta([{delete: 1}])

yjs.encodeStateAsUpdate(doc)

Throws: URIError: URI malformed in encodeURIComponent

Possible solutions might be:

  • In the client code on every delete it should find and delete broken characters. This should possibly be stated in the Yjs documentation.
  • Replace encodeURIComponent with something else that encodes every byte regardless its value. It is a rare case that a user splits an emoji, but if this happens
    the document will still work fine and the user will see just a nasty “question mark”(�) where the broken character is

Currently, Yjs relies on the idea that users don’t split surrogate pairs. Which is not great…

Option 1 seems pretty expensive to implement. Also it might lead to inconsistencies when a user is restoring an old document state. Item.content should not be manipulated, that would really increase complexity.

I like option 2. Are you aware of alternatives? I want to test if the same issue happens when using TextEncoder or when using the V2 update format.

Another option would be to throw an error when the user does something like that. Or simply delete the pair although only a length of 1 was specified. In any case, the user probably did something unintended.

I am not aware of any alternative implementations of encodeURIComponent and probably the native implementation will work faster anyway and it is critical for yjs.

I like your suggestion about deleting the remaining part of the pair if the user removes one part. Probably we can check somewhere here
https://github.com/yjs/yjs/blob/e2c9eb7f0116c8bb0772161b8915617003449f8f/src/types/YText.js#L444-L471 if a pair is splitted during a deletion and delete the remaining part that must be either before or after the deleted substring. Do you think we can do something like that?

Exactly, this is where we need to check if we are deleting a surrogate pair.

If I understand correctly, then a surrogate pair starts with [\uD800-\uDBFF] and ends with [\uDC00-\uDFFF]. So yes, we simply need to check if the last character is in [\uD800-\uDBFF] and then increase the deletion length by one.

I also want to log an error message when the user tries to split a surrogate pair. This should be properly documented.

In some cases (e.g. when using the delta format) this might lead to unexpected behavior. [{delete: 1}, { insert: \uD800 }] is technically valid.

With [{delete: 1}, {keep: 5}, {insert: 'x'}] the user expects that ‘x’ will be inserted at the 5th position, but because of this new addition, we will insert it at position 4 instead. This is completely fine, but an error message will help to detect the issue.

For now, I think it is a fair limitation to expect that users must not split surrogate pairs. The alternative would be to work with byte-arrays instead and only construct text objects when necessary. Interestingly enough this is exactly what the Ywasm port currently does (still in development). So I’m happy that we will eventually fix this issue.

I can implement the change on the weekend. I opened a ticket to track the issue: https://github.com/yjs/yjs/issues/248

The issue is fixed in yjs@13.4.2.

I went with the approach to replace invalid pairs with � characters. I’m worried that the other approach might break existing applications. Yjs shouldn’t magically adjust positions. This might break editor bindings that expect this to be working.

There are some completely valid use-cases for splitting surrogates:

// deleting within a surrogate pair
ytext.insert(0, '👾👾')
ytext.delete(1, 2)

// applying the minimal diff to transform one emoji to a different one
ytext.insert('👾')
ytext.delete(1, 1)
ytext.insert(1, '😱'[1]) // only replace the right surrogate

Nevertheless, I feel pretty confident about the current solution now. Splitting surrogates can’t be supported in collaborative environments. Concurrent changes might always end up with a dangling part of a surrogate. There is no method to prevent that. Therefore, it is fair to simply use replacement characters when a user tries to do such a thing.