UndoManager with Grouped Text

Hi, I am currently developing a new Text Editor, and at the moment I am finding an interesting situation with undo/redo entries. In order to understand the challenge, I will explain first how the editor is built, generally.

The editor has a list of deltas, where a delta is a list of insert operations that take the form:

{ insert: string | { block: BlockType }, glyphs: Array<string>, attributes: { key: value, ... } }

The glyphs parse the insert string and organize each position into a matching position in the array of strings. This is done, as some values are greater than length 1 in actuality, but visually look like a single character, such as an emoji. For example,

string: ‘Hello​:family_man_man_girl_girl:World’, glyphs: [ ‘H’, ‘e’, ‘l’, ‘l’, ‘o’, ‘:family_man_man_girl_girl:’, ‘W’, ‘o’, ‘r’, ‘l’, ‘d’ ],

This allows for the cursor to move around with simple mathematics, rather than constantly checking the length of a glyph within the character string.

That said all works well on insert, retain, and delete operations. The issue however arises when I utilize the UndoManager with YText, where it returns the undo/redo statement with a length that is equal to the glyph and not 1. In the above example, the emoji would have length 11.

I am currently using the ytext.applyDelta function to insert text into Yjs.

Does anyone have any suggestions to solve this issue?

Thank you!

Storing each glyph separately in an array seems like quite an overhead. This seriously might cripple the garbage-collector as soon as you insert a million characters. Also this at least quaduples your memory-requirements.

Yjs (rather Y.Text) internally manages text-chunks that are utf16 encoded (which is the default encoding of JavaScript). Emojis are usually composed of surrogate pairs (look it up) meaning they consist of two 16-bit characters. It would be quite an overhead to split-up all strings into glyphs. There are no proper data structures in JavaScript to handle this behavior. So rather I just use the default-behavior and address positions in utf16-offsets. While this behavior is quite misleading for newcomers, it is also a very well understood problem.

Can you please describe this further? What is a undo/redo statement and how does it have a length? Also I don’t understand how you measure that the emoji has a length of 11. I assume you mean that it has a length of two?

Thank you for the reply.

I don’t store the array of glyphs in Yjs, I use the same delta format as Quill so I can take advantage of all the YText logic. The glyphs are locally cached in order to switch the positioning of 2 or 11 into a space of 1, for easier processing. But that said, how I calculate 11 as a length is by using the length property of a string. Type ‘:family_man_man_girl_girl:’.length and you will get a value of 11. This is because some emojis are composites of other emojis, like in my example.

Why I need that calculation, is that when I receive the deltas from Yjs, I process them on the fly, constantly connecting them or splitting them apart so that they can be processed and passed to the logic of the editor. Actually Yjs has zero knowledge of any of this, and I completely leave your logic as intended.

I could however be making a mistake on how I calculate the length of strings, and should use an alternative method. I am completely open to suggestions.

I will work on this, I was looking for a better understanding of how YText may already accommodate this. I found that using a length calculation like so […‘:family_man_man_girl_girl:’].length actually returns the correct length calculation of 7, rather than 11, which comes from ‘:family_man_man_girl_girl:’.length. I will see if my algorithm supports this change and fixes the issue I was investigating.

Personally I would love to drop the glyphs solution. But I was after testing my cursor algorithm, so it was a good fix for the time. Now I am optimizing and reworking.

If this question is about utf16 I’m not sure how this is related to the Undo Manager :thinking:

If ‘:family_man_man_girl_girl:’ has a length of 11, then it is a composed glyph. You probably want to look for a good algorithm for splitting up strings into grapheme clusters. There are a lot of tutorials about utf16 tutorials that explain how you could accomplish that.

As said, Yjs doesn’t accommodate for this and simply uses utf16 encoding for improved performance.

Most programming languages have different packages to look at strings differently. I think the Rust documentation explained the need for different “string” packages well. Defining length by the number of graphemes (printable characters) incurs different tradeoffs. For example, you can’t simply start reading from a certain offset because each grapheme may have a different byte-length. You always need to consume the complete array to determine the size of a string.

Unless you find a good & performant algorithm to split up strings into grapheme clusters, you shoulds probably stay with good ol’ JavaScript strings.

I used this library grapheme-splitter to handle the grapheme split. I think with the new numbers, I can have the string be recalculated and actually fit the YText results. Basically, our offset values in retain and delete were mismatched because of the 2x chars for emoji characters…etc… I haven’t spent the time you have on strings, as your library demands an incredible attention of detail to the topic. I am enjoying the exploration, and the demand for growth in this area for myself as I build out this editor using your amazing library. If you like, you can delete this thread, or I can reply with my findings and help another person along the way.

I’m sure someone will profit from this thread, so I’d like to keep it. Building collaborative applications is what this forum is about. So this is definitely on-topic.

1 Like

I found a solution to my issue that overall reduces all complexity. I am sharing it to help anyone else who reads this discussion.

I had removed the entire glyphs concept, and eliminated all the extra overhead in that area. Rather than try and map lengths of cursors to single sequential positions by reducing the text to a glyph, I have now moved to the reverse relationship. Cursor positions now have knowledge of how many characters it takes to get to that cursor from its previous character sibling. This allows me to delete that length or calculate segments of lengths as desired, thus giving more overall flexibility, and reducing overhead.

My solution now allows the mathematics to sync with YText, and have a different coordinate system for various calculations I am running outside of the YText library.

I hope this helps anyone else, and I am happy to hear any feedback and suggestions for a better solution. Thank you!

Great, thanks for sharing!

1 Like