It seems like right now YJS forces client ids to be uint32. I suppose these ids are also part of the encoded document updates. I wonder how one would avoid collisions if there’s no single authority that assigns client ids?
I would like to use hash of the public part of an asymmetric key pair as a client id, or something similar that can be generated without coordination.
I’m probably missing something, and would appreciate if someone could bring some light into this.
Yjs specifies variable-length unsigned integers as the source for random client-ids. This allows for arbitrary numbers (up to 53 bit currently) to represent client-ids. But you are right, we currently only generate 32 random bits. This has historical reasons because I want to keep compatibility to the 32 bit approach. Soon, I plan to increase that to 53 bit uints once the upgraded library has been rolled out completely.
BigInts might at some point allow us to use arbitrary precision for client-ids. I’m not sure about the performance overhead of using them, but they will allow us to increase entropy even further by using larger numbers.
With 32 bits, the chance of two client-ids colliding is one in 4 billion. If, for any chance, two users get assigned the same client-id, there are mechanisms in place to reassign them once this gets detected. It is generally not a problem if there are two clients with the same id. They just must not create operations concurrently.
Database systems often like to use UUIDv4 which allow 122 bits of entropy. But keep in mind that they assign a random GUID to every single item. Yjs only generates random ids for clients, so we don’t need as much entropy.
Just in case you are developing something like a huge database with Yjs, you can increase that the precision to 64 bit, or even 128 bit. That will be possible with BigInts.
@dmonad I think I found the place where client IDs gets reassigned in Yjs, so forget about my previous question
I’d like to understand though how much overhead (at least on the wire) there would be if client IDs were arbitrary strings for example. I was trying to find more details about the document update binary encoding layout, but I couldn’t find any. So I was trying to figure it out from code reading thought the v2 format. Maybe asking you would be faster .
The application I’m working on doesn’t need real-time collaboration (at least for now), but we really need to make sure there’re no collisions of client IDs because published document updates would be cryptographically signed and immutable.
Now that I’m thinking about it, it’s probably close to what y-dat would do.
In the video, we shortly go through the encoding approach. But I really didn’t describe it in detail. It is rather complicated and probably deserves a dedicated article. I developed a dedicated library for encoding/decoding of document updates that contains different RLE encoders and an opinionated approach to work with data (https://github.com/dmonad/lib0/ - encoding.js/decoding.js).
Yeah, in any case, we need to ensure this - even in Yjs. 53 bit should be more than enough, especially with a relatively low number of collaborators.
Thanks for the detailed answer! I’ve watched the video and all the documentation about the internals. Really helpful!
Your reasoning about using numbers for client ids is crystal clear! I guess some of the performance concerns could be solved as soon as Yrs is in place.
This client ID limitation is possibly the only thing why I can’t use Yjs at the moment. But maybe it’s just not a good fit for us. We don’t need real-time collaboration, we mostly just want to store all the versions in a compact form and have a CRDT merging semantics with automatic conflict resolution. We do really need larger client ids, because everything is going to be immutable and cryptographically signed.
To be honest, I don’t understand the argument that you can’t use numbers because you need to cryptographically sign … document updates?
With BigInts, you’ll have arbitrary precision for numbers. So you could even use a precision of 1024 bit or 4096. I guess you want your client-id to be some kind of certificate, not just a GUID. If you convert your certificate to a Uint8Array, you can use that as a source for the BigInt client-id.
Seph is mapping from string to a number to improve the hashing and the referencing process. He built this concept as a convenient method to reference client-ids (which are strings in his case). In Rust you don’t want to manage pointers, so you often come up with tricks like this. In Yjs I would use references to locations in memory (with BigInt, a client-id would be a reference/pointer to a location in memory). In Yrs, I would do something similar as Seph.
I don’t know about your application. But I don’t think that you ever want to map user identifiers (which are strings/certificates in your case) to client-ids. A couple of reasons:
A user may connect to the same document several times. A user also sometimes makes concurrent changes (e.g. if they point two browsers to the same document and then make changes)
Although I call them client-ids, they are really session identifiers. They are not designed to be used for anything else.
If you want to preserve the editing history and associate it to a user, you should use PermanentUserData instead (it helps you to map users to specific edits on the document).
You should not reuse client-ids. You should start with a fresh client-id for every session. Reusing client-ids is error-prone and may lead to unintended side-effects if not done correctly: https://docs.yjs.dev/api/faq
Deletes are not associated to client-ids (which may be counterintuitive for you).
The only minuscule advantage of reusing client-ids is that you might decrease the size of the Yjs document by a bit. But by using strings as client-ids, you will actually increase the size of the document.
By using strings instead of numbers, you will decrease performance quite heavily. Lookups should be reduced as much as possible. Especially in Yjs, the process of looking up two adjacent structs by number is ~90% of the work when applying document updates. In these scenarios, memory-locality is very important.
My recommendation is just to ignore the concept of client-ids in Yjs - it should be kept internal to Yjs. If you still want to, you can map from certificate to the Set of all owned client-ids to support the use-case of having several concurrent Yjs-sessions with a single user.
If you select ids randomly from 32 bit space, you get 50% change for collision after only 2^16 ids because of birthday paradox. Once you have total of 65000 ids it’s pretty sure that you have so least one collision.
Also, comparing strings is well defined if you use NFC normalized UTF-8. I think that’s well supported in browsers including built-in normalization methods.
@mtrantalainen, It is very unlikely that you will ever have 65000 collaborators that concurrently change the document. As I said before, it is fine to reuse client-ids as long as no two clients with the same client-id perform a change concurrently. Duplicate client-ids will be detected on the initial connection anyway (if the awareness module is used).
The only scenario when a duplicate client-id leads to issues is when two users with the same client-id edit the document before syncing with the other peers. Once they are synced, the duplicate id will be detected.
Yjs is a framework for building collaborative applications. It was not meant as a database backend that is manipulated by tens-of-thousands of users concurrently. That is why I chose the 32bit approach to optimize for document-size.
But I kept this use-case in mind and I left room for optimization. client-ids are variable-length fields that I interpret as integers. You can put anything into a variable-length field (even strings if that would be more convenient).
We might agree on a fixed alphabet (i.e. without surrogates) and only perform lexicographic comparisons. But I don’t see how it is helpful to me to use strings instead of numbers. That aside, strings are not an ideal data structure for comparisons. I didn’t say that it wouldn’t work. Numbers are just a more appropriate choice for this use-case.
I said before that you will be able to increase the size of client-ids indefinitely with BigInts (even to 1024 characters).
OK, I had misunderstood that the client-id must be unique per document over its whole history. Is it still possible to figure out who wrote what in a document with a long history if client-ids get reused by different users over time? I’m thinking about use case where a document is often opened in edit state by different users but not actually modifed that much. I could see situations where more than 65000 different users have visited the edit view over time but not simultanously.
The reason I mentioned strings in NFC format was that if you used UUID v1 as the client-id there would never be any collisions. UUID v1 as the identifier would only use subset of ASCII compatible characters and for that part the string handling should be stable on all possible environments as long you interpret all strings as UTF-8. Of course, if you can use 128 bit ints as client id, UUID is an 128 bit int. And yes, I agree that comparing ints instead of strings is much better computationally.
Associating edits to users is handled by PermanentUserData. The approach is not yet stable and is not recommended for production. It associates users with client-ids. A better approach associates ranges of edits to users instead. But I’m not too worried about the current approach: The probability of reusing client-ids is still very low because we reassign client-ids when we detect that the generated client-id is (or was) already used. We always start with a fresh client-id.
Yes, we were thinking about signing delta document updates, and make them immutable.
This would totally satisfy the use case. I think I didn’t make myself clear that my argument was about making client IDs arbitrarily large, not necessarily strings. A byte/uint8 array definitely solves the problem.
This really helped the understanding. I was not taking into account users with multiple tabs. Although a user rarely would make a concurrent change in both tabs at the same time, right?
If we get a new client ID every time the same user will open a document, then the chance of collisions with int32 is even greater right?
Conflicts in general happen rarely. But it is indeed possible that a single user performs concurrent changes. For example because of a timeout that triggers concurrently in two windows. Or because one of the browser windows freezes and reacts slowly to user-input. This problem can’t be ruled out.
As I mentioned above, it is fine to reuse client-ids. Collisions are only dangerous if two clients perform changes with the same client-id before syncing with the network.