Without real clocks, how can the order of offline changes be known?

StuStu · October 22, 2023, 10:36am

Many parts of y.js are still magic to me; slowly trying to de-mistify this by reading the docs, including the magnificent blogs on YATA by Bartosz Sypytkowski. However, I am punching a little bit outside my weight class here, please bear with me…

My current bottlenecks in my understanding:

There are no real/wall clocks involved, at all, only monotonically increased counters, per peer; correct?
Are there any other clocks involved besides this clock?
I know of the problems with real clocks (drift, etc., and know that these cannot be relied on)
If there are no real clocks, and no global clock, then I am not really understanding how a fool-proof causal order can be determined (maybe it can’t?). Scenario below.

In my offline-first application I actually don’t deal with text very much. In fact in this stage I am not even implementing collaborative text editing. I plan to use y.js mostly for keeping docs with mostly objects (y.map) in sync between peers.

Consider these chronological events:

Alice @ counter 1; { myBoolean: null → true }
Document syncs with Bob. Bob then goes offline.
Bob @ counter 1; { myBoolean: true → false }
Alice @ counter 2; { myBoolean: true → false }
Alice @ counter 3; { myBoolean: false → true }
Bob comes back online, and receives the document (myBoolean=true).

When Bob does an applyUpdate of his change (false) to the document.
How can y.js know that the result should be true and not false? As all it has are counter-clocks.

I have not tried out this scenario (just started playing with y.js, at this point not yet ready to test these kinds of scenario’s). What would the outcome actually be?

raine · October 22, 2023, 8:16pm

Each client has their own clock, which as you pointed out is basically a simple counter (technically a lamport clock). Thus an update from Alice at clock 3 is guaranteed to have occurred after an update from Alice at clock 2.

As for how to tell whether Bob or Alice’s update comes first, in the chronological sense, we don’t really know, since they have their own clocks, different latencies, different network conditions, and may or may not be online when they make the change. In this case the algorithm picks some arbitrary but deterministic comparison that does not depend on when the updates arrive to determine in which order the updates are applied, e.g. a lexicographic comparison of the hash. I’m not sure what YJS uses, but that’s kind of the point. All that matters is that the clients will eventually converge. That is, once they are both online again and see the same updates, they will eventually arrive at the same state.

StuStu · October 23, 2023, 7:45am

Thanks for your reply. To summarize,

An update from Alice at alice:clock 3 is guaranteed to have taken place after the update at alice:clock 2.
It cannot be known if the update by Bob (to false) at bob:clock 1 has taken place before, after, or in between alice:clock 2 to alice:clock 3.
When Bob comes back online, and a document sync can be done, the document converges on a deterministically determined state. This state is deterministic, but arbitrary.

Looking at the timeline of the scenario, and given that we want the last change in time to be the ultimate truth, the desired outcome would be that the final value of myBoolean is true. Just to make sure I understand, this is not guaranteed?

Out of curiosity, is this typical of YATA (Y.js) vs. other algorithms (Lsq, and RGA, …)? Are there any algorithms that can actually determine the “correct”* outcome? I suspect that the exact (eg. sub-millisecond) order is unknowable due to clock drift, network latency, etc, but a best guess order could technically be achievable; imagine that each of the steps in the timeline happens a day apart, one could dismiss network latency and clock drift.

* “correct” is, of course, in the eye of the beholder; correct in this context is last update wins.

raine · October 23, 2023, 12:53pm

Yes, correct.

Unfortunately this is a fundamental limit of decentralized systems. To achieve a global chronology, you would need a single clock and a single server.

In practice, the lamport clocks work quite well. If both clients are online, updates will generally be merged in chronological order. It’s only when there are concurrent changes that the final result is less predictable (albeit reliable).

clibu · October 26, 2023, 9:26pm

An excellent series of articles folks here may be interested in is by Jake Lazaroff, An Interactive Intro to CRDTs at An Interactive Intro to CRDTs | jakelazaroff.com

StuStu · October 27, 2023, 9:33am

Many thanks. However, I’m not convinced on this being a fundamental limit of decentralized systems, at least not always (but please correct me if it is). Being very new to this, I could be wrong, but it seems that the “bigger brother” of the Lamport Clock, the “vector clock” could (mostly/always?) determine causality between clients.

I do recognize that the vector clock has drawbacks as well (more complex, and requires more metadata to be exchanged between clients). It may have other trade-offs that I am yet still blisfully ignorant of. Because of these could be a more suitable fit for Y.js.

However, since my application is offline-first (and will have offline changes), and also does not have many participants (nodes/clients) per document (realistically 10 max, but mostly between 2-5), the vector-clock seems more suitable to my application at this point. Also it’s, at least at this point, not used for text, but mostly for arrays, maps, and scalars/primitives.

To remain on topic.

Would a vector clock actually be able to determine causal order (or is my assumption wrong) – not technically a Yjs question, I know.
Does Y.js has an option for using vector clocks? (I have the impression it has not)
Initially I thought that adding a vector-clock to Yjs would not be too complex. But I think that’s was naive. Appending the vector clock itself would be fairly simple, but the actual implemention, the “merge”, is probably not build-in into Y.js. Any thoughts of doing a vector-clock instead of a Lamport-clock in Y.js?

As a newcomer I found it very important to learn about the limitations of the implementation, in order to make the correct decisions in a project early on. When you first hear of CRDTs they may sound like they will magically solve all your sync problems, which they might, but also may not depending on your requirements. Not to say that Y.js is bad, or misleading… at all, it seems to be one of the few JS implementations of CRDT’s, and seems to be very good at what it does.

For anybody reading the thread. In addition to clibu, and the already mentions blogs by Bartosz Sypytkowski. I also found the following very educative when it comes to timekeeping challenges, Lamport and Vector clocks:

Kavya Joshi; Keeping time in Real Systems (introduction to timekeeping issues)

Martin Kleppmann; Logical Time (explains Lamport and Vector Clocks)

raine · October 27, 2023, 11:33am

I’ve never even heard of vector clocks, so you’re a few steps ahead of me! Thanks for the resources.

raine · October 27, 2023, 12:22pm

From https://cs.stackexchange.com/q/101496/163893:

version vectors can distinguish whether two operations are concurrent or one is causally dependent on the other

I think the distinction between concurrency and causal dependency is worth noting here. The Alice and Bob example you originally used demonstrates concurrency, not causality. Alice’s offline change is not causally dependent on Bob’s offline change.

The intractability of concurrency is also clarified through an example of actual concurrency. Imagine that Alice and Bob are offline and make a change to the same field in the same nanosecond. In this case (call it Example 2), it’s obvious that there is no causal dependency and thus no proper ordering of the events. Now compare Example 2 to Example 1. What is the difference? Nothing as far as the computers are concerned! The clocks are the same, and the vectors would be the same. Unless I’m missing something, no logical clock can give you global chronological ordering, even in principle.

I found this comment especially enlightening (from Quora of all places):

So Lamport Clocks are great for making sure we know when there’s a causal dependency between records. But there is a problem. Because Lamport Clocks induce a total ordering over all records, they actually imply more dependencies than truly exist. If two records are not causally related at all, and were produced completely independently by separate nodes that did not communicate, they will still have Lamport Clock timestamps which imply that one is ordered before the other, which is a false positive.

So it’s not clear how YJS would benefit from being able to detect those false positives, i.e. detect additional updates that are concurrent but not causally dependent. Vector clocks cannot capture more causality in an already totally-ordered system (to the extent that is logically feasible without a global block).

StuStu · October 27, 2023, 12:24pm

Side note on the demo featured at the beginning of the Jake Lazaroff blog post; this example ARBITRARILY (but deterministically) chooses a winner. You can see this by going into offline mode. Put a colored pixel in the top-right corner of the right client. Then put a different colored pixel in the same top-right corner of the left client. Now go back online. Even though the you put the pixel in the left-most client LAST, it will still be overwritten by the FIRST pixel event of the right client.

Though this is deterministic (the ID of the client determines the winner), it’s arbitrary, and not suitable for all applications (but might for some). The application I am currently working on, I rather have the actual value of the LAST change. As I understand it so far, Y.JS also chooses an arbitrary winner.

raine · October 27, 2023, 12:29pm

Yes, because FIRST and LAST are unknown when both clients are offline. There is not enough information in a logical clock to determine their ordering. You’re only able to distinguish first and last because you’re outside the system :).

It sounds like you want a hybrid logical + chronological clock that falls back to a global clock or the local clocks of each client when there is no causal (logical) dependency.

StuStu · October 27, 2023, 1:35pm

It took me a while to recognise that even though events happen at different “real” times, they are actually concurrent in logical time. In a logical clock, time 2 and 3 for Alice are concurrent with time 1 for Bob, even though they happened at different “real” times (I hope I stated that correctly). This is better illustrated (yet functionally the same) if Alice was offline as well for her events at time 2 and 3.

So you are right to recognise my given example is indeed actually a concurreny conflict. However, the goal is still to figure out which one came first in this concurrent conflict, as in real time they did very much not happen concurrently.

My hope that a vector clock could determine the causal order in such logical concurrency was then unfortunatly wrong. It only helps to detect “false positives” (something I will still need to dive into deeper, as I don’t fully grasp that yet). As far as I understand it, I also don’t see any benefit in detecting these false positives in my application.

A further clarified example, mostly for completeness sake

Alice is online the whole time (which is not relevant to the example. Bob goes offline, does a change (before Alice does hers), and later comes back online. If time was perfectly synchronised between the clients, and the updates all had timestamps, we could infer that Alice’s change takes precedence over Bob’s earlier change. However, if Bob was late with his change, and changed after Alice, his change should take precedence.

Alice and Bob are online.
Alice @ counter 1; { myBoolean: null → true }
Document syncs with Bob (is actually not relevant).
Bob then goes offline. Alice remains online (also not really relevant).
Bob @ counter 1; { myBoolean: true → false }
Alice @ counter 2; { myBoolean: true → false }
Alice @ counter 3; { myBoolean: false → true }
Bob comes back online, and receives the latest version of the document, this version has Alice’s updates, and thus is: { myBoolean:true}.
Bob now executes a merge between the received document, and his document. This merge must try to converge on “the truth”.

That Quora comment (yes, of all places) is actually a quite good and understandable explanation of Lamport vs. Vector clocks, and total vs. partial ordering.

And yes, a hybrid logical + chronological clock sounds like it might be a viable solution. I wonder if this could be accomplished by instructing Yjs to break a concurrency tie by using a timestamp instead of a client ID for example? (and this is then still deterministic, I recon – actually, this sounds soo much better than using a client ID, why don’t we always do this? – he said naively…).

raine · October 27, 2023, 1:50pm

Yes, you’ve got it :).

Breaking ties with a timestamp would be nice. We just need a way to detect concurrent updates without any false positives… oh wait, that’s what a vector clock does

StuStu · October 27, 2023, 1:53pm

Lol; indeed. Thanks for the wonderful discussion. I think the question is thoroughly answered, and I learned a lot. Also I now have a few more questions to figure out. But I guess that is science…

jonstuebe · January 11, 2024, 8:42pm

So, I had the same questions as @StuStu. After doing some research, I discovered Hybrid Logical Clocks (HLCs). I believe this gives you the exact feature you’re looking for. They combine the Lamport Logical Clock with a client-based timestamp. I, too, am punching a little bit above my weight class, so bear with me, but it seems that with an HLC all of these conflict problems go away. Yes, you do have clock drift, but the data I’m working with isn’t a document where individual keystrokes are happening at a great rate, potentially creating thousands of changes. I think, for apps that are not “document-based” and are instead traditional CRUD apps, HLCs should solve the offline clients issue you mentioned. I’m sure it’s not as simple as I’d like it to be, but I wonder if it’s possible to have some sort of plugin mechanism for y.js to override and customize the CRDT internals behavior so that we can swap out the underlying implementation.

micrology · January 18, 2024, 9:51pm

This is a really helpful discussion. PRSM uses Y.Maps for the nodes and edges of a collaboratively constructed network. I had users distressed because, for instance, users 1 and 2 created a network, then both went offline, user 2 made changes offline, user 1 came online, saw the network as they had left it, then user 2 came online and a lot of the network mysteriously (to the users) disappeared. The reason initially puzzled me, but eventually I realised that it was the arbitrary but deterministic CRDT algorithm that was the cause. I have now prevented users making changes offline - not what I wanted, but I can see no other way round the problem (in fact, even when users are online, there still could be unsynchronised clocks, but syncing happens frequently and if it does, the users are there to deal with it). The issue appears to be less dramatic when editing text, but I’ve experimented and it happens even with software as well used as Google Docs. Perhaps this is a fundamental flaw with the idea of offline-first?

jonstuebe · January 20, 2024, 4:56pm

yaa I’ve been continuing to do more research and finding out that simply using a Lamport Clock only gets you so far. It’s still an edge case but if you have two users going through and doing actions in a certain sequence you’ll end up with strange results (or at least not what a human would expect). I’m a little surprised that Y.js and Automerge haven’t moved to using Vector Clocks as it seems like the next best. thing compared to a “wall clock” implementation like a HLC.

Like I mentioned before, it really seems like there is a need for a plugin system to be able to hook into how the clocks are created and how the conflict system is run.

raine · February 12, 2024, 1:40pm

Here’s a short article from Jared Forsyth that shows a HLC implementation:

FYI His whole series on local-first dbs is excellent reading, and 4 years later still the best survey of the ecosystem.

MentalGear · February 15, 2024, 1:50pm

Very interesting discussion, especially from @StuStu & @raine on how to improve consistency.

This might be another piece to the puzzle.