We have recently faced some load issues in our sync servers (using Hocuspocus, multiple instances synced via Redis) and noticed an interesting behavior: during an incident, users would report seeing “ghost cursors” from their past selves, “replaying” the changes they made some time before.
We have not been able to reproduce this, but my understanding of what happens is:
- user is connected to a sync server instance (S1) and sending awareness updates (from client id C1)
- server faces heavy load and there’s lag consuming awareness updates from other instances (S2, S3, …)
- user reconnects to a different server, say S2 (e.g. maybe because they reloaded the app), now the client id is C2
- server S2 finally consumes the updates that originated in C1 (synced via S1) and sends then to C2
- C2 “replays” old awareness states
We also received reports of this happening in collaborative sessions (with multiple users), so some naive app-level user id filtering wouldn’t suffice.
Now, given the scenario above, and without any knowledge of the awareness protocol, my first instinct would be to tag awareness updates with timestamps and then conditionally drop them in the receiving client based on a configurable grace period. Now, it seems that the awareness protocol actually uses a state-based CRDT, and in this case, at least with the current implementation, it doesn’t seem like that’s something achievable.
Is that understanding correct? Could there be an alternative solution?