Dynamic chunking using old snapshot

abhinavdayal · April 27, 2023, 10:29am

I am using slate.js editor that use YJS for syncing updates. I need to chunk the document dynamically into say 100 word chunks with probably a little overlap, in order to run some downstream NLP tasks. I would like to maintain the chunks consistent as much as possible and trying to do a divide and conquer strategy, where rather than rechunking entire document for every small change, I would grow/shrick chunks leading to splits and merges.

I am trying to store the character indexes of start and end position of each chunk. I am storing the previous state of the document as

const docSnapshot = Y.encodeStateVector(yDoc)

And saving this into the database.
The slate.js bindings allow me to convert this document to slate nodes, from where I get the full text of the document. This is cleaned and split into chunks in the given state.

Now, when changes happen to the document, I am trying to locate the positions where the changes happened based upon the older snapshot, so as to determine the chunks that were affected.

Get the old state
const oldStateVector = stringToUint8Array(sourceSnapshot.yJsState)
construct the slate.js editor from the old state
??? Not sure how to do this
Get the diffs from the old state to the current one
const diff2 = Y.encodeStateAsUpdate(yDoc, oldStateVector)
somehow use this diff with the slate-yjs bindings to identify what exact locations changed and how in the document.
??? Could not find a way to do this.

Can anyone help me with the steps 2, 4. Could not find appropiate documentation.