Best ways to compress size of an Document/Update in the DB

lupaci · October 6, 2023, 5:17pm

Hi I am currently trying to build a collaboration backend for my Tip Tap based Editor but can’t figure out a way to reduce the size. From my understanding a overhead is quite normal as the whole history needs to be kept. But even without considering the history my document scales currently very fast which means I probably make a beginner mistake. But to my defence it’s hard to find the best practice ways to store the data in a small scope. So maybe someone can list here potential ways so other that stumble on this will get help.

For 10k Chars the document in my DB is already 788KB big while the json one from Prosemirror is 15KB for 100k it’s already 3.54 MB stored in the DB.

My code currently is the following

import mongoose from "mongoose";
import { Hocuspocus } from "@hocuspocus/server";
import { Doc } from "yjs";
import { encodeStateAsUpdate, applyUpdate } from "yjs";

const changeSchema = new mongoose.Schema({
  documentName: String,
  version: Number,
  snapshot: [Number],
  updates: [[Number]], 
});

function serializeYDoc(doc) {
  return Array.from(encodeStateAsUpdate(doc)); 
}

function deserializeToYDoc(updates) {
  const doc = new Doc();
  for (let update of updates) {
    applyUpdate(doc, new Uint8Array(update));
  }
  return doc;
}

const ChangeModel = mongoose.model("Change", changeSchema);

// Connect to MongoDB
mongoose
  .connect(
    "mongouri",
    {
      useNewUrlParser: true,
      useUnifiedTopology: true,
    }
  )
  .then(() => console.log("Connected to MongoDB."))
  .catch((error) => console.error("Error connecting to MongoDB:", error));

function createInitialDocTemplate() {
  return new Doc();

}


async function saveToDatabase(document, documentName) {
  console.log("Save to Database called");
  try {
    const update = serializeYDoc(document);
    const existingDocument = await ChangeModel.findOne({ documentName });

    if (existingDocument) {
    
      existingDocument.updates.push(update);

  
      if (existingDocument.updates.length >= 10) {
        existingDocument.snapshot = serializeYDoc(document);
        existingDocument.updates = []; // Reset the updates since we have a new snapshot
      }

      existingDocument.version += 1;
      await existingDocument.save();
    } else {

      const newDocument = new ChangeModel({
        documentName,
        snapshot: update,
        updates: [],
        version: 1,
      });
      await newDocument.save();
    }
  } catch (error) {
    console.error("Error saving document:", error);
  }
}

async function loadFromDatabase(documentName) {
  try {
    const existingDocument = await ChangeModel.findOne({ documentName });

    if (existingDocument) {
      const doc = new Doc();


      applyUpdate(doc, new Uint8Array(existingDocument.snapshot));

     
      for (let update of existingDocument.updates) {
        applyUpdate(doc, new Uint8Array(update));
      }

      return doc;
    }

    return createInitialDocTemplate();
  } catch (error) {
    console.error("Error loading document:", error);
    return null;
  }
}


const server = new Hocuspocus({
  port: 1234,
});

server.configure({
  async onStoreDocument(data) {
    await saveToDatabase(data.document, data.documentName);
  },

  async onLoadDocument(data) {
    const document = await loadFromDatabase(data.documentName);
    return document;
  },
});

server.listen();

Thank you

websiddu · October 7, 2023, 12:02am

Ideally you are only suppose to store the updates and then construct the yjs document like you are doing. No need to store snapshots as far as I know.

They way I have achieved it by creating cache class,

export class UpdateCache {
  private _updates: Uint8Array[] = [];

  constructor() {}

  push(update: Uint8Array) {
    this._updates.push(update);
  }

  clear() {
    this._updates = [];
  }

  get updates() {
    return this._updates;
  }
}

And then in the onchange method for hocuspocus add these udpates

onChange: async ({ update }: onChangePayload) => {
  if (update.length <= 2) return true;
  updateCache.push(update);

  return true;
};

and in the onStoreDocument

const megaUpdate = mergeUpdates(updateCache.updates);

// Save the data in db 

// Then clear the cache for a new set of updates. 
updateCache.clear();

store the merged updates are much smaller compared to snapshots.

lupaci · October 7, 2023, 5:55am

Unfortunately no, same problem

lupaci · October 7, 2023, 9:06am

I have absolutely no idea if this is best practice I actually fear this is the worst you can do but it works like this:

import mongoose from "mongoose";
import { Hocuspocus } from "@hocuspocus/server";
import { Doc } from "yjs";
import { encodeStateAsUpdate, applyUpdate, mergeUpdates } from "yjs";

const changeSchema = new mongoose.Schema({
  documentName: String,
  version: Number,
  snapshot: Buffer, // Most recent snapshot of the document
});

function serializeYDoc(doc) {
  return Buffer.from(encodeStateAsUpdate(doc));
}

function deserializeToYDoc(snapshot, updates) {
  const doc = new Doc();
  applyUpdate(doc, new Uint8Array(snapshot.buffer));

  for (let update of updates) {
    applyUpdate(doc, new Uint8Array(update.buffer));
  }
  return doc;
}

const ChangeModel = mongoose.model("Wusel", changeSchema);

// Connect to MongoDB
mongoose
  .connect(
    "uri",
    {
      useNewUrlParser: true,
      useUnifiedTopology: true,
    }
  )
  .then(() => console.log("Connected to MongoDB."))
  .catch((error) => console.error("Error connecting to MongoDB:", error));

function createInitialDocTemplate() {
  return new Doc();

}

async function saveToDatabase(document, documentName) {
  console.log("Save to Database called");
  try {
    const update = serializeYDoc(document);
    const existingDocument = await ChangeModel.findOne({ documentName });

    if (existingDocument) {
      // Replace the snapshot
      existingDocument.snapshot = update;
      existingDocument.version += 1;
      await existingDocument.save();
    } else {
      const newDocument = new ChangeModel({
        documentName,
        snapshot: update,
        version: 1,
      });
      await newDocument.save();
    }
  } catch (error) {
    console.error("Error saving document:", error);
  }
}

async function loadFromDatabase(documentName) {
  try {
    const existingDocument = await ChangeModel.findOne({ documentName });

    if (existingDocument) {
      const doc = new Doc();
      applyUpdate(doc, new Uint8Array(existingDocument.snapshot));
      return doc;
    }

    return createInitialDocTemplate();
  } catch (error) {
    console.error("Error loading document:", error);
    return null;
  }
}


const server = new Hocuspocus({
  port: 1234,
});

server.configure({
  async onStoreDocument(data) {
    saveToDatabase(data.document, data.documentName);
  },

  async onLoadDocument(data) {
    const document = await loadFromDatabase(data.documentName);
    return document;
  },
});

server.listen();

But apparently the binary format is so optimized that the overhead is minimal I have now couple of times edited the document to 1 million chars , changed and deleted and it is not more than 1mb and if I delete back to 54kb.

But honestly no idea why

raine · October 8, 2023, 7:36pm

@lupaci Are you sure that existingDocument.updates = [] actually deletes the updates from MongoDB? I took a look at the Mongoose documentation, and I’m not convinced that that syntax triggers the ORM’s change tracking. You might need something like existingDocument.markModified('updates').

Mongoose has a known issue with setting array indexes directly. For example, if you set doc.tags[0] , Mongoose change tracking won’t pick up that change.

…

To work around this caveat, you need to inform Mongoose’s change tracking of the change, either using the markModified() method or by explicitly calling MongooseArray#set() on the array element as shown below.