Last updated a year ago
Currently no efficient tooling exists for storing the blocks persistently with fast indexing and fast distribution. For example, an empty indexer on Oura will still take ~300s/epoch
We introduce new thread-safe memory-efficient persistent storage with indexing capabilities. Using a single-file solution, there will be no extra processing overhead for download and synchronization.
This is the total amount allocated to Fast reindexable data format.
There are two main problems we are trying to solve:
For example, StreamingFast has the technology for storing blocks. However, they try to solve both fork resolution and data storage problems at the same time. They have plenty of files storing 100 blocks per file. This doesn’t work for quick distribution and efficient indexing at the same time. As for the solution we propose, fork resolution can be done through multiverse and only confirmed blocks go to the actual storage.
Moreover, the solution we propose can be used for other projects in the Cardano ecosystem. The storage can be used as a source for Oura or Carp, so instead of spending 4-5 days to synchronize with Cardano you just download one large file. If a person doesn’t trust third-party backups they can synchronize the node and create their own backup, reusing it after.
Let’s dive in a little more detail of potential implementation:
We can utilize mapping of file to memory, so we sync the changes to the disk easily and don’t have overhead for data access. Moreover, this is cache efficient, since we have memory reads of consecutive data. There’s a syscall on linux that we can use called mmap. As long as we append more data we allocate new chunks, several records can be in the same chunk. If the storage is closed and reopened after we create an mmap for existing records (data structure doesn’t depend on how the memory was allocated).
For every record we need service bytes (e.g. 8 bytes to store offset to the next record) + size of serialized block bytes. The architecture is serialization-generic, so any format can be used (e.g. cbor)
As long as we never modify the existing records, the only thing that needs to be treated carefully is the end of the file (last mmap) and mmap structure (in case rebalance is needed). Due to immutability of existing events we can make the access lock-free
This tool will allow for much faster access of block data which will help make indexer in Cardano more responsive indexers, unlock use-cases that require quick re-indexing and also improve developer agility. This solution can then be used to unlock the next generation of Cardano products & tools that depend on these properties.
No significant risk beyond standard engineering risk (delay, overbudget, etc.)
We are confident this project will meet the indexing needs of Milkomeda and that it's generic enough to be useful for many different use-cases, but there may be other approaches that other projects need such as http3-based block fetching
Q3: implement the solution, open source it and integrate it as part of Milkomeda. We expect the project to take ~2.5 months
All funds will go towards engineering effort to implement the solution
1 Rust engineer at Milkomeda
1 Project lead (shared resource between this and other Milkomeda projects)
No plans for this specific project. Depending on interest, we could also develop alternatives to cover different use-cases such as a http3-based block syncing
Development progress on the project itself, followed by its integration into Milkomeda
Project is successfully open sourced, integrated into Milkomeda and usable by projects such as Carp that need fast re-indexing of data
Entirely new proposal, but inspired by our indexing work for Carp and Milkomeda
NB: Monthly reporting was deprecated from January 2024 and replaced fully by the Milestones Program framework. Learn more here
Members of the dcSpark & Milkomeda team have written a lot of indexers for Cardano including Carp, contributions to Pallas/Oura, contributions to db-sync and custom indexers for projects like Milkomeda