Home Funds overview Fund9 Developer Ecosystem

ID: #900103

Last updated a year ago

Share this on X Share this on Facebook Share this on LinkedIn Share this on Reddit

Fast reindexable data format

Status:

CompleteCountry:

JapanIndustry group:

Development & Tools

Share this on X Share this on Facebook Share this on LinkedIn Share this on Reddit

Problem

Currently no efficient tooling exists for storing the blocks persistently with fast indexing and fast distribution. For example, an empty indexer on Oura will still take ~300s/epoch

Solution

We introduce new thread-safe memory-efficient persistent storage with indexing capabilities. Using a single-file solution, there will be no extra processing overhead for download and synchronization.

Completed outcome

Download report

Total to date

This is the total amount allocated to Fast reindexable data format.

$50,000

Total funds requested

Distributed: $50,000

Remaining: $0

Complete

In progress

392

Total votes cast

₳223M

Votes yes

₳22M

Votes no

About this idea

developers infrastructure

[GENERAL] Summarize your solution to the problem.

We introduce new thread-safe memory-efficient persistent storage with indexing capabilities. Using a single-file solution, there will be no extra processing overhead for download and synchronization.

[GENERAL] Summarize your relevant experience.

Members of the dcSpark & Milkomeda team have written a lot of indexers for Cardano including Carp, contributions to Pallas/Oura, contributions to db-sync and custom indexers for projects like Milkomeda

[GENERAL] Requested funds in USD.

50000

[IMPACT] Please describe your proposed solution.

There are two main problems we are trying to solve:

Currently, the main bottleneck for indexing data is querying the block data from the cardano-node. Since cardano-node was not optimized for fast historical block fetching, an indexer that scans the chain and does nothing with the data (no-op) still takes ~300s / epoch which means it takes over an entire day for the indexer to complete. This is really bad for developer agility because it means if you need to write a custom indexer you need to spend a day re-running your indexer every time you change your code. This proposal will allow us to save the block data in a fast-to-query format once to optimize for use-cases that need to re-index the same data multiple times such as protocols similar to TheGraph and also optimize for developer agility
Currently, there is no way to save block data in way that supports (in parallel) the need to append new blocks to storage and read the blocks from a specified index. We have researched whether there are any solutions to tackle this problem (see below). However, it turned out that there’s no suitable solution at the moment.

For example, StreamingFast has the technology for storing blocks. However, they try to solve both fork resolution and data storage problems at the same time. They have plenty of files storing 100 blocks per file. This doesn’t work for quick distribution and efficient indexing at the same time. As for the solution we propose, fork resolution can be done through multiverse and only confirmed blocks go to the actual storage.

Moreover, the solution we propose can be used for other projects in the Cardano ecosystem. The storage can be used as a source for Oura or Carp, so instead of spending 4-5 days to synchronize with Cardano you just download one large file. If a person doesn’t trust third-party backups they can synchronize the node and create their own backup, reusing it after.

Let’s dive in a little more detail of potential implementation:

Memory management

We can utilize mapping of file to memory, so we sync the changes to the disk easily and don’t have overhead for data access. Moreover, this is cache efficient, since we have memory reads of consecutive data. There’s a syscall on linux that we can use called mmap. As long as we append more data we allocate new chunks, several records can be in the same chunk. If the storage is closed and reopened after we create an mmap for existing records (data structure doesn’t depend on how the memory was allocated).

Memory structure

For every record we need service bytes (e.g. 8 bytes to store offset to the next record) + size of serialized block bytes. The architecture is serialization-generic, so any format can be used (e.g. cbor)

First-level indexing
The indexing by number can be made on top of the offset structure. For instance, we know the offsets of the records, so we create an in-memory index, where we have the array of them. If we reopen the storage we just go through the offsets, verify data validity and reconstruct the in-memory index. Besides, we get efficient iterators automatically having this approach
Thread-safety

As long as we never modify the existing records, the only thing that needs to be treated carefully is the end of the file (last mmap) and mmap structure (in case rebalance is needed). Due to immutability of existing events we can make the access lock-free

Advanced indexing
Having the first-level indexing we can build any index (in-memory or persistent) on top of that using various approaches. Like block hash -> block mapping and so on.

[IMPACT] Please describe how your proposed solution will address the Challenge that you have submitted it in.

This tool will allow for much faster access of block data which will help make indexer in Cardano more responsive indexers, unlock use-cases that require quick re-indexing and also improve developer agility. This solution can then be used to unlock the next generation of Cardano products & tools that depend on these properties.

[IMPACT] What are the main risks that could prevent you from delivering the project successfully and please explain how you will mitigate each risk?

No significant risk beyond standard engineering risk (delay, overbudget, etc.)

We are confident this project will meet the indexing needs of Milkomeda and that it's generic enough to be useful for many different use-cases, but there may be other approaches that other projects need such as http3-based block fetching

[FEASIBILITY] Please provide a detailed plan, including timeline and key milestones for delivering your proposal.

Q3: implement the solution, open source it and integrate it as part of Milkomeda. We expect the project to take ~2.5 months

[FEASIBILITY] Please provide a detailed budget breakdown.

All funds will go towards engineering effort to implement the solution

[FEASIBILITY] Please provide details of the people who will work on the project.

1 Rust engineer at Milkomeda

1 Project lead (shared resource between this and other Milkomeda projects)

[FEASIBILITY] If you are funded, will you return to Catalyst in a later round for further funding? Please explain why / why not.

No plans for this specific project. Depending on interest, we could also develop alternatives to cover different use-cases such as a http3-based block syncing

[FEASIBILITY] Are you or any member of your team working on any other proposals in this Fund9?

Yes

[FEASIBILITY] Are you or your team working on any other proposals from previous Funds?

Yes

[AUDITABILITY] Please describe what you will measure to track your project's progress, and how will you measure these?

Development progress on the project itself, followed by its integration into Milkomeda

[AUDITABILITY] What does success for this project look like?

Project is successfully open sourced, integrated into Milkomeda and usable by projects such as Carp that need fast re-indexing of data

[AUDITABILITY] Please provide information on whether this proposal is a continuation of a previously funded project in Catalyst or an entirely new one.

Entirely new proposal, but inspired by our indexing work for Carp and Milkomeda

Monthly report

NB: Monthly reporting was deprecated from January 2024 and replaced fully by the Milestones Program framework. Learn more here

July 20, 2023 Progress report

Status: Launched

On track: Yes

Estimated completion date: -

Summary

We have successfully designed and implemented this storage along with the indexes. Core storage basically has 2 layers: Raw data file – raw data file, where raw bytes data is stored consecutively in cbor format SeqNoIndex – index file where offsets and lengths of the raw data is stored Core storage utilizes memory maps for efficient data loading, so the only limit for the storage is the size of virtual memory, since memory maps allow lazy data loading. There’s also a generic key-value indexed-log-map storage on top of core storage, which allows searching and iterating in the storage not just by sequential number, but by arbitrary key. These keys might be for example block ids, block numbers or transaction ids. This way you can iterate the data from specified position, e.g. transaction id in a way these transactions occurred on chain. The keys are stored in sled::db, the values are the indexes in SeqNoIndex. Thread safety is ensured by thread safety of sled::db

Evidence

https://github.com/dcSpark/dcspark-core/tree/main/indexed-log-map https://github.com/dcSpark/dcspark-core/tree/main/fraos

Explanation

Indexed log map crate Fraos crate

May 18, 2023 Progress report

Status: In progress

On track: Yes

Estimated completion date: -

Summary

In march we submitted formal change request explaining projects pause till end of April. We are excited to share that we started our work on this proposal from the mid of May and we expect siginificant progress in the comming months. Currently we are optimizing the indexing scheme and the way of accessing the data. Main goal is to get rid of upper limit on storage size without reload

Evidence

https://github.com/dcSpark/dcspark-core/pull/42

Explanation

Remove unoptimized seqno iter #42 PR

March 20, 2023 Progress report

Status: In progress

On track: No

Estimated completion date: -

February 19, 2023 Progress report

Status: In progress

On track: No

Estimated completion date: -

January 19, 2023 Progress report

Status: In progress

On track: No

Estimated completion date: -

December 20, 2022 Progress report

Status: In progress

On track: No

Estimated completion date: -

November 23, 2022 Progress report

Status: Launched

On track: Yes

Estimated completion date: -

Summary

The working prototype of the storage is implemented. The storage needs to be optimized for large datasets and index size / speed can be improved as well. Needs to be done so the library can support writing large files (e.g. 100gb).

Evidence

https://github.com/dcSpark/dcspark-core/tree/main/indexed-log-map) on top of https://github.com/dcSpark/dcspark-core/tree/main/fraos

Explanation

Please refer to main report for explanation

Team

Sebastien Guillemot

nicoarqueros

dcSpark

Website:https://github.com/dcSpark