Home Funds overview Fund3 Developer Ecosystem

Last updated 4 years ago

Share this on X Share this on Facebook Share this on LinkedIn Share this on Reddit

Cardano ETL: Public BigQuery Data

Share this on X Share this on Facebook Share this on LinkedIn Share this on Reddit

Problem

Collection and democratization of data in the Cardano ecosystem will be critical for feeding new ideas and measuring critical project KPIs

Solution

Cardano ETL will support transformation of blockchain data into convenient formats like JSON Newline, GCP PubSub, and relational databases.

Total to date

This is the total amount allocated to Cardano ETL: Public BigQuery Data.

$30,000

Total funds requested

Complete

In progress

Total votes cast

₳4.83M

Votes yes

₳6.68M

Votes no

About this idea

datasharing quick-to-market

Cardano ETL will support transformation of blockchain data into convenient formats like JSON Newline, GCP PubSub, and relational databases.

I've been working in the mobile gaming space for 8 years creating client and server architectures at petabyte scale with both GCP and AWS

## Problem

Collection and democratization of data in the Cardano ecosystem will be critical for feeding new ideas and measuring critical project KPIs.

## Solution

Cardano ETL will support transformation of Cardano blockchain data into convenient formats like CSV, JSON Newline, GCP PubSub, and relational databases.

Initially we will support exporting CSV, JSON Newline Delimited, GCP PubSub. We will also operate a passive stakepool that streams in realtime to a public BigQuery dataset for everyone to use using PubSub/Dataflow pipeline/BigQuery.

We will expand into additional convenient formats in later phases and milestones likely based on the success of this Fund3 effort (both technology and vlog lessons/amas/deep dives).

If you want instant Cardano data and you rather NOT export the blockchain yourself using `cardanoetl`; please checkout the quickstart below for our realtime public BigQuery data!

Proposal Details (Github): https://github.com/floydcraft/cardano-etl

Proposal Overview (YouTube): https://www.youtube.com/watch?v=QeFCzwNBR5U

Proposal BigQuery use case Examples (YouTube): https://youtu.be/0LtND_PDfQU

## Target Impact

Meaningful step towards the democratization of Cardano Data in a scalable and accessible way.
Easy to use Cardano ETL CLI to export Cardano data from a node with both batch and streaming support into many convenient data formats/streams.
Realtime Cardano public BigQuery data operated with validation.
Useful Cardano insights derived from raw, summarized, and aggregate data by developers, data scientists/engineers, and product owners that minimized/eliminates technical blockers for access to those insights.

## Auditability

Opt-out/optional reporting of anonymous usage statistics of Cardano ETL features also as public BigQuery data.
Shared learnings of Cardano / Haskell / Catalyst / Proposal with - community via Youtube / Twitter
Open Source cardano-etl - https://github.com/floydcraft/cardano-etl
Weekly public goals with demos (vlog, notebooks, ...)
Ideally open source contributions from the community
Potentially community showcases of their insights and how they derived them
Some metrics that will have a dashboard once development is complete. Until then, can be reported when applicable.

BigQuery Queries / day
BigQuery TB scanned / day ($5 USD per TB scanned after first TB)
Unique users / day
Infrastructure Spend / day
Github Insights / day (Bugs, Resolved, Comments, ...)
Github Issues / PR's for public project tracking

## Feasibility

### Approach

Well great news! IOHK already has a model for how to sync the blockchain to a SQL database https://github.com/input-output-hk/cardano-db-sync .

Initially I'll work quickly to see if Haskell only solution works much like the cardano-db-sync repo currently uses. I suspect that will actually not scale to many target db/formats and allow for a clean/useful implementation, but I need to follow up on this as my first action item.

So, in the case I can't use pure Haskell I'll end up using the serialization lib to load the data from disk into python which will allow the Cardano ETL project to target just about any potential target (BigQuery, Athena, Json, …). This is more or less the Ethereum ETL approach https://github.com/blockchain-etl/ethereum-etl . This might require work to make the serialization lib available in python (TODO).

Worst case I'll end up needing to improve/add features to the serialization lib to allow for the second approach.

All of this will be open source and I welcome contributions / feedback along the way.

One note on the Cardano ETL CLI. The idea is that it supports exporting to all likely formats/streams, but it could be that a limited set is supported via a Haskell CLI and a full set is supported via a python CLI (like PubSub streaming).

### Applicable Skills

4 years of in depth BigData experience with both architecture and implementations on AWS/GCP
Combined 13 years of professional engineering for simulations and mobile games (C++/C#/python/OpenGL/Unity3D/networking)
Over a year of in depth commitment to understand Cardano and it's many tentacles through blogs, white papers, just about every Charles video, and Fund2 as examples.
Setup and operated a jormungandr stakepool (CHBFI) for a few months.
Modified the stakepool to expose additional API's for potential stakepool discovery service/website. It was a dead end with the approach I took, but it made for a cool frontend for my pool at the time that nobody cared about =P
Basically no Haskell experience (WIP), but after stakepool setup is up and working and I'll figure out the iterative dev workflows.

### Estimates

3-4 months of Cardano ETL Development (20K USD)

Third Party Subscriptions / Tools
~200 Hours over 3 months to get to alpha / beta (open source github)
~200 Hours over following 6 months for production + monitoring and operations (GCP kubernetes)

6 months of GCP operations usage costs (8K USD)

~40% ETL Compute/Networking/GKE Managed Kubernetes/Stackdriver/...
~60% BigQuery Storage and Queries (w/ buffer for 50TB scanned per month)

Cardano ETL BigQuery Data use case marketing (2K USD)

specific use case YouTube playlists/notebooks/sessions for helping developers, data scientists/engineers, and product owners understand how to use the Cardano data to evaluate both opportunities and their project operations.
attempt to showcase Google GCP with the technical approach to this project for potential Google developer marketing opportunities
some materials/assets to support the marketing efforts

### Resourcing

Nights and weekends for me (reasonable for myself because I like to develop with intent/purpose).
Ideally find a partner with the right skillsets and alignment of core values to contribute / contract.
Use open source model for less direct potential contributions (bugs, target exporters) once the framework is in place.

## Future Funding (for both operations and developments)

Will setup a Cardano ETL Stakepool that purpose is contributing it's proceeds to the funding of this project
Potentially/likely to make proposals when it makes sense through Catalyst
Would love to share/grow my data driven knowledge and experiences with the community if it turns out to be interesting and reinvest that into this foundation.
Curious if anyone has ideas that I should be considering (dApps) that I'm not that would lend itself to a healthy operations / investments in the democratization of Cardano data.

30000

Team

I've been working in the mobile gaming space for 8 years creating client and server architectures at petabyte scale with both GCP and AWS

Bourke Floyd

Website:https://cardano.ideascale.com/c/cardano/idea/56755

All funds