[DISCUSSION] RFP-003: Long-Term Block Data Storage

RFP-003: Long-Term Block Data Storage

Note

Use this space to discuss and ask questions about the RFP. We want this thread to facilitate a 2-way communication between the community and the team regarding this RFP.

Context/Background

Avail is a blockchain focused on “data availability” which means that it is designed to order and publish data (usually transaction data from other blockchains) and provide a high degree of confidence that the data was indeed published correctly.

Note that the focus is on ensuring that block data was published properly, not that it is stored long-term. The Avail validators/nodes currently do store all data generated by the network, because it is still new. Eventually, we plan to begin pruning block data older than some cutoff. To that end, Avail will benefit from a complementary solution designed specifically for data storage.

Project Description

A long-term data storage solution tailored to work with Avail would ideally have the following requirements:

  • [P0] Observe the Avail network and detect when new blocks are finalized
  • [P0] Automatically archive block data for finalized blocks
  • [P0] Provide documentation for strong guarantees of continued storage
  • [P0] Provide a method of retrieving the archived data
    • [P1] Programmatic retrieval built into the Avail node, so that a node can be synced from genesis even after old blocks have been pruned from the other nodes on the network (via a command-line option/config)
  • [P1] Provide a method for checking if archived data is still available
    • [P2] Surface this info on a web-based status page or dashboard

Additional Information

Avail is based on Substrate, and much of the tooling available for Substrate/Polkadot will work with Avail with some tweaking. We have a library called avail-js that wraps polkadot.js and provides a way to access most chain functionality.

The block production and finality algorithms are GRANDPA/BABE. This is relevant because it makes sense to only store blocks once they have been finalized. The GRANDPA algorithm can be used to determine which blocks have been finalized, rather than using a trailing number of blocks or some other heuristic. For an overview, see here.

References

Read the entire RFP, Funding Milestones, and Application Instructions here.

5 Likes

Hey!

Few some questions regarding the RFP:

  1. In the context of long-term data archiving, what level of granularity do you envision for the solution? Should it support archiving complete block history, filtered data subsets (e.g., specific rollup data), or a flexible combination of both?
  2. Is tight integration with the Avail node itself a requirement, or would an off-chain service architecture offering efficient communication with the node be acceptable?
  3. In anticipating future Avail network pruning, will there be a dedicated archival mode (same as in Substrate) for Avail nodes to maintain complete block history?

Hi Kamil, thanks for the interest!

  1. Archival of the complete block history is the core requirement of this RFP. However, is it possible (even likely) that individual rollups will be interested in archival of their own history, and thus there might be a use case for creating this tooling in a flexible way so that rollups can pay for this service for their own subset of the data. So it’s not a requirement, but could be a smart way to design it.
  2. Tight integration with the Avail node is not a requirement, merely a suggestion as one way to tap into the block data as it is published. As noted in the requirements, an integration with the node such that it is able to query the long-term storage to sync historical data is not a blocker, but very desirable (P1).
  3. The current Avail node already supports an archival mode, and this will remain the case. Currently the default configuration for all nodes on the chain is to operate in archival mode, but we expect to change this setting during the next year and begin pruning. Anyone who wishes to continue operating an archival node will be able to continue to do so, of course.
3 Likes

Hey, Avail team.

Is the application still in progress? I was trying to contact exploration@availproject.org but it turned out to be invalid. I’m wondering whether it’s okay for an individual to apply this and if it’s too late at this moment.

Hey! Yes we’re still accepting applications for this RFP. You can still apply even if you’re an individual.
Thanks for flagging the email issue, will get that sorted.

1 Like

Hey Avail team, I’ve got a few more questions about this one!

  1. What are the performance expectations for data retrieval from the long-term storage solution, particularly concerning latency and throughput? Are there specific benchmarks or performance metrics Avail aims to meet?
  2. How does Avail envision the growth of the blockchain and the corresponding data storage needs over the next 2-3 years? Are there any specific scalability features or future-proofing mechanisms you would like the storage solution to incorporate? Any budgetary constraints or preferred cost models for long-term usage of storage?
  3. How do you imagine retrieval will look? Are there any specific tools or APIs that the solution must integrate with (seamlessly)?

Thanks!

Just a bump here, would love to submit something!

Hi Andrew, apologies for the super delayed response – lost track of this thread with all of the mainnet launch/post-launch activity.

This is meant to be long term storage, so we don’t have any benchmarks on latency. Total throughput could be a concern, though – see next bullet below.

Currently the chain publishes up to 2MB of data every 20 seconds. The timing will remain at 20 seconds, but we do expect that max size to increase over time. How quickly is hard to predict, since it will be demand driven, but we have done testing with up to 128MB of data, and even larger is possible.

So it’s good to understand how the solution might need to adapt as Avail grows block sizes.

The doc above mentions integrated retrieval into Avail nodes as a desirable but not MVP feature. That would allow spinning up a new node and syncing historical data, even after the Avail network begins pruning data. This integration could be in the node itself (e.g. a command-line config) or a special kind of archive node that is able to serve data via the normal RPC API but is pulling data from long term storage behind the scenes.

Not mentioned above is another interesting use case that we have been noodling on: allowing rollups to retrieve only their own data from the archive.

Avail blocks store data from other applications (rollups, etc). Each Avail block has an index, so that it is possible to identify just one app’s data. If you imagine 128MB+ size blocks it becomes obvious why this is important: as a participant of a rollup (say, a validator, or anyone really), you want to be able to query just your rollup’s data and nothing else.

The same is also true for long term storage: it is easy to imagine that a rollup would want to be able to retrieve only historical data for its specific app, without having to retrieve entire Avail blocks. So for example, a chain that submits 50kB of data once an hour shouldn’t have to download gigabytes per day of historical data out of the archive.

So the challenge with that is that the data would have to be stored in a way that it can facilitate retrieval of both entire blocks (for syncing Avail nodes), or partial blocks (for syncing rollup/app nodes).

Let me know if you have any follow-up questions. Thank you!