“Data availability” and “data availability problem” refer to a problem faced by some blockchain expansion solutions. Specifically, when a new block is created, how does the node ensure that all data in the block has been published to the network? The difficulty is that if the block producer does not publish all the data in the block, no one can find out whether the malicious transaction is hidden in the block.
In this article, I will delve into the importance of data availability and related solutions.
How do blockchain nodes work?
Each block on the blockchain consists of two parts:
The block header, that is, the metadata of the block, consists of some basic information about the content of the block, including the Merkel root of the transaction.
Transaction data, that is, the main component of a block, consists of actual transaction data.
There are two main types of nodes in the blockchain network:
Full node (also known as fully verified node). This type of node will download every transaction in the blockchain and verify its validity. This requires a lot of resources and hundreds of gigabytes of disk space, but these nodes are the most secure because they will not accept blocks containing invalid transactions.
Light client. If your computer does not have enough resources to run a full node, you can run a light client. The light client does not need to download or verify any transactions. They only download the block header and assume that the transactions contained in the block are valid. Therefore, the security of light clients is lower than that of full nodes.
Fortunately, there is a way for light clients to indirectly check whether all transactions in the block are valid. The light client can rely on the full node to send it a fraud proof about invalid transactions without checking the validity of the transaction itself. A fraud proof is a small proof that can prove that a certain transaction in a block is invalid.
There is only one problem here: if a full node wants to generate a fraud proof for a block, it needs to know the transaction data of that block. If the block producer only publishes the block header and does not publish the transaction data, the full node will not be able to verify the validity of the transaction and will generate a fraud proof for invalid transactions. This requires block producers to publish all the data of the block, but we need to find a way to enforce it.
In order to solve this problem, the light client needs to find a way to check whether the transaction data of the block is actually published on the network, so that the full node can verify it. However, we have to avoid letting the light client download the entire block, because this will make the light client lose the meaning of existence.
How can we solve this problem? First, let’s discuss the relevance of the data availability problem and how to take measures to resolve it.
What options are the data availability issues related to?
In the previous section, we introduced the issue of data availability. Let’s discuss its importance to scalability solutions.
Increase block size
In blockchains such as Bitcoin, most general laptop computers can run full nodes and verify the entire chain, because there is an artificial upper limit of block size to prevent the blockchain from becoming too large.
But what if we want to increase the block size limit? Then only more people can afford the cost of running full nodes and independently verifying the blockchain, and most people will run light clients with lower security. This is not conducive to decentralization, because it will make it easier for block producers to change protocol rules and insert invalid transactions to deceive light clients. Therefore, it is important to provide fraud proof support for light clients, but as we have already discussed, light clients need a way to verify that all data in a block has been published to the network.
Fragmentation
One way to improve the throughput of the blockchain is to divide the blockchain into multiple chains, that is, sharding. These shards have their own block producers and can communicate with each other to transfer tokens between shards. The meaning of sharding is to group the block producers in the network so that each block producer does not need to process each transaction, but only needs to be distributed to different shards. Each shard only needs to process part of the transaction.
Generally speaking, on a sharded blockchain, validators only need to run full nodes for one or a few shards, and run light clients for other shards. After all, if each validator had to run a full node for each shard, the purpose of sharding would not be achieved—dividing the network overhead to different nodes.
However, this method has its own drawbacks. What if the block producer on the shard commits evil and starts to accept invalid transactions? Compared with non-sharded systems, sharded systems are more likely to happen, because the latter has only a small number of block producers on each shard, making it easier to attack. Remember that block producers will be continuously allocated to different shards.
To make it easier to detect whether there is a shard accepting invalid transactions, we must ensure that all data in the shard is publicly available so that fraud proofs can be used to prove all invalid transactions.
Rollup
Optimistic rollup is a new scalability strategy based on rollup side chains (similar to sharding). These side chains have their own exclusive block producers, which can transfer assets to and from other side chains.
However, what if a malicious block producer packs invalid transactions into the block and steals the funds of all users on the side chain? To solve this problem, we can use fraud proof to detect this situation. However, there is still the old problem. Sidechain users need to find some way to ensure that the data of all blocks on the sidechain are publicly visible in order to detect invalid transactions. In order to solve this problem, Rollup on Ethereum publishes all rollup blocks to the Ethereum blockchain, relying on Ethereum to achieve data availability. In other words, use Ethereum as the data availability layer.
ZK-rollup is similar to optimistic rollup. The difference is that the former does not use fraud proofs to find invalid blocks, but uses validity proofs to prove the validity of blocks. The proof of validity itself does not require data availability. However, in general, ZK-rollup requires data availability, because if a block producer creates a valid block and generates a validity certificate for it, but does not publish the block data, users cannot know the block The state of the chain and their balance can not interact with the blockchain.
Explore further
The rollup is designed to use the blockchain as a data availability layer to store transactions, but the actual transaction processing and calculations occur on the rollup. This is a very interesting idea: the blockchain does not actually need to perform any calculations, but at least it needs to package transactions into blocks and ensure the availability of transaction data.
This is also the design idea of LazyLedger, that is, a kind of “lazy” blockchain, which only needs to complete the two core tasks of the blockchain-sorting transactions in a scalable way and realizing the availability of transaction data. This makes LazyLedger the smallest “pluggable” component in systems such as rollup.
Solutions to data availability problems
Download all data
As discussed above, the most direct way to solve the data availability problem is to require everyone (including light clients) to download all data. Obviously, this method does not have good scalability. Most blockchains such as Bitcoin and Ethereum use this method.
Proof of data availability
Proof of data availability is a new technology: the client only needs to download a small part of the data in the block to check whether all the data in the block has been published.
The data availability proof uses a mathematical element called erasure coding. Erasure codes are widely used in information technologies ranging from CD-ROMs to satellite communications to two-dimensional codes. Erasure codes can expand the original 1 MB block data to 2MB, and the extra 1 MB is special data called erasure codes. If any bytes in the block are lost, erasure codes can help you retrieve them. Even if the data of the entire block is lost, erasure codes can help you retrieve all the data. Similarly, with erasure codes, even if the data in the CD-ROM is erased, your computer can read it (Translator’s Note: Erasure codes will not help you save bandwidth, assuming 1MB of data is expanded If it becomes 2 MB, you still need to obtain at least 1 MB of data to recover the original data, although the 1 MB of data is not required to be continuous).
This means that to achieve 100% data availability, block producers only need to publish 50% of the data in the block to the network. If the malicious block producer wants to successfully conceal 1% of the data, it must conceal more than 50% of the data, otherwise the 1% of the data can be retrieved with the remaining 50% of the data (Translator’s Note: first sentence of this paragraph Doubtful).
With this knowledge, the client can take measures to ensure that the data in the block will not be hidden. Clients can try to randomly download data blocks divided into blocks. If they fail to download the data block (that is, the data block belongs to the undisclosed 50% of the malicious block producer), they will Refuse to acknowledge the data availability of the block. If a random data block is downloaded, the client has a 50% probability of finding an invalid block. If you download two data blocks, there is a 75% probability. If you download three data blocks, there is an 87.5% probability. By analogy, after downloading seven data blocks, there is a 99% probability. In this way, the client only needs to download a small part of the data in the block to effectively check the data availability of the entire block.
The full details of the data availability proof will be more complicated and rely on other assumptions. For example, the number of light clients in the network cannot be lower than a certain lower limit, so that enough light clients can request data blocks for recovery The data of the entire block. If you want more information, you can check the paper on proof of data availability.