198 total views
The root cause of the Ethereum 2.0 Prysm client accident is analyzed, and the issues that Ethereum 2.0 pledgers need to pay attention to are listed.
Original title: “Review of the Ethereum 2.0 Mainnet Incident”
Written by: Raul Jordan, co-founder of Prysmatic Labs
Summary of the accident
Starting from epoch 32302, the beacon chain has lost a large number of block proposals. Since Prysm has the most users in the Eth2 client, the problem is most likely to occur on Prysm. After some time, we reproduced the error locally. This is actually a known issue related to Eth1 data voting and validator deposits. Although someone has reported this issue to us before, we were unable to reproduce the bug and treat it as an isolated incident. And this problem has never been widely spread on any testnet or mainnet. This is the first time this problem has caused a block proposal failure accident.
In these 18 epochs, almost all Prysm beacon nodes cannot produce new blocks. Epoch 32320 started to operate normally again, and it was generally believed that the accident was over. However, about 24 hours later, the accident happened again, causing a similar impact.
A formal post-mortem report on this accident has been released, please visit the link to view it .
The review details the timeline of the accident; it analyzes the root cause and lists the issues that Eth2 pledgers and participants need to pay attention to.
Some preliminary data indicate that in the first accident, each affected verifier lost an average of 122,950 gwei (calculated as $0.3 at the price at the time of writing). Within 24 hours of the accident, the same accident occurred a second time, and each affected verifier lost approximately $0.22.
Some key facts:
- No validators were confiscated
- No impact on the finalization of the beacon chain
- The participation rate is still very high (the lowest point is 84.8%) (Editor’s note, this data is different from the latest issue of ” Eth2 Progress Update ” written by Ben Edgington.)
- Most verifiers lose 2 to 3 proofs, no matter which client type
- This time it’s not like a malicious or deliberate attack
After about 30 hours of hard work by the entire team, we diagnosed the root cause and deployed a repaired version to all Prysm nodes at 6 am UTC on April 25th. Before the node was fully upgraded, a similar accident occurred for the last time. After giving node operators enough time to upgrade the client, such incidents did not happen again, and there is evidence that the problem has been completely resolved.
The shortcut to becoming a validator
Will this accident weaken everyone’s confidence in Eth2?
will not. The accident did not cause a consensus failure, and the scope of the event’s impact was very small compared to the size of the Eth2 mainnet (in the first accident, each affected verifier lost an average of about $0.3). Since its inception, Eth2 has always been very powerful, with a very high participation rate of validators, and each epoch has been finalized. From our point of view, after the fault is resolved, the network is able to return to a perfectly functioning state, which in turn strengthens the community’s confidence in the resilience of Ethereum.
Will this incident weaken everyone’s confidence in the Prysmatic Labs team?
Our response and solution to this accident are completely different from when we dealt with failures in the Eth2 testnet before. After the accident, our team immediately eliminated the wrong information; quantified the impact; and while waiting for the solution, we listed clear steps for the verifiers. Furthermore, after we have completely determined the solution, we will ask everyone to upgrade the client version. It is worth noting that since the Prysm client is the software with the largest user proportion in the Ethereum 2.0 network, any bugs that appear may cause more serious problems.
For core developers, the key to their work is to “bound complexity”. Distributed systems such as Eth2 have so many variables, each of our teams makes every effort to reduce the possibility of bugs. Of course, in this software, bugs are inevitable, and we admit that Prysmatic Labs did make a mistake. But we hope to show our team’s motivation and ability to solve problems, while balancing the issues between speed and accuracy for verifiers.
Summary of the root cause of the accident
The Eth2 and Eth1 chains are loosely connected, and Eth2 only needs Eth1 for verification of the verifier’s deposit. In other words, even if validators vote on junk data, the Eth2 PoS chain can continue to operate. The only thing that will affect is that new validator deposits cannot be added until the PoS chain votes on the correct Eth1 data again. This “voting” is completed in the “voting cycle”, which is currently set to 64 epochs (approximately 6.8 hours) on the mainnet.
The voting method is a simple “absolute majority” principle, and the Eth2 verifier specification explains how it works. Unfortunately, Prysm lost some validation when implementing this principle (voting according to the absolute majority principle). In this incident, due to a bug in Prysm, a block proposer created a completely invalid Eth1 deposit tree root, and other Prysm nodes first discovered the block proposal. Subsequently, they voted for this, because the Prysm client follows a simple “absolute majority vote” principle without explicit verification.
Then, all Prysm nodes “snowball” voting on invalid information, resulting in block proposers not being able to pack blocks with deposits into the chain. This is because these deposits have not verified the root of the Eth1 deposit tree of the node, so the block proposal will fail. After the voting period is over, the problem is automatically resolved, but if the bug is not fixed, this problem will reappear.
In fact, the root cause of the invalid Eth1 deposit data tree this time is that there was a bug in the initialization of the deposit cache, but only a part of the beacon nodes using the Prysm client were affected. This caused these nodes to produce the wrong root of the deposit tree, and other Prysm nodes voted on it, which caused the accident.
Note that the technical details are below! You can skip to the next part and read about the solution and the lessons learned from the accident.
Block proposal failed
Epoch 32302 began to experience the problem of missing block proposals.
Nishant notified the team and held a plenary meeting. Then, we reproduced the accident through the local mainnet beacon node and started the investigation.
Investigation revealed that Prysm voted for the strange and wrong Eth1 deposit tree roots
We noticed that Prysm nodes are voting on the strange tree root, which is used to verify the deposit integrity of the validator deposit contract in the PoS chain. After viewing the historical information of the original block proposer on a public browser (in order to protect the verifier, the identity of the verifier is not announced), we infer that this is not an attack.
The initial suspicion was about how Prysm handled Eth1 data voting in the validator’s proposed code path. In particular, we tried to troubleshoot some problems:
- Is there a problem with packing deposits into the block?
- Is the deposit log information acquisition and Eth1 information mixed or uncertain?
- Is there a problem with our deposit Merkel tree?
For the next 16 hours or so, we spent a lot of time working together to diagnose potential problems. We combed the lines of code, tried to reproduce the failure process through unit testing, and tried various methods. Although we already have a potential solution, we are also nervous about releasing a fixed version due to lack of confidence.
More reasonable root cause
When dealing with bugs in the Eth2 testnet, we learned some lessons. It is not enough to have confidence in the root cause. In high-risk situations, we need to have 100% confidence before releasing solutions to users. 28 hours after the accident, we sat down and asked ourselves: “What else do we do not know? What other questions can we ask to get us closer to the root cause of the failure?” Then we knew the following point:
- Our sparse merkle tree implementation does not have serious bugs because it uses deposits from the mainnet and Prater testnet, which matches the Eth2 zrnt implementation of Lighthouse and Protolambda.
- The code path we use to retrieve Eth1 data from the Eth1 node has no bugs, nor does it return incorrect data.
What we don’t know are:
- How do invalid deposit tree roots come about
- Why this problem can be reproduced in some nodes, but not in other nodes
- Why did the Prysm node make an “off-by-one” error when determining the amount of deposits in the block
Fix the problem
To answer these questions, we looked at the code path to initialize our deposit tree. It turns out that a caching layer was added in the early days to avoid that pledgers must download all validator deposit records every time they start their node. In addition, we have added a new feature-Prysm can be started from an embedded creation state inside the client. When filling the cache, an incorrect preset of our deposit tree led to the corruption of the information:
Root of the problem
It turns out that if our deposit tree is empty, the function len(items) will always return 1. This means that when we actually should set the value of
lastReceivedMerkleIndex to -1, we will set it to 0. The above code will cause some Prysm nodes in the code path to skip embedding the 0th deposit into the tree. The rest of our code base points to this strange part of our deposit tree implementation, not the code path.
To test this hypothesis, we tried to replicate the code path as much as possible using the test fixture provided by Protolambda. We instinctively missed the embedding of the 0th deposit into the deposit tree. Of course, we can find the problematic roots of the deposit tree that led to the entire incident in a repeatable test! Then, we add conditions around the code path to prevent the condition from reappearing, and prepare to launch a finalized repair version.
Summary of the root cause
- Prysm saves Eth1 data on disk to prevent users from having to make a request to the validator’s deposit contract log every time the process is restarted.
- If a node restarts and saves Eth1 data on disk, we will initialize our deposit cache from these data, but our sparse merkle tree (SMT) assists the way the package works and from disk The code path for initializing this cache is different. We will skip embedding the 0th deposit into the deposit tree, resulting in an invalid deposit tree root. This code path only affects those nodes that have not had a database since the creation of the world, and were later repaired.
- In the official specification, Prysm nodes follow the principle of “absolute majority” to implement an Eth1 data voting algorithm, but Prysm does not fully implement some of the effective conditions of the algorithm. Prysm nodes vote with an absolute majority of Eth1 data. The voting data refers to an existing block root, which may cause the Prysm node to vote for a deposit tree hash value generated by a problematic deposit tree, because these deposits are not Verified.
- Since most of the nodes in the network are Prysm nodes, the snowball effect of voting for the problematic deposit root with the principle of absolute majority has developed into a serious problem, because Prysm nodes cannot generate blocks on the main network for a period of time.
- Once the Eth1 data voting period is reset, the Prysm node can propose the block correctly again until the vulnerability is encountered in the future.
At 13:00 on Sunday, April 25th, Beijing time, after suffering for many hours in uncertainty, we released a fix to this problem. We have full confidence in this solution and are very confident that this problem will not reappear in Eth2 after the node upgrade.
Learn a lesson
In the event, confidence in our solutions and careful communication with the outside world are essential
When we encountered Eth2’s Medalla testnet accident, we learned an important lesson about the value of good communication. The precise expression of every public comment and language will have a serious impact on the outcome of the event. In the testnet incident, we thought that an immediate solution was to tell everyone to “restart your node” through public channels. This hasty decision caused most of the nodes on the network to go offline, and then scrambled to find a good one among a bunch of bad peer nodes to achieve synchronization with the blockchain. In addition, we soon released a software upgrade hotfix that was not 100% confident that it would solve the problem. This brings more confusion to the system and causes node operators to doubt about the solution.
In contrast, during the entire process of this new mainnet accident, we have always paid attention to careful and precise communication. In addition, we did not release a hotfix until we were 100% confident in the root cause and solution of the problem.
Patience and calmness will help solve the problem
Our team has built Eth2 over the past few years and learned how to stay calm in the face of adversity. We believe that in the process of problem solving, it is very important to stay calm, exchange status reports frequently, and ensure that the team feels support and positive feedback. We can take the time to collect as much evidence as possible and cooperate meticulously with our users. We will successfully solve this problem. More importantly, we took the time to quantify the impact of the event at the beginning to reduce the worries of pledgers and lack of information. This lesson is very important for working under conditions of high stress and lack of sleep. Slow down, solve it in the proper way, and avoid making the problem worse at all costs.
Eth2 testnet is not equal to the mainnet
For the Prysm client, we conducted extensive testing and monitoring of the previous candidate versions of the Prysm product on the public Eth2 testnet. Both Prater and Pyrmont testnets are good tools for users to test their settings before joining the Eth2 mainnet. However, these test nets all presuppose that the proportion of four product-level Eth2 clients is close to the average score, that is, no client has a clear majority share of validators. Unfortunately, this may not take into account the vulnerabilities that only occur when a certain client is used by most people. In the future, Prysmatic Labs will conduct tests in an internal test network that is closer to the main network environment, or an environment that is 50% of the Prysm network nodes.
In addition, we recommend that other clients also include such an environment in their own content testing. When they become the majority of clients, they can also understand the potential problems of their own clients.
What should the pledger think about
Why use Prysm client for staking
People choose to run Prysm because from the beginning our team has focused on making their experience of participating in Ethereum staking easier. I have communicated with our users many times. Many people choose a client not because of micro-optimization or the relatively small difference in revenue compared with other clients, but because we make their experience simpler-good documentation Information, has always provided important help to all community members. Eth2 is terrible for novices, and staking is also full of uncertainty and risks. The mission of our team is to let users know that we are by their side and that they will receive our support no matter how small their problems are. In particular, we have been paying attention to ordinary pledgers who may not be familiar with the command line and the UNIX operating system.
In the future, you can expect the following from our team:
- Improve the accuracy of the implementation of the specification conditions, and ensure that the preset and effective conditions are fully reviewed and questioned before any code is written
- We do not want to improve this experience, but we have to redouble our efforts to make Prysm many times higher than today, and make it easier for pledgers who use our client to participate in the network, including the improvement of the web interface.
- Prysm will redouble its efforts in research and development to provide key features and improvements before the merger of Eth1 <> Eth2.
- We believe that healthy competition can form a strong incentive mechanism to promote Eth’s equity to prove that more people can participate, and therefore it is safer, because all client teams continue to improve their software
- Our team is committed to solving problems that may be encountered by pledgers with the highest professional standards. We believe that we do a good job of handling any problems we encounter on the road, and assure our community that we will make the pledger experience our highest priority.
- Finally, we believe that there are many important features that can make Prysm a more attractive software for participating in Eth2, and we will continue to iterate toward this goal.
- Prysm has some advanced optimizations of validator’s income that have not been set as default for all pledgers. We believe that after the release of these features, Prysm pledgers will see the highest level of revenue.
Recalling the conversation of client diversity
Since the creation of Eth2, a common theme we have always heard is client diversity. Eth2 is a distributed system with people from all over the world participating as validators. Different people use different software to participate in the consensus of the blockchain. If a software has a serious problem, if the client running the network is distributed in a balanced way, the impact will be smaller.
Leonardo Bautista-Gomez published a data analysis as early as January. The results showed that Prysm nodes accounted for 65% of the network. This incident also showed that Prysm validators accounted for the majority today.
We recommend that you look at each client objectively: its software, its community, and its resilience, and then decide which software to choose and the team behind it best suits your needs. If an Eth2 client lacks something that is important to you, the reason is that you do not choose their client. We strongly recommend that you make a feature request. Prysmatic Labs will continue to focus on helping you participate in the Ethereum network and pushing the boundaries of blockchain software.
If you want to communicate and have questions about this article, please join our Discord.
Medalla testnet event