What are the potential failures of Ethereum 2.0 pledge? how to respond?

December 18, 2020

The penalty mechanism of Ethereum 2.0 is related to the overall state of the network. Verifiers should consider the impact on other verifiers instead of starting from an isolated perspective.

Original title: “Eth2 Staking Series #6: The Good, the Enemy of Good”
Written by: Carl Beekhuizen

The beacon chain has a variety of incentive mechanisms for verifier behavior, and these mechanisms are all determined by the current state of the network. Therefore, when deciding how to secure nodes, it is also important to consider the circumstances in which other verifiers may encounter problems.

The balance of an active validator either rises or falls and will not remain the same. Therefore, minimizing risks is a way to maximize returns. There are three main situations in which the verifier’s balance is deducted by the beacon chain:

General punishment : Verifiers will be given such punishments when they neglect their duties (for example, offline)
Inactivity Leaks : When the network is in a state that cannot be finalized, the verifier will be punished for dereliction of duty, which is highly related to other verifiers who are offline.
Slashing : When a verifier makes a contradictory block proposal or proof, it will be forfeited ( possibly an attack).

Note: On average, the balance of a single validator may not change, but as long as they participate in the work, they will be rewarded or punished.

Correlation

If the entire network is in a healthy state of operation, the impact of a single verifier being offline or triggering penalties is small, which means that the penalties will not be large. Conversely, if a large number of validators are offline in the network, the balance of offline validators will be reduced much faster.

In the same way, if a large number of verifiers trigger confiscation at the same time, this is no different from an attack for the beacon chain, so 100% of the pledge deposit of these verifiers will be destroyed.

Because of these “anti-correlation” incentives, verifiers should consider issues that affect others at the same time, rather than starting from an isolated, personal perspective.

Cause and possibility of failure

Let’s take a closer look at some failure cases, and then see how many other validators will be affected at the same time, and how strongly your validators will be punished.

I don’t agree with @econoar here. The severity of these problems is moderate. Home UPS and dual WAN address failures have nothing to do with other users, so they are excluded from consideration.

🌍 Network/power failure

If you are a home-run verifier, you are likely to encounter these problems in the future. Home network and electrical connections cannot guarantee uptime. When the network is disconnected or the power is interrupted, usually the entire area is affected, even for several hours.

Unless your network or electricity is very stable, it is not worth being punished for this reason. You will be punished within these few hours, but since the entire network is operating normally, your punishment is approximately equal to the reward you deserve within that time period. In other words, if the failure time is k hours, your verifier balance may fall back to the value k hours before the failure, and then after k hours, your verifier balance will return to the value before the failure.

Validator #12661 balance recovery rate is about the same as the rate of decrease when offline-Beaconcha.in

🛠 Hardware failure

Similar to network problems, hardware failures occur randomly, and when failures occur, your node may be offline for several days. It is necessary for us to consider the expected benefits of the verifier’s entire life cycle and the cost of spare hardware. Is the expected value of the failure (offline fine multiplied by the probability of occurrence) greater than the cost of spare hardware?

Personally, if the chance of failure is low and the cost of spare hardware is high, this is not worthwhile. But then again, I am not a giant whale🐳. You need to evaluate all failure scenarios based on your actual situation.

☁️ Cloud service failure

Perhaps you may choose to use cloud services in order to avoid hardware and network failures. If cloud services are used, the above-mentioned correlation risks are introduced. How many other validators use the same cloud service provider as you?

A week before the creation of the world, Amazon’s AWS was out of service for a long time, which had a great impact on the network. Now if a similar event occurs, causing a large number of validators to go offline at the same time, inactivity penalties will be triggered.

Worse, if the cloud service provider uses a new virtual machine to run your node, but accidentally does not stop running the old node, this may result in you being fined (if this also affects other validators, then the penalty Will be especially big).

If you insist on using cloud services, you can consider switching to a smaller service provider, which may reduce losses.

🥩 Pledge service

At present, the mainnet has a variety of staking services to choose from, and the degree of decentralization is not the same, but entrusting your ETH to the service provider will increase the relevant risk. These services are undoubtedly an indispensable part of the eth2 ecosystem, especially for users who hold less than 32 ETH or lack the technical knowledge required for staking. But these services are artificially designed, so there will be defects.

If the size of the staking pool eventually grows to the same size as the eth1 mining pool, then a loophole may cause its users to be large-scale confiscation or sabotage punishment.

🔗 Infura failure

Last month, Infura was down for six hours, causing the Ethereum ecosystem to stagnate. In the same way, this is also a correlation risk that Eth2 verifiers may face.

In addition, third-party eth1 API providers must rate-limit service calls: in the past, this prevented validators from producing valid blocks (Medalla testnet).

The best solution is to run your own eth1 node: you will not encounter rate limiting, thereby reducing your relevance risk, and helping to increase the degree of decentralization of the entire network.

Eth2 clients have begun to join the possibility of specifying multiple eth1 nodes. The advantage is that the standby terminal can be easily switched when the main terminal fails (Lighthouse: --eth1-endpoints , Prysm: PR#8062, Nimbus and Teku may add support later).

I highly recommend adding low-cost or free alternate APIs (free and paid API terminals and their current status in EthereumNodes.com). This measure is necessary whether you run your own eth1 node or not.

🦏 An eth2 client fails

Despite the code review, audit and testing, the bugs of the eth2 client are hidden somewhere. Most of them are minor issues and will be discovered before the product is released, but the client you choose may be offline or cause you to be confiscated. If this happens, you will not want to run a client that most people (>1/3) use.

You have to make a trade-off between the client you think is most suitable and its popularity. Consider reading through the documentation of another client so that you know how to install and configure a different client when an accident occurs on your node.

If you pledge a large amount of ETH, it is necessary to run different clients to avoid putting eggs in one basket. Vouch is an infrastructure that can provide multi-node pledges. At present, Secret Shared Validators have also ushered in rapid progress.

🦢 Black swans

Of course, there are also many unlikely and unpredictable events with considerable impact that will bring risks. This has nothing to do with your pledge settings and decisions. For example, Spectre and Meltdown at the hardware level, or kernel vulnerabilities (BleedingTooth hints that there are certain dangers in the entire hardware stack). In other words, we cannot completely predict and avoid these problems, but take corresponding measures after the problems occur.

What should I worry about?

In the final analysis, it depends on calculating the expected value of a given failure E(X): the probability of the event and the cost of the event. Because the correlation factor can have a considerable impact on the punishment, it is important to consider these failures in the context of other members of the eth2 network. Compare the expected cost of the failure with the cost of diluting the failure, and you will get a reasonable answer to determine whether it is worth a try.

No one knows all the situations in which a node may fail, nor the possibility of each failure, but by independently estimating the probability of each type of failure and diluting the maximum risk, “group wisdom” will play a role. In addition, because each verifier faces different risks, and the assessment of these risks is also different, the risks you did not consider may be encountered by others, so the correlation will be reduced. The power of decentralization!

📕 Don’t panic

Finally, if something happens to your node, don’t panic! Even for inactivity leaks, the amount of punishment in a short period of time is not large. Calm down and think about what happened and why it happened, and then make a plan to solve the problem. Take a deep breath before getting started! It is better to give yourself five more minutes of thinking time than to rush to make the wrong decision and be fined.

Top priority: 🚨Don’t run two nodes with the same verification key! 🚨

Penalties caused by running more than one verifier with the same key-Beaconcha.in

Thanks to Danny Ryan, Joseph Schweitzer and Sacha Yves Saint-Leger for reviewing this article

Source link: blog.ethereum.org