Reflecting on the worst Ethereum accident after The DAO: Is Infura the source of “single point of failure”?

Reflecting on the worst Ethereum accident after The DAO: Is Infura the source of “single point of failure”?

Loading

The answer to “whether Infura becomes a single point of failure” is divided, but the solution to the problem—statelessness still requires the wisdom of researchers and developers.

Original title: “View | How do you view the collapse of Infura’s service and its impact? 》
Written by: A Jian

On the afternoon of November 11, 2020, Beijing time, Infura, a well-known node service in the Ethereum community, was exposed to an API service error, which caused the collapse of multiple services built on Infura, or the front-end display was incorrect.

As far as Infura itself is concerned, it can be understood as a public Ethereum node. This node will receive requests and return certain services, such as helping to forward transactions, such as checking whether a certain transaction is on the chain, or an account’s What is the state. In fact, as long as you deploy an Ethereum node yourself, you can provide the same service as Infura. But its particularity is that most of Infura’s services are free, so many services (including exchanges) have chosen to rely on Infura to broadcast the status of the Ethereum blockchain to themselves, eliminating the trouble of deploying nodes by themselves. .

It is precisely because of this that Infura made a mistake. Theoretically, it will spread widely. During the divergence of the incident, some people even threatened that “Ethereum will fork (or is undergoing a fork).” The reason is that on two different block explorers (Etherscan and Blockchair), two different blocks are displayed for the same block height (but the blocks after these two blocks, the two browsers display the same of).

But obviously, Ethereum did not fork at all. In fact, the subsequent blocks displayed by the two block explorers are the same, which means that the miner who produced the block (at least most miners) did not use two different blocks as the parent block to continue mining , And there is no block that rejects each other. Theoretically speaking, only if the nodes producing blocks use different consensus rules between each other (therefore they will reject the blocks produced by the other party), and both occupy a certain amount of computing power, a fork is possible.

In fact, people quickly discovered that this was because Infura did not run the latest version of the Geth client, and some special transactions triggered a bug in this version of the client, causing it to crash. Blockchair is the same. So soon someone came out to urge everyone to upgrade the Geth client as soon as possible.

As of 18:00 on the 11th, Beijing time, Nikita Zhavoronkov@nikzh of the Blockchair team posted a tweet explaining the cause and effect of the incident:

  1. An Ethereum developer made a code change that led to the split of the Ethereum blockchain that day, starting at block height 11234873;
  2. The service providers that did not update the client, including Blockchair and Infura, suffered as a result and were left on a chain composed of a few people (the chain issued 30 blocks within 2 hours)
  3. Technically, this means that an “unannounced hard fork” has occurred.
  4. The fix is ​​to upgrade the geth client and run debug.setHead(11234872)

He also stated that this incident should never be underestimated and should be considered the most serious accident on the Ethereum blockchain after The DAO incident.

It is really strange, why is there a certain error that only causes the historical version of the software to crash at a certain time before the current version does not crash? Does this mean that the consensus rules of different versions of geth clients are actually different, that is, a consensus rule change (“hard fork”) that is not backward compatible occurs at a certain moment? In addition, an Infura crash has caused a large area of ​​service errors. Does this mean that Infura has become a “single point of failure” source?

reason

In response to the two questions above, Péter Szilágyi@peter_szilagyi, the leader of the Geth client team, responded.

  • Technically speaking, it can indeed be said that an “undisclosed hard fork” has occurred, but this is only because the developers fixed a bug that has been sleeping for more than two years, and because they are worried that public disclosure of this bug will cause Ethereum to suffer Attack, so I chose silent repair.
  • People should not despise Infura for not using the latest Geth client. From the operator’s point of view, it is rational not to keep up with the latest version of the software. And relying on Infura’s service, it is yourself who handed over this right, rather than others banning you from running the node, so there is nothing to complain about.

Peter’s response also caused different reactions. A person in the Monroe community said that in 2017, they also chose to silently fix bugs because of the same concerns. Of course, some people think that it is right to choose silent repair, but at least the providers of large infrastructure should be notified, as long as they are contacted, the damage caused by this vulnerability can be greatly reduced.

At 5:34 am Beijing time on the 12th, Peter released the “Aftermath Report on the Destruction Caused by the Geth v1.9.17 Client”, locating the source of the problem: Geth v1.9.7, released on November 7, 2019, incorrectly implemented EIP -211; John Youngseok Yang reported this issue on July 15, 2020, and the Geth team fixed the issue in the v1.9.17 version updated on July 20. This fix makes the Geth client consistent with other Ethereum clients (such as Besu, Nethermind) when executing transactions involving related rules, but it makes the v1.9.17 version inconsistent with the historical version of Geth.

As Peter mentioned, this process is not at all to introduce a consensus rule that the Ethereum community does not know or disagree with. It is only because a bug has been written that it must be fixed. Unless you call a “hard fork” if you write a bug, there is no reason to call a “hard fork” to fix a bug (Nikita obviously disagrees with this, he said that this happened twice, not once, hard fork cross).

Secondly, how to release the fix is ​​actually not simple. Ethereum’s hard fork coordination also takes a long time. If a bug with serious danger is disclosed, it is difficult to guarantee that no one will try to attack during the upgrade process of each node. As a client developer, he considers the security of the Ethereum network more than the security of a certain service. Moreover, they do not take the same silent repair measures for all bugs, many of which are publicly fixed.

At 7:11 am on the 12th, Jing is hiring for Optimism@jinglanW of the Optimism team came out to disclose more information: They copied the code base of the Geth client 6 months ago to research and develop Optimistic Virtual Machine. In the process, They found a mysterious bug and fixed the bug, but they have been unable to locate its source; they always thought that this bug might be related to the customization improvements introduced by the team, but on the 11th they began to suspect that the bug was in the old version of geth In the client, not because they introduced some improvements. So they looked at the node distribution displayed on ethernodes.org (and found that most nodes have been upgraded), and decided to test the bug on the mainnet. So there is the next thing.

So, in fact, it was the Optimism team who discovered a bug and rashly decided to test the bug on the mainnet if it still did not exist. In addition, the Geth team had previously chosen to silently fix the bug, which made some nodes that were not upgraded in time go wrong. Up.

How to understand and treat this matter?

In terms of the cause of the matter, this is because the client team chose to silently fix a bug that has been sleeping for a long time. Although many people think that the geth team can reduce damage by contacting infrastructure providers, I still believe that we should give more trust and respect to client developers . I believe the Geth client team has a reason to do this. They know that most of the nodes are using their own software, and they also considered the sleeping time of the bug, so they chose to fix it silently. From the perspective of Zhuge Liang, of course, it would be better to notify the major infrastructure providers in advance, and the damage would be less. But is it reasonable to be so picky? Why don’t services that rely on Infura assume that Infura might crash?

I admit that I am not fair here, but more fairness, many people have already said. I just want to express my respect to the geth client team. I am willing to share the impression with them because they have provided a lot of proof of work in the past. They deserve everyone’s respect.

There is certainly room for improvement in the implementation of silent repair measures, and we should also learn from the community including Monero and Bitcoin. But if you only want to condemn the geth team, or even use conspiracy theories to speculate on them, that would be even greater injustice.

Regarding “whether Infura has become a single point of failure”, there are also simple answers and complex answers. The short answer is no, because as Peter said, no one has ever forbid you to deploy nodes, but many providers themselves choose to outsource. Infura is not a single point that must be passed at the design level. Just for various reasons, it has become the largest node service provider.

But the complicated answer is that the resource consumption of Ethereum nodes is relatively large, which is indeed an underestimated problem. The operation of the Ethereum protocol requires that each node completely execute the transactions contained in the block, and the execution of the transaction must take out the data from the state data and write the result after completion. This process will involve a large number of random hard disk reads and writes. Moreover, as the volume of state data expands, the efficiency requirements for reading and writing will also increase. The issue of “state expansion” that was hotly discussed in previous years has not yet been resolved on the current Ethereum. The threshold for running nodes is high, and the number of nodes is naturally small. From a benevolent point of view, if the threshold for the operation of Ethereum nodes is lowered, I believe that more people will build their own nodes (after all, it is safer) instead of relying on Infura.

But the solution of this problem also depends on the wisdom of Ethereum client developers and researchers. Statelessness can be said to be the ultimate solution to the problem of state expansion. Before the ultimate solution becomes feasible, we still need client developers to contribute more efficient clients to us.

Therefore, something did happen, and it did expose some problems and pointed out the direction of our learning and progress. But to solve these problems, we cannot do without our understanding and respect for different groups in the community. Stay away from conspiracy theories, away from malicious and clever ridicule, figure out the source of the problem, think about its essence and improvement plan. What we do determines who we are.

Source link: mp.weixin.qq.com