Post-mortem Report for Mainnet Upgrade v8.1.0
The Zilliqa protocol underwent a mainnet upgrade on Aug 30, 2021. This upgrade introduced important features to enable ZilBridge
Background: The Zilliqa protocol underwent a mainnet upgrade on Aug 30, 2021. This upgrade introduced important features to enable ZilBridge such as:
- Merkle Patricia tree data structure for contract storage
- Contract proof API needed for PolyNet relayers
It also included several stability, maintenance and bug fixes such as:
- Several memory clean up improvements
- Improve network stability for JSON RPC API service
- Improve PoW submission handling
- Node syncing improvement
The upgrade took a few hours and once the upgrade was completed, miners had started to join the network and started processing transactions, some exchanges like Binance had also opened up deposit and withdrawals.
Issue reported: At this point, one of the community devs from the Duck team reached out with an observation with their NFT token minting contract. They noticed that there were two transactions, one with TXID starting with 0xef and another starting with 0x02 which had minted the same token_id 1351. The first transaction was done before the upgrade had started, while the second one after the upgrade.
For some reasons, the transaction and therefore the state change that happened before the upgrade was ignored and the network after the upgrade wasn’t aware that the token_id 1351 had already been sold and hence after the upgrade, it was resold to a new buyer.
Early investigation and first action taken: As soon as we were informed about this, we started to look into the code of the Duck NFT minting contract. The contract code looked correct. After investigating a bit more, we suspected that the mainnet post-upgrade had started with a wrong global state and as a result the transaction before the upgrade even though it did happen, was considered to be non-existent from the network’s perspective. Since the global state that the nodes started with post-upgrade was somehow corrupted, we suspected the impact to be beyond the Duck NFT contract.
This took us a few hours and by that time 4,478 transactions had been processed by the network post-upgrade. Many of these transactions had been processed with the corrupt initial global state. In order to reduce the impact of the corrupted global state, we decided to disable RPC endpoints that allow users to send transactions. By doing so, we limited any future transaction to be processed under a wrong global state.
Investigating the root cause: After making sure that we had restricted the impact by disabling transactions, we started to investigate a bit deeper into what had gone wrong. We found that when the network was upgraded to v8.1.0 at block number 1,394,088, the final global state after the upgrade was not the same as the global state before the upgrade had started. In other words, the global state after block 1,394,088 was not the same as the state before block 1,394,089.
The main reason behind this was that the state changes for the last 88 blocks (i.e., for all the blocks processed in the last epoch, each epoch is characterised by 100 blocks) were not fully taken into account for the global state post-upgrade. The state changes for these 88 blocks are stored in what’s called a state delta as these 88 blocks had not formed a full 100-blocks epoch. Consider state delta as a temporary change in the global state to be effectuated at the end of the epoch. Nodes keep these state deltas separately until the epoch is fully complete. At the end of the epoch (i.e., every 100 blocks), the state delta is then merged with the global state.
To summarise, the issue was that while all the merged state deltas were considered part of the global state (i.e., all transactions until block 1,394,000), transactions made in the last 88 blocks such as the DUCK NFT minting transaction were ignored. As a result, even though those transactions were processed by the network, the state after the upgrade did not consider the state change that they created. It was as if those transactions never occurred.
Proposed fix: Once we had identified the root cause, we knew that the fix would involve the following steps:
- Roll back the network to block number 1,394,088: As the transactions made post upgrade were effectuated on an incorrect global state, we had to first nullify all the state changes that happened post upgrade.
- Reconstruct the global state at block number 1,394,088: We had to reconstruct the correct global state by taking into account the state deltas for the last 88 blocks.
- Replay as many as possible all the 4,478 transactions: At this stage, we would be ready to recreate the impact of all the transactions that happened post-upgrade. As users had sent their transactions, these were stored with the nodes and hence could be replayed. While one could replay all those transactions, it was apparent that some of these transactions may fail this time. For example, in the DUCK NFT minting contract, the second user who tried to buy the token_id 1351 would have his transaction errored out.
Impact of the fix: While we are working on implementing the fix, we would like to highlight that the roll back will have an impact on some users, exchanges, miners as well as some dapps. We are looking at each of the 4,478 transactions that are impacted by the fix. These transactions are of varied types. Out of these 4,478 transactions, 1,230 transactions were to smart contracts like staking, ZILSwap, token transfers etc., and the remaining 3,248 were simple ZIL transfers, many of which seem to be mining payouts.
Rolling back these transactions and then replaying them may not yield the same result as explained earlier with the DUCK NFT example. A replay of that transaction is expected to fail and therefore the user would have paid gas fees for a failed transaction.
Another example would be ZILSwap transactions, it is possible that the replay of a ZILSwap trade transaction may fail due to slippage as the liquidity after the rollback could be very different or due to the difference between expiration blocks that each trade includes.
As for miners, who have been mining the network, the rollback would cancel any mining reward that they may have received.
Compensating users: We are looking into ways to compensate all users, miners and other parties impacted by the fix by airdropping the appropriate number of ZIL to cover their losses. We sincerely apologise for the inconvenience that this may have caused to users, dapp developers, miners and any other relevant stakeholders. More information on the compensation will be provided in a separate post.
Status of fix: We are in the process of implementing all the three steps needed for the fix and have prepared all the scripts to replay the transactions. The roll back (Step 1) and reconstruction of the correct global state (Step 2) are done. We are in the process of replaying and checking the impact of each transaction (Step 3). The last step is a WIP.
We sincerely apologise for the inconvenience and we are all hands on deck to enable transaction processing as soon as possible. Stay tuned for the compensation post.
The Zilliqa Team