Post-incident report on the staking issue witnessed on October 16, 2020

The launch of the seed node staking had been much anticipated by our global community of over 150,000 Zilliqans. Therefore, it was not much of a surprise to see a tremendous participation following the Phase I launch of non-custodial seed node staking.

The launch of the seed node staking had been much anticipated by our global community of over 150,000 Zilliqans. Therefore, it was not much of a surprise to see a tremendous participation following the Phase I launch of non-custodial seed node staking on October 14, 2020. Within a matter of hours, over a billion ZILs were staked in the contract.

Seed Node Staking Background

For those who are unfamiliar with the underlying concept behind seed nodes, these are Infura-like (on Ethereum) nodes that store transaction history of the Zilliqa blockchain. They also serve as an entry-point to the network, allowing users and dApps to send transactions. The goal of the seed node staking project was to open up an incentivised network of nodes providing the service.

The seed node staking design was simple. Seed nodes would get 40% of the block rewards as an incentive for their service. In order to become a seed node, the node operator must have at least 10 million ZIL staked with it. The reward would be distributed among the operators in proportion to their stake and the quality of their service (to be determined by a verifier). Any stake delegated by a token holder with a given operator would count in the 10 million ZILs. In other words, the operators were allowed to seek stake from ZIL holders.

Seed node staking programme was launched in two phases: Phase 0 being custodial, where delegators had to send their stake to an address controlled by the operator while in Phase 1, a non-custodial staking mechanism was introduced via a smart contract. Delegators could now directly deposit their stake in the contract. Phase 1 staking was launched on October 14, 2020.

October 14–16, 2020

After the launch on October 14, 2020, the staking contract distributed its first reward to all the delegators on October 15, 2020 and everything worked as expected and the community was elated to see the first returns on their stake. It was a great feeling to see the community rally around staking and particularly gZIL — the governance tokens issued alongside ZIL rewards.

However, distribution of reward for the second reward cycle on October 16, 2020 threw an error. All the funds were safe in the contract, but the reward distribution did not get processed. The team quickly assembled together to identify the issue. From a preliminary investigation, it appeared that the reward distribution failed because of an integer overflow.

There were two options available:

1) Go with a quick intermediate fix that would solve the issue at hand and with minimal disruption.

2) Go with a proper long-term fix that would involve a few days of disruption.

We decided to go with 1) and agreed that we will look into the issue a bit more deeply to understand the disruption that we will have to make and then follow 1) with 2). In the next hour or so, we made an announcement across our SM channels about the issue the system encountered and a quick immediate fix as per 1) was pushed to resume the staking programme.

As there was no risk of any loss of funds, we decided to take our time to better understand the issue and come up with a plan for 2) in the next few days. This writeup is a post-incident transparency report to inform the broader community on our findings and the steps that we took as a part of the long-term fix.

During our investigations, we found two issues that we describe below and the fixes that we later pushed.

Incident #1: Integer overflow bug within the smart contract

  1. On Oct 16 2020, at epoch 830801, the verifier initiated the assigned stake rewards operation to reward all SSN operators. The transaction failed after the Scilla interpreter detected an integer overflow and halted the transaction with an exception.
  2. The error occurred during the computation of rewards inside UpdateStakeReward procedure. The contract computes the following value. The cycleRewardparameter is passed by the verifier. It takes into account the performance of the given SSN.

The implementation is as follows:

new_rewards_tmp = builtin mul stake_amt cycle_reward;
new_rewards = builtin div new_rewards_tmp total_stake;

All the above values are stored in 128-bit unsigned integers. All are represented in Qa, where 1 ZIL = 1e12 Qa.

3. The integer overflow can happen when “stake_amt” multiplied by “cycle_reward” is larger than 2¹²⁸.

As the number of delegators increases, “stake_amt” at SSN will increase as well. With a full total cycle reward of 1.98M $ZIL (1980000000000000000 Qa), a SSN with more than ~171.86 Million $ZIL will exceed the limit of 128-bit unsigned integer, resulting in a halt in the transaction execution. The halting of transaction execution is a security feature within the Scilla interpreter. This is to prevent overflowing values from going back to 0 again, causing contract state inconsistency.

4. Upon realising this issue, the following interim measures were put in place:

  • We manually triggered the distribution of the rewards from block 830801 into 6 batches to avoid encountering the integer overflow issue.
  • We changed the rewarding cycle to be once every 300 blocks instead of 1800 blocks with prorated rewards.

5. The 6 transactions were confirmed on the network at block 830858, 830862, 830866,830870, 830874 and 830878 with the following transaction hashes

  • 0xdcb55bdae94fab2d1515cdd9dd1f3d72ae8f9399f6cd657c32c35515d4272afa
  • 0x6ba94a8dfa8cd1ff37576afc1c274e46af66b9e05e8b955c5aadb82a93eec4d5
  • 0x84a15de342176451a1d0668ee5e4be123b512d01649fa658cf0c25fd4e287a25
  • 0x3883c3ad05ad46cb8a3168ef81fa4b83a5d66d863fbcf0f4abbf98a99f9ae246
  • 0x3a992697074b26094d4ff60bd6f4e89c66382afedae608d0f6d2ec2990ba6b46
  • 0x209fd3e15ac2b54ff8f1d4c43f42150317867b394c3d1fe4d57afb0f4163ecdc

6. With the interim mitigation in place, the team conducted a deeper investigation with the help of PwC Switzerland. We also found a second code snippet where integer overflow could possibly happen. The code is present inside “CalcStakeRewards”

reward_tmp = builtin mul total_rewards staking_of_deleg;
reward = builtin div reward_tmp total_staking;

We have determined that this issue was not triggered during this incident.

4. To fix the issue, we have changed the implementation of “ssnlist.scilla”. Notably, we created a new function named “muldiv”. This function takes in all the 3 values, upcasts it to 256-bit unsigned integer, performs multiplication and division operation sequentially and finally downcasts the value back to 128-bit unsigned integer. This implementation mitigates the overflow issue. The fix is implemented at

Incident #2: Contract states not fully reverted after an exception

  1. On Oct 18, our contract monitoring script detected an anomaly. Rewards for 13 delegators suddenly became unclaimable. All these delegators were expecting some rewards but those rewards could not be claimed and the contract upon being called gave 0 ZILs as rewards.
  2. It was discovered that the affected delegators had previously sent a failed transaction, for example a transaction that ran out of gas. We also observed that these delegators were performing the “claim rewards” operation in the failed transactions.
  3. A closer investigation of the contract’s code and its state, led us to the following code snippet.

delete deleg_stake_per_cycle[deleg][ssn_operator][last_reward_cycle];
delete direct_deposit_deleg[deleg][ssn_operator][last_reward_cycle];
delete buff_deposit_deleg[deleg][ssn_operator][last2_reward_cycle];
staking_of_deleg = match comb_opt with
| Some stake => builtin add last_amt stake
| None => last_amt
deleg_stake_per_cycle[deleg][ssn_operator][reward_cycle] := staking_of_deleg;

In case of a successful reward claim, the code removes map entries from “deleg_stake_per_cycle”, “direct_deposit_deleg” and “buff_deposit_deleg”. Those values are then stored as an entry in the “deleg_stake_per_cycle” map. These changes were supposed to happen only when the transaction was successfully processed.

4. However, we noticed that for the affected accounts, the map deletions were executed even when the transaction failed. In case a transaction runs out of gas, any state change should have reverted.

5. Further investigation led to a core protocol bug, where a permanent container was used instead of a temporary container. This led to the deleted map being not reverted.

6. We realised that for the issue to arise, there must be a successful change in the contract state together with a failed transaction (with map delete being executed) in the same block.

7. A fix was implemented and pushed to our Github code repository on 19 Oct 2020. A new release, v6.4.2, was also released on the same day.

Actions Taken Post-Investigation

Since the second issue was a protocol-level bug, we had to conduct a network upgrade to push the fix. We therefore announced an unscheduled mainnet upgrade on 19 Oct 2020 followed by a contract upgrade that could take up to 48 hours. An hour before the mainnet upgrade (at epoch 836973), we paused the staking contract to ensure that once the mainnet upgrade was finished, users should not get to modify the contract state.

The network upgrade commenced on 20 Oct 2020 5:00hrs (UTC) and finished at 7:40hrs (UTC) on the same day. This prepared the ground to upgrade the contract and undo the deletion that happened. We used this opportunity to also push the integer overflow fix. As part of the upgrade, we patched the contract state such that the unclaimable stake reward was claimable again. On 22nd October 08:42hrs (UTC), smart contract was upgraded and unpaused and staking activity resumed.

The second issue was a bit difficult to catch as it was a corner case at the protocol-level but we should have caught the integer overflow issue. We apologize for having missed it during contract development. Having said that, we are continuously running monitoring scripts so that we can continue to quickly identify and mitigate issues, and ensure a smooth network experience for all users in the ecosystem.