A Retrospective: Understanding The Network Outage & Our Next Steps
As many of you already know, the Zilliqa mainnet went down on 29th July at 22:59:14 UTC. Through the combined efforts of our technical teams working together, we were able to ensure the mainnet recovery in under 24 hours.
This experience encapsulates the spirit and vision of Zilliqa 2.0. Our focus on building up technical capabilities and ongoing team expansion enabled us to emerge from this episode stronger and more prepared than ever, and is a significant moment for our collective growth.
The details of the outage have been published in an earlier post here, but we recognise that it may be difficult to understand for some because of its technical nature and the terms used.
If you’re one of them, we hope that this post can condense the topic in a way that’s easier for you to understand!
Part 1: The cause
It was discovered that the Scilla function ecdsa_recover_pk had received an out-of-range parameter input. This input was not handled properly by ecdsa_recover_pk‘s external cryptographic library secp256k1, which it depends on.
Let’s break things down. For starters, functions operate based on a set of parameters. Here are some details to demonstrate the situation more clearly. You can think of:
- the ecdsa_recover_pk function as a credit card reader;
- secp256k1 as the credit card reader software; and
- the cards it accepts (Visa, MasterCard, and American Express) as the parameters
What happened was that the credit card reader was presented with a Citibank card that the software did not recognise.
Keep these analogies in mind as we will return to them shortly!
Part 2: The effect
The valid range of parameters for secp256k1 is (0,1,2,3). In this case, it received an input of 28.
Here, secp256k1 did not check if the parameters were within its bounds and went on to call the unsafe code it received. This led to the segmentation fault that led to the Scilla server process terminating on the mining nodes, effectively causing the network to go down.
Now let’s return to our analogy. The credit card reader software did not first identify if the credit card it detected was accepted, and attempted to process payment. This triggered an internal fail-safe mechanism that caused the credit card reader to shut down.
Part 3: The fix
Upon identifying the cause of mainnet going down, our team rebuilt the Scilla binaries to pick up the latest version (0.4.4) of secp256k1’s OCaml wrapper library. This version was reprogrammed to ensure that it handled the out of range parameter correctly and mitigate a reoccurrence of the same issue.
To put the above in simpler terms, let’s assume that the credit card reader software can be updated and re-deployed from a desktop application. This desktop application essentially represents the Scilla binaries.
As you may have already guessed, the credit card reader software was updated through the application, and then re-deployed to all credit card readers. This update ensures that every reader now displays an error message (instead of shutting down) when it identifies cards that are not accepted.
Strengthening network resilience: Learnings for the road ahead
There are several positive aspects to this series of events, in particular the team’s success in resolving the network issues quickly. It has also helped us to pinpoint areas of improvement.
We will start by commence comprehensive happy path and out of bound (unhappy path) unit tests for all Scilla built-in functions.
Furthermore, it highlighted the need to find a way to update independent binaries in the nodes instead of updating the whole node, which was the Scilla server in this case. Once implemented, new binaries will be picked up by an overarching monitor process and replaced seamlessly.
Being able to independently update binaries will also help us make progress towards rolling upgrades in the future. At the same time, strengthening binary resilience of the nodes goes a long way in ensuring our network remains robust.
Above all, this has been a learning experience for Zilliqa and it reaffirms our commitment to constantly building and innovating. We will soon share more details about our plans to strengthen our DevOps practices, processes and capabilities.