ZilBridge back online following extended downtime

Following a period of extended disruption, full functionality has been restored to ZilBridge.

ZilBridge back online following extended downtime

Following the upgrade of the Zilliqa network to v9.3.0 on January 3, 2024, the ZilBridge platform experienced what would become an extended disruption, leaving several transactions unconfirmed and users unable to make use of the platform.

This issue was resolved on March 27, 2024, when full functionality was restored to the platform.

ZilBridge is an Ethereum-Zilliqa Bridge powered by Carbon and Poly Network, and which enables the easy bridging of ZRC-2 fungible tokens across both Zilliqa and Ethereum. 

The Zilliqa technical team has conducted a root cause analysis of this disruption which provides a detailed breakdown of the issue and how it was resolved.

The disruption to ZilBridge, labelled as PIR-219, was caused in principle by the ZilBridge relay infrastructure not being shut down when the v9.3.0 network upgrade was implemented, causing it to relay incorrect block headers through the bridge infrastructure.

Factors which contributed to the delays in resolving this issue include the manner in which Zilliqa network upgrades are rolled out, the characteristics of how PolyNetwork validates transaction blocks, the discovery of bugs in the relayer program, and the time required to build a new genesis block and sync historical transactions with PolyNetwork.

As of March 27, these obstacles were overcome and full functionality was restored to the platform. ZilBridge is now back online and all previously stuck transactions have been synchronised and confirmed.

Root Cause Analysis - ZilBridge Disruption

ZilBridge (in part) uses a relayer program to relay suitable transactions and block headers to PolyNetwork, which then handles trans-shipment of the requests to Carbon and then on to other chains for delivery.

When a mainnet upgrade is implemented on the Zilliqa network, the following occurs:

  • The old network is made inaccessible
  • A new network is created from the persistence of the old network.
  • The new network replaces the old.
  • The new network is made accessible.

The Zilliqa team notifies partners ahead of a scheduled network upgrade so they are able to pause their infrastructure while the old network is made inaccessible. This imperfect process will be improved and made more dynamic and flexible with the rollout of Zilliqa 2.0.

In the case of the Zilliqa v9.3.0 upgrade, ZilBridge infrastructure was not paused during this process, and it continued to accrue headers from the empty blocks now being produced by the old network and it relayed these to PolyNetwork.

This meant that as the network came back up with v9.3.0, PolyNetwork found itself with a forked DS committee membership and it refused to sync with the new network.

PolyNetwork checks if a transaction block is correctly signed by reconstructing the DS (Directory Service) committee from the DS block headers reported by the Zilliqa blockchain. Only the latest DS committee is stored by PolyNetwork, and it is impossible to calculate previous DS committee members for previous DS blocks.

This meant that we needed to regenerate the genesis block for PolyNetwork - a time-intensive process which we would need to start from a time earlier than the current DS block.

As it is not possible to calculate DS committee memberships for previous blocks, the Zilliqa team created a tool that took saved persistence and worked forwards to reconstruct a genesis block at any point in the chain. This was then used to generate and sign a block at a block height just after the network upgrade.

Several genesis syncs had to be completed to account for the change in guard nodes between network versions, and bugs were then discovered in the relayer program that resulted in transaction blocks not being synced to PolyNetwork.

These bugs were fixed and we then encountered an issue stemming from the fact that PolyNetwork is unable to store the DS committee membership for a DS block that it has already stored a hash of. 

This caused the relayer to stop working and necessitated the creation of a genesis block between the last DS block PolyNetwork had seen and the first DS block with an outstanding bridge transaction.

The relayer was optimised to accelerate this synchronisation which, once complete, eventually resulted in PolyNetwork aligning with the Zilliqa network and functionality to ZilBridge being fully restored on March 27, 2024.