Learning by Breaking - A LayerZero Case Study - Part One

Or C.
Feb 29, 2024
5 min read

Updated: Mar 1, 2024

Cross-chain messaging has always fascinated us at Trust Security, so we've jumped on the opportunity when LayerZero announced their crazy $15M bounty program last year. Through this three-part adventure, we'll look into the anatomy of the LZ architecture, study how it safeguards key security properties, and finally some ways we've found to break them. As usual, we'll also provide a completely transparent review of the project's approach to white-hat disclosures.

Intro

LayerZero needs no introduction, it's by far the most popular and commonly integrated cross-chain messaging primitive and casually hosts a $15M bounty program since May 2023 on Immunefi. Its baby project is Stargate, developed by the same team, and as we'll see, coupled in many sneaky ways to its parent. SG leverages LZ messaging to manage liquidity pools across the "multichain".

Properties

The main responsibility of cross-chain messaging is simple - it just needs to pick up a send() call on the source chain and drop off a receive() call on the dest chain. That's usually enough to build full-fledged cross-chain applications for different use cases. Intuitively we can come up with a bunch of properties that have to hold:

A valid message must be deliverable to the destination
A message should not be delivered more than once
A message that was never sent should not be receivable.

Since our scope is the EVM ecosystem, gas is a thing, therefore there's also:

The amount of gas paid for in the send() TX must be provided in the receive() TX
Unused gas should be refunded to a specified address

Since blockchains are devoid of meaning without consensus, let's mention:

A deliverable message is one that received sufficient confirmations by the accepted channel oracle
A message that hasn't received the threshold confirmation count should not be deliverable.

Additionally, LZ has been designed with message sequencing ON by default. The docs state every message is assigned an increasing nonce, which is unique to a (src address, src chain, dst chain) tuple. On delivery, the incoming message nonce must match ++nonce. In non-technical words:

Messages cannot arrive out-of-order in messaging channels.

The LZ integration library allows developers to easily convert the "blocking" channel behavior to non-blocking, as we'll see later.

Anatomy

Readers are referred to the docs and source code while we review mostly the key interface and implementation details.

The point of contact for clients integrating with LZ is the Endpoint contract. However it contains almost no logic and defers complexity to the messaging library. It only validates ordering, makes the receive() call, and allows clients to configure some messaging properties.

ULN (UltraLightNode) is what LZ calls the messaging library. On the outbound path, it notifies the client's chosen Relayer and Oracle for the request (also charges their fee). On the inbound path, it receives the delivery request from a Relayer, verifies consensus with the Oracle and Proof modules, and forwards it to the Endpoint if everything checks out.

The Relayer and Oracle are the off-chain duo responsible for liveness and consensus respectively. The LZ dedicated network updates the ULN with trusted Merkle roots for incoming transactions with updateHash(). Relayer performs delivery through validateTransactionProof() entry point. Both are by default supplied by LZ while clients can specify their own delivery addresses.

Here's a brief call stack for send():

Application code
- Endpoint.send()
  - ULNv2.send()
    - ULNv2.handleRelayer()
      - Relayer.assignJob()
    - ULNv2.handleOracle()
      - Oracle.assignJob()

For receivePayload():

Relayer EOA
- RelayerV2.validateTransactionProofV2()
  - ULNv2.validateTransactionProofV2()
    - Inline Merkle hash consensus check
    - Proof library validateProof() - to tie the hash to the incoming payload
    - Endpoint.receivePayload()
      - application's lzReceive()

The schematic understanding above is enough to start getting into more interesting discussions.

The ULNv1 Exploit

Back in Sept 2022, LayerZero disclosed that they received a report about "potential griefing of applications by blocking messaging chain paths". Subsequently ULNv1 was deprecated. To the best of our knowledge, the details were never discussed in public. It's interesting to take a look at the key changes and bring them to light.

ULNv1:

ULNv2:

The code above is at the end of ULN's validateTransactionProof(), after it verified an incoming transaction. Note that the second parameter to receivePayload() used to be (srcAddress) and is now (srcAddress, dstAddress). Below we show how it is used:

The expected nonce is pulled from the mapping and access via [_srcChainId][_srcAddress], where _srcAddress is now both src and dst address. It has to equal the incoming packet's nonce. So why do nonces have to be unique for every src / dest address pair?

Recall that the client contract can pick their own Relayer/Oracle pair. Hence, they can forge messages from other chains that were never sent (as long as the destination is the client contract). They will confirm consensus on a fake Merkle root, and call validateTransactionProof() from their own Relayer. But since the nonce was mapped only by srcAddress, the increment for delivery for the malicious destination would invalidate delivery for the real destination! Essentially it's a slot collision.

The impact is extreme - anyone can prevent the delivering of a specific message forever, since the nonce assigned to the message will never equal inboundNonce. LayerZero wrote:

Within hours, we concluded that while the issue was legitimate, no funds were at risk; a malicious actor could potentially grief applications, but applications would still be able to fully recover their state.

The only way we see applications being able to re-deliver the message is by updating the ULN to a version with a backdoor of "burnt" transactions, which can be replayed again. In truly decentralized applications which renounce the owner role, this would not be possible, meaning the loss of funds would be permanent.

It seems like the effort to hush down details of the issue came at the expense of accurate documentation. As mentioned before, the docs state the ordering is from a src address to all destinations in a single chain. It is worth updating the docs in case it would cause incorrect integrations.

LayerZero paid a generous bounty of $250k to the party which disclosed the issue. Turns out that samczsun reported it shortly after them, and LZ decided to also award him a $50k good will bounty. The team deserves applause for that.

Bug theory

Logical slot collisions are one of the bug classes which are next to impossible to detect in testing. The two main ways to detect such issues are:

Working backwards from an attack surface analysis - being able to set a custom Oracle means being able to deliver any TX to your contract. What are all the side effects of that surface?
Audit checklist - go over all mappings and ask - is there more than one way to index a certain item when there shouldn't be?

That's all for today. Join us for part 2 which will introduce Stargate and discuss two high severity issues we've found in it.

Trust
security