Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fma: FMA for MT/64-bit Cannon #123

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
146 changes: 146 additions & 0 deletions security/fma-multi-threaded-64-bit-cannon.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
# Multithreaded Cannon: Failure Modes and Recovery Path Analysis

<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->

- [Introduction](#introduction)
- [Failure Modes and Recovery Paths](#failure-modes-and-recovery-paths)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any hard forks or contract upgrades involved? If so, let's make sure to consider any FMA-specific mitigations for the failure modes listed in fma-generic-hardfork.md and fma-generic-contracts.md

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't a contract upgrade. The onchain operation for this change requires changing the DisputeGameFactory implementation of cannon and permissioned_cannon. I have a section at the bottom that covers this operation. The other failure modes in the linked documents don't apply here.

- [Incorrect Linux/MIPS emulation](#incorrect-linuxmips-emulation)
- [Unimplemented syscalls or opcodes needed by `op-program`](#unimplemented-syscalls-or-opcodes-needed-by-op-program)
- [Insufficient memory in the program](#insufficient-memory-in-the-program)
- [Compromised Control Flow in the program](#compromised-control-flow-in-the-program)
- [Failure to run correct VM based on prestate input](#failure-to-run-correct-vm-based-on-prestate-input)
- [Mismatch between on-chain and off-chain execution](#mismatch-between-on-chain-and-off-chain-execution)
- [Livelocks in the fault proof](#livelocks-in-the-fault-proof)
- [Execution traces too long for the fault proof](#execution-traces-too-long-for-the-fault-proof)
- [Invalid `DisputeGameFactory.setImplementation` execution](#invalid-disputegamefactorysetimplementation-execution)
- [Action Items](#action-items)
Inphi marked this conversation as resolved.
Show resolved Hide resolved
- [Audit requirements](#audit-requirements)

<!-- END doctoc generated TOC please keep comment here to allow auto update -->

| | |
|--------|--------------|
| Author | Paul Dowman, Mofi Taiwo |
| Created at | *2024-10-09* |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this was originally written in october, is this still up to date, i.e. has the design changed at all? One notable difference is we now know we'll use OPCM, which is something we should add to the generic contract failure modes

| Initial Reviewers | *Reviewer Name 1, Reviewer Name 2* |
| Need Approval From | *Security Reviewer Name* |
Comment on lines +26 to +27
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| Initial Reviewers | *Reviewer Name 1, Reviewer Name 2* |
| Need Approval From | *Security Reviewer Name* |
| Initial Reviewers | Matt Solomon |
| Need Approval From | Tom Assas |

| Status | In Review |

> [!NOTE]
> 📢 Remember:
>
> - The single approver in the “Need Approval From” must be from the Security team.
> - Maintain the “Status” property accordingly. An FMA document can have the following statuses:
> - **Draft 📝:** Doc is created but not yet ready for review.
> - **In Review 🔎:** Security is reviewing, and Engineering is iterating on the design. A checklist of action items will be created during this phase.
> - **Implementing Actions 🛫:** Security has signed off on the content of the document, including the resulting action items. Engineering is responsible for implementing the the action items, and updating the checklist.
> - **Final 👍:** Security will transition the status of the document to Final once all action items are completed.

> [!TIP]
> Guidelines for writing a good analysis, and what the reviewer will look for:
>
> - Show your work: Include steps and tools for each conclusion.
> - Completeness of risks considered.
> - Include both implementation and operational failure modes
> - Provide references to support the reviewer.
> - The size of the document will likely be proportional to the project's complexity.
> - The ultimate goal of this document is to identify action items to improve the security of the project. The FMA review process can be accelerated by proactively identifying action items during the writing process.

## Introduction

This document covers the conversion of the [Cannon Fault Proof VM](https://docs.optimism.io/stack/protocol/fault-proofs/cannon) to support multi-threading and 64-bit architecture. These changes increase addressable memory and support better memory management by unlocking garbage collection in the op-program.

The multi-threaded Fault Proof VM is specified [here](https://github.com/ethereum-optimism/specs/blob/3abc17a68727e22c31a7a113be935943f717ee63/specs/experimental/cannon-fault-proof-vm-mt.md).

## Failure Modes and Recovery Paths

### Incorrect Linux/MIPS emulation

- **Description:** An incorrectly implemented FPVM could result in an invalid fault proof. This can be caused by bugs in the thread scheduler, incorrect emulation of MIPS64 instructions, and so on.
- **Risk Assessment:** High severity, Low likelihood.
- **Mitigations:** Comprehensive testing. This includes full test coverage of every supported MIPS instruction, threading semantics, and verifying op-program execution on live chain data.
mds1 marked this conversation as resolved.
Show resolved Hide resolved
This includes [unit and fuzz](https://github.com/ethereum-optimism/optimism/tree/eabf70498f68f321f5de003f1d443d3e3c8100b8/cannon/mipsevm/tests) testing of MIPS instructions and Linux syscalls. It also includes [testing](https://github.com/ethereum-optimism/optimism/blob/eabf70498f68f321f5de003f1d443d3e3c8100b8/cannon/mipsevm/multithreaded/state_test.go) of multithreaded specific functionality.
- **Detection:** op-dispute-mon forecasts and alerts on undesirable game resolutions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's link to dispute mon here

- **Recovery Path(s)**: See [Fault Proof Recovery](https://www.notion.so/oplabs/RB-000-Fault-Proofs-Recovery-Runbook-8dad0f1e6d4644c281b0e946c89f345f).

### Unimplemented syscalls or opcodes needed by `op-program`

- **Description:** We only aim to implement syscalls and opcodes that are required by `op-program` so there are some unimplemented. The risk is that there is some previously untested code path that uses an opcode or syscall that we haven't implemented and this code path ends up being exercised by an input condition some time in the future.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for not implementing unused syscalls and opcodes to mitigate this failure mode? I'm guessing either (1) there are so many that it'd not feasible, or (2) implementation and verification of each is time consuming? It would be good to expand either the description, or mitigation, to answer this

- **Risk Assessment:** High severity, low likelihood.
- **Mitigations:** We periodically use Cannon to execute the op-program using inputs from op-mainnet and op-sepolia. This periodic cannon runner (vm-runner) runs on oplabs infrastructure.
Furthermore, we [sanitize](https://github.com/ethereum-optimism/optimism/blob/eabf70498f68f321f5de003f1d443d3e3c8100b8/cannon/Makefile#L51) the op-program [in CI](https://github.com/ethereum-optimism/optimism/blob/eabf70498f68f321f5de003f1d443d3e3c8100b8/.circleci/config.yml#L928C1-L929C111) for unsupported opcodes.
- **Detection:** Alerting is setup to notify the proofs team whenever the vm-runner fails to complete a cannon run. And the CI check provides an early warning against unsupported opcodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, it seems that the only early detection is coming from the CI.
This is making me wondering about the potential hole there:

  1. We maintain the CI for a long period of time.
  2. As no one introduce new opcodes this CI test will never be generating fail.
  3. After a certain time, we decide to clean the CI tests because there a taking time to run (and this test is not matching often).
  4. Then a invalid Opcode is introduced.

For me, seems bit light to only rely on the CI here.
Happy to discuss about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. Added clarification on the vm-runner being used to detect issues outside of CI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should link to the alerting to verify this setup. Also, is this a page to oncall because a failure means unsupported opcode (which is good since there's a clear DRI), or is failure just a slack alert that could get missed since it has no DRI?

- **Recovery Path(s)**: See [Fault Proof Recovery](https://www.notion.so/oplabs/RB-000-Fault-Proofs-Recovery-Runbook-8dad0f1e6d4644c281b0e946c89f345f).

### Insufficient memory in the program

- **Description:** The op-program may run out of memory, causing it to crash.
- **Risk Assessment:** High severity, low likelihood.
- **Mitigations:** The 64-bit address space virtually eliminates memory exhaustion risks. Go's concurrent garbage collector automatically manages memory through scheduled background goroutines.
- **Detection:** op-dispute-mon forecasts and alerts on undesirable game resolutions that would result due to a program crash.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about adding op-dispute-mon link

- **Recovery Path(s)**: See [Fault Proof Recovery](https://www.notion.so/oplabs/RB-000-Fault-Proofs-Recovery-Runbook-8dad0f1e6d4644c281b0e946c89f345f).

### Compromised Control Flow in the program

- **Description:** This could theoretically occur when the op-program runs out of memory in a way that lets the attacker reuse code to subvert execution.
- **Risk Assessment:** High severity, low likelihood.
- Low likelihood: This requires an attacker to craft inputs that not only induce high memory usage, but also corrupt or spray the heap in a way that either produces invalid fault proofs or prevents valid fault proofs from being generated.
- **Mitigations:** As with [Insufficient memory in the program](#insufficient-memory-in-the-program), the 64-bit address space effectively prevents this from occurring. Furthermore, the Go runtime checks memory allocations against heap corruption. However, such memory protections may not hold due to bugs in the Go runtime.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Maybe add that not only vulnerability inside Go can cause this behavior but also the usage of unsafe in Go can lead to unexpected behavior.
PS: We should also ensure that no unsafe package is imported and used incorrectly in the current codebase.

- **Detection:** op-dispute-mon forecasts and alerts on undesirable game resolutions that would result due to honest claims being disputed at the bottom of the game tree.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about adding op-dispute-mon link

- **Recovery Path(s)**: See [Fault Proof Recovery](https://www.notion.so/oplabs/RB-000-Fault-Proofs-Recovery-Runbook-8dad0f1e6d4644c281b0e946c89f345f).

### Failure to run correct VM based on prestate input

- **Description:** The off-chain Cannon [attempts to run the correct VM version based on the prestate input](https://github.com/ethereum-optimism/design-docs/blob/0034943e42b8ab5f9dd9ded2ef2b6b55359c922c/cannon-state-versioning.md). If it doesn't work correctly the on-chain steps would not match.
- **Risk Assessment:** Medium severity, low likelihood.
- **Mitigations:** Multicannon mitigates this issue by embedding a variety of cannon STFs into a single binary. This shifts the concern of ensuring the correct VM selection to multicannon. We also run multicannon on oplabs infra via the vm-runner, to assert the multicannon binary was built correctly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we details more about the "STFs" meaning here?
And why the multicannon is ensuring the program will not occured here?

Also, for the case of the an invalid prestate is deployed on L1 by a mistake (For example: by updating the prestate with the ASR in case of Incident response with an incorrect game that is blacklisted etc..) is this case also make sense here or
should we add a new failure or this is part of the failure?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a new failure mode for invalid prestates.

Copy link
Contributor

@mds1 mds1 Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is multicannon the same as MT-Cannon? If yes let's just change this to be consistent (also has to be changed above and in a few other places)

Suggested change
- **Mitigations:** Multicannon mitigates this issue by embedding a variety of cannon STFs into a single binary. This shifts the concern of ensuring the correct VM selection to multicannon. We also run multicannon on oplabs infra via the vm-runner, to assert the multicannon binary was built correctly.
- **Mitigations:** MT-Cannon mitigates this issue by embedding a variety of cannon STFs into a single binary. This shifts the concern of ensuring the correct VM selection to MT-Cannon. We also run MT-Cannon on oplabs infra via the vm-runner, to assert the MT-Cannon binary was built correctly.

- **Detection:** This can be detected by manual review. Failing that, it would only be detected when malicious activity occurs and an honest op-challenger fails to generate a fault proof.
- **Recovery Path(s)**: Fix the op-challenger multicannon configuration.

### Mismatch between on-chain and off-chain execution

- **Description:** There could be bugs in the implementation of either the Solidity or Go versions that make them incompatible with each other.
- **Risk Assessment:** High severity, low likelihood.
- **Mitigations:** [Diffeerential testing](https://github.com/ethereum-optimism/optimism/tree/eabf70498f68f321f5de003f1d443d3e3c8100b8/cannon/mipsevm/tests) asserts identical on-chain and off-chain execution.
- **Detection:** An op-challenger fails to fault prove an invalid claim using a witness generated offchain.
- **Recovery Path(s)**: Depends on the specifics. If the onchain VM implementation is "more correct", then fixing this can be done solely offchain. Otherwise, a governance vote will be needed. As usual, the [Fault Proof Recovery](https://www.notion.so/oplabs/RB-000-Fault-Proofs-Recovery-Runbook-8dad0f1e6d4644c281b0e946c89f345f) provides the best guidance on this.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "more correct" mean? Is this covered in the runbook?


### Livelocks in the fault proof

- **Description:** A livelocked execution prevents an honest challenger from generating a fault proof.
- **Risk Assessment:** High severity, low likelihood.
- **Mitigations:** Manual review of the op-program and a quick review of Go runtime internals. The op-program uses 3 threads, and only one of those threads is used by the mutator main function. This makes livelocks very unlikely. This [issue](https://github.com/ethereum-optimism/optimism/issues/11979) looks into the livelock problem with possible solutions. The proposed solutions are deferred for future work as the risk of a livelock is considered too low to be addressed immediately.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue is now closed, but it says "To future-proof the guest program, we should revisit this". Where are we tracking revisiting this issue/mitigation?

- **Detection:** This would manifest as an execution that runs forever. Eventually, but well before the dispute period ends, op-dispute-mon will indicate that a game is forecasted to resolve incorrectly.
- **Recovery Path(s)**: See [Fault Proof Recovery](https://www.notion.so/oplabs/RB-000-Fault-Proofs-Recovery-Runbook-8dad0f1e6d4644c281b0e946c89f345f).

### Execution traces too long for the fault proof

- **Description:** It's possible that introducing multi-threading/gc greatly increases the execution time of the op-program.
- **Risk Assessment:** Medium severity, low likelihood.
- **Mitigations:** Based on [vm-runner executions of 64-bit Cannon and 32-bit single-threaded Cannon](https://optimistic.grafana.net/goto/IlD3Tu4Hg?orgId=1), the 64-bit VM executes the op-program much faster than the 32-bit VM. As such, we do not expect worse performance over the existing VM, which was already found to be adequate. However, we can always use CPUs with better single-core performance to mitigate.
- **Detection:** op-dispute-mon provides an early forecast (see [this alerting rule](https://github.com/ethereum-optimism/k8s/blob/6416658dd9602f83c6be093c5c57af45877258a3/grafana-cloud/rules/dispute-mon.yaml#L154)) and triggers an alert if the op-challenger stops interacting with a game.
- **Recovery Path(s)**: This can be mitigated by migrating the op-challenger to a more powerful CPU. [Fault Proof Recovery](https://www.notion.so/oplabs/RB-000-Fault-Proofs-Recovery-Runbook-8dad0f1e6d4644c281b0e946c89f345f) provides guidance on when it'll be appropriate to do so.

### Invalid `DisputeGameFactory.setImplementation` execution

- Description: This occurs when either the call to the DisputeGameFactory could not be made due to grossly unfavorable base fees on L1, an invalidly approved safe nonce, or a successful execution to a misconfigured dispute game implementation.
Copy link
Contributor

@Ethnical Ethnical Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will invalid the withdrawals that are currently into the current windows right?
If yes, we should inform the reader here imo.
To make sure, this will also invalid some withdrawals of users. They will required to resubmit them. However, that can be expensive in case of gas spike.

Or this won't call setRespectedGame? and this only adding a new implementation into the mapping?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added context clarifying this.

- **Risk Assessment:** Low severity, low likelihood.
- Low Likelihood: The low likelihood is a result of tenderly simulation testing of safe transactions, code review of the upgrade playbook, and manual review of the dispute game implementations (which are deployed on mainnet and specified in the governance proposal so they may be reviewed).
- Low severity: Fault Proofs continues to use the existing single-threaded FPVM. This carries a reputational risk, but it doesn't diminish the security of the system. Withdrawals will continue to work against outputs secured by the single-threaded FPVM.
- **Mitigations:** No immediate action is needed other than to retry the safe transaction. This may require another signing ceremony. Note that the op-challenger does not need to be rolled back, as multicannon is backwards compatible with older FPVM state transition functions.
- **Detection:** An un-executed safe transaction is easily detectable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An un-executed safe transaction is easily detectable.

Agreed in the case of the revert transaction as the UI of the execution from superchain-ops will show the issue.

or a successful execution to a misconfigured dispute game implementation.

However, the a successful execution to a misconfigured dispute game implementation. How to detect a misconfigured dispute game deployed?
I think we should elaborate more on it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the "low likelihood" bullet point, I note the various ways this would be detected. Since the game implementations are pre-deployed prior to the upgrade, a reviewer can check and detect any invalid configuration. Does that address your comment?

- **Recovery Path(s)**: Reschedule upgrade, possibly releasing new binary though without immediate urgency.


## Action Items
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mds1 - any immidiate FMA action items that we should add to the list, after your initial pass for the MT cannon FMA?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Below is what needs to be done before launch to reduce the chances of the above failure modes occurring, and to ensure they can be detected and recovered from:

- [ ] Third-party audit the offchain and onchain VM implementation and specification (Assignee: @inphi)

## Audit requirements

An audit of the multithreaded VM is not required per the [OP Labs Audit Framework](https://gov.optimism.io/t/op-labs-audit-framework-when-to-get-external-security-review-and-how-to-prepare-for-it/6864).
A failure in the new Cannon VM and thus dispute games is mitigated by an airgap in finalized withdrawals. Furthermore, there's a window whereby the Security Council can override the results of invalid games.
Nonetheless, we will be auditing the new VM.
Copy link
Contributor

@BlocksOnAChain BlocksOnAChain Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Inphi - we should add audit report links here, once we have the final report from our external audits

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BlocksOnAChain Out of curiosity when you will have the results for the audit of the VM? Did the PR will be merged before the results here?

Copy link
Contributor

@BlocksOnAChain BlocksOnAChain Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ethnical - We will have the results in January, likely mid January since we will be on a collective pause + auditors need time to generate the reports.
We are still doing reviews with the Auditors, so all of this brings us to January as our target date.
I started adding the findings from the audit to this label, feel free to review what we have, for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the prior text of "an audit is not needed given the audit framework" it would be good to explain why we chose to get an audit