-
Notifications
You must be signed in to change notification settings - Fork 73
Root Cause Analysis Template for Incidents
The OAK team observed significant RAM growth on our nodes (collator and boot). We believe it's an issue with the upgrade to the polkadot client 0.9.23, specifically https://github.com/paritytech/substrate/issues/11604.
(In PST or UTC-7)
0900: Begin upgrading nodes from 1.4.0 binary to 1.5.0
1015: Runtime upgrade started
1118: Runtime upgrade occurred for Turing Network
1120: Complete upgrading nodes from 1.4.0 binary to 1.5.0
1236: First alert regarding RAM
...alerts continue for other nodes
1400: OAK Devs huddle begins and is flagged as P0 (highest priority bug)
1420: OAK Devs start rolling back nodes to 1.4.0 after discussing many mitigation options.
1426: Collators were informed to keep the 1.4.0 node version/binary
1445: OAK nodes rollback to 1.4.0 complete
We did not observe this same issue occur in Turing Staging (against Rococo). We noticed that Rococo was in polkadot client version 0.9.26. Turing Staging and Turing both recently upgraded to polkadot client 0.9.23. We looked into the delta between the two, and noticed this issue: https://github.com/paritytech/substrate/issues/11604
After finished syncing with Polkadot, parachain node memory usage starts constantly growing until reaches maximum and then relay chain part of node stops working (only parachain logs are visible, no relaychain logs and no connection to relay chain).
This was what we also observed in our node dashboards. The metrics align to our timelines above. RAM continues to flatline after the downgrade to 1.4.0.
We noticed that the cumulus version with 0.9.24 had beefy disabled. https://github.com/paritytech/cumulus/pull/1382/files
- Rollback nodes to 1.4.0 and observe if the mem leak persists.
- Upgrade to 0.9.26 for the OAK-blockchain. The problem with this is our dependency with the moonbeam's parachain staking. Specifically waiting on this PR: https://github.com/PureStake/moonbeam/pull/1699
- Cherry-pick clients related to the beefy disable change. The dependency graph was too vast.
Rolling nodes back to 1.4.0. The collators must remain in 1.4.0. This seems to mitigate the leak, and we're not seeing much growth for RAM, but the OAK team will continue to monitor.
-
Kusama upgrade to 0.9.26 client might mitigate this issue. https://kusama.polkassembly.io/referendum/216 This will happen on Tuesday July 26 (per governance). Once the governance proposal has passed we will test the 1.5.0 turing client to evaluate if the memory leak still presents an issue. It's unclear whether this will fix the problem; however, the issue was not seen on Rococo (already on the 0.9.26 client).
-
We will upgrade the turing client to use polkadot-v0.9.26. This is a bit more complicated to expedite given dependencies to other projects, specifically moonbeam's parachain-staking.
-
Figure out ways to decouple dependencies to enable hotfixes for issues such as this.
-
Figure out how to test / deploy, while observing issues in Turing Staging first.
- Moonbeam cherry-picked this: https://github.com/PureStake/moonbeam/pull/1679
- Issues flagged on polkadot: https://github.com/paritytech/polkadot/issues/5804
- According to a collator, Shiden also encountered this issue and had to rollback the node