Skip to content

Validator Guides

uprendis edited this page Mar 25, 2021 · 1 revision

Validator guides

Doublesign and its avoidance

Doublesign is creation of at least one pair of fork events in DAG. Check out fork rules.

Creator of fork events is called a cheater.

If fork is created, then it'll do no harm to other users unless 1/3W of validators cheat simultaneously.

Doublesign punishment

Punishment of a detected cheater is fully controlled by the SFC contract. It's worth to mention that the SFC contract can be upgraded by the governance contract at any time without hardfork.

SFC slashes 100% of the stake of cheater validator and his delegators, unless overridden by the governance contract.

Pay attention that exact same security rules are applied to delegators without exceptions - it's necessary because delegators increase validator's weight in consensus algorithm. If delegators wouldn't have the same possible punishment, then validator would delegate to himself to minimize punishment.

Doublesign on migration or node update

Fork is created when validator doesn't use his last event as self-parent of every created event.

It means that validator always must have his previous event in DB, which implies 2 reasons for a doublesign by accident:

  1. 2+ instances of the same validator are running simultaneously (and hence may create events simultaneously), and all heuristics failed to determine that another instance is running.
  2. On re-downloading or due to a loss of data, validator hasn't downloaded all his previous events, and all heuristics failed to determine that node hasn't synced up.

Migration guide

It means that on migration or upgrade, validators should ensure that their new node has downloaded all previous events of previous node.

On migration to another server, follow the steps:

  1. Stop previous node
  2. Ensure the node has stopped and wasn't restarted!
  3. Copy datadir to new server (optional, but it'll ensure that previous events are in DB)
  4. Backup keystore (copy datadir/keystore directory). Erase keystore and password from old server!
  5. Follow steps from How to ensure previous events are downloaded
  6. Start node as non-validator on new server. Ensure it does connect to the network, and everything is working as you assumed.
  7. Restart node as validator on new server (only one instance!)

Node upgrade guide

On node upgrade, follow the steps to prevent doublesign:

  1. Stop the node
  2. Ensure the node has stopped and wasn't restarted!
  3. Backup keystore (copy datadir/keystore directory)
  4. Re-build or replace the go-opera executable (follow Node update instructions)
  5. Follow steps from How to ensure previous events are downloaded
  6. Start node as non-validator. Ensure it does connect to the network, and everything is working as you assumed.
  7. Restart node as validator (only one instance!)

How to ensure previous events are downloaded

Ensure node has synced up (do any 2):

  • Launch a non-validator node first. Check that it has last block >= last block in explorer.
  • Launch a non-validator node first with --exitwhensynced flag. Wait until it has stopped.
  • Find previous event ID of your validator by greping logs. Launch a non-validator node first. Wait until it has synced up. Ensure it has received previous last event by greping logs. Alternatively, you may request events by ID using API (enable --rpc before that)

Ensure that there's no parallel instance of the same validator:

  • After previous node has stopped, wait 60-90 minutes. Wait until explorer shows that your validator has downtime >= 40 minutes.
  • Launch a non-validator node. Search in logs any events from your validator with ID=X: grep log.txt | "by=X ". Wait 40 minutes, ensure new events aren't emitted.

Node has multiple heuristics which it uses to detect not downloaded previous events, or a parallel validator instance. Those heuristics are probabilistic, it's impossible to provide a fully accurate detection of those conditions in a decentralized network. It implies that validators should follow the steps above manually and not rely on heuristics.

Data loss

DB data may be lost due to the following reasons:

  1. The node was terminated using kill -9, or node has crashed, or server stopped. After this, node will either refuse to start (if DB is corrupted), or it'll start from previous flushed state (data is flushed every epoch, every 10 minutes, when DB buffer is exceeded).
  2. Disk failure. Disk failure may lead to corrupted DB state, in which case the node will refuse to start.

Node update

Stop the node. Ensure the node has stopped and wasn't restarted!

killall opera

If needed, backup and erase previous DB files (node won't start if new version isn't compatible with previous DB):

rm -r /path-to-datadir/chaindata

Update and build latest version

git pull origin master
make build

Build output is found in ./build/

Clone this wiki locally