This troubleshooting guide lists methods for detecting and handling issues with mirror node.
Any time there is any disruption on the mirrornet at all, the first thing to do is decide whether the failover is healthy. If it is, we want to switch DNS immediately while we troubleshoot what is happening.
Similarly, any time mirror node needs to be restarted, switch DNS to the failover first if it is healthy.
Following is list of error messages and how to begin handling issues when they are encountered.
-
Encountered unknown transaction type
Ideally, this should not happen because new transaction types are released after updating mirror node to be able to handle them. Alerted only when impacting significant number of transactions.
Actions:
- There is no immediate fix. Bring to team's attention immediately (during reasonable hours, otherwise next morning).
-
Error closing connection
If this happens, it's possible that database will eventually run out of open connections.
Actions:
- Check Cloud SQL console to figure out if connection limit is being reached. If so, restarting importer would be one way to temporarily fix it.
- Ensure no service outage happens due to connection limit, restart as needed.
-
Error parsing record file
previous hash is null
Hash mismatch for file
Previous file hash not available
Unable to extract hash and signature from file
Unknown file delimiter
Unknown record file delimiter
All of the above errors happen when data inside stream files is not as expected. It can be because of new bug introduced on the importer side, or due to mainnet node publishing bad data.
Actions:
- Notify devops immediately
- Check if any recent changes were made to the code related to the error
-
Error saving file in database
Unable to connect to database
Unable to fetch entity types
Unable to prepare SQL statements
Unable to set connection to not auto commit
All of the above errors have some SQLException as the root cause.
Actions:
- Check Cloud SQL instance is up and running correctly (by executing some SQL queries). If the problem seems on Cloud SQL side, escalate to devops.
-
ERRORS processing account balances file
Can be caused either by bad data in account balances stream or due to SQL exceptions. In either case, exception are logged and the action is retried.
Actions:
- If bad data in the stream, escalate to devops
- If due to sql exception, check if next try is successful. If the error continues, investigate.
-
Exception
Actions:
- See exception details for more info (attempt to diagnose, including trying to restart the service, but escalate anyway)
-
Failed downloading
These errors happen even in a perfectly running mirror node, but are pretty infrequent (one every few minutes). Failed download attempts will be retried rapidly. If they happen too much, investigate. Many possible causes - S3 may be down, or there maybe other connection issues.
Actions:
- Try downloading failing files locally (checks S3 is up)
- Check socket usage, packet loss, etc on importer instance
-
Failed to parse NodeAddressBook from
Actions:
- There is no immediate fix. Bring to team's attention immediately (during reasonable hours, otherwise next morning).
-
Insufficient downloaded signature file count, requires at least
This can happen if- Some mainnet nodes are still in the process of uploading their signatures for the latest file (benign case). Logging rate will be at most 20/min.
- Bad signatures by some mainnet nodes, halts the downloader progress. Logging rate in this case can reach 100/min.
Effect: In case of bad signatures, it'll halt system progress.
Actions:
- If happens due to bad signatures, escalate to devops.
-
Long overflow when converting time to nanos timestamp
Importer assumes all timestamps can be converted into nanos-since-epoch and stored as
Long
. This error will halt the progress of parserActions:
- There is no immediate fix. Bring to team's attention immediately (during reasonable hours, otherwise next morning).
-
Unable to copy address book from
Actions:
- For emergency fix, manually copy known good address book to the destination.
-
Unable to guess correct transaction type since there's not exactly one
Ideally, this should not happen because new transaction types are released after updating mirror node to be able to handle them. However, occurrence of this error means parser will keep retrying and will never make progress
Actions:
- There is no immediate fix. Bring to team's attention immediately (during reasonable hours, otherwise next morning).
This section lists alerts to detect issues when a mirror node might not be functioning normally, and guiding rules to help prioritize new alerts.
The priorities below are based on PagerDuty's Alert Priorities and warrant same responses as mentioned in that link.
If a message signals any of the following, then it qualifies as high priority:
- Service is down
- A scenario which will certainly halt system progress
For example, parser encounters badly formatted file which will certainly halt parser's progress. - Data loading delayed more than accepted SLA
- Anything that adversely impacts many transactions
For example, timestamp overflow affecting many transactions
Alerts: High-Priority PagerDuty Alert 24/7/365 Response: Requires immediate human action
If a message signals any of the following, then it qualifies as medium priority:
- Service is lagging
- A scenario which may halt system progress
For example, many badly formatted files are encountered by downloader and it is possible that progress may halt.
Alerts: High-Priority PagerDuty Alert during business hours only Response: Requires human action within 24 hours.
If a message signals any of the following, then it qualifies as low priority:
- An unexpected scenario, which if continues for sufficient time can eventually lead to medium/high priority scenarios
- Non-critical system assumption are broken but no real or very limited impact (say few transactions)
For example, required field missing in transaction/receipt which only impacts that transaction. For instance, topicId missing in update/delete topic transaction.
For example, invalid/Missing signatures but it doesn't halt progress since it's only from one of the many nodes.
Alerts: Low-Priority PagerDuty Alert during business hours only Response: Requires human action at some point.
Log Message | Default Priority | Conditional Priority |
---|---|---|
Error parsing record file |
HIGH | |
Error starting watch service |
HIGH | |
ERRORS processing account balances file |
HIGH | |
previous hash is null |
HIGH | |
Failed to parse NodeAddressBook from |
HIGH | |
Hash mismatch for file |
HIGH | |
Long overflow when converting time to nanos timestamp |
HIGH | |
Previous file hash not available |
HIGH | |
Unable to extract hash and signature from file |
HIGH | |
Unable to guess correct transaction type since there's not exactly one |
HIGH | |
Unknown file delimiter |
HIGH | |
Unknown record file delimiter |
HIGH | |
Error processing balances files after |
MEDIUM | |
Exception processing account balances file |
MEDIUM | |
Encountered unknown transaction type |
LOW | HIGH (if 10 entries over 10 min |
Error closing connection |
LOW | HIGH (if 10 entries over 10 min |
Account balance dataset timestamp mismatch! |
LOW | |
Error decoding hex string |
LOW | |
Failed to verify |
LOW | |
Input parameter is not a folder |
LOW | |
Failed to verify signature with public key |
LOW | |
Missing signature for file |
LOW | |
Error saving file in database |
NONE | HIGH (if 30 entries in 1 min) |
Failed downloading |
NONE | HIGH (if 30 entries in 1 min) |
Insufficient downloaded signature file count, requires at least |
NONE | HIGH (if 30 entries in 1 min) |
Signature verification failed |
NONE | HIGH (if 30 entries in 1 min) |
Unable to connect to database |
NONE | HIGH (if 30 entries in 1 min) |
Unable to fetch entity types |
NONE | HIGH (if 30 entries in 1 min) |
Unable to prepare SQL statements |
NONE | HIGH (if 30 entries in 1 min) |
Unable to set connection to not auto commit |
NONE | HIGH (if 30 entries in 1 min) |
Anything that wakes up a human in the middle of the night should be immediately actionable. For all HIGH
priority
alerts, there should be a section in the guide above listing immediate actionable steps someone can take to reduce
issue's severity of or to fix it.