You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is the all-in-one document about improvements in mitigation the performance impact on tikv slow including "restart, slow tikv (disk io jitter/hang/overload etc)". From this issue, you can track all related problems, bug-fixes and tasks for improvement and enhancement.
Background
Challenges in stability: When one or more tikv instances encounter issues or slow down in a large-scale cluster, what impact does it have on the overall performance of the TiDB cluster?
Assuming we have 100 tikv nodes, when 1 tikv node encounters an issue. We typically assume that the overall performance impact is not more than 1/100, but in reality, this is not the case at all.
From a lot of production environment issues, we have found that when cluster size increases, the performance impact on the entire cluster far exceeds the assumption mentioned above when a single tikv node encounters problems. The reasons for the significant impact can be categorized as follows:
Bugs, such as the schema cache not caching historical schema versions as expected, resulting in penetration of tikv.
Reaching the existing implementation capacity boundaries, such as the impact of tikv with meta region failures on overall performance.
Constraints of the architecture.
Therefore, improving the overall stability and resilience of TiDB essentially requires:
Improving quality and addressing bugs.
Optimizing implementations to bring TiDB's resilience capabilities closer to the upper limits of architectural design constraints.
This tracking issue focuses on and consolidates the second point.
From the end to end perspective, the speed of failover depends on the critical paths related both to kv-client and tikv when some tikv nodes fail. We examine the current state and improvement of each component related to kv-client and tikv taking a top-down perspective, combining the known user issues and pain points encountered at present.
Region Cache Related
Problems:
The region information not updated in time, causing unexpected cross-AZ flow
Region reload causes significant pressure on PD
The interface has multiple usage patterns and is tightly coupled with surrounding modules, making it difficult to maintain
Building A Unified Health Controller And Feedback Mechanism
Problems:
The slow information could not be detected by the kv-client, it should be helpful for the kv-client to decide peer selection and avoid resource wastage
The tikv nodes could be requested to handle requests before warm up, causing latency spike, issue. The PD store heartbeat could be sent after the log applying and warm up operations on the restarted tikv node.
The tikv nodes could be busy applying raft logs after network partition, scheduling leader peers to the just started node may cause high write latency because of apply wait. issue
The leader can not advance write request processing(applying logs), even though the logs have already been committed by a majority of the replicas, results in significant impact on write latency due to single EBS IO jitter.
close#16614, ref pingcap/tidb#51585
Enable `async-io` by default with changing the setting `raftstore.store-io-pool-size` from 0 to 1.
Signed-off-by: lucasliang <[email protected]>
This is the all-in-one document about improvements in mitigation the performance impact on tikv slow including "restart, slow tikv (disk io jitter/hang/overload etc)". From this issue, you can track all related problems, bug-fixes and tasks for improvement and enhancement.
Background
Challenges in stability: When one or more tikv instances encounter issues or slow down in a large-scale cluster, what impact does it have on the overall performance of the TiDB cluster?
Assuming we have 100 tikv nodes, when 1 tikv node encounters an issue. We typically assume that the overall performance impact is not more than 1/100, but in reality, this is not the case at all.
From a lot of production environment issues, we have found that when cluster size increases, the performance impact on the entire cluster far exceeds the assumption mentioned above when a single tikv node encounters problems. The reasons for the significant impact can be categorized as follows:
Therefore, improving the overall stability and resilience of TiDB essentially requires:
This tracking issue focuses on and consolidates the second point.
From the end to end perspective, the speed of failover depends on the critical paths related both to kv-client and tikv when some tikv nodes fail. We examine the current state and improvement of each component related to kv-client and tikv taking a top-down perspective, combining the known user issues and pain points encountered at present.
Region Cache Related
Problems:
Tracking issue:
Replica selection & TiKV Error Handling & Retry Related
Problems:
Tracking issue:
Enabling TiKV Slow Score By Default
Problems:
Tracking issue:
evict-slow-store
scheduler by default. tikv/pd#7564Building A Unified Health Controller And Feedback Mechanism
Problems:
Tracking issue:
Warmup Before PD Heatbeat And Leader Movement
Problems:
Tracking issue:
Enabling Async-IO By Default
Problems:
Tracking issue:
async-io
as default. tikv/tikv#16614Allow Log Apply When The Quorum Has Formed
Problems:
Tracking issue:
Avoid IO opertions in store loop
Problems:
Tracking issue:
The text was updated successfully, but these errors were encountered: