Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Reindexing fails with indexing backpressure #417

Open
dokterbob opened this issue Mar 8, 2023 · 1 comment
Open

[BUG] Reindexing fails with indexing backpressure #417

dokterbob opened this issue Mar 8, 2023 · 1 comment
Labels
🐛 bug Something isn't working

Comments

@dokterbob
Copy link

What is the bug?

When shard indexing back-pressure is enabled and cluster load exceeds the threshold, 429's generated during reindexing crashes the reindexing operation. For some reason, 429's (as opposed to other intermittent errors) do not lead to a retry but crash the reindex process instead.

How can one reproduce the bug?

  1. Enable shard indexing back-pressure.
  2. Max. out server load.
  3. Perform reindexing.

What is the expected behavior?

Retries, ideally with exponential back-off.

What is your host/environment?

OpenSearch 2.6.0, Ubuntu 20.04.3 LTS, deployment from Debian repo, 75-node cluster.

Do you have any additional context?

Example output from reindexing operation:

{
	"completed": true,
	"task": {
		"node": "D8yBWHY2SkqsB0RO4EGOdA",
		"id": 828447209,
		"type": "transport",
		"action": "indices:data/write/reindex",
		"status": {
			"total": 11450121,
			"updated": 7170428,
			"created": 3535836,
			"deleted": 0,
			"batches": 107064,
			"version_conflicts": 0,
			"noops": 0,
			"retries": {
				"bulk": 0,
				"search": 0
			},
			"throttled_millis": 0,
			"requests_per_second": -1.0,
			"throttled_until_millis": 0,
			"slices": [
				{
					"slice_id": 0,
					"total": 1430025,
					"updated": 1274600,
					"created": 0,
					"deleted": 0,
					"batches": 12746,
					"version_conflicts": 0,
					"noops": 0,
					"retries": {
						"bulk": 0,
						"search": 0
					},
					"throttled_millis": 0,
					"requests_per_second": -1.0,
					"throttled_until_millis": 0
				},
				{
					"slice_id": 1,
					"total": 1430133,
					"updated": 764654,
					"created": 520046,
					"deleted": 0,
					"batches": 12847,
					"version_conflicts": 0,
					"noops": 0,
					"retries": {
						"bulk": 0,
						"search": 0
					},
					"throttled_millis": 0,
					"requests_per_second": -1.0,
					"throttled_until_millis": 0
				},
				{
					"slice_id": 2,
					"total": 1432099,
					"updated": 774538,
					"created": 657561,
					"deleted": 0,
					"batches": 14321,
					"version_conflicts": 0,
					"noops": 0,
					"retries": {
						"bulk": 0,
						"search": 0
					},
					"throttled_millis": 0,
					"requests_per_second": -1.0,
					"throttled_until_millis": 0
				},
				{
					"slice_id": 3,
					"total": 1432157,
					"updated": 764625,
					"created": 667532,
					"deleted": 0,
					"batches": 14322,
					"version_conflicts": 0,
					"noops": 0,
					"retries": {
						"bulk": 0,
						"search": 0
					},
					"throttled_millis": 0,
					"requests_per_second": -1.0,
					"throttled_until_millis": 0
				},
				{
					"slice_id": 4,
					"total": 1431793,
					"updated": 1280800,
					"created": 0,
					"deleted": 0,
					"batches": 12808,
					"version_conflicts": 0,
					"noops": 0,
					"retries": {
						"bulk": 0,
						"search": 0
					},
					"throttled_millis": 0,
					"requests_per_second": -1.0,
					"throttled_until_millis": 0
				},
				{
					"slice_id": 5,
					"total": 1431473,
					"updated": 764947,
					"created": 516753,
					"deleted": 0,
					"batches": 12817,
					"version_conflicts": 0,
					"noops": 0,
					"retries": {
						"bulk": 0,
						"search": 0
					},
					"throttled_millis": 0,
					"requests_per_second": -1.0,
					"throttled_until_millis": 0
				},
				{
					"slice_id": 6,
					"total": 1430508,
					"updated": 764718,
					"created": 665790,
					"deleted": 0,
					"batches": 14306,
					"version_conflicts": 0,
					"noops": 0,
					"retries": {
						"bulk": 0,
						"search": 0
					},
					"throttled_millis": 0,
					"requests_per_second": -1.0,
					"throttled_until_millis": 0
				},
				{
					"slice_id": 7,
					"total": 1431933,
					"updated": 781546,
					"created": 508154,
					"deleted": 0,
					"batches": 12897,
					"version_conflicts": 0,
					"noops": 0,
					"retries": {
						"bulk": 0,
						"search": 0
					},
					"throttled_millis": 0,
					"requests_per_second": -1.0,
					"throttled_until_millis": 0
				}
			]
		},
		"description": "reindex from [ipfs_files_v9] to [ipfs_other_v1]",
		"start_time_in_millis": 1678232080365,
		"running_time_in_nanos": 53075245689662,
		"cancellable": true,
		"cancelled": false,
		"headers": {}
	},
	"response": {
		"took": 53075191,
		"timed_out": false,
		"total": 11450121,
		"updated": 7170428,
		"created": 3535836,
		"deleted": 0,
		"batches": 107064,
		"version_conflicts": 0,
		"noops": 0,
		"retries": {
			"bulk": 0,
			"search": 0
		},
		"throttled": "0s",
		"throttled_millis": 0,
		"requests_per_second": -1.0,
		"throttled_until": "0s",
		"throttled_until_millis": 0,
		"slices": [
			{
				"slice_id": 0,
				"total": 1430025,
				"updated": 1274600,
				"created": 0,
				"deleted": 0,
				"batches": 12746,
				"version_conflicts": 0,
				"noops": 0,
				"retries": {
					"bulk": 0,
					"search": 0
				},
				"throttled": "0s",
				"throttled_millis": 0,
				"requests_per_second": -1.0,
				"throttled_until": "0s",
				"throttled_until_millis": 0
			},
			{
				"slice_id": 1,
				"total": 1430133,
				"updated": 764654,
				"created": 520046,
				"deleted": 0,
				"batches": 12847,
				"version_conflicts": 0,
				"noops": 0,
				"retries": {
					"bulk": 0,
					"search": 0
				},
				"throttled": "0s",
				"throttled_millis": 0,
				"requests_per_second": -1.0,
				"throttled_until": "0s",
				"throttled_until_millis": 0
			},
			{
				"slice_id": 2,
				"total": 1432099,
				"updated": 774538,
				"created": 657561,
				"deleted": 0,
				"batches": 14321,
				"version_conflicts": 0,
				"noops": 0,
				"retries": {
					"bulk": 0,
					"search": 0
				},
				"throttled": "0s",
				"throttled_millis": 0,
				"requests_per_second": -1.0,
				"throttled_until": "0s",
				"throttled_until_millis": 0
			},
			{
				"slice_id": 3,
				"total": 1432157,
				"updated": 764625,
				"created": 667532,
				"deleted": 0,
				"batches": 14322,
				"version_conflicts": 0,
				"noops": 0,
				"retries": {
					"bulk": 0,
					"search": 0
				},
				"throttled": "0s",
				"throttled_millis": 0,
				"requests_per_second": -1.0,
				"throttled_until": "0s",
				"throttled_until_millis": 0
			},
			{
				"slice_id": 4,
				"total": 1431793,
				"updated": 1280800,
				"created": 0,
				"deleted": 0,
				"batches": 12808,
				"version_conflicts": 0,
				"noops": 0,
				"retries": {
					"bulk": 0,
					"search": 0
				},
				"throttled": "0s",
				"throttled_millis": 0,
				"requests_per_second": -1.0,
				"throttled_until": "0s",
				"throttled_until_millis": 0
			},
			{
				"slice_id": 5,
				"total": 1431473,
				"updated": 764947,
				"created": 516753,
				"deleted": 0,
				"batches": 12817,
				"version_conflicts": 0,
				"noops": 0,
				"retries": {
					"bulk": 0,
					"search": 0
				},
				"throttled": "0s",
				"throttled_millis": 0,
				"requests_per_second": -1.0,
				"throttled_until": "0s",
				"throttled_until_millis": 0
			},
			{
				"slice_id": 6,
				"total": 1430508,
				"updated": 764718,
				"created": 665790,
				"deleted": 0,
				"batches": 14306,
				"version_conflicts": 0,
				"noops": 0,
				"retries": {
					"bulk": 0,
					"search": 0
				},
				"throttled": "0s",
				"throttled_millis": 0,
				"requests_per_second": -1.0,
				"throttled_until": "0s",
				"throttled_until_millis": 0
			},
			{
				"slice_id": 7,
				"total": 1431933,
				"updated": 781546,
				"created": 508154,
				"deleted": 0,
				"batches": 12897,
				"version_conflicts": 0,
				"noops": 0,
				"retries": {
					"bulk": 0,
					"search": 0
				},
				"throttled": "0s",
				"throttled_millis": 0,
				"requests_per_second": -1.0,
				"throttled_until": "0s",
				"throttled_until_millis": 0
			}
		],
		"failures": [
			{
				"index": "ipfs_files_v9",
				"shard": 624,
				"node": "lPxMyNf_TgCoN96mczN8RQ",
				"status": 429,
				"reason": {
					"type": "rejected_execution_exception",
					"reason": "rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@e99343b on QueueResizingOpenSearchThreadPoolExecutor[name = furber/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 6.4ms, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@786a3ad5[Running, pool size = 19, active threads = 19, queued tasks = 1662, completed tasks = 272789524]]"
				}
			},
			{
				"index": "ipfs_files_v9",
				"shard": 905,
				"node": "lPxMyNf_TgCoN96mczN8RQ",
				"status": 429,
				"reason": {
					"type": "rejected_execution_exception",
					"reason": "rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@5e6cb1df on QueueResizingOpenSearchThreadPoolExecutor[name = furber/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 15.3ms, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@786a3ad5[Running, pool size = 19, active threads = 19, queued tasks = 1675, completed tasks = 272789511]]"
				}
			},
			{
				"index": "ipfs_files_v9",
				"shard": 524,
				"node": "lPxMyNf_TgCoN96mczN8RQ",
				"status": 429,
				"reason": {
					"type": "rejected_execution_exception",
					"reason": "rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@1c969a1c on QueueResizingOpenSearchThreadPoolExecutor[name = furber/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 7.8ms, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@786a3ad5[Running, pool size = 19, active threads = 19, queued tasks = 1649, completed tasks = 272789537]]"
				}
			},
			{
				"index": "ipfs_files_v9",
				"shard": 276,
				"node": "lPxMyNf_TgCoN96mczN8RQ",
				"status": 429,
				"reason": {
					"type": "rejected_execution_exception",
					"reason": "rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@e4bfb6f on QueueResizingOpenSearchThreadPoolExecutor[name = furber/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 7.8ms, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@786a3ad5[Running, pool size = 19, active threads = 19, queued tasks = 1649, completed tasks = 272789537]]"
				}
			},
			{
				"index": "ipfs_files_v9",
				"shard": 821,
				"node": "lPxMyNf_TgCoN96mczN8RQ",
				"status": 429,
				"reason": {
					"type": "rejected_execution_exception",
					"reason": "rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@33098fce on QueueResizingOpenSearchThreadPoolExecutor[name = furber/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 5.7ms, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@786a3ad5[Running, pool size = 19, active threads = 19, queued tasks = 1615, completed tasks = 272789631]]"
				}
			},
			{
				"shard": -1,
				"status": 429,
				"reason": {
					"type": "rejected_execution_exception",
					"reason": "rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@2cb9f42b on QueueResizingOpenSearchThreadPoolExecutor[name = furber/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 5.8ms, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@786a3ad5[Running, pool size = 19, active threads = 19, queued tasks = 1647, completed tasks = 272789539]]"
				}
			},
			{
				"shard": -1,
				"status": 429,
				"reason": {
					"type": "rejected_execution_exception",
					"reason": "rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@27e1566b on QueueResizingOpenSearchThreadPoolExecutor[name = furber/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 3.8ms, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@786a3ad5[Running, pool size = 19, active threads = 19, queued tasks = 1645, completed tasks = 272789541]]"
				}
			},
			{
				"shard": -1,
				"status": 429,
				"reason": {
					"type": "rejected_execution_exception",
					"reason": "rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@46fcfeb9 on QueueResizingOpenSearchThreadPoolExecutor[name = furber/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 3.8ms, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@786a3ad5[Running, pool size = 19, active threads = 19, queued tasks = 1645, completed tasks = 272789541]]"
				}
			},
			{
				"shard": -1,
				"status": 429,
				"reason": {
					"type": "rejected_execution_exception",
					"reason": "rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@4e55205c on QueueResizingOpenSearchThreadPoolExecutor[name = furber/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 3.8ms, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@786a3ad5[Running, pool size = 19, active threads = 19, queued tasks = 1645, completed tasks = 272789541]]"
				}
			},
			{
				"shard": -1,
				"status": 429,
				"reason": {
					"type": "rejected_execution_exception",
					"reason": "rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@7a935d1c on QueueResizingOpenSearchThreadPoolExecutor[name = furber/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 3.8ms, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@786a3ad5[Running, pool size = 19, active threads = 19, queued tasks = 1645, completed tasks = 272789541]]"
				}
			},
			{
				"shard": -1,
				"status": 429,
				"reason": {
					"type": "rejected_execution_exception",
					"reason": "rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@3e88e732 on QueueResizingOpenSearchThreadPoolExecutor[name = furber/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 3.2ms, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@786a3ad5[Running, pool size = 19, active threads = 19, queued tasks = 1644, completed tasks = 272789542]]"
				}
			},
			{
				"shard": -1,
				"status": 429,
				"reason": {
					"type": "rejected_execution_exception",
					"reason": "rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@792e7c53 on QueueResizingOpenSearchThreadPoolExecutor[name = furber/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 3.2ms, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@786a3ad5[Running, pool size = 19, active threads = 19, queued tasks = 1644, completed tasks = 272789542]]"
				}
			},
			{
				"shard": -1,
				"status": 429,
				"reason": {
					"type": "rejected_execution_exception",
					"reason": "rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@5aca9948 on QueueResizingOpenSearchThreadPoolExecutor[name = furber/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 3.8ms, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@786a3ad5[Running, pool size = 19, active threads = 19, queued tasks = 1645, completed tasks = 272789541]]"
				}
			},
			{
				"shard": -1,
				"status": 429,
				"reason": {
					"type": "rejected_execution_exception",
					"reason": "rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@232a2758 on QueueResizingOpenSearchThreadPoolExecutor[name = furber/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 2.8ms, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@786a3ad5[Running, pool size = 19, active threads = 19, queued tasks = 1643, completed tasks = 272789543]]"
				}
			},
			{
				"shard": -1,
				"status": 429,
				"reason": {
					"type": "rejected_execution_exception",
					"reason": "rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@68f906ba on QueueResizingOpenSearchThreadPoolExecutor[name = furber/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 3.2ms, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@786a3ad5[Running, pool size = 19, active threads = 19, queued tasks = 1644, completed tasks = 272789542]]"
				}
			},
			{
				"shard": -1,
				"status": 429,
				"reason": {
					"type": "rejected_execution_exception",
					"reason": "rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@4b448d4f on QueueResizingOpenSearchThreadPoolExecutor[name = furber/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 2.8ms, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@786a3ad5[Running, pool size = 19, active threads = 19, queued tasks = 1643, completed tasks = 272789543]]"
				}
			},
			{
				"shard": -1,
				"status": 429,
				"reason": {
					"type": "rejected_execution_exception",
					"reason": "rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@433c74a1 on QueueResizingOpenSearchThreadPoolExecutor[name = furber/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 3.2ms, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@786a3ad5[Running, pool size = 19, active threads = 19, queued tasks = 1644, completed tasks = 272789542]]"
				}
			}
		]
	}
}
@dblock
Copy link
Member

dblock commented Mar 8, 2023

Want to try to implement a unit test for the response above similar to the one here?

@wbeckler wbeckler added the 🐛 bug Something isn't working label Oct 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants