Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improvement(gemini): run gemini from scylladb/gemini #8953

Merged
merged 1 commit into from
Jan 14, 2025

Conversation

CodeLieutenant
Copy link
Contributor

@CodeLieutenant CodeLieutenant commented Oct 10, 2024

Dockerfile has been moved to gemini project, and now SCT should use that image.

Changes:

  • Use image from scylladb/gemini
  • Remove dockerfile for gemini and point in readme location for the images
  • Add CQL Statement Logging to gemini output
  • Forward outputs from docker to $HOME/*.log
  • Run default gemini flags from gemini_thread.py

Testing

@CodeLieutenant CodeLieutenant added backport/none Backport is not required test-integration Enable running the integration tests suite labels Oct 10, 2024
@CodeLieutenant CodeLieutenant self-assigned this Oct 10, 2024
@fruch
Copy link
Contributor

fruch commented Oct 10, 2024

@CodeLieutenant

I want to see a 10h run, with the statements logs, so we can see how bad is it (in size).

@CodeLieutenant
Copy link
Contributor Author

@CodeLieutenant

I want to see a 10h run, with the statements logs, so we can see how bad is it (in size).

I'm starting a new run right now, I'll reference it when it's started.
Branch name change maked other run invalid

@fruch
Copy link
Contributor

fruch commented Oct 13, 2024

@CodeLieutenant

from the 10h run (that lasted 12min)
seems like some parameters are not pass down correctly to the gemini command:

09:46:31  < t:2024-10-11 06:46:31,119 f:gemini_thread.py l:171  c:sdcm.gemini_thread   p:ERROR > Error: gemini encountered errors, exiting with non zero status
09:46:31  < t:2024-10-11 06:46:31,119 f:gemini_thread.py l:171  c:sdcm.gemini_thread   p:ERROR > /bin/sh: --warmup: not found
09:46:31  < t:2024-10-11 06:46:31,119 f:gemini_thread.py l:171  c:sdcm.gemini_thread   p:ERROR > /bin/sh: --concurrency: not found
09:46:31  < t:2024-10-11 06:46:31,119 f:gemini_thread.py l:171  c:sdcm.gemini_thread   p:ERROR > /bin/sh: --mode: not found

@CodeLieutenant
Copy link
Contributor Author

@CodeLieutenant

from the 10h run (that lasted 12min) seems like some parameters are not pass down correctly to the gemini command:

09:46:31  < t:2024-10-11 06:46:31,119 f:gemini_thread.py l:171  c:sdcm.gemini_thread   p:ERROR > Error: gemini encountered errors, exiting with non zero status
09:46:31  < t:2024-10-11 06:46:31,119 f:gemini_thread.py l:171  c:sdcm.gemini_thread   p:ERROR > /bin/sh: --warmup: not found
09:46:31  < t:2024-10-11 06:46:31,119 f:gemini_thread.py l:171  c:sdcm.gemini_thread   p:ERROR > /bin/sh: --concurrency: not found
09:46:31  < t:2024-10-11 06:46:31,119 f:gemini_thread.py l:171  c:sdcm.gemini_thread   p:ERROR > /bin/sh: --mode: not found

Fixed with the last commit, now gemini looks like its stable for 43m, but after that same thing, fails on validation, and by the looks of the statement inside the select, generated statement is invalid by the schema, but that just continues. Issue 431

@CodeLieutenant CodeLieutenant force-pushed the feat--gemini-from-dockerhub branch from a0e0cff to 7e38bab Compare October 21, 2024 13:49
@CodeLieutenant CodeLieutenant force-pushed the feat--gemini-from-dockerhub branch 2 times, most recently from 22d8710 to f20b322 Compare November 6, 2024 11:21
@fruch
Copy link
Contributor

fruch commented Nov 12, 2024

this is waiting for fixes in gemini, to stabilize it, we can't use gemini in current state

@CodeLieutenant CodeLieutenant force-pushed the feat--gemini-from-dockerhub branch 9 times, most recently from b2a8f41 to 733b25b Compare November 25, 2024 23:41
@CodeLieutenant CodeLieutenant force-pushed the feat--gemini-from-dockerhub branch 4 times, most recently from e6f5847 to 5f16c38 Compare November 28, 2024 22:56
@fruch
Copy link
Contributor

fruch commented Dec 2, 2024

this is waiting for fixes in gemini, to stabilize it, we can't use gemini in current state

@CodeLieutenant
what's the status of this one ?
can you fix the links to the jobs ? (they don't point at anything)

defaults/test_default.yaml Outdated Show resolved Hide resolved
defaults/test_default.yaml Outdated Show resolved Hide resolved
Copy link
Contributor

@fruch fruch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 last pending issue
and also fix pre-commit autopep8 issues

@yarongilor
Copy link
Contributor

@roydahan , @fruch , i tried the new Gemini but it seem to fail after warmup or something.
it got an error of:

2025-01-06 15:09:09.391: (GeminiStressEvent Severity.ERROR) period_type=end event_id=e117e030-303e-45b3-94e5-6ed5b237fd30 during_nemesis=GrowShrinkCluster duration=31m44s: node=Node gemini-with-grow-shrink-gemini-w-loader-node-2a990123-1 [3.239.160.42 | 10.12.0.38] gemini_cmd=gemini                 --non-interactive                 --oracle-cluster="10.12.0.225"                 --test-cluster="10.12.2.215,10.12.2.198,10.12.2.3,10.12.3.84"                 --seed=79                 --schema-seed=79                 --profiling-port=6060                 --bind=0.0.0.0:2121                 --outfile=/gemini_result_82d6fc08-69de-4bd6-9583-71dd977d5c81.log                 --replication-strategy="{'class': 'NetworkTopologyStrategy', 'replication_factor': '3'}"                 --oracle-replication-strategy="{'class': 'NetworkTopologyStrategy', 'replication_factor': '1'}" --level=info --request-timeout=60s --connect-timeout=60s --consistency=QUORUM --dataset-size=large --oracle-host-selection-policy=token-aware --test-host-selection-policy=token-aware --drop-schema=true --fail-fast=true --materialized-views=false --use-server-timestamps=true --use-lwt=false --use-counters=false --max-tables=1 --max-columns=16 --min-columns=8 --max-partition-keys=6 --min-partition-keys=2 --max-clustering-keys=4 --min-clustering-keys=2 --partition-key-distribution=normal --token-range-slices=512 --partition-key-buffer-reuse-size=100 --duration 3h --warmup 30m --concurrency 50 --mode mixed -f --non-interactive --cql-features normal --max-mutation-retries 10 --max-mutation-retries-backoff 500ms --async-objects-stabilization-attempts 5 --async-objects-stabilization-backoff 500ms --replication-strategy "{'class': 'NetworkTopologyStrategy', 'replication_factor': '3'}" --oracle-replication-strategy "{'class': 'NetworkTopologyStrategy', 'replication_factor': '1'}"
result=Exit code: 137
Command output: ['{"L":"INFO","T":"2025-01-06T15:07:29.957Z","N":"work cycle.validation_job","M":"starting validation loop"}', '{"L":"INFO","T":"2025-01-06T15:07:29.951Z","N":"work cycle.mutation_job","M":"starting mutation loop"}']

@fruch
Copy link
Contributor

fruch commented Jan 6, 2025

@roydahan , @fruch , i tried the new Gemini but it seem to fail after warmup or something. it got an error of:

2025-01-06 15:09:09.391: (GeminiStressEvent Severity.ERROR) period_type=end event_id=e117e030-303e-45b3-94e5-6ed5b237fd30 during_nemesis=GrowShrinkCluster duration=31m44s: node=Node gemini-with-grow-shrink-gemini-w-loader-node-2a990123-1 [3.239.160.42 | 10.12.0.38] gemini_cmd=gemini                 --non-interactive                 --oracle-cluster="10.12.0.225"                 --test-cluster="10.12.2.215,10.12.2.198,10.12.2.3,10.12.3.84"                 --seed=79                 --schema-seed=79                 --profiling-port=6060                 --bind=0.0.0.0:2121                 --outfile=/gemini_result_82d6fc08-69de-4bd6-9583-71dd977d5c81.log                 --replication-strategy="{'class': 'NetworkTopologyStrategy', 'replication_factor': '3'}"                 --oracle-replication-strategy="{'class': 'NetworkTopologyStrategy', 'replication_factor': '1'}" --level=info --request-timeout=60s --connect-timeout=60s --consistency=QUORUM --dataset-size=large --oracle-host-selection-policy=token-aware --test-host-selection-policy=token-aware --drop-schema=true --fail-fast=true --materialized-views=false --use-server-timestamps=true --use-lwt=false --use-counters=false --max-tables=1 --max-columns=16 --min-columns=8 --max-partition-keys=6 --min-partition-keys=2 --max-clustering-keys=4 --min-clustering-keys=2 --partition-key-distribution=normal --token-range-slices=512 --partition-key-buffer-reuse-size=100 --duration 3h --warmup 30m --concurrency 50 --mode mixed -f --non-interactive --cql-features normal --max-mutation-retries 10 --max-mutation-retries-backoff 500ms --async-objects-stabilization-attempts 5 --async-objects-stabilization-backoff 500ms --replication-strategy "{'class': 'NetworkTopologyStrategy', 'replication_factor': '3'}" --oracle-replication-strategy "{'class': 'NetworkTopologyStrategy', 'replication_factor': '1'}"
result=Exit code: 137
Command output: ['{"L":"INFO","T":"2025-01-06T15:07:29.957Z","N":"work cycle.validation_job","M":"starting validation loop"}', '{"L":"INFO","T":"2025-01-06T15:07:29.951Z","N":"work cycle.mutation_job","M":"starting mutation loop"}']

it died cause of OOM:
scylladb/gemini#420

it's a known issue, root cause still unknown
@CodeLieutenant please add --seed=79 to that issue

@fruch
Copy link
Contributor

fruch commented Jan 6, 2025

@CodeLieutenant

precommit is failing on this:

16:15:41  diff --git a/sdcm/gemini_thread.py b/sdcm/gemini_thread.py
16:15:41  index 81352b88..6faff390 100644
16:15:41  --- a/sdcm/gemini_thread.py
16:15:41  +++ b/sdcm/gemini_thread.py
16:15:41  @@ -98,7 +98,7 @@ class GeminiStressThread(DockerBasedStressThread):  # pylint: disable=too-many-i
16:15:41               'partition-key-distribution': 'normal',  # Distribution for hitting the partition
16:15:41               # These two are used to control the memory usage of Gemini
16:15:41               'token-range-slices': 512,  # Number of partitions
16:15:41  -            'partition-key-buffer-reuse-size': 100, # Internal Channel Size per parittion value generation
16:15:41  +            'partition-key-buffer-reuse-size': 100,  # Internal Channel Size per parittion value generation
16:15:41           }
16:15:41   
16:15:41           self.gemini_oracle_statements_file = f"gemini_oracle_statements_{self.unique_id}.log"

@CodeLieutenant CodeLieutenant force-pushed the feat--gemini-from-dockerhub branch 3 times, most recently from ff5de8d to d17442e Compare January 10, 2025 08:16
@CodeLieutenant CodeLieutenant requested a review from fruch January 10, 2025 13:16
@CodeLieutenant CodeLieutenant force-pushed the feat--gemini-from-dockerhub branch from d17442e to 0822faa Compare January 10, 2025 13:21
Dockerfile has been moved to gemini project, and now
SCT should use that image.

Changes:
- Use image from scylladb/gemini
- Remove dockerfile for gemini and point in readme location for the
  images
- Add CQL Statement Logging to gemini output
- Forward outputs from docker to $HOME/*.log
- Run default gemini flags from gemini_thread.py

Signed-off-by: Dusan Malusev <[email protected]>
@CodeLieutenant CodeLieutenant force-pushed the feat--gemini-from-dockerhub branch from 0822faa to 30ddba1 Compare January 10, 2025 13:23
Copy link
Contributor

@fruch fruch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fruch fruch merged commit 193c79b into scylladb:master Jan 14, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/none Backport is not required promoted-to-master test-integration Enable running the integration tests suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants