test(elasticity): Added a test of 90% utilization with a lot of small tables #9785

yarongilor · 2025-01-12T11:27:16Z

This is a test case for having 90% disk utilization with a lot of small tables.
The data is split equally among 500 tables.
The dataset size is aligned with 'i4i.xlarge'.
It uses a c-s user-profile template for all 500 tables.
It runs 4 batches of 125 tables each.
On each batch cycle, 125 tables are created, then a load is generated for all of these tables.
When all the 125 stress writes/reads are done, it continue with the next batch until stress to all 500 tables is completed (after 4 cycles).
Each one of the 500 tables has both write and read load.
Closes #9309

Testing

500 small tables

PR pre-checks (self review)

I added the relevant backport labels
I didn't leave commented-out/debugging code

Reminders

Add New configuration option and document them (in sdcm/sct_config.py)
Add unit tests to cover my changes (under unit-test/ folder)
Update the Readme/doc folder relevant to this change (if needed)

yarongilor · 2025-01-12T11:38:34Z

Keeping this PR as draft since not sure it worth merging.

pehala · 2025-01-12T16:21:01Z

Could you elaborate more on why you need to make so many changes to decorators?

yarongilor · 2025-01-12T18:05:11Z

Could you elaborate more on why you need to make so many changes to decorators?

It's an ad hoc variation of a latency decorator. It's not a must as well and can be removed.

pehala · 2025-01-13T08:21:25Z

It's an ad hoc variation of a latency decorator. It's not a must as well and can be removed.

If we need to make only small-to-none changes to the decorator, I think this is worth merging

yarongilor · 2025-01-13T08:45:26Z

It's an ad hoc variation of a latency decorator. It's not a must as well and can be removed.

If we need to make only small-to-none changes to the decorator, I think this is worth merging

If we remove this new decorator we still get all needed latency results by Grafana.
Otherwise, we'll have to refactor the original latency decorator so it'll be more extensive code change and testing.

pehala · 2025-01-13T08:48:20Z

If we remove this new decorator we still get all needed latency results by Grafana.

I think for this testcase it will be enough (Even though I was pushing "lets use the decorator" before), it is a weird testcase that we are not yet prepared for and we cannot call it a performance test regardless due to methodology changes. I add a configuration file and in it I would document shortcomings for this test case and then we can merge safely

pehala · 2025-01-13T11:27:06Z

Any reason why it is still in draft?

yarongilor · 2025-01-13T11:37:12Z

Any reason why it is still in draft?

ready for review now

test-cases/scale/templated-elasticity-tables.yaml

test-cases/scale/longevity-elasticity-many-small-tables.yaml

yarongilor · 2025-01-13T16:34:12Z

@pehala , all checks passed and comments addressed.

soyacz

Test in argus is failing - timeout is reached and from graphs I see it's more than 90% of disk utilization - is this an Scylla issue? If yes, please add reference to it.
Can you clear out what is tested here - possibility of creating 500 tables? Or the tables to have 1 tablet per table (as it's small)?

test-cases/scale/longevity-elasticity-many-small-tables.yaml

data_dir/templated-elasticity-tables.yaml

yarongilor · 2025-01-14T09:14:40Z

Test in argus is failing - timeout is reached and from graphs I see it's more than 90% of disk utilization - is this an Scylla issue? If yes, please add reference to it. Can you clear out what is tested here - possibility of creating 500 tables? Or the tables to have 1 tablet per table (as it's small)?

This was a debug run for debug testing, This branch wasn't originally meant to be merged. I started another test to see it runs as expected.
As for what's tested, please see this PR description.

yarongilor · 2025-01-15T13:04:08Z

@pehala , the test basically accomplished the task of running 500 tables workload with 90% utilization.
yet there are few errors and issues, probably inherited from the original 5000-tables scale test.
please advice if it's good enough as is or we like to resolve all issues and errors.

pehala · 2025-01-15T13:17:23Z

@pehala , the test basically accomplished the task of running 500 tables workload with 90% utilization.
yet there are few errors and issues, probably inherited from the original 5000-tables scale test.
please advice if it's good enough as is or we like to resolve all issues and errors.

They for sure need to resolved

… tables Test splitting the 90% utilization among a lot of small tables.

yarongilor · 2025-01-21T17:46:41Z

@pehala , i added a new commit with some rewrites. It should allow supporting both scenarios of 5000 tables and 500 tables with less confusions.
test passed ok.

jenkins-pipelines/oss/raft/longevity-5000-tables.jenkinsfile

test-cases/scale/longevity-elasticity-many-small-tables.yaml

fruch · 2025-01-22T17:46:46Z

@vponomaryov you recently optimized the 5000 tables in one keyspace that uses the machinery touched here

Can you please review it ?

vponomaryov

@yarongilor

Your referenced test job has 8 batches with 62 stress commands.
But it's config has batch_size: 125.

So, you have some bug in this scope.

jenkins-pipelines/oss/raft/longevity-5000-tables.jenkinsfile

test-cases/scale/longevity-elasticity-many-small-tables.yaml

vponomaryov · 2025-01-22T19:02:23Z

longevity_test.py

-
+        user_profile_table_count = self.params.get('user_profile_table_count')  # pylint: disable=invalid-name
+        # Creating extra table during batch only for a high number of tables (>500).
+        create_extra_tables = True if int(user_profile_table_count) > 500 else False


This approach looks strange.
Why exactly 500? What if later we will have 400 and 800 cases?

So, need either make it be configurable or describe the reasoning for it here.
Also, coding of this particular condition can be simplified to the following:

create_extra_tables = int(user_profile_table_count) > 500

well i don't really know the idea of adding extra table, So i didn't want to impact existing code flow.
@fruch , can you please advise why the need for an extra table? or perhaps it's not needed anymore and can be removed in a followup fix? would adding a parameter for it is reasonable or an overkill?

In the original 5000 table case, we created all 5000 tables upfront

we wanted few to be created in run time, via c-s as well

In each cycle we added 4 of them, I don't know how it became just 1

And yes if you want to take it out for some tests, it should be configurable.

No one's gonna remember what's this logic you did here in two weeks

ok, @fruch , fixed by:

longevity_test.py

yarongilor · 2025-01-23T09:41:59Z

@yarongilor

Your referenced test job has 8 batches with 62 stress commands. But it's config has batch_size: 125.

So, you have some bug in this scope.

@vponomaryov , i don't see any bug according to the log:

< t:2025-01-22 08:00:45,556 f:longevity_test.py l:320  c:LongevityTest        p:DEBUG > Starting stress in batches of: 125 with 1000 stress commands

$ grep -i "(CassandraStressEvent Severity.NORMAL) period_type=begin"  -c sct-a22f9c23.log
1000

vponomaryov · 2025-01-23T11:10:34Z

@yarongilor
Your referenced test job has 8 batches with 62 stress commands. But it's config has batch_size: 125.
So, you have some bug in this scope.

@vponomaryov , i don't see any bug according to the log:
< t:2025-01-22 08:00:45,556 f:longevity_test.py l:320  c:LongevityTest        p:DEBUG > Starting stress in batches of: 125 with 1000 stress commands
$ grep -i "(CassandraStressEvent Severity.NORMAL) period_type=begin"  -c sct-a22f9c23.log
1000

Argus link from the PR description: https://argus.scylladb.com/tests/scylla-cluster-tests/a22f9c23-6289-497d-8202-db2b4ba85d2f
Jenkins for it: https://jenkins.scylladb.com/job/scylla-staging/job/yarongilor/job/elasticity-many-small-tables-test/25/consoleFull

Part of log:

2025-01-22 07:45:15,776 f:sct_config.py   l:2200 c:sdcm.sct_config      p:INFO  > batch_size: 125
...
2025-01-22 07:45:15,776 f:sct_config.py   l:2200 c:sdcm.sct_config      p:INFO  > user_profile_table_count: 500
...
2025-01-22 08:00:45,556 f:longevity_test.py l:347  c:LongevityTest        p:INFO  > Starting batch 1 out of 8
...
2025-01-22 09:00:26,768 f:longevity_test.py l:347  c:LongevityTest        p:INFO  > Starting batch 2 out of 8
...
2025-01-22 09:59:11,281 f:longevity_test.py l:347  c:LongevityTest        p:INFO  > Starting batch 3 out of 8
...
2025-01-22 11:00:19,811 f:longevity_test.py l:347  c:LongevityTest        p:INFO  > Starting batch 4 out of 8
...
2025-01-22 11:59:59,305 f:longevity_test.py l:347  c:LongevityTest        p:INFO  > Starting batch 5 out of 8
...
2025-01-22 12:59:51,577 f:longevity_test.py l:347  c:LongevityTest        p:INFO  > Starting batch 6 out of 8
...
2025-01-22 13:59:18,718 f:longevity_test.py l:347  c:LongevityTest        p:INFO  > Starting batch 7 out of 8
...
2025-01-22 14:59:29,130 f:longevity_test.py l:347  c:LongevityTest        p:INFO  > Starting batch 8 out of 8

The 500 / 125 must give us 4 sets, not 8.

yarongilor · 2025-01-23T13:16:58Z

@yarongilor
Your referenced test job has 8 batches with 62 stress commands. But it's config has batch_size: 125.
So, you have some bug in this scope.

@vponomaryov , i don't see any bug according to the log:
< t:2025-01-22 08:00:45,556 f:longevity_test.py l:320  c:LongevityTest        p:DEBUG > Starting stress in batches of: 125 with 1000 stress commands
$ grep -i "(CassandraStressEvent Severity.NORMAL) period_type=begin"  -c sct-a22f9c23.log
1000

Argus link from the PR description: https://argus.scylladb.com/tests/scylla-cluster-tests/a22f9c23-6289-497d-8202-db2b4ba85d2f Jenkins for it: https://jenkins.scylladb.com/job/scylla-staging/job/yarongilor/job/elasticity-many-small-tables-test/25/consoleFull

Part of log:

2025-01-22 07:45:15,776 f:sct_config.py   l:2200 c:sdcm.sct_config      p:INFO  > batch_size: 125
...
2025-01-22 07:45:15,776 f:sct_config.py   l:2200 c:sdcm.sct_config      p:INFO  > user_profile_table_count: 500
...
2025-01-22 08:00:45,556 f:longevity_test.py l:347  c:LongevityTest        p:INFO  > Starting batch 1 out of 8
...
2025-01-22 09:00:26,768 f:longevity_test.py l:347  c:LongevityTest        p:INFO  > Starting batch 2 out of 8
...
2025-01-22 09:59:11,281 f:longevity_test.py l:347  c:LongevityTest        p:INFO  > Starting batch 3 out of 8
...
2025-01-22 11:00:19,811 f:longevity_test.py l:347  c:LongevityTest        p:INFO  > Starting batch 4 out of 8
...
2025-01-22 11:59:59,305 f:longevity_test.py l:347  c:LongevityTest        p:INFO  > Starting batch 5 out of 8
...
2025-01-22 12:59:51,577 f:longevity_test.py l:347  c:LongevityTest        p:INFO  > Starting batch 6 out of 8
...
2025-01-22 13:59:18,718 f:longevity_test.py l:347  c:LongevityTest        p:INFO  > Starting batch 7 out of 8
...
2025-01-22 14:59:29,130 f:longevity_test.py l:347  c:LongevityTest        p:INFO  > Starting batch 8 out of 8

The 500 / 125 must give us 4 sets, not 8.

@vponomaryov , the loader memory doesn't care how many tables there are. It cares about c-s threads memory consumptin. So there are 1000 stresses in 8 cycles of 125 threads each. It's 2 c-s stresses per table.

…% utilization scenario The test of test_user_batch_custom_time is fixed and improved in order to run both 5000 tables scenario and a scenario like 500 tables for testing 90% utilization with many small tables.

scylladbbot · 2025-01-27T09:54:11Z

@yarongilor new branch branch-2025.1 was added, please add backport label if needed

yarongilor · 2025-01-28T10:42:10Z

@scylladb/qa-maintainers , all comments are resolved, please review.

cc: @pehala

github-actions bot assigned yarongilor Jan 12, 2025

yarongilor mentioned this pull request Jan 12, 2025

Create testcase for having 90% utilization with a lot of small tables #9309

Open

yarongilor force-pushed the elasticity_small_tables2 branch 2 times, most recently from ffb43a1 to 58d3702 Compare January 13, 2025 11:23

yarongilor force-pushed the elasticity_small_tables2 branch from 58d3702 to 245e5d2 Compare January 13, 2025 11:36

yarongilor marked this pull request as ready for review January 13, 2025 11:36

yarongilor requested a review from pehala January 13, 2025 11:37

yarongilor added Ready for review Ready to be Merged area/elastic cloud Issues related to the elastic cloud project backport/none Backport is not required labels Jan 13, 2025

pehala reviewed Jan 13, 2025

View reviewed changes

yarongilor force-pushed the elasticity_small_tables2 branch 2 times, most recently from 7fa68dc to 2f70af1 Compare January 13, 2025 12:19

yarongilor requested a review from pehala January 13, 2025 12:22

yarongilor force-pushed the elasticity_small_tables2 branch 2 times, most recently from e601f9a to 958a02c Compare January 13, 2025 15:12

pehala previously approved these changes Jan 13, 2025

View reviewed changes

pehala requested a review from a team January 13, 2025 21:07

soyacz requested changes Jan 14, 2025

View reviewed changes

test-cases/scale/longevity-elasticity-many-small-tables.yaml Outdated Show resolved Hide resolved

data_dir/templated-elasticity-tables.yaml Outdated Show resolved Hide resolved

fruch reviewed Jan 14, 2025

View reviewed changes

data_dir/templated-elasticity-tables.yaml Outdated Show resolved Hide resolved

test(elasticity): Added a test of 90% utilization with a lot of small…

fff1bd5

… tables Test splitting the 90% utilization among a lot of small tables.

yarongilor force-pushed the elasticity_small_tables2 branch 2 times, most recently from 6025504 to a5c93b9 Compare January 21, 2025 16:31

yarongilor force-pushed the elasticity_small_tables2 branch from a5c93b9 to 348ad37 Compare January 22, 2025 07:33

yarongilor requested a review from pehala January 22, 2025 17:03

yarongilor force-pushed the elasticity_small_tables2 branch from 348ad37 to 06e0b4e Compare January 22, 2025 17:08

yarongilor requested review from soyacz and fruch January 22, 2025 17:11

fruch reviewed Jan 22, 2025

View reviewed changes

jenkins-pipelines/oss/raft/longevity-5000-tables.jenkinsfile Outdated Show resolved Hide resolved

test-cases/scale/longevity-elasticity-many-small-tables.yaml Outdated Show resolved Hide resolved

fruch requested a review from vponomaryov January 22, 2025 17:45

vponomaryov requested changes Jan 22, 2025

View reviewed changes

yarongilor force-pushed the elasticity_small_tables2 branch from 06e0b4e to 47da586 Compare January 23, 2025 10:12

yarongilor force-pushed the elasticity_small_tables2 branch from 47da586 to 7449ba7 Compare January 23, 2025 12:38

yarongilor force-pushed the elasticity_small_tables2 branch 2 times, most recently from f5a2b5e to d613b1a Compare January 23, 2025 15:22

yarongilor requested review from vponomaryov and fruch January 23, 2025 15:30

yarongilor force-pushed the elasticity_small_tables2 branch from d613b1a to e377881 Compare January 23, 2025 17:35

improvement(test_user_batch_custom_time): refactor code to support 90…

459b90b

…% utilization scenario The test of test_user_batch_custom_time is fixed and improved in order to run both 5000 tables scenario and a scenario like 500 tables for testing 90% utilization with many small tables.

yarongilor force-pushed the elasticity_small_tables2 branch from e377881 to 459b90b Compare January 26, 2025 08:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(elasticity): Added a test of 90% utilization with a lot of small tables #9785

test(elasticity): Added a test of 90% utilization with a lot of small tables #9785

yarongilor commented Jan 12, 2025 •

edited

Loading

yarongilor commented Jan 12, 2025

pehala commented Jan 12, 2025

yarongilor commented Jan 12, 2025

pehala commented Jan 13, 2025

yarongilor commented Jan 13, 2025

pehala commented Jan 13, 2025

pehala commented Jan 13, 2025

yarongilor commented Jan 13, 2025

yarongilor commented Jan 13, 2025

soyacz left a comment

yarongilor commented Jan 14, 2025

yarongilor commented Jan 15, 2025

pehala commented Jan 15, 2025

yarongilor commented Jan 21, 2025 •

edited

Loading

fruch commented Jan 22, 2025

vponomaryov left a comment

vponomaryov Jan 22, 2025

yarongilor Jan 23, 2025

fruch Jan 23, 2025

fruch Jan 23, 2025

yarongilor Jan 23, 2025 •

edited

Loading

yarongilor commented Jan 23, 2025

vponomaryov commented Jan 23, 2025

yarongilor commented Jan 23, 2025

scylladbbot commented Jan 27, 2025

yarongilor commented Jan 28, 2025

test(elasticity): Added a test of 90% utilization with a lot of small tables #9785

Are you sure you want to change the base?

test(elasticity): Added a test of 90% utilization with a lot of small tables #9785

Conversation

yarongilor commented Jan 12, 2025 • edited Loading

Testing

PR pre-checks (self review)

Reminders

yarongilor commented Jan 12, 2025

pehala commented Jan 12, 2025

yarongilor commented Jan 12, 2025

pehala commented Jan 13, 2025

yarongilor commented Jan 13, 2025

pehala commented Jan 13, 2025

pehala commented Jan 13, 2025

yarongilor commented Jan 13, 2025

yarongilor commented Jan 13, 2025

soyacz left a comment

Choose a reason for hiding this comment

yarongilor commented Jan 14, 2025

yarongilor commented Jan 15, 2025

pehala commented Jan 15, 2025

yarongilor commented Jan 21, 2025 • edited Loading

fruch commented Jan 22, 2025

vponomaryov left a comment

Choose a reason for hiding this comment

vponomaryov Jan 22, 2025

Choose a reason for hiding this comment

yarongilor Jan 23, 2025

Choose a reason for hiding this comment

fruch Jan 23, 2025

Choose a reason for hiding this comment

fruch Jan 23, 2025

Choose a reason for hiding this comment

yarongilor Jan 23, 2025 • edited Loading

Choose a reason for hiding this comment

yarongilor commented Jan 23, 2025

vponomaryov commented Jan 23, 2025

yarongilor commented Jan 23, 2025

scylladbbot commented Jan 27, 2025

yarongilor commented Jan 28, 2025

yarongilor commented Jan 12, 2025 •

edited

Loading

yarongilor commented Jan 21, 2025 •

edited

Loading

yarongilor Jan 23, 2025 •

edited

Loading