Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long running stress test never terminates #136

Open
paymog opened this issue May 19, 2021 · 2 comments
Open

Long running stress test never terminates #136

paymog opened this issue May 19, 2021 · 2 comments

Comments

@paymog
Copy link

paymog commented May 19, 2021

I'm using tlp-stress to validate that cassandra running on AWS graviton instances behaves I expect. I've done 5 minute tests against a few cassandra clusters running on different instance types and those tests have done quite well.

Yesterday I kicked off multihour tests and I'm finding that I've hit some edge case in tlp-stress. Here's the command I used to kick off these multihour tests: tlp-stress run KeyValue --compaction lcs --deleterate 0 --host cassandra-paymahn-testing-amd-chill.cassandra-paymahn-testing-amd-chill --csv /results/2021-05-18 15:20:46.582214/keyvalue-lcs.csv --duration 3hr.

When I view the csv file I see entries starting at 2021-05-18T18:20:56 and going until just a few minutes ago, 2021-05-19T12:49:26.

I kicked off the above command in the background on a node in k8s using & notation. When I run ps -aux I see that the process is still running:

root@cassandra-paymahn-load-testing2-toolbox-7df9f98df4-6nzqz:/# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   5476   528 ?        Ss   May18   0:00 sleep infinity
root          96  0.0  0.0   7236  4060 pts/0    Ss+  May18   0:00 /bin/bash
root         149  0.0  0.0  17792 11676 pts/0    S    May18   0:00 python3.8 /opt/pycasstoolbox/scripts/load-test run --duration 3hr --extended-testing cassandra-paymahn-testing-amd-chill.cassandra-paymahn-testing-amd-chill
root         180  0.0  0.0   6972  3620 pts/0    S    May18   0:00 /bin/bash /opt/tlp-stress/bin/tlp-stress run KeyValue --compaction lcs --deleterate 0 --host cassandra-paymahn-testing-amd-chill.cassandra-paymahn-testing-amd-chill --csv /results/2021-05-18 15:20:46.582214/keyvalue-lcs.csv --duration 3hr
root         186 15.5  2.3 6577400 367664 pts/0  Sl   May18 170:39 java -jar build/libs/tlp-stress-4.0.0-all.jar run KeyValue --compaction lcs --deleterate 0 --host cassandra-paymahn-testing-amd-chill.cassandra-paymahn-testing-amd-chill --csv /results/2021-05-18 15:20:46.582214/keyvalue-lcs.csv --duration 3hr
root         252  1.0  0.0   7236  4100 pts/1    Ss   12:39   0:00 /bin/bash
root         292  0.0  0.0   8892  3304 pts/1    R+   12:39   0:00 ps aux

It seems like there might be some bug with tlp-stress and long running tests.

EDIT: note that the tlp-stress process was started on May 18 for a duration of 3 hours (and it was started early in the day) and now I'm posting this on May 19.

@paymog
Copy link
Author

paymog commented May 19, 2021

Ah, looks like 3hr might not parse correctly:

I wonder why tlp-stress started at all. And when I look at one of the logs of a test invocation I see that 3hr does seem to get parsed correctly:

root@cassandra-paymahn-load-testing2-toolbox-7df9f98df4-6nzqz:/results# load-test run --duration  3hr --extended-testing cassandra-paymahn-testing-amd-chill.cassandra-paymahn-testing-amd-chill &
[1] 149
root@cassandra-paymahn-load-testing2-toolbox-7df9f98df4-6nzqz:/results#

-------------------- Running KeyValue stress test with stcs compaction
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF8
Creating tlp_stress:
CREATE KEYSPACE
 IF NOT EXISTS tlp_stress
 WITH replication = {'class': 'SimpleStrategy', 'replication_factor':3 }

Creating schema
Executing 0 operations with consistency level LOCAL_ONE
Connected
Creating Tables
CREATE TABLE IF NOT EXISTS keyvalue (
                        key text PRIMARY KEY,
                        value text
                        ) WITH compaction = {
  'class' : 'SizeTieredCompactionStrategy'
} AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'} AND default_time_to_live = 0
Preparing queries
Initializing metrics
Connecting
Creating generator random
Preparing statements.
1 threads prepared.
Starting main runner
Running
[Thread 0]: Running the profile for 180min...

@paymog
Copy link
Author

paymog commented May 19, 2021

Also, when I look at the csv output I see that many of the data points at the end are identical:

2021-05-19T13:54:51.445Z,81585,177225696,66.524784,2.964393875E-314,177222644,75.35003999999999,2.964393875E-314,0,0.0,0.0,0,0.0
2021-05-19T13:54:54.445Z,81588,177225696,66.524784,2.964393875E-314,177222644,75.35003999999999,2.964393875E-314,0,0.0,0.0,0,0.0
2021-05-19T13:54:57.445Z,81591,177225696,66.524784,2.964393875E-314,177222644,75.35003999999999,2.964393875E-314,0,0.0,0.0,0,0.0
2021-05-19T13:55:00.445Z,81594,177225696,66.524784,2.964393875E-314,177222644,75.35003999999999,2.964393875E-314,0,0.0,0.0,0,0.0
2021-05-19T13:55:03.445Z,81597,177225696,66.524784,2.964393875E-314,177222644,75.35003999999999,2.964393875E-314,0,0.0,0.0,0,0.0
2021-05-19T13:55:06.445Z,81600,177225696,66.524784,2.964393875E-314,177222644,75.35003999999999,2.964393875E-314,0,0.0,0.0,0,0.0
2021-05-19T13:55:09.445Z,81603,177225696,66.524784,2.964393875E-314,177222644,75.35003999999999,2.964393875E-314,0,0.0,0.0,0,0.0
2021-05-19T13:55:12.445Z,81606,177225696,66.524784,2.964393875E-314,177222644,75.35003999999999,2.964393875E-314,0,0.0,0.0,0,0.0
2021-05-19T13:55:15.445Z,81609,177225696,66.524784,2.964393875E-314,177222644,75.35003999999999,2.964393875E-314,0,0.0,0.0,0,0.0
2021-05-19T13:55:18.445Z,81612,177225696,66.524784,2.964393875E-314,177222644,75.35003999999999,2.964393875E-314,0,0.0,0.0,0,0.0
2021-05-19T13:55:21.445Z,81615,177225696,66.524784,2.964393875E-314,177222644,75.35003999999999,2.964393875E-314,0,0.0,0.0,0,0.0

and when I look at dashboards in datadog I see that CPU usage dropped off after the 3hr mark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant