Skip to content

Implement YAML+Jinja2 support, code re-org, remove out-dated functions#264

Open
hagertnl wants to merge 30 commits into
olcf:develfrom
hagertnl:nick-issue219-yaml-jinja2
Open

Implement YAML+Jinja2 support, code re-org, remove out-dated functions#264
hagertnl wants to merge 30 commits into
olcf:develfrom
hagertnl:nick-issue219-yaml-jinja2

Conversation

@hagertnl
Copy link
Copy Markdown
Contributor

A few interesting changes in this PR:

  • Implement YAML rgt_test_input.yaml, plus .template.j2 job template
  • Remove all the PYTHONPATH stuff from the modulefile in favor of a single sys.path addition in the core user-facing binaries (e.g., runtests.py). This is much better practice than polluting the heck out of the environment.

Planned changes before merging:

  • Re-structure machine_types to split out schedulers from machine_types and move rgt_tests.py over to libraries, where it belongs
  • Possible other modernization that is needed

@hagertnl
Copy link
Copy Markdown
Contributor Author

Added a lot more cleanup here now:

  • broke out schedulers into their own directory
  • removed old RunTimeEnvironment section of .ini file (never used, probably broken)
  • removed get_new_environment functionality, will re-implement later if needed. But the scheduler has most of this type of functionality nowadays
  • removed unnecessarily-spec'd replacements: walltime, total_processes, processes_per_node, executable_path. The harness doesn't need to know any of these, the user should be free to define whatever they want to provide a time to their batch job. The primary counterexample is if you want to provide Slurm & lsf job scripts to a single test, you can't use a single walltime field, because they use different time formats
  • removed IBM_POWER9 "machine_type", since it can be fully represented by linux_x86_64. There is nothing specific to IBM POWER9 to require a different machine_type

@hagertnl hagertnl marked this pull request as ready for review April 30, 2026 20:26
@hagertnl hagertnl changed the title Implement YAML+Jinja2 support, code re-org Implement YAML+Jinja2 support, code re-org, remove out-dated functions Apr 30, 2026
@hagertnl hagertnl linked an issue Apr 30, 2026 that may be closed by this pull request
@hagertnl
Copy link
Copy Markdown
Contributor Author

hagertnl commented May 4, 2026

Just now, I finished implementing a better CTRL+C handling as well, to bundle into this PR:

$ ./run_tests.sh
Failed to import Kafka backend: No module named 'confluent_kafka'
Failed to import Kafka backend: No module named 'confluent_kafka'
Starting tasks for harness_unit_tests.test_long_build_long_run: ['start_tests', 'stop_tests']
Using machine config: borg.ini
Failed to import Kafka backend: No module named 'confluent_kafka'
Failed to import Kafka backend: No module named 'confluent_kafka'
Path to Source: /autofs/nccs-svm1_proj/stf243/hagertnl/harness/unit_tests/harness_unit_tests/Source
Path to Build: /lustre/orion/proj-shared/stf243/hagertnl/harness_sspace/borg/05.04.26-10.58/harness_unit_tests/test_long_build_long_run/1777906730.3121712/build_directory
Path to Run_Archive: /autofs/nccs-svm1_proj/stf243/hagertnl/harness/unit_tests/harness_unit_tests/test_long_build_long_run/Run_Archive/1777906730.3121712
^CDetected CTRL+C, aborting build.
No submit action due to prior failed build.
The command 'test_harness_driver.py -r -l borg_test/hagertnl@2026-05-04T10:58:50.10 --loglevel WARNING' has exited with a failure.
The exit return value is 1.

Starting tasks for harness_unit_tests.test_long_build_long_run: ['start_tests', 'stop_tests']
Test harness_unit_tests.test_long_build_long_run failed to launch.


Using machine config: borg.ini
Failed to import Kafka backend: No module named 'confluent_kafka'
Failed to import Kafka backend: No module named 'confluent_kafka'
Path to Source: /autofs/nccs-svm1_proj/stf243/hagertnl/harness/unit_tests/harness_unit_tests/Source
Path to Build: /lustre/orion/proj-shared/stf243/hagertnl/harness_sspace/borg/05.04.26-10.58/harness_unit_tests/test_long_build_long_run/1777906733.029087/build_directory
Path to Run_Archive: /autofs/nccs-svm1_proj/stf243/hagertnl/harness/unit_tests/harness_unit_tests/test_long_build_long_run/Run_Archive/1777906733.029087
SLURM jobID = 600345
Test harness_unit_tests.test_long_build_long_run is launched.


Launched 1 tests, failed to launch 1 tests.
Failed tests:
	harness_unit_tests.test_long_build_long_run

A CTRL+C now cancels the currently-running build, additionally allowing that build to log a failed build_end event before exiting to leave behind appropriate bread crumbs. The main thread effectively ignores any CTRL+C. In the future, we may want a "if I have 2 CTRL+C's within 3 seconds, I'll cancel everything" type of functionality, but I have a feeling we may move away from this method of multithreading test submissions as we push to modernize, so I don't think it's worth the development time now.

@hagertnl
Copy link
Copy Markdown
Contributor Author

hagertnl commented May 5, 2026

Full list of changes that I think are in this PR:

  • Added YAML input file & Jinja2 template support (must be used together)
  • Remove all the PYTHONPATH modifications from the modulefile in favor of a single sys.path addition in the user-facing entry points (e.g., runtests.py, test_harness_driver.py, etc.). This is better practice than polluting the heck out of the user's global Python environment.
  • Move schedulers into their own directory instead of being under "machine_types"
  • Removed old RunTimeEnvironment section of .ini file (never used, probably doesn't even work, not a feature I'd advise folks to use)
  • Removed the now-unused get_new_environment functionality (was part of RunTimeEnvironment), will re-implement later if needed. But the scheduler has most of this type of functionality nowadays
  • Removed unnecessarily-spec'd replacements: walltime, total_processes, processes_per_node, executable_path. The harness doesn't need to know any of these, the user should be free to define whatever they want to provide a time to their batch job. The primary example is if you want to provide Slurm & LSF job scripts to a single test, you can't use a single walltime field, because they use different time formats. So the exact field should be up to the user. If they want to hard-code a 10-minute wall time in the batch script, more power to them.
  • Removed IBM_POWER9 "machine_type", since it can be fully represented by linux_x86_64. There is nothing specific to IBM POWER9 to require a different machine_type
  • Add graceful handling of CTRL+C. New behavior is to gracefully cancel the currently-running build steps, but do NOT cancel the parent process that launched it.

Comment thread harness/bin/check_executable_driver.py
Comment thread harness/bin/log_binary_execution_time.py
Comment thread harness/bin/runtests.py
Comment thread harness/bin/test_harness_driver.py Outdated
Comment thread harness/libraries/layout_of_apps_directory.py Outdated
Comment thread harness/utilities/add_comment_to_databases.py Outdated
Comment thread harness/utilities/check_utility.py Outdated
Comment thread harness/utilities/report_to_databases.py Outdated
Comment thread harness/utilities/rgt_archive_tests.py Outdated
Comment thread harness/utilities/update_databases.py Outdated
@hagertnl
Copy link
Copy Markdown
Contributor Author

@ddietz89 @AcerP-py , docs have been updated to include YAML+Jinja2 descriptions & examples, let me know if there's any more changes you need! I'm currently evaluating how safe it is to switch /sw/acceptance/olcf-test-harness-dev to this branch for in-production testing now. Almost certain it's backwards compatible, at least for Frontier's context (obviously, power9 was removed, so not 100% backwards compatible).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants