Add comprehensive CI documentation for research software by srghosh56 · Pull Request #1 · srghosh56/RSQKit-copy

srghosh56 · 2025-06-23T13:47:46Z

Overview

This PR adds two comprehensive documentation pages for advanced CI/CD practices in research software development. These guides address common challenges faced by research software projects that require specialized testing infrastructure and complex validation matrices.

Files Added

1. `pages/your_tasks/local_gitlab_ci_infra_for_github_projectV4.md`

Title: "Using local GitLab CI infrastructure for your GitHub project"

This guide addresses the challenge of leveraging organizational GitLab CI infrastructure with specialized hardware (GPUs, HPC systems, alternative architectures) for GitHub-hosted research projects.

Key topics covered:

Repository mirroring infrastructure setup
Fork integration systems for external contributors
Bidirectional CI status reporting between platforms
Security and access management for hybrid workflows
Real-world implementation examples from HPC research environments

Target audience: Research software developers whose projects need specialized CI resources beyond GitHub's free tier limitations.

2. `pages/your_tasks/ci_testing_matrix_for_research_softwareV4.md`

Title: "Managing complex CI testing matrices for research software"

This guide tackles the combinatorial explosion problem in testing research software across multiple compilers, libraries, architectures, and platforms while managing resource constraints.

Key topics covered:

Pairwise testing algorithms to reduce job combinations from thousands to dozens
Optimization strategies (e.g., 2800 combinations → ~60-100 jobs)
Container optimization and wave scheduling for resource management
Performance testing integration and regression detection
Dynamic job generation and infrastructure management

Target audience: Developers of performance-portable libraries, simulation codes, and research software requiring extensive cross-platform validation.

Content Quality and Sources

Both documents are based on real-world implementations from:

Helmholtz-Zentrum Dresden-Rossendorf research infrastructure
Alpaka performance-portability library project
PIConGPU particle-in-cell simulation code
Published research: "Continuous Integration in Complex Research Software - Handling Complexity" (Zenodo: 14643958)

The guides include:

Code examples and configuration snippets
Analysis of optimization strategies
Resource requirement tables and performance metrics
Implementation guidance
Comprehensive reference lists and external resources

Formatting and Structure

Both files follow the RSQKit template structure:

Problem-focused question headings
Description → Considerations → Solutions format
Proper YAML frontmatter with metadata
Related pages and training resource links
Tool references using RSQKit's tagging system

Request for Review

These documents represent comprehensive guides for advanced CI/CD scenarios in research software development.

Please review for:

Content accuracy and completeness
Writing clarity and organization
Any missing considerations or alternative approaches

Thank you for taking the time to review these materials!

- Add guide for using local GitLab CI infrastructure with GitHub projects - Add guide for managing CI testing matrices

Replace outdated Docker multi-stage builds URL with current documentation link

tobiashuste

I went through the first document and added comments. Thanks for creating.

tobiashuste · 2025-06-24T06:27:53Z

pages/your_tasks/ci_testing_matrix_for_research_softwareV4.md

+- 4 CMake versions
+- 7 Boost library versions
+
+This results in **2,800 potential combinations**, requiring approximately **280 hours** of compute time at 6 minutes per job, even with 30 parallel runners.


I think ~9 hours should be correct. (2800/30*6 minutes).

Suggested change

This results in **2,800 potential combinations**, requiring approximately **280 hours** of compute time at 6 minutes per job, even with 30 parallel runners.

This results in **2,800 potential combinations**, requiring approximately **9 hours** of compute time at 6 minutes per job, even with 30 parallel runners.

tobiashuste · 2025-06-24T06:30:52Z

pages/your_tasks/ci_testing_matrix_for_research_softwareV4.md

+  # Each combination of compiler + CUDA version appears at least once
+  ```
+
+- **Develop domain-specific combination rules**: Create libraries that encode your project's specific compatibility requirements and testing priorities, such as the Alpaka Job Matrix Library approach.


What is Alpaka Job Matrix Library approach referring to? This probably needs a reference.

tobiashuste · 2025-06-24T06:31:12Z

pages/your_tasks/ci_testing_matrix_for_research_softwareV4.md

+
+### Description
+
+Research software, particularly performance-portable libraries and simulation codes, often needs to support extensive combinations of compilers, library versions, target architectures, and runtime environments. For example, accelerator abstraction libraries like Alpaka require testing across multiple GCC versions, Clang versions, CUDA SDK versions, CMake versions, and Boost versions. A naive approach testing all combinations can create thousands of test jobs, making CI pipelines impractically long and resource-intensive.


Reference Alpaka

tobiashuste · 2025-06-24T06:37:50Z

pages/your_tasks/ci_testing_matrix_for_research_softwareV4.md

+  exclusions:
+    - cuda_version: "11.0"
+      gcc_version: "gcc-11"  # Incompatible combination
+    - architecture: "arm64"


As far as I know CUDA is available on ARM. I suggest to chose another example.

tobiashuste · 2025-06-24T06:40:02Z

pages/your_tasks/ci_testing_matrix_for_research_softwareV4.md

+| Pairwise testing      | ~60–100    | ~20–30 minutes                                    | All 2-way interactions |
+| Random sampling       | ~200       | ~40 minutes                                       | Statistical coverage |
+
+- **Use dynamic child pipelines**: Leverage CI systems that support programmatically generated pipeline configurations based on computed test matrices, enabling runtime optimization based on available resources.


This is a prerequisite to implement all of the above. Only by using dynamic child pipeline they are able to generate the dynamic CI definition. I would expect this more above including some comments why this is necessary.

tobiashuste · 2025-06-24T06:55:29Z

pages/your_tasks/ci_testing_matrix_for_research_softwareV4.md

+        TEST_FILTER: "cpu"
+  ```
+
+- **Implement job filtering**: Allow developers to run subset of CI pipeline during development using {% tool "git" %} commit message filters to avoid running full pipeline for targeted development work:


What is the difference between both bullets? They describe the same thing, don't they?

tobiashuste · 2025-06-24T06:58:27Z

pages/your_tasks/ci_testing_matrix_for_research_softwareV4.md

+
+### Solutions
+
+#### **Performance Testing Integration**


Do we need bold text in headlines? Headlines should automatically be styled. I would remove it from all sections.

tobiashuste · 2025-06-24T06:59:25Z

pages/your_tasks/ci_testing_matrix_for_research_softwareV4.md

+
+- **Configure performance thresholds**: Establish automated performance regression detection with configurable thresholds for different hardware configurations and algorithm implementations.
+
+#### **Dynamic Job Generation**


This point was already mentioned in the text.

tobiashuste · 2025-06-24T07:00:21Z

pages/your_tasks/ci_testing_matrix_for_research_softwareV4.md

+  | Job Duration | <10 minutes average | Pipeline analytics |
+  | Queue Time | <5 minutes | Runner utilization metrics |
+  | Failure Rate | <5% for stable configurations | Historical trend analysis |
+  | Resource Utilization | 70-90% of capacity | Real-time monitoring |


I am missing some hint where to get these metrics from.

tobiashuste · 2025-06-24T07:02:36Z

pages/your_tasks/ci_testing_matrix_for_research_softwareV4.md

+  - **CI integration**: Use GitLab dynamic child pipelines or GitHub Actions matrix strategies
+
+- **Document testing rationale**: Maintain clear documentation explaining testing parameter choices and exclusion rules to facilitate maintenance and onboarding.
+


I would also add that people should think about resource usage. Even though we could test all combinations this requires resources, wastes energy. So deciding on the size of the test matrix and when to run which jobs should also depend on that fact.

srghosh56 · 2025-06-24T12:28:45Z

Resolved everything you said. Can you review both the files?

juckel · 2025-06-25T12:21:22Z

The test matrix stuff is really great - thanks! for the GitHub-GitLab, I would have expected more links to documentation how to do the various suggested solutions.

srghosh56 · 2025-06-26T11:07:55Z

Added relevant links for the GitLab-GitHub integration file.

tobiashuste

I read through the second part. There's a lot of information inside, thanks.

What I am missing in general is the practical setup. When reading this a somebody new to the topic I get a long list of ideas and abstract things to consider but in the end, I would have nearly know idea on how to set things up. Also, I would miss what is most important to make it work and what is optional to get the best reliability.
I would expect the document to be some practical guideline for someone who would like to build something similar. Should this be the idea in the RSQKit?

tobiashuste · 2025-07-04T07:53:55Z