Add comprehensive CI documentation for research software#1
Add comprehensive CI documentation for research software#1
Conversation
- Add guide for using local GitLab CI infrastructure with GitHub projects - Add guide for managing CI testing matrices
Replace outdated Docker multi-stage builds URL with current documentation link
tobiashuste
left a comment
There was a problem hiding this comment.
I went through the first document and added comments. Thanks for creating.
| - 4 CMake versions | ||
| - 7 Boost library versions | ||
|
|
||
| This results in **2,800 potential combinations**, requiring approximately **280 hours** of compute time at 6 minutes per job, even with 30 parallel runners. |
There was a problem hiding this comment.
I think ~9 hours should be correct. (2800/30*6 minutes).
| This results in **2,800 potential combinations**, requiring approximately **280 hours** of compute time at 6 minutes per job, even with 30 parallel runners. | |
| This results in **2,800 potential combinations**, requiring approximately **9 hours** of compute time at 6 minutes per job, even with 30 parallel runners. |
| # Each combination of compiler + CUDA version appears at least once | ||
| ``` | ||
|
|
||
| - **Develop domain-specific combination rules**: Create libraries that encode your project's specific compatibility requirements and testing priorities, such as the Alpaka Job Matrix Library approach. |
There was a problem hiding this comment.
What is Alpaka Job Matrix Library approach referring to? This probably needs a reference.
|
|
||
| ### Description | ||
|
|
||
| Research software, particularly performance-portable libraries and simulation codes, often needs to support extensive combinations of compilers, library versions, target architectures, and runtime environments. For example, accelerator abstraction libraries like Alpaka require testing across multiple GCC versions, Clang versions, CUDA SDK versions, CMake versions, and Boost versions. A naive approach testing all combinations can create thousands of test jobs, making CI pipelines impractically long and resource-intensive. |
| exclusions: | ||
| - cuda_version: "11.0" | ||
| gcc_version: "gcc-11" # Incompatible combination | ||
| - architecture: "arm64" |
There was a problem hiding this comment.
As far as I know CUDA is available on ARM. I suggest to chose another example.
| | Pairwise testing | ~60–100 | ~20–30 minutes | All 2-way interactions | | ||
| | Random sampling | ~200 | ~40 minutes | Statistical coverage | | ||
|
|
||
| - **Use dynamic child pipelines**: Leverage CI systems that support programmatically generated pipeline configurations based on computed test matrices, enabling runtime optimization based on available resources. |
There was a problem hiding this comment.
This is a prerequisite to implement all of the above. Only by using dynamic child pipeline they are able to generate the dynamic CI definition. I would expect this more above including some comments why this is necessary.
| TEST_FILTER: "cpu" | ||
| ``` | ||
|
|
||
| - **Implement job filtering**: Allow developers to run subset of CI pipeline during development using {% tool "git" %} commit message filters to avoid running full pipeline for targeted development work: |
There was a problem hiding this comment.
What is the difference between both bullets? They describe the same thing, don't they?
|
|
||
| ### Solutions | ||
|
|
||
| #### **Performance Testing Integration** |
There was a problem hiding this comment.
Do we need bold text in headlines? Headlines should automatically be styled. I would remove it from all sections.
|
|
||
| - **Configure performance thresholds**: Establish automated performance regression detection with configurable thresholds for different hardware configurations and algorithm implementations. | ||
|
|
||
| #### **Dynamic Job Generation** |
There was a problem hiding this comment.
This point was already mentioned in the text.
| | Job Duration | <10 minutes average | Pipeline analytics | | ||
| | Queue Time | <5 minutes | Runner utilization metrics | | ||
| | Failure Rate | <5% for stable configurations | Historical trend analysis | | ||
| | Resource Utilization | 70-90% of capacity | Real-time monitoring | |
There was a problem hiding this comment.
I am missing some hint where to get these metrics from.
| - **CI integration**: Use GitLab dynamic child pipelines or GitHub Actions matrix strategies | ||
|
|
||
| - **Document testing rationale**: Maintain clear documentation explaining testing parameter choices and exclusion rules to facilitate maintenance and onboarding. | ||
|
|
There was a problem hiding this comment.
I would also add that people should think about resource usage. Even though we could test all combinations this requires resources, wastes energy. So deciding on the size of the test matrix and when to run which jobs should also depend on that fact.
|
Resolved everything you said. Can you review both the files? |
|
The test matrix stuff is really great - thanks! for the GitHub-GitLab, I would have expected more links to documentation how to do the various suggested solutions. |
|
Added relevant links for the GitLab-GitHub integration file. |
tobiashuste
left a comment
There was a problem hiding this comment.
I read through the second part. There's a lot of information inside, thanks.
What I am missing in general is the practical setup. When reading this a somebody new to the topic I get a long list of ideas and abstract things to consider but in the end, I would have nearly know idea on how to set things up. Also, I would miss what is most important to make it work and what is optional to get the best reliability.
I would expect the document to be some practical guideline for someone who would like to build something similar. Should this be the idea in the RSQKit?
|
|
||
| Many research software projects are hosted on {% tool "github" %} to benefit from its large open-source community and collaboration features. However, GitHub's free CI resources may be insufficient for complex research software that requires specialized hardware (GPUs, specific CPU architectures), extensive testing matrices, or simply more computational resources than the free tier provides. Organizations often have local {% tool "gitlab" %} instances with powerful runners and specialized hardware that could address these limitations. | ||
|
|
||
| Research projects like PIConGPU and Alpaka demonstrate this challenge perfectly. These projects require testing across multiple hardware configurations and extensive parameter combinations that exceed GitHub's free tier capabilities, yet benefit from GitHub's collaborative ecosystem for open-source development. |
There was a problem hiding this comment.
Add links for both picongpu and alpaka here?
|
|
||
| #### CI Status Integration | ||
|
|
||
| - Configure bidirectional status reporting: Implement a system that sends GitLab pipeline status back to GitHub using commit hashes for identification. Use [GitHub's commit status API](https://docs.github.com/en/rest/commits/statuses) to report build statuses from external CI systems and [GitLab CI/CD pipeline events](https://docs.gitlab.com/ee/user/project/integrations/webhook_events.html#pipeline-events). This ensures pull request status checks are properly updated regardless of the execution platform. |
There was a problem hiding this comment.
It's probably a good idea to link to https://docs.gitlab.com/ci/ci_cd_for_external_repos/github_integration/ as well.
|
|
||
| #### Access Management and Security | ||
|
|
||
| - Establish guest access procedures: Create documented procedures for external contributors to access GitLab CI logs and results, including temporary guest access workflows that maintain security boundaries. Configure using [GitLab project members management](https://docs.gitlab.com/ee/user/project/members/) and [guest permissions](https://docs.gitlab.com/ee/user/permissions.html#project-members-permissions) for setting up access control. |
There was a problem hiding this comment.
Probably one could mention that this is only relevant for non-public projects.
|
|
||
| - Configure permission mapping: Establish clear mapping between GitHub repository permissions and GitLab project access levels to ensure appropriate access control. | ||
|
|
||
| - Implement audit logging: Maintain comprehensive logs of all cross-platform CI activities for security monitoring and troubleshooting. |
There was a problem hiding this comment.
If I were an RSE I would not know what I should do here or how to configure that.
|
|
||
| - Establish guest access procedures: Create documented procedures for external contributors to access GitLab CI logs and results, including temporary guest access workflows that maintain security boundaries. Configure using [GitLab project members management](https://docs.gitlab.com/ee/user/project/members/) and [guest permissions](https://docs.gitlab.com/ee/user/permissions.html#project-members-permissions) for setting up access control. | ||
|
|
||
| - Configure permission mapping: Establish clear mapping between GitHub repository permissions and GitLab project access levels to ensure appropriate access control. |
There was a problem hiding this comment.
How to implement that practically? That's what I would ask myself as an RSE.
|
|
||
| - Configure runner tagging: Implement comprehensive tagging systems that allow jobs to target specific hardware configurations while maintaining flexibility for resource allocation. See [GitLab Runner tags documentation](https://docs.gitlab.com/ee/ci/runners/configure_runners.html#use-tags-to-control-which-jobs-a-runner-can-run). | ||
|
|
||
| - Set up runner pools: Organize runners into pools based on hardware capabilities and project requirements to ensure fair resource distribution across multiple projects. For more information,see [GitLab Runner Tags](https://docs.gitlab.com/runner/configuration/advanced-configuration.html#the-runners-section) and [Shared vs Specific Runners](https://docs.gitlab.com/ee/ci/runners/). |
There was a problem hiding this comment.
What are pools referred to? I don't know that term with regards to GitLab CI.
Also the linked documentation does not mention shared vs specific runners. To me this is not practical enough without specific examples.
|
|
||
| - Set up runner pools: Organize runners into pools based on hardware capabilities and project requirements to ensure fair resource distribution across multiple projects. For more information,see [GitLab Runner Tags](https://docs.gitlab.com/runner/configuration/advanced-configuration.html#the-runners-section) and [Shared vs Specific Runners](https://docs.gitlab.com/ee/ci/runners/). | ||
|
|
||
| #### Monitoring and Maintenance |
There was a problem hiding this comment.
Who is the target audience for this document? This all might be a huge challenge for an RSE if there are no specific instructions.
|
|
||
| ### Solutions | ||
|
|
||
| - Implement robust webhook processing: Design webhook handlers with comprehensive error handling, retry logic, and dead letter queues to ensure no GitHub events are lost due to temporary failures. |
There was a problem hiding this comment.
The web hook is sent out from Github. How can I implement a retry logic? I thought I can only rely on the implementation of Github.
|
|
||
| - Implement robust webhook processing: Design webhook handlers with comprehensive error handling, retry logic, and dead letter queues to ensure no GitHub events are lost due to temporary failures. | ||
|
|
||
| - Configure webhook redundancy: Set up multiple webhook endpoints with failover mechanisms to ensure continuous operation even during maintenance or unexpected outages. |
|
|
||
| - Configure webhook redundancy: Set up multiple webhook endpoints with failover mechanisms to ensure continuous operation even during maintenance or unexpected outages. | ||
|
|
||
| - Establish authentication token rotation: Implement automated token rotation for both GitHub and GitLab APIs to maintain long-term reliability without manual intervention. Configure using [GitHub personal access tokens](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token) and [GitLab access tokens](https://docs.gitlab.com/ee/user/profile/personal_access_tokens.html). |
There was a problem hiding this comment.
THis als requires the web hook to automatically be updated, right?
I added implementation examples and code snippets wherever possible. Please take a re-look.
You can check the existing task pages here. This is the template I am following. The current task pages only include until some basic CI/CD pipelines, task automation using GitHub Actions and GitLab CI/CD, etc and they are more on the theoretical side and aren't really practical step-by-step tutorials. Also, I understand that the local_gitlab_ci_infra_for_github_project and ci_testing_matrix_for_rs task pages are still quite high-level and abstract but there is also a huge gap between the current difficulty level of task pages that already exist in RSQkit and that of the ones I am trying to create. But I tried to incorporate more code snippets, implementation examples to make them like practical guides to mitigate these two problems. They might need more revisions before finally pushing them to RSQkit which is also subject to scrutiny of the Editorial Board members. But, please let me know if you think any other changes are to be made. |
Overview
This PR adds two comprehensive documentation pages for advanced CI/CD practices in research software development. These guides address common challenges faced by research software projects that require specialized testing infrastructure and complex validation matrices.
Files Added
1.
pages/your_tasks/local_gitlab_ci_infra_for_github_projectV4.mdTitle: "Using local GitLab CI infrastructure for your GitHub project"
This guide addresses the challenge of leveraging organizational GitLab CI infrastructure with specialized hardware (GPUs, HPC systems, alternative architectures) for GitHub-hosted research projects.
Key topics covered:
Target audience: Research software developers whose projects need specialized CI resources beyond GitHub's free tier limitations.
2.
pages/your_tasks/ci_testing_matrix_for_research_softwareV4.mdTitle: "Managing complex CI testing matrices for research software"
This guide tackles the combinatorial explosion problem in testing research software across multiple compilers, libraries, architectures, and platforms while managing resource constraints.
Key topics covered:
Target audience: Developers of performance-portable libraries, simulation codes, and research software requiring extensive cross-platform validation.
Content Quality and Sources
Both documents are based on real-world implementations from:
The guides include:
Formatting and Structure
Both files follow the RSQKit template structure:
Request for Review
These documents represent comprehensive guides for advanced CI/CD scenarios in research software development.
Please review for:
Thank you for taking the time to review these materials!