Skip to content

Add comprehensive CI documentation for research software#1

Open
srghosh56 wants to merge 16 commits intomainfrom
add-ci-documentation
Open

Add comprehensive CI documentation for research software#1
srghosh56 wants to merge 16 commits intomainfrom
add-ci-documentation

Conversation

@srghosh56
Copy link
Copy Markdown
Owner

Overview

This PR adds two comprehensive documentation pages for advanced CI/CD practices in research software development. These guides address common challenges faced by research software projects that require specialized testing infrastructure and complex validation matrices.

Files Added

1. pages/your_tasks/local_gitlab_ci_infra_for_github_projectV4.md

Title: "Using local GitLab CI infrastructure for your GitHub project"

This guide addresses the challenge of leveraging organizational GitLab CI infrastructure with specialized hardware (GPUs, HPC systems, alternative architectures) for GitHub-hosted research projects.

Key topics covered:

  • Repository mirroring infrastructure setup
  • Fork integration systems for external contributors
  • Bidirectional CI status reporting between platforms
  • Security and access management for hybrid workflows
  • Real-world implementation examples from HPC research environments

Target audience: Research software developers whose projects need specialized CI resources beyond GitHub's free tier limitations.

2. pages/your_tasks/ci_testing_matrix_for_research_softwareV4.md

Title: "Managing complex CI testing matrices for research software"

This guide tackles the combinatorial explosion problem in testing research software across multiple compilers, libraries, architectures, and platforms while managing resource constraints.

Key topics covered:

  • Pairwise testing algorithms to reduce job combinations from thousands to dozens
  • Optimization strategies (e.g., 2800 combinations → ~60-100 jobs)
  • Container optimization and wave scheduling for resource management
  • Performance testing integration and regression detection
  • Dynamic job generation and infrastructure management

Target audience: Developers of performance-portable libraries, simulation codes, and research software requiring extensive cross-platform validation.

Content Quality and Sources

Both documents are based on real-world implementations from:

  • Helmholtz-Zentrum Dresden-Rossendorf research infrastructure
  • Alpaka performance-portability library project
  • PIConGPU particle-in-cell simulation code
  • Published research: "Continuous Integration in Complex Research Software - Handling Complexity" (Zenodo: 14643958)

The guides include:

  • Code examples and configuration snippets
  • Analysis of optimization strategies
  • Resource requirement tables and performance metrics
  • Implementation guidance
  • Comprehensive reference lists and external resources

Formatting and Structure

Both files follow the RSQKit template structure:

  • Problem-focused question headings
  • Description → Considerations → Solutions format
  • Proper YAML frontmatter with metadata
  • Related pages and training resource links
  • Tool references using RSQKit's tagging system

Request for Review

These documents represent comprehensive guides for advanced CI/CD scenarios in research software development.

Please review for:

  • Content accuracy and completeness
  • Writing clarity and organization
  • Any missing considerations or alternative approaches

Thank you for taking the time to review these materials!

Copy link
Copy Markdown

@tobiashuste tobiashuste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the first document and added comments. Thanks for creating.

- 4 CMake versions
- 7 Boost library versions

This results in **2,800 potential combinations**, requiring approximately **280 hours** of compute time at 6 minutes per job, even with 30 parallel runners.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ~9 hours should be correct. (2800/30*6 minutes).

Suggested change
This results in **2,800 potential combinations**, requiring approximately **280 hours** of compute time at 6 minutes per job, even with 30 parallel runners.
This results in **2,800 potential combinations**, requiring approximately **9 hours** of compute time at 6 minutes per job, even with 30 parallel runners.

# Each combination of compiler + CUDA version appears at least once
```

- **Develop domain-specific combination rules**: Create libraries that encode your project's specific compatibility requirements and testing priorities, such as the Alpaka Job Matrix Library approach.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is Alpaka Job Matrix Library approach referring to? This probably needs a reference.


### Description

Research software, particularly performance-portable libraries and simulation codes, often needs to support extensive combinations of compilers, library versions, target architectures, and runtime environments. For example, accelerator abstraction libraries like Alpaka require testing across multiple GCC versions, Clang versions, CUDA SDK versions, CMake versions, and Boost versions. A naive approach testing all combinations can create thousands of test jobs, making CI pipelines impractically long and resource-intensive.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference Alpaka

exclusions:
- cuda_version: "11.0"
gcc_version: "gcc-11" # Incompatible combination
- architecture: "arm64"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know CUDA is available on ARM. I suggest to chose another example.

| Pairwise testing | ~60–100 | ~20–30 minutes | All 2-way interactions |
| Random sampling | ~200 | ~40 minutes | Statistical coverage |

- **Use dynamic child pipelines**: Leverage CI systems that support programmatically generated pipeline configurations based on computed test matrices, enabling runtime optimization based on available resources.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a prerequisite to implement all of the above. Only by using dynamic child pipeline they are able to generate the dynamic CI definition. I would expect this more above including some comments why this is necessary.

TEST_FILTER: "cpu"
```

- **Implement job filtering**: Allow developers to run subset of CI pipeline during development using {% tool "git" %} commit message filters to avoid running full pipeline for targeted development work:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between both bullets? They describe the same thing, don't they?


### Solutions

#### **Performance Testing Integration**
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need bold text in headlines? Headlines should automatically be styled. I would remove it from all sections.


- **Configure performance thresholds**: Establish automated performance regression detection with configurable thresholds for different hardware configurations and algorithm implementations.

#### **Dynamic Job Generation**
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This point was already mentioned in the text.

Comment on lines +263 to +266
| Job Duration | <10 minutes average | Pipeline analytics |
| Queue Time | <5 minutes | Runner utilization metrics |
| Failure Rate | <5% for stable configurations | Historical trend analysis |
| Resource Utilization | 70-90% of capacity | Real-time monitoring |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am missing some hint where to get these metrics from.

- **CI integration**: Use GitLab dynamic child pipelines or GitHub Actions matrix strategies

- **Document testing rationale**: Maintain clear documentation explaining testing parameter choices and exclusion rules to facilitate maintenance and onboarding.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also add that people should think about resource usage. Even though we could test all combinations this requires resources, wastes energy. So deciding on the size of the test matrix and when to run which jobs should also depend on that fact.

@srghosh56 srghosh56 requested a review from tobiashuste June 24, 2025 12:24
@srghosh56
Copy link
Copy Markdown
Owner Author

Resolved everything you said. Can you review both the files?

@juckel
Copy link
Copy Markdown

juckel commented Jun 25, 2025

The test matrix stuff is really great - thanks! for the GitHub-GitLab, I would have expected more links to documentation how to do the various suggested solutions.

@srghosh56
Copy link
Copy Markdown
Owner Author

Added relevant links for the GitLab-GitHub integration file.

Copy link
Copy Markdown

@tobiashuste tobiashuste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read through the second part. There's a lot of information inside, thanks.

What I am missing in general is the practical setup. When reading this a somebody new to the topic I get a long list of ideas and abstract things to consider but in the end, I would have nearly know idea on how to set things up. Also, I would miss what is most important to make it work and what is optional to get the best reliability.
I would expect the document to be some practical guideline for someone who would like to build something similar. Should this be the idea in the RSQKit?


Many research software projects are hosted on {% tool "github" %} to benefit from its large open-source community and collaboration features. However, GitHub's free CI resources may be insufficient for complex research software that requires specialized hardware (GPUs, specific CPU architectures), extensive testing matrices, or simply more computational resources than the free tier provides. Organizations often have local {% tool "gitlab" %} instances with powerful runners and specialized hardware that could address these limitations.

Research projects like PIConGPU and Alpaka demonstrate this challenge perfectly. These projects require testing across multiple hardware configurations and extensive parameter combinations that exceed GitHub's free tier capabilities, yet benefit from GitHub's collaborative ecosystem for open-source development.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add links for both picongpu and alpaka here?


#### CI Status Integration

- Configure bidirectional status reporting: Implement a system that sends GitLab pipeline status back to GitHub using commit hashes for identification. Use [GitHub's commit status API](https://docs.github.com/en/rest/commits/statuses) to report build statuses from external CI systems and [GitLab CI/CD pipeline events](https://docs.gitlab.com/ee/user/project/integrations/webhook_events.html#pipeline-events). This ensures pull request status checks are properly updated regardless of the execution platform.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably a good idea to link to https://docs.gitlab.com/ci/ci_cd_for_external_repos/github_integration/ as well.


#### Access Management and Security

- Establish guest access procedures: Create documented procedures for external contributors to access GitLab CI logs and results, including temporary guest access workflows that maintain security boundaries. Configure using [GitLab project members management](https://docs.gitlab.com/ee/user/project/members/) and [guest permissions](https://docs.gitlab.com/ee/user/permissions.html#project-members-permissions) for setting up access control.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably one could mention that this is only relevant for non-public projects.


- Configure permission mapping: Establish clear mapping between GitHub repository permissions and GitLab project access levels to ensure appropriate access control.

- Implement audit logging: Maintain comprehensive logs of all cross-platform CI activities for security monitoring and troubleshooting.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I were an RSE I would not know what I should do here or how to configure that.


- Establish guest access procedures: Create documented procedures for external contributors to access GitLab CI logs and results, including temporary guest access workflows that maintain security boundaries. Configure using [GitLab project members management](https://docs.gitlab.com/ee/user/project/members/) and [guest permissions](https://docs.gitlab.com/ee/user/permissions.html#project-members-permissions) for setting up access control.

- Configure permission mapping: Establish clear mapping between GitHub repository permissions and GitLab project access levels to ensure appropriate access control.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to implement that practically? That's what I would ask myself as an RSE.


- Configure runner tagging: Implement comprehensive tagging systems that allow jobs to target specific hardware configurations while maintaining flexibility for resource allocation. See [GitLab Runner tags documentation](https://docs.gitlab.com/ee/ci/runners/configure_runners.html#use-tags-to-control-which-jobs-a-runner-can-run).

- Set up runner pools: Organize runners into pools based on hardware capabilities and project requirements to ensure fair resource distribution across multiple projects. For more information,see [GitLab Runner Tags](https://docs.gitlab.com/runner/configuration/advanced-configuration.html#the-runners-section) and [Shared vs Specific Runners](https://docs.gitlab.com/ee/ci/runners/).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are pools referred to? I don't know that term with regards to GitLab CI.

Also the linked documentation does not mention shared vs specific runners. To me this is not practical enough without specific examples.


- Set up runner pools: Organize runners into pools based on hardware capabilities and project requirements to ensure fair resource distribution across multiple projects. For more information,see [GitLab Runner Tags](https://docs.gitlab.com/runner/configuration/advanced-configuration.html#the-runners-section) and [Shared vs Specific Runners](https://docs.gitlab.com/ee/ci/runners/).

#### Monitoring and Maintenance
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who is the target audience for this document? This all might be a huge challenge for an RSE if there are no specific instructions.


### Solutions

- Implement robust webhook processing: Design webhook handlers with comprehensive error handling, retry logic, and dead letter queues to ensure no GitHub events are lost due to temporary failures.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The web hook is sent out from Github. How can I implement a retry logic? I thought I can only rely on the implementation of Github.


- Implement robust webhook processing: Design webhook handlers with comprehensive error handling, retry logic, and dead letter queues to ensure no GitHub events are lost due to temporary failures.

- Configure webhook redundancy: Set up multiple webhook endpoints with failover mechanisms to ensure continuous operation even during maintenance or unexpected outages.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to do this?


- Configure webhook redundancy: Set up multiple webhook endpoints with failover mechanisms to ensure continuous operation even during maintenance or unexpected outages.

- Establish authentication token rotation: Implement automated token rotation for both GitHub and GitLab APIs to maintain long-term reliability without manual intervention. Configure using [GitHub personal access tokens](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token) and [GitLab access tokens](https://docs.gitlab.com/ee/user/profile/personal_access_tokens.html).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THis als requires the web hook to automatically be updated, right?

@srghosh56
Copy link
Copy Markdown
Owner Author

What I am missing in general is the practical setup. When reading this a somebody new to the topic I get a long list of ideas and abstract things to consider but in the end, I would have nearly know idea on how to set things up. Also, I would miss what is most important to make it work and what is optional to get the best reliability. I would expect the document to be some practical guideline for someone who would like to build something similar.

I added implementation examples and code snippets wherever possible. Please take a re-look.

Should this be the idea in the RSQKit?

You can check the existing task pages here. This is the template I am following. The current task pages only include until some basic CI/CD pipelines, task automation using GitHub Actions and GitLab CI/CD, etc and they are more on the theoretical side and aren't really practical step-by-step tutorials. Also, I understand that the local_gitlab_ci_infra_for_github_project and ci_testing_matrix_for_rs task pages are still quite high-level and abstract but there is also a huge gap between the current difficulty level of task pages that already exist in RSQkit and that of the ones I am trying to create. But I tried to incorporate more code snippets, implementation examples to make them like practical guides to mitigate these two problems. They might need more revisions before finally pushing them to RSQkit which is also subject to scrutiny of the Editorial Board members. But, please let me know if you think any other changes are to be made.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants