A Nextflow pipeline for processing study assay results with a focus on RNA-Seq analysis workflows. This pipeline provides a flexible, containerized framework for executing multi-step data processing workflows defined in XML configuration files.
- Flexible workflow definition via XML configuration files
- Containerized execution with Docker or Singularity support
- Multi-step processing with sequential execution of analysis steps
- RNA-Seq specialized mode with TPM (Transcripts Per Million) support
- Extensible architecture for adding new analysis types
- Nextflow (version compatible with DSL2 and recursion)
- Docker or Singularity
nextflow run main.nf \
--analysisConfigFile data/test1/analysisConfig.xml \
--finalDir data/test1 \
--outputDirectory output \
--technologyType rnaseq
nextflow run main.nf \
-c conf/singularity.config \
--analysisConfigFile data/test1/analysisConfig.xml \
--finalDir data/test1 \
--outputDirectory output \
--technologyType rnaseq
Parameter | Description |
---|---|
--analysisConfigFile |
Path to XML configuration file defining the analysis workflow |
--finalDir |
Path to directory containing input data files |
--outputDirectory |
Path where output files will be published |
--technologyType |
Technology type (e.g., rnaseq , microarray ) |
Parameter | Description |
---|---|
--inputFile |
Path to specific input file (can also be defined in XML config) |
--tpmDir |
Path to TPM directory (required for RNASeq mode) |
--pseudogenesFile |
Path to pseudogenes file for filtering |
The pipeline automatically switches between two modes based on whether --tpmDir
is provided:
Used for most assay types. Input files from finalDir
are copied into an analysis_output
subdirectory.
nextflow run main.nf \
--analysisConfigFile config.xml \
--finalDir input_data/ \
--outputDirectory results/ \
--technologyType microarray
Used when TPM data needs to be processed alongside standard results. Requires both finalDir
and tpmDir
.
nextflow run main.nf \
--analysisConfigFile config.xml \
--finalDir input_data/final \
--tpmDir input_data/TPM \
--outputDirectory results/ \
--technologyType rnaseq
Workflows are defined using XML configuration files. Here's an example structure:
<xml>
<globalReferencable>
<!-- Define reusable variables -->
<property name="profileSetName" value="My Analysis"/>
<property name="samples">
<value>group1|sample1</value>
<value>group1|sample2</value>
<value>group2|sample3</value>
</property>
</globalReferencable>
<!-- Define analysis steps -->
<step class="ApiCommonData::Load::RnaSeqAnalysisEbi">
<property name="profileSetName" isReference="1"
value="$globalReferencable->{profileSetName}" />
<property name="samples" isReference="1"
value="$globalReferencable->{samples}" />
<property name="isStrandSpecific" value="1"/>
</step>
<step class="ApiCommonData::Load::IterativeWGCNAResults">
<property name="profileSetName" value="WGCNA Analysis"/>
<property name="inputFile" value="profiles.genes.htseq-union.firststrand.tpm"/>
<property name="softThresholdPower" value="10"/>
<property name="organism" value="Plasmodium falciparum 3D7"/>
</step>
</xml>
- globalReferencable: Define variables that can be referenced across multiple steps
- isReference: Set to "1" to evaluate property values as Perl expressions
- Multiple steps: Chain analysis steps that execute sequentially
Example configurations can be found in the data/
directory.
The pipeline supports various analysis step types including:
- RNA-Seq Analysis:
ApiCommonData::Load::RnaSeqAnalysisEbi
,ApiCommonData::Load::RNASeqProfiles
- Differential Expression:
ApiCommonData::Load::DeseqAnalysis
,ApiCommonData::Load::DEGseqAnalysis
- Network Analysis:
ApiCommonData::Load::WGCNA
,ApiCommonData::Load::IterativeWGCNAResults
- Normalization: Various methods in
CBIL::StudyAssayResults::DataMunger::Normalization::*
- And many more in the
lib/perl/
directory
.
├── main.nf # Main Nextflow workflow
├── nextflow.config # Pipeline configuration
├── conf/
│ ├── docker.config # Docker-specific settings
│ └── singularity.config # Singularity-specific settings
├── bin/
│ ├── doStep.pl # Execute individual analysis steps
│ └── nextStepFromJsonFile.pl # Manage step queue
├── lib/
│ ├── perl/ # Perl analysis modules
│ │ ├── ApiCommonData/Load/ # Analysis step implementations
│ │ └── CBIL/StudyAssayResults/ # Core framework
│ └── R/StudyAssayResults/ # R utility functions
└── data/ # Example configurations and test data
- Parse Configuration: The XML config is parsed into a JSON array of analysis steps
- Setup Working Directory: Input files are organized based on the operating mode
- Execute Steps: Each step is executed sequentially in its appropriate container
- Publish Results: Final outputs are published to the specified output directory
The pipeline uses Nextflow's recursion feature to iterate through analysis steps, ensuring they execute in the correct order with proper data dependencies.
The pipeline uses pre-built container images from VEuPathDB:
veupathdb/gusenv:latest
- Main analysis environmentveupathdb/iterativewgcna:latest
- WGCNA analysisveupathdb/bioperl:latest
- Perl utilitiesveupathdb/alpine_bash:latest
- Lightweight utilities
- Create a new Perl module in
lib/perl/ApiCommonData/Load/
orlib/perl/CBIL/StudyAssayResults/DataMunger/
- Extend the
CBIL::StudyAssayResults::DataMunger
base class - Implement the
munge()
method with your analysis logic - If your step requires a specific container, update the container mapping in
main.nf
Example:
package ApiCommonData::Load::MyNewAnalysis;
use base qw(CBIL::StudyAssayResults::DataMunger);
sub munge {
my ($self) = @_;
# Your analysis logic here
my $inputFile = $self->getInputFile();
my $outputFile = $self->getOutputFile();
# Process data...
}
1;
Test configurations are available in the data/
directory:
# Test with example data
nextflow run main.nf \
--analysisConfigFile data/test1/analysisConfig.xml \
--finalDir data/test1 \
--outputDirectory test_output \
--technologyType rnaseq
Container permissions: The Docker configuration runs containers with your user ID to avoid permission issues. If you encounter permission errors, check the docker.runOptions
in conf/docker.config
.
Library paths: The pipeline automatically sets PERL5LIB
and MY_R_LIB
environment variables. If modules aren't being found, verify these are correctly configured in your container config.
Step failures: Check the Nextflow work directory for detailed logs from failed steps. Each step's stdout/stderr is captured in its work directory.
See LICENSE file for details.