SVHIP

This repository contains svhip.py, a script for predicting functional RNA elements from multiple sequence alignments. It supports generating training data, training machine learning models, slicing alignments into windows, and predicting classes (coding, non-coding, other) for alignment windows.

Usage

python svhip.py [Task] [Options]
Tasks: data, train, windows, predict, hexcalibrate

Global options (available for all tasks)

--threads (int, default: max(CPU_COUNT-1, 1)): Number of threads to allocate.
--seed (int, default: random integer at startup): Seed controlling randomized behavior (e.g. shuffling).

Task overview:

data → Generation of training data from multiple sequence alignments.
train → Train a prediction model on data generated with 'data' command.
windows → Cut an alignment into overlapping windows in preparation for alignments.
predict → Run a model prediction on windows generated with 'windows' program.
hexcalibrate → Train a hexamer frequency model which can be used for coding potential assessment.

svhip data

Purpose

Generate training data from coding and/or noncoding input sequences.
Align sequences (requires Clustal Omega), optionally generate a negative set (SISSIz if available, otherwise column shuffling), slice alignments into windows, and compute features into a TSV.

Behavior

Requires at least one of --noncoding or --coding.
Checks for clustalo availability; checks SISSIz availability unless --shuffle-control is given.
Seeds RNG with --seed.

Options

--noncoding (string): Input directory with FASTA file(s) of noncoding sequences (Requires at least 1 of 'noncoding', 'coding').
--coding (string): Input directory with FASTA file(s) of coding sequences (Requires at least 1 of 'noncoding', 'coding').
--other (string): Input directory with FASTA file(s) of random (intergenic) sequences (will be auto-generated via randomization if not supplied).
-o, --outfile (string): Name for the output file (Required).
-N, --negative (string): Path to a specific negative dataset; if empty, a negative set is auto-generated.
-d, --max-id (float, default: 0.95): Remove sequences above this identity threshold during preprocessing (interpreted as proportion; help text mentions percent).
-n, --num-sequences (int, default: 100): Number of sequences input alignments will be optimized towards.
-l, --window-length (int, default: 120): Window length for slicing alignments into overlapping windows.
-w, --windowslide (int, default: 40): Slide step size controlling window overlap.
-s, --samples (int, default: 10): Number of sampling runs per alignment/sequence count.
-a, --sample-attempts (int, default: 1000): Number of sampling attempts per alignment.
-c, --shuffle-control (store_true, default: False): Use simpler column-based shuffling instead of SISSIz.
-H, --hexamer-model (string, default: hexamer_models/Human_hexamer.tsv): Path to the statistical hexamer model to use.
-S, --no-structural-filter (string, default: False): Set to True to disable filtering windows by statistical significance of structure. Note: defined as action="store" (string) although conceptually a boolean toggle.
-T, --tree (string, default: None): Path to a Newick-formatted species tree for the alignment. If None, a tree may be estimated.

Example

python svhip.py data --coding CodingDir --noncoding NoncodingDir -o features.tsv -n 200 -l 120 -w 40

svhip train

Purpose

Train a machine-learning model (RF, SVM, or LR) on features generated by the data task.
Supports optional hyperparameter optimization for SVM and RF.

Options

-i, --input (string): Input features file generated with data (Required).
-o, --outfile (string): Prefix for output model files (Required).
-M, --model (string, default: RF): Model type. One of RF (Random Forest), SVM (Support Vector Machine), LR (Logistic Regression).
--optimize-hyperparameters (store_true, default: False): Perform hyperparameter optimization.
--optimizer (string, default: randomwalk): Hyperparameter search strategy: gridsearch (exhaustive) or randomwalk (faster).

SVM hyperparameters (when model=SVM and optimization enabled)

--low-c (int, default: 1): Lowest C value to try.
--high-c (int, default: 100): Highest C value to try.
--low-gamma (int, default: 1): Lowest gamma value to try.
--high-gamma (int, default: 100): Highest gamma value to try.
--hyperparameter-steps (int, default: 10): Number of values per hyperparameter (evenly spaced).
--logscale (store_true, default: False): Use logarithmic scaling for the parameter grid.
--logbase (int, default: 2): Logarithmic base if --logscale is set.

Random Forest hyperparameters (when model=RF and optimization enabled)

--min-trees (int, default: 100): Minimum number of trees (n_estimators) to consider.
--max-trees (int, default: 500): Maximum number of trees (n_estimators) to consider.
--min-samples-split (int, default: 2): Minimum samples required to split an internal node.
--max-samples-split (int, default: 16): Maximum samples to split an internal node.
--min-samples-leaf (int, default: 1): Minimum samples required at a leaf node.
--max-samples-leaf (int, default: 16): Maximum samples at a leaf node.

Example

python svhip.py train -i features.tsv -o RF_classifier -M RF --optimize-hyperparameters --optimizer randomwalk

svhip windows

Purpose

Slice an existing alignment into overlapping windows, filtering sequences by identity and gaps.

Options

-i, --input (string): Input alignment file (Required).
-o, --outfile (string): Output alignment file for windows (Required).
-l, --length (int, default: 120): Window length.
-s, --slide (int, default: 80): Slide step size for overlap.
--min-id (float, default: 0.5): Minimum pairwise identity of sequences to keep.
--max-id (float, default: 0.95): Maximum pairwise identity of sequences to keep.
--opt-id (float, default: 0.8): Target identity to optimize sequence selection.
-n, --num-seqs (int, default: 6): Maximum number of sequences per window.
-g, --max-gaps (float, default: 0.75): Maximum fraction of gaps in the reference sequence.

Example

python svhip.py windows -i input.aln -o WINDOWS.aln -l 120 -s 80 --min-id 0.5 --opt-id 0.8 -n 6

svhip predict

Purpose

Predict class labels (coding, non-coding, other) for windows cut from an input alignment using a trained model and hexamer model.
Supports MAF or Clustal input; when input ends with .maf, genome coordinates are preserved and can be exported as BED.
Processes windows in blocks for efficiency; can scan both strands.

Options

-i, --input (string): Input alignment file, MAF or Clustal (Required).
-o, --outfile (string): Output TSV file (Required).
-M, --model-path (string, default: ""): Path to the trained model file (Required).
-T, --tree (string, default: None): Path to a Newick-formatted species tree; if None, one may be estimated.
-H, --hexamer-model (string, default: hexamer_models/Human_hexamer.tsv): Path to the hexamer score model.
--both-strands (store_true, default: False): Screen both forward and reverse strands.
--bed (store_true, default: False): Merge overlapping annotations and write a BED file. IMPORTANT: Requires MAF input for genomic coordinates.
--windows-per-block (int, default: 50): Number of windows processed per block before writing results.

Example

python svhip.py predict -i query.maf -o predictions.tsv -M RF_classifier.model -H hexamer_models/Human_hexamer.tsv --both-strands --bed

svhip hexcalibrate

Purpose

Calibrate a hexamer model from coding and noncoding sequences; writes a tab-delimited model file.

Options

-c, --coding (string): Fasta file of coding transcripts (must be in-frame).
-n, --noncoding (string): Fasta file of noncoding sequences.
-o, --outfile (string): Output TSV file for the calibrated hexamer model.

Example

python svhip.py hexcalibrate -c coding.fa -n noncoding.fa -o Human_hexamer.tsv

External tools and notes

Clustal Omega (clustalo) must be available in PATH for data generation and alignment steps.
SISSIz is used for negative control generation when available; if not present or if --shuffle-control is set, a simpler column-shuffling approach is used instead.
Randomization is controlled by --seed. If not provided, a random seed is generated at start.
When using predict with --bed, ensure the input is MAF to include genomic coordinates.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
Example		Example
hexamer_models		hexamer_models
z_score_model_SVR		z_score_model_SVR
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
svhip.py		svhip.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SVHIP

svhip data

Purpose

Behavior

Options

Example

svhip train

Purpose

Options

SVM hyperparameters (when model=SVM and optimization enabled)

Random Forest hyperparameters (when model=RF and optimization enabled)

Example

svhip windows

Purpose

Options

Example

svhip predict

Purpose

Options

Example

svhip hexcalibrate

Purpose

Options

Example

External tools and notes

About

Uh oh!

Releases 9

Packages

Languages

License

chrisBioInf/svhip

Folders and files

Latest commit

History

Repository files navigation

SVHIP

svhip data

Purpose

Behavior

Options

Example

svhip train

Purpose

Options

SVM hyperparameters (when model=SVM and optimization enabled)

Random Forest hyperparameters (when model=RF and optimization enabled)

Example

svhip windows

Purpose

Options

Example

svhip predict

Purpose

Options

Example

svhip hexcalibrate

Purpose

Options

Example

External tools and notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 9

Packages 0

Languages

Packages