This is the open-source software HiCMC. Through sophisticated biological modeling we enable highly efficient compression of Hi-C contact matrices.
For a smooth quick start, we provide a test file that can be downloaded and extracted.
We have tested this software on Ubuntu operating system with conda software.
First, clone the repository and enter the directory:
git clone https://github.com/sXperfect/hicmc
cd hicmcCreate a virtual environment using conda and install necessary libraries
conda create -y -n hicmc python=3.11
conda activate hicmc
conda install -y -c conda-forge cmake gxx_linux-64 gcc_linux-64 zlib curlInstall python libraries
pip install -r requirements.txt
pip install hic2cool cooltools
pip install --pre bitstreamNote: At the time of writing, the bitstream library has a bug that is fixed in the pre-release.
Future versions of bitstream may not require installation with the --pre option.
Run setup script setup.sh:
bash setup.shCreate data folder and download domain information data based on Insulation score:
mkdir -p data && cd data
wget https://www.tnt.uni-hannover.de/staff/adhisant/hicmc/domain_info.tar.gz
tar xzvf domain_info.tar.gzNote: Insulation score can be computed using cooltools
Download hic data from GEO:
wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE63nnn/GSE63525/suppl/GSE63525%5FGM12878%5Finsitu%5Fprimary%2EhicConvert hic data to mcool:
hic2cool convert GSE63525_GM12878_insitu_primary.hic GSE63525_GM12878_insitu_primary.coolNote: This step is necessary because HiCMC currently only supports cooler as input file. This can be extended by integrating parsers or readers for other formats, especially for the hic format using straw.
Go back to the root directory
cd ..
Encode the mcool data at 250kb with HiCMC:
python -m hicmc ENCODE --insulation-file data/GM12828-insitu_primary/250000/insulation.tsv --insulation-window 1000000 --weights-precision 12 --domain-values-precision 18 --distance-table-precision 10 --domain-mask-threshold 45 --balancing KR data/GSE63525_GM12878_insitu_primary.mcool 250000 results/GM12878-insitu_primary-250kbNote: The value of --insulation-window is a multiplication of the resolution. In the paper we mention the multiplier value instead of the exact window size value.
The open-source HiCMC codec is made available before scientific publication.
This pre-publication software is preliminary and may contain errors. The software is provided in good faith, but without any express or implied warranties. We refer the reader to our license.
The goal of our policy is that early release should enable the progress of science. We kindly ask to refrain from publishing analyses that were conducted using this software while its development is in progress.
Python 3.8 or higher is required.
It is recommended that you create a virtual environment using conda.
For conda users, the cmake, gcc, zlib, curl, and gxx libraries are required and can be installed through:
conda install -y -c conda-forge cmake gxx_linux-64 gcc_linux-64 zlib curlSee requirements.txt for the list of required Python libraries.
Our tool accept mcool data as the input.
For hic data, transcoding to mcool is necessary using hic2cool tool:
hic2cool convert <hic_file> <mcool_file>Before encoding with our tools, a domain information based on a TAD caller (in this case Insulation score) is required. Please refer to this link on how to generate the domain file.
To run our tools, please use the following command on the directory:
python -m hicmc <mode>where mode is either ENCODE or DECODE.
Use --help to show help.
ENCODE Compress a cooler file with a specific resolution
usage: HiCMC ENCODE [-h] [--check-result] [--insulation-file INSULATION_FILE] [--insulation-window INSULATION_WINDOW] [--weights-precision WEIGHTS_PRECISION] [--domain-mask-statistic {average,sparsity,deviation}] [--domain-mask-threshold DOMAIN_MASK_THRESHOLD] [--domain-values-precision DOMAIN_VALUES_PRECISION] [--distance-table-precision DISTANCE_TABLE_PRECISION]
[--balancing BALANCING]
input_file resolution output_directory
positional arguments:
input_file input file path (.cool or .mcool)
resolution
output_directory
options:
-h, --help show this help message and exit
--check-result Check the decoded contact matrix equals the original matrix
--insulation-file INSULATION_FILE
--insulation-window INSULATION_WINDOW
--weights-precision WEIGHTS_PRECISION
--domain-mask-statistic {average,sparsity,deviation}
--domain-mask-threshold DOMAIN_MASK_THRESHOLD
--domain-values-precision DOMAIN_VALUES_PRECISION
Number of bits used for floating-point compression
--distance-table-precision DISTANCE_TABLE_PRECISION
Number of bits used for floating-point compression
--balancing BALANCING
Select a balancing method, default: KRDECODE Decompress HiCMC encoded payload
usage: HiCMC DECODE [-h] input output
positional arguments:
input Path to the HiCMC encoded payload
output Output directory
options:
-h, --help show this help message and exitCurrently HiCMC supports only cooler as input file. This can be extended by integrating parsers or readers for other formats, especially for the hic format using straw.
The data can be dowloaded from NCBI.
| NCBI Accession Number | Cell line | Filename |
|---|---|---|
| GSE63525 | CH12 | GSE63525_CH12-LX_combined.hic |
| GSE63525 | GM12878 (Insitu-DpnII) | GSE63525_GM12878_insitu_DpnII_combined.hic |
| GSE63525 | GM12878 (Primary) | GSE63525_GM12878_insitu_primary.hic |
| GSE63525 | GM12878 (Replicate) | GSE63525_GM12878_insitu_replicate.hic |
| GSE63525 | HMEC | GSE63525_HMEC_combined.hic |
| GSE63525 | HUVEC | GSE63525_HUVEC_combined.hic |
| GSE63525 | IMR90 | GSE63525_IMR90_combined.hic |
| GSE63525 | K562 | GSE63525_K562_combined.hic |
| GSE63525 | KBM7 | GSE63525_KBM7_combined.hic |
| GSE63525 | NHMEK | GSE63525_NHEK_combined.hic |
Yeremia Gunawan Adhisantoso <adhisant@tnt.uni-hannover.de>
Fabian Müntefering <muenteferi@tnt.uni-hannover.de>
Jan Voges <voges@tnt.uni-hannover.de>