*** Please execute the course content under work/$user ***
mv
is a command-line tool used for moving or renaming files and directories, widely used in Linux and macOS systems. The mv command helps you move files or directories from one location to another, or rename files or directories.
- Move a file:
mv <source_file> <destination_directory>/
- Rename a file:
mv <old_filename> <new_filename>
- Move and rename a file:
mv <source_file> /<new_directory>/<new_filename>
- Move the directory:
mv <source_directory>/ <destination_directory>/
- Use BWA for alignment
- Use Picard to mark duplicates
- Use ThinLinc to open IGV and view the alignment/MarkDuplicates results Therefore, you must download ThinLinc beforehand. For more details, refer to this link.(https://hackmd.io/speUUZSNRZe0n_EAIl2lig)
- If time permits, compare the results from this week with last week's.
Reminder: This session will mainly review the previous content, with the additional step of marking duplicates.
ℹ️
- Picard is a genomic data analysis toolkit designed specifically for handling high-throughput sequencing data. It provides a wide range of powerful tools to assist users in performing various operations during the analysis process, such as MarkDuplicates, manipulating read groups, Sorting and indexing files, data-cleanup operations, statistical analysis, and format conversion. Picard is widely used in workflows for variant detection and genome analysis.
- In genomics and next-generation sequencing (NGS), duplicate reads refer to multiple reads originating from the same original DNA molecule during the sequencing process. These duplicate reads are typically produced due to the PCR amplification process, where DNA polymerase replicates the template DNA during each amplification cycle, doubling the amount of DNA with each cycle. In theory, this should produce a large number of identical DNA fragments. However, certain DNA fragments are amplified more efficiently than others during PCR, which can affect the representativeness and accuracy of the final sequencing data. This can result from factors such as primer design, GC content of the DNA sequence, secondary structures of the DNA (e.g., hairpins), PCR temperature and time, and the efficiency of the DNA polymerase.
- Thus, we use MarkDuplicates to identify and mark duplicate reads. This process typically occurs after alignment and is intended to prevent errors in subsequent analyses caused by duplicate reads originating from the same DNA sequence.
- Log in to the NCHC (For those who forgot how to log in, please refer to this link(https://hackmd.io/jcvG9iIiRW6DTUysi8AKug)).
- Enter the "alignment" folder.
cd /work/username/alignment
- Copy the executable files needed for the class.
rsync -avz /work/u2499286/alignment/bwa_markdup.sh /work/username/alignment
- Enter bwa_markdup.sh
vim bwa_markdup.sh
- Please press "i" to modify the following code:
The following serves as an example based on the files in the bwa_markdup.sh folder (format should follow the examples provided within, do not include the file extension).
(1) #SBATCH -A ACD113120 #Account name/project number
(2) #SBATCH -J alignment ###Job name:可修改
(3) #SBATCH -p ngscourse ###Partition Name:等同PBS裡面的 -q Queue name
(4) #SBATCH -c 2 #使用的core數:請參考Queue資源設定
(5) #SBATCH --mem=13g #使用的記憶體量 請參考Queue資源設定
(6) #SBATCH -o out_almark.log ###Path to the standard output file:可修改
(7) #SBATCH -e err_almark.log ###Path to the standard error ouput file:可修改
(8) #SBATCH --mail-user= ###e-mail:可修改
(9) #SBATCH --mail-type=END ###指定送出email時機:可為NONE, BEGIN, END, FAIL, REQUEUE, ALL
- Enter your account in any location marked as username.
- Enter
:wq
to save and exit.
:wq
- Execute the script (1) Enter the following command to submit the edited draft as an sbatch job:
sbatch bwa_markdup.sh
(2) If submitted successfully, the following message will appear (after the bwa_markdup.sh file completes running, an alignmentRM folder will be automatically created under the alignment directory to store the results):
(3) You can use the following command to check the status of the job execution:
sacct
- Check the results: The "alignmentRM" folder will contain
sam
andbam
files. Confirm the integrity of the files with the detailed steps listed below. (1) Open the "alignmentRM" folder: You can use either a relative or absolute path.
cd alignmentRM # Use relative path
cd /work/username/alignment/alignmentRM # Or use absolute path
(2) Confirm that the files exist:
ls
(3) Check the integrity of the files:
less SRR13076392_S14_L002_.sam
(4) Use Shift
+ g
to view the bottom of the file.
(5) Exit:
q
Since the alignment/MarkDuplicates step in bwa takes a long time, the results used in the following steps are those already generated by the TA. Please first copy the results from the TA into the alignment folder (both files are required).
rsync -avz /work/u2499286/alignment/alignmentRM/SRR13076392_S14_L002_.sorted.markdup.bam ./
rsync -avz /work/u2499286/alignment/alignmentRM/SRR13076392_S14_L002_.sorted.markdup.bai ./
- Use ThinLinc to open a terminal.
- In the terminal, use the
sh
command to open the IGV software:
sh /opt/ohpc/Taiwania3/pkg/biology/IGV/IGV_v2.10.3/igv.sh
(1) Select "More..." from the dropdown menu in the upper left corner.
(2)Search for hg38 and download Human hg38.
(3)Use File → Load from file in the upper left corner to import SAM and BAM files (using BAM files as an example). The files are located at the following path:
(4)In the upper left corner, you can select the chromosome and range to view (blue box), while in the upper right corner (red box), you can select the view size (you may need to zoom in to a sufficient scale to see the results).
For example, using chr16:
If you want to understand what each read's color represents in IGV, you can refer to the following link: https://igv.org/doc/desktop/# [User Guide > Tracks and Data Types > Alignments > Paired-end alignments > Detecting structral variants]
If you want to know th details about bwa :https://bio-bwa.sourceforge.net/bwa.shtml
-
Overrepresented sequences
*** 上課內容請在work/$user
之下執行 ***
mv
是一個用於移動或重新命名文件和資料夾的命令行工具,在 Linux 和 macOS 系統中廣泛使用。mv
命令可以幫助你將文件或目錄從一個位置移動到另一個位置,或者將文件或目錄重新命名。- 移動文件:
mv <source_file> <destination_directory>/
- 重新命名:
mv <old_filename> <new_filename>
- 移動並重新命名:
mv <source_file> /<new_directory>/<new_filename>
- 移動目錄:
mv <source_directory>/ <destination_directory>/
- 利用BWA做alignment
- 利用picard做mark duplicates
- 用thinlinc打開IGV查看alignment/ MarkDuplicates後的結果 所以在這之前必須要先下載好thinlinc,詳細連結可見此
- 若有時間可以比較本週與上週的結果
ℹ️
Picard 是一套genomic data analysis,專為處理高通量測序數據設計,提供了一系列功能強大的工具,幫助用戶在分析過程中進行各種操作,提供如 MarkDuplicates、調整讀數群組、重新排序、數據清理、統計分析和格式轉換等功能,廣泛應用於變異檢測和基因體分析的工作流程中。
- 在Genomics和次世代定序(NGS)中,重複讀數(duplicate reads)是指在定序的過程中由同一原始DNA分子產生的多個讀數。這些讀數的出現通常是由於PCR amplification的過程造成的,在每個擴增循環中,DNA polymerase會複製template DNA,使得每個循環後的DNA量都會成倍增加。理論上,這應該會產生大量相同的DNA片段,但因PCR的過程中,某些DNA片段的擴增效率比其他片段高,會影響最終測序數據的代表性和準確性。這種影響可能源於幾個因素:Primer的設計、DNA sequence的GC含量、DNA的二級結構(hairpin)、PCR的溫度時間及DNA polymerase的效率等。
- 因此我們利用MarkDuplicates來辨識並標記 duplicate reads。這個過程通常在比對alignment之後進行,主要目的是防止來自同一個DNA序列因為重複讀數在後續分析中引起錯誤結果。
- 登入國網(忘記怎麼登入的人請參見連結)
- 進入alignment資料夾
cd /work/username/alignment
- 複製上課所需執行檔
rsync -avz /work/u2499286/bwa.sh /work/username/alignment
- 進入bwa_markdup.sh
vim bwa_markdup.sh
- 請輸入
i
更改以下程式碼:
以下以bwa_markdup.sh資料夾中的做為示範 (格式請依照裡面給你的範例,副檔名不用寫進去)
(1) #SBATCH -A ACD113120 #Account name/project number
(2) #SBATCH -J alignment ###Job name:可修改
(3) #SBATCH -p ngscourse ###Partition Name:等同PBS裡面的 -q Queue name
(4) #SBATCH -c 2 #使用的core數:請參考Queue資源設定
(5) #SBATCH --mem=13g #使用的記憶體量 請參考Queue資源設定
(6) #SBATCH -o out_almark.log ###Path to the standard output file:可修改
(7) #SBATCH -e err_almark.log ###Path to the standard error ouput file:可修改
(8) #SBATCH --mail-user= ###e-mail:可修改
(9) #SBATCH --mail-type=END ###指定送出email時機:可為NONE, BEGIN, END, FAIL, REQUEUE, ALL
- 在任何username的位子輸入自己的主機帳號
- 輸入
:wq
儲存離開
:wq
- 執行script
(1) 輸入以下指令,來以sbatch job的方式送出編輯完成的草稿
sbatch bwa.sh
(2) 若送出成功將會出現以下文字 (bwa.sh
的檔案跑完後會自動在alignmnet資料下建立一個alignmentRM資料夾,將結果放在裡面)
(3) 可使用以下指令查看工作執行情況
sacct
- 查看結果:在
alignmentR
資料夾中會有sam
,bam
檔,並確認檔案完整性,詳細步驟逐條列在下面
(1) 開啟alignmentRM資料夾:可使用相對路徑或絕對路徑
cd alignmnetRM #可使用相對路徑
cd /work/username/alignment/alignmentRM #或使用絕對路徑
(2) 確認檔案存在:
ls
(3) 確認檔案完整性:
less SRR13076392_S14_L002_.sam
(4) 利用 shift+g 查看檔案最底部
(5) 退出:
q
由於bwa在執行alignment/ MarkDuplicates的時間較長,所以執行以下步驟時使用的都是助教已經跑出的結果,請先複製助教的結果到alignment資料夾底下(兩個檔案都要)
rsync -avz /work/u2499286/S14_HC_result/SRR13076392_S14_L002.sorted.bam ./
rsync -avz /work/u2499286/S14_HC_result/SRR13076392_S14_L002.sorted.bam.bai ./
- 使用thinlinc、開啟 'Xfce terminal'
- 在terminal利用
sh
指令開啟IGV軟體
sh /opt/ohpc/Taiwania3/pkg/biology/IGV/IGV_v2.10.3/igv.sh
(3) 透過左上角的File → Load from file可匯入sam檔及bam檔(在這以bam檔為範例),檔案位於以下路徑:
- bam file:
/work/username/alignment/SRR13076392_S14_L002_sorted.markdup.bam
(4) 左上角可選取要看的染色體以及範圍(藍色框),右上角(紅色框)可選取要看的大小(需要放大到足夠的級距才能看到結果)
以chr16為例:
若你想要了解在 IGV 中每個 read 的顏色所代表的意義,可以參考以下連結(https://igv.org/doc/desktop/#) [User Guide > Tracks and Data Types > Alignments > Paired-end alignments > Detecting structral variants]
如果想知道bwa的詳細內容:https://bio-bwa.sourceforge.net/bwa.shtml