Skip to content

Latest commit

 

History

History
322 lines (251 loc) · 16.7 KB

tutorial week2.md

File metadata and controls

322 lines (251 loc) · 16.7 KB

20240911 Fantastic Genomic Biomarkers and Where to Find Them Practical Course (Part II)

Main Content of This Course

  1. Copy course files to your folder on the National Center for High-Performance Computing (NCHC) system
  2. Sample QC: A rapid quality control check on raw sequencing data. Through Quality Control, researchers can identify potential issues in the data, allowing them to make decisions about data filtering or correction. In high-throughput sequencing experiments, quality control is the process of inspecting and evaluating raw sequencing data to ensure data integrity and reliability.

Copying Course Files

Note: Since files are being copied to NCHC, the following steps must be executed in your own remote host's "/work/{your_username}" path (Please ensure you are in your own path!!!)

Step 1: Create a Path on NCHC

  1. Log in to NCHC (for those who forgot how to log in, please refer to this link).
  2. Enter the work directory and type cd /work/username, then type mkdir result to create a folder named result in the current location on the NCHC system. This will be the folder where you store your course files.
cd /work/username
mkdir result
  1. Next, use cd result and create a folder with mkdir fastq to store the downloaded samples.
cd result
mkdir fastqc

ℹ️

Command Basics 101

mkdir

  • Make directory, used to create a new directory (folder) at the specified location.
  • Usage of mkdir: mkdir [options] # Create a new directory (folder)
  1. Use the rsync command to copy the required scripts (bash file) and samples to your path: In this step, you need to learn how to copy files from someone else's folder on the NCHC to your own folder on the NCHC.
rsync -avz /work/u2499286/fastqc.sh ./
  1. Additionally, download the data required for analysis (6 reads in total, which include three samples, but you need to download six files!!).
password: NGS112-2
  1. Upload them to your own fastqc folder. In this step, you need to learn how to upload files from local to NCHC.
Upload files:rsync -azrvh . [email protected]:/work/supercompputeraccount/result/fastqc
unzip a file: unzip <filename_to_unzip>
rename a file: mv <original_filename> <new_filename>
  • Folder and File Description (strongly recommended to understand): fastqc: This is the directory you are currently in, containing all analysis execution files, reference sequences, and various tools.

ℹ️

File Format Introduction

  • fastq is a file format used to store DNA or RNA sequence data generated by high-throughput sequencing technologies in the biological field. This format includes both nucleotide sequences and corresponding quality scores.
  • fastqc.sh: This file is the executable required for the analysis.

Sample QC

The file used (total of 1) can be downloaded from here as fastqc.sh.

Step 1: Create a Shell Script

ℹ️

What is a Shell Script?

Shell script is a program written using the functionality of the shell. It is a plain text file containing shell syntax and commands (including external commands), combined with regular expressions, pipeline commands, and data redirection, to achieve the desired processing. Writing and using shell scripts can improve work efficiency and reduce repetitive command operations. They can run in Unix, Linux, or other Unix-like systems' shell environments. Common shell scripting languages include Bash (Bourne Again Shell), Sh (Bourne Shell), and Zsh (Z Shell).

Reference: https://linux.vbird.org/linux_basic/centos7/0340bashshell-scripts.php#script

  1. Enter the result directory with
cd /work/username/result
  1. Create a shell script by typing
vim fastqc.sh

ℹ️

vim fastqc.sh

  • In Unix and Linux systems, the command vim fastqc.sh is used to open or create a file named fastqc.sh using the Vim editor. This is a simple command to enter the Vim editor to edit the specified Shell Script file. Specific Meaning:
  • vim: This is the name of the command-line text editor Vim. Vim is a powerful text editor widely used for editing program code and scripts.
  • fastqc.sh: This is the file name to be edited. In this case, fastqc.sh is a Shell Script file, with the .sh extension typically indicating it is a Shell script file.
  • Behavior after executing this command: If the fastqc.sh file already exists, vim fastqc.sh will open the file, allowing you to view and edit its contents. If the fastqc.sh file does not exist, vim fastqc.sh will create a new empty file and enter Vim editor so you can start writing the script.

⚠️
3. Press the "i" key to enter insert mode (you will see "–- INSERT –-" at the bottom) and paste the shell script you downloaded into fastqc.sh.

Step 2: Modify the Analysis Script

  1. Change the following code: The example below uses the SEA folder in the fastq directory (format should follow the provided example, excluding the file extension).
    (1) SLURM Scheduling Settings

ℹ️

What is SLURM?

SLURM (Simple Linux Utility for Resource Management) is an open-source resource manager and workload scheduler for large-scale computing clusters. It is primarily used in high-performance computing (HPC) environments to manage and schedule computing resources such as CPUs, memory, and compute nodes. SLURM is widely used in large supercomputing centers, research institutions, and enterprises.

  • Modify this block according to the instructions (see the explanation below):
#SBATCH -A ACD113120              # Account name/project number
#SBATCH -J fastqc                 # Job name
#SBATCH -p ngscourse              # Partition Name (equivalent to PBS's -q Queue name)
#SBATCH -c 2                      # Number of cores used (refer to Queue resource settings)
#SBATCH --mem=13g                 # Amount of memory used (refer to Queue resource settings)
#SBATCH -o out.log                # Path to the standard output file
#SBATCH -e err.log                # Path to the standard error output file
#SBATCH [email protected]    # Email
#SBATCH --mail-type=END           # Specifies when to send email; can be NONE, BEGIN, END, FAIL, REQUEUE, ALL
# For NCHC usage

⚠️
(1) Press esc to exit insert mode.
(2) Type :wq and press Enter to save and exit
if you see "E45: 'readonly' option is set (add ! to override)", type :wq! to save.

ℹ️

Commands Basics 101

  1. :wq: Save and exit. This command in Vim editor saves the current file and exits insert mode. Press Esc to enter command mode, then type:wq and press Enter. This saves the current file's modifications and exits Vim. "w" stands for write (save), "q" stands for quit (exit).

  2. :wq!: Force save and exit. If the file is read-only or has other restrictions, you can use :wq! to force save and exit. "!" signifies force execution.

  3. If you only want to save but not exit, use :w. To exit without saving, use :q or :q! (force exit).

  4. Execute the Script
    (1) Submit the edited script as a SLURM job with the following command:

sbatch fastqc.sh

ℹ️

Command Basics 101

sbatch is a command-line tool for submitting job scripts to the SLURM job scheduling system. These scripts typically include SLURM directives and commands to be executed.

(2) If submission is successful, you will see:

Submitted batch job ___

(3) Use the following command to check the job status:

sacct

image

ℹ️

Command basics 101

sacct This command is used to list the status of jobs or job arrays associated with an account, such as running, terminated, or completed. It is the most basic command for viewing job statuses. It can display information such as ID, user, status, and resources used, which is useful for tracking and analyzing the status of jobs.

  1. Entering ls allows you to view all the files in a directory, where you can see out.log, err.log, and HTML files.
  • out.log and err.log are the standard output and standard error for this script. If any errors occurred during execution, you can check err.log.
  • This script will generate an HTML file, which can be downloaded and opened to view the FastQC Report. Details available for download at the link

生物標記物與它們的產地實作課程(二)

本次課程主要內容

  1. 複製課程檔案至國網中自己的資料夾
  2. sample QC:對原始測序數據進行快速質量控制檢查。通過Quality Control研究人員可以識別數據中的潛在問題,從而做出數據過濾、修正的決定。在高通量測序實驗中,質量控制是對原始測序數據進行檢查和評估的過程,以確保數據的完整性和可靠性。

複製課程檔案

⚠️
注意:因為是將檔案複製至國網,所以以下步驟的指令都需在『你自己的』遠端主機的"/work/{your_username}"路徑下 (請確保現在是在自己的路徑下!!!)

step 1 在國網上建立路徑

  1. 登入國網(忘記怎麼登入的人請參見連結
  2. 進入work資料夾輸入cd /work/username,接著輸入mkdir result可以在國網主機目前的位置下建立一個叫做result的資料夾,作為本次作業檔案儲存的資料夾
cd /work/username
mkdir result
  1. 接著cd result在建立一個資料夾mkdir fastqc儲存下載的sample
cd result
mkdir fastqc

ℹ️

命令小學堂

  • mkdir make directory,用來在指定位置創建一個新的目錄(資料夾)
  • mkdir的用法: mkdir [選項] #建立一個新目錄(資料夾)
  1. 在terminal利用rsync指令,將分析所需的指令(bash檔),複製到自己的路徑下使用 此步驟你需要學會如何從國網別人的資料夾中複製檔案到自己在國網的資料夾
rsync -avz /work/u2499286/fastqc.sh ./
  1. 另外下載分析所需資料(6條reads,共包含三個sample,但你需下載六個檔案!!)
#下載密碼:NGS112-2
  1. 請直接上傳 檔案到fastqc 此步驟需要學習的內容為將檔案從本地端上傳至國網
上傳檔案:rsync -azrvh . 主機帳號@t3-c4.nchc.org.tw:/work/主機帳號/result/fastqc
解壓縮檔案: unzip <要解壓縮的檔名>
改檔名: mv <原本的檔名> <後來的檔名>

⚠️
fastqc:此資料夾存放 下載的檔案 (fastq檔案)

ℹ️

文件格式介紹

  • fastq : FASTQ 是一種文件格式,用於存儲生物學領域中高通量測序技術生成的 DNA 或 RNA 序列數據。這種格式同時包含了序列的核苷酸序列和對應的質量分數
  • fastqc.sh : 此檔案為執行分析所需的執行檔

Sample QC

step 1 建立shell script

ℹ️

甚麼是shell script?

shell script(程式化草稿)

  • 簡單來說,是利用 shell 的功能所寫的一個『程式 (program)』,這個程式是使用純文字檔,將一些 shell 的語法與指令(含外部指令)寫在裡面, 搭配正規表示法、管線命令與資料流重導向等功能,以達到我們所想要的處理目的,通過編寫和使用 Shell 草稿,可以提高工作效率並減少重複操作命令
  • 可在 Unix、Linux 或其他類 Unix 系統的 shell 環境中運行。最常用的 shell 草稿語言包括 Bash(Bourne Again Shell)、Sh(Bourne Shell)和 Zsh(Z Shell)。

reference: https://linux.vbird.org/linux_basic/centos7/0340bashshell-scripts.php#script

  1. 進入result資料夾,輸入cd /work/username/result
  2. 進入shell script,輸入vim fastqc.sh

ℹ️

命令小學堂

  • 在 Unix 和 Linux 系統中,命令 vim fastqc.sh 用於使用 Vim 編輯器打開或創建名為 fastqc.sh 的文件。這是一個簡單的命令,用於進入 Vim 編輯器以編輯指定的 Shell Script 文件。
  • 具體含義
    • vim:這是命令行文本編輯器 Vim 的名稱。Vim 是一個強大的文本編輯器,廣泛用於編輯程序代碼和草稿。
    • fastqc.sh:這是要編輯的文件名。在這種情況下,fastqc.sh 是一個 Shell Script 文件,擴展名 .sh 通常表示這是一個 Shell 脚本文件。
  • 執行該命令後的行為:
    • 如果 fastqc.sh 文件已經存在,vim fastqc.sh 會打開這個文件,允許你查看和編輯其內容。
    • 如果 fastqc.sh 文件不存在,vim fastqc.sh 會創建一個新的空文件,並進入 Vim 編輯器以便你可以開始編寫草稿。

⚠️
3. 請按鍵盤"i"進入編輯模式 (底下會出現"–- INSERT –-"),並把上一步驟下載的shell scipt 貼到fastqc.sh

step 2 修改分析執行檔

  1. 請更改以下程式碼:

以下示範會以fastq資料夾中的SEA做為示範 (格式請依照裡面給你的範例,副檔名不用寫進去)

(1) Slurm排程設定

ℹ️

slurm是甚麼?

SLURM(Simple Linux Utility for Resource Management)是一個用於大規模計算集群的開源資源管理器和工作負載管理器。它主要用於高性能計算(HPC)環境,幫助管理和調度計算資源,如 CPU、內存和計算節點。SLURM 在大型超算中心、研究機構和企業中廣泛使用。

  • 接下來依照指示修改這個區塊 (請見下面兩點說明):

    #SBATCH -A ACD113120              # Account name/project number
    #SBATCH -J fastqc        # Job name
    #SBATCH -p ngscourse              # Partition Name 等同PBS裡面的 -q Queue name
    #SBATCH -c 2                      # 使用的core數 請參考Queue資源設定
    #SBATCH --mem=13g                 # 使用的記憶體量 請參考Queue資源設定
    #SBATCH -o out.log                # Path to the standard output file
    #SBATCH -e err.log                # Path to the standard error ouput file
    #SBATCH [email protected]    # email
    #SBATCH --mail-type=END           # 指定送出email時機 可為NONE, BEGIN, END, FAIL, REQUEUE, ALL
    # 國網使用
    
    

⚠️
(2)按 esc 離開編輯模式

(3)輸入:wq並按下enter可儲存結果 (若出現 "E45: 'readonly' option is set (add ! to override)" 的話,請輸入:wq!來儲存)

ℹ️

命令小學堂

  • :wq: 保存並退出
    是在 Vim 編輯器中用來保存文件並退出編輯模式的命令,先按 Esc 進入命令模式,然後輸入 :wq 並按下 Enter。這將會保存當前文件的修改並退出 Vim。 w 代表 write(保存), q 代表 quit(退出)
  • :wq!: 強制保存並退出
    如果文件是只讀的或者有其他限制,可以使用 :wq! 來強制保存並退出。 ! 是強制執行的意思
  • 如果你只想保存文件但不退出,可以使用 :w,如果只想退出但不保存,可以使用 :q:q! (強制退出)。
  1. 執行script

(1)輸入以下指令,來以sbatch job的方式送出編輯完成的草稿

sbatch fastqc.sh

ℹ️

命令小學堂

sbatch 是 SLURM 的一個命令行工具,用於提交作業草稿到 SLURM 作業調度系統。這些草稿通常包含 SLURM 指令和要執行的命令。

(2)若送出成功將會出現以下文字(結果在result資料夾已經指定好路徑)

Submitted batch job ___

(3)可使用以下指令查看工作執行情況

sacct

image

ℹ️

命令小學堂

sacct 此指令用於列出帳號的相關任務或任務集之狀態,例如運行中、已終止或是已完成,是最基本的檢視任務指令。它可以顯示例如ID、使用者、狀態、使用的資源等資訊,這個命令對於追蹤和分析作業的運行情況非常有用。

  1. 查看結果
    out.logerr.log為執行這個script的標準輸出和標準錯誤,如果執行時有出現錯誤,可以查看err.log(檔案會在/work/username/底下)
  • 這份執行檔會產生html檔,下載後即可開啟查看FastQC Report
  • 下載詳情可見連結下載