🌟

【シングルセル解析】cellranger countをWSLで実行

2023/04/06に公開

Bash

10X Genomicsのプラットフォームでシングルセルライブラリーの調製 -> fastqファイルまで完了すると、10X Genomicsが提供するcellrangerというツールで参照配列へのマッピングや細胞dropletの検出などを行う。

使用したWSL環境は以下のように作成したものである。

PCスペック

cellrangerを行うには8コア, 64GBメモリが最低限のスペックとして10X Genomicsには書かれている。推奨は16コア、128GBメモリとのこと。

コア数確認

lscpuコマンドでコア数が確認できる。私のPCでは16コア使用可能とのこと。

lscpu

memory確認

free -hでメモリが確認できる。

free -h

私のPCは本来64GBメモリあるのだが、WSLからは31GBメモリしか確認できない。
これはWSLに割り当てられているメモリがPCメモリの半分となっているからだ。
10X Genomicsが求める最低スペックにはなっていないがこのまま進める。

Cell Rangerのインストール

10X Genomicsのページからインストールコマンドが公開されている。(2023年4月ではversion 7.1.0が最新)

インストーラーのダウンロード

まずはページ内のdownloads pageというリンクからソースファイルのダウンロードを行う。
ダウンロードにはアカウント登録が必要となる。
ダウンロードページではcurlコマンドかwgetコマンドでのダウンロードコマンドがコピペできるので好きな方を使用。

10X Genomics　Cell Rangerのダウンロードページ

Ubuntuを立ち上げてダウンロードコマンドを張り付けてEnter。カレントディレクトリにcellranger-7.1.0.tar.gzという圧縮ファイルがダウンロードされる。

インストーラーの解凍

紹介されている解凍コマンドをコピペで実行。

tar -xzvf cellranger-7.1.0.tar.gz

カレントディレクトリにcellranger-7.1.0というフォルダが作成される。

パスを通す

cellranger-7.1.0フォルダにあるcellrangerというコマンドを使うために、このフォルダへのパスを通す必要がある。

cellrangerコマンドの場所

export PATH=コマンドがあるディレクトリへのパス:$PATHでパスを通すことができる。
新たなツールをダウンロードする際によく使うので覚えておくとよい。
(sudo apt installやcondaコマンドでインストールしたツールは、既にパスが通った先にインストールするため新たにパスを通す必要はない。)

whichコマンドを使うとパスが通ったコマンドファイルの場所を表示させることができる。

# cellrangerコマンドを探す --> まだパスが通ってないので見つからず
which cellranger

# パスを通す
export PATH=~/cellranger-7.1.0:$PATH

# もう一度cellrangerコマンドを探してみる --> cellrangerファイルの場所が表示
which cellranger

echo $PATHと打つとパスが通っている場所の一覧を表示できる。

cellranger-7.1.0フォルダへのパスが追記されているのがわかる。

永続パスを通す

export PATH=~/cellranger-7.1.0:$PATHで通したパスはUbuntuを立ち上げなおすと消えている。
毎回パスを通しなおすのは面倒なので、WSLの設定ファイル (.profile/.bash_profile, .bashrc)にexport PATH=~/cellranger-7.1.0:$PATHを書き込んでおく。

以下は.profileに追記する例である。>>は追記を意味する。>は上書きになるので間違えないように注意。

echo 'export PATH=~/cellranger-7.1.0:$PATH' >> .profile

catコマンドで.profileを確認。末尾に追記されているのがわかる。

cat .profile

.profileの中身をcatで表示
これでUbuntu起動時に常にcellranger-7.1.0フォルダにパスが通った状態になる。

なお、Notepad++などのテキストエディタで直接打ち込んでも構わない。

参照ゲノム配列

10X Genomicsではcellrangerで使用する用のreference genomeファイルも公開している。ヒト、マウス、ヒト+マウスの3種類のみでそれ以外の生物種の場合は自身でreference genomeファイルを作成する。
reference genomeにGFPなどのレポーター遺伝子を組み込む場合にも自身で作成する必要がある。
cellranger mkrefコマンドで行えるが、詳細は別途調べてほしい。

こちらのページに各referenceファイルのダウンロードコマンドが用意されているのでコピペでダウンロードする。
https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest?

今回はヒトのものを使用する。

curl -O https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz

こちらも解凍しておく。

tar -xzvf refdata-gex-GRCh38-2020-A.tar.gz

デモデータ

10X Genomicsがデモデータを沢山公開してくれている。今回は10k Human PBMCs, 3' v3.1, Chromium Xを使用する。

fastqファイルのダウンロード

ページ内のFASTQsというリンクからfastqファイルをダウンロード。ブラウザのダウンロードを使ってもいいが、ここではリンクの箇所で右クリック > リンクをコピーでURLを取得し、wgetコマンドでダウンロードした。

wget https://s3-us-west-2.amazonaws.com/10x.files/samples/cell-exp/6.1.0/10k_PBMC_3p_nextgem_Chromium_Controller/10k_PBMC_3p_nextgem_Chromium_Controller_fastqs.tar

34GBもあるので注意。

md5sum

md5sumコマンドでmd5sum値が公開ページ通りか確認しておく。この値が異なる場合はダウンロードがうまくできていない。やり直し。

md5sum 10k_PBMC_3p_nextgem_Chromium_Controller_fastqs.tar

md5sumコマンドの出力

解凍

複数のfastqファイルが入ったフォルダがtar圧縮されている。まずは解凍。

tar -xvf 10k_PBMC_3p_nextgem_Chromium_Controller_fastqs.tar

解凍されたフォルダには16個のfastqファイルが入っていた。(デモデータでは同じサンプルを複数回シークエンスしているためファイル数が多くなっている。)

fastqファイルの内訳

デモデータの説明ではPaired endで配列を読み、indexは2つ使用しているとのこと。

Read 1: 28 cycles (16塩基のbarcode配列 + 12塩基のUMI配列)
Read 2: 90 cycles (transcript)
i5 index: 10 cycles (sample index)
i7 index: 10 cycles (sample index)

Read1

①細胞1つ1つを識別するための細胞バーコード配列と ②mRNA分子 1コピーずつを識別するためのUMI配列で構成される合計28塩基のfastqファイル。

一部を表示

zcat 10k_PBMC_3p_nextgem_Chromium_Controller_fastqs/10k_PBMC_3p_nextgem_Chromium_Controller_S2_L001_R1_001.fastq.gz | head

Read2

mRNA配列をシークエンスした90塩基のfastqファイル。

一部を表示

zcat 10k_PBMC_3p_nextgem_Chromium_Controller_fastqs/10k_PBMC_3p_nextgem_Chromium_Controller_S2_L001_R2_001.fastq.gz | head

index

複数のサンプルを同時にシークエンサーに入れる際に後でどのサンプル由来のリードかを識別できるようにする配列。
1サンプルだけでシークエンサーに入れるなら無くてもよい。この例ではindex配列を2つ使用したとのこと。

一部を表示

zcat 10k_PBMC_3p_nextgem_Chromium_Controller_fastqs/10k_PBMC_3p_nextgem_Chromium_Controller_S2_L001_I1_001.fastq.gz | head

fastqファイル名のルール

[Sample Name]_S[Sample Number]_L00[Lane Number]_[Read Type]_001.fastq.gz というファイル名である必要がある。このルールを守らないとcellrangerコマンドで認識できない。

デモデータでは4つRead_Typeのものが4セットで合計16ファイルあるが、Sample_Nameの箇所が同じでLane Numberが異なるので、1つのサンプルをシークエンサーの複数レーンで読み込んだということだろう。
cellrangerコマンドではSample Nameの箇所が同一であれば同じサンプルとしてまとめて処理してくれる。
Read TypeはR1, R2, I1, I2が想定される。

S[Sample Number] の箇所はS1でもS2でも認識する。Lane Numberも厳密でなくてよい。
「どのfastqファイルが同一サンプル由来か」「どのfastqファイルがR1/R2なのか」が重要。

cellranger count

cellranger countは1サンプル分の遺伝子発現解析を実行する。

help

まずはcellrander countのhelpを確認しておく。

cellranger count --help

cellranger-count
Count gene expression (targeted or whole-transcriptome) and/or feature barcode
reads from a single sample and GEM well

USAGE:
    cellranger count [OPTIONS] --id <ID> --transcriptome <PATH>

OPTIONS:
        --id <ID>
            A unique run id and output folder name [a-zA-Z0-9_-]+

        --description <TEXT>
            Sample description to embed in output files [default: ]

        --transcriptome <PATH>
            Path of folder containing 10x-compatible transcriptome reference

        --fastqs <PATH>
            Path to input FASTQ data

        --project <TEXT>
            Name of the project folder within a mkfastq or bcl2fastq-generated
            folder from which to pick FASTQs

        --sample <PREFIX>
            Prefix of the filenames of FASTQs to select

        --lanes <NUMS>
            Only use FASTQs from selected lanes

        --libraries <CSV>
            CSV file declaring input library data sources

        --feature-ref <CSV>
            Feature reference CSV file, declaring Feature Barcode constructs and            associated barcodes

        --target-panel <CSV>
            The target panel CSV file declaring the target panel used, if any.
            Default analysis will exclude intronic mapped reads, which is the
            recommended mode for targeted assay. Use include-introns=true to
            include intronic mapped reads in analysis

        --expect-cells <NUM>
            Expected number of recovered cells, used as input to cell calling
            algorithm

        --force-cells <NUM>
            Force pipeline to use this number of cells, bypassing cell calling
            algorithm. [MINIMUM: 10]

        --no-bam
            Set --no-bam to not generate the BAM file. This will reduce the
            total computation time for the pipestance and the size of the output            directory. If unsure, we recommend not to use this option. BAM file
            could be useful for troubleshooting and downstream analysis

        --nosecondary
            Disable secondary analysis, e.g. clustering. Optional

        --r1-length <NUM>
            Hard trim the input Read 1 to this length before analysis

        --r2-length <NUM>
            Hard trim the input Read 2 to this length before analysis

        --include-introns <true|false>
            Include intronic reads in count (default=true unless --target-panel
            is specified in which case default=false)

        --chemistry <CHEM>
            Assay configuration. NOTE: by default the assay configuration is
            detected automatically, which is the recommended mode. You usually
            will not need to specify a chemistry. Options are: 'auto' for
            autodetection, 'threeprime' for Single Cell 3', 'fiveprime' for
            Single Cell 5', 'SC3Pv1' or 'SC3Pv2' or 'SC3Pv3' for Single Cell 3'
            v1/v2/v3, 'SC3Pv3LT' for Single Cell 3' v3 LT, 'SC3Pv3HT' for Single            Cell 3' v3 HT, 'SC5P-PE' or 'SC5P-R2' for Single Cell 5',
            paired-end/R2-only, 'SC-FB' for Single Cell Antibody-only 3' v2 or
            5'. To analyze the GEX portion of multiome data, chemistry must be
            set to 'ARC-v1'; 'ARC-v1' chemistry cannot be autodetected [default:            auto]

        --no-libraries
            Proceed with processing using a --feature-ref but no Feature Barcode            libraries specified with the 'libraries' flag

        --check-library-compatibility <true|false>
            Whether to check for barcode compatibility between libraries.
            [default: true]

        --no-target-umi-filter
            Turn off the target UMI filtering subpipeline. Only applies when
            --target-panel is used

        --dry
            Do not execute the pipeline. Generate a pipeline invocation (.mro)
            file and stop

        --jobmode <MODE>
            Job manager to use. Valid options: local (default), sge, lsf, slurm
            or path to a .template file. Search for help on "Cluster Mode" at
            support.10xgenomics.com for more details on configuring the pipeline            to use a compute cluster [default: local]

        --localcores <NUM>
            Set max cores the pipeline may request at one time. Only applies to
            local jobs

        --localmem <NUM>
            Set max GB the pipeline may request at one time. Only applies to
            local jobs

        --localvmem <NUM>
            Set max virtual address space in GB for the pipeline. Only applies
            to local jobs

        --mempercore <NUM>
            Reserve enough threads for each job to ensure enough memory will be
            available, assuming each core on your cluster has at least this much            memory available. Only applies to cluster jobmodes

        --maxjobs <NUM>
            Set max jobs submitted to cluster at one time. Only applies to
            cluster jobmodes

        --jobinterval <NUM>
            Set delay between submitting jobs to cluster, in ms. Only applies to            cluster jobmodes

        --overrides <PATH>
            The path to a JSON file that specifies stage-level overrides for
            cores and memory. Finer-grained than --localcores, --mempercore and
            --localmem. Consult https://support.10xgenomics.com/ for an example
            override file

        --uiport <PORT>
            Serve web UI at http://localhost:PORT

        --disable-ui
            Do not serve the web UI

        --noexit
            Keep web UI running after pipestance completes or fails

        --nopreflight
            Skip preflight checks

    -h, --help
            Print help information

これが基本的なコマンドである。

cellranger count [OPTIONS] --id <ID> --transcriptome <PATH>

引数は色々とあるが、重要なものを以下に挙げる。

--id:
任意のサンプルIDを入れる。この名前のフォルダが作成され、そこに出力が保存される。既に同じ名前のフォルダがあるとエラーが出るので注意。英数字とハイフン、アンダーバーのみ使用可能。
--transcriptome:
参照ゲノムファイルが入っているフォルダへのパスを入れる。
--fastqs:
サンプルのfastqファイルが入っているフォルダへのパスを入れる。
--sample:
fastqファイル名のSample nameの箇所を入れる。
--expect-cells:
予想される細胞数を入れる。このデモデータでは--expect-cells=10000で解析したと書いてあった。
--force-cells:
cellrangerに細胞か非細胞かの判定をさせずに、ここに指定した細胞数を細胞と判定させる。mRNA検出コピー数の降順で決めているみたい。
--include-introns:
イントロンにマッピングされたリードも発現量に入れるかどうか。デフォルトでtrue。古いversionのcellrangerではデフォルトがfalseだった。
--no-bam:
bamファイルを書き出さないようにするオプション。bamファイルは容量が大きいわりに必ずしも解析に必要ではないので、--no-bamオプションをつけてもよい。velocity解析の際にはbamファイルが必要となる。
--localcores/--localmem:
使用コア数/使用メモリに上限を設けたければ指定すればよい。

実行

実際に以下のパス、オプションでcellranger countを実行してみる。(かなり時間がかかるので1晩放置ぐらいのつもりで)

cellranger count \
--id test \
--transcriptome ~/refdata-gex-GRCh38-2020-A \
--fastqs ~/10k_PBMC_3p_nextgem_Chromium_Controller_fastqs/ \
--sample 10k_PBMC_3p_nextgem_Chromium_Controller \
--expect-cells 10000 \
--no-bam

以下のように解析が始まる。

開始直後の画面

今回は2時間半程度で終了していた。

終了画面

出力フォルダ

出力フォルダ

実際に使用するファイルを幾つか紹介する。

filtered_feature_bc_matrix:
cellranger countで細胞として認識されたdropletのデータ。
3つのファイルが入っているフォルダはR Seuratでデータを読み込む際によく使用する。
filtered_feature_bc_matrix.h5は3つのファイルが1つのファイルに収められたもので、R Seuratでこちらを使ってデータ読み込みすることも可能。
raw_feature_bc_matrix:
全dropletのデータ。cellranger countの細胞認識を使わずに、自身でフィルタリングしていく場合はこちらを使用する。raw_feature_bc_matrix.h5も同様。
cloupe.cloupe:
10X Genomics社配布のビューアーソフト(loupe browser)で可視化するために使用するファイル。次元削減やクラスタリングされたデータの確認や変動遺伝子解析も行える。
web_summary.html:
データのqualityなどのsummaryファイル。まずはこちらを参照して、実験自体がうまくいっているのか確認するとよい。

web_summary.html

割り当てメモリの変更

WSL2ではデフォルトではPCのメモリの半分しか割り当てられていないが、これを変更することも可能。
以下に例を示すが、実際にこれをしても今回の例ではcellranger countの速度が上がる訳ではなかった。

CドライブのUserフォルダに.wslconfigファイルを作成し、その中にWSLの割り当てメモリ数を記載するとのこと。
テキストエディタで行ってもよいが、せっかくなのでUbuntu内で行ってみる。

cd /mnt/c/Users/rchiji/
touch .wslconfig
echo "[wsl2]" >> .wslconfig
echo "memory=64GB" >> .wslconfig

catで.wslconfigファイルの中身を表示

次にPCを再起動もしくはWindows PowershellでWSLをシャットダウンしてからUbuntuを再起動することで反映される。

PowershellからWSLをシャットダウン

free -h

totalメモリが62GBと表示されるようになった。

PCスペック

コア数確認

memory確認

Cell Rangerのインストール

インストーラーのダウンロード

インストーラーの解凍

パスを通す

永続パスを通す

参照ゲノム配列

デモデータ

fastqファイルのダウンロード

md5sum

解凍

fastqファイルの内訳

Read1

Read2

index

fastqファイル名のルール

cellranger count

help

実行

出力フォルダ

割り当てメモリの変更

Discussion