🧬

graphMB: アセンブリグラフを学習モデルに組み込んだビングツール

2024/02/09に公開

ビニング

メタゲノム

tech

はじめに

配列決定技術とアセンブリ法の進歩にもかかわらず、メタゲノムサンプルから高品質の微生物ゲノムを取得することは依然としてコストのかかる作業です。

現在のメタゲノムビニングツールはアセンブリグラフを最大限に活用できておらず、また、ロングリードアセンブリ向けに最適化されたものも少ない状況にあります。

ディープグラフ学習アルゴリズムは、複雑なグラフデータ構造を扱うための技術として、他の分野でも利用されています。

アセンブリステップで生成されるグラフ構造のコンティグの特徴を統合した深層学習を、ビニングのステップに適用することで優れたビンを取得することができます。

GraphMB アブスト参照

GraphMB

GraphMBはアセンブリグラフをコンティグの特徴学習プロセスに組み込み、カバレッジのより高いエッジを重要視するようにニューラルネットワークに学習させます。

学習には、グラフノードの表現を学習できる深層学習モデルの一種であるGNNを使用します。

Install

mamba install -c conda-forge make cmake libgcc python=3.8 pip tensorflow numpy
conda activate graphmb
pip install graphmb

USAGE GraphMB

usage: graphmb [-h] [--assembly ASSEMBLY] [--assembly_name ASSEMBLY_NAME] [--graph_file GRAPH_FILE]
               [--edge_threshold EDGE_THRESHOLD] [--depth DEPTH] [--features FEATURES] [--labels LABELS] [--embs EMBS]
               [--model_name MODEL_NAME] [--activation ACTIVATION] [--layers_vae LAYERS_VAE] [--layers_gnn LAYERS_GNN]
               [--hidden_gnn HIDDEN_GNN] [--hidden_vae HIDDEN_VAE] [--embsize_gnn EMBSIZE_GNN] [--embsize_vae EMBSIZE_VAE]
               [--batchsize BATCHSIZE] [--batchtype BATCHTYPE] [--dropout_gnn DROPOUT_GNN] [--dropout_vae DROPOUT_VAE]
               [--lr_gnn LR_GNN] [--lr_vae LR_VAE] [--graph_alpha GRAPH_ALPHA] [--kld_alpha KLD_ALPHA] [--ae_alpha AE_ALPHA]
               [--scg_alpha SCG_ALPHA] [--clusteringalgo CLUSTERINGALGO] [--kclusters KCLUSTERS] [--aggtype AGGTYPE]
               [--decoder_input DECODER_INPUT] [--vaepretrain VAEPRETRAIN] [--ae_only] [--negatives NEGATIVES] [--quick]
               [--classify] [--fanout FANOUT] [--epoch EPOCH] [--print PRINT] [--evalepochs EVALEPOCHS] [--evalskip EVALSKIP]
               [--eval_split EVAL_SPLIT] [--kmer KMER] [--rawfeatures] [--clusteringloss] [--targetmetric TARGETMETRIC]
               [--concatfeatures] [--no_loss_weights] [--no_sample_weights] [--early_stopping EARLY_STOPPING] [--nruns NRUNS]
               [--mincontig MINCONTIG] [--minbin MINBIN] [--mincomp MINCOMP] [--randomize] [--labelgraph] [--binarize] [--noedges]
               [--read_embs] [--reload] [--markers MARKERS] [--post POST] [--writebins] [--skip_preclustering] [--outname OUTNAME]
               [--cuda] [--noise] [--savemodel] [--tsne] [--numcores NUMCORES] [--outdir OUTDIR] [--assembly_type ASSEMBLY_TYPE]
               [--contignodes] [--seed SEED] [--quiet] [--read_cache] [--version] [--loglevel LOGLEVEL]

Train graph embedding model

options:
  -h, --help            show this help message and exit
  --assembly ASSEMBLY   Assembly base path
  --assembly_name ASSEMBLY_NAME
                        File name with contigs
  --graph_file GRAPH_FILE
                        File name with graph
  --edge_threshold EDGE_THRESHOLD
                        Remove edges with weight lower than this (keep only >=)
  --depth DEPTH         Depth file from jgi
  --features FEATURES   Features file mapping contig name to features
  --labels LABELS       File mapping contig to label
  --embs EMBS           No train, load embs
  --model_name MODEL_NAME
                        One of the implemented models: gcn, gat, sage, sage_lstm, _ccvae variation
  --activation ACTIVATION
                        Activation function to use(relu, prelu, sigmoid, tanh)
  --layers_vae LAYERS_VAE
                        Number of layers of the VAE
  --layers_gnn LAYERS_GNN
                        Number of layers of the GNN
  --hidden_gnn HIDDEN_GNN
                        Dimension of hidden layers of GNN
  --hidden_vae HIDDEN_VAE
                        Dimension of hidden layers of VAE
  --embsize_gnn EMBSIZE_GNN, --zg EMBSIZE_GNN
                        Output embedding dimension of GNN
  --embsize_vae EMBSIZE_VAE, --zl EMBSIZE_VAE
                        Output embedding dimension of VAE
  --batchsize BATCHSIZE
                        batchsize to train the VAE
  --batchtype BATCHTYPE
                        Batch type, nodes or edges
  --dropout_gnn DROPOUT_GNN
                        dropout of the GNN
  --dropout_vae DROPOUT_VAE
                        dropout of the VAE
  --lr_gnn LR_GNN       learning rate
  --lr_vae LR_VAE       learning rate
  --graph_alpha GRAPH_ALPHA
                        Coeficient for graph loss
  --kld_alpha KLD_ALPHA
                        Coeficient for KLD loss
  --ae_alpha AE_ALPHA   Coeficient for AE loss
  --scg_alpha SCG_ALPHA
                        Coeficient for SCG loss
  --clusteringalgo CLUSTERINGALGO
                        clustering algorithm: vamb, kmeans
  --kclusters KCLUSTERS
                        Number of clusters (only for some clustering methods)
  --aggtype AGGTYPE     Aggregation type for GraphSAGE (mean, pool, lstm, gcn)
  --decoder_input DECODER_INPUT
                        What to use for input to the decoder
  --vaepretrain VAEPRETRAIN
                        How many epochs to pretrain VAE
  --ae_only             Do not use GNN (ae model must be used and decoder input must be ae
  --negatives NEGATIVES
                        Number of negatives to train GraphSAGE
  --quick               Reduce number of nodes to run quicker
  --classify            Run classification instead of clustering
  --fanout FANOUT       Fan out, number of positive neighbors sampled at each level
  --epoch EPOCH         Number of epochs to train model
  --print PRINT         Print interval during training
  --evalepochs EVALEPOCHS
                        Epoch interval to run eval
  --evalskip EVALSKIP   Skip eval of these epochs
  --eval_split EVAL_SPLIT
                        Percentage of dataset to use for eval
  --kmer KMER
  --rawfeatures         Use raw features
  --clusteringloss      Train with clustering loss
  --targetmetric TARGETMETRIC
                        Metric to pick best epoch
  --concatfeatures      Concat learned and original features before clustering
  --no_loss_weights     Using edge weights for loss (positive only)
  --no_sample_weights   Using edge weights to sample negatives
  --early_stopping EARLY_STOPPING
                        Stop training if delta between last two losses is less than this
  --nruns NRUNS         Number of runs
  --mincontig MINCONTIG
                        Minimum size of input contigs
  --minbin MINBIN       Minimum size of clusters in bp
  --mincomp MINCOMP     Minimum size of connected components
  --randomize           Randomize graph
  --labelgraph          Create graph based on labels (ignore assembly graph)
  --binarize            Binarize adj matrix
  --noedges             Remove all but self edges from adj matrix
  --read_embs           Read embeddings from file
  --reload              Reload data
  --markers MARKERS     File with precomputed checkm results to eval. If not found, it will assume it does not exist.
  --post POST           Output options
  --writebins           Write bins to fasta files
  --skip_preclustering  Use precomputed checkm results to eval
  --outname OUTNAME, --outputname OUTNAME
                        Output (experiment) name
  --cuda                Use gpu
  --noise               Use noise generator
  --savemodel           Save best model to disk
  --tsne                Plot tsne at checkpoints
  --numcores NUMCORES   Number of cores to use
  --outdir OUTDIR, --outputdir OUTDIR
                        Output dir (same as input assembly dir if not defined
  --assembly_type ASSEMBLY_TYPE
                        flye or spades
  --contignodes         Use contigs as nodes instead of edges
  --seed SEED           Set seed
  --quiet, -q           Do not output epoch progress
  --read_cache          Do not check assembly files, read cached files only
  --version, -v         Print version and exit
  --loglevel LOGLEVEL, -l LOGLEVEL
                        Log level

アセンブリ後のデータを使う場合は、デモデータが用意されていたのでこちらからダウンロードしてください。

以降に記載するmetaflyeによるアセンブルからjgi_summarize_bam_contig_depthsでdepthの計算の工程を済ませたONTのメタゲノムのデータになります。

https://zenodo.org/records/6122610

ダウンロードしたデータを使ってgrahmbの実行

graphmb --assembly ./ --outdir results/ --cuda --markers marker_gene_stats.tsv --seed 1234

↓標準出力↓

logging to results/20240209-003659graphmb_output.log
Running GraphMB 0.2.5
using cuda: True
setting seed to 1234
setting tf seed
Reading cache from
==============================
DATASET STATS:
number of sequences: 1150
assembly length: 0.171 Gb
assembly N50: 1.754 Mb
assembly average length (Mb): 0.149 max: 9.224 min: 0.001
coverage samples: 1
Graph file found and read
graph edges: 1270
contig paths: 1076
total ref markers sets: 58
total ref markers: 104
contigs with one or more markers: 215/1150
max SCGs on one contig: 104, average(excluding 0): 19.926
candidate k0s [44, 45, 46, 47, 48, 49, 50]
SCG contig count min: 29 contigs
edges with overlapping scgs (max=20): [(1, 8), (7, 2), (3, 2), (4, 2)]
==============================
loading features from ./features.tsv
loaded 1150 features/ 1150 nodes from tsv
RUN 0
setting tf seed
edges with overlapping scgs (max=20): [(1, 8), (7, 2), (3, 2), (4, 2)]
deleted 14 edges with same SCGs
**** Num of edges: 2378
logging to mlflow
******* Running model: gcn **********
***** using edge weights: True ******
***** using disconnected: True ******
***** concat features: True *****
***** cluster markers only: False *****
***** threshold adj matrix: False *****
***** self edges only: False *****
***** Using raw kmer+abund features: False
***** SCG neg pairs: (8799, 2)
***** input features dimension: 32
***** Nodes used for clustering: 1150
>>> Pre train stats: {'precision': 1.0, 'recall': 0.07478260869565218, 'f1': 0.13915857605177992, 'ari': 0, 'hq': 31, 'mq': 31, 'n_clusters': 544, 'unresolved_mags': 8, 'hq_comp': 98.16108807766206, 'hq_cont': 0.27808676307007785, 'unresolved_contigs': 935, 'unresolved_contigs_with_scgs': 120}
*** Model input dim 32, GNN input dim 32
*** output clustering dim 32
[graphmb 3l] L=0.668 D=0.702 HQ=17  Besthq=23  Best Epoch=59 GPU=53.9MB: 100%|█████████████████████| 500/500 [01:36<00:00,  5.19it/s]
>>> best epoch all contigs: 20 : {'precision': 1.0, 'recall': 0.12173913043478261, 'f1': 0.2170542635658915, 'ari': 0, 'hq': 23, 'mq': 25, 'n_clusters': 347, 'unresolved_mags': 16, 'hq_comp': 98.1586479487529, 'hq_cont': 0.22488755622188905, 'unresolved_contigs': 943, 'unresolved_contigs_with_scgs': 148, 'epoch': 499} <<<
>>> best epoch: 20 : {'precision': 1.0, 'recall': 0.12173913043478261, 'f1': 0.2170542635658915, 'ari': 0, 'hq': 23, 'mq': 25, 'n_clusters': 347, 'unresolved_mags': 16, 'hq_comp': 98.1586479487529, 'hq_cont': 0.22488755622188905, 'unresolved_contigs': 943, 'unresolved_contigs_with_scgs': 148, 'epoch': 59} <<<
### writing best and last embs to results/

ここでPdb(デバッガ)が起動していしまいました。謎。。

一応、全く別のデータ使ってアセンブルから実行してみる。

アセンブリグラフとコンティグにマップされるリードのdepthを計算する時に使用するツールの仮想環境の作成

以下のツールをひとまとめにしたcondaの仮想環境を作成します。

flye (github)
CheckM (github)
samtools (github)
minimap2 (github)
MetaBat2のjgi_summarize_bam_contig_depths (MetaBat2 HP)

mamba create -n other_graphmb_env python=3.9 numpy matplotlib pysam hmmer prodigal pplacer checkm-genome flye samtools minimap2 metabat2

USAGE CheckM

usage: checkm {data,tree,tree_qa,lineage_set,taxon_list,taxon_set,analyze,qa,lineage_wf,taxonomy_wf,gc_plot,coding_plot,tetra_plot,dist_plot,gc_bias_plot,nx_plot,len_hist,marker_plot,unbinned,coverage,tetra,profile,ssu_finder,merge,outliers,modify,unique,test} ...
checkm: error: unrecognized arguments: --help

usage: checkm taxonomy_wf [-h] [--ali] [--nt] [-g] [--individual_markers] [--skip_adj_correction] [--skip_pseudogene_correction] [--aai_strain AAI_STRAIN] [-a ALIGNMENT_FILE] [--ignore_thresholds] [-e E_VALUE] [-l LENGTH] [-c COVERAGE_FILE] [-f FILE] [--tab_table]
                          [-x EXTENSION] [-t THREADS] [-q] [--tmpdir TMPDIR]
                          {life,domain,phylum,class,order,family,genus,species} taxon bin_input output_dir

Runs taxon_set, analyze, qa

positional arguments:
  {life,domain,phylum,class,order,family,genus,species}
                        taxonomic rank
  taxon                 taxon of interest
  bin_input             directory containing bins (fasta format) or path to file describing genomes/genes - tab separated in 2 or 3 columns [genome ID, genome fna, genome translation file (pep)]
  output_dir            directory to write output files

optional arguments:
  -h, --help            show this help message and exit
  --ali                 generate HMMER alignment file for each bin
  --nt                  generate nucleotide gene sequences for each bin
  -g, --genes           bins contain genes as amino acids instead of nucleotide contigs
  --individual_markers  treat marker as independent (i.e., ignore co-located set structure)
  --skip_adj_correction
                        do not exclude adjacent marker genes when estimating contamination
  --skip_pseudogene_correction
                        skip identification and filtering of pseudogenes
  --aai_strain AAI_STRAIN
                        AAI threshold used to identify strain heterogeneity (default: 0.9)
  -a, --alignment_file ALIGNMENT_FILE
                        produce file showing alignment of multi-copy genes and their AAI identity
  --ignore_thresholds   ignore model-specific score thresholds
  -e, --e_value E_VALUE
                        e-value cut off (default: 1e-10)
  -l, --length LENGTH   percent overlap between target and query (default: 0.7)
  -c, --coverage_file COVERAGE_FILE
                        file containing coverage of each sequence; coverage information added to table type 2 (see coverage command)
  -f, --file FILE       print results to file (default: stdout)
  --tab_table           print tab-separated values table
  -x, --extension EXTENSION
                        extension of bins (other files in directory are ignored) (default: fna)
  -t, --threads THREADS
                        number of threads (default: 1)
  -q, --quiet           suppress console output
  --tmpdir TMPDIR       specify an alternative directory for temporary files

Example: checkm taxonomy_wf domain Bacteria ./bins ./output

usage: checkm qa [-h] [-o {1,2,3,4,5,6,7,8,9}] [--exclude_markers EXCLUDE_MARKERS] [--individual_markers] [--skip_adj_correction] [--skip_pseudogene_correction] [--aai_strain AAI_STRAIN] [-a ALIGNMENT_FILE] [--ignore_thresholds] [-e E_VALUE] [-l LENGTH] [-c COVERAGE_FILE]
                 [-f FILE] [--tab_table] [-t THREADS] [-q] [--tmpdir TMPDIR]
                 marker_file analyze_dir

Assess bins for contamination and completeness.

positional arguments:
  marker_file           marker file specified during analyze command
  analyze_dir           directory specified during analyze command

optional arguments:
  -h, --help            show this help message and exit
  -o, --out_format {1,2,3,4,5,6,7,8,9}
                        desired output: (default: 1)
                          1. summary of bin completeness and contamination
                          2. extended summary of bin statistics (includes GC, genome size, ...)
                          3. summary of bin quality for increasingly basal lineage-specific marker sets
                          4. list of marker genes and their counts
                          5. list of bin id, marker gene id, gene id
                          6. list of marker genes present multiple times in a bin
                          7. list of marker genes present multiple times on the same scaffold
                          8. list indicating position of each marker gene within a bin
                          9. FASTA file of marker genes identified in each bin
  --exclude_markers EXCLUDE_MARKERS
                        file specifying markers to exclude from marker sets
  --individual_markers  treat marker as independent (i.e., ignore co-located set structure)
  --skip_adj_correction
                        do not exclude adjacent marker genes when estimating contamination
  --skip_pseudogene_correction
                        skip identification and filtering of pseudogenes
  --aai_strain AAI_STRAIN
                        AAI threshold used to identify strain heterogeneity (default: 0.9)
  -a, --alignment_file ALIGNMENT_FILE
                        produce file showing alignment of multi-copy genes and their AAI identity
  --ignore_thresholds   ignore model-specific score thresholds
  -e, --e_value E_VALUE
                        e-value cut off (default: 1e-10)
  -l, --length LENGTH   percent overlap between target and query (default: 0.7)
  -c, --coverage_file COVERAGE_FILE
                        file containing coverage of each sequence; coverage information added to table type 2 (see coverage command)
  -f, --file FILE       print results to file (default: stdout)
  --tab_table           print tab-separated values table
  -t, --threads THREADS
                        number of threads (default: 1)
  -q, --quiet           suppress console output
  --tmpdir TMPDIR       specify an alternative directory for temporary files

Note: lineage_wf and taxonomy_wf produce a marker file in the specified output directory. The
        lineage workflow produced a marker file called lineage.ms, while the taxonomy workflow
        produces a marker file called <taxon>.ms (e.g. Bacteria.ms).

Example: checkm qa ./output/lineage.ms ./output

USAGE flye

usage: flye (--pacbio-raw | --pacbio-corr | --pacbio-hifi | --nano-raw |
             --nano-corr | --nano-hq ) file1 [file_2 ...]
             --out-dir PATH

             [--genome-size SIZE] [--threads int] [--iterations int]
             [--meta] [--polish-target] [--min-overlap SIZE]
             [--keep-haplotypes] [--debug] [--version] [--help] 
             [--scaffold] [--resume] [--resume-from] [--stop-after] 
             [--read-error float] [--extra-params] 
             [--deterministic]

Assembly of long reads with repeat graphs

optional arguments:
  -h, --help            show this help message and exit
  --pacbio-raw path [path ...]
                        PacBio regular CLR reads (<20% error)
  --pacbio-corr path [path ...]
                        PacBio reads that were corrected with other methods (<3% error)
  --pacbio-hifi path [path ...]
                        PacBio HiFi reads (<1% error)
  --nano-raw path [path ...]
                        ONT regular reads, pre-Guppy5 (<20% error)
  --nano-corr path [path ...]
                        ONT reads that were corrected with other methods (<3% error)
  --nano-hq path [path ...]
                        ONT high-quality reads: Guppy5+ SUP or Q20 (<5% error)
  --subassemblies path [path ...]
                        [deprecated] high-quality contigs input
  -g size, --genome-size size
                        estimated genome size (for example, 5m or 2.6g)
  -o path, --out-dir path
                        Output directory
  -t int, --threads int
                        number of parallel threads [1]
  -i int, --iterations int
                        number of polishing iterations [1]
  -m int, --min-overlap int
                        minimum overlap between reads [auto]
  --asm-coverage int    reduced coverage for initial disjointig assembly [not set]
  --hifi-error float    [deprecated] same as --read-error
  --read-error float    adjust parameters for given read error rate (as fraction e.g. 0.03)
  --extra-params extra_params
                        extra configuration parameters list (comma-separated)
  --plasmids            unused (retained for backward compatibility)
  --meta                metagenome / uneven coverage mode
  --keep-haplotypes     do not collapse alternative haplotypes
  --no-alt-contigs      do not output contigs representing alternative haplotypes
  --scaffold            enable scaffolding using graph [disabled by default]
  --trestle             [deprecated] enable Trestle [disabled by default]
  --polish-target path  run polisher on the target sequence
  --resume              resume from the last completed stage
  --resume-from stage_name
                        resume from a custom stage
  --stop-after stage_name
                        stop after the specified stage completed
  --debug               enable debug output
  -v, --version         show program's version number and exit
  --deterministic       perform disjointig assembly single-threaded

Input reads can be in FASTA or FASTQ format, uncompressed
or compressed with gz. Currently, PacBio (CLR, HiFi, corrected)
and ONT reads (regular, HQ, corrected) are supported. Expected error rates are
<15% for PB CLR/regular ONT; <5% for ONT HQ, <3% for corrected, and <1% for HiFi. Note that Flye
was primarily developed to run on uncorrected reads. You may specify multiple
files with reads (separated by spaces). Mixing different read
types is not yet supported. The --meta option enables the mode
for metagenome/uneven coverage assembly.

To reduce memory consumption for large genome assemblies,
you can use a subset of the longest reads for initial disjointig
assembly by specifying --asm-coverage and --genome-size options. Typically,
40x coverage is enough to produce good disjointigs.

You can run Flye polisher as a standalone tool using
--polish-target option.

USAGE samtools

Program: samtools (Tools for alignments in the SAM format)
Version: 1.19.2 (using htslib 1.19.1)

Usage:   samtools <command> [options]

Commands:
  -- Indexing
     dict           create a sequence dictionary file
     faidx          index/extract FASTA
     fqidx          index/extract FASTQ
     index          index alignment

  -- Editing
     calmd          recalculate MD/NM tags and '=' bases
     fixmate        fix mate information
     reheader       replace BAM header
     targetcut      cut fosmid regions (for fosmid pool only)
     addreplacerg   adds or replaces RG tags
     markdup        mark duplicates
     ampliconclip   clip oligos from the end of reads

  -- File operations
     collate        shuffle and group alignments by name
     cat            concatenate BAMs
     consensus      produce a consensus Pileup/FASTA/FASTQ
     merge          merge sorted alignments
     mpileup        multi-way pileup
     sort           sort alignment file
     split          splits a file by read group
     quickcheck     quickly check if SAM/BAM/CRAM file appears intact
     fastq          converts a BAM to a FASTQ
     fasta          converts a BAM to a FASTA
     import         Converts FASTA or FASTQ files to SAM/BAM/CRAM
     reference      Generates a reference from aligned data
     reset          Reverts aligner changes in reads

  -- Statistics
     bedcov         read depth per BED region
     coverage       alignment depth and percent coverage
     depth          compute the depth
     flagstat       simple stats
     idxstats       BAM index stats
     cram-size      list CRAM Content-ID and Data-Series sizes
     phase          phase heterozygotes
     stats          generate stats (former bamcheck)
     ampliconstats  generate amplicon specific stats

  -- Viewing
     flags          explain BAM flags
     head           header viewer
     tview          text alignment viewer
     view           SAM<->BAM<->CRAM conversion
     depad          convert padded BAM to unpadded BAM
     samples        list the samples in a set of SAM/BAM/CRAM files

  -- Misc
     help [cmd]     display this help message or help for [cmd]
     version        detailed version information

USAGE minimap2

Usage: minimap2 [options] <target.fa>|<target.idx> [query.fa] [...]
Options:
  Indexing:
    -H           use homopolymer-compressed k-mer (preferrable for PacBio)
    -k INT       k-mer size (no larger than 28) [15]
    -w INT       minimizer window size [10]
    -I NUM       split index for every ~NUM input bases [8G]
    -d FILE      dump index to FILE []
  Mapping:
    -f FLOAT     filter out top FLOAT fraction of repetitive minimizers [0.0002]
    -g NUM       stop chain enlongation if there are no minimizers in INT-bp [5000]
    -G NUM       max intron length (effective with -xsplice; changing -r) [200k]
    -F NUM       max fragment length (effective with -xsr or in the fragment mode) [800]
    -r NUM[,NUM] chaining/alignment bandwidth and long-join bandwidth [500,20000]
    -n INT       minimal number of minimizers on a chain [3]
    -m INT       minimal chaining score (matching bases minus log gap penalty) [40]
    -X           skip self and dual mappings (for the all-vs-all mode)
    -p FLOAT     min secondary-to-primary score ratio [0.8]
    -N INT       retain at most INT secondary alignments [5]
  Alignment:
    -A INT       matching score [2]
    -B INT       mismatch penalty (larger value for lower divergence) [4]
    -O INT[,INT] gap open penalty [4,24]
    -E INT[,INT] gap extension penalty; a k-long gap costs min{O1+k*E1,O2+k*E2} [2,1]
    -z INT[,INT] Z-drop score and inversion Z-drop score [400,200]
    -s INT       minimal peak DP alignment score [80]
    -u CHAR      how to find GT-AG. f:transcript strand, b:both strands, n:don't match GT-AG [n]
    -J INT       splice mode. 0: original minimap2 model; 1: miniprot model [1]
  Input/Output:
    -a           output in the SAM format (PAF by default)
    -o FILE      output alignments to FILE [stdout]
    -L           write CIGAR with >65535 ops at the CG tag
    -R STR       SAM read group line in a format like '@RG\tID:foo\tSM:bar' []
    -c           output CIGAR in PAF
    --cs[=STR]   output the cs tag; STR is 'short' (if absent) or 'long' [none]
    --MD         output the MD tag
    --eqx        write =/X CIGAR operators
    -Y           use soft clipping for supplementary alignments
    -t INT       number of threads [3]
    -K NUM       minibatch size for mapping [500M]
    --version    show version number
  Preset:
    -x STR       preset (always applied before other options; see minimap2.1 for details) []
                 - map-pb/map-ont - PacBio CLR/Nanopore vs reference mapping
                 - map-hifi - PacBio HiFi reads vs reference mapping
                 - ava-pb/ava-ont - PacBio/Nanopore read overlap
                 - asm5/asm10/asm20 - asm-to-ref mapping, for ~0.1/1/5% sequence divergence
                 - splice/splice:hq - long-read/Pacbio-CCS spliced alignment
                 - sr - genomic short-read mapping

See `man ./minimap2.1' for detailed description of these and other advanced command-line options.

USAGE MaxBin2のjgi_summarize_bam_contig_depths

jgi_summarize_bam_contig_depths 2.15 (Bioconda) 2020-07-03T13:02:15
Usage: jgi_summarize_bam_contig_depths <options> sortedBam1 [ sortedBam2 ...]
where options include:
        --outputDepth       arg  The file to put the contig by bam depth matrix (default: STDOUT)
        --percentIdentity   arg  The minimum end-to-end % identity of qualifying reads (default: 97)
        --pairedContigs     arg  The file to output the sparse matrix of contigs which paired reads span (default: none)
        --unmappedFastq     arg  The prefix to output unmapped reads from each bam file suffixed by 'bamfile.bam.fastq.gz'
        --noIntraDepthVariance   Do not include variance from mean depth along the contig
        --showDepth              Output a .depth file per bam for each contig base
        --minMapQual        arg  The minimum mapping quality necessary to count the read as mapped (default: 0)
        --weightMapQual     arg  Weight per-base depth based on the MQ of the read (i.e uniqueness) (default: 0.0 (disabled))
        --includeEdgeBases       When calculating depth & variance, include the 1-readlength edges (off by default)
        --maxEdgeBases           When calculating depth & variance, and not --includeEdgeBases, the maximum length (default:75)
        --referenceFasta    arg  The reference file.  (It must be the same fasta that bams used)

Options that require a --referenceFasta
        --outputGC          arg  The file to print the gc coverage histogram
        --gcWindow          arg  The sliding window size for GC calculations
        --outputReadStats   arg  The file to print the per read statistics
        --outputKmers       arg  The file to print the perfect kmer counts

Options to control shredding contigs that are under represented by the reads
        --shredLength       arg  The maximum length of the shreds
        --shredDepth        arg  The depth to generate overlapping shreds
        --minContigLength   arg  The mimimum length of contig to include for mapping and shredding
        --minContigDepth    arg  The minimum depth along contig at which to break the contig

CheckMの準備

CheckMにはデータベースが必要です。データベースをダウンロードします。

#ダウンロード
aria2c -x 16 -ctrue https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz

#解凍
mkdir checkm_data_2015_01_16
tar -xvf checkm_data_2015_01_16.tar.gz -C checkm_data_2015_01_16

#データベースの場所の指定
checkm data setRoot checkm_data_2015_01_16

確認

checkm

解析に使用するデータの準備

PacBioのデータを解析対象とします。クラウドサーバーからダウロード。

aria2c -x 16 -ctrue https://downloads.pacbcloud.com/public/dataset/Sequel-IIe-202104/metagenomics/m64011_210225_094432.hifi_reads.fastq.tar.gz

解凍します。

tar -xvf m64011_210225_094432.hifi_reads.fastq.tar.gz

metaflyeによるde novo assembly

metaflyeの実行

flye \
--pacbio-hifi m64011_210224_000525.hifi_reads.fastq \
-o 02metaFlye/ \
--meta \
-t 32

いくつかのファイルが出力されますが、.gfaがアセンブリグラフファイルです。

アセンブリグラフファイルからfastaを生成する前に、metaflyeから出力されたassembly.fastaをcontig.fastaにリネーム

cd 02metaFlye/
mv assembly.fasta contigs.fasta

アセンブリグラフファイルからfastaを生成

awk '/^S/{print ">"$2"\n"$3}' assembly_graph.gfa | fold > assembly.fasta

各contig配列を別ファイルに出力

mkdir edges
cd edges; cat ../assembly.fasta | awk '{ if (substr($0, 1, 1)==">") {filename=(substr($0,2) ".fa")} print $0 > filename }'; cd ..

エラーが出たらulimit -n 100000を試してみてください。同時に表示できるファイル数に制限がかかっている場合があります。

スペースをアンダーバーに置換

find edges/ -name "* *" -type f | rename 's/ /_/g'

checkmの実行

# evaluate edges
checkm taxonomy_wf -x fa domain Bacteria edges/ checkm_edges/ --threads 32

checkm qa checkm_edges/Bacteria.ms checkm_edges/ -f checkm_edges_polished_results.txt --tab_table -o 2

Contigへのread mappingとdepthの算出

depthを算出するためにcontigにリードをmappingする

minimap2 -I 64GB -d assembly.mmi assembly.fasta # make index
minimap2 -I 64GB -ax map-ont assembly.mmi ../../LRBinner_20240203/m64011_210224_000525.hifi_reads.fastq > assembly.sam

depthを算出

samtools sort -@ 32 assembly.sam -o assembly.bam
jgi_summarize_bam_contig_depths --outputDepth assembly_depth.txt assembly.bam

GraphMBでビニングする準備が整いました。

GraphMBビニング

cuda core搭載のNVIDIAのGPUがある場合--cudaをつけて実行する

graphmb --assembly ./ --outdir results/ --cuda --markers checkm_edges/storage/marker_gene_stats.tsv  --seed 1234

エラー発生。

TypeError: object of type 'numpy.float64' has no len()

cudaを使用しなくても同様のエラーで終了する。numpyのversionをいくつか試して見たが解決せず。。

一向に解決しないので終了。

Infomation

自作PC1