graphMB: アセンブリグラフを学習モデルに組み込んだビングツール
はじめに
配列決定技術とアセンブリ法の進歩にもかかわらず、メタゲノムサンプルから高品質の微生物ゲノムを取得することは依然としてコストのかかる作業です。
現在のメタゲノムビニングツールはアセンブリグラフを最大限に活用できておらず、また、ロングリードアセンブリ向けに最適化されたものも少ない状況にあります。
ディープグラフ学習アルゴリズムは、複雑なグラフデータ構造を扱うための技術として、他の分野でも利用されています。
アセンブリステップで生成されるグラフ構造のコンティグの特徴を統合した深層学習を、ビニングのステップに適用することで優れたビンを取得することができます。
GraphMB アブスト参照
GraphMB
GraphMBはアセンブリグラフをコンティグの特徴学習プロセスに組み込み、カバレッジのより高いエッジを重要視するようにニューラルネットワークに学習させます。
学習には、グラフノードの表現を学習できる深層学習モデルの一種であるGNNを使用します。
Install
mamba install -c conda-forge make cmake libgcc python=3.8 pip tensorflow numpy
conda activate graphmb
pip install graphmb
USAGE GraphMB
usage: graphmb [-h] [--assembly ASSEMBLY] [--assembly_name ASSEMBLY_NAME] [--graph_file GRAPH_FILE]
[--edge_threshold EDGE_THRESHOLD] [--depth DEPTH] [--features FEATURES] [--labels LABELS] [--embs EMBS]
[--model_name MODEL_NAME] [--activation ACTIVATION] [--layers_vae LAYERS_VAE] [--layers_gnn LAYERS_GNN]
[--hidden_gnn HIDDEN_GNN] [--hidden_vae HIDDEN_VAE] [--embsize_gnn EMBSIZE_GNN] [--embsize_vae EMBSIZE_VAE]
[--batchsize BATCHSIZE] [--batchtype BATCHTYPE] [--dropout_gnn DROPOUT_GNN] [--dropout_vae DROPOUT_VAE]
[--lr_gnn LR_GNN] [--lr_vae LR_VAE] [--graph_alpha GRAPH_ALPHA] [--kld_alpha KLD_ALPHA] [--ae_alpha AE_ALPHA]
[--scg_alpha SCG_ALPHA] [--clusteringalgo CLUSTERINGALGO] [--kclusters KCLUSTERS] [--aggtype AGGTYPE]
[--decoder_input DECODER_INPUT] [--vaepretrain VAEPRETRAIN] [--ae_only] [--negatives NEGATIVES] [--quick]
[--classify] [--fanout FANOUT] [--epoch EPOCH] [--print PRINT] [--evalepochs EVALEPOCHS] [--evalskip EVALSKIP]
[--eval_split EVAL_SPLIT] [--kmer KMER] [--rawfeatures] [--clusteringloss] [--targetmetric TARGETMETRIC]
[--concatfeatures] [--no_loss_weights] [--no_sample_weights] [--early_stopping EARLY_STOPPING] [--nruns NRUNS]
[--mincontig MINCONTIG] [--minbin MINBIN] [--mincomp MINCOMP] [--randomize] [--labelgraph] [--binarize] [--noedges]
[--read_embs] [--reload] [--markers MARKERS] [--post POST] [--writebins] [--skip_preclustering] [--outname OUTNAME]
[--cuda] [--noise] [--savemodel] [--tsne] [--numcores NUMCORES] [--outdir OUTDIR] [--assembly_type ASSEMBLY_TYPE]
[--contignodes] [--seed SEED] [--quiet] [--read_cache] [--version] [--loglevel LOGLEVEL]
Train graph embedding model
options:
-h, --help show this help message and exit
--assembly ASSEMBLY Assembly base path
--assembly_name ASSEMBLY_NAME
File name with contigs
--graph_file GRAPH_FILE
File name with graph
--edge_threshold EDGE_THRESHOLD
Remove edges with weight lower than this (keep only >=)
--depth DEPTH Depth file from jgi
--features FEATURES Features file mapping contig name to features
--labels LABELS File mapping contig to label
--embs EMBS No train, load embs
--model_name MODEL_NAME
One of the implemented models: gcn, gat, sage, sage_lstm, _ccvae variation
--activation ACTIVATION
Activation function to use(relu, prelu, sigmoid, tanh)
--layers_vae LAYERS_VAE
Number of layers of the VAE
--layers_gnn LAYERS_GNN
Number of layers of the GNN
--hidden_gnn HIDDEN_GNN
Dimension of hidden layers of GNN
--hidden_vae HIDDEN_VAE
Dimension of hidden layers of VAE
--embsize_gnn EMBSIZE_GNN, --zg EMBSIZE_GNN
Output embedding dimension of GNN
--embsize_vae EMBSIZE_VAE, --zl EMBSIZE_VAE
Output embedding dimension of VAE
--batchsize BATCHSIZE
batchsize to train the VAE
--batchtype BATCHTYPE
Batch type, nodes or edges
--dropout_gnn DROPOUT_GNN
dropout of the GNN
--dropout_vae DROPOUT_VAE
dropout of the VAE
--lr_gnn LR_GNN learning rate
--lr_vae LR_VAE learning rate
--graph_alpha GRAPH_ALPHA
Coeficient for graph loss
--kld_alpha KLD_ALPHA
Coeficient for KLD loss
--ae_alpha AE_ALPHA Coeficient for AE loss
--scg_alpha SCG_ALPHA
Coeficient for SCG loss
--clusteringalgo CLUSTERINGALGO
clustering algorithm: vamb, kmeans
--kclusters KCLUSTERS
Number of clusters (only for some clustering methods)
--aggtype AGGTYPE Aggregation type for GraphSAGE (mean, pool, lstm, gcn)
--decoder_input DECODER_INPUT
What to use for input to the decoder
--vaepretrain VAEPRETRAIN
How many epochs to pretrain VAE
--ae_only Do not use GNN (ae model must be used and decoder input must be ae
--negatives NEGATIVES
Number of negatives to train GraphSAGE
--quick Reduce number of nodes to run quicker
--classify Run classification instead of clustering
--fanout FANOUT Fan out, number of positive neighbors sampled at each level
--epoch EPOCH Number of epochs to train model
--print PRINT Print interval during training
--evalepochs EVALEPOCHS
Epoch interval to run eval
--evalskip EVALSKIP Skip eval of these epochs
--eval_split EVAL_SPLIT
Percentage of dataset to use for eval
--kmer KMER
--rawfeatures Use raw features
--clusteringloss Train with clustering loss
--targetmetric TARGETMETRIC
Metric to pick best epoch
--concatfeatures Concat learned and original features before clustering
--no_loss_weights Using edge weights for loss (positive only)
--no_sample_weights Using edge weights to sample negatives
--early_stopping EARLY_STOPPING
Stop training if delta between last two losses is less than this
--nruns NRUNS Number of runs
--mincontig MINCONTIG
Minimum size of input contigs
--minbin MINBIN Minimum size of clusters in bp
--mincomp MINCOMP Minimum size of connected components
--randomize Randomize graph
--labelgraph Create graph based on labels (ignore assembly graph)
--binarize Binarize adj matrix
--noedges Remove all but self edges from adj matrix
--read_embs Read embeddings from file
--reload Reload data
--markers MARKERS File with precomputed checkm results to eval. If not found, it will assume it does not exist.
--post POST Output options
--writebins Write bins to fasta files
--skip_preclustering Use precomputed checkm results to eval
--outname OUTNAME, --outputname OUTNAME
Output (experiment) name
--cuda Use gpu
--noise Use noise generator
--savemodel Save best model to disk
--tsne Plot tsne at checkpoints
--numcores NUMCORES Number of cores to use
--outdir OUTDIR, --outputdir OUTDIR
Output dir (same as input assembly dir if not defined
--assembly_type ASSEMBLY_TYPE
flye or spades
--contignodes Use contigs as nodes instead of edges
--seed SEED Set seed
--quiet, -q Do not output epoch progress
--read_cache Do not check assembly files, read cached files only
--version, -v Print version and exit
--loglevel LOGLEVEL, -l LOGLEVEL
Log level
アセンブリ後のデータを使う場合は、デモデータが用意されていたのでこちらからダウンロードしてください。
以降に記載するmetaflyeによるアセンブルからjgi_summarize_bam_contig_depths
でdepthの計算の工程を済ませたONTのメタゲノムのデータになります。
ダウンロードしたデータを使ってgrahmbの実行
graphmb --assembly ./ --outdir results/ --cuda --markers marker_gene_stats.tsv --seed 1234
↓標準出力↓
logging to results/20240209-003659graphmb_output.log
Running GraphMB 0.2.5
using cuda: True
setting seed to 1234
setting tf seed
Reading cache from
==============================
DATASET STATS:
number of sequences: 1150
assembly length: 0.171 Gb
assembly N50: 1.754 Mb
assembly average length (Mb): 0.149 max: 9.224 min: 0.001
coverage samples: 1
Graph file found and read
graph edges: 1270
contig paths: 1076
total ref markers sets: 58
total ref markers: 104
contigs with one or more markers: 215/1150
max SCGs on one contig: 104, average(excluding 0): 19.926
candidate k0s [44, 45, 46, 47, 48, 49, 50]
SCG contig count min: 29 contigs
edges with overlapping scgs (max=20): [(1, 8), (7, 2), (3, 2), (4, 2)]
==============================
loading features from ./features.tsv
loaded 1150 features/ 1150 nodes from tsv
RUN 0
setting tf seed
edges with overlapping scgs (max=20): [(1, 8), (7, 2), (3, 2), (4, 2)]
deleted 14 edges with same SCGs
**** Num of edges: 2378
logging to mlflow
******* Running model: gcn **********
***** using edge weights: True ******
***** using disconnected: True ******
***** concat features: True *****
***** cluster markers only: False *****
***** threshold adj matrix: False *****
***** self edges only: False *****
***** Using raw kmer+abund features: False
***** SCG neg pairs: (8799, 2)
***** input features dimension: 32
***** Nodes used for clustering: 1150
>>> Pre train stats: {'precision': 1.0, 'recall': 0.07478260869565218, 'f1': 0.13915857605177992, 'ari': 0, 'hq': 31, 'mq': 31, 'n_clusters': 544, 'unresolved_mags': 8, 'hq_comp': 98.16108807766206, 'hq_cont': 0.27808676307007785, 'unresolved_contigs': 935, 'unresolved_contigs_with_scgs': 120}
*** Model input dim 32, GNN input dim 32
*** output clustering dim 32
[graphmb 3l] L=0.668 D=0.702 HQ=17 Besthq=23 Best Epoch=59 GPU=53.9MB: 100%|█████████████████████| 500/500 [01:36<00:00, 5.19it/s]
>>> best epoch all contigs: 20 : {'precision': 1.0, 'recall': 0.12173913043478261, 'f1': 0.2170542635658915, 'ari': 0, 'hq': 23, 'mq': 25, 'n_clusters': 347, 'unresolved_mags': 16, 'hq_comp': 98.1586479487529, 'hq_cont': 0.22488755622188905, 'unresolved_contigs': 943, 'unresolved_contigs_with_scgs': 148, 'epoch': 499} <<<
>>> best epoch: 20 : {'precision': 1.0, 'recall': 0.12173913043478261, 'f1': 0.2170542635658915, 'ari': 0, 'hq': 23, 'mq': 25, 'n_clusters': 347, 'unresolved_mags': 16, 'hq_comp': 98.1586479487529, 'hq_cont': 0.22488755622188905, 'unresolved_contigs': 943, 'unresolved_contigs_with_scgs': 148, 'epoch': 59} <<<
### writing best and last embs to results/
ここでPdb(デバッガ)が起動していしまいました。謎。。
一応、全く別のデータ使ってアセンブルから実行してみる。
アセンブリグラフとコンティグにマップされるリードのdepthを計算する時に使用するツールの仮想環境の作成
以下のツールをひとまとめにしたcondaの仮想環境を作成します。
- flye (github)
- CheckM (github)
- samtools (github)
- minimap2 (github)
- MetaBat2のjgi_summarize_bam_contig_depths (MetaBat2 HP)
mamba create -n other_graphmb_env python=3.9 numpy matplotlib pysam hmmer prodigal pplacer checkm-genome flye samtools minimap2 metabat2
USAGE CheckM
usage: checkm {data,tree,tree_qa,lineage_set,taxon_list,taxon_set,analyze,qa,lineage_wf,taxonomy_wf,gc_plot,coding_plot,tetra_plot,dist_plot,gc_bias_plot,nx_plot,len_hist,marker_plot,unbinned,coverage,tetra,profile,ssu_finder,merge,outliers,modify,unique,test} ...
checkm: error: unrecognized arguments: --help
usage: checkm taxonomy_wf [-h] [--ali] [--nt] [-g] [--individual_markers] [--skip_adj_correction] [--skip_pseudogene_correction] [--aai_strain AAI_STRAIN] [-a ALIGNMENT_FILE] [--ignore_thresholds] [-e E_VALUE] [-l LENGTH] [-c COVERAGE_FILE] [-f FILE] [--tab_table]
[-x EXTENSION] [-t THREADS] [-q] [--tmpdir TMPDIR]
{life,domain,phylum,class,order,family,genus,species} taxon bin_input output_dir
Runs taxon_set, analyze, qa
positional arguments:
{life,domain,phylum,class,order,family,genus,species}
taxonomic rank
taxon taxon of interest
bin_input directory containing bins (fasta format) or path to file describing genomes/genes - tab separated in 2 or 3 columns [genome ID, genome fna, genome translation file (pep)]
output_dir directory to write output files
optional arguments:
-h, --help show this help message and exit
--ali generate HMMER alignment file for each bin
--nt generate nucleotide gene sequences for each bin
-g, --genes bins contain genes as amino acids instead of nucleotide contigs
--individual_markers treat marker as independent (i.e., ignore co-located set structure)
--skip_adj_correction
do not exclude adjacent marker genes when estimating contamination
--skip_pseudogene_correction
skip identification and filtering of pseudogenes
--aai_strain AAI_STRAIN
AAI threshold used to identify strain heterogeneity (default: 0.9)
-a, --alignment_file ALIGNMENT_FILE
produce file showing alignment of multi-copy genes and their AAI identity
--ignore_thresholds ignore model-specific score thresholds
-e, --e_value E_VALUE
e-value cut off (default: 1e-10)
-l, --length LENGTH percent overlap between target and query (default: 0.7)
-c, --coverage_file COVERAGE_FILE
file containing coverage of each sequence; coverage information added to table type 2 (see coverage command)
-f, --file FILE print results to file (default: stdout)
--tab_table print tab-separated values table
-x, --extension EXTENSION
extension of bins (other files in directory are ignored) (default: fna)
-t, --threads THREADS
number of threads (default: 1)
-q, --quiet suppress console output
--tmpdir TMPDIR specify an alternative directory for temporary files
Example: checkm taxonomy_wf domain Bacteria ./bins ./output
usage: checkm qa [-h] [-o {1,2,3,4,5,6,7,8,9}] [--exclude_markers EXCLUDE_MARKERS] [--individual_markers] [--skip_adj_correction] [--skip_pseudogene_correction] [--aai_strain AAI_STRAIN] [-a ALIGNMENT_FILE] [--ignore_thresholds] [-e E_VALUE] [-l LENGTH] [-c COVERAGE_FILE]
[-f FILE] [--tab_table] [-t THREADS] [-q] [--tmpdir TMPDIR]
marker_file analyze_dir
Assess bins for contamination and completeness.
positional arguments:
marker_file marker file specified during analyze command
analyze_dir directory specified during analyze command
optional arguments:
-h, --help show this help message and exit
-o, --out_format {1,2,3,4,5,6,7,8,9}
desired output: (default: 1)
1. summary of bin completeness and contamination
2. extended summary of bin statistics (includes GC, genome size, ...)
3. summary of bin quality for increasingly basal lineage-specific marker sets
4. list of marker genes and their counts
5. list of bin id, marker gene id, gene id
6. list of marker genes present multiple times in a bin
7. list of marker genes present multiple times on the same scaffold
8. list indicating position of each marker gene within a bin
9. FASTA file of marker genes identified in each bin
--exclude_markers EXCLUDE_MARKERS
file specifying markers to exclude from marker sets
--individual_markers treat marker as independent (i.e., ignore co-located set structure)
--skip_adj_correction
do not exclude adjacent marker genes when estimating contamination
--skip_pseudogene_correction
skip identification and filtering of pseudogenes
--aai_strain AAI_STRAIN
AAI threshold used to identify strain heterogeneity (default: 0.9)
-a, --alignment_file ALIGNMENT_FILE
produce file showing alignment of multi-copy genes and their AAI identity
--ignore_thresholds ignore model-specific score thresholds
-e, --e_value E_VALUE
e-value cut off (default: 1e-10)
-l, --length LENGTH percent overlap between target and query (default: 0.7)
-c, --coverage_file COVERAGE_FILE
file containing coverage of each sequence; coverage information added to table type 2 (see coverage command)
-f, --file FILE print results to file (default: stdout)
--tab_table print tab-separated values table
-t, --threads THREADS
number of threads (default: 1)
-q, --quiet suppress console output
--tmpdir TMPDIR specify an alternative directory for temporary files
Note: lineage_wf and taxonomy_wf produce a marker file in the specified output directory. The
lineage workflow produced a marker file called lineage.ms, while the taxonomy workflow
produces a marker file called <taxon>.ms (e.g. Bacteria.ms).
Example: checkm qa ./output/lineage.ms ./output
USAGE flye
usage: flye (--pacbio-raw | --pacbio-corr | --pacbio-hifi | --nano-raw |
--nano-corr | --nano-hq ) file1 [file_2 ...]
--out-dir PATH
[--genome-size SIZE] [--threads int] [--iterations int]
[--meta] [--polish-target] [--min-overlap SIZE]
[--keep-haplotypes] [--debug] [--version] [--help]
[--scaffold] [--resume] [--resume-from] [--stop-after]
[--read-error float] [--extra-params]
[--deterministic]
Assembly of long reads with repeat graphs
optional arguments:
-h, --help show this help message and exit
--pacbio-raw path [path ...]
PacBio regular CLR reads (<20% error)
--pacbio-corr path [path ...]
PacBio reads that were corrected with other methods (<3% error)
--pacbio-hifi path [path ...]
PacBio HiFi reads (<1% error)
--nano-raw path [path ...]
ONT regular reads, pre-Guppy5 (<20% error)
--nano-corr path [path ...]
ONT reads that were corrected with other methods (<3% error)
--nano-hq path [path ...]
ONT high-quality reads: Guppy5+ SUP or Q20 (<5% error)
--subassemblies path [path ...]
[deprecated] high-quality contigs input
-g size, --genome-size size
estimated genome size (for example, 5m or 2.6g)
-o path, --out-dir path
Output directory
-t int, --threads int
number of parallel threads [1]
-i int, --iterations int
number of polishing iterations [1]
-m int, --min-overlap int
minimum overlap between reads [auto]
--asm-coverage int reduced coverage for initial disjointig assembly [not set]
--hifi-error float [deprecated] same as --read-error
--read-error float adjust parameters for given read error rate (as fraction e.g. 0.03)
--extra-params extra_params
extra configuration parameters list (comma-separated)
--plasmids unused (retained for backward compatibility)
--meta metagenome / uneven coverage mode
--keep-haplotypes do not collapse alternative haplotypes
--no-alt-contigs do not output contigs representing alternative haplotypes
--scaffold enable scaffolding using graph [disabled by default]
--trestle [deprecated] enable Trestle [disabled by default]
--polish-target path run polisher on the target sequence
--resume resume from the last completed stage
--resume-from stage_name
resume from a custom stage
--stop-after stage_name
stop after the specified stage completed
--debug enable debug output
-v, --version show program's version number and exit
--deterministic perform disjointig assembly single-threaded
Input reads can be in FASTA or FASTQ format, uncompressed
or compressed with gz. Currently, PacBio (CLR, HiFi, corrected)
and ONT reads (regular, HQ, corrected) are supported. Expected error rates are
<15% for PB CLR/regular ONT; <5% for ONT HQ, <3% for corrected, and <1% for HiFi. Note that Flye
was primarily developed to run on uncorrected reads. You may specify multiple
files with reads (separated by spaces). Mixing different read
types is not yet supported. The --meta option enables the mode
for metagenome/uneven coverage assembly.
To reduce memory consumption for large genome assemblies,
you can use a subset of the longest reads for initial disjointig
assembly by specifying --asm-coverage and --genome-size options. Typically,
40x coverage is enough to produce good disjointigs.
You can run Flye polisher as a standalone tool using
--polish-target option.
USAGE samtools
Program: samtools (Tools for alignments in the SAM format)
Version: 1.19.2 (using htslib 1.19.1)
Usage: samtools <command> [options]
Commands:
-- Indexing
dict create a sequence dictionary file
faidx index/extract FASTA
fqidx index/extract FASTQ
index index alignment
-- Editing
calmd recalculate MD/NM tags and '=' bases
fixmate fix mate information
reheader replace BAM header
targetcut cut fosmid regions (for fosmid pool only)
addreplacerg adds or replaces RG tags
markdup mark duplicates
ampliconclip clip oligos from the end of reads
-- File operations
collate shuffle and group alignments by name
cat concatenate BAMs
consensus produce a consensus Pileup/FASTA/FASTQ
merge merge sorted alignments
mpileup multi-way pileup
sort sort alignment file
split splits a file by read group
quickcheck quickly check if SAM/BAM/CRAM file appears intact
fastq converts a BAM to a FASTQ
fasta converts a BAM to a FASTA
import Converts FASTA or FASTQ files to SAM/BAM/CRAM
reference Generates a reference from aligned data
reset Reverts aligner changes in reads
-- Statistics
bedcov read depth per BED region
coverage alignment depth and percent coverage
depth compute the depth
flagstat simple stats
idxstats BAM index stats
cram-size list CRAM Content-ID and Data-Series sizes
phase phase heterozygotes
stats generate stats (former bamcheck)
ampliconstats generate amplicon specific stats
-- Viewing
flags explain BAM flags
head header viewer
tview text alignment viewer
view SAM<->BAM<->CRAM conversion
depad convert padded BAM to unpadded BAM
samples list the samples in a set of SAM/BAM/CRAM files
-- Misc
help [cmd] display this help message or help for [cmd]
version detailed version information
USAGE minimap2
Usage: minimap2 [options] <target.fa>|<target.idx> [query.fa] [...]
Options:
Indexing:
-H use homopolymer-compressed k-mer (preferrable for PacBio)
-k INT k-mer size (no larger than 28) [15]
-w INT minimizer window size [10]
-I NUM split index for every ~NUM input bases [8G]
-d FILE dump index to FILE []
Mapping:
-f FLOAT filter out top FLOAT fraction of repetitive minimizers [0.0002]
-g NUM stop chain enlongation if there are no minimizers in INT-bp [5000]
-G NUM max intron length (effective with -xsplice; changing -r) [200k]
-F NUM max fragment length (effective with -xsr or in the fragment mode) [800]
-r NUM[,NUM] chaining/alignment bandwidth and long-join bandwidth [500,20000]
-n INT minimal number of minimizers on a chain [3]
-m INT minimal chaining score (matching bases minus log gap penalty) [40]
-X skip self and dual mappings (for the all-vs-all mode)
-p FLOAT min secondary-to-primary score ratio [0.8]
-N INT retain at most INT secondary alignments [5]
Alignment:
-A INT matching score [2]
-B INT mismatch penalty (larger value for lower divergence) [4]
-O INT[,INT] gap open penalty [4,24]
-E INT[,INT] gap extension penalty; a k-long gap costs min{O1+k*E1,O2+k*E2} [2,1]
-z INT[,INT] Z-drop score and inversion Z-drop score [400,200]
-s INT minimal peak DP alignment score [80]
-u CHAR how to find GT-AG. f:transcript strand, b:both strands, n:don't match GT-AG [n]
-J INT splice mode. 0: original minimap2 model; 1: miniprot model [1]
Input/Output:
-a output in the SAM format (PAF by default)
-o FILE output alignments to FILE [stdout]
-L write CIGAR with >65535 ops at the CG tag
-R STR SAM read group line in a format like '@RG\tID:foo\tSM:bar' []
-c output CIGAR in PAF
--cs[=STR] output the cs tag; STR is 'short' (if absent) or 'long' [none]
--MD output the MD tag
--eqx write =/X CIGAR operators
-Y use soft clipping for supplementary alignments
-t INT number of threads [3]
-K NUM minibatch size for mapping [500M]
--version show version number
Preset:
-x STR preset (always applied before other options; see minimap2.1 for details) []
- map-pb/map-ont - PacBio CLR/Nanopore vs reference mapping
- map-hifi - PacBio HiFi reads vs reference mapping
- ava-pb/ava-ont - PacBio/Nanopore read overlap
- asm5/asm10/asm20 - asm-to-ref mapping, for ~0.1/1/5% sequence divergence
- splice/splice:hq - long-read/Pacbio-CCS spliced alignment
- sr - genomic short-read mapping
See `man ./minimap2.1' for detailed description of these and other advanced command-line options.
USAGE MaxBin2のjgi_summarize_bam_contig_depths
jgi_summarize_bam_contig_depths 2.15 (Bioconda) 2020-07-03T13:02:15
Usage: jgi_summarize_bam_contig_depths <options> sortedBam1 [ sortedBam2 ...]
where options include:
--outputDepth arg The file to put the contig by bam depth matrix (default: STDOUT)
--percentIdentity arg The minimum end-to-end % identity of qualifying reads (default: 97)
--pairedContigs arg The file to output the sparse matrix of contigs which paired reads span (default: none)
--unmappedFastq arg The prefix to output unmapped reads from each bam file suffixed by 'bamfile.bam.fastq.gz'
--noIntraDepthVariance Do not include variance from mean depth along the contig
--showDepth Output a .depth file per bam for each contig base
--minMapQual arg The minimum mapping quality necessary to count the read as mapped (default: 0)
--weightMapQual arg Weight per-base depth based on the MQ of the read (i.e uniqueness) (default: 0.0 (disabled))
--includeEdgeBases When calculating depth & variance, include the 1-readlength edges (off by default)
--maxEdgeBases When calculating depth & variance, and not --includeEdgeBases, the maximum length (default:75)
--referenceFasta arg The reference file. (It must be the same fasta that bams used)
Options that require a --referenceFasta
--outputGC arg The file to print the gc coverage histogram
--gcWindow arg The sliding window size for GC calculations
--outputReadStats arg The file to print the per read statistics
--outputKmers arg The file to print the perfect kmer counts
Options to control shredding contigs that are under represented by the reads
--shredLength arg The maximum length of the shreds
--shredDepth arg The depth to generate overlapping shreds
--minContigLength arg The mimimum length of contig to include for mapping and shredding
--minContigDepth arg The minimum depth along contig at which to break the contig
CheckMの準備
CheckMにはデータベースが必要です。データベースをダウンロードします。
#ダウンロード
aria2c -x 16 -ctrue https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz
#解凍
mkdir checkm_data_2015_01_16
tar -xvf checkm_data_2015_01_16.tar.gz -C checkm_data_2015_01_16
#データベースの場所の指定
checkm data setRoot checkm_data_2015_01_16
確認
checkm
解析に使用するデータの準備
PacBioのデータを解析対象とします。クラウドサーバーからダウロード。
aria2c -x 16 -ctrue https://downloads.pacbcloud.com/public/dataset/Sequel-IIe-202104/metagenomics/m64011_210225_094432.hifi_reads.fastq.tar.gz
解凍します。
tar -xvf m64011_210225_094432.hifi_reads.fastq.tar.gz
metaflyeによるde novo assembly
metaflyeの実行
flye \
--pacbio-hifi m64011_210224_000525.hifi_reads.fastq \
-o 02metaFlye/ \
--meta \
-t 32
いくつかのファイルが出力されますが、.gfa
がアセンブリグラフファイルです。
アセンブリグラフファイルからfastaを生成する前に、metaflyeから出力されたassembly.fasta
をcontig.fasta
にリネーム
cd 02metaFlye/
mv assembly.fasta contigs.fasta
アセンブリグラフファイルからfastaを生成
awk '/^S/{print ">"$2"\n"$3}' assembly_graph.gfa | fold > assembly.fasta
各contig配列を別ファイルに出力
mkdir edges
cd edges; cat ../assembly.fasta | awk '{ if (substr($0, 1, 1)==">") {filename=(substr($0,2) ".fa")} print $0 > filename }'; cd ..
エラーが出たらulimit -n 100000
を試してみてください。同時に表示できるファイル数に制限がかかっている場合があります。
スペースをアンダーバーに置換
find edges/ -name "* *" -type f | rename 's/ /_/g'
checkmの実行
# evaluate edges
checkm taxonomy_wf -x fa domain Bacteria edges/ checkm_edges/ --threads 32
checkm qa checkm_edges/Bacteria.ms checkm_edges/ -f checkm_edges_polished_results.txt --tab_table -o 2
Contigへのread mappingとdepthの算出
depthを算出するためにcontigにリードをmappingする
minimap2 -I 64GB -d assembly.mmi assembly.fasta # make index
minimap2 -I 64GB -ax map-ont assembly.mmi ../../LRBinner_20240203/m64011_210224_000525.hifi_reads.fastq > assembly.sam
depthを算出
samtools sort -@ 32 assembly.sam -o assembly.bam
jgi_summarize_bam_contig_depths --outputDepth assembly_depth.txt assembly.bam
GraphMBでビニングする準備が整いました。
GraphMBビニング
cuda core搭載のNVIDIAのGPUがある場合--cuda
をつけて実行する
graphmb --assembly ./ --outdir results/ --cuda --markers checkm_edges/storage/marker_gene_stats.tsv --seed 1234
エラー発生。
TypeError: object of type 'numpy.float64' has no len()
cudaを使用しなくても同様のエラーで終了する。numpyのversionをいくつか試して見たが解決せず。。
一向に解決しないので終了。
Infomation
- 自作PC1
Discussion