ゲノムのリピート領域をマスクする
はじめに
ゲノム配列を決定したらそれをリファレンスにして、SNPの解析や変異解析を実施することができます。
Pool-seqやlcWGR(Low coverage whole genome resequencing)などのリファレンス配列とのSNPの違いに着目する集団ゲノミクスではリピート配列はマスク (Nや小文字に置き換えること) してしまうことが多いです。
WGSのデータを使って解析をする予定があるのでゲノム配列のマスク処理をしてみたいと思います。
ゲノム散在リピート(Genomic interspersed repeats)
簡単にリピート配列についてまとめます。
ゲノム散在リピート(Genomic interspersed repeats)は、ゲノムにある反復的なDNA配列のことで、その位置や繰り返しの数により、機能や病原性、表現型への影響が異なります。
RepeatMaskerのHP[1]に記載されている情報を参考にすると以下のとおりです。
-
Simple Repeats
- AやCA、CGGなど、通常1~5bpの繰り返しで構成されるDNA領域
-
Tandem Repeats
- Simple repeatより複雑な100~200bpの繰り返しで構成されるDNA領域
- 染色体のセントロメア (くびれ部分) とテロメア (末端部分) で主に見られる
-
Segmental Duplications
- ゲノムの別の領域にコピーされた 10~300kbpのDNA領域
-
Interspersed Repeats
- SINES: 逆転写酵素の助けを借りてゲノムに再組み込まれた、RNAの機能しない遺伝子コピー (300bp程度)
- DNAトランスポゾン
- レトロウイルス、レトロトランスポゾン[2]
- LINES : 非LTRのレトロトランスポゾン (700bp程度)
それぞれ、目的に応じて意味のある要素となりますが、ゲノムの構造変異解析などでなければMaskするほうが効率的に解析を進めることができます。
マスク対象のリファレンスゲノムの用意
まずは公共データベースからダウンロードします (BROE01.1.fsa_nt.gz)
mkdir ref && cd ref
aria2c -x 16 -ctrue https://sra-download.ncbi.nlm.nih.gov/traces/wgs01/wgs_aux/BR/OE/BROE01/BROE01.1.fsa_nt.gz
解答する。
pigz -d BROE01.1.fsa_nt.gz
配列名の確認
grep -i ">" BROE01.1.fsa_nt
><番号> Plecoglossus altivelis DNA, <chr or sccaffold>, whole genome shotgun sequence
上記のような感じの命名になっていたので、><chr or sccaffold>
の形式にしておきます。
sed -E 's/BROE[0-9]+.[0-9] Plecoglossus altivelis DNA, ([a-z]+[0-9]+), whole.*/\1/g' BROE01.1.fsa_nt > Paa.fasta
確認
grep -i ">" Paa.fasta | head -n 30
多分、上記作業で>chr[0-9]+かscaffold[0-9]+
の形式になっているかと思います。
RepeatModer2とRepeatMaskerによるリピート領域のハードマスク
マッピング精度の向上と時短のために、リファレンスゲノムのリピート領域や複雑領域をNでマスクします。RepeatModerとRepeatMaskerはリピート領域の推定とMaskingが可能なツールです。
De novoでRepeat領域の識別
RepeatModeler2を使ってDe novoでRepeat領域の識別を行います。
RepeatModerはv2からLTR (Likelihood ratio test: 尤度比検定) が実施できるらしいのやってみる。実行にはNINJAが必要と書いてあったので導入。
git clone https://github.com/TravisWheelerLab/NINJA.git
cd NINJA
make build
export PATH=$PATH:$PWD/NINJA
USAGE Ninja
Ninja - Version 0.97-cluster_only
./Ninja --in file.fa --out file.out
Arguments:
--help (or -h) to display this help
--in (or -i) filename
--out (or -o) filename
--in_type type [a | d] (default a)
--out_type type [d | c] (default c)
--corr_type type [n | j | k | s | m]
--cluster_cutoff dist_cutoff (default 0.03)
--threads (or -T) num_threads
--version (or -v) print the software version
For more information, check the README file.
# データベースファイルの作成
BuildDatabase -name ref/Paa -engine ncbi ref/Paa.fasta
Building database ref/Paa:
Reading ref/Paa.fasta...
Number of sequences (bp) added to database: 4119 ( 448375639 bp )
RepeatModelerの実行
RepeatModeler -LTRStruct -threads 32 -engine ncbi -database ref/Paa
repeatmodeler2では使用スレッド数の指定が、-pa
から-threads
になったようです。32スレッド使って実行時間は28時間程度。
入力ファイルと同じフォルダに出力されたのは以下の3ファイルです。
-
Paa-families.fa
- fasta形式のファイルで、Headerの#以降にリピートの分類情報が記載。
-
Paa-families.stk
- リピートファミリーのアライメント情報
-
Paa-rmod.log
:- RepeatModelerの解析プロセス全体のログファイル
RepeatModerの出力はRM_実行日時*
の形式のフォルダに出力されます。
次にRepeatMaskerでリピート領域のアノテーションとマスキングを実施しますが、使用するのは入力ファイルと同じフォルダに出力されている*-families.fa
です。
RepeatMaskerによるリピート領域のマスキング
USAGE RepeatMasker
NAME
RepeatMasker - Mask repetitive DNA
SYNOPSIS
RepeatMasker [-options] <seqfiles(s) in fasta format>
DESCRIPTION
The options are:
-h(elp)
Detailed help
Default settings are for masking all type of repeats in a primate
sequence.
-e(ngine) [crossmatch|wublast|abblast|ncbi|rmblast|hmmer]
Use an alternate search engine to the default. Note: 'ncbi' and
'rmblast' are both aliases for the rmblastn search engine engine.
The generic NCBI blastn program is not sensitive enough for use with
RepeatMasker at this time.
-pa(rallel) [number]
The number of sequence batch jobs [50kb minimum] to run in parallel.
RepeatMasker will fork off this number of parallel jobs, each
running the search engine specified. For each search engine
invocation ( where applicable ) a fixed the number of cores/threads
is used:
RMBlast 4 cores
ABBlast 4 cores
nhmmer 2 cores
crossmatch 1 core
To estimate the number of cores a RepeatMasker run will use simply
multiply the -pa value by the number of cores the particular search
engine will use.
-s Slow search; 0-5% more sensitive, 2-3 times slower than default
-q Quick search; 5-10% less sensitive, 2-5 times faster than default
-qq Rush job; about 10% less sensitive, 4->10 times faster than default
(quick searches are fine under most circumstances) repeat options
-nolow
Does not mask low_complexity DNA or simple repeats
NOTE: This is an important step in RepeatMasker, the identification
of low-divergence simple repeats early in RepeatMasker's search
phase lowers the overall false-positive rate for TE annotations
considerably. To simply remove simple repeats from the final output
of RepeatMasker use postprocessing tools such as:
egrep -v "Simple|Satellite" my_data.out > filtered.out
To remove these annotations from the final output. The -nolow option
should only be used when there is a specific reason to avoid
pre/post masking tandem/simple/low-complexity sequences.
-noint
Only masks low complex/simple repeats (no interspersed repeats)
-norna
Does not mask small RNA (pseudo) genes
-alu
Only masks Alus (and 7SLRNA, SVA and LTR5)(only for primate DNA)
-div [number]
Masks only those repeats with [number] percent diverged from
consensus
-lib [filename]
Allows use of a custom library (e.g. from another species)
-cutoff [number]
Sets cutoff score for masking repeats when using -lib (default 225)
-species <query species>
Specify the species or clade of the input sequence. The species name
must be a valid NCBI Taxonomy Database species name and be contained
in the RepeatMasker repeat database. Some examples are:
-species human
-species mouse
-species rattus
-species "ciona savignyi"
-species arabidopsis
Other commonly used species:
mammal, carnivore, rodentia, rat, cow, pig, cat, dog, chicken, fugu,
danio, "ciona intestinalis" drosophila, anopheles, worm, diatoaea,
artiodactyl, arabidopsis, rice, wheat, and maize
-uncurated
Use uncurated and curated families from FamDB rather than curated
only (default).
The latest version of RepeatMasker is designed to work with the new
FamDB partitioned data format (v1.0). This format supports storing
both curated and uncurated TE data in smaller (optionally present)
partitions. Rather than controlling the type of data used (curated/
uncurated) by downloading the corresponding database, both
curated/uncurated data will now be in the same files and use of that
data can be controlled using this new flag.
Contamination options
-is_only
Only clips E coli insertion elements out of fasta and .qual files
-is_clip
Clips IS elements before analysis (default: IS only reported)
-no_is
Skips bacterial insertion element check
Running options
-gc [number]
Use matrices calculated for 'number' percentage background GC level
-gccalc
RepeatMasker calculates the GC content even for batch files/small
seqs
-frag [number]
Maximum sequence length masked without fragmenting (default 60000)
-nocut
Skips the steps in which repeats are excised
-noisy
Prints search engine progress report to screen (defaults to .stderr
file)
-nopost
Do not postprocess the results of the run ( i.e. call ProcessRepeats
). NOTE: This options should only be used when ProcessRepeats will
be run manually on the results.
output options
-dir [directory name]
Writes output to this directory (default is query file directory,
"-dir ." will write to current directory).
-a(lignments)
Writes alignments in .align output file
-inv
Alignments are presented in the orientation of the repeat (with
option -a)
-lcambig
Outputs ambiguous DNA transposon fragments using a lower case name.
All other repeats are listed in upper case. Ambiguous fragments
match multiple repeat elements and can only be called based on
flanking repeat information.
-small
Returns complete .masked sequence in lower case
-xsmall
Returns repetitive regions in lowercase (rest capitals) rather than
masked
-x Returns repetitive regions masked with Xs rather than Ns
-poly
Reports simple repeats that may be polymorphic (in file.poly)
-source
Includes for each annotation the HSP "evidence". Currently this
option is only available with the "-html" output format listed
below.
-html
Creates an additional output file in xhtml format.
-ace
Creates an additional output file in ACeDB format
-gff
Creates an additional Gene Feature Finding format output
-u Creates an additional annotation file not processed by
ProcessRepeats
-xm Creates an additional output file in cross_match format (for
parsing)
-no_id
Leaves out final column with unique ID for each element (was
default)
-e(xcln)
Calculates repeat densities (in .tbl) excluding runs of >=20 N/Xs in
the query
CONFIGURATION OVERRIDES
-crossmatch_dir <string>
The path Phil Green's cross_match program ( phrap program suite ).
-libdir <string>
Path to the RepeatMasker libraries directory.
-hmmer_dir <string>
The path to the HMMER profile HMM search software.
-rmblast_dir <string>
The path to the installation of the RMBLAST sequence alignment
program.
-trf_prgm <string>
The full path including the name for the TRF program.
-default_search_engine <string>
The default search engine to use
-abblast_dir <string>
The path to the installation of the ABBLAST sequence alignment
program.
SEE ALSO
Crossmatch, ProcessRepeats
COPYRIGHT
2002-2024 Copyright (C) Institute for Systems Biology Developed by Arian
Smit and Robert Hubley.
2000-2001 Copyright (C) Arian Smit.
1996-1999 Copyright (C) University of Washington, Developed by Arian
Smit, Philip Green and Colin Wilson of the University of Washington
Department of Genomics.
AUTHORS
Arian Smit <asmit@systemsbiology.org>
Robert Hubley <rhubley@systemsbiology.org>
では、RepeatMaskerでRepeat領域をマスクします。
RepeatMasker \
-lib ref/Paa-families.fa \
ref/Paa.fasta \
-dir LTR_repeat \
-html -gff -xsmall \
--pa 8
-
-lib
: RepeatModelerで推定されてたRepeat情報fastaを指定 -
-dir
: 結果出力フォルダを指定 -
-html
: html形式の結果ファイルを指定 -
-gff
: gff形式の結果ファイルを指定 -
-xsmall
: 検出されたリピート利用域を小文字化するソフトマスクを指定 -
-pa
: 並列実行するJob数を指定。以下のコア数が1jobに割り当てられるので、PCのスレッド数に合わせる。(32 threadsだと8を指定すると最大32threads使用することになる)- RMBlast 4 cores
- ABBlast 4 cores
- nhmmer 2 cores
- crossmatch 1 core
ログファイル (rmod.log) を見てみるとLTR Structural Analysis: Enabled ( GenomeTools 1.6.4, LTR_Retriever ,Ninja 0.97-cluster_only, MAFFT 7.520, CD-HIT 4.8.1 )
と記載されているので、有効化したオプションは効いているようです。
rmod.log
RepeatModeler Version 2.0.5
===========================
Search Engine = rmblast 2.14.1+
Threads = 32
Dependencies: TRF 4.09, RECON , RepeatScout 1.0.6, RepeatMasker 4.1.7
LTR Structural Analysis: Enabled ( GenomeTools 1.6.2, LTR_Retriever ,
Ninja 0.97-cluster_only, MAFFT 7.525,
CD-HIT 4.8.1 )
Random Number Seed: 1727589178
Database = ref/Paa - Sequences = 4119
- Bases = 448375639
- N50 = 17524823
- Contig Histogram:
Size(bp) Count
-----------------------------------------------------------------------
20914945-22408856 | [ 1 ]
19421035-20914945 | [ 4 ]
17927124-19421034 | [ 4 ]
16433214-17927124 | [ 5 ]
14939304-16433214 | [ 3 ]
13445393-14939303 | [ 2 ]
11951483-13445393 | [ 2 ]
10457572-11951482 | [ 3 ]
8963662-10457572 | [ 4 ]
7469752-8963662 | [ ]
5975841-7469751 | [ ]
4481931-5975841 | [ ]
2988020-4481930 | [ ]
1494110-2988020 | [ ]
200-1494110 |************************************************** [ 4091 ]
Storage Throughput = excellent ( 1252.36 MB/s )
RepeatModeler Round # 1
========================
Searching for Repeats
-- Sampling from the database...
- Gathering up to 40000000 bp
- Final Sample Size = 40738614 bp ( 40012192 non ambiguous )
- Num Contigs Represented = 404
- Sequence extraction : 00:00:07 (hh:mm:ss) Elapsed Time
-- Running RepeatScout on the sequences...
- RepeatScout: 00:05:29 (hh:mm:ss) Elapsed Time
Round Time: 00:06:12 (hh:mm:ss) Elapsed Time : 80 families discovered.
RepeatModeler Round # 2
========================
Searching for Repeats
-- Sampling from the database...
- Gathering up to 10000000 bp
- Sequence extraction : 00:00:02 (hh:mm:ss) Elapsed Time
-- Running TRFMask on the sequence...
- TRFMask time 00:00:42 (hh:mm:ss) Elapsed Time
-- Masking repeats from the previous rounds...
1337 repeats masked totaling 271677 bp(s).
- TE Masking time 00:00:02 (hh:mm:ss) Elapsed Time
-- Sample Stats:
Sample Size 10195226 bp
Num Contigs Represented = 131
Non ambiguous bp:
Initial: 10029941 bp
After Masking: 8569184 bp
Masked: 14.56 %
-- Input Database Coverage: 10195226 bp out of 448375639 bp ( 2.27 % )
Sampling Time: 00:00:46 (hh:mm:ss) Elapsed Time
Running all-by-other comparisons...
- Total Comparisons = 61425
Comparison Time: 00:02:51 (hh:mm:ss) Elapsed Time, 4973 HSPs Collected
Number of families returned by RECON: 1237
Round Time: 00:03:51 (hh:mm:ss) Elapsed Time : 8 families discovered.
RepeatModeler Round # 3
========================
Searching for Repeats
-- Sampling from the database...
- Gathering up to 30000000 bp
- Sequence extraction : 00:00:06 (hh:mm:ss) Elapsed Time
-- Running TRFMask on the sequence...
- TRFMask time 00:02:11 (hh:mm:ss) Elapsed Time
-- Masking repeats from the previous rounds...
4702 repeats masked totaling 888344 bp(s).
- TE Masking time 00:00:07 (hh:mm:ss) Elapsed Time
-- Sample Stats:
Sample Size 30583388 bp
Num Contigs Represented = 301
Non ambiguous bp:
Initial: 30021443 bp
After Masking: 25659283 bp
Masked: 14.53 %
-- Input Database Coverage: 40778614 bp out of 448375639 bp ( 9.09 % )
Sampling Time: 00:02:25 (hh:mm:ss) Elapsed Time
Running all-by-other comparisons...
- Total Comparisons = 516636
Comparison Time: 00:19:40 (hh:mm:ss) Elapsed Time, 23320 HSPs Collected
Number of families returned by RECON: 4438
Round Time: 00:22:40 (hh:mm:ss) Elapsed Time : 37 families discovered.
RepeatModeler Round # 4
========================
Searching for Repeats
-- Sampling from the database...
- Gathering up to 90000000 bp
- Sequence extraction : 00:00:17 (hh:mm:ss) Elapsed Time
-- Running TRFMask on the sequence...
- TRFMask time 00:06:19 (hh:mm:ss) Elapsed Time
-- Masking repeats from the previous rounds...
20043 repeats masked totaling 3794688 bp(s).
- TE Masking time 00:00:33 (hh:mm:ss) Elapsed Time
-- Sample Stats:
Sample Size 91901378 bp
Num Contigs Represented = 831
Non ambiguous bp:
Initial: 90026687 bp
After Masking: 75999598 bp
Masked: 15.58 %
-- Input Database Coverage: 132679992 bp out of 448375639 bp ( 29.59 % )
Sampling Time: 00:07:12 (hh:mm:ss) Elapsed Time
Running all-by-other comparisons...
- Total Comparisons = 4613203
Comparison Time: 02:38:01 (hh:mm:ss) Elapsed Time, 147522 HSPs Collected
Number of families returned by RECON: 20277
Round Time: 02:49:47 (hh:mm:ss) Elapsed Time : 142 families discovered.
RepeatModeler Round # 5
========================
Searching for Repeats
-- Sampling from the database...
- Gathering up to 270000000 bp
- Sequence extraction : 00:00:50 (hh:mm:ss) Elapsed Time
-- Running TRFMask on the sequence...
- TRFMask time 00:18:15 (hh:mm:ss) Elapsed Time
-- Masking repeats from the previous rounds...
80178 repeats masked totaling 14953789 bp(s).
- TE Masking time 00:02:50 (hh:mm:ss) Elapsed Time
-- Sample Stats:
Sample Size 275772821 bp
Num Contigs Represented = 2605
Non ambiguous bp:
Initial: 270025198 bp
After Masking: 224480321 bp
Masked: 16.87 %
-- Input Database Coverage: 408452813 bp out of 448375639 bp ( 91.10 % )
Sampling Time: 00:22:05 (hh:mm:ss) Elapsed Time
Running all-by-other comparisons...
- Total Comparisons = 43100970
Comparison Time: 22:36:22 (hh:mm:ss) Elapsed Time, 709565 HSPs Collected
Number of families returned by RECON: 93024
Round Time: 24:00:13 (hh:mm:ss) Elapsed Time : 402 families discovered.
RepeatScout/RECON discovery complete: 669 families found
LTR Structural Analysis
=======================
LTRPipeline Time: 00:10:31 (hh:mm:ss) Elapsed Time
Classification Time: 00:05:50 (hh:mm:ss) Elapsed Time
Program Time: 27:39:04 (hh:mm:ss) Elapsed Time
出力ファイルに指定したLTR_repeat
にあるファイルを確認します。
Paa.fasta.tbl
RepeatMaskerのサマリーファイル。
==================================================
file name: Paa.fasta
sequences: 4119
total length: 448375639 bp (439282816 bp excl N/X-runs)
GC level: 46.84 %
bases masked: 77879223 bp ( 17.37 %)
==================================================
number of length percentage
elements* occupied of sequence
--------------------------------------------------
Retroelements 12967 5740794 bp 1.28 %
SINEs: 408 74708 bp 0.02 %
Penelope: 0 0 bp 0.00 %
LINEs: 11995 5375619 bp 1.20 %
CRE/SLACS 0 0 bp 0.00 %
L2/CR1/Rex 7880 2741510 bp 0.61 %
R1/LOA/Jockey 90 68046 bp 0.02 %
R2/R4/NeSL 379 135141 bp 0.03 %
RTE/Bov-B 3321 2309635 bp 0.52 %
L1/CIN4 0 0 bp 0.00 %
LTR elements: 564 290467 bp 0.06 %
BEL/Pao 0 0 bp 0.00 %
Ty1/Copia 0 0 bp 0.00 %
Gypsy/DIRS1 422 142033 bp 0.03 %
Retroviral 125 125387 bp 0.03 %
DNA transposons 8125 2505481 bp 0.56 %
hobo-Activator 551 208168 bp 0.05 %
Tc1-IS630-Pogo 5370 1682357 bp 0.38 %
En-Spm 0 0 bp 0.00 %
MULE-MuDR 0 0 bp 0.00 %
PiggyBac 101 27358 bp 0.01 %
Tourist/Harbinger 727 215294 bp 0.05 %
Other (Mirage, 41 25512 bp 0.01 %
P-element, Transib)
Rolling-circles 0 0 bp 0.00 %
Unclassified: 192061 35584608 bp 7.94 %
Total interspersed repeats: 43830883 bp 9.78 %
Small RNA: 301 65223 bp 0.01 %
Satellites: 0 0 bp 0.00 %
Simple repeats: 370906 26063844 bp 5.81 %
Low complexity: 69149 7919273 bp 1.77 %
==================================================
* most repeats fragmented by insertions or deletions
have been counted as one element
RepeatMasker version 4.1.7-p1 , default mode
run with rmblastn version 2.14.1+
The query was compared to classified sequences in ".../Paa-families.fa"
FamDB:
全体の17.37%がRepeatModerとRepeatMaskerによってNに変換されてたようです。
Scafforldの状態でアセンブルされた原著論文[3]では、Repbaseのフグリピートデータベースを使用すると56.78 Mb (アユゲノムアセンブリの12.61%)、ゼブラフィッシュデータベースを使用すると65.15 Mb (14.47%)が、TE (Transpose Element) だったとの記載があります。
ただ実際に使用したリファレンスゲノムはHi-Cで染色体レベルにまで向上させたもの[4]なので、RepeatModelerによるde novoでのRpeat領域の推定とリファレンスゲノムの状態がこの違いを生んでいるのかもしれません。
Paa.fasta.out
Maskされた領域の開始点や終了点、どういったリピート配列の区分に当たるのかなどが記載されたファイル。
SW perc perc perc query position in query matching repeat position in repeat
score div. del. ins. sequence begin end (left) repeat class/family begin end (left) ID
12 21.7 0.0 0.0 chr1 1541 1572 (16525992) + (AGCT)n Simple_repeat 1 32 (0) 1
2895 2.7 0.0 1.5 chr1 3871 4211 (16523353) + rnd-1_family-5 Unknown 1 336 (0) 2
411 22.0 0.0 2.0 chr1 12820 12921 (16514643) + rnd-4_family-9207 Unknown 13 112 (2) 3
1100 9.6 4.8 0.6 chr1 18840 19006 (16508558) C rnd-1_family-8 Unknown (6) 197 24 4
16 9.6 0.0 0.0 chr1 20341 20363 (16507201) + (TC)n Simple_repeat 1 23 (0) 5
1215 5.2 13.9 0.0 chr1 22271 22442 (16505122) + rnd-1_family-11 Unknown 48 243 (1) 6
12 8.9 4.2 0.0 chr1 23082 23105 (16504459) + (AGT)n Simple_repeat 1 25 (0) 7
12 18.5 2.4 5.0 chr1 23759 23799 (16503765) + (TTTA)n Simple_repeat 1 40 (0) 8
24 18.4 1.9 1.9 chr1 26608 26660 (16500904) + (CT)n Simple_repeat 1 53 (0) 9
611 7.2 17.2 3.8 chr1 27102 27217 (16500347) + rnd-4_family-3478 DNA/hAT-Charlie 43 173 (696) 10
長鎖の散在反復配列 (LINE) でだけ取り出してみる
grep -i LINE LTR_repeat/Paa.fasta.out | head -n 5
6223 8.6 7.2 0.1 chr1 35144 36436 (16491128) + rnd-1_family-9 LINE/RTE-BovB 33 1416 (95) 20
1655 10.3 4.7 1.5 chr1 36460 36716 (16490848) C rnd-1_family-12 LINE/RTE-BovB (706) 808 544 21
9172 9.2 2.8 0.0 chr1 36729 37989 (16489575) + rnd-1_family-4 LINE/RTE-BovB 1 1296 (228) 22
11822 7.0 0.4 0.0 chr1 40502 42019 (16485545) C rnd-1_family-4 LINE/RTE-BovB (0) 1524 1 29
11762 7.2 0.7 0.0 chr1 41552 43055 (16484509) + rnd-1_family-12 LINE/RTE-BovB 1 1514 (0) 30 *
Paa.fasta.masked
softmaskされた配列データです。.outに記載されているLINE/RTE-BovBとされた領域を見てみる。両端1bpずつ伸ばして表示
seqkit grep -nrp chr1$ LTR_repeat/Paa.fasta.masked | seqkit subseq -r 35143:36437
>chr1
Cagggatttgagtgcagggatgatcataggagctatgctgtcaggggcaataagcccctg
gtagggtctcccaagtcaaactggtcctaggtgactagccagactaagcgcggttcacca
aacccctatgatcacaagactaaagagattcactccgtcgcctggagtggcgttacctgg
ggccccaccctggagccaggcctggagtgggtgctcgcaggcgagcgcctagtgaccagg
tctttgcccatgggaccggccggatatagcccgaaaaggcaatgtggggacactttcctg
caggcccaccacccacaggaaggaccgcagggggcgggtgtagtatgttttaggcggcag
tcaggggcaggtgcctggacgactgaggccatgcacacagtctctagcgctggggatatg
gaatgtcacctggcggggaaggagcccgcgcacagcttgggttccggatccactatcctc
gagagaggctggactctacatttctctggagttgcccgcgagaagaggcggcgagctggc
gtgggcttgcttatagccccgtggctcagccgccacgtgttggcgtttaccccagtgaat
gagaaggtcgcgtccctgcgccttcaagccggggacaggtctctcactgttgtattggcc
tatgggccgaagaacagtgcagagtagccggccttcttggggtccctgggaggggtactg
gacagtgcaccgaccgaggactccattgttctgttgggggacttaaacactcaagtgggc
aacgacagtgacacctggagaggagtgattgagaagaacagcccccctaatcagaaccca
agcggggttttgtgctagtcacagtttgtccataacaaacaccatgttcacaagggtgtc
catcagtgcacgtggcaccaggacaccctaagccgaaggtcaatgctcgactttgttgtc
gtgtcctctgaccagcggccacgtgtctcggataaagagaggggcggagttgtcaattga
ccaccacctggtcgtgagctggatccgctggtggaggaggaagccgaacagacctggcag
gcccaagcatatcgtgagggtctgttgggaacatctagtggagccctctgtcagcagggt
cttcaaccctgcacctccgggaaagtttctcatagatccctggggggactggggacattg
ggtctgagtggaccatgttctcagcctctgttgtcaacatggcagcctgtaactgcggtc
gtaaggtctccagttgtggcggcaatccctgaacT
Blastnかけてみたが当たり前のように魚類にヒットする。そういうものなのか今の知識だとピンとこないがリピートマスク処理の練習になったので良し。
Soft MaskをHard Maskに変換
NCBIのGenome情報が保存されているFTPにあるREADMEに記載のコマンドフレーズを使用します。
awk '{if(/^[^>]/)gsub(/[a-z]/,"N");print $0}' LTR_repeat/Paa.fasta.masked > ref/Paa.fasta.hardmask
先程確認した領域を表示
seqkit grep -nrp chr1$ ref/Paa.fasta.hardmask | seqkit subseq -r 35143:36437
>chr1
CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNT
無事Hard Mask処理されたゲノム配列が作成できた。
おわり。
-
ゲノム上のある場所から別の場所へと移動する転移因子をトランスポゾンといい、その中でも、自身をRNAに転写後、自身がコードする逆転や酵素によって自らのRNAをcDNAへと変換して、ゲノムへ挿入を繰り返すことで数を増やすことの出来るものをレトロトランスポゾンと言う。 ↩︎
-
Y-linked anti-Müllerian hormone type-II receptor is the sex-determining gene in ayu, Plecoglossus altivelis ↩︎
-
Improvement of the ayu (Plecoglossus altivelis) draft genome using Hi-C sequencing ↩︎
Discussion