🐥

ゲノムのリピート領域をマスクする

2024/10/01に公開

はじめに

ゲノム配列を決定したらそれをリファレンスにして、SNPの解析や変異解析を実施することができます。

Pool-seqやlcWGR(Low coverage whole genome resequencing)などのリファレンス配列とのSNPの違いに着目する集団ゲノミクスではリピート配列はマスク (Nや小文字に置き換えること) してしまうことが多いです。

WGSのデータを使って解析をする予定があるのでゲノム配列のマスク処理をしてみたいと思います。

ゲノム散在リピート(Genomic interspersed repeats)

簡単にリピート配列についてまとめます。

ゲノム散在リピート(Genomic interspersed repeats)は、ゲノムにある反復的なDNA配列のことで、その位置や繰り返しの数により、機能や病原性、表現型への影響が異なります。

RepeatMaskerのHP[1]に記載されている情報を参考にすると以下のとおりです。

  1. Simple Repeats
    • AやCA、CGGなど、通常1~5bpの繰り返しで構成されるDNA領域
  2. Tandem Repeats
    • Simple repeatより複雑な100~200bpの繰り返しで構成されるDNA領域
    • 染色体のセントロメア (くびれ部分) とテロメア (末端部分) で主に見られる
  3. Segmental Duplications
    • ゲノムの別の領域にコピーされた 10~300kbpのDNA領域
  4. Interspersed Repeats
    1. SINES: 逆転写酵素の助けを借りてゲノムに再組み込まれた、RNAの機能しない遺伝子コピー (300bp程度)
    2. DNAトランスポゾン
    3. レトロウイルス、レトロトランスポゾン[2]
    4. LINES : 非LTRのレトロトランスポゾン (700bp程度)

それぞれ、目的に応じて意味のある要素となりますが、ゲノムの構造変異解析などでなければMaskするほうが効率的に解析を進めることができます。

マスク対象のリファレンスゲノムの用意

まずは公共データベースからダウンロードします (BROE01.1.fsa_nt.gz)

https://www.ncbi.nlm.nih.gov/Traces/wgs/BROE01?display=download

mkdir ref && cd ref
aria2c -x 16 -ctrue https://sra-download.ncbi.nlm.nih.gov/traces/wgs01/wgs_aux/BR/OE/BROE01/BROE01.1.fsa_nt.gz

解答する。

pigz -d BROE01.1.fsa_nt.gz

配列名の確認

grep -i ">" BROE01.1.fsa_nt
><番号> Plecoglossus altivelis DNA, <chr or sccaffold>, whole genome shotgun sequence

上記のような感じの命名になっていたので、><chr or sccaffold>の形式にしておきます。

sed -E 's/BROE[0-9]+.[0-9] Plecoglossus altivelis DNA, ([a-z]+[0-9]+), whole.*/\1/g' BROE01.1.fsa_nt >  Paa.fasta

確認

grep -i ">" Paa.fasta | head -n 30

多分、上記作業で>chr[0-9]+かscaffold[0-9]+の形式になっているかと思います。

RepeatModer2RepeatMaskerによるリピート領域のハードマスク

マッピング精度の向上と時短のために、リファレンスゲノムのリピート領域や複雑領域をNでマスクします。RepeatModerとRepeatMaskerはリピート領域の推定とMaskingが可能なツールです。

De novoでRepeat領域の識別

RepeatModeler2を使ってDe novoでRepeat領域の識別を行います。

RepeatModerはv2からLTR (Likelihood ratio test: 尤度比検定) が実施できるらしいのやってみる。実行にはNINJAが必要と書いてあったので導入。

https://github.com/TravisWheelerLab/NINJA

git clone https://github.com/TravisWheelerLab/NINJA.git
cd NINJA

make build
export PATH=$PATH:$PWD/NINJA
USAGE Ninja
Ninja - Version 0.97-cluster_only

  ./Ninja --in file.fa --out file.out

Arguments: 
--help (or -h) to display this help
--in (or -i) filename
--out (or -o) filename
--in_type type [a | d] (default a)
--out_type type [d | c] (default c)
--corr_type type [n | j | k | s | m]
--cluster_cutoff dist_cutoff (default 0.03)
--threads (or -T) num_threads
--version (or -v) print the software version
For more information, check the README file.
# データベースファイルの作成
BuildDatabase -name ref/Paa -engine ncbi ref/Paa.fasta
Building database ref/Paa:
  Reading ref/Paa.fasta...
Number of sequences (bp) added to database: 4119 ( 448375639 bp )

RepeatModelerの実行

RepeatModeler -LTRStruct -threads 32 -engine ncbi -database ref/Paa

repeatmodeler2では使用スレッド数の指定が、-paから-threadsになったようです。32スレッド使って実行時間は28時間程度。

入力ファイルと同じフォルダに出力されたのは以下の3ファイルです。

  • Paa-families.fa

    • fasta形式のファイルで、Headerの#以降にリピートの分類情報が記載。
  • Paa-families.stk

    • リピートファミリーのアライメント情報
  • Paa-rmod.log :

    • RepeatModelerの解析プロセス全体のログファイル

RepeatModerの出力はRM_実行日時*の形式のフォルダに出力されます。

次にRepeatMaskerでリピート領域のアノテーションとマスキングを実施しますが、使用するのは入力ファイルと同じフォルダに出力されている*-families.faです。

RepeatMaskerによるリピート領域のマスキング

USAGE RepeatMasker
NAME
    RepeatMasker - Mask repetitive DNA

SYNOPSIS
      RepeatMasker [-options] <seqfiles(s) in fasta format>

DESCRIPTION
    The options are:

    -h(elp)
        Detailed help

    Default settings are for masking all type of repeats in a primate
    sequence.

    -e(ngine) [crossmatch|wublast|abblast|ncbi|rmblast|hmmer]
        Use an alternate search engine to the default. Note: 'ncbi' and
        'rmblast' are both aliases for the rmblastn search engine engine.
        The generic NCBI blastn program is not sensitive enough for use with
        RepeatMasker at this time.

    -pa(rallel) [number]
        The number of sequence batch jobs [50kb minimum] to run in parallel.
        RepeatMasker will fork off this number of parallel jobs, each
        running the search engine specified. For each search engine
        invocation ( where applicable ) a fixed the number of cores/threads
        is used:

          RMBlast     4 cores
          ABBlast     4 cores
          nhmmer      2 cores
          crossmatch  1 core

        To estimate the number of cores a RepeatMasker run will use simply
        multiply the -pa value by the number of cores the particular search
        engine will use.

    -s  Slow search; 0-5% more sensitive, 2-3 times slower than default

    -q  Quick search; 5-10% less sensitive, 2-5 times faster than default

    -qq Rush job; about 10% less sensitive, 4->10 times faster than default
        (quick searches are fine under most circumstances) repeat options

    -nolow
        Does not mask low_complexity DNA or simple repeats

        NOTE: This is an important step in RepeatMasker, the identification
        of low-divergence simple repeats early in RepeatMasker's search
        phase lowers the overall false-positive rate for TE annotations
        considerably. To simply remove simple repeats from the final output
        of RepeatMasker use postprocessing tools such as:

         egrep -v "Simple|Satellite" my_data.out > filtered.out

        To remove these annotations from the final output. The -nolow option
        should only be used when there is a specific reason to avoid
        pre/post masking tandem/simple/low-complexity sequences.

    -noint
        Only masks low complex/simple repeats (no interspersed repeats)

    -norna
        Does not mask small RNA (pseudo) genes

    -alu
        Only masks Alus (and 7SLRNA, SVA and LTR5)(only for primate DNA)

    -div [number]
        Masks only those repeats with [number] percent diverged from
        consensus

    -lib [filename]
        Allows use of a custom library (e.g. from another species)

    -cutoff [number]
        Sets cutoff score for masking repeats when using -lib (default 225)

    -species <query species>
        Specify the species or clade of the input sequence. The species name
        must be a valid NCBI Taxonomy Database species name and be contained
        in the RepeatMasker repeat database. Some examples are:

          -species human
          -species mouse
          -species rattus
          -species "ciona savignyi"
          -species arabidopsis

        Other commonly used species:

        mammal, carnivore, rodentia, rat, cow, pig, cat, dog, chicken, fugu,
        danio, "ciona intestinalis" drosophila, anopheles, worm, diatoaea,
        artiodactyl, arabidopsis, rice, wheat, and maize

    -uncurated
        Use uncurated and curated families from FamDB rather than curated
        only (default).

        The latest version of RepeatMasker is designed to work with the new
        FamDB partitioned data format (v1.0). This format supports storing
        both curated and uncurated TE data in smaller (optionally present)
        partitions. Rather than controlling the type of data used (curated/
        uncurated) by downloading the corresponding database, both
        curated/uncurated data will now be in the same files and use of that
        data can be controlled using this new flag.

    Contamination options

    -is_only
        Only clips E coli insertion elements out of fasta and .qual files

    -is_clip
        Clips IS elements before analysis (default: IS only reported)

    -no_is
        Skips bacterial insertion element check

    Running options

    -gc [number]
        Use matrices calculated for 'number' percentage background GC level

    -gccalc
        RepeatMasker calculates the GC content even for batch files/small
        seqs

    -frag [number]
        Maximum sequence length masked without fragmenting (default 60000)

    -nocut
        Skips the steps in which repeats are excised

    -noisy
        Prints search engine progress report to screen (defaults to .stderr
        file)

    -nopost
        Do not postprocess the results of the run ( i.e. call ProcessRepeats
        ). NOTE: This options should only be used when ProcessRepeats will
        be run manually on the results.

    output options

    -dir [directory name]
        Writes output to this directory (default is query file directory,
        "-dir ." will write to current directory).

    -a(lignments)
        Writes alignments in .align output file

    -inv
        Alignments are presented in the orientation of the repeat (with
        option -a)

    -lcambig
        Outputs ambiguous DNA transposon fragments using a lower case name.
        All other repeats are listed in upper case. Ambiguous fragments
        match multiple repeat elements and can only be called based on
        flanking repeat information.

    -small
        Returns complete .masked sequence in lower case

    -xsmall
        Returns repetitive regions in lowercase (rest capitals) rather than
        masked

    -x  Returns repetitive regions masked with Xs rather than Ns

    -poly
        Reports simple repeats that may be polymorphic (in file.poly)

    -source
        Includes for each annotation the HSP "evidence". Currently this
        option is only available with the "-html" output format listed
        below.

    -html
        Creates an additional output file in xhtml format.

    -ace
        Creates an additional output file in ACeDB format

    -gff
        Creates an additional Gene Feature Finding format output

    -u  Creates an additional annotation file not processed by
        ProcessRepeats

    -xm Creates an additional output file in cross_match format (for
        parsing)

    -no_id
        Leaves out final column with unique ID for each element (was
        default)

    -e(xcln)
        Calculates repeat densities (in .tbl) excluding runs of >=20 N/Xs in
        the query

CONFIGURATION OVERRIDES
    -crossmatch_dir <string>
        The path Phil Green's cross_match program ( phrap program suite ).

    -libdir <string>
        Path to the RepeatMasker libraries directory.

    -hmmer_dir <string>
        The path to the HMMER profile HMM search software.

    -rmblast_dir <string>
        The path to the installation of the RMBLAST sequence alignment
        program.

    -trf_prgm <string>
        The full path including the name for the TRF program.

    -default_search_engine <string>
        The default search engine to use

    -abblast_dir <string>
        The path to the installation of the ABBLAST sequence alignment
        program.

SEE ALSO
        Crossmatch, ProcessRepeats

COPYRIGHT
    2002-2024 Copyright (C) Institute for Systems Biology Developed by Arian
    Smit and Robert Hubley.

    2000-2001 Copyright (C) Arian Smit.

    1996-1999 Copyright (C) University of Washington, Developed by Arian
    Smit, Philip Green and Colin Wilson of the University of Washington
    Department of Genomics.

AUTHORS
    Arian Smit <asmit@systemsbiology.org>

    Robert Hubley <rhubley@systemsbiology.org>

では、RepeatMaskerでRepeat領域をマスクします。

RepeatMasker \
-lib ref/Paa-families.fa \
ref/Paa.fasta \
-dir LTR_repeat \
-html -gff -xsmall \
--pa 8
  • -lib: RepeatModelerで推定されてたRepeat情報fastaを指定
  • -dir: 結果出力フォルダを指定
  • -html: html形式の結果ファイルを指定
  • -gff: gff形式の結果ファイルを指定
  • -xsmall: 検出されたリピート利用域を小文字化するソフトマスクを指定
  • -pa: 並列実行するJob数を指定。以下のコア数が1jobに割り当てられるので、PCのスレッド数に合わせる。(32 threadsだと8を指定すると最大32threads使用することになる)
    • RMBlast 4 cores
    • ABBlast 4 cores
    • nhmmer 2 cores
    • crossmatch 1 core

ログファイル (rmod.log) を見てみるとLTR Structural Analysis: Enabled ( GenomeTools 1.6.4, LTR_Retriever ,Ninja 0.97-cluster_only, MAFFT 7.520, CD-HIT 4.8.1 )と記載されているので、有効化したオプションは効いているようです。

rmod.log
RepeatModeler Version 2.0.5
===========================
Search Engine = rmblast 2.14.1+
Threads = 32
Dependencies: TRF 4.09, RECON , RepeatScout 1.0.6, RepeatMasker 4.1.7
LTR Structural Analysis: Enabled ( GenomeTools 1.6.2, LTR_Retriever ,
                                   Ninja 0.97-cluster_only, MAFFT 7.525,
                                   CD-HIT 4.8.1 )
Random Number Seed: 1727589178
Database = ref/Paa   - Sequences = 4119
  - Bases = 448375639
  - N50 = 17524823
  - Contig Histogram:
  Size(bp)                                                        Count
  -----------------------------------------------------------------------
  20914945-22408856 |                                                   [ 1 ]
  19421035-20914945 |                                                   [ 4 ]
  17927124-19421034 |                                                   [ 4 ]
  16433214-17927124 |                                                   [ 5 ]
  14939304-16433214 |                                                   [ 3 ]
  13445393-14939303 |                                                   [ 2 ]
  11951483-13445393 |                                                   [ 2 ]
  10457572-11951482 |                                                   [ 3 ]
  8963662-10457572  |                                                   [ 4 ]
  7469752-8963662   |                                                   [  ]
  5975841-7469751   |                                                   [  ]
  4481931-5975841   |                                                   [  ]
  2988020-4481930   |                                                   [  ]
  1494110-2988020   |                                                   [  ]
  200-1494110       |************************************************** [ 4091 ]

Storage Throughput = excellent ( 1252.36 MB/s )


RepeatModeler Round # 1
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 40000000 bp
   - Final Sample Size = 40738614 bp ( 40012192 non ambiguous )
   - Num Contigs Represented = 404
   - Sequence extraction : 00:00:07 (hh:mm:ss) Elapsed Time
 -- Running RepeatScout on the sequences...
   - RepeatScout: 00:05:29 (hh:mm:ss) Elapsed Time
Round Time: 00:06:12 (hh:mm:ss) Elapsed Time : 80 families discovered.


RepeatModeler Round # 2
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 10000000 bp
   - Sequence extraction : 00:00:02 (hh:mm:ss) Elapsed Time
 -- Running TRFMask on the sequence...
   - TRFMask time 00:00:42 (hh:mm:ss) Elapsed Time
 -- Masking repeats from the previous rounds...
       1337 repeats masked totaling 271677 bp(s).
   - TE Masking time 00:00:02 (hh:mm:ss) Elapsed Time
 -- Sample Stats:
       Sample Size 10195226 bp
       Num Contigs Represented = 131
       Non ambiguous bp:
             Initial: 10029941 bp
             After Masking: 8569184 bp
             Masked: 14.56 % 
 -- Input Database Coverage: 10195226 bp out of 448375639 bp ( 2.27 % )
Sampling Time: 00:00:46 (hh:mm:ss) Elapsed Time
Running all-by-other comparisons...
  - Total Comparisons = 61425
Comparison Time: 00:02:51 (hh:mm:ss) Elapsed Time, 4973 HSPs Collected
Number of families returned by RECON: 1237
Round Time: 00:03:51 (hh:mm:ss) Elapsed Time : 8 families discovered.


RepeatModeler Round # 3
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 30000000 bp
   - Sequence extraction : 00:00:06 (hh:mm:ss) Elapsed Time
 -- Running TRFMask on the sequence...
   - TRFMask time 00:02:11 (hh:mm:ss) Elapsed Time
 -- Masking repeats from the previous rounds...
       4702 repeats masked totaling 888344 bp(s).
   - TE Masking time 00:00:07 (hh:mm:ss) Elapsed Time
 -- Sample Stats:
       Sample Size 30583388 bp
       Num Contigs Represented = 301
       Non ambiguous bp:
             Initial: 30021443 bp
             After Masking: 25659283 bp
             Masked: 14.53 % 
 -- Input Database Coverage: 40778614 bp out of 448375639 bp ( 9.09 % )
Sampling Time: 00:02:25 (hh:mm:ss) Elapsed Time
Running all-by-other comparisons...
  - Total Comparisons = 516636
Comparison Time: 00:19:40 (hh:mm:ss) Elapsed Time, 23320 HSPs Collected
Number of families returned by RECON: 4438
Round Time: 00:22:40 (hh:mm:ss) Elapsed Time : 37 families discovered.


RepeatModeler Round # 4
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 90000000 bp
   - Sequence extraction : 00:00:17 (hh:mm:ss) Elapsed Time
 -- Running TRFMask on the sequence...
   - TRFMask time 00:06:19 (hh:mm:ss) Elapsed Time
 -- Masking repeats from the previous rounds...
       20043 repeats masked totaling 3794688 bp(s).
   - TE Masking time 00:00:33 (hh:mm:ss) Elapsed Time
 -- Sample Stats:
       Sample Size 91901378 bp
       Num Contigs Represented = 831
       Non ambiguous bp:
             Initial: 90026687 bp
             After Masking: 75999598 bp
             Masked: 15.58 % 
 -- Input Database Coverage: 132679992 bp out of 448375639 bp ( 29.59 % )
Sampling Time: 00:07:12 (hh:mm:ss) Elapsed Time
Running all-by-other comparisons...
  - Total Comparisons = 4613203
Comparison Time: 02:38:01 (hh:mm:ss) Elapsed Time, 147522 HSPs Collected
Number of families returned by RECON: 20277
Round Time: 02:49:47 (hh:mm:ss) Elapsed Time : 142 families discovered.


RepeatModeler Round # 5
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 270000000 bp
   - Sequence extraction : 00:00:50 (hh:mm:ss) Elapsed Time
 -- Running TRFMask on the sequence...
   - TRFMask time 00:18:15 (hh:mm:ss) Elapsed Time
 -- Masking repeats from the previous rounds...
       80178 repeats masked totaling 14953789 bp(s).
   - TE Masking time 00:02:50 (hh:mm:ss) Elapsed Time
 -- Sample Stats:
       Sample Size 275772821 bp
       Num Contigs Represented = 2605
       Non ambiguous bp:
             Initial: 270025198 bp
             After Masking: 224480321 bp
             Masked: 16.87 % 
 -- Input Database Coverage: 408452813 bp out of 448375639 bp ( 91.10 % )
Sampling Time: 00:22:05 (hh:mm:ss) Elapsed Time
Running all-by-other comparisons...
  - Total Comparisons = 43100970
Comparison Time: 22:36:22 (hh:mm:ss) Elapsed Time, 709565 HSPs Collected
Number of families returned by RECON: 93024
Round Time: 24:00:13 (hh:mm:ss) Elapsed Time : 402 families discovered.

RepeatScout/RECON discovery complete: 669 families found


LTR Structural Analysis
=======================
LTRPipeline Time: 00:10:31 (hh:mm:ss) Elapsed Time


Classification Time: 00:05:50 (hh:mm:ss) Elapsed Time


Program Time: 27:39:04 (hh:mm:ss) Elapsed Time

出力ファイルに指定したLTR_repeatにあるファイルを確認します。

Paa.fasta.tbl

RepeatMaskerのサマリーファイル。

==================================================
file name: Paa.fasta                
sequences:          4119
total length:  448375639 bp  (439282816 bp excl N/X-runs)
GC level:         46.84 %
bases masked:   77879223 bp ( 17.37 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements        12967      5740794 bp    1.28 %
   SINEs:              408        74708 bp    0.02 %
   Penelope:             0            0 bp    0.00 %
   LINEs:            11995      5375619 bp    1.20 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex       7880      2741510 bp    0.61 %
     R1/LOA/Jockey      90        68046 bp    0.02 %
     R2/R4/NeSL        379       135141 bp    0.03 %
     RTE/Bov-B        3321      2309635 bp    0.52 %
     L1/CIN4             0            0 bp    0.00 %
   LTR elements:       564       290467 bp    0.06 %
     BEL/Pao             0            0 bp    0.00 %
     Ty1/Copia           0            0 bp    0.00 %
     Gypsy/DIRS1       422       142033 bp    0.03 %
       Retroviral      125       125387 bp    0.03 %

DNA transposons       8125      2505481 bp    0.56 %
   hobo-Activator      551       208168 bp    0.05 %
   Tc1-IS630-Pogo     5370      1682357 bp    0.38 %
   En-Spm                0            0 bp    0.00 %
   MULE-MuDR             0            0 bp    0.00 %
   PiggyBac            101        27358 bp    0.01 %
   Tourist/Harbinger   727       215294 bp    0.05 %
   Other (Mirage,       41        25512 bp    0.01 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:       192061     35584608 bp    7.94 %

Total interspersed repeats:    43830883 bp    9.78 %


Small RNA:             301        65223 bp    0.01 %

Satellites:              0            0 bp    0.00 %
Simple repeats:     370906     26063844 bp    5.81 %
Low complexity:      69149      7919273 bp    1.77 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
                                                      

RepeatMasker version 4.1.7-p1 , default mode
                                     
run with rmblastn version 2.14.1+
The query was compared to classified sequences in ".../Paa-families.fa"
FamDB: 

全体の17.37%がRepeatModerとRepeatMaskerによってNに変換されてたようです。

Scafforldの状態でアセンブルされた原著論文[3]では、Repbaseのフグリピートデータベースを使用すると56.78 Mb (アユゲノムアセンブリの12.61%)、ゼブラフィッシュデータベースを使用すると65.15 Mb (14.47%)が、TE (Transpose Element) だったとの記載があります。

ただ実際に使用したリファレンスゲノムはHi-Cで染色体レベルにまで向上させたもの[4]なので、RepeatModelerによるde novoでのRpeat領域の推定とリファレンスゲノムの状態がこの違いを生んでいるのかもしれません。

Paa.fasta.out

Maskされた領域の開始点や終了点、どういったリピート配列の区分に当たるのかなどが記載されたファイル。

   SW   perc perc perc  query         position in query              matching           repeat              position in repeat
score   div. del. ins.  sequence      begin    end          (left)   repeat             class/family      begin  end    (left)     ID

   12   21.7  0.0  0.0  chr1              1541     1572 (16525992) + (AGCT)n            Simple_repeat          1     32    (0)      1  
 2895    2.7  0.0  1.5  chr1              3871     4211 (16523353) + rnd-1_family-5     Unknown                1    336    (0)      2  
  411   22.0  0.0  2.0  chr1             12820    12921 (16514643) + rnd-4_family-9207  Unknown               13    112    (2)      3  
 1100    9.6  4.8  0.6  chr1             18840    19006 (16508558) C rnd-1_family-8     Unknown              (6)    197     24      4  
   16    9.6  0.0  0.0  chr1             20341    20363 (16507201) + (TC)n              Simple_repeat          1     23    (0)      5  
 1215    5.2 13.9  0.0  chr1             22271    22442 (16505122) + rnd-1_family-11    Unknown               48    243    (1)      6  
   12    8.9  4.2  0.0  chr1             23082    23105 (16504459) + (AGT)n             Simple_repeat          1     25    (0)      7  
   12   18.5  2.4  5.0  chr1             23759    23799 (16503765) + (TTTA)n            Simple_repeat          1     40    (0)      8  
   24   18.4  1.9  1.9  chr1             26608    26660 (16500904) + (CT)n              Simple_repeat          1     53    (0)      9  
  611    7.2 17.2  3.8  chr1             27102    27217 (16500347) + rnd-4_family-3478  DNA/hAT-Charlie       43    173  (696)     10 

長鎖の散在反復配列 (LINE) でだけ取り出してみる

grep -i LINE LTR_repeat/Paa.fasta.out | head -n 5
 6223    8.6  7.2  0.1  chr1             35144    36436 (16491128) + rnd-1_family-9     LINE/RTE-BovB         33   1416   (95)     20  
 1655   10.3  4.7  1.5  chr1             36460    36716 (16490848) C rnd-1_family-12    LINE/RTE-BovB      (706)    808    544     21  
 9172    9.2  2.8  0.0  chr1             36729    37989 (16489575) + rnd-1_family-4     LINE/RTE-BovB          1   1296  (228)     22  
11822    7.0  0.4  0.0  chr1             40502    42019 (16485545) C rnd-1_family-4     LINE/RTE-BovB        (0)   1524      1     29  
11762    7.2  0.7  0.0  chr1             41552    43055 (16484509) + rnd-1_family-12    LINE/RTE-BovB          1   1514    (0)     30 *

Paa.fasta.masked

softmaskされた配列データです。.outに記載されているLINE/RTE-BovBとされた領域を見てみる。両端1bpずつ伸ばして表示

seqkit grep -nrp chr1$ LTR_repeat/Paa.fasta.masked | seqkit subseq -r 35143:36437
>chr1
Cagggatttgagtgcagggatgatcataggagctatgctgtcaggggcaataagcccctg
gtagggtctcccaagtcaaactggtcctaggtgactagccagactaagcgcggttcacca
aacccctatgatcacaagactaaagagattcactccgtcgcctggagtggcgttacctgg
ggccccaccctggagccaggcctggagtgggtgctcgcaggcgagcgcctagtgaccagg
tctttgcccatgggaccggccggatatagcccgaaaaggcaatgtggggacactttcctg
caggcccaccacccacaggaaggaccgcagggggcgggtgtagtatgttttaggcggcag
tcaggggcaggtgcctggacgactgaggccatgcacacagtctctagcgctggggatatg
gaatgtcacctggcggggaaggagcccgcgcacagcttgggttccggatccactatcctc
gagagaggctggactctacatttctctggagttgcccgcgagaagaggcggcgagctggc
gtgggcttgcttatagccccgtggctcagccgccacgtgttggcgtttaccccagtgaat
gagaaggtcgcgtccctgcgccttcaagccggggacaggtctctcactgttgtattggcc
tatgggccgaagaacagtgcagagtagccggccttcttggggtccctgggaggggtactg
gacagtgcaccgaccgaggactccattgttctgttgggggacttaaacactcaagtgggc
aacgacagtgacacctggagaggagtgattgagaagaacagcccccctaatcagaaccca
agcggggttttgtgctagtcacagtttgtccataacaaacaccatgttcacaagggtgtc
catcagtgcacgtggcaccaggacaccctaagccgaaggtcaatgctcgactttgttgtc
gtgtcctctgaccagcggccacgtgtctcggataaagagaggggcggagttgtcaattga
ccaccacctggtcgtgagctggatccgctggtggaggaggaagccgaacagacctggcag
gcccaagcatatcgtgagggtctgttgggaacatctagtggagccctctgtcagcagggt
cttcaaccctgcacctccgggaaagtttctcatagatccctggggggactggggacattg
ggtctgagtggaccatgttctcagcctctgttgtcaacatggcagcctgtaactgcggtc
gtaaggtctccagttgtggcggcaatccctgaacT

Blastnかけてみたが当たり前のように魚類にヒットする。そういうものなのか今の知識だとピンとこないがリピートマスク処理の練習になったので良し。

Soft MaskをHard Maskに変換

NCBIのGenome情報が保存されているFTPにあるREADMEに記載のコマンドフレーズを使用します。

awk '{if(/^[^>]/)gsub(/[a-z]/,"N");print $0}' LTR_repeat/Paa.fasta.masked > ref/Paa.fasta.hardmask

先程確認した領域を表示

seqkit grep -nrp chr1$ ref/Paa.fasta.hardmask | seqkit subseq -r 35143:36437
>chr1
CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNT

無事Hard Mask処理されたゲノム配列が作成できた。

おわり。

脚注
  1. https://www.repeatmasker.org/faq.html ↩︎

  2. ゲノム上のある場所から別の場所へと移動する転移因子をトランスポゾンといい、その中でも、自身をRNAに転写後、自身がコードする逆転や酵素によって自らのRNAをcDNAへと変換して、ゲノムへ挿入を繰り返すことで数を増やすことの出来るものをレトロトランスポゾンと言う。 ↩︎

  3. Y-linked anti-Müllerian hormone type-II receptor is the sex-determining gene in ayu, Plecoglossus altivelis ↩︎

  4. Improvement of the ayu (Plecoglossus altivelis) draft genome using Hi-C sequencing ↩︎

Discussion