iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
📖

Genome Analysis: Calculating Nucleotide Ratios at a Single Coordinate from BAM Files

に公開

When viewing a BAM file obtained by mapping NGS reads to a reference sequence in IGV, you can check the count and ratio of ATGC for each individual coordinate.

IGV screen opening a BAM file

Since it is impractical to perform this calculation manually for every file and every coordinate, I wanted to run it via CUI. However, finding the right tool was difficult, so I am documenting it here. In the end, I found that bam-readcount or Rsamtools could be used.

The BAM file used in this article was created by mapping human cell WES (whole exome sequence) reads to the hg19 reference genome using bwa mem, with PCR duplicates removed using Picard tools.

Methods that failed

bedtools intersect

bedtools intersect

[Reference]
https://bedtools.readthedocs.io/en/latest/content/tools/intersect.html

bedtools intersect returns the intersection of two inputs. Use a BED file containing the bases you want to search for and a BAM file of the mapping results.

BED file

You specify two files for -a and -b, and it accepts them in either order. It is probably best to set the reference source to -a and the query file to -b.

The output of bedtools intersect follows the file format specified in -a. (If you specify a BAM file for -a, the output will also be a BAM file.)
Include the -bed option to specify output in BED file format.

bedtools intersect -a path_to_bam_file path_to_bed_file  -bed > out.bed


out.bed

The output consisted of information about the NGS reads spanning the coordinates of interest (read ID, strand, start/stop coordinates, etc.), and the specific bases at those coordinates could not be determined.

bcftools mpileup

bcftools mpileup

[Reference]
https://www.biostars.org/p/425139/
https://github.com/samtools/bcftools/issues/992
https://www.biostars.org/p/47945/#47951

When specifying a single-base region in the BED file used for bcftools mpileup, an error occurs if the 2nd and 3rd columns have the same value.
If you want the base at position 2444414, set the 2nd column to 2444413 and the 3rd column to 2444414.

bcftools mpileup \
--regions-file path_to_bed_file \
path_to_bam_file \
--fasta-ref path_to_reference_genome_fasta_file \
--annotate FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/DP,FORMAT/SP,INFO/AD,INFO/ADF,INFO/ADR \
> out.txt

After more than 100 lines of headers, you obtain information such as the reference sequence and variant sequence for the coordinates.

out.txt

Since bcftools is a tool for handling genetic mutations, it provides a lot of redundant information if you simply want to know the base ratio at specific coordinates.
Also, the information obtained can be difficult to interpret if you are not used to it.

Passing the output to bcftools query makes it somewhat easier to read, but it is still not intuitive.

bcftools mpileup \
--regions-file path_to_bed_file \
path_to_bam_file \
--fasta-ref path_to_reference_genome_fasta_file \
--annotate FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/DP,FORMAT/SP,INFO/AD,INFO/ADF,INFO/ADR \
| bcftools query --format '%CHROM\t%POS\t%REF\t%ALT[\t%AD]\n' \
> out.txt
igvtools count

igvtools count

Since this data is visible in IGV, another method is to use igvtools, which is used for manipulating IGV data.
igvtools count allows you to aggregate base information for a specified window size, so setting the window size to 1 provides information for each individual base.
However, there was no way to specify a particular region, so it ends up aggregating all bases in the BAM file. This takes a long time and the output is massive, making it impractical for processing multiple samples.

[Reference]
https://software.broadinstitute.org/software/igv/igvtools
https://ncrna.jp/blog/item/65-igv-toolkit

igvtools count \
-w 1 \
--bases \
path_to_bam_file \
path_to_output_file \
path_to_reference_genome_fasta_file

[bam-readcount]

https://github.com/genome/bam-readcount

A tool to obtain measurement results on a per-base basis from BAM files.

Installation

The GitHub page introduces methods for cloning the repository to build from source code and using Docker.

Here, we will build it from source code and use it. If this doesn't work for you, please consider using the Docker method.

Clone the GitHub repository
git clone https://github.com/genome/bam-readcount 
Build from source code
# Change directory
cd bam-readcount

# Create and enter the build directory
mkdir build
cd build

# Compile the source files
cmake ..
make

The executable file is located at ~/bam-readcount/build/bin/bam-readcount. Since it is inconvenient to use as is, you should either add this to your PATH or create a symbolic link in a directory that is already in your PATH.
In my case, I have a tools folder for these kinds of tools and have added it to my PATH, so I created a symbolic link there.

ln -s ~/bam-readcount/build/bin/bam-readcount ~/tools/

If the path to bam-readcount is displayed by the which command, it was successful.

which bam-readcount

Execution

Run the base count for specific coordinates using the reference genome FASTA file, BAM file, and a query BED file.

The --site-list argument specifies the BED file. If you only want a single coordinate, you can set the values in the 2nd and 3rd columns to be the same.

BED file

Running bam-readcount
bam-readcount \
--reference-fasta path_to_reference_genome_fasta_file \
path_to_BAM_file \
--site-list path_to_BED_file \
--max-warnings 0  \ 
> out.tsv

Since the output is tab-delimited, it is a good idea to redirect it to a file with an extension like .txt or .tsv.
While there is a lot of information besides the base count, you can see that there are 49 reads spanning the 2,444,414th base of chromosome 1, with 43 being A and 6 being G.

out.tsv

How to interpret the results
https://kazumaxneo.hatenablog.com/entry/2018/07/05/230927

[Rsamtools]

Rsamtools allows you to manipulate BAM files in R.

https://bioconductor.org/packages/release/bioc/html/Rsamtools.html

# BiocManager::install("Rsamtools")
library(Rsamtools)

Registering a BAM file

First, register the BAM file you want to reference using the BamFile() function. If a .bai file exists in the same folder, it will be recognized automatically.

bamfile <- BamFile(file = "path_to_BAM_file")

Creating scan parameters

Use the scanBam() function to retrieve only the query information from this BAM file. Prepare the query using the ScanBamParam() function.
In ScanBamParam(), specify the target coordinates in the which= argument and the items of read information to be retrieved in the what= argument.

Coordinate information

Create the coordinate information to be specified in the which= argument using IRanges() and GRanges(). Chromosome numbers and base numbers are used, but since we are targeting only one base this time, the start and end of IRanges() use the same value.

chr <- "1"
pos <- 2444414

which <- GRanges(seqnames = chr, ranges = IRanges(start = pos, end = pos))


Contents of which

Measurement items

You can check the measurement items that can be specified in the what= argument using the scanBamWhat() function.


Item names that can be specified in what=

Here, we retrieve the mapping position information of the reads spanning the target coordinates and the base sequence information of the reads.

what <- c("pos", "seq")

scanBam()

Using the scan parameters defined above, retrieve the reads spanning the coordinates and their information.

reads <- scanBam(file = bamfile,
                 param = ScanBamParam(which = which, what = what)
                 )

Information for 49 reads spanning 1:2444414-2444414 is obtained as a list. pos contains the 5' coordinate of each read, and seq contains the base sequence of the read spanning the coordinate.

Contents of reads

The reads are stored in ascending order of their 5' coordinates.

The storage order of reads as seen in IGV

Extracting bases at the target coordinate

Extract only the base information at the target coordinate from the seq in the reads list.

# Calculate which position in each read corresponds to the target coordinate
correct_pos <- pos - reads$`1:2444414-2444414`$pos + 1

# Extract only the sequence information
sequences <- reads$`1:2444414-2444414`$seq

# Vector to store results
base <- c()
for( i in seq(length(sequences))){
  # Vectorize the sequence
  sequence <- as.vector(sequences[i])
  # Split into individual characters as separate elements
  sequence <- unlist(strsplit(as.vector(sequences[i]), ""))
  # Get the element at the target coordinate and append to base
  base <- c(base, sequence[correct_pos[i]])
}

Discussion