SAMtools is a set of utilities for processing short fragments of sequenced DNA in SAM or BAM formats. SAMtools is written by Chinese bioinformatics Heng Li , who is also the author of the specifications for SAM and BAM formats. Currently, the leading developers of SAMtools are Petr Daněček ( Czech Petr Daněček ) [3] [4] and John Marshall ( English John Marshall ) [5] .
| SAMtools | |
|---|---|
| Type of | Bioinformatics |
| Author | Han lee |
| Developers | John Marshall, Peter Danechek |
| Written on | C [1] |
| operating system | Unix |
| First edition | 2009 |
| Latest version | 1.9 (2018-07-19) |
| Readable File Formats | , and |
| Generated File Formats | |
| License | MIT |
| Website | htslib.org |
Content
- 1 Background
- 2 Principle of work
- 3 SAM, BAM, and CRAM Formats
- 4 SAMtools Commands
- 5 notes
- 6 Literature
Creation Background
With the advent of new sequencing technologies such as Illumina / Solexa , AB / SOLiD, and Roche / 454 , many new alignment tools have been developed to implement effective mapping of readings into large reference sequences, including the human genome . However, these tools generate alignments in different formats, which complicates the subsequent processing. The general alignment format, which supports all types of sequences and tools for their alignment, creates a clearly defined transition from alignment to the subsequent analysis, including mutation search, genotyping, and genome assembly [6] .
The Sequence Alignment / Map (SAM) format is designed to achieve this. It supports single and double reads, and can also combine readings of various types, including readings in the AB / SOLiD color space. This format is intended for alignment of sets of 10 11 or more base pairs, which is typical for deep sequencing of one person [6] .
SAMtools is specifically designed to handle alignment in the SAM / BAM format. It can convert from other alignment formats, sort and merge alignments, remove duplicate PCRs, generate information on positions in the pileup format , call SNPs and short indels options, as well as display alignments in the text viewer [6] .
Principle of Operation
SAMtools is designed to work with data flow . Each program is called by a separate command, takes the input file through the standard input stream (stdin) and returns the result through the standard output stream (stdout). Warnings and error messages are output to the standard error stream (stderr). The samtools commands can be combined into pipelines with other Unix commands [7] .
By default, the output stream is directed to the screen. Since it can be cumbersome and complex, the output is redirected to a file (> and >>) or to the next command in the pipeline (|) [8] .
SAMtools can also open BAM files (but not SAM!) Via FTP or HTTP [7] .
SAMtools is written in C and can be used through the API . There are wrappers for other programming languages :
- pysam for Python [9] ,
- Bio-samtools for Ruby [10] ,
- Bio-SamTools for Perl [11] ,
- samtools for Haskell [12] .
It is worth noting that there are independent programs for working with SAM and BAM formats written in other languages:
- BamTools for C ++ [13] ,
- Picard for Java [14] ,
- cl-sam for Common Lisp [15] .
SAM, BAM, and CRAM Formats
The BAM format ( Binary Alignment Map ) is the binary equivalent of SAM. BAM takes up less space and allows you to work with information faster than SAM. However, only SAM files are readable as text files . SAMtools allows you to effectively work with the BAM format and extract the necessary information in a human-readable format [7] [16] .
CRAM files are even more efficient in terms of disk space than BAM files. The CRAM file stores the differences in reading from the reference sequence , therefore, to work with it, you must have a file with a reference genome. The specification [17] of the format was developed at the European Institute of Bioinformatics . SAMtools allows you to convert between formats SAM, BAM and CRAM [7] .
The SAM format ( Sequence Alignment Map ) is a text format for storing biological sequences aligned with a reference sequence, also called reference . This format is widely used for storing data such as fragments of nucleotide sequences (otherwise called reads, reads or reads) obtained using a new generation of sequencing technology. Most often, SAM is obtained by mapping reads from a FASTQ file on the sequence of the reference genome . The format supports short and long reads (up to 128 Mbp) and may include one or more alignments. One alignment consists of several lines, each of which is the alignment of one fragment [16] .
A SAM file may contain a heading, the lines of which always begin with the “@” character, followed by one of the two-letter heading type codes. In the header, each line is separated by a tab character, and, in addition to the @CO lines, each data field corresponds to the tag format : value , where the tag is a two-character string that defines the format and content of the value . Below are briefly described the types of header that can be used in the file [16] .
| Header type | Description |
|---|---|
| @HD | Title bar. This is the first line, if present. It must contain information about the version of the format, may contain information about sorting the alignments in the file (if there are several) and their grouping. |
| @SQ | Dictionary of reference sequences. The @SQ line order determines the alignment sort order. Necessarily contains the name of the reference sequence and its length. It may also include alternative sequence names, description, sequence features, and so on. |
| @RG | Group readings. Several unordered @RG lines are allowed. Must contain a unique group identifier. It may contain a description, the date of receipt of the group and the programs used for this, the name of the sample during sequencing, etc. |
| @PG | The program used at startup. Mandatory contains a unique identifier for the program entry. It may contain the name of the program, command text, description and version of the program. |
| @CO | One line text comment. Several unordered @CO lines are allowed. |
Under the heading is the alignment section. It has 11 required fields containing information such as position and quality of alignment, direction of reading, indication of pair reading, etc. In addition, a number of optional fields are possible in the form of a tag: type: value [16] [18] .
The specification [16] of the SAM format can be found in the SAMtools repository [19] or in the official documentation of the format [16] . Below is a part of the alignment in SAM format with a description of the fields [16] .
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * r001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM: i: 1
| Field number | Field Name [20] | Comment [20] |
|---|---|---|
| one | Reed Name | The same names are given to reads read on the same matrix, for example, pair reads. |
| 2 | Flag | It is a combination of several binary flags indicating pairing, mapping, etc. For example, flag 99 (in the hexadecimal system - 0x63) is a combination of bits 0x1, 0x2, 0x20 and 0x40, which together gives the following reading information: this is the first read from a pair, for each read of a pair there is an alignment, the pair of this read is in the reverse orientation. |
| 3 | Reference Name | The reference name should be present in one of the header lines starting with @SQ, if such lines are present in the file. |
| four | Leftmost alignment position | The position in the reference, which corresponds to the beginning of reading. Positions in the reference are numbered from 1. For an unmapped read, the value of this field is usually zero. (The format specification, however, assumes a nonzero value for unmapped reads, which can be useful, for example, when sorting reads by coordinate.) |
| 5 | Mapq | By definition, the MAPQ value is −10 log 10 P, rounded to the nearest integer, where P is the probability that the alignment is incorrect. So, if MAPQ = 30, then P = 0.001. |
| 6 | Cigar | Description of alignment, in the record of which a set of operations is used (coincidence, insertion, etc.). For example, the line 8M2I4M1D3M means: 8 matches with reference, 2 insertions , 4 matches, 1 deletion , 3 matches. |
| 7 | The name of the pair reference | The reference name for the pair must be present in one of the header lines starting with @SQ, if such lines are present in the file. If the reference of the pair coincides with the reference of the given read, then the field value is =, and in the absence of information about the reference of the pair - * (for example, reading can be single). |
| 8 | Start pairing | The position in the reference, which corresponds to the beginning of the pair. Similar to the fourth field. |
| 9 | Distance between extreme alignment points | The maximum reference distance. Moreover, in the case of pair alignment, this value is positive for the left (by reference) read and negative for the right. For single reads, the value is zero. |
| 10 | Reed sequence | The case of letters does not matter. |
| eleven | Quality | It can be recorded in Phred + 33 format. If information is not available, the * sign is indicated. |
The flag field is a combination of several binary flags. If the reading has a pair, then by this combination you can uniquely restore the flag of a coupled reading with it and, as a result, information about it. A complete list of flags with values is presented in the table below [21] [20] .
| Flag | Description [20] |
|---|---|
| 1 10 ≡ 1 16 | Reading has a couple |
| 2 10 ≡ 2 16 | Reading mapped to the correct pair |
| 4 10 ≡ 4 16 | Reading is not mapped |
| 8 10 ≡ 8 16 | Reading couple not mapped |
| 16 10 ≡ 10 16 | Reverse reading |
| 32 10 ≡ 20 16 | Reverse orientation reading couple |
| 64 10 ≡ 40 16 | First couple reading |
| 128 10 ≡ 80 16 | Second reading in pair |
| 256 10 ≡ 100 16 | Alignment is not primary |
| 512 10 ≡ 200 16 | Reading failed quality control |
| 1,024 10 ≡ 400 16 | Reading is an optical or PCR duplicate |
| 2048 10 ≡ 800 16 | This alignment is optional. |
Thus, the most common flags are grouped according to the main values [22] :
- one of the readings in a pair is unmapped: 73, 133, 89, 121, 165, 181, 101, 117, 153, 185, 69, 137;
- both reads are unmapped: 77, 141;
- readings are mapped within the insert size and in the correct orientation: 99, 147, 83, 163;
- readings are mapped within the insert size, but in the wrong orientation: 67, 131, 115, 179;
- readings are unambiguously mapped, but with an incorrect insert size: 81, 161, 97, 145, 65, 129, 113, 177.
Optional fields must match the format of a two-letter tag: type: value . For example, NH:i:1 indicates the number of alignments in the file for a given read as an integer value equal to one [18] . Some other common tags [20] :
- AS - alignment weight (score) calculated by the charting program;
- NM is the editorial distance from reading to reference;
- MD - a line with information about unaligned positions; for example,
10A5^AC6'means 10 matches with the reference → A in the reference, different from the nucleotide in the corresponding reading position → 5 matches → deletion (lack of reading) of two nucleotides - AC → 6 matches ; - CC - the name of the reference for the "next" alignment ("hit") - for the case of non-unique alignment;
- CP - coordinate of the leftmost position for the “next” alignment (“hit”);
- HI is the alignment index (“hit”) for this reading.
Optional fields whose tags begin with X, Y or Z are reserved for use by various programs and directly by users. Often these fields are generated using BWAtools , and the most common such tags can be found in the BWAtools specification [20] , as well as in the specification of additional SAM format fields [18] .
SAMtools Commands
Calling commands is in the form of "samtools command_name". Next, the call options and the necessary files are indicated (if the file was not transferred via the pipeline). An example is the command that converts a SAM file to BAM format: samtools view -bS sample.sam > sample.bam , where view is a command, -bS are options, and sample.sam and sample.bam specify files in the corresponding formats [7] .
The list of SAMtools commands is presented below.
|
| ||||||||||||||||||||||||||||||||||||||||||||||||
Notes
- ↑ The samtools Open Source Project on Open Hub: Languages Page . Circulation date May 4, 2019.
- ↑ 1 2 3 SAMtools Manual pages
- ↑ Petr Danecek (English) . Date of treatment April 28, 2015. Archived on August 20, 2017.
- ↑ Regular Wednesday IMG seminar Petr Daněček, Ph.D. (eng.) . Circulation date May 2, 2019.
- ↑ John Marshall . Date of treatment April 28, 2015. Archived June 10, 2017.
- ↑ 1 2 3 Li H. , Handsaker B. , Wysoker A. , Fennell T. , Ruan J. , Homer N. , Marth G. , Abecasis G. , Durbin R. , 1000 Genome Project Data Processing Subgroup. The Sequence Alignment / Map format and SAMtools. (English) // Bioinformatics. - 2009 .-- 15 August ( vol. 25 , no. 16 ). - P. 2078-2079 . - DOI : 10.1093 / bioinformatics / btp352 . - PMID 19505943 .
- ↑ 1 2 3 4 5 6 7 Official documentation of SAMtools (version 1.9) (English) . Date of treatment April 21, 2019. Archived April 11, 2019.
- ↑ Geradot. Output to a Bash file in Linux . - Samtools works from the command line.
- ↑ Library for Python . Date of treatment April 29, 2015. Archived August 3, 2015.
- ↑ Library for Ruby . Date of treatment May 1, 2019. Archived May 1, 2019.
- ↑ Library for Perl . Date of treatment May 1, 2019. Archived April 21, 2019.
- ↑ Library for Haskell . Date of treatment April 29, 2015. Archived March 31, 2015.
- ↑ BamTools . Date of treatment April 29, 2015. Archived August 2, 2015.
- ↑ Picard . Date of treatment September 29, 2017. Archived September 29, 2017.
- ↑ cl-sam . Date of treatment April 29, 2015. Archived July 17, 2017.
- ↑ 1 2 3 4 5 6 7 Specification of SAM / BAM formats (English) . Date of treatment April 28, 2015. Archived on April 6, 2017.
- ↑ CRAM format specification . Date of treatment April 29, 2015. Archived April 27, 2015.
- ↑ 1 2 3 4 Specification of additional SAM format fields . Date of treatment April 27, 2019. Archived March 28, 2019.
- ↑ SAMtools repository . Date of treatment April 28, 2015. Archived on April 28, 2015.
- ↑ 1 2 3 4 5 6 7 BWAtools Specification . bio-bwa.sourceforge.net. Archived on April 5, 2017.
- ↑ SAM Format . www.samformat.info. Archived on April 27, 2019.
- ↑ SAMtool bitwise flag meaning explained: how to understand samflags without pains . A Pillow Diary of an Expatriate Scientist (August 25, 2010). Archived on April 21, 2019.
- ↑ SAMtools Official Documentation (version 1.2) Date of treatment April 28, 2015. Archived March 26, 2015.
- ↑ VCF / BCF format specification . Date of treatment April 27, 2019. Archived March 28, 2019.
- ↑ SAMtools NERC Enviromental Bioinformatics Center documentation . Date accessed April 27, 2019. Archived April 27, 2019.
Literature
- Li H. Improving SNP discovery by base alignment quality. (English) // Bioinformatics. - 2011 .-- April 15 ( vol. 27 , no. 8 ). - P. 1157-1158 . - DOI : 10.1093 / bioinformatics / btr076 . - PMID 21320865 .
- Ramirez-Gonzalez RH , Bonnal R. , Caccamo M. , Maclean D. Bio-samtools: Ruby bindings for SAMtools, a library for accessing BAM files containing high-throughput sequence alignments. (English) // Source Code For Biology And Medicine. - 2012 .-- 28 May ( vol. 7 , no. 1 ). - P. 6-6 . - DOI : 10.1186 / 1751-0473-7-6 . - PMID 22640879 .