SAMtools

SAMtools
SAMtools
Type of	Bioinformatics
Author	Han lee
Developers	John Marshall, Peter Danechek
Written on	C
operating system	Unix
First edition	2009
Latest version	1.9 (2018-07-19)
Readable File Formats	, and
Generated File Formats
License	MIT
Website	htslib.org

SAMtools is a set of utilities for processing short fragments of sequenced DNA in SAM or BAM formats. SAMtools is written by Chinese bioinformatics Heng Li , who is also the author of the specifications for SAM and BAM formats. Currently, the leading developers of SAMtools are Petr Daněček ( Czech Petr Daněček ) ^[3] ^[4] and John Marshall ( English John Marshall ) ^[5] .

Content

1 Background
2 Principle of work
3 SAM, BAM, and CRAM Formats
4 SAMtools Commands
5 notes
6 Literature

Creation Background

With the advent of new sequencing technologies such as Illumina / Solexa , AB / SOLiD, and Roche / 454 , many new alignment tools have been developed to implement effective mapping of readings into large reference sequences, including the human genome . However, these tools generate alignments in different formats, which complicates the subsequent processing. The general alignment format, which supports all types of sequences and tools for their alignment, creates a clearly defined transition from alignment to the subsequent analysis, including mutation search, genotyping, and genome assembly ^[6] .

The Sequence Alignment / Map (SAM) format is designed to achieve this. It supports single and double reads, and can also combine readings of various types, including readings in the AB / SOLiD color space. This format is intended for alignment of sets of 10 ¹¹ or more base pairs, which is typical for deep sequencing of one person ^[6] .

SAMtools is specifically designed to handle alignment in the SAM / BAM format. It can convert from other alignment formats, sort and merge alignments, remove duplicate PCRs, generate information on positions in the pileup format , call SNPs and short indels options, as well as display alignments in the text viewer ^[6] .

Principle of Operation

SAMtools is designed to work with data flow . Each program is called by a separate command, takes the input file through the standard input stream (stdin) and returns the result through the standard output stream (stdout). Warnings and error messages are output to the standard error stream (stderr). The samtools commands can be combined into pipelines with other Unix commands ^[7] .

By default, the output stream is directed to the screen. Since it can be cumbersome and complex, the output is redirected to a file (> and >>) or to the next command in the pipeline (|) ^[8] .

SAMtools can also open BAM files (but not SAM!) Via FTP or HTTP ^[7] .

SAMtools is written in C and can be used through the API . There are wrappers for other programming languages :

pysam for Python ^[9] ,
Bio-samtools for Ruby ^[10] ,
Bio-SamTools for Perl ^[11] ,
samtools for Haskell ^[12] .

It is worth noting that there are independent programs for working with SAM and BAM formats written in other languages:

BamTools for C ++ ^[13] ,
Picard for Java ^[14] ,
cl-sam for Common Lisp ^[15] .

SAM, BAM, and CRAM Formats

The BAM format ( Binary Alignment Map ) is the binary equivalent of SAM. BAM takes up less space and allows you to work with information faster than SAM. However, only SAM files are readable as text files . SAMtools allows you to effectively work with the BAM format and extract the necessary information in a human-readable format ^[7] ^[16] .

CRAM files are even more efficient in terms of disk space than BAM files. The CRAM file stores the differences in reading from the reference sequence , therefore, to work with it, you must have a file with a reference genome. The specification ^{[17] of the} format was developed at the European Institute of Bioinformatics . SAMtools allows you to convert between formats SAM, BAM and CRAM ^[7] .

The SAM format ( Sequence Alignment Map ) is a text format for storing biological sequences aligned with a reference sequence, also called reference . This format is widely used for storing data such as fragments of nucleotide sequences (otherwise called reads, reads or reads) obtained using a new generation of sequencing technology. Most often, SAM is obtained by mapping reads from a FASTQ file on the sequence of the reference genome . The format supports short and long reads (up to 128 Mbp) and may include one or more alignments. One alignment consists of several lines, each of which is the alignment of one fragment ^[16] .

A SAM file may contain a heading, the lines of which always begin with the “@” character, followed by one of the two-letter heading type codes. In the header, each line is separated by a tab character, and, in addition to the @CO lines, each data field corresponds to the tag format : value , where the tag is a two-character string that defines the format and content of the value . Below are briefly described the types of header that can be used in the file ^[16] .

Header type	Description
@HD	Title bar. This is the first line, if present. It must contain information about the version of the format, may contain information about sorting the alignments in the file (if there are several) and their grouping.
@SQ	Dictionary of reference sequences. The @SQ line order determines the alignment sort order. Necessarily contains the name of the reference sequence and its length. It may also include alternative sequence names, description, sequence features, and so on.
@RG	Group readings. Several unordered @RG lines are allowed. Must contain a unique group identifier. It may contain a description, the date of receipt of the group and the programs used for this, the name of the sample during sequencing, etc.
@PG	The program used at startup. Mandatory contains a unique identifier for the program entry. It may contain the name of the program, command text, description and version of the program.
@CO	One line text comment. Several unordered @CO lines are allowed.

Under the heading is the alignment section. It has 11 required fields containing information such as position and quality of alignment, direction of reading, indication of pair reading, etc. In addition, a number of optional fields are possible in the form of a tag: type: value ^[16] ^[18] .

The specification ^{[16] of} the SAM format can be found in the SAMtools repository ^[19] or in the official documentation of the format ^[16] . Below is a part of the alignment in SAM format with a description of the fields ^[16] .

  r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
 r001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM: i: 1

Field number	Field Name ^[20]	Comment ^[20]
one	Reed Name	The same names are given to reads read on the same matrix, for example, pair reads.
2	Flag	It is a combination of several binary flags indicating pairing, mapping, etc. For example, flag 99 (in the hexadecimal system - 0x63) is a combination of bits 0x1, 0x2, 0x20 and 0x40, which together gives the following reading information: this is the first read from a pair, for each read of a pair there is an alignment, the pair of this read is in the reverse orientation.
3	Reference Name	The reference name should be present in one of the header lines starting with @SQ, if such lines are present in the file.
four	Leftmost alignment position	The position in the reference, which corresponds to the beginning of reading. Positions in the reference are numbered from 1. For an unmapped read, the value of this field is usually zero. (The format specification, however, assumes a nonzero value for unmapped reads, which can be useful, for example, when sorting reads by coordinate.)
5	Mapq	By definition, the MAPQ value is −10 log ₁₀ P, rounded to the nearest integer, where P is the probability that the alignment is incorrect. So, if MAPQ = 30, then P = 0.001.
6	Cigar	Description of alignment, in the record of which a set of operations is used (coincidence, insertion, etc.). For example, the line 8M2I4M1D3M means: 8 matches with reference, 2 insertions , 4 matches, 1 deletion , 3 matches.
7	The name of the pair reference	The reference name for the pair must be present in one of the header lines starting with @SQ, if such lines are present in the file. If the reference of the pair coincides with the reference of the given read, then the field value is =, and in the absence of information about the reference of the pair - * (for example, reading can be single).
8	Start pairing	The position in the reference, which corresponds to the beginning of the pair. Similar to the fourth field.
9	Distance between extreme alignment points	The maximum reference distance. Moreover, in the case of pair alignment, this value is positive for the left (by reference) read and negative for the right. For single reads, the value is zero.
10	Reed sequence	The case of letters does not matter.
eleven	Quality	It can be recorded in Phred + 33 format. If information is not available, the * sign is indicated.

The flag field is a combination of several binary flags. If the reading has a pair, then by this combination you can uniquely restore the flag of a coupled reading with it and, as a result, information about it. A complete list of flags with values is presented in the table below ^[21] ^[20] .

Flag	Description ^[20]
1 ₁₀ ≡ 1 ₁₆	Reading has a couple
2 ₁₀ ≡ 2 ₁₆	Reading mapped to the correct pair
4 ₁₀ ≡ 4 ₁₆	Reading is not mapped
8 ₁₀ ≡ 8 ₁₆	Reading couple not mapped
16 ₁₀ ≡ 10 ₁₆	Reverse reading
32 ₁₀ ≡ 20 ₁₆	Reverse orientation reading couple
64 ₁₀ ≡ 40 ₁₆	First couple reading
128 ₁₀ ≡ 80 ₁₆	Second reading in pair
256 ₁₀ ≡ 100 ₁₆	Alignment is not primary
512 ₁₀ ≡ 200 ₁₆	Reading failed quality control
1,024 ₁₀ ≡ 400 ₁₆	Reading is an optical or PCR duplicate
2048 ₁₀ ≡ 800 ₁₆	This alignment is optional.

Thus, the most common flags are grouped according to the main values ^[22] :

one of the readings in a pair is unmapped: 73, 133, 89, 121, 165, 181, 101, 117, 153, 185, 69, 137;
both reads are unmapped: 77, 141;
readings are mapped within the insert size and in the correct orientation: 99, 147, 83, 163;
readings are mapped within the insert size, but in the wrong orientation: 67, 131, 115, 179;
readings are unambiguously mapped, but with an incorrect insert size: 81, 161, 97, 145, 65, 129, 113, 177.

Optional fields must match the format of a two-letter tag: type: value . For example, NH:i:1 indicates the number of alignments in the file for a given read as an integer value equal to one ^[18] . Some other common tags ^[20] :

AS - alignment weight (score) calculated by the charting program;
NM is the editorial distance from reading to reference;
MD - a line with information about unaligned positions; for example, 10A5^AC6' means 10 matches with the reference → A in the reference, different from the nucleotide in the corresponding reading position → 5 matches → deletion (lack of reading) of two nucleotides - AC → 6 matches ;
CC - the name of the reference for the "next" alignment ("hit") - for the case of non-unique alignment;
CP - coordinate of the leftmost position for the “next” alignment (“hit”);
HI is the alignment index (“hit”) for this reading.

Optional fields whose tags begin with X, Y or Z are reserved for use by various programs and directly by users. Often these fields are generated using BWAtools , and the most common such tags can be found in the BWAtools specification ^[20] , as well as in the specification of additional SAM format fields ^[18] .

SAMtools Commands

Calling commands is in the form of "samtools command_name". Next, the call options and the necessary files are indicated (if the file was not transferred via the pipeline). An example is the command that converts a SAM file to BAM format: samtools view -bS sample.sam > sample.bam , where view is a command, -bS are options, and sample.sam and sample.bam specify files in the corresponding formats ^[7] .

The list of SAMtools commands is presented below.

^[7]
bedcov	The bedcov team reports the total number of read bases for each area of the genome indicated in the attached BED file.
cat	cat allows you to quickly combine BAM and CRAM files, the order of the entries in the input files is preserved.
calmd	The calmd command complements the entries in the input file with the MD tag, which contains information about polymorphisms and indels ^[20] ^[18] .
depth	Calculates the depth of each position or region.
dict	Creates a sequence dictionary from a fasta file.
faidx	faidx indexes the input FASTA file and creates the fasta.fai index file. The command can be used both for complete sequence and for a given region. All sequences in the input file must have different names, otherwise indexing will give a warning about duplicate sequences and upon extraction only subsequences from the first sequence with a duplicate name will be obtained. For example, if the header of the SAM file does not indicate the sequence in which alignment was performed (@SQ field), you can supplement the file with this information. Interestingly, when you call the view command with the -T option and specify an unindexed fastq file, the corresponding index will be generated automatically.
fastq / fasta	Converts a BAM or CRAM file to a FASTQ or FASTA format file (depending on which command was used). Used since version SAMtools 1.3, before that, the bam2fq command ^{[23] was} used to convert a BAM file to FASTQ format.
fixmate	fixmate allows you to supplement a file with paired reads: it fills in the coordinates of the corresponding pairs, adds the ISIZE tag (responsible for the distance between two pairs) and other tags responsible for information about pairs. Using the -r option, you can delete pairs with secondary and unmapped reads.
flags	flags converts mapping flags from numeric to text representation and vice versa.
flagstat	Gives statistics on 13 categories for the file submitted to the input of the command, working with the FLAG field. For each category, the number of past and non-passing QC (quality control) reads is written to the output file. The total number of reads that have passed and have not passed QC is recorded in the first line of the output file. If no output file is specified, information is returned to stdout.
fqidx	The fqidx command indexes the input FASTQ file (or extracts a subsequence from it) and creates the fasta.fai index file. The command can be used both for complete sequence and for a given region. Used only for files with a small number of entries. Using this command in a file containing millions of short reads will cause the index to become as large as the original file, and searching using the index will be very slow and require a lot of memory.
idxstats	idxstats displays a short annotation for the input BAM file in the format of values, separated by tabs ( TSV ). The annotation contains information about the names of reference sequences, their lengths, the number of mapped and unmapped reads. It is important that to use this command, the input file must be sorted and indexed using the SAMtools index command.

^[7]
index	The index command generates an index file based on a sorted BAM or CRAM file. The index file has the format .cram.crai for CRAM and .bam.bai or .bam.csi for BAM. Index allows view and tview commands to work faster if you need to extract a region from an associated file. Some SAMtools commands require mandatory indexing of files (e.g. idxstats).
merge	merge combines several sorted BAM files into one output file, also sorted. If the headers of the source files are different, merge forms a new, generalized one. If the files contain entries with the same ID, by default a suffix is added to them, allowing them to be distinguished.
mpileup	mpileup accepts one or more BAM files as input and generates pileup , en: VCF or its binary BCF format. They are designed to store or search for single nucleotide polymorphisms and indels (inserts or deletions). In the pileup format, each line represents one genomic position and contains information about its coverage, nucleotide composition, quality of alignment, and other characteristics. In the VCF (or BCF) format, information is stored only on positions that are changed compared to the reference genome ^[24] .
phase	The phase command allows you to find and phase heterozygous single nucleotide polymorphisms .
sort	sort sorts reads in the input file, using the -o option writes the result to a file. By default, sorting is done by the left coordinate of the reads (that is, by the first coordinate of reads on the forward chain and by the second coordinate of reads on the reverse chain). Using the -n option, you can sort the mappings by name. Sorting requires a lot of memory; therefore, to save resources, the program writes intermediate BAM files. By default, the maximum memory per execution thread is 768 MiB. Using the -m option, you can change this threshold, and using - @ you can increase the number of threads. Currently, when using the command, the prefix for temporary files (-T) and the output file (-o) are explicitly specified, however, an outdated form of use is still common. In it, after the name of the input file, the prefix for the output file and temporary files must be indicated. For example, a command accepts an unsorted_in.bam file, writes intermediate files of the form sorted_out.% D.out (% d indicates the number of a temporary file) ^[25], and writes the resulting file sorted_out.bam.
split	Splits a file by reading group.
stats	Collects statistics from BAM files and provides the output file in text format. The output can be visualized graphically using plot-bamstats.
reheader	The reheader command is used to rename the input BAM header to the header of another SAM input file. Such a command works faster than converting BAM → SAM → BAM.
tview	tview invokes an interactive console text viewer . It allows you to view the alignment of the mappings on a small fragment of the reference genome. Use the 'g' key to move to different alignment positions. '?' calls the assistant listing possible hot keys.
view	The view command allows you to perform various operations with the SAM, BAM and CRAM formats. For example, you can convert a file from one format to another, filter by region, quality or tag, view contents via stdout.

Notes

↑ The samtools Open Source Project on Open Hub: Languages Page . Circulation date May 4, 2019.
↑ ¹ ² ³ SAMtools Manual pages
↑ Petr Danecek (English) . Date of treatment April 28, 2015. Archived on August 20, 2017.
↑ Regular Wednesday IMG seminar Petr Daněček, Ph.D. (eng.) . Circulation date May 2, 2019.
↑ John Marshall . Date of treatment April 28, 2015. Archived June 10, 2017.
↑ ¹ ² ³ Li H. , Handsaker B. , Wysoker A. , Fennell T. , Ruan J. , Homer N. , Marth G. , Abecasis G. , Durbin R. , 1000 Genome Project Data Processing Subgroup. The Sequence Alignment / Map format and SAMtools. (English) // Bioinformatics. - 2009 .-- 15 August ( vol. 25 , no. 16 ). - P. 2078-2079 . - DOI : 10.1093 / bioinformatics / btp352 . - PMID 19505943 .
↑ ¹ ² ³ ⁴ ⁵ ⁶ ⁷ Official documentation of SAMtools (version 1.9) (English) . Date of treatment April 21, 2019. Archived April 11, 2019.
↑ Geradot. Output to a Bash file in Linux (Russian) . - Samtools works from the command line.
↑ Library for Python . Date of treatment April 29, 2015. Archived August 3, 2015.
↑ Library for Ruby . Date of treatment May 1, 2019. Archived May 1, 2019.
↑ Library for Perl . Date of treatment May 1, 2019. Archived April 21, 2019.
↑ Library for Haskell . Date of treatment April 29, 2015. Archived March 31, 2015.
↑ BamTools . Date of treatment April 29, 2015. Archived August 2, 2015.
↑ Picard . Date of treatment September 29, 2017. Archived September 29, 2017.
↑ cl-sam . Date of treatment April 29, 2015. Archived July 17, 2017.
↑ ¹ ² ³ ⁴ ⁵ ⁶ ⁷ Specification of SAM / BAM formats (English) . Date of treatment April 28, 2015. Archived on April 6, 2017.
↑ CRAM format specification . Date of treatment April 29, 2015. Archived April 27, 2015.
↑ ¹ ² ³ ⁴ Specification of additional SAM format fields . Date of treatment April 27, 2019. Archived March 28, 2019.
↑ SAMtools repository . Date of treatment April 28, 2015. Archived on April 28, 2015.
↑ ¹ ² ³ ⁴ ⁵ ⁶ ⁷ BWAtools Specification . bio-bwa.sourceforge.net. Archived on April 5, 2017.
↑ SAM Format . www.samformat.info. Archived on April 27, 2019.
↑ SAMtool bitwise flag meaning explained: how to understand samflags without pains . A Pillow Diary of an Expatriate Scientist (August 25, 2010). Archived on April 21, 2019.
↑ SAMtools Official Documentation (version 1.2) Date of treatment April 28, 2015. Archived March 26, 2015.
↑ VCF / BCF format specification . Date of treatment April 27, 2019. Archived March 28, 2019.
↑ SAMtools NERC Enviromental Bioinformatics Center documentation . Date accessed April 27, 2019. Archived April 27, 2019.

Literature

Li H. Improving SNP discovery by base alignment quality. (English) // Bioinformatics. - 2011 .-- April 15 ( vol. 27 , no. 8 ). - P. 1157-1158 . - DOI : 10.1093 / bioinformatics / btr076 . - PMID 21320865 .
Ramirez-Gonzalez RH , Bonnal R. , Caccamo M. , Maclean D. Bio-samtools: Ruby bindings for SAMtools, a library for accessing BAM files containing high-throughput sequence alignments. (English) // Source Code For Biology And Medicine. - 2012 .-- 28 May ( vol. 7 , no. 1 ). - P. 6-6 . - DOI : 10.1186 / 1751-0473-7-6 . - PMID 22640879 .

[1] The samtools Open Source Project on Open Hub: Languages Page . Circulation date May 4, 2019.

[_7a658a2dce055f17-2] ¹ ² ³ SAMtools Manual pages

[3] Petr Danecek (English) . Date of treatment April 28, 2015. Archived on August 20, 2017.

[4] Regular Wednesday IMG seminar Petr Daněček, Ph.D. (eng.) . Circulation date May 2, 2019.

[5] John Marshall . Date of treatment April 28, 2015. Archived June 10, 2017.

[:3-6] ¹ ² ³ Li H. , Handsaker B. , Wysoker A. , Fennell T. , Ruan J. , Homer N. , Marth G. , Abecasis G. , Durbin R. , 1000 Genome Project Data Processing Subgroup. The Sequence Alignment / Map format and SAMtools. (English) // Bioinformatics. - 2009 .-- 15 August ( vol. 25 , no. 16 ). - P. 2078-2079 . - DOI : 10.1093 / bioinformatics / btp352 . - PMID 19505943 .

[:4-7] ¹ ² ³ ⁴ ⁵ ⁶ ⁷ Official documentation of SAMtools (version 1.9) (English) . Date of treatment April 21, 2019. Archived April 11, 2019.

[8] Geradot. Output to a Bash file in Linux (Russian) . - Samtools works from the command line.

[9] Library for Python . Date of treatment April 29, 2015. Archived August 3, 2015.

[10] Library for Ruby . Date of treatment May 1, 2019. Archived May 1, 2019.

[11] Library for Perl . Date of treatment May 1, 2019. Archived April 21, 2019.

[12] Library for Haskell . Date of treatment April 29, 2015. Archived March 31, 2015.

[13] BamTools . Date of treatment April 29, 2015. Archived August 2, 2015.

[14] Picard . Date of treatment September 29, 2017. Archived September 29, 2017.

[15] -sam . Date of treatment April 29, 2015. Archived July 17, 2017.

[:0-16] ¹ ² ³ ⁴ ⁵ ⁶ ⁷ Specification of SAM / BAM formats (English) . Date of treatment April 28, 2015. Archived on April 6, 2017.

[17] CRAM format specification . Date of treatment April 29, 2015. Archived April 27, 2015.

[:2-18] ¹ ² ³ ⁴ Specification of additional SAM format fields . Date of treatment April 27, 2019. Archived March 28, 2019.

[19] SAMtools repository . Date of treatment April 28, 2015. Archived on April 28, 2015.

[:1-20] ¹ ² ³ ⁴ ⁵ ⁶ ⁷ BWAtools Specification . bio-bwa.sourceforge.net. Archived on April 5, 2017.

[21] SAM Format . www.samformat.info. Archived on April 27, 2019.

[22] SAMtool bitwise flag meaning explained: how to understand samflags without pains . A Pillow Diary of an Expatriate Scientist (August 25, 2010). Archived on April 21, 2019.

[23] SAMtools Official Documentation (version 1.2) Date of treatment April 28, 2015. Archived March 26, 2015.

[24] VCF / BCF format specification . Date of treatment April 27, 2019. Archived March 28, 2019.

[25] SAMtools NERC Enviromental Bioinformatics Center documentation . Date accessed April 27, 2019. Archived April 27, 2019.