SpeedSeq
Current version: 0.0.2
Chiang, C, R M Layer, G G Faust, M R Lindberg, D B Rose, E P Garrison, G T Marth, A R Quinlan, and I M Hall. 2014. SpeedSeq: Ultra-Fast Personal Genome Analysis and Interpretation. bioRxiv. doi:10.1101/012179.
Current support for Linux only
Table of Contents
- Summary
- Constitutive Pipeline Tools
- Installation
- Automatic Installation
- Manual Installation
- Usage
- Annotations
- Example Workflows
Summary
The SpeedSeq suite is a lightweight, flexible, and open source pipeline that identifies genomic variation (single nucleotide variants (SNVs), indels, and structural variants (SVs)).
Constitutive Pipeline Tools
Installation
There is an automatic and manual installation process for SpeedSeq.
The following are required for both installations:
- cmake
- g++ 4.3 or later
- gcc
- git
- make
- python2.7
- python-devel
- python-yaml
- ncurses-devel
- zlib-devel
- numpy
- pip
A Linux package manager can be used to obtain these by with the command:
# general purpose tools
sudo yum update
sudo yum -y install cmake gcc-c++ gcc git make python27 python-devel python-yaml ncurses-devel zlib-devel numpy python-pip
# PERL modules for VEP
sudo yum -y install "perl(Archive::Extract)" "perl(CGI)" "perl(DBI)" "perl(Time::HiRes)" "perl(Archive::Tar)" "perl(Archive::Zip)"
or
sudo apt-get update
sudo apt-get -y install build-essential cmake gpp gcc git make python2.7 python-dev python-yaml ncurses-dev zlib1g-dev python-numpy python-pip
Automatic installation
SpeedSeq can be installed with the following commands:
git clone --recursive https://github.com/cc2qe/speedseq
cd speedseq
python speedseq_setup.py
sudo cp -r bin/* /usr/local/bin/
Configure the paths to the SpeedSeq dependencies by modifying the speedseq.config file. The speedseq.config file should reside in the same directory as the SpeedSeq executable. By default, SpeedSeq attempts to source the dependencies from the $PATH.
Note that the optional component CNVnator cannot be automatically installed (see the CNVnator installation instructions). SpeedSeq uses a multi-threaded implementation of CNVnator v0.3, which can be for in src/cnvnator
Genome
We recommend using the GRCh37 human genome for SpeedSeq, available here:
- ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz
- ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.fai
Manual installation
The following instructions for installation assumes that the required tools are not installed. If you already have the SpeedSeq components on your system, then you can simply link them to the proper paths in the [speedseq.config](bin/speedseq.config] file.
It is recommended that the specified versions of each tool are used for this release of SpeedSeq.
The use of unspecified versions of any pipeline component is not guaranteed to work.
SpeedSeq can be installed with the following commands:
git clone --recursive https://github.com/cc2qe/speedseq
cd speedseq
make
sudo cp -r bin/* /usr/local/bin/
Configuration
Configure the paths to the SpeedSeq dependencies by modifying the speedseq.config file. The speedseq.config file should reside in the same directory as the SpeedSeq executable. By default, SpeedSeq attempts to source the dependencies from the $PATH.
Genome
We recommend using the GRCh37 human genome for SpeedSeq, available here:
- ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz
- ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.fai
Obtain each of the external pipeline tools and install:
Sambamba
https://github.com/lomereiter/sambamba
Sambamba can be installed and used by SpeedSeq by:
curl -OL https://github.com/lomereiter/sambamba/releases/download/v0.4.6-beta/sambamba_v0.4.6-beta_centos5-x86_64.tar.bz2
tar -xvf sambamba_v0.4.6-beta_centos5-x86_64.tar.bz2
sudo cp sambamba_v0.4.6 /usr/local/bin/
CNVnator
http://sv.gersteinlab.org/cnvnator/
CNVnator can be installed and used by SpeedSeq with the following commands:
- Install the ROOT package (http://root.cern.ch/drupal/) ``` # install dependencies sudo yum -y install libX11-devel libXpm-devel libXft-devel libXext-devel
# get ROOT curl -OL ftp://root.cern.ch/root/root_v5.34.20.source.tar.gz tar -xvf root_v5.34.20.source.tar.gz cd root ./configure make cd .. sudo mv root /usr/local ```
- Add the following line to
~/.bashrc
source /usr/local/root/bin/thisroot.sh
Then run the following for reload .bashrc
. ~/.bashrc
-
Navigate to SpeedSeq git directory and compile the multi-threaded implementation of CNVnator v0.3
cd speedseq make cnvnator-multi sudo cp bin/* /usr/local/bin
-
Create chromosomes directory from genome fasta file
CNVnator requires a directory containing each chromosome as a separate fasta file. Run the following commands to generate this directory.
mkdir chroms
cd chroms
cat human_g1k_v37.fasta | awk 'BEGIN { CHROM="" } { if ($1~"^>") CHROM=substr($1,2); print $0 > CHROM".fa" }'
Then set the path to the chroms
directory as CNVNATOR_CHROMS_DIR
in speedseq.config
VEP
http://www.ensembl.org/info/docs/tools/vep/index.html
VEP can be installed and used by SpeedSeq with the following commands:
-
Install required PERL modules
sudo yum -y install "perl(Archive::Extract)" "perl(CGI)" "perl(DBI)" "perl(Time::HiRes)" "perl(Archive::Tar)" "perl(Archive::Zip)"
-
Download the software and install
curl -OL https://github.com/Ensembl/ensembl-tools/archive/release/76.zip unzip 76.zip perl ensembl-tools-release-76/scripts/variant_effect_predictor/INSTALL.pl -a ac -s homo_sapiens -y GRCh37 --CACHEDIR ~/.vep
-
Copy files to a directory in $PATH
sudo cp ensembl-tools-release-76/scripts/variant_effect_predictor/variant_effect_predictor.pl /usr/local/bin sudo cp -r ensembl-tools-release-76/scripts/variant_effect_predictor/Bio /usr/local/bin
-
Edit speedseq.config to set correct paths for the
VEP
andVEP_CACHE_DIR
variables.
GEMINI
https://github.com/arq5x/gemini
GEMINI can be installed and used by SpeedSeq with the following commands:
wget https://raw.github.com/arq5x/gemini/master/gemini/scripts/gemini_install.py
# or curl -OL https://raw.github.com/arq5x/gemini/master/gemini/scripts/gemini_install.py > gemini_install.py
sudo python2.7 gemini_install.py /usr/local /usr/local/share/gemini
export PATH=$PATH:/usr/local/gemini/bin
# it would be wise to add the above line to your ``.bashrc`` or ``.bash_profile``
gemini update
BEDTools
https://github.com/arq5x/bedtools2
BEDTools can be installed and used by SpeedSeq with the following commands:
curl -OL https://github.com/arq5x/bedtools2/releases/download/v2.20.1/bedtools-2.20.1.tar.gz
tar -xvf bedtools-2.20.1.tar.gz
make -C bedtools2-2.20.1
sudo cp bedtools2-2.20.1/bin/* /usr/local/bin
GNU Parallel
http://www.gnu.org/software/parallel/
Parallel can be installed and used by SpeedSeq with the following commands:
curl -OL http://ftp.gnu.org/gnu/parallel/parallel-20100424.tar.bz2
tar -xvf parallel-20100424.tar.bz2
cd parallel-20100424
./configure && sudo make && sudo make install
sudo cp src/parallel /usr/local/bin/
For alternative installations and release issues for any of the above tools please consult the website/creator.
Usage
SpeedSeq is a modular pipeline with four components: align
, var
, somatic
, and sv
.
speedseq align
- Take paired-end fastq sequences as input, and produce a duplicate-marked, sorted, indexed BAM file that can be processed with other speedseq modules. Currently, speedseq align does not support single-end reads.
speedseq var
- Run FreeBayes one or more BAM files
speedseq somatic
- Run FreeBayes on a tumor/normal pair of BAM files
speedseq sv
- Run LUMPY on one or more BAM files, with breakend genotyping and read-depth calculation.
speedseq align
speedseq align
takes paired-end fastq sequences as input, and produces a duplicate-marked, sorted, indexed BAM file that can be processed with other SpeedSeq modules. Currently, speedseq align
does not support single-end reads.
Internally, speedseq align
runs the following steps to produce three output BAM files:
- Alignment with BWA-MEM
- Prior to alignment, run
bwa index genome.fasta
on the reference genome - Duplicate marking with SAMBLASTER
- Discordant-read and split-read extraction with SAMBLASTER
- Position sorting with Sambamba
- BAM indexing with Sambamba
usage: speedseq align [options] <reference.fa> <in1.fq> [in2.fq]
Positional arguments
reference.fa genome reference fasta file (indexed with bwa) (required)
in1.fq paired-end fastq file. if -p flag is used then expected to be
an interleaved paired-end fastq file, and in2.fq may be omitted.
(may be gzipped) (required)
in2.fq paired-end fastq file. (may be gzipped) (required)
Alignment options
These options determine the behavior of BWA-MEM
-o STR output prefix [default: in1.fq]
-R read group header line such as "@RG\tID:libraryname\tSM:samplename" (required)
-p first fastq file consists of interleaved paired-end sequences
-t INT number of threads to use [default: 1]
-T DIR temp directory [default: ./temp]
Samblaster options
These options determine the behavior of SAMBLASTER
-i include duplicates in splitters and discordants
(default: exclude duplicates)
-c INT maximum number of split alignments for a read to be
included in splitter file [default: 2]
-m INT minimum non-overlapping base pairs between two alignments
for a read to be included in splitter file [default: 20]
Sambamba options
-M amount of memory in GB to be used for sorting [default: 20]
Global options
-K FILE path to speedseq.config file (default: same directory as speedseq)
-v verbose
-h show help message
Output
speedseq align
produces three sorted, indexed BAM files (plus their corresponding .bai index files):
outprefix.bam
- The full, duplicate-marked, sorted BAM file for the library. This file may serve as input for
speedseq var
,speedseq somatic
, andspeedseq sv
. outprefix.splitters.bam
- This BAM file contains split reads called by the BWA-MEM alignment of the library. It may be used as the
-S
flag input tospeedseq sv
. This file excludes duplicate reads by default, but they will be included if the-i
flag is specified as aspeedseq align
command line parameter. outprefix.discordants.bam
- This BAM file contains discordant read-pairs called by the BWA-MEM alignment of the library. These reads may be discordant by strand orientation, intrachromosomal distance, or interchromosomal mapping. This BAM file may be used as the
-D
flag input tospeedseq sv
. This file excludes duplicate reads by default, but they will be included if the-i
flag is specified as aspeedseq align
command line parameter.
speedseq var
speedseq var
runs FreeBayes one or more BAM files.
usage: speedseq var [options] <reference.fa> <input1.bam> [input2.bam [...]]
Positional arguments
reference.fa genome reference fasta file
input.bam BAM file(s) to call variants on. Must have readgroup information,
and the SM readgroup tags will be the VCF column header
Options
-o STR output prefix [default: input1.bam]
-w FILE BED file of windowed genomic intervals. For human genomes,
we recommend using the annotations/ceph18.b37.include.2014-01-15.bed
(see Annotations)
-q FLOAT minimum variant QUAL score to output [1]
-t INT number of threads to use [default: 1]
-T DIR temp directory [default: ./temp]
-A BOOL annotate the vcf with snpEff (true or false) [default: true]
-K FILE path to speedseq.config file [default: same directory as speedseq]
-v verbose
-h show help message
Output
speedseq var
produces a single indexed VCF file that optionally annotated with VEP.
outprefix.vcf.gz
speedseq somatic
speedseq somatic
runs FreeBayes on a tumor/normal pair of BAM files
usage: speedseq somatic [options] <reference.fa> <normal.bam> <tumor.bam>
Positional arguments
reference.fa genome reference fasta file
normal.bam germline BAM file(s) (comma separated BAMs from multiple libraries).
Must have readgroup information, and the SM readgroup tag will
be the VCF column header
tumor.bam tumor BAM file(s) (comma separated BAMs for multiple libraries).
Must have readgroup information, and the SM readgroup tag will
be the VCF column header
Options
-o STR output prefix [default: tumor.bam]
-w FILE BED file of windowed genomic intervals. For human genomes,
we recommend using the annotations/ceph18.b37.include.2014-01-15.bed
(see Annotations)
-t INT number of threads to use [default: 1]
-F FLOAT require at least this fraction of observations supporting
an alternate allele within a single individual in order
to evaluate the position [0.05]
-C INT require at least this count of observations supporting
an alternate allele within a single individual in order
to evaluate the position [2]
-n FLOAT minimum normal log odds ratio for PASS [7]
-u FLOAT minimum tumor log odds ratio for PASS [7]
-q FLOAT minimum QUAL score to output non-passing somatic variants [1e-5]
-T DIR temp directory [./temp]
-A BOOL annotate the vcf with snpEff (true or false) (default: true)
-K FILE path to speedseq.config file (default: same directory as speedseq)
-v verbose
-h show help message
Output
speedseq somatic
produces a single indexed VCF file and optionally annotated with VEP.
outprefix.vcf.gz
speedseq sv
speedseq sv
runs LUMPY on one or more BAM files, with optional breakend genotyping by SVtyper, and optional read-depth analysis by CNVnator.
Options
-B FILE full BAM file(s) (comma separated) (required)
example: -B in1.bam,in2.bam,in3.bam
-S FILE split reads BAM file(s) (comma separated, order same as in -B) (required)
example: -S in1.splitters.bam,in2.splitters.bam,in3.splitters.bam
-D FILE discordant reads BAM file(s) (comma separated, order same as in -B) (required)
example: -D in1.discordants.bam,in2.discordants.bam,in3.discordants.bam
-R FILE indexed reference genome fasta file (required)
-o STR output prefix [in1.bam]
-t INT threads [1]
-x FILE BED file to exclude
-g genotype SV breakends with svtyper
-d calculate read-depth with CNVnator
-A BOOL annotate the vcf with VEP (true or false) (default: true)
-m INT minimum weight for a call [default: 4]
-r FLOAT trim threshold [0]
-L INT read length [auto]
-T DIR temp directory [default: ./temp]
-k keep temporary files
The flags -s
and -p
are automatically generated using the defaults below, but may be overridden by the user by explicitly defining them using the following format.
-s STR lumpy split read parameters [auto]
bam_file:<splitreads.bam>,
back_distance:<20>,
min_mapping_threshold:<20>,
weight:<1>,
id:<11>,
min_clip:<20>
-p STR lumpy discordant read parameters [auto]
bam_file:<discreads.bam>,
histo_file:<auto>,
mean:<auto>,
stdev:<auto>,
read_length:<auto>,
min_non_overlap:<read_length>,
discordant_z:<5>,
back_distance:<20>,
min_mapping_threshold:<20>,
weight:<1>,
id:<10>
Global options
-K FILE path to speedseq.config file (default: same directory as speedseq)
-v verbose
-h show help message
Output
speedseq sv
produces a VCF file and a BEDPE file containing the same information.
outprefix.sv.vcf.gz
outprefix.sv.bedpe
The tab-delimited BEDPE file has the following structure:
1. chromosome 1
2. interval 1 start
3. interval 1 end
4. chromosome 2
5. interval 2 start
6. interval 2 end
7. id
8. evidence set score
9. strand 1
10. strand 2
11. type
12. id of samples containing evidence for this breakpoint
13. strand configurations observed in the evidence set
Example:
chr1 34971904 34971945 chr1 34976002 34976043 0x7f9eb0917210 0.0110386 + - TYPE:DELETION IDS:11,1 STRANDS:+-,1
Annotations
For human genome alignment using the GRCh37 build, we recommend using the annotations/ceph18.b37.include.2014-01-15.bed BED file to parallelize the variant calling (speedseq var
and speedseq somatic
). This BED file excludes regions of the genome where the coverage in the CEPH1463 pedigree was greater than twice the mode coverage plus 3 standard deviations. We believe these extremely high depth regions that we excluded are areas of misassembly in the GRCh37 human reference genome in which variant calling is time-consuming and error-prone.
Additionally, the regions in annotations/ceph18.b37.include.2014-01-15.bed are variable-width windows which each contain approximately the same coverage depth in the CEPH1463 pedigree, and sorted from highest to lowest depth. This ensures that the parallelization of Freebayes uses approximately the same amount of time per region.
The regions in annotations/ceph18.b37.exclude.2014-01-15.bed represent the complement of the regions in annotations/ceph18.b37.include.2014-01-15.bed.
In the speedseq sv
module, we recommend excluding the genomic regions in the annotations/ceph18.b37.lumpy.exclude.2014-01-15.bed BED file. These regions represent the complement of those in annotations/ceph18.b37.include.2014-01-15.bed as well as the mitochondrial chromosome.
Example Workflows
Call variants on a single sample
- Use
speedseq align
to produce a sorted, duplicate-marked, BAM alignment from paired-end fastq data.
speedseq align -o NA12878 -R "@RG\tID:NA12878.S1\tSM:NA12878" \
human_g1k_v37.fasta NA12878.1.fq.gz NA12878.2.fq.gz
Note: if using an interleaved paired-end fastq file, use the -p
flag
speedseq align -p -o NA12878 -R "@RG\tID:NA12878.S1\tSM:NA12878" \
human_g1k_v37.fasta NA12878.interleaved.fq.gz
- Use
speedseq var
to call SNVs and indels on a single sample.
speedseq var -o NA12878 \
-w annotations/ceph18.b37.include.2014-01-15.bed \
human_g1k_v37.fasta NA12878.bam
- Use
speedseq sv
to call structural variants. The optional-g
and-d
flags perform breakend genotyping and read-depth calculation respectively
speedseq sv -o NA12878 \
-x annotations/ceph18.b37.lumpy.exclude.2014-01-15.bed \
-g \
-d \
-B NA12878.bam \
-D NA12878.discordants.bam \
-S NA12878.splitters.bam
Call variants on a single sample sequenced with multiple libraries
- Use
speedseq align
to produce a sorted, duplicate-marked, BAM alignment of each library.
``` speedseq align -o NA12878_S1 -R "@RG\tID:NA12878.S1\tSM:NA12878" \ human_g1k_v37.fasta NA12878.S1.1.fq.gz NA12878.S1.2.fq.gz
speedseq align -o NA12878_S2 -R "@RG\tID:NA12878.S2\tSM:NA12878" \ human_g1k_v37.fasta NA12878.S2.1.fq.gz NA12878.S2.2.fq.gz
speedseq align -o NA12878_S3 -R "@RG\tID:NA12878.S3\tSM:NA12878" \ human_g1k_v37.fasta NA12878.S3.1.fq.gz NA12878.S3.2.fq.gz ```
- Use
speedseq var
to call SNVs and indels.speedseq var
will automatically recognize libraries with the same SM readgroup tag to be the same sample.
speedseq var -o NA12878 \
-w annotations/ceph18.b37.include.2014-01-15.bed \
human_g1k_v37.fasta NA12878_S1.bam NA12878_S2.bam NA12878_S3.bam
- Use
speedseq sv
to call structural variants. (For proper library merging, each library from the same sample must have the same SM readgroup tag.)speedseq -sv -o NA12878 \ -x annotations/ceph18.b37.lumpy.exclude.2014-01-15.bed \ -B NA12878_S1.bam,NA12878_S2.bam,NA12878_S3.bam -S NA12878_S1.splitters.bam,NA12878_S2.splitters.bam,NA12878_S3.splitters.bam -D NA12878_S1.discordants.bam,NA12878_S2.discordants.bam,NA12878_S3.discordants.bam
Call variants on multiple samples
- Use
speedseq align
to produce sorted, duplicate-marked, BAM alignments for each sample.
``` speedseq align -o NA12877 -R "@RG\tID:NA12877.S1\tSM:NA12877" \ human_g1k_v37.fasta NA12877.1.fq.gz NA12877.2.fq.gz
speedseq align -o NA12878 -R "@RG\tID:NA12878.S1\tSM:NA12878" \ human_g1k_v37.fasta NA12878.1.fq.gz NA12878.2.fq.gz
speedseq align -o NA12879 -R "@RG\tID:NA12879.S1\tSM:NA12879" \ human_g1k_v37.fasta NA12879.1.fq.gz NA12879.2.fq.gz ```
- Use
speedseq var
to call SNVs and indels on multiple samples.
speedseq var -o cephtrio \
-w annotations/ceph18.b37.include.2014-01-15.bed \
human_g1k_v37.fasta NA12877.bam NA12878.bam NA12879.bam
- Use
speedseq sv
to call structural variants on multiple samples.
speedseq sv -o cephtrio \
-x annotations/ceph18.b37.lumpy.exclude.2014-01-15.bed \
-B NA12877.bam,NA12878.bam,NA12879.bam \
-D NA12877.discordants.bam,NA12878.discordants.bam,NA12879.discordants.bam \
-S NA12877.splitters.bam,NA12878.splitters.bam,NA12879.splitters.bam
Call variants on a tumor/normal pair
- Use
speedseq align
to produce sorted, duplicate-marked, BAM alignments for the tumor/normal pair
``` speedseq align -p -o TCGA-B6-A0I6.normal \ -R "@RG\tID:TCGA-B6-A0I6-10A-01D-A128-09\tSM:TCGA-B6-A0I6-10A-01D-A128-09" \ human_g1k_v37.fasta TCGA-B6-A0I6-10A-01D-A128-09.interleaved.fq.gz
speedseq align -p -o TCGA-B6-A0I6.tumor \ -R "@RG\tID:TCGA-B6-A0I6-01A-11D-A128-09\tSM:TCGA-B6-A0I6-01A-11D-A128-09" \ human_g1k_v37.fasta TCGA-B6-A0I6-01A-11D-A128-09.interleaved.fq.gz ```
-
Use
speedseq somatic
to call SNVs and indels on the tumor/normal pair.speedseq somatic -o TCGA-B6-A0I6 \ -w annotations/ceph18.b37.include.2014-01-15.bed \ -F 0.05 \ -q 1 \ human_g1k_v37.fasta TCGA-B6-A0I6.normal.bam TCGA-B6-A0I6.tumor.bam
-
Use
speedseq sv
to call structural variants on the tumor/normal pair.speedseq sv -o TCGA-B6-A0I6 \ -x annotations/ceph18.b37.lumpy.exclude.2014-01-15.bed \ -B TCGA-B6-A0I6.normal.bam,TCGA-B6-A0I6.tumor.bam \ -D TCGA-B6-A0I6.normal.discordants.bam,TCGA-B6-A0I6.tumor.discordants.bam \ -S TCGA-B6-A0I6.normal.splitters.bam,TCGA-B6-A0I6.tumor.splitters.bam