Lumpy

https://github.com/arq5x/lumpy-sv

Overview

This repository contains the source code for the Lumpy structural variation detection framework developed in the Hall and Quinlan laboratories at the University of Virginia.

Installation

Lumpy installation does not require any external softwart. We do recommend installing numpy, samtools, bedtools, bamtools, novoalign (or bwa), and yaha. Below is a step-by-step tutorial for how to install and use Lumpy. This guide assumes that /usr/local/bin is writable and in your path. If either is not true, use another directory that is both writable and in your path, or contact your administrator. If you have questions, email me.

Install numpy (One of the following commands should work. If not, then visit http://docs.scipy.org/doc/numpy/user/install.html for more informaiton.) ::

pip install numpy

# Ubuntu & Debian
sudo apt-get install python-numpy

# Fedora
sudo yum install numpy

Install samtools ::

git clone git://github.com/samtools/samtools.git
cd samtools
make
cp samtools /usr/local/bin/.

Install bedtools ::

git clone git@github.com:arq5x/bedtools.git    
cd bedtools
make
cp bin/bedtools /usr/local/bin/.

Install bamtools ::

git clone git://github.com/pezmaster31/bamtools.git
cd bamtools
mkdir build
cd build
cmake ..
make
cd ..
cp bin/bamtools /usr/local/bin/.

Install novoalign

- Novoalign is closed-source, commercial software that can be freely used
  by any not-for-profit projects within not-for-profit organizations.
  Visit the 'Downloads' page at http://www.novocraft.com/

Install yaha ::

wget http://faculty.virginia.edu/irahall/support/yaha/YAHA.0.1.79.tar.gz
tar zxvf YAHA.0.1.79.tar.gz
cp yaha /usr/local/bin/.

Install bwa ::

wget http://downloads.sourceforge.net/project/bio-bwa/bwa-0.7.5a.tar.bz2
bunzip2 bwa-0.7.5a.tar.bz2
tar xvf bwa-0.7.5a.tar
cd bwa-0.7.5a
make
cp bwa /usr/local/bin/.

Clone the Lumpy repository ::

git clone git://github.com/arq5x/lumpy-sv.git

Navigate into the lumpy directory ::

cd lumpy-sv

At this point, you should be ready to compile lumpy ::

    make

Now, you can test the paired-end, split-read, and combination paired-end and split-read versions of the tools by running the moving to the test directory and running the test.sh script. Also, this shell script demonstrates how to run each of the lumpy version ::

    ./test.sh

If all works well, and you have bedtools installed in your path you should see the following ::

Testing lumpy paired-end
549 0
chr10   1000000
Simulated:    1000  Predicted:      40  True:      40   False:       0
Testing lumpy split-read
chr10   1000000
Simulated:    1000  Predicted:      44  True:      43   False:       1
Testing lumpy paired-end and split-read
549 0
chr10   1000000
Simulated:    1000  Predicted:      95  True:      92   False:       1

Usage

General options ::

-e

The default output reports the predicted breakpoint. This option includes the evidence supporting each call. ::

-mw minimum weight for a call

Each piece of evidence has a weight, and each possible call has an evidence set. The sum of weights in the evidence set must be above this value. ::

-tt trim threshold

Each predicted breakpoint interval has a probability array associated with it. The intervals can be trimmed of values that are below some trimming percentile. NOTE: We recommend "-tt 0.0" (no trimming) since LUMPY now reports both the 95% confidence interval and the most probable single base for each breakpoint. ::

-P

Print the breakpoint probability array. ::

-x excluded regions bed file

Regions of the genome may be excluded from consideration by included them in bed file format. Any alignment that overlaps any of the regions will be ignored. This is particularly useful when a sample has regions with either too very low or very high coverage due to biases in sequencing or alignment. See below for help creating this file. ::

Split-read options ::

-sr 
    bam_file:<file name>,

Position sorted bam file containing the output of a single read split-read aligner (e.g., YAHA, bwasw) for this sample. ::

    back_distance:<distance>

The distance around the +/- of the split to include in the breakpoint interval. A distance of 20 will created a breakpoint interval of size 40 centered at the split. ::

    min_mapping_threshold:<mapping quality>

Minimum mapping quality (reported from the aligner) that a read must have to be considered. A quality of 1 will filter all reads with two or more equally good mappings. ::

    weight:<sample weight>

Weight of each piece of evidence from this sample. ::

    id:<sample id>

Sample id.

Paired-end options ::

-pe 
    bam_file:<file name>,

Position sorted bam file containing the output of a paired-end read aligner aligner (e.g., bwa) for this sample. ::

    histo_file:<file name>,

Histogram of observed library sizes for the sample. A script to generate this file is located in scripts/pairend_distro.py (NOTE: the output of this script is the breakpoint probability distortion, not the fragment size distribution) ::

    mean:<value>,

Sample mean library size (can be found using scripts/pairend_distro.py) ::

    stdev:<value>,

Sample mean library standard deviation (can be found using scripts/pairend_distro.py) ::

    read_length:<length>,

Length of sequenced reads ::

    min_non_overlap:<length>,

Number of base pair positions that must be unique to each end of a read pair. Some library preps are created with large reads and small library sizes such that read overlap, in all over cases overlapping reads tends to be a sign of an error. We typically set this to read length (pairs cannot overlap). ::

    discordant_z:<z value>,

Number of standard deviations away from the mean to be considered as a normal library size. ::

    back_distance:<distance>

Distance into the read to add to the breakpoint interval. ::

    min_mapping_threshold:<mapping quality>

Minimum mapping quality (reported from the aligner) that a read must have to be considered. A quality of 1 will filter all reads with two or more equally good mappings. ::

    weight:<sample weight>

Weight of each piece of evidence from this sample. ::

    id:<sample id>

Sample id.

BEDPE (general interface) options ::

-bedpe 
    bedpe_file:<bedpe file>,

Position sorted bedpe file containing the breakpoint intervals for this sample. ::

    back_distance:<distance>

Distance into the read to add to the breakpoint interval.
::

    weight:<sample weight>

Weight of each piece of evidence from this sample. ::

    id:<sample id>

Sample id.

Output

Tab separated::

1. chromosome 1
2. interval 1 start
3. interval 1 end
4. chromosome 2
5. interval 2 start
6. interval 2 end
7. id
8. evidence set score
9. strand 1
10. strand 2
11. type 
12. id of samples containing evidence for this breakpoint
    13. strand configurations observed in the evidence set
    14. point within the two breakpoint with the maximum probability
    15. segmetn of each breakpoint that contains 95% of the probability

Example::

    chr1    547154  547462  chr1    547265  547569  1   0.00254453  +   -   TYPE:DELETION   IDS:10,6    STRANDS:+-,6    MAX:chr1:547175;chr1:547569 95:chr1:547169-547225;chr1:547266-547569

Test data sets

The test/test.sh script executes lumpy against several simulated data sets and compares the results to the known correct result. The sample data sets are not part of the lumpy code base, and can be found at http://www.cs.virginia.edu/~rl6sf/lumpy/data.tar.gz. This tar ball should be extracted into the top-level lumpy directory. The script test/test.sh checks for the the existence of this directory before running lumpy.

Example Work flow

Assume that the input files are "sample.1.fq" and "sample.2.fq", and the read length is 150.

LUMPY is designed to consider both paired-end and split-read alignments, and can also consider each independently. There are two strategies for extracting constructing a split-read bam file that are fully explained below. One option is to first align a fastq file with a paired-end aligned (novoalign or bwa), extract candidate split reads from those alignments, then realign those candidate reads using a split-read aligner (yaha or bwasw). If you are starting with an aligned file (e.g., a bam file), this is probably your best option since it does not require full realignment. Another option is to align using bwa-mem, which will produce both paired-end alignments and split-read alignments in a single pass. Then, you can split this file into a paired-end file and a split-read file. This is probably the best option when starting from a fastq file.

Paired-end alignment

Both novoalign and bwa are options for paired-end alignment: ::

novoalign \
    -d hg19.ndx \
    -o SAM \
    -r Random \
    -i PE 500,50 -e 1 -c 20 \
    -f sample.1.fq sample.2.fq \
    | samtools view -Sb - > sample.pe.bam

bwa aln hg19.fa sample.1.fq > sample.1.sai
bwa aln hg19.fa sample.2.fq > sample.2.sai
bwa sampe hg19.fa \
    sample.1.sai sample.2.sai \
    sample.1.fq sample.2.fq \
    | samtools view -S -b - \
    > sample.pe.bam

Use bamtools or a recent version of samtools (0.1.19) to sort. NOTE: the resulting bam file must have the coordinate sort flag set (i.e., @HD VN:1.3 SO:coordinate). ::

bamtools sort -in sample.pe.bam -out sample.pe.sort.bam

samtools sort sample.pe.bam sample.pe.sort

Split read alignment

From the paired end aligned bam file sample.pe.sort.bam, you can extract the reads that are either unmapped or have a soft clipped portion of at least 20 base pairs ::

samtools view sample.pe.sort.bam \
    | scripts/split_unmapped_to_fasta.pl -b 20 \
> sample.um.fq

Use a split-read aligner on the unmapped/soft clipped reads; we prefer yaha: ::

# index first
yaha -g hg19.fa  -L 11

# using 20 threads
yaha \
    -t 20 \
-x hg19.X11_01_65525S
-q sample.um.fq \
-osh stdout \
-M 15 \
-H 2000 \
-L 11 \
| samtools view -Sb - \
> sample.sr.bam

For split reads, bwasw is another option: ::

bwa bwasw -H -t 20 hg19.fa sample.um.fq \
    | samtools view -Sb - \
    > sample.sr.bam

Sort the split-read alignments (again, using bamtools or samtools): ::

bamtools sort -in sample.sr.bam -out sample.sr.sort.bam

samtools sort sample.sr.bam sample.sr.sort

Paired-end and split-read alignment using bwa-mem

bwa-mem produces a single bam file with both paired-end alignments and split-read alignments ::

bwa mem hg19.fa sample.1.fq sample.2.fq -M \
    | samtools view -S -b - \
    > sample.pesr.bam

extract the disordant paired-end alignments. ::

samtools view -u -F 0x0002 sample.pesr.bam  \
    |  samtools view -u -F 0x0100 - \
    | samtools view -u -F 0x0004 - \
    | samtools view -u -F 0x0008 - \
    | samtools view -b -F 0x0400 - \
    > sample.discordant.pe.bam

extract the split-read alignments ::

samtools view -h sample.pesr.bam \
    | scripts/extractSplitReads_BwaMem -i stdin \
    | samtools view -Sb - \
    > sample.sr.bam

Sort both alignments (again, using bamtools or samtools): ::

bamtools sort -in sample.discordant.pe.bam -out sample.discordant.pe.sort.bam
bamtools sort -in sample.sr.bam -out sample.sr.sort.bam

samtools sort sample.discordant.pe.bam sample.discordant.pe.sort
samtools sort sample.sr.bam sample.sr.sort

Run lumpy-sv using paired end reads

Using the paired end mapped reads, empirically define the paired-end distribution from 10000 proper alignments. It is common practice to skip the first million reads. (NOTE: the output of this script is the breakpoint probability distortion, not the fragment size distribution) ::

samtools view sample.pesr.bam \
    | tail -n+100000 \
    | scripts/pairend_distro.py \
    -r 150 \
    -X 4 \
    -N 10000 \
    -o sample.pe.histo

The above script (scripts/pairend_distro.py) will display mean and stdev to screen.

To run lumpy with just the paired-end data, We will assume the mean=500 and stdev=50: ::

../bin/lumpy \
    -mw 4 \
-tt 0.0 \
-pe \
bam_file:sample.discordant.pe.sort.bam,histo_file:sample.pe.histo,mean:500,stdev:50,read_length:150,min_non_overlap:150,discordant_z:4,back_distance:20,weight:1,id:1,min_mapping_threshold:20\
> sample.pe.bedpe

Run lumpy-sv using split-reads reads

We can run lumpy with just the split-read data too: ::

../bin/lumpy \
    -mw 4 \
-tt 0.0 \
-sr \
bam_file:sample.sr.sort.bam,back_distance:20,weight:1,id:2,min_mapping_threshold:20 \
> sample.sr.bedpe

Run lumpy-sv using both paired and split reads

Or, we run lumpy with both the paired-end and split-read data: ::

../bin/lumpy \
    -mw 4 \
    -tt 0.0 \
    -pe \
    bam_file:sample.discordant.pe.sort.bam,histo_file:sample.pe.histo,mean:500,stdev:50,read_length:150,min_non_overlap:150,discordant_z:4,back_distance:20,weight:1,id:1,min_mapping_threshold:20\
    -sr \
    bam_file:sample.sr.sort.bam,back_distance:20,weight:1,id:2,min_mapping_threshold:20 \
    > sample.pesr.bedpe

Run lumpy-sv using matched samples

We can run lumpy with paired-end data from a matched tumor/normal samples ::

../bin/lumpy \
        -mw 4 \
        -tt 0.0 \
        -pe \
        bam_file:tumor.pe.sort.bam,histo_file:tumor.pe.histo,mean:500,stdev:50,read_length:150,min_non_overlap:150,discordant_z:4,back_distance:20,weight:1,id:1,min_mapping_threshold:1\
        -pe \
        bam_file:normal.pe.sort.bam,histo_file:normal.pe.histo,mean:500,stdev:50,read_length:150,min_non_overlap:150,discordant_z:4,back_distance:20,weight:1,id:2,min_mapping_threshold:1\
        > tumor_v_normal.pe.bedpe

Run lumpy-sv with regions of very high coverage excluded

We can direct lumpy to ignore certain regions by using the exclude region option. In this example we find and then exclude regions that have very high coverage. First we use the get_coverages.py script to find the min, max, and mean coverages of the the sr and pe bam files, and to create coverage profiles for both files. ::

    python ../scripts/get_coverages.py \
            sample.pe.sort.bam \
            sample.sr.sort.bam

    sample.pe.sort.bam.coverage  min:1   max:14  mean(non-zero):2.35557521272
    sample.sr.sort.bam.coverage  min:1   max:7   mean(non-zero):1.08945936729

From this output, we will choose to exclude regions that have more than 10x coverage. To create the exclude file we will use the get_exclude_regions.py script to create the exclude.bed file ::

    python ../scripts/get_exclude_regions.py \
            10 \
            exclude.bed \
            sample.pe.sort.bam \
            sample.sr.sort.bam

We now rerun lumpy with the exclude (-x) option ::

../bin/lumpy \
    -mw 4 \
    -tt 0.0 \
            -x exclude.bed \
    -pe \
    bam_file:sample.pe.sort.bam,histo_file:sample.pe.histo,mean:500,stdev:50,read_length:150,min_non_overlap:150,discordant_z:4,back_distance:20,weight:1,id:1,min_mapping_threshold:1\
    -sr \
    bam_file:sample.sr.sort.bam,back_distance:20,weight:1,id:2,min_mapping_threshold:1 \
    > sample.pesr.exclude.bedpe

Troubleshooting

All of the bam files that lumpy processes must be position sorted. To check if your bams are sorted correctly, use the check_sorting.py script ::

    python ../scripts/check_sorting.py \
            pe.pos_sorted.bam \
            sr.pos_sorted.bam \
            pe.name_sorted.bam
    pe.pos_sorted.bam
    in order
    sr.pos_sorted.bam
    in order
    pe.name_sorted.bam
    out of order:   chr10   102292476   occurred after   chr10   102292893