Friday, 8 May 2015

LARGE DELETIONS/DUPLICATIONS TESTING ON NGS DATA

algorithms to do del/dup (CNV) testing on NGS data
CNV (Copy Number Variation) analysis - also known as large deletion/duplication testing - has been traditionally performed through dedicated techniques like MLPA (Multiplex Ligation Probe Amplification) or qPCR (quantitative PCR). Today large deletions/duplications testing is available also through 'simple' in-silico analysis of NGS data.

WHY DOING CNV ANALYSIS?

CNV analysis is essential in the study of all kinds of genetic conditions, from Mendelian inherited disorders to multi-factorial diseases, and in quantitative trait loci (QTL) analysis. Traditionally CNV analysis has been possible just by the execution of specific techniques like MLPA, qPCR or, for larger deletions/duplications, FISH.

Today we can perform large deletion testing also by utilizing specific algorithms on sequencing data obtained by NGS, especially in case we do whole genome sequencing (WGS) or whole exam sequencing (WES). No single algorithm is capable of detecting the full range of large deletions/duplications and each program has its own advantages and disadvantages. However the concept they introduce is very innovative: we can make large deletion/duplication screening just using sequencing data!

WHICH ALGORITHMS CAN BE USED?

These new algorithms for CNV analysis can be classified in four methods, which can be also combined to obtain the best results. The four methods are:

RP: read-pair
SR: split read
RD: read-depth
AS: assembly-based

When they are combined, we can talk of CA: combined approach.

1. THE READ-PAIR (RP) METHOD

The RP method compares the real size of the insert between the read-pairs with the expected size of the insert based on the reference sequence. In case of discordance, the algorithm points to a result of deletion or duplication. The advantage of the RP method is in that it is capable of detecting large deletions/duplications, whereas it usually misses smaller events, like insertions or losses of few tens of bases. Software based on the RP method include PEMer, Hydra, Ulysses, and BreakDancer.

THE SPLIT READ (SR) METHOD

This method is based on pair-end sequencing and is applicable whenever only one read of the pair maps and the other one completely or partially fails to map to the genome, indicating so that it might fall right in a region of a breakpoint. In comparison to the RD method, this method shows limitations on identifying large structural variations, whereas it may be helpful for small deletions/insertions (the software Gustaf, for instance, can detect all a types of variations larger than 30 bp). By using the SR method it is in principle possible to pinpoint the breakpoints exactly. Other SR-based programs are: Pindel, Prism, svseq2.

THE READ DEPTH (RD) METHOD

The read depth (RD) method consists in the count of reads obtained from the sequencing reaction and is based on the assumption that when a region is duplicated or deleted, there must be a proportional increase or reduction in the number of reads coming from that region. This comparison of the number of reads can be done with other genomic regions of the same sample or with the same genomic region of other samples. There are two main differences with the RP and SR methods: (1) the RD method can quantify the CNV whereas the first two can only locate the variation, (2) the RD method can detect very large variations (although it may have problems in detecting variations smaller than 1kb). A limitation of the RD method is in that it cannot locate exactly the breakpoints of the deletion/duplication. The following programs are based on the RD method: CNV-seq, BIC-seq, cm.MOPS, CNVnator, ERDS, RDXplorer, ReadDepth, SegSeq, CNVrd2.
THE ASSEMBLY-BASED (AS) METHOD

By this method the comparison is done straight between contigs and scaffolds of the patient and the reference sequence, but it is rarely used because of the need of tremendous computation power and because of its capability of detecting just homozygous variations. Magnolya is a software based on the assembly method.

THE COMBINED APPROACH (CA)

A combined approach can be used to overcome the limitations of each single method. The CA consists in using one or more of the single methods in a step-wise manner. Several programs based on the CA method are available: SVDetect, cnvHiTSeq, Clever-sv, CNVer, DELLY, GenomeSTRiP, Gindel, GASVPro, Hydra-Multi, LUMPY, PSCC, SoftSearch. The best match is probably the integration between the RP and the RD methods, as it can virtually enable the detection of CNV of any length in parallel with the precise identification of the breakpoints.