LESSONS IN GENETICS: LABORATORY

Showing posts with label LABORATORY. Show all posts

Saturday, 9 May 2015

VAR-MD

VAR-MD has been designed to optimize annotation for Mendelian inherited disorders. It analyzes variants derived from whole genome or whole exome sequencing data coming from pedigrees. It outputs a list of putative pathogenic mutations based on inheritance models, genotype quality and allele frequency.

VARIANT

VARIANT (VARIant ANalysis Tool) is a web-based tool that interrogates several different databases in parallel. It utilizes dbSNP, 1000 Genomes, the GWAS catalog, OMIM and COSMIC.

VAT

VAT (Variant Analysis Tool) annotates variants which have been mapped on a transcript by the variant calling. It has been extensively used in the 1000 Genomes Project to annotate loss-of-function variants. Of note, together with ANNOVAR, VAT is the only annotation software that is capable of handling structural variations (e.g. large deletions/duplications).

VAAST

VAAST (Variant Annotation Analysis and Search Tool) uses an algorithm which is based on existing information on pathogenic variants to output a likelihood of pathogenicity. Differently from other software, it can also make predictions on non-coding variants.

SEATTLESEQ

SeattleSeq is a platform that integrates the output from other programs and databases to produce predictions on novel and known SNPs. It gives conservation scores, HapMap frequencies, Polyphen predictions and clinical associations. Its computations are also based on dbSNP via the Genome Variation Server.

SNPeffect

This software was designed to make predictions on mutations falling within coding regions. It relies on the information contained in UniProtKB, where there are currently data on more than 60,000 variant proteins. The software utilizes different algorithms like TANGO (which detects regions prone to aggregation), WALTZ (which calculate the propensity of a certain region to produce amyloids) and LIMBO (which predicts chaperone binding sites for the Hsp70 chaperones).

snpEFF

snpEFF is an open source software which can be used to rapidly categorize single nucleotide polymorphisms (SNPs), insertions, and deletions. The Java-based program works with VCF files, it is GATK compatible and returns results of high, medium or low functional impact.

VARIBENCH

VariBench is based on datasets of experimentally validated variations, against which variants can be compared to make predictions. The datasets of VariBench are compiled based on published literature and other public databases. The datasets are categorized in four sections:

1. Variants that affect protein tolerance.

2. Variants that affect protein stability.

3. Variants that affect transcription binding sites.

4. Variants that affects splicing sites (this section is actually very limited, as it containes information on just a couple of genes).

VariBench can also map variants to the sequences contained in RefSeq and within the 3D protein structures at Protein Data Bank (PDB).

Friday, 8 May 2015

POLYPHEN/POLYPHEN2

Polyphen, now available in its version Polyphen2, predicts the impact of a missense mutation based on (1) protein sequence (2) phylogenetic information and (3) structural information. The software actually looks if the mutation is falling within a protein domain essential for the binding to other molecules of for the formation of the secondary/tertiary structure. In particularly Polyphen2 looks at putative disulfide bonds, active sites, binding sites and transmembrane domains and makes computations on 3D models of the protein structure. Polyphen2 also looks at homologous proteins to see if the identified missense mutation has been observed in other proteins of the same family.

PROVEAN

PROVEAN (PROtein Variation Effect ANalyzer) is an in-silico analysis tool to predict whether a missense mutation or an indel has an impact on the biological function of a protein. Like SIFT, PROVEAN is hosted by the J Craig Venter Institute, where they claim its output to be comparable to the one of other software like Polyphen2 or SIFT.

A variant of the software called PROVEAN HUMAN GENOME VARIANTS returns the results of PROVEAN and SIFT simultaneously.

SIFT

SIFT (Sorting Tolerant From Intolerant)

SIFT is an in-silico analysis tool that predicts pathogenicity based on the level of conservation of an amino acid residue across different species. The assumption is that residues which are essential for protein function must be highly conserved and that mutations affecting such residues are therefore highly likely pathogenic.

The SIFT homepage is hosted by the J Craig Venter Institute, where also the PROVEAN tool is available.

LARGE DELETIONS/DUPLICATIONS TESTING ON NGS DATA

algorithms to do del/dup (CNV) testing on NGS data

CNV (Copy Number Variation) analysis - also known as large deletion/duplication testing - has been traditionally performed through dedicated techniques like MLPA (Multiplex Ligation Probe Amplification) or qPCR (quantitative PCR). Today large deletions/duplications testing is available also through 'simple' in-silico analysis of NGS data.

EXOME SEQUENCING MACHINES: ILLUMINA

The coding region (i.e. the part of the DNA which encodes for proteins, commonly known as exome) represents just 2% of the entire human genome, but it harbours more than 85% of all disease-causing mutations in humans. Exome sequencing is therefore the best and most cost-effective approach to investigate genetic diseases, especially where the cause (i.e. the gene) has not been discovered yet.

NANOBALLS FOR SEQUENCING

In our review of the most recent sequencing machinery it's now time to mention the technology offered by Complete Genomics (Mountain View, California), a company which has been bought in March 2013 by the Chinese giant of genomic sequencing services, BGI-Shenzen.

MULTIPLEXING: A FORMIDABLE APPLICATION OF NEXT GENERATION SEQUENCING

MULTIPLEXING - sequencing of multiple samples in parallel

The modern Next Generation Sequencing (NGS) machinery allows to process a large number of sequences in parallel. This has opened the way to the fast and cost-effective analysis of the entire genome of an individual (whole-genome sequencing). Similarly, it is also possible to analyze large parts of the genome of many individuals in parallel.

In other words, depending on the sequence size and the power of the machine available, it is possible to perform the analysis of more than one sample at a time. It is not uncommon, for example, to perform the exome sequencing for two individuals in a single run. In microbiology for instance because the genomes of microorganisms are very small, it is possible to sequence the genomes of many viruses and bacteria simultaneously.

The analysis of multiple samples in parallel is called multiplexing. In order to do a multiplexing analysis, it is necessary to add to each library fragment a short, patient-specific, synthetic sequence, also known as barcode sequence. The barcode sequence works as a label to uniquely identify all the DNA fragments belonging to the same individual (or to the same microorganism). Once the DNA library fragments of every sample have been labelled with the barcode sequence, they can be put all together into the same tube to start the sequencing reaction. All the reads produced will also contain the barcode sequence. Thanks to the barcode sequences it will be then possible to separate the reads belonging to each sample (de-multiplexing) before proceeding to alignment and variant calling.

Multiplexing became a reality just thanks to NGS, since costs and times have always made such analysis prohibitive by means of Sanger capillary electrophoresis. NGS machines also minimize the human intervention in the analysis, since there's no electropherogram to interpret and alignment and variant calling are fully automated. Not only that: the amount of DNA required for an NGS analysis (even in multiplexing) is much lower than what is needed for a Sanger analysis: only 30 ng of DNA may be enough to get an entire exome!

Related articles:

Friday, 28 November 2014

NEXT GENERATION SEQUENCING APPLICATIONS

WGS - WES - TARGETED SEQUENCING - TRANSCRIPTOME ANALYSIS

Next Generation Sequencing (NGS) represents the latest evolution of sequencing. For many years, sequencing was done by capillary electrophoresis (Sanger sequencing). Capillary electrophoresis sequencing allows us to reconstitute the sequence of a single DNA fragment. Sanger sequencing is done by recording the signals of incorporation of fluorochromes-labeled nucleotides, which are used to synthesize the strand complementary to the original DNA fragment. Even NGS is based on a similar principle; however, the reaction may be done for many fragments of DNA in parallel, not just one. Through NGS it is therefore possible to obtain an enormous amount of sequences (in a single stroke you can get gigabases or terabases of information) more quickly and at a much lower cost. For this reason, NGS is also known as high-throughput sequencing.

In NGS, the DNA of an individual is broken into many small fragments (for example, through the use of ultrasound) to constitute the so-called sequencing library. These small fragments serve as templates for the synthesis of numerous, complementary fragments called reads. Each small fragment of DNA is copied many times in a variable number of reads. Depending on the desired level of precision, it is possible to set the system to achieve a certain level of coverage, i.e., a certain number of reads for each fragment of the library. For example, 30 reads per fragment, which would have been defined in jargon as 'coverage 30x', are already sufficient for routine diagnostics of Mendelian diseases, while the diagnosis of somatic mutations typical of tumors may require coverage up to 1000x. A computer then collects all the reads and aligns them with the reference sequence of the human genome annotated in the databases. By this way the reads can be reassembled like in a puzzle to obtain the sequence of the gene or of the entire genome.

NGS machines available today are produced by many different brands and they are very flexible devices. A NGS sequencer can actually be used for different types of applications:

1. Whole-genome sequencing (WGS) - also known as whole-genome shotgun, this is the analysis of the entire genome of an individual.

2. Whole-exome sequencing (WES) - analysis of the entire coding region of all the genes of an individual.

3. Targeted sequencing - analysis of a group of genes (panel) or of a single gene.

4. Transcriptome analysis - analysis of all RNA produced by a cell (transcriptome).

Programs 2 and 3 require an additional step (target enrichment) and can also be performed for many samples simultaneously through the so-called technique of multiplexing. The DNA of each individual can, in fact, be distinguished by attaching a specific sequence (barcode sequence) to it. Since barcode sequences are also sequenced, the reads belonging to each individual can be recognized and sorted before alignment thanks to the barcodes.

Sunday, 5 January 2014

NEXT GENERATION SEQUENCING FOR DUMMIES!

Awesome! You are about to be redirected to the updated version of NGS for dummies...

If redirect takes more than 5 seconds, please click here.

double helix on a ptinted out FASTA sequence

After years in which the Sanger sequencing has been the gold standard of molecular genetics diagnostics, Next Generation Sequencing (NGS) is going to take over. NGS is also known as high-throughput sequencing (high yield sequencing) as it allows to sequence many fragments in parallel (which was impossible by traidional Sanger sequencing).

Pages