Friday 16 January 2015

NLG: NATURAL LANGUAGE GENERATION

pc - laptop - natural language generation
Natural Language Generation (NLG) is an informatic process based on algorithms that try to transform text data into sentences which can be understandable by a human. The ultimate application of NLG is scientific curation. A computer can be taught to read a scientific article, extract important data from it (e.g. evidences on genotype-phenotype correlations) and re-elabororate them in condensed statements that can be fast and easily understood by a scientist without him reading the original article. In theory. 

For instance, a computer could 'read' an article on EGFR gene mutations in lung cancer, extract the association between the L858R mutation and the sensitivity to the treatment with gefitinib and finally output a statement like 'the somatic mutation L858R in lung cancer cells increases the sensitivity to the treatment with gefitinib'. In theory.

Although it is said that machines are doubling their memory and their computational capacity every year, we are still at a stage where algorithms may find very difficult to understand, interpret, and synthesize complex scientific concepts (and to distill them in a correct and understandable human language). As consequence, probably most outputs of NLG from complex scientific data remains today odd, contradictory and cumbersome. To use another word, wrong.

The way to artificial intelligence is certainly exciting and we all hope in future clever (yet friendly) machines which may be able to help humans as never before. However, data coming from scientific curation by NLG is still to be used very, very cautiously.

NLG has been used so far mostly to curate scientific sources on somatic mutations in cancer. There is an incredible amount of data on cancer genetics out there, and the race to decode all published evidences into clinical practice guidelines to realize the dream of personalized medicine has already started. Some companies and institutions are already using NLG for this purpose. However, it is just the amount of data and the often arbitrary selection of publications (NLG is often working on articles which have been preselected by a human) that makes difficult to check the consistency of NLG statements and that puts the reliability of NLG (and databases build with it) in doubt. Considering that at the end of the process there is always a human patient, maximum caution must be certainly warranted before physicians may confidently rely on NLG based datasets.