Tuesday, September 25, 2012

The story of missing genes

The recent explosion in the number of genomes being sequenced has lead to interesting experiments that use genome assemblies to test hypothesis ranging from gene loss and gain, pathway enrichment analysis, gene repertoire evolution etc. However, many reviews that have analysed NGS assemblies have shown that genome assembly quality and annotation methodology determines gene content.

Human genome assemblies generated with the NGS data are completely missing more than 83 genes available in the Sanger genome assemblies. While no single functional category of genes are missing from the assemblies, genes located in regions difficult to assemble are lost often. MHC genes are notorious for being difficult to assemble. Analysis of MHC region using draft assemblies can lead to underestimating or predicting the loss genes in a lineage due to poor assembly quality.

Various programs have been developed to find these "missing genes" that are either lost in gaps in the assembly or too fragmented to be readily recognizable or even miss-assembled. For example, IMAGE genome assembler tries to find the parts that are lost in gaps. SOAP Gap Closer also closes gaps in the assembly. GAP Filler from the makers of SSPACE claims to be superior to both IMAGE and SOAP Gap closer in its ability to better predict gap sizes and also use multiple libraries without a corresponding increase in memory usage. Being able to replace N's in genome assemblies with the "correct bases" is very useful in being able to recover various features of the genome which were essentially lost in gaps.

Longer pacbio read based assembly of budgerigar genome is probably a way out of these poor assemblies. Genes like FOXP2 were found to be fragmented in assemblies that did not use the PBcR (PacBio) reads. These fragmented genes could be recovered in the assemblies that used PBcR reads. However, few regions needed a combination of different technologies to be able to recover certain regions. Its not surprising that NGS assemblies have so many missing bits when even the "completed" human genome is still being updated 13 years after being published. Advances not only in sequencing technology but also our understanding of the diversity in the genome and its representation will probably change the way we think of a genome assembly in the future.

No comments: