Few genomes are difficult (with current state of technology) to assemble due to their bizzare characteristics. The high cost of sanger sequencing, construction of FOSMID or BAC libraries, flow sorting of chromosomes and other wonderful methods makes the assembly of genomes with NGS methods difficult. However, even Sanger based methods find it difficult to sequence some genomes that are almost impossible to assemble to "completion". Genomes can be difficult due to reasons such as:
- Large size of genome: A genome that is very large (has many many bases) are difficult to sequence, mainly due to the higher costs involved in generating sufficient coverage. Largest known vertebrate genome is that of the Lungfish with a size of 133 Gb and canopy plant being the largest known plant genome with a size of 150 Gb. An amoeboid, Polychaos dubium might have the largest genome with a size of 670 Gb. Larger amounts of data are difficult to handle bioinformatically. Infact most assemblers would be unable to handle large amounts of data associated with these genomes. Moreover, these genomes are thought to be filled with repeats and genome duplication events making their assembly even more complicated.
- Repeat content of genome: Certain genomes are known to have very high transposon activity making them rich with repetitive content. These genomes need not be large, but can still be difficult to assemble due to the almost identical copies of DNA prevalent in the genome.
- Extremes in GC content: Certain genomes are known to have very high or very low GC content. This makes them difficult to sequence due to the bias involved in NGS methods. Although GC content extremes are constrained by the requirements imposed by the genetic code, Streptomyces coelicolor manages to have a GC-content of 72% while Plasmodium falciparum has just 20%. Apart from extremes in genome wide GC content, parts of the genome can have extremes in GC content making them difficult to sequence.
- Rarity of sample: Some organisms are so rare, that its almost as if they were extinct. Being able to find such species and obtaining enough DNA from them can be almost impossible. The situation is made more complicated by various legal, ethical and technical issues. Rarity of sample, could also be a result of the amazingly tiny amounts of DNA available in certain species. Cultivation of many microbial species in the lab is not yet possible and obtaining enough DNA from such species has driven research in the field of metagenomics and more recently single cell sequencing. DNA from extinct species is of lower quality and filled with many artifacts making correct assembly of genomes a daunting task. However, many of these problems have been overcome by novel methods and extinct species such as the mammoth, neanderthals, denisovan.... have been sequenced and assembled to a quality comparable to that of other NGS genome assemblies.
- Genome definition inconsistencies: To be able to assemble the genome of a certain species, it should be possible to define a species and what constitutes its genome. Symbiotic organisms can be difficult to delineate into distinct species, due to the high degree of inter-dependence of these species. The definition of species is a controversial subject and different interpretations of these definitions makes it controversial to claim sequencing of a particular "species".
- Dynamic nature of genomes: Genomes are generally though of as stable inherited genetic material which remain exactly the same over short periods of time. However, the genomes have many dynamic features. Telomeres change in length with age in almost all species. Similarly, small viral genomes with very high mutation rates can change drastically within the span of a few hours making them completely immune to a drug or conferring a new phenotype. Such changes will require re-sequencing of the genome to identify the changes to the genome.Species with different numbers of chromosomes along a cline are another type of dynamism.