Research in the Kim lab is focused on developing computer algorithms and statistical methods that enable accurate and rapid analysis of biological data, in particular sequencing data. The software systems developed in his lab include several widely used programs such as TopHat2, HISAT, TopHat-Fusion, and Centrifuge.
Currently one important obstacle facing analyses of sequencing data is their reliance on the human reference genome to align sequencing reads. The human reference genome was assembled using only a few samples and thus does not reflect genetic diversity across individuals and populations. This reliance on a single reference genome can introduce significant biases in downstream analyses, and it can miss important disease-related genetic variants if they occur in regions not present in the reference genome.
To address these challenges, Dr. Kim recently developed a novel indexing scheme using a graph approach that captures a wide representation of genetic variants and has low memory requirements. He has built a new alignment system, HISAT2, that enables fast search through the index. HISAT2 is the first and only practical method available for aligning sequencing reads to a graph at the human genome scale while only requiring a small amount of memory typically available on a conventional desktop. The graph-based alignment approach enables much higher alignment sensitivity and accuracy than linear reference-based alignment approaches, especially for highly polymorphic genomic regions such as HLA genes, DNA fingerprinting loci, and LINEs. The system also has the potential to perform unbiased alignment irrespective of which individual genome is sequenced.
Building off of HISAT2, Dr. Kim plans to develop a practical software solution that can accurately analyze an individual’s genome and its >20,000 genes within a few hours on a desktop computer. The availability of an individual’s genetic information made possible by this proposed work is essential to promoting personalized medicine. The software will enable researchers to more efficiently perform unbiased analyses for next-generation sequencing experiments, further improving our understanding of tumorigenesis and finding personalized treatments for cancer patients. Anyone who has access to sequencing data will be able to easily perform these functions using just one software package.