Applying Data Science, Bioinformatics and Networks to Cancer Genomics

Ann Nguyen:
Hi there, and welcome to this latest podcast from Cambridge Healthtech Institute for the 2015 Bio-IT World Conference & Expo, taking place April 21-23 in Boston, Massachusetts. My name is Ann Nguyen and I'm one of the Associate Conference Producers for the event.

Dr. Mark Gerstein, Albert L. Williams Professor of Biomedical Informatics at Yale University is chatting with us today. He'll be speaking during the session on Databases, Sharing and Integration in the Clinical Genomics conference track.

Mark, thank you for finding some time for us.

How did you come to focus on bioinformatics and how has your work with applying data science to genomics evolved since you joined Yale in 1997?

Mark Gerstein:
I focused on bioinformatics actually from, really when I graduated from college. First I was studying physics and I got really interested in doing that and applying quantitative approaches to the natural world, and that's of course where people tend to in physics background.

When I graduated from college, I was looking for a field to get into that was new to being quantified. This is the late 1980s now, and I looked around a bit and I got very interested in more biological things, biophysics and so forth, so that's where I really did my Ph.D.

I did my Ph.D. in England and I think one of the early people focusing on doing computational things with biological molecules. I started out with, as you can probably imagine, a more physical approach. I was very interested in molecular mechanics of molecules and I really did a chemistry Ph.D. focusing on things like that. But then when I did my postdoc and when I started with the faculty, I got more into biological things and in particular, around that time was when they started sequencing the first genomes, the Haemophilus influenzae genome, and then the human genome. I found this very exciting and so I got into genomics.

I guess over time, I followed along with the field. I started out more as a structural, protein-oriented person and then I got more into DNA genomics. And now I'm really into personal genomics and disease genomics and the stuff really paralleling the advent of next-gen sequencing being a driving force.

Ann Nguyen:
Your lab uses networks as a basis for representing, analyzing and integrating many types of biological data. What's so useful about networks?

Mark Gerstein:
I think networks are really great representations and in a sense they are a fundamental representation for this emerging area of data science now where people are starting to develop ways of analyzing big datasets and calculations and structures for that.

Networks are really useful because you can apply them in many, many different contexts. It’s a flexible and abstract representation, yet at the same time it really does help people understand things.

I think people find it very useful to see the network representation used in a traditional, biological context. But also they can have it applied in another context and that gives them some insight and intuition.

People often make networks of molecules, say molecules where one regulates another one, a transcription factor regulating a gene or maybe two genes that -- or two protein products that -- physically interact and they can analyze these networks but sometimes they don't have that much intuition for them because they are so removed from everyday experience.

However, you can construct very similar network diagrams for other things that people maybe have a little bit more intuition for. Obviously social networks would be the extreme where people have a lot of intuition for them but there is also things like electrical networks or neural networks and things that people have maybe a little bit more knowledge of.

I think what's needed to see the same type of calculation applied in two contexts. One of the initial famous network calculations was the finding of the hubs in networks. These are the points with very large connectivity.

You can certainly find these points in molecular networks. But then when you make a comparison to, say, the transport network that we have and you talk about the airplane networks having hubs and you also talk about social networks having hubs where people are really connected to lots of other people, it gives people a little bit of intuition about how a hub might function.

The same is true, say, for instance, for another network concept like bottlenecks. Bottlenecks would be particular spots in the network where a lot of short paths go through. There is a lot of traffic through them and people identify these in gene regulatory networks all the time but again, variable intuition.

When you look at, say, the transport network and you realize a bottleneck might correspond to a major bridge or a major tunnel where there is a lot of traffic being funneled through that, again you get a lot of intuition. So I'm particularly enthusiastic about these type of comparisons that people can make.

Ann Nguyen:
Can you broadly describe some of your lab’s progress with analyzing cancer genomes, from identifying and annotating personal genome variants to determining each mutation’s impact?

Mark Gerstein:
We got into analyzing cancer genomes as an outgrowth to the work we started looking at for personal genomes. Essentially, cancer genomics is one of the most direct applications of personal genomics, a place where it’s really quite useful.

In personal genomics, what one tries to do is, in addition to looking at the genome generally, looking at the variants in a particular person. The average person has 3 to 4 million single-nucleotide polymorphisms and a few thousand block variants, structural variants.

People often want to know what those variants do and some of them have obvious functional aspects. They might hit a gene or other ones -- it's a little hard to see what they're doing. Maybe they change the regulation of genes and they affect their binding size.

In cancer genomics what one wants to do is then to apply that logic, but now one is looking at the cancer genome. The cancer genome is really the personal genome with the variants from the cancer, the somatic variants, and usually these are considerably less than the number of natural polymorphisms. Usually, say, 5,000 or so, which is a bit less but still quite a large number to figure out which ones are important, the driver event and which ones are collateral damage, the passive events. The mindset is if you understand which of the alterations are key in the cancer, potentially they could be useful in targeting drugs to the cancer or even designing a treatment for a person.

What we've focused on very much is the non-coding mutations. The bulk of the mutations in the personal genome and also the somatic disease mutations in the cancer genome are non-coding. You probably know that about only a percent of the genome is the coding genome and 99% that overall in bulk is the non-coding regions.

Most of the mutations in the cancer genome are in this non-coding region, but people haven't focused on them as much because it’s harder to understand what they do. We’ve developed ways of trying to interpret them so we’ve taken all the somatic mutations and seen if they affect any of the non-coding annotations such as transcription factor binding sites or non-coding RNAs.

Then we have a way of highlighting the factor binding sites and non-coding RNA annotations that are maybe most disabling if they are affected. For instance, they would be the hubs in networks, say, the regulatory network, or they would be regions in the genome that have very little natural variation and suspect that have a cancer variable would be very deleterious.

We have a way of prioritizing these non-coding variants and non-coding somatic variants based on the annotation and then we can take the, say, 5,000 in the cancer genome and highlight the maybe 5 that we think are having the strongest effect and that’s a useful thing, where someone can actually take a number like 5 and test them in the lab and really start to think about what they do.

That’s what we’ve been focusing on for cancer genomics. We developed a tool a while ago that does all this. It’s a computer tool. We call it FunSeq. We’ve had now a second version, FunSeq 2, that does this type of prioritization.

The new version of the tool also includes a recurrence analysis so people usually think if you look at, say, 10 cancers and you tend to have in those 10 cancers more than you'd expect, a mutation hitting this particular non-coding region, it may be important so we can incorporate this recurrence or burden analysis in the newer FunSeq as well as the functional-based prioritization. That’s what we’ve been focusing on for cancer genomics.

Ann Nguyen:
I know there’s so much more info and data that you could share and I could tell that you’d love to share it. So we’re looking forward to hearing more specifics later at the event. For now though, Mark, thank you again for your time.

That was Mark Gerstein of Yale. He'll be joining the Clinical Genomics conference at the upcoming Bio-IT World Conference & Expo, running April 21-23 in Boston.

If you'd like to hear him in person, go to www.bio-itworldexpo.com for registration information and enter the keycode “Podcast”.

This is Ann Nguyen. Thanks for listening.

Data Platforms and Storage Infrastructure