Cloud Computing and Experimental Research: Compatibility, Access, Virtualization and Scale

Ann Nguyen:
Hello and welcome to this Cambridge Healthtech Institute podcast for 2015’s Bio-IT World Conference & Expo, running April 21-23 in Boston, Massachusetts. I’m Ann Nguyen. I’m one of the event’s Associate Conference Producers.

Kate Keahey is joining us today to discuss some of her work and insights as related to cloud computing. She is a Senior Fellow with the Computation Institute of the University of Chicago and Argonne National Laboratory and Principal Investigator with Chameleon.

Kate, thanks so much for your time.

Kate Keahey:
Okay.

Ann Nguyen:
How did you end up working in infrastructure cloud computing with both the University of Chicago and Chameleon, and do your projects and resources intertwine?

Kate Keahey:
I started working at Argonne National Lab and University of Chicago about maybe almost 15 years ago, somewhere around 2001. I was working in the grid computing group. Now, grid computing, as you may know, is about accessing remote resources, using remote computer resources and storage for your computation. I was working with some of the communities, and they were fascinated by the concept of grid computing, but unfortunately had trouble using them occasionally. The problem was that many of the scientific codes are very complex to put those remote resources that grid computing allows you to access.

There are many reasons for that. There is operating system incompatibility, problems with libraries compiling, things as simple as environment variables not being set up right. In other words, those codes are brittle but yet they are very often groundbreaking and extremely important for their communities, and very often, very, very impactful, leading to many discoveries, having very many users. They wanted to run those codes on grid resources but they were now hitting this problem that they could not run them on the resources that were not configured just right.

Of course if you have many people now many groups even or communities sharing your resource and everybody wants to have their own configuration, that's very hard to reconcile. That was one problem that we were hitting against.

The second problem was that I worked with many experimental communities that needed on-demand access to compute resources. They needed that in order to support the experiments they were running. Of course most of the resources that were provided by grid computing were provided by BatchQs, which you submit a job and then the job stays in the BatchQ for a long time and you don’t know how long, and eventually gets executed. Now if you need quick feedback, this is unacceptable because you need to have the guarantee that the execution will at least start within some critical time.

Those two very fundamental issues, one is the ability to port code – the compatibility with the different remote platforms – and the second one was the ability to start running on demand. We experimented with many, many different solutions. We tried everything we could, and eventually we hit on the idea of using virtual machines to support those codes.

Virtual machines give you essentially a representation of the real machine except that the machine then is configured to support your specific environment, so users could be guaranteed to run their codes in a virtual machine. Then if we could run a virtual machine on remote resource, that solves the problem of compatibility of the code with the remote resource. Then we also said, “Well, we’re going to deploy them on demand, because there are all of those experimental communities that need answers in real time or near-real time.”

This is how we started working on cloud computing. That would have been 2001, 2002. The virtualization technology, at that time, was very expensive and not very high performing.

In 2003, everything changed when the Xen virtualization came out – the Xen Hypervisor, because it was very fast and it was open source, so it was free. It was possible to build a system that would take a virtual machine and deploy it on remote resources, solving those two problems that I was talking about. We built such systems, and that system became known eventually as Nimbus. We had the first production release in mid -2005. That was about a year before Amazon EC2 came out in mid-2006. At that time, we didn't really have the word “cloud computing”; we didn't have the notion of cloud computing. But this is essentially what we were doing with Amazon EC2, was doing later, and is doing today. In fact, a few years later on, we decided to provide interface compatibility with Amazon, so that users could move from Nimbus to Amazon and back. That took us very little time. Essentially the interfaces were already the same. We just had to make cosmetic changes. So anyhow, this is how I got interested in cloud computing, and this is how myself and my team got involved in developing Nimbus, which today is recognized as the first open-source infrastructure-as-a-service implementation.

Now Chameleon is a very logical extension of all those efforts. Obviously, I’m still working in cloud computing. My team is working, these days, more in platform-level rather than infrastructural-level computing. Chameleon is building an infrastructure which will provide resources for cloud computing research – something that up till now, we didn’t really have.

Right now, we are just at the beginning of understanding what cloud computing can give us: what the exciting applications are, what the interesting interaction patterns are and so forth.

There are many exciting open issues in cloud computing. Is cloud computing suitable for HPC, for example? What types of applications is it suitable for? Can it provide a plot for cyber-physical systems is a very exciting issue. So there was really no test meant to investigate all of those issues on, and now there is. That test bed is going to be the two projects funded by NSF Cloud. Chameleon is one of them.

Ann Nguyen:
Can you give us a quick preview of your presentation on “Chameleon: A Large-Scale, Reconfigurable Experimental Environment for Cloud Research,” slated for April 22?

Kate Keahey:
So Chameleon was designed around the idea that any research in cloud computing right now has to look at scale. I mentioned earlier the challenge of whether cloud computing is suitable for high-performance computing, for example, for supercomputers. If it’s not, then we’e going to end up with two different models, right? We’re going to end up with supercomputers as we see them today, and then we’re going to end up with 90% of the world running on cloud computing. Maintaining two different models like that can be very expensive.

People are researching all sorts of interesting questions there: whether we can make communication by virtualization faster, whether we can replace the heavier-weight hypervisors of virtualization technologies with lighter weight ones such as Docker, or such as several research projects that are open right now.

There’s also a lot of interest in data-intensive computing, in other words, big data. We’ve got research on big compute, we’ve got research on big data, and as I mentioned earlier, whether cloud computing can become a platform for cyber-physical systems is another very important area.

So we conceived this system, a research system that would be based around those big questions. Can we scale cloud computing for compute? Can we scale cloud computing for data processing? Can we scale it for big instrument? I mean millions of sensors. If you imagine we could instrument the planet and stream the data from different sensors –atmosphere sensors, demographic sensors, air pollution sensors – combine them with social networks. We can imagine that that’s a lot of data coming in a very volatile pattern into the cloud.

Can we develop efficient algorithms to store that and process that? A large-scale platform like that is going to have, obviously, at least one large homogeneous partition so that we can explore the big compute issues. It’s going to have a lot of storage in order to address big data, and it’s going to support interesting data patterns. So we designed a test that is composed of about 15,000 calls, and we are going to have one large homogeneous partition. We’re going to have only two sides, but those two sides are going to be connected by a 100-gigabit network, so in other words, you’re going to be able to experiment with large flows for big data applications. We’re going to have 5 petabytes of storage, again for big data, so that users can store experimental data on the test bed, they don’t have to go far to fetch the data. It’s all available on the test bed and can work with those applications.

In addition, this is going to be highly reconfigurable, so we’re trying to recreate the conditions that a scientist has in their lab as nearly as possible. In their lab, scientists have access, root access to machines. Not only that, they can reboot those machines whenever they want. They can power them on and off, they can get console access to those machines.

We’re building infrastructure that will allow that level of reconfiguration to users. We’re building this infrastructure on OpenStack, which is itself a cloud computing technology. Recently, OpenStack, which ironic, allows users to do bare hardware reconfiguration of machines, and we’re leveraging that feature to give bare hardware reconfiguration to our users. Then of course we have to enhance the capabilities currently available in OpenStack quite significantly to provide virtualizations, to provide all those different power on/power off features, to provide more extensive monitoring, so that people can easily find out what’s happening in their experiments and so forth.

We’re also, in addition to all this, trying to organize a partnership of industry and academia around Chameleon. One of my co-PIs on Chameleon, Paul Rad, is Vice President of Rackspace, which is a wonderful cloud company. We’re trying to make sure that researchers have access, for example, to traces that are running in places such as Rackspace or Google, that are companies we were partnering with, but also, for example, CERN. CERN has a large cloud. It’s interesting what scientific data patterns need, what kind of algorithms they need, how to research them.

In other words, we’re trying to create a sort of one-stop-shopping for experimental scientists to come to the cloud and run various experiments answering different challenges. In addition to all this, we will also have a production open start cloud deployed, and that is mainly for users of innovative applications. So users who will want to experiment with cloud computing, are not sure if cloud computing is for them, they want to try doing something completely new and different and maybe don’t have the funds yet to go to production, they need to do a proof of concept first, pilot project first, to show their sponsors that something like that is possible. Or maybe they need to run performance tests on the application, see whether this is an application that is suitable for cloud computing at all, or something similar. Or just experiment with completely new and different application patterns such as for cyber-physical systems, for example.

Ann Nguyen:
We appreciate the glimpses of your work on cloud computing so far, and definitely look forward to going more in-depth when you’re at the podium.

That was Kate Keahey of University of Chicago, Argonne National Laboratory and Chameleon. She’ll be speaking during the Flexibility: IT Infrastructure session for the Cloud Computing conference at the upcoming Bio-IT World Conference & Expo, happening April 21-23 in Boston.

To hear her in person, go to www.Bio-ITWorldExpo.com for registration information and enter the keycode “Podcast”.

I’m Ann Nguyen. Thank you for listening.

Data Platforms and Storage Infrastructure