Charged with storing and analysing hundreds of petabytes of life sciences research data, the European Bioinformatics Institute (EMBL-EBI) has a ravenous appetite for storage and compute infrastructure. Equipped with a £45m grant from UK Research and Innovation, the organisation recently struck a five-year strategic partnership with Google Cloud to help accelerate cloud adoption. Tech Monitor spoke to technical director Steven Newhouse about the aims of the deal and how EMBL-EBI is pursuing a hybrid, multicloud strategy.

EMBL-EBI cloud

300 petabytes of raw storage and over 50,000 cores of computing is not enough for EMBL-EBI. (Photo courtesy of EMBL-EBI)

EMBL-EBI’s growing appetite for data

Funded by a coalition of European countries and part of the European Molecular Biology Laboratory, Cambridge-headquartered EMBL-EBI collects and publishes open life sciences data from around the world to facilitate biomedical research. “Although we have Europe in our title, we’re a global reference point to support open science across the life sciences community,” explains Newhouse.

We’re a global reference point to support open science across the life sciences community.

As the ability to sequence genetic material gets cheaper and more widespread, and digital healthcare systems create ever more in-depth data, the volume of data the EMBL-EBI needs to store is expanding rapidly. “We’ve been consistently seeing 50% data growth each year over the last decade,” Newhouse says.

One of the EMBL-EBI’s projects is the UK Biobank, a database containing genetic and health information of half a million UK volunteer participants. As time goes on, the BioBank provides researchers with a longitudinal dataset through which to explore the relationship between genes and health outcomes, says Newhouse. “As a researcher, you will be able to say ‘I’m interested in having samples of people who have had this particular type of cancer’ and they can get all the information they need.”

In future, the BioBank may encompass new datasets, Newhouse says, such as data from fitness devices. These can provide “a much fuller picture” of an individual’s health and lifestyle than a traditional survey, he explains.

The pandemic has been another driver for data expansion at EMBL-EBI. Alongside partner organisations across Europe, EMBL-EBI established the Covid-19 Data Portal, allowing researchers to share data including protein sequences, medical images, researcher papers and more. In October, the Portal surpassed 3 million sequences published. “It has grown to encompass our data as well as data from other sources, into a global community hub built around understanding Covid-19 activity,” says Newhouse.

EMBL-EBI hybrid cloud strategy

Handling this data calls for considerable IT infrastructure. “We operate across three data centres in the UK, with over 300 petabytes of raw storage and about 50,000-plus cores of computing,” Newhouse explains. This voluminous archive is managed by a staff of 500 developers, a considerable chunk of the institute’s 850 employees.

Even this is not enough to meet demand, however: “We are always in need of growing our storage infrastructure, our analysis infrastructure,” and the publishing infrastructure that distributes data to users, explains Newhouse.

EMBL-EBI, therefore, augments its own facilities with public cloud services. These help improve the service it offers to users, it says, for example by hosting data or services closer to where they are located. Given the sensitive nature of much of the data it handles, EMBL-EBI has implemented strict controls to govern which data can be stored in the public cloud and in what circumstances. Data is classified as ‘very confidential’, ‘confidential’, ‘internal’ and ‘public’; confidential data can only be moved to the public cloud with the express permission of the data controller.

The organisation is pursuing a hybrid multicloud strategy, drawing on services from providers including Google, AWS, and the European Open Science Cloud. “This ensures that the institute’s cloud infrastructure is flexible and can support the diverse needs of the different teams based at the institute,” it says.

EMBL-EBI’s Google Cloud partnership

Earlier this month, though, EMBL-EBI struck a five-year strategic partnership with Google Cloud. The two organisations are already collaborators – EMBL-EBI worked with Google-owned DeepMind to develop AlphaFold DB, an open-access database of protein structures. In addition to providing access to storage, compute and AI services, Google will help train EBI’s staff in “building, deploying, and using cloud-native applications”.

One attraction to Google Cloud is the scale of its infrastructure, says Newhouse. “Being able to leverage Google Cloud’s global infrastructure for [the data] publishing component is incredibly interesting for us.”

Another is the opportunity to access non-traditional technology platforms that Google offers today – and may offer in future, Newhouse explains. “One of the things we’ve already seen through the relationship with Google is the opportunity it provides us to dip into exotic hardware, new GPU architectures and so forth, that we don’t have in-house,” he says. Quantum computing, he adds “is another class of hardware that we will be able to dip into in the years to come”.