Meta tackles technical debt problem during data migration

Meta struggled with legacy code and schemas when transitioning to its new data architecture (Photo: mundissima/Shutterstock)

Meta recently completed an exascale data migration project to a new architecture, and despite its size the Facebook parent company came up against issues familiar to any business migrating data – technical debt and insufficient staff training. The migration revealed systems built and patched together over years with various legacy dependencies and integrations that need to be modernised, the tech giant has revealed. One analyst told Tech Monitor this type of change is often essential for modern AI and machine learning applications.

The social media giant has moved its data to improve performance and allow for improved scaling as it integrates artificial intelligence into more of its workflow. Meta first confirmed it needed to upgrade its data infrastructure in December last year after finding it was struggling with scale as it needed to access data from a more diverse range of sources.

“Over time, the data platform has morphed into various forms as the needs of the company have grown,” Meta’s engineers wrote in a blog post. “What was a modest data platform in the early days has grown into an exabyte-scale platform. Some systems serving a smaller scale began showing signs of being insufficient for the increased demands that were placed on them.”

Most of this was related to the reliability and efficiency of data, particularly during efforts to re-use and analyse it in different ways. Meta found that improved data logging and serialisation was the solution, allowing the data to describe itself more efficiently – but changing legacy systems isn’t easy.

Meta’s engineers built the Tulip serialisation format, a new architecture for its data that was built to improve logging and serialisation. Serialisation is the process of converting a data object within a region of storage into a series of bytes that allow the object to be more easily transmitted.

Meta data migration: storage and CPU cycles

Switching over to Tulip helped Meta save on storage and CPU cycles. The team found that at the high end, any data on Tulip required up to 85% fewer bytes than the current system and up to 90% fewer CPU cycles. “Making huge bets such as the transformation of serialisation formats across the entire data platform is challenging in the short term, but it offers long-term benefits and leads to evolution over time,” the engineers wrote in a blog post.

The problem was that they were dealing with more than 30,000 logging schemes generated over almost two decades and with different schemas, equipment and code. Migration proved to be a four-year battle for the engineers due to this “tech debt” built up since 2004.

They discovered that some of the data couldn’t be easily ingested or converted, some of it would be too expensive in terms of compute power to convert it and some of the tools they had to ease the migration developed bugs or found problems during the process.

To avoid the problems from growing out of hand or control, the Meta engineers wrote that they employed rate limiters and had to closely monitor ingestion, particularly for the trickier schemas.

“A tool was built so that an engineer just needed to run a script that would automatically target unmigrated logging schemas for conversion. We also built tools to detect potential data loss for the targeted logging schemas,” the team wrote. “Eventually, this tooling was run daily by a cron-like scheduling system.”

Evolving migration guide

They then had to battle with employees, ensuring they followed the migration guide created. This included the engineers having to produce an instructional video and set up a support team. “Since the migration varied in scale and complexity on an individual, case-by-case basis, we started out by providing lead time to engineering teams via tasks to plan for the migration,” they wrote.

“We came up with a live migration guide along with a demo video which migrated some loggers to show end users how this should be done. Instead of a migration guide that was written once and never (or rarely) updated, a decision was made to keep this guide live and constantly evolving. “

The engineers concluded that despite the complexities of the project “designing and architecting solutions that are cognizant of both the technical as well as nontechnical aspects of performing a migration at this scale are important for success.”

Caroline Carruthers CEO of global data consultancy Carruthers and Jackson told Tech Monitor large-scale data migrations come with multiple challenges and technical debt is often a major hurdle.

“With such a huge amount of data and years of legacy systems all integrating with each other, trying to untangle the web and free the data to be moved is a hugely complex task which will cost Meta a lot of time and money,” she explained.

“That being said, the benefits outweigh the costs: Meta’s old systems would struggle to keep up with the growing number of AI/ML products, so migration is an essential, if costly, project if the company wants to continue to innovate.”