I had the privilege to deliver the keynote on Data Science & Audience Engagement at the Data Science for Media Summit, hosted by the Alan Turing Institute, in Edinburgh last week. It was the first media focused event by the institute and highlights the increasing role that data science is playing in our industry.

It is not surprising, however, as the proliferation of content choice and sources mean that media organizations need to work harder to better understand and serve their audiences while consumers need help to discover and enjoy great content. What role does data science play today and where will it go in the future? To answer these questions it is useful to explore what data actually exists to work with. We can loosely group the data in two categories – human generated data and machine generated data. Both provide important data sets that can be leveraged by data science tools and techniques.

Human Generated Data
There is a rich and diverse set of data that is created by media professionals and consumers to describe, rate and debate content. This includes material such as subtitles and editorial metadata that has been used to make programming more accessible and discoverable on TV for many years. Editorial metadata provides high quality descriptions of shows, series and contributors that populate EPGs, TV guides and increasingly apps. We, at Ericsson, create over 200,000 hours of subtitles each year alone and host millions of records of editorial metadata.

Alongside this data created by media professionals sits an increasingly large volume of data created by audiences themselves. Twitter has become the conversational medium of choice for many TV viewers, creating very large volumes of data in the process, not to mention Facebook with its ubiquitous ‘likes’ allowing TV viewers to express and share their favourite shows. User ratings on dedicated review sites and content owner platforms are crowdsourcing how much we love, or don’t love, TV shows and movies at a scale never before seen.

Machine Generated Data
While human generated data volumes have seen huge growth in recent years, it is nothing to what is being generated by the applications, devices, networks and platforms that serve our content. In fact it is the sheer size of this data torrent that is driving the use of data science in media. We are evolving from an era of data scarcity to data abundance and this brings a new set of challenges.

The ‘data currency’ for TV has traditionally been based on data collected from small, carefully managed, audience groups (such as BARB in the UK, Nielsen in the US etc) and through statistical extrapolation show ratings are recorded. This approach is necessary in an era of data scarcity, as is the case for traditional TV delivery systems that do not provide a return path for data collection from the viewing device. As we migrate to IP delivery systems we face the opposite challenge – vast quantities of viewing data being generated at multiple points along the transmission path. How do we derive useful and timely insight from this data source for the benefit of content providers and consumers? What are the technical, commercial and societal challenges that come with it?

These topics will be covered in part two of this blog.