The start of April saw the largest ever leak of documents, the Mossack Fonseca files, by the International Consortium of Investigative Journalists.
The leak contained 2.6 terabytes of data, 11.5 million documents, and more than 3 million database files. The contents of these files would have worldwide ramifications and lead to over 370 journalists collaborating to uncover the facts.
The files from the law firm based in Panama contain details about the offshore holdings of politicians, public officials, athletes, and businesses around the world.
To tackle this vast amount of data the ICIJ turned to big data tools such as Neo4j, a graph database from Neo Technology, tools from Talend, and software from Linkurious. Combined, these technologies would help the consortium directly impact the resignations of the PM of Iceland, and Spain’s minister for industry.
In the UK it led to PM David Cameron planning to create a register for offshore companies buying property in the UK, forcing them to reveal who is behind them, while also revealing embarrassing links between his father, Ian Cameron, and a trust in a tax haven.
The ICIJ’s Head of Data and Research Unit, Mar Cabra, spoke to CBR about the challenges posed by the large volumes of data, how this was made available to the investigative journalists and the challenge of opening access to some of the data to the public.
The ICIJ made the decision to present the data in a graph format basically because that is what the journalists were already using, but in a manual way.
Cabra said: "Typically when investigating journalists get a piece of paper, start drawing the graph itself, maybe put it on a wall, or if a bit more sophisticated they do MS Paint and do it manually.
"Technology has advanced in a way where we no longer need these manual tools to explore graphs and if the data is already in a database format, then graphs are the best ways to explore these connections.
"Technology is here to help us do things better and in a more sophisticated way."
Having the data and transforming it in a graph database wasn’t exactly a simple process, some of the databases were in Access, some in SQL Server or MySQL, this meant that it had to be converted to be used in Neo4j. To do this the consortium used Talend as an extract, transform, and load tool.
"That allowed us to easily convert a relational database in SQL, in Access into Neo4j – Talend was very useful for that to do it easily."
Some of the work that had to be undertaken saw the reverse engineering of internal databases.
"We had to do reverse engineering to reconstruct the database and put it into a typical relational database format. In this case we put it into SQL server, once in SQL Server we used Talend to transform into Neo4j and once in Neo4j we plugged it into Linkurious.
"Most of the work we had to do, most of the time is spent reconstructing the databases, unfortunately sometimes when we get a leak we don’t get a leak of the database in the original format so there is a lot of reverse engineering that has to happen there to put in a state where we can start working with it."
Aside from the questionable nature of the dealings that have been highlighted by the leaked documents, the poor standardisation of data also brought to light poor data management from the law firm.
"What makes me wonder is if this data was so poorly standardised in some cases how could a company do much with it? If the data is not standardised it points out the fact that many companies out there are working with data, and working with databases where they are not taking care of data quality."
Data quality was the main challenge when it came to making some of the data public. Cabra could not do a lot of transformation of the data like cleaning addresses, or standardising of names due to legal reasons.
The problem is that there were a lot of duplications of names, but just because there are four versions of the same name it does not mean that this is one person duplicated, or four separate people.
People that access the public data use much the same system as the investigative journalists do, they use Linkurious.
Linkurious enables users to identify connections between the graph data. Working through an intuitive search box it is possible for a user to enter a name which brings up a node, double clicking on that node expands the network to be able to see connections.
One of the features of this technology is the ability to find the shortest path between connections. Searching for two names brings up two nodes that may not appear connected, but by using the shortest path tool, it will tell you whether there is a connection.
Linkurious was also used as a kind of intranet so journalists could have access to the Url, be given a username and password, and could then access the files and work on analysing the data from all over the world.
Cabra said that one of the good things about using Neo4j and Linkurious was that the ICIJ controlled the whole process. This means that they controlled access and the companies had no insight into what work was being done, protecting all that were working on it.
The insights from a data leak of this magnitude would never have been possible without the use of technology to be able to uncover connections. At least, the insights would have taken a lot longer to come out and perhaps negatively impacted the scale of the impact.
Technology has reached a point where even terabytes of data are no longer looked at as a daunting task when it comes to analysing it, and while some may use it for nefarious means, others like the ICIJ are using it to affect positive change.
This article is from the CBROnline archive: some formatting and images may not be present.