Earlier this year more than 50 news organisations worldwide revealed how HSBC had helped criminals, traffickers and tax evaders – as well as profited from doing business with them, by helping shelter over 100,000 clients with accounts worth $100 billion in Switzerland.
Key to breaking this major financial story was a gargantuan technological challenge that spotlights the power of a new way to work with complex data: graph databases.
Back in 2014, Le Monde investigative reporters Gérard Davet and FabriceLhomme found themselves with unexpected access to a valuable set of data.
These journalists knew at once the potential of what they had acquired: a major international scoop around fraud, tax evasion and international crime.Their problem, however, was as big as the opportunity to expose a scandal: the data was too complex to be analysed by traditional means.
Davet and Lhomme could not analyse the data themselves. The data they had discovered, eventually to become known as the ‘Swiss Leaks,’ included information from thousands of HSBC account holders located in more than 20 countries, with connections spread among thousands of files.
After much discussion they decided to ask colleagues to help, turning to the Consortium of Investigative Journalists (IICJ) for assistance – a move that set off one of the biggest ever digital cross-border journalistic collaborations we’ve ever seen.
When asked to help, Mar Cabra, the ICIJ’s Data and Research Unit Editor, at the ICIJ,knew her team would need a ground-breaking tool to analyse this complex data, one that could handle unstructured data quickly, easily and efficiently.
Cabra had one other demand: she wanted an easy to use and intuitive tool that didn’t need data scientists and developers to do all the work.She wanted the data discovery and analysis process to be accessible to investigative reporters worldwide, regardless of their technical skills.
A Graph Database Met the Challenge
Luckily for the Swiss Leaks probe, Cabra had come up against the complex data challenge before –and so knew that a technique called graph database was probably going to be the only solution available that could perform such demanding and complex analysis.
Independent of the total size of your data, graph databases are great at managing highly connected data and complex queries. Instead of using tables, like relational databases, graph databases use graph structures incorporating nodes, properties and edges to define and store data, making them best in breed for analysing relationships and interconnections between data.As a result, graph databases are widely used in data mining, data with dynamic schemas and highly complex data analysis.
“While working on stories like Offshore Leaks, I learned how important graph analysis is when investigating financial corruption,” Cabra explained. “Connections are key to understanding what the real story is: they show you who’s doing business with whom. We decided early on that we needed to use a graph-based approach for the HSBC Leaks.”
The ICIJ worked with open source integration software specialist Talend to transfer the original dataset into Neo Technology’s Neo4j graph database. Another Neo partner, Linkurious, provided a web app user interface so that the graph database could be visualised and easily accessed by reporters.
The graph visualisation approach allowed ICIJ journalists to identify the connections between people and bank accounts, helping them to ‘follow the money’ in order to identify literally dozens of instances of fraud, corruption and tax evasion.
To get there, Cabra’s Data and Research Unit’s first created a HSBC client database from the provided plain Excel files. Next they connected every name to one or several countries, creating what a graph database needs to work, so-say ‘nodes’.
The data was then turned into a graph format to detect then fine-tune the connections between the nodes. The Swiss Leak held around 60,000 files that contained information about over 100,000 clients in 203 countries, which means that the resulting graph database had more than 275,000 nodes with 400,000 relationships among them.
Unlocking The Data
After importing the data into Neo4j with the Linkurious visualisation at the front, Cabra and her team noted interesting differences in how journalists were able to use the Swiss Leaks data. As an international collective, the ICIJ has many members in many countries – all of whom reported that the material they were now working with was easy to use, intuitive to navigate and required very little training.The ICIJ helped allay any fears by supplying a set of short online demos and webinars for journalists, but many say they were surprised at the fact that they didn’t need advanced technology skills to use the graph database.
The IICJ shared the tool on its virtual newsroom, enabling journalists worldwide to tap into the dataset and the graph analysis tool within their respected regions, querying data on a worldwide scale.
By using Neo4j to investigate the HSBC Leaks data, journalists were quickly empowered to easily identify major players, intermediaries and beneficiaries in the scandal (regardless of location) and define how they were connected. Likewise, banks are keen to identify at source and in real-time with the same tools to solve a variety of connected data problems, first and foremost being the detection of potential fraud.
By being able to easily visualise the networks around clients and accounts, the reporters found many more connections than they had before, which lead to the Swiss Leaks revelations soon being shared with the public and regulators across the globe.As a result, the 150-journalist project was awarded the prestigious Data Journalism Award (Investigation of the Year category) by the Global Editors Network –and stories are still appearing today.
No surprise then, that Neo4j has become an integral part of the ICIJ for its Big Data projects.“It’s a revolutionary discovery tool that’s transformed our investigative journalism process,” Cabrasays.
“This simply wouldn’t have been possible before on this scale. It’s magic!”