By Harry Powell, Head of Industry Solutions, TigerGraph.
Despite the rapid growth in spending on fraud detection systems, financial institutions are seeing a diminishing rate of increase in the performance of the algorithms that drive these systems.
They are coming up against a ceiling in performance improvements as they discover there are limits to how much you can squeeze from an algorithm without a fundamental change in the system.
Fraud detection systems is a rapidly growing industry. The value of the market has grown 10 times since 2009 to reach $2 billion in 2020.
However, one has to question the effectiveness of the existing fraud detection paradigm. Between 2021 and 2022, there has been a 30% increase in fraud. Although 70% of fraud is detected early enough to stop it, that still means the fraudsters are getting away with a significant amount of fraud against financial institutions – a whopping $50 billion worth per year.
The problem for fraud detection systems is fundamental: while the algorithms used to detect fraud are effective, they are only as good as the data they have to work with.
The data being input into fraud detection systems typically consists of transactional data and information about the parties to the transaction. In simple terms, the system computes a fraud score by looking at the nature of the transaction and the history of the parties involved and uses a range of algorithms to arrive at an overall fraud score based on known risk factors.
However, a significant limitation of these systems is the scope of the data that is brought to bear on the problem. There is a lot more data available which, if it were used, would help to contextualise the transaction. Without that data, we are effectively throwing away half the information before we’ve even started.
What if we could examine all the known entities with which the parties to the transaction were known to be associated? This doesn’t involve sourcing new data, just using data that is currently available but not widely exploited.
This would allow us to use the fraud scores of connected entities – be they people, accounts or devices – and factor their fraud scores into the overall fraud scores of the parties to the transaction. A party with a large number of connections to entities with high fraud scores is almost certain to be worthy of further investigation.
The problem is that the data is not always organised to facilitate this type of analysis. What that would entail is organising the data to allow us to readily view the links between various entities. What accounts and devices is this person linked to? How are those accounts and devices linked to other entities and how are they in turn linked to other entities?
Doing the database hop
Each link between entities in your database can be thought of as a ‘hop’ and as you hop from one entity to another, the level of complexity of your analysis becomes greater. Greater complexity leads to deeper insights but at a cost, that cost being computational effort.
The computational effort involved in using relational databases for this type of analysis grows exponentially with the number of hops.
That’s where graph databases come in.
Graph databases represent a paradigm shift in the way information is stored and processed. While the strength of relational databases is tabulating lists and aggregating numbers, the strength of graph databases lies in analysing relationships.
Rather than storing data in tables, columns and rows, graph databases store information in nodes called ‘vertices’ and links called ‘edges’. With this data model, you can easily represent data as a web of information.
To analyse relationships in SQL, we must define the connections between entities in the database with a query and build the relationships at run time. This is computationally intensive in terms of both memory capacity and processing time. It is also hard to do right, so it takes a long time for data engineers to code and is prone to errors.
By contrast, the relationships between the entities in a graph database are defined in the data. A node called ‘Person A’ is linked to another node called ‘Account B’ with a shared edge. We can then use more vertices and edges to show that Account B has been accessed by Device C on a number of occasions. Through this chain of links, Account B has transferred funds to another node, Account D. Looking more closely at Account D, we see that it has a high fraud score which could reflect badly on Account B. Therefore, account B could be flagged for further investigation.
That is a very simple example but shows how a graph database can be used to analyse relationships and add context to fraud scores. Scaled up to millions of accounts and transactions, it yields real fraud insight.
Harnessing advanced algorithms
While graph databases can be useful for visualising complex webs of relationships, the power of graph does not stop there.
The branch of mathematics upon which graph databases are modelled is called graph theory. It contains a treasure trove of algorithms which can be deployed to solve computational problems and yield actionable fraud intelligence which will greatly improve the accuracy of fraud detection scores and provide a jumping off point for financial investigators to dig even deeper into suspect transactions and accounts.
It allows you to detect patterns in data that have been designed by fraudsters to look unremarkable, helping financial institutions reveal flows of money, chains and groups of relationships and shared characteristics that would not be obvious by simply looking at a party in isolation.
Examples of graph algorithms that can help to uncover these relationships and flows include:
Closeness algorithms – A class of algorithms which enable you to reveal how close an account is to other accounts with a high risk of fraud. An example of a closeness algorithm is ‘shortest path’. This works out the shortest distance between Entity A and Entity B. If B has a high risk of fraud, then, depending on how close it is to B, A may also have a high risk of fraud.
Centrality algorithms – We know intuitively that an account at the nexus of an abnormally high number of capital flows is more likely to be fraudulent, and centrality algorithms can pick out those accounts for further investigation. An example of a centrality algorithm is PageRank which is the algorithm that search engines use to determine the relative importance of web pages. You can use the PageRank algorithm to compare large numbers of accounts and pick out those with abnormally large capital flows.
Community algorithms – Accounts that are in a community of accounts with other suspicious accounts are more likely to be fraudulent, and community algorithms can crunch large numbers of relationships to identify these communities. An example of a community algorithm is called Louvain which identifies communities by looking at entities that have more links to each other than they do with other entities. If the overall fraud score of the community is high, then it is likely that the individual accounts within the community are worth further investigation.
With these algorithms, you can join the dots between the parties to a transaction and other parties which are not directly connected to the transaction but are known fraud risks.
Of course it’s not enough to simply identify suspicious accounts and transactions which are worthy of action by human investigators – the computational model should also be able to explain why it has given a high risk score to a transaction.
One of the features of graph analytics is the ability to generate explainable models, not only giving a fraud score but revealing the specific connections which contributed to that computation, information that would allow a fraud investigator to pick up and carry on where the algorithms left off.
Graph analytics is being used by at least four of the tier one banks in the US to augment their existing AI fraud detection systems. One bank said it yielded a 20% increase in synthetic identity fraud detection. Another bank reports that graph gave the greatest return on investment (ROI) of any of its technology investments in a year, delivering $100 million per year.
Adding graph features to machine learning and AI models has enabled banks to use information that was once classified as too difficult to analyse, leading to significant improvements in fraud detection rates.