Financial analytics and the use of big data

Laura Shepard, Director of HPC Marketing, DataDirectNetworks

Laura-ShepardBanks are arguably one of the biggest enterprise consumers of IT across all industries – research from Ovum in June 2013 suggested global IT spending at retail banks alone would reach $132bn dollars by 2015. You would think, therefore, that most would be a step ahead of the game in terms of how they use IT. Whilst that may be true across many aspects of this sector, for example high frequency traders using ultra low latency networking or innovations such as ‘back testing’ or using time series databases, most are still using last century infrastructure design. The infrastructure simply hasn’t matched the capability or the scale of the innovative software. Financial firms are three to five years behind academics and research institutes when it comes to state of the art infrastructure design.

Decades old enterprise computing
Quant analysts are constantly evolving and developing algorithms based on historical data to help predict the market direction, important for a number of reasons – not least to make money and reduce exposure to risk. Yet, it puzzles me why in so many firms this core function is still being implemented in the same way it has always been done – one that is based on practices common place in enterprise computing several years ago. Latest technologies can increase performance of algorithms 5 times. A lot of the ‘features’ of enterprise computing, replication, de-duplication and compression, all interfere with performance. When performance matters most only HPC techniques deliver.

Finance houses’ ‘back testing’ capability is also being undermined by enterprise computing from two decades ago. Whilst there is not necessarily anything wrong with this per se, it is limiting their ability to query and build algorithms for ever-expanding data volumes. Increasingly, financial institutions want to run back testing on multi-year and multi location/market data, to help them more accurately predict market movement globally and to help reduce their risk exposure.

But today’s current financial technology infrastructures, in the most part, are not tuned up to deal with double digit terabytes and soon to be hundreds of terabytes of data. Long term, it becomes impractical from a cost perspective and a ‘real estate’ standpoint to continue to use ‘90s and ‘00s enterprise style computing – these systems require a lot of space.

Today, finance houses want to analyse more data from more sources than just exchanges and run simultaneous programmes to create a better understanding of their market. This includes a number of elements such as, market tick data, which is fixed format: it moves fast but its type is predictable; Sentiment analysis, which adds an additional data dimension, type variation as well as volume is an analytics approach to data whereby algorithms are emerging to search through popular social media feeds – twitter, for example – and other public sources looking for key words and phrases to spot trends that then inform the trading practice. For example, an algorithm that looks for discussions on energy, pipeline failure or oil spill will ‘e-discover’ the mood of market and predict how the market trades.

The sheer increase in data volumes and the desire to analyse more data from multiple sources, whether it be structured in databases or unstructured like social media, it suddenly becomes no longer cost effective to try and build enterprise systems that can hold that much data in system memory, or storage cache. The particle physicists and life science researchers hit this wall 5 years ago and turned to non-blocking storage and on-line analytics to pre-process the data before storing and streaming gigabytes of data a second for months on end. If you were looking for the Higgs Boson particle and someone from the back of room put his hand up and said, “Sorry, can you run the 3 month experiment again, had a cache overflow and missed the bit in the middle” you would probably like to review contracts. Financial services may not have the volumes of their academic peers but they do have the same velocity problem – churning 30TB of market data faster than your competition will bring first mover advantage to any algorithmic trader.

Embrace HPC techniques 
To be competitive, firms need to look beyond client/server and even distributed systems and embrace HPC techniques, which have been tried and tested by the academic research community for at least the last ten years or so. In addition, technology strategy ambitions should not stop at batching data through cache. To batch is not terribly efficient, SAP Hana and SAS Grid are ushering the merger of online transaction processing and online-analytics into the enterprise. Batch processing will become as archaic as mainframes. Slicing data into ten iterations might speed things up, but ten iterations will soon become 100 iterations and so on, it is inefficient to say the least.

Some organisations might look to put in flash storage between storage infrastructure and system memory. Whilst flash will provide better performance in some cases, alone it will not scale to the terabytes of data finance houses hold, nor deliver the bandwidth. Flash still remains too expensive to replace spinning disk data arrays.

There is no middle ground to addressing this data growth and “need to know more” challenge; the successful firms will embrace Big Data. And, there is a new way, borrowed from HPC, and the visionary finance houses are starting to see the benefit from a parallel approach – it started in the systems with grid computing and is now reaching the file system, helping firms analyse more positions faster and develop more effective trading strategies, which they can deploy in less time.

The majority of finance houses running algorithmic back testing do so out of KX or other custom in-memory databases running on 90s style infrastructure, which is like pairing a Ferrari engine to a Mini gearbox. STAC M3 tests showed that flipping the storage to supercomputer class can accelerate the database 800%.

By using fast, scalable, external disk systems with massively parallel access to data, researchers can perform analysis against much larger data sets delivering more effective models, faster. This means analysts can run hundreds of models or hundreds of iterations of the same model in the same time it used to take to run a few.

It’s not just trading where HPC techniques can be exploited, risk and compliance departments will want to process more data and get answers more quickly and lower capital reserving. One firm I know has reduced a risk calculation from 9 minutes to 2 minutes using HPC style data storage – that gets them to the market 7 minutes earlier which makes a big difference when volumes are high.

The UK regulator wants financial institutions to raise another £27billion in capital reserve. If a firm can demonstrate their risk control measures are strong to pass the stress tests, they will be able to deploy capital into work and not stuck in reserve. This means dramatically increasing your ability to assess total market exposure from only once or twice a day – to multiple, intra-day assessments.

Blind Case Study
At one global hedge fund, hundreds of servers capture 3GB of tick data per exchange every day. Tens of quantitative analysts work with that data to create and test equity trading strategies on hundreds more servers. They were previously using several larger NAS filers that simply could not keep up with the volume of data and the analyst’s data access performance requirements.

They moved to a similar capacity DDN SFA system and with the additional performance delivered, they were able to back test 3x the number of models, significantly reducing their time to deploy new strategies. The Securities Technology Analysis Center LLC (STAC®), M3 benchmark recently demonstrated that the SFA12K-40 (with a hybrid configuration of Flash and spinning disk technology) from DDN, delivered performance more than eight times faster than the traditional storage average and, in some cases, almost twice the performance of Flash storage.

“Before we rolled out the DDN systems, our [filer] farm just couldn’t keep up with the growth in the market and our business. With the DDN systems, our traders are getting new strategies into the market faster.” – director of IT, global hedge fund.

Organisations that want to be able to analyse all their exchanges, across multiple locations and query such ‘phenomena’ as sentiment (Big Data) need to be looking at adopting a parallel file approach to their storage environment.

People who are building their own tools that can take advantage of a parallel file system will do very well. They can use the parallel file system as the basis to stream data from multiple sources and have many different systems using the same data storage concurrently so they do not need a dedicated copy of each set of data. So, parallel file systems enable multiple nodes to hit the same data concurrently and same data sources concurrently – not only can hit the tick database from London and Singapore at the same time but also the sentiment analysis – mentioned earlier on.

Blind case study
A large US proprietary trading firm specialising in high frequency trading recognised early that they would need to move away from direct attached and NAS storage if they were going to handle their requirements to share and access petabytes of data across multiple types of high performance trading groups including teams in currencies, derivatives, international equities, technology equities and more.

An extreme need for speed was identified as the firm’s success depended on being highly competitive in bringing many new strategies to bear quickly and failing out fading or unsuccessful strategies in as close to real time as possible. A parallel architecture was selected to provide two key criteria to meet these goals:

  1. A global namespace for more efficient data gathering and sharing
  2. Parallel IO to remove the time lag if sequential jobs necessary in NAS architectures.

Several generations of parallel infrastructure were tried based on storage and parallel file systems, one provided by a major server vendor and another by a major storage vendor – under production conditions, neither was able to meet even the minimum performance requirements based on current and projected needs which are in the petabytes and in the GB/s sustained IO performance.
Impressed with DDN’s massively parallel architecture, and open platform approach to parallel file systems which would allow them to change infrastructure if they needed to without throwing out their entire investment, this company evaluated then selected DDN SFA with GPFS. In addition to surpassing the performance, availability, scalability and cost requirements, the office of the firm’s CIO also liked DDN’s Python based infrastructure and features that would support possible future directions like embedding applications in the storage itself to cut IO path process steps by almost half for extreme low latency .

“Before we rolled out the DDN systems, our [NAS] farm just couldn’t keep up with the growth in the market and our business. With the DDN systems, our traders are getting new strategies into the market faster.” – Director of IT, large proprietary trading firm.

The future
For me, the next big things in terms of Big Data, and the analysis of that data, relates to risk management. The first step is CAT (Consolidated Audit Trail) but there is so much more to gain than mere compliance. There are quantifiable goals on the table such as intra-day risk assessment, capitalisation and liquidity reporting and predictive modeling. But beyond that, institutions would love to know what their market exposure and opportunity is at any given time. Knowing the position at any moment in the day will allow the firm to know which way to jump and deliver that elusive, first mover advantage.




Related Articles