Viewpoint : Data wrangling : Dan Mitchell

The availability and use of data in today’s investment world

Gaining a competitive edge from today’s incredibly broad data universe is not without its challenges. The Science of Investment talks to Founder and CEO of Hivemind, Dan Mitchell, about how investors can use datasets accurately and in a manner appropriate to their research.

How would you characterise the scope of data that investors might assess for opportunities today?

There’s no doubt there’s a huge volume and variety of data available for investors; the industry has been discussing the ‘data deluge’ for some time. The range of firms using this broad data universe and the uses to which it is being put has also expanded significantly.

The scope of the data itself has increased over the last couple of decades as more of the events of daily life have become digitised. Consumers interacting with companies online create data points, which in aggregate could be used to understand more about those companies. Similarly, whenever a company posts a job listing, files a patent or reduces their prices online, there’s a clear record of it. It’s not all consumer or equity related data though. For instance live ship tracking, satellite imagery of industrial activity, or real-time gas pipeline flow data could be used to inform commodity or macro investment strategies.

Some of this data has been commoditised and is sold to investors as ‘alternative data’. These datasets can be huge and so ‘alternative data’ has become entwined with parallel technological innovation in cloud computing and machine learning. As such it is often considered the preserve of technical quantitative firms rather than discretionary investors. But in fact, over the last decade more traditional, discretionary firms have also begun buying these datasets to augment or support their decision-making process. Similarly, quantitative investors have begun using more qualitative sources in their quantitative systems: for instance, news articles, text from annual reports or bond prospectuses, and other unstructured regulatory filings. And the data isn’t just being used to design new trading signals; it’s being used across businesses from investment risk factors to compliance – KYC or AML initiatives for example.

So there has been both an expansion in the scope of the data available and in the ways in which that data is used.


What potential advantages are there to accessing that very broad data universe?

Fundamentally, by expanding the range of data they ingest investors can potentially make better and more timely decisions, and thereby gain a competitive edge.

From a trading signal perspective, the success of alternative data is a recognition that traditional datasets are limited in their scope and timeliness. Credit-card, email receipt or footfall data potentially allow equity investors to get an idea of quarterly sales numbers well before official release. Macro investors might ask whether there are better ways of measuring inflation than a relatively static basket of goods or better measures of economic activity than GDP, which is very slow and can get substantially restated.

But the richness of the data, and the scope of potential applications suggest that there’s real gold that hasn’t yet been found in innovative use cases.

What are the challenges of getting information out of that data universe?

Before you approach the application of data to business problems, there are sourcing, infrastructure, data preparation and data interpretation challenges. The first two of those have been discussed at length recently but the preparation and interpretation challenges tend to get less attention.

What’s needed at the end of the preparation stage is data pertinent to your question or research hypothesis, cleaned and structured in a useful way. The difficulty is that datasets are never finished; they aren’t agnostic to use. Every task requires data in different formats, with different error tolerances, normalisations, or treatments of unexpected scenarios and edge cases. The key challenge is to set up a data preparation process which can create accurate and appropriate outputs as frictionlessly as possible irrespective of the source of data you’re working with.

When working with raw source material the nature of this challenge is very obvious. It comes in a variety of messy, semi-structured and unstructured formats and from a range of internal and external repositories: dense legal text documents, press releases, tables of stats in pdfs, a mix of infographics, tables and text in presentations or reports, images and informal or colloquial text in social media posts. Turning this into usefully structured data in a spreadsheet or database is extremely challenging. When you buy from data vendors they’ve done some of this work for you but they’re naturally building a somewhat generic product to attract clients with a range of use cases rather than yours specifically. With alternative data a further challenge is that they often aren’t designed for use within finance, nor curated as financial professionals are used to.

Then there are data cleaning and linkage problems. Just because datasets are becoming bigger doesn’t mean you can ignore basic data quality problems: inconsistent units, outliers, gaps, etc. And individual datasets are rarely valuable on their own – it’s usually the combination of them that brings the greatest insight and value. So you need to link them together by working out how the people, places, products and companies mentioned in one dataset relate to those mentioned in another. Conceptually this seems trivial but at scale through time it’s a really complicated problem.

In terms of interpretation, the key challenge is understanding the nature of the data. For instance, much alternative data is essentially a proxy for what you really want to measure. For equities, real time sales data would be ideal; but you might use credit card data based on a panel of purchasers as a proxy. How has the vendor dealt with the panel so that it is relevant for different styles of stock (demographically, geographically, economically, etc) and how representative is credit card data of sales for different stocks? Web traffic, footfall data or satellite images of car parks are steps further removed from sales and their use necessitates further approximations and assumptions. Then there are complicating factors: with credit cards, you have to consider third party sellers such as Amazon or Deliveroo, and how they appear in the data. This isn’t impossible at all, but it’s hard and needs careful thought about how the data you’ve bought relates to the system you’re developing.

What are the risks and inefficiencies involved in overcoming these challenges?

In the first place, there are potentially serious legal and compliance risks with some alternative datasets. It’s very important to be sure that the vendor has acquired the data in a way which abides by all required data protection laws.

Beyond that, the basic risk is return on investment. You can spend a lot of money on alternative data and a lot of effort obtaining raw source material. You can spend on the infrastructure and the expert staff to interpret it, but if you don’t properly consider the data preparation aspect then you may struggle to realise the potential value.

Data preparation, or wrangling, is an area of expertise all of its own. Although discretionary analysts and quantitative researchers might be able to pick up those skills, you can end up with very highly paid staff doing something they aren’t expert in and aren’t enjoying. It’s an efficiency risk and a retention risk.

What steps can people take to overcome these challenges while minimising those risks?

When you consider how to include the broad range of data available in your business decisions, data preparation should be part of that overall strategy. How will you actually create and curate accurate datasets that are appropriate for your intended research without burdening inappropriate teams with data preparation work?

That can be achieved by developing a specialist internal team to deal with it or investing in third party tools and services to help.

Why is there little awareness of the importance of data wrangling?

I think amongst many practitioners there actually is awareness of the importance of data wrangling, but it doesn’t get the media coverage it deserves because it’s just not that cool. Machine learning and cloud computing are cool, but to many preparing data is time-consuming, tedious and uncool. And yet it’s incredibly, incredibly valuable – it’s actually the thing at the heart of it all.

What are the potential outcomes if people do overcome these challenges?

It’s not contentious to say that you make better decisions if you make them on the basis of broader, richer and more varied datasets. In the end, research driven by these datasets will provide you with strong signals or systems that are differentiated from your competitors who have stuck with more traditional data sources. But there is no silver bullet – you need to go through the data preparation pain to get to the gain. ●

To find out more about Hivemind, click here.


Leave a Reply

Your email address will not be published. Required fields are marked *