Data wrangling crucial to AI development
Seeing data through the same lens as AI creates the transparency needed for trust. Dan Barnes reports.
Data wrangling and governance will be key to optimising the use of artificial intelligence (AI) within investment management. For investment firms seeking to apply machine learning (ML) algorithms within their business, be it in the investment process, risk management, or back office processing, both the outcome it generates and the justification for its outcome needs to be trusted and tested.
“This is a world where the question ‘is the data good’ is meaningless,” said Mark Ainsworth, global head of data insights and analytics at Schroders, speaking at the InvestOps conference in London on 19 November 2019. “The question is, ‘what is the data good for?’ You can’t test that data until you have it in the building.”
In October 2019 the UK’s Financial Conduct Authority (FCA) and Bank of England released a joint paper entitled ‘Machine learning in UK financial services’. Based on research into regulated firms, it found that 45% of investment and capital markets firms had a machine learning strategy, but for 57% this was governed by their existing model risk management strategy. Furthermore, although 76% developed and implemented their ML use cases inhouse, 40% used third party data.
In the study, the top five risks identified from machine learning (ML) applications relate to: lack of explainability; biases in data and algorithms; poor performance for clients/customers and associated reputational damage; inadequate controls, validation or governance; and inaccurate predictions resulting in poor decisions.
The same month, BlackRock released a paper, ‘Artificial intelligence and machine learning in asset management’ which observed, “Asset managers rely on vast quantities of data, including from external data vendors. Thus, data quality and robust production monitoring should be of the utmost importance to reduce errors and mitigate operational risks.”
Data, even when referring to a specific asset class or market, can include wide array of information types, presented in different ways. Listed instruments may have market data standardised by the primary exchange, however in over-the-counter markets the proprietary data representation used internally by firms will be reflected in the data exchanged with counterparties.
Moreover, new levels of information have become viable inputs into decisions for capital markets firms. ML and artificial intelligence technologies are able to analyse unstructured data, such as pictures and text, and draw out information which can be quantified.
In the ‘2019 Future of Trading Study’ published by analyst firm Greenwich Associates, it found that 95% of respondents saw alternative data such as this becoming a more valuable input into trading over the next three to five years (see Figure 1).
To realise the value of the “vast quantities of data” they use, asset managers must wrangle the relevant data sets they have together for ML projects, or building in safeguards for systems that have already been modelled on data sets.
Getting checked
Ainsworth says firms do not need perfect data to generate insights, what they need is access to the data and the capability to manage the data.
“You are answering questions nobody asked before and you need to be able to answer them quickly,” he said. “If the insight is delivered too late but was delivered solidly and reliably, it is of no value. As a result, you are loading in data before you know whether it is good, and where the gaps are. Then you have a set of people who thrive on working that problem out.”
Wrangling data is the process of pulling it into a usable format or data model, with every data point tracked into a wider model to ensure a common, viable data set can be created (see Hivemind).
If the data to be used is being pulled together internally within the firm, the right technology will be needed to process and map massive data sets in order to make sense of it.
That will need to happen whether the machine learning algorithm is built internally or provided by a vendor, if it is based on the investment firm’s historical data.
“Where the vendor provides a model but you provide the training data, the vendor is unlikely to agree to take full responsibility for the accuracy of the model because it’s been trained on your data. Equally, the vendor may be reluctant to provide other assurances regarding the model in these circumstances; for example, that the model’s output is free from bias,” says Minesh Tanna, AI Lead at law firm Simmons & Simmons. “The more collaborative the procurement / implementation process, the less protection you are likely to be able to obtain from the vendor.”
The data science teams will therefore need to buy or develop data wrangling capabilities to function inhouse to manage these risks to the correct tolerances for internal assurance and regulatory guidance. When third party systems are employed, a different approach is needed.
Managing third party providers
For vendors to the financial markets, AI capabilities can be a major selling point, however they need to be aware of the governance concerns and requirements of their users. SmartStream uses AI to support automated reconciliation by recognising the content of a file, matching files and to create matching rules.
“All the business logic is incorporated into our tool; after a few seconds you get the matching results. That’s a complete game changer in the reconciliation space because we have reduced something that has took weeks to a few seconds,” says Andreas Burner, chief innovation officer at SmartStream. “Incorporated in that process is reconciliation via a white box, it can show the user the rules, the matching criteria, the mapping and so on. Therefore we have an audit trail in place.”
The demand for clients also means supporting insight into the way the model has been developed, he says, including data.
“We are in the banking and financial services world, so when clients implement the system it’s not just about AI and machine learning it’s also about data lineage, we have to explain where the data is coming from, and how it’s being used,” he says. “That is also a liability issue, we need to be transparent so customers need to know when we apply machine learning or AI.”
That liability issue naturally needs firms to ensure they are protected legally if they have not managed the process of building data sets internally.
“If you are buying a more ‘off-the-shelf’ product which has already been trained on data, and which you plan to implement in your business from day one, you will naturally want greater assurances that the model is accurate, that it is free from bias as far as possible, and also that it’s compliant with regulatory obligations and ethical best practice in so far as the vendor / developer is able to guarantee this,” says Tanna. “In these circumstances, you should seek greater assurances, because you are ultimately not responsible for the data on which the model was trained and may not be able to interrogate it yourself.”
Other legal issues arise including the allocation of proprietary rights, which an asset manager will want to retain for valuable models.
If you are an investment manager and are collaborating with a third party to develop an AI system which, for example, can predict stock movements based on your own research or data, you will want to ensure that you retain proprietary rights to appropriate parts of the AI system such as the model and the data,” says Tanna. ●
<a href=”#top”>TOP OF PAGE</a>