nep-big 2021-04-05 papers

on Big Data

Issue of 2021‒04‒05
nineteen papers chosen by
Tom Coupé
University of Canterbury

Betting models using AI: a review on ANN, SVM, and Markov chain By Kollár, Aladár
A machine learning approach to domain specific dictionary generation. An economic time series framework By Hanjo Odendaal
Text Mining of Stocktwits Data for Predicting Stock Prices By Mukul Jaggi; Priyanka Mandal; Shreya Narang; Usman Naseem; Matloob Khushi
The Hard Problem of Prediction for Conflict Prevention By Hannes Mueller; Christopher Rauh
The Value of Data for Prediction Policy Problems: Evidence from Antibiotic Prescribing By Shan Huang; Michael Allan Ribers; Hannes Ullrich
Concept of peer-to-peer lending and application of machine learning in credit scoring By Aleksy Klimowicz; Krzysztof Spirzewski
Data-intensive innovation and the State: evidence from AI firms in China By Martin Beraja; David Y. Yang; Noam Yuchtman
Higher-Order Orthogonal Causal Learning for Treatment Effect By Yiyan Huang; Cheuk Hang Leung; Xing Yan; Qi Wu
News media vs. FRED-MD for macroeconomic forecasting By Jon Ellingsen; Vegard H. Larsen; Leif Anders Thorsrud
Can Machine Learning Help to Select Portfolios of Mutual Funds? By Victor DeMiguel; Javier Gil-Bazo; Francisco J. Nogales; André A. P. Santos
Can machine learning help to select portfolios of mutual funds? By Victor DeMiguel; Javier Gil-Bazo; Francisco J. Nogales; André A. P. Santos
Interpretable ML-driven Strategy for Automated Trading Pattern Extraction By Artur Sokolovsky; Luca Arnaboldi; Jaume Bacardit; Thomas Gross
A deep learning approach to data-driven model-free pricing and to martingale optimal transport By Ariel Neufeld; Julian Sester
Machine Learning and Central Banks: Ready for Prime Time? By Hans Genberg; Özer Karagedikli
TradeR: Practical Deep Hierarchical Reinforcement Learning for Trade Execution By Karush Suri; Xiao Qi Shi; Konstantinos Plataniotis; Yuri Lawryshyn
How Many Online Workers are there in the World? A Data-Driven Assessment By Kässi, Otto; Lehdonvirta, Vili; Stephany, Fabian
How Many Online Workers are there in the World? A Data-Driven Assessment By Otto K\"assi; Vili Lehdonvirta; Fabian Stephany
Measuring the Economic Cost of Conflict in Afflicted Arab Countries By Elif Semra Ceylan; Semih Tumen
Intraday trading strategy based on time series and machine learning for Chinese stock market By Q. Wang; Y. Zhou; J. Shen

Betting models using AI: a review on ANN, SVM, and Markov chain

By:	Kollár, Aladár
Abstract:	In today's modern world, sports generate a great deal of data about each athlete, team, event, and season. Many people, from spectators to bettors, find it fascinating to predict the outcomes of sporting events. With the available data, the sports betting industry is turning to Artificial Intelligence. Working with a great deal of data and information is needed in sports betting all over the world. Artificial intelligence and machine learning are assisting in the prediction of sporting trends. The true influence of technology is felt as it offers these observations in real-time, which can have an impact on important factors in betting. An artificial neural network is made up of several small, interconnected processors called neurons, which are similar to the biological neurons in the brain. In ANN framework, MLP, the most applicable NN algorithm, are generally selected as the best model for predicting the outcomes of football matches. This review also discussed another common technique of modern intelligent technique, namely Support Vector Machines (SVM). Lastly, we also discussed the Markov chain to predict the result of a sport. Markov chain is the sequence or chain from which the next sample from this state space is sampled.
Keywords:	Artificial Intelligence; ANN; Betting; sports; SVM; Markov chain
JEL:	C5 C55 C6
Date:	2021–03–21
URL:	http://d.repec.org/n?u=RePEc:pra:mprapa:106821&r=all

A machine learning approach to domain specific dictionary generation. An economic time series framework

By:	Hanjo Odendaal (Department of Economics, Stellenbosch University)
Abstract:	This paper aims to offer an alternative to the manually labour intensive process of constructing a domain specific lexicon or dictionary through the operationalization of subjective information processing. This paper builds on current empirical literature by (a) constructing a domain specific dictionary for various economic confidence indices, (b) introducing a novel weighting schema of text tokens that account for time dependence; and (c) operationalising subjective information processing of text data using machine learning. The results show that sentiment indices constructed from machine generated dictionaries have a better fit with multiple indicators of economic activity than @loughran2011liability's manually constructed dictionary. Analysis shows a lower RMSE for the domain specific dictionaries in a five year holdout sample period from 2012 to 2017. The results also justify the time series weighting design used to overcome the p>>n problem, commonly found when working with economic time series and text data.
Keywords:	Sentometrics, Machine learning, Domain-specific dictionaries
JEL:	C32 C45 C53 C55
Date:	2021
URL:	http://d.repec.org/n?u=RePEc:sza:wpaper:wpapers366&r=all

Text Mining of Stocktwits Data for Predicting Stock Prices

By:	Mukul Jaggi; Priyanka Mandal; Shreya Narang; Usman Naseem; Matloob Khushi
Abstract:	Stock price prediction can be made more efficient by considering the price fluctuations and understanding the sentiments of people. A limited number of models understand financial jargon or have labelled datasets concerning stock price change. To overcome this challenge, we introduced FinALBERT, an ALBERT based model trained to handle financial domain text classification tasks by labelling Stocktwits text data based on stock price change. We collected Stocktwits data for over ten years for 25 different companies, including the major five FAANG (Facebook, Amazon, Apple, Netflix, Google). These datasets were labelled with three labelling techniques based on stock price changes. Our proposed model FinALBERT is fine-tuned with these labels to achieve optimal results. We experimented with the labelled dataset by training it on traditional machine learning, BERT, and FinBERT models, which helped us understand how these labels behaved with different model architectures. Our labelling method competitive advantage is that it can help analyse the historical data effectively, and the mathematical function can be easily customised to predict stock movement.
Date:	2021–03
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2103.16388&r=all

The Hard Problem of Prediction for Conflict Prevention

By:	Hannes Mueller; Christopher Rauh
Abstract:	There is a growing interest in prevention in several policy areas and this provides a strong motivation for an improved integration of forecasting with machine learning into models of decision making. In this article we propose a framework to tackle conflict prevention. A key problem of conflict forecasting for prevention is that predicting the start of conflict in previously peaceful countries needs to overcome a low baseline risk. To make progress in this hard problem this project combines a newspaper-text corpus of more than 4 million articles with unsupervised and supervised machine learning. The output of the forecast model is then integrated into a simple static framework in which a decision maker decides on the optimal number of interventions to minimize the total cost of conflict and intervention. This exercise highlights the potential cost savings of prevention for which reliable forecasts are a prerequisite.
Keywords:	armed conflict, forecasting, machine learning, newspaper text, random forest, topic models
JEL:	O11 O43
Date:	2021–03
URL:	http://d.repec.org/n?u=RePEc:bge:wpaper:1244&r=all

The Value of Data for Prediction Policy Problems: Evidence from Antibiotic Prescribing

By:	Shan Huang; Michael Allan Ribers; Hannes Ullrich
Abstract:	Large-scale data show promise to provide efficiency gains through individualized risk predictions in many business and policy settings. Yet, assessments of the degree of data-enabled efficiency improvements remain scarce. We quantify the value of the availability of a variety of data combinations for tackling the policy problem of curbing antibiotic resistance, where the reduction of inefficient antibiotic use requires improved diagnostic prediction. Fousing on antibiotic prescribing for suspected urinary tract infections in primary care in Denmark, we link individual-level administrative data with microbiological laboratory test outcomes to train a machine learning algorithm predicting bacterial test results. For various data combinations, we assess out of sample prediction quality and efficiency improvements due to prediction-based prescription policies. The largest gains in prediction quality can be achieved using simple characteristics such as patient age and gender or patients’ health care data. However, additional patient background data lead to further incremental policy improvements even though gains in prediction quality are small. Our ﬁndings suggest that evaluating prediction quality against the ground truth only may not be sufficient to quantify the potential for policy improvements.
Keywords:	Prediction policy; data combination; machine learning; antibiotic prescribing
JEL:	C10 C55 I11 I18 Q28
Date:	2021
URL:	http://d.repec.org/n?u=RePEc:diw:diwwpp:dp1939&r=all

Concept of peer-to-peer lending and application of machine learning in credit scoring

By:	Aleksy Klimowicz (Faculty of Economic Sciences, University of Warsaw); Krzysztof Spirzewski (Faculty of Economic Sciences, University of Warsaw)
Abstract:	Numerous applications of AI are found in the banking sector. Starting from front-office, enhancing customer recognition and personalized services, continuing in middle-office with automated fraud-detection systems, ending with back-office and internal processes automatization. In this paper we provide comprehensive information on the phenomenon of peer-to-peer lending in the modern view of alternative finance and crowdfunding from several perspectives. The aim of this research is to explore the phenomenon of peer-to-peer lending market model. We apply and check the suitability and effectiveness of credit scorecards in the marketplace lending along with determining the appropriate cut-off point. We conducted this research by exploring recent studies and open-source data on marketplace lending. The scorecard development is based on the P2P loans open dataset that contains repayments record along with both hard and soft features of each loan. The quantitative part consists of applying a machine learning algorithm in building a credit scorecard, namely logistic regression.
Keywords:	artificial intelligence, peer-to-peer lending, credit risk assessment, credit scorecards, logistic regression, machine learning
JEL:	G21 C25
Date:	2021
URL:	http://d.repec.org/n?u=RePEc:war:wpaper:2021-04&r=all

Data-intensive innovation and the State: evidence from AI firms in China

By:	Martin Beraja; David Y. Yang; Noam Yuchtman
Abstract:	Artificial intelligence (AI) innovation is data-intensive. States have historically collected large amounts of data, which is now being used by AI firms. Gathering comprehensive information on firms and government procurement contracts in China's facial recognition AI industry, we first study how government data shapes AI innovation. We find evidence of a precise mechanism: because data is sharable across uses, economies of scope arise. Firms awarded public security AI contracts providing access to more government data produce more software for both government and commercial purposes. In a directed technical change model incorporating this mechanism, we then study the trade-offs presented by states' AI procurement and data pro-vision policies. Surveillance states' demand for AI may incidentally promote growth, but distort innovation, crowd-out resources, and infringe on civil liberties. Government data provision may be justified when economies of scope are strong and citizens' privacy concerns are limited.
Keywords:	data, innovation, artificial intelligence, China, economies of scope, directed technical change, industrial policy, privacy, surveillance
JEL:	O30 P00 E00 L5 L63 O25 O40
Date:	2021–03
URL:	http://d.repec.org/n?u=RePEc:cep:cepdps:dp1755&r=all

Higher-Order Orthogonal Causal Learning for Treatment Effect

By:	Yiyan Huang; Cheuk Hang Leung; Xing Yan; Qi Wu
Abstract:	Most existing studies on the double/debiased machine learning method concentrate on the causal parameter estimation recovering from the first-order orthogonal score function. In this paper, we will construct the $k^{\mathrm{th}}$-order orthogonal score function for estimating the average treatment effect (ATE) and present an algorithm that enables us to obtain the debiased estimator recovered from the score function. Such a higher-order orthogonal estimator is more robust to the misspecification of the propensity score than the first-order one does. Besides, it has the merit of being applicable with many machine learning methodologies such as Lasso, Random Forests, Neural Nets, etc. We also undergo comprehensive experiments to test the power of the estimator we construct from the score function using both the simulated datasets and the real datasets.
Date:	2021–03
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2103.11869&r=all

News media vs. FRED-MD for macroeconomic forecasting

By:	Jon Ellingsen; Vegard H. Larsen; Leif Anders Thorsrud
Abstract:	Using a unique dataset of 22.5 million news articles from the Dow Jones Newswires Archive, we perform an in depth real-time out-of-sample forecasting comparison study with one of the most widely used data sets in the newer forecasting literature, namely the FRED-MD dataset. Focusing on U.S. GDP, consumption and investment growth, our results suggest that the news data contains information not captured by the hard economic indicators, and that the news-based data are particularly informative for forecasting consumption developments.
Keywords:	forecasting, real-time, machine learning, news, text data
JEL:	C53 C55 E27 E37
Date:	2020–10–08
URL:	http://d.repec.org/n?u=RePEc:bno:worpap:2020_14&r=all

Can Machine Learning Help to Select Portfolios of Mutual Funds?

By:	Victor DeMiguel; Javier Gil-Bazo; Francisco J. Nogales; André A. P. Santos
Abstract:	Identifying outperforming mutual funds ex-ante is a notoriously difficult task. We use machine learning methods to exploit the predictive ability of a large set of mutual fund characteristics that are readily available to investors. Using data on US equity funds in the 1980-2018 period, the methods allow us to construct portfolios of funds that earn positive and significant out-of-sample risk-adjusted after-fee returns as high as 4.2% per year. We further show that such outstanding performance is the joint outcome of both exploiting the information contained in multiple fund characteristics and allowing for flexibility in the relationship between predictors and fund performance. Our results confirm that even retail investors can benefit from investing in actively managed funds. However, we also find that the performance of all our portfolios has declined over time, consistent with increased competition in the asset market and diseconomies of scale at the industry level.
Keywords:	mutual fund performance, performance predictability, active management, machine learning, elastic net, random forests, gradient boosting
Date:	2021–03
URL:	http://d.repec.org/n?u=RePEc:bge:wpaper:1245&r=all

Can machine learning help to select portfolios of mutual funds?

By:	Victor DeMiguel; Javier Gil-Bazo; Francisco J. Nogales; André A. P. Santos
Abstract:	Identifying outperforming mutual funds ex-ante is a notoriously difficult task. We use machine learning methods to exploit the predictive ability of a large set of mutual fund characteristics that are readily available to investors. Using data on US equity funds in the 1980-2018 period, the methods allow us to construct portfolios of funds that earn positive and significant out-of-sample risk-adjusted after-fee returns as high as 4.2% per year. We further show that such outstanding performance is the joint outcome of both exploiting the information contained in multiple fund characteristics and allowing for flexibility in the relationship between predictors and fund performance. Our results confirm that even retail investors can benefit from investing in actively managed funds. However, we also find that the performance of all our portfolios has declined over time, consistent with increased competition in the asset market and diseconomies of scale at the industry level.
Keywords:	Mutual fund performance, performance predictability, active management, machine learning, elastic net, random forests, gradient boosting
Date:	2021–03
URL:	http://d.repec.org/n?u=RePEc:upf:upfgen:1772&r=all

Interpretable ML-driven Strategy for Automated Trading Pattern Extraction

By:	Artur Sokolovsky; Luca Arnaboldi; Jaume Bacardit; Thomas Gross
Abstract:	Financial markets are a source of non-stationary multidimensional time series which has been drawing attention for decades. Each financial instrument has its specific changing over time properties, making their analysis a complex task. Improvement of understanding and development of methods for financial time series analysis is essential for successful operation on financial markets. In this study we propose a volume-based data pre-processing method for making financial time series more suitable for machine learning pipelines. We use a statistical approach for assessing the performance of the method. Namely, we formally state the hypotheses, set up associated classification tasks, compute effect sizes with confidence intervals, and run statistical tests to validate the hypotheses. We additionally assess the trading performance of the proposed method on historical data and compare it to a previously published approach. Our analysis shows that the proposed volume-based method allows successful classification of the financial time series patterns, and also leads to better classification performance than a price action-based method, excelling specifically on more liquid financial instruments. Finally, we propose an approach for obtaining feature interactions directly from tree-based models on example of CatBoost estimator, as well as formally assess the relatedness of the proposed approach and SHAP feature interactions with a positive outcome.
Date:	2021–03
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2103.12419&r=all

A deep learning approach to data-driven model-free pricing and to martingale optimal transport

By:	Ariel Neufeld; Julian Sester
Abstract:	We introduce a novel and highly tractable supervised learning approach based on neural networks that can be applied for the computation of model-free price bounds of, potentially high-dimensional, financial derivatives and for the determination of optimal hedging strategies attaining these bounds. In particular, our methodology allows to train a single neural network offline and then to use it online for the fast determination of model-free price bounds of a whole class of financial derivatives with current market data. We show the applicability of this approach and highlight its accuracy in several examples involving real market data. Further, we show how a neural network can be trained to solve martingale optimal transport problems involving fixed marginal distributions instead of financial market data.
Date:	2021–03
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2103.11435&r=all

Machine Learning and Central Banks: Ready for Prime Time?

By:	Hans Genberg (Asia School of Business); Özer Karagedikli (South East Asian Central Banks (SEACEN) Research and Training Centre and Centre for Applied Macroeconomic Analysis (CAMA))
Abstract:	In this article we review what machine learning might have to offer central banks as an analytical approach to support monetary policy decisions. After describing the central bankâ€™s â€œproblemâ€ and providing a brief introduction to machine learning, we propose to use the gradual adoption of Vector Auto Regression (VAR) methods in central banks to speculate how machine learning models must (will?) evolve to become influential analytical tools supporting central banksâ€™ monetary policy decisions. We argue that VAR methods achieved that status only after they incorporated elements that allowed users to interpret them in terms of structural economic theories. We believe that the same has to be the case for machine learning model.
Date:	2021–03
URL:	http://d.repec.org/n?u=RePEc:sea:wpaper:wp43&r=all

TradeR: Practical Deep Hierarchical Reinforcement Learning for Trade Execution

By:	Karush Suri; Xiao Qi Shi; Konstantinos Plataniotis; Yuri Lawryshyn
Abstract:	Advances in Reinforcement Learning (RL) span a wide variety of applications which motivate development in this area. While application tasks serve as suitable benchmarks for real world problems, RL is seldomly used in practical scenarios consisting of abrupt dynamics. This allows one to rethink the problem setup in light of practical challenges. We present Trade Execution using Reinforcement Learning (TradeR) which aims to address two such practical challenges of catastrophy and surprise minimization by formulating trading as a real-world hierarchical RL problem. Through this lens, TradeR makes use of hierarchical RL to execute trade bids on high frequency real market experiences comprising of abrupt price variations during the 2019 fiscal year COVID19 stock market crash. The framework utilizes an energy-based scheme in conjunction with surprise value function for estimating and minimizing surprise. In a large-scale study of 35 stock symbols from the S&P500 index, TradeR demonstrates robustness to abrupt price changes and catastrophic losses while maintaining profitable outcomes. We hope that our work serves as a motivating example for application of RL to practical problems.
Date:	2021–02
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2104.00620&r=all

How Many Online Workers are there in the World? A Data-Driven Assessment

By:	Kässi, Otto; Lehdonvirta, Vili; Stephany, Fabian
Abstract:	An unknown number of people around the world are earning income by working through online labour platforms such as Upwork and Amazon Mechanical Turk. We combine data collected from various sources to build a data-driven assessment of the number of such online workers (also known as online freelancers) globally. Our headline estimate is that there are 163 million freelancer profiles registered on online labour platforms globally. Approximately 19 million of them have obtained work through the platform at least once, and 5 million have completed at least 10 projects or earned at least $1000. These numbers suggest a substantial growth from 2015 in registered worker accounts, but much less growth in amount of work completed by workers. Our results indicate that online freelancing represents a non-trivial segment of labour today, but one that is spread thinly across countries and sectors.
Date:	2021–03–24
URL:	http://d.repec.org/n?u=RePEc:osf:socarx:78nge&r=all

How Many Online Workers are there in the World? A Data-Driven Assessment

By:	Otto K\"assi; Vili Lehdonvirta; Fabian Stephany
Abstract:	An unknown number of people around the world are earning income by working through online labour platforms such as Upwork and Amazon Mechanical Turk. We combine data collected from various sources to build a data-driven assessment of the number of such online workers (also known as online freelancers) globally. Our headline estimate is that there are 163 million freelancer profiles registered on online labour platforms globally. Approximately 19 million of them have obtained work through the platform at least once, and 5 million have completed at least 10 projects or earned at least $1000. These numbers suggest a substantial growth from 2015 in registered worker accounts, but much less growth in amount of work completed by workers. Our results indicate that online freelancing represents a non-trivial segment of labour today, but one that is spread thinly across countries and sectors.
Date:	2021–03
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2103.12648&r=all

Measuring the Economic Cost of Conflict in Afflicted Arab Countries

By:	Elif Semra Ceylan (Ernst & Young); Semih Tumen (TED University)
Abstract:	The goal of this paper is to estimate the economic cost of conflict in selected Arab countries by using satellite images and geographical information systems (GIS) methods. Specifically, we employ image-processing techniques to generate data proxying intensity of economic activity at country and sub-region levels. The focus is on four countries: Iraq, Libya, Syria, and Yemen. These are the countries that have been most severely affected in various ways by the widespread wave of civil conflict occurred in the MENA region in the aftermath of the Arab Spring. Certain back-of-the-envelope calculations suggest that GDP and main factors of production are nearly halved in those countries. We use data provided by the National Geophysical Data Center of the United States to compare the night-light intensities before and after the conflict in those four countries. The night-light data serve as a proxy for regional economic activity and are widely used to generate credible economic data—mainly in circumstances where official data either do not exist or are not reliable. We construct indices combining the contrast and dispersion of night-lights within fine-grained geographical regions and then report the time series evolution of those indices both at country and sub-region levels. The estimates suggest that the scale and intensity of economic destruction in the region have been unprecedented in recent history and the extent of destruction is the largest in Syria and Yemen among those four conflict-afflicted countries. We also provide additional insights at sub-region level.
Date:	2021–02–20
URL:	http://d.repec.org/n?u=RePEc:erg:wpaper:1459&r=all

Intraday trading strategy based on time series and machine learning for Chinese stock market

By:	Q. Wang; Y. Zhou; J. Shen
Abstract:	This article comes up with an intraday trading strategy under T+1 using Markowitz optimization and Multilayer Perceptron (MLP) with published stock data obtained from the Shenzhen Stock Exchange and Shanghai Stock Exchange. The empirical results reveal the profitability of Markowitz portfolio optimization and validate the intraday stock price prediction using MLP. The findings further combine the Markowitz optimization, an MLP with the trading strategy, to clarify this strategy's feasibility.
Date:	2021–03
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2103.13507&r=all

This nep-big issue is ©2021 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.

General information on the NEP project can be found at http://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.

NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.