nep-big New Economics Papers
on Big Data
Issue of 2019‒09‒09
eighteen papers chosen by
Tom Coupé
University of Canterbury

  1. The finer points of model comparison in machine learning: forecasting based on russian banks’ data By Denis Shibitov; Mariam Mamedli
  2. Are Bitcoins price predictable? Evidence from machine learning techniques using technical indicators By Samuel Asante Gyamerah
  3. Reinforcement Learning: Prediction, Control and Value Function Approximation By Haoqian Li; Thomas Lau
  4. Does the Estimation of the Propensity Score by Machine Learning Improve Matching Estimation? The Case of Germany's Programmes for Long Term Unemployed By Goller, Daniel; Lechner, Michael; Moczall, Andreas; Wolff, Joachim
  5. Agricultural Loan Delinquency Prediction Using Machine Learning Methods By Chen, Jian; Katchova, Ani
  6. Economic Black Holes and Labor Singularities in the Presence of Self-replicating Artificial Intelligence By YANO Makoto; FURUKAWA Yuichi
  7. An introduction to flexible methods for policy evaluation By Huber, Martin
  8. Stock Price Forecasting and Hypothesis Testing Using Neural Networks By Kerda Varaku
  9. Predicting Returns With Text Data By Zheng Tracy Ke; Bryan T. Kelly; Dacheng Xiu
  10. Predicting systemic financial crises with recurrent neural networks By Tölö, Eero
  11. Rethinking travel behavior modeling representations through embeddings By Francisco C. Pereira
  12. Towards a Utility Theory of Privacy and Information Sharing and the Introduction of Hyper-Hyperbolic Discounting in the Digital Big Data Age By Julia M. Puaschunder
  13. Sabrina: Modeling and Visualization of Economy Data with Incremental Domain Knowledge By Alessio Arleo; Christos Tsigkanos; Chao Jia; Roger A. Leite; Ilir Murturi; Manfred Klaffenboeck; Schahram Dustdar; Michael Wimmer; Silvia Miksch; Johannes Sorger
  14. Crime and Networks: 10 Policy Lessons By Lindquist, Matthew J.; Zenou, Yves
  15. Predicting Consumer Default: A Deep Learning Approach By Stefania Albanesi; Domonkos F. Vamossy
  16. Predict Food Security with Machine Learning: Application in Eastern Africa By Zhou, Yujun; Baylis, Kathy
  17. QCNN: Quantile Convolutional Neural Network By G\'abor Petneh\'azi
  18. Data-sharing in IoT Ecosystems from a Competition Law Perspective: The Example of Connected Cars By Wolfgang Kerber

  1. By: Denis Shibitov (Bank of Russia, Russian Federation); Mariam Mamedli (Bank of Russia, Russian Federation)
    Abstract: We evaluate the forecasting ability of machine learning models to predict bank license withdrawal and the violation of statutory capital and liquidity requirements (capital adequacy ratio N1.0, common equity Tier 1 adequacy ratio N1.1, Tier 1 capital adequacy ratio N1.2, N2 instant and N3 current liquidity). On the basis of 35 series from the accounting reports of Russian banks, we form two data sets of 69 and 721 variables and use them to build random forest and gradient boosting models along with neural networks and a stacking model for different forecasting horizons (1, 2, 3, 6, 9 months). Based on the data from February 2014 to October 2018 we show that these models with fine-tuned architectures can successfully compete with logistic regression usually applied for this task. Stacking and random forest generally have the best forecasting performance comparing to the other models. We evaluate models with commonly used performance metrics (ROC-AUC and F1) and show that, depending on the task, F1-score could be better at defining the model’s performance. Comparison of the results depending on the metrics applied and types of cross-validation used illustrate the importance of choosing the appropriate metric for performance evaluation and the cross-validation procedure, which accounts for the characteristics of the data set and the task under consideration. The developed approach shows the advantages of non-linear methods for bank regulation tasks and provides the guidelines for the application of machine learning algorithms to these tasks.
    Keywords: machine learning, random forest, neural networks, gradient boosting, forecasting, bank supervision
    JEL: C53 C52 C5
    Date: 2019–08
  2. By: Samuel Asante Gyamerah
    Abstract: The uncertainties in future Bitcoin price make it difficult to accurately predict the price of Bitcoin. Accurately predicting the price for Bitcoin is therefore important for decision-making process of investors and market players in the cryptocurrency market. Using historical data from 01/01/2012 to 16/08/2019, machine learning techniques (Generalized linear model via penalized maximum likelihood, random forest, support vector regression with linear kernel, and stacking ensemble) were used to forecast the price of Bitcoin. The prediction models employed key and high dimensional technical indicators as the predictors. The performance of these techniques were evaluated using mean absolute percentage error (MAPE), root mean square error (RMSE), mean absolute error (MAE), and coefficient of determination (R-squared). The performance metrics revealed that the stacking ensemble model with two base learner (random forest and generalized linear model via penalized maximum likelihood) and support vector regression with linear kernel as meta-learner was the optimal model for forecasting Bitcoin price. The MAPE, RMSE, MAE, and R-squared values for the stacking ensemble model were 0.0191%, 15.5331 USD, 124.5508 USD, and 0.9967 respectively. These values show a high degree of reliability in predicting the price of Bitcoin using the stacking ensemble model. Accurately predicting the future price of Bitcoin will yield significant returns for investors and market players in the cryptocurrency market.
    Date: 2019–09
  3. By: Haoqian Li; Thomas Lau
    Abstract: With the increasing power of computers and the rapid development of self-learning methodologies such as machine learning and artificial intelligence, the problem of constructing an automatic Financial Trading Systems (FTFs) becomes an increasingly attractive research topic. An intuitive way of developing such a trading algorithm is to use Reinforcement Learning (RL) algorithms, which does not require model-building. In this paper, we dive into the RL algorithms and illustrate the definitions of the reward function, actions and policy functions in details, as well as introducing algorithms that could be applied to FTFs.
    Date: 2019–08
  4. By: Goller, Daniel (University of St. Gallen); Lechner, Michael (University of St. Gallen); Moczall, Andreas (Institute for Employment Research (IAB), Nuremberg); Wolff, Joachim (Institute for Employment Research (IAB), Nuremberg)
    Abstract: Matching-type estimators using the propensity score are the major workhorse in active labour market policy evaluation. This work investigates if machine learning algorithms for estimating the propensity score lead to more credible estimation of average treatment effects on the treated using a radius matching framework. Considering two popular methods, the results are ambiguous: We find that using LASSO based logit models to estimate the propensity score delivers more credible results than conventional methods in small and medium sized high dimensional datasets. However, the usage of Random Forests to estimate the propensity score may lead to a deterioration of the performance in situations with a low treatment share. The application reveals a positive effect of the training programme on days in employment for long-term unemployed. While the choice of the "first stage" is highly relevant for settings with low number of observations and few treated, machine learning and conventional estimation becomes more similar in larger samples and higher treatment shares.
    Keywords: programme evaluation, active labour market policy, causal machine learning, treatment effects, radius matching, propensity score
    JEL: J68 C21
    Date: 2019–08
  5. By: Chen, Jian; Katchova, Ani
    Keywords: Agricultural Finance
    Date: 2019–06–25
  6. By: YANO Makoto; FURUKAWA Yuichi
    Abstract: This study is motivated by the widely-held view that self-replicating artificial intelligence may approach "some essential singularity . . . beyond which human affairs, as we know them, could not continue" (von Neumann). It investigates what state this process would lead to in an economy with frictionless markets. We demonstrate that if the production technologies, too, are frictionless, all workers will eventually be pulled into the most labor friendly sector (economic black hole). If, instead, they are subject to a friction created by congestion, it will eventually give rise to a state in which all workers will be unemployed (labor singularity).
    Date: 2019–08
  7. By: Huber, Martin
    Abstract: This chapter covers different approaches to policy evaluation for assessing the causal effect of a treatment or intervention on an outcome of interest. As an introduction to causal inference, the discussion starts with the experimental evaluation of a randomized treatment. It then reviews evaluation methods based on selection on observables (assuming a quasi-random treatment given observed covariates), instrumental variables (inducing a quasi-random shift in the treatment), difference-in-differences and changes-in-changes (exploiting changes in outcomes over time), as well as regression discontinuities and kinks (using changes in the treatment assignment at some threshold of a running variable). The chapter discusses methods particularly suited for data with many observations for a flexible (i.e. semi- or nonparametric) modeling of treatment effects, and/or many (i.e. high dimensional) observed covariates by applying machine learning to select and control for covariates in a data-driven way. This is not only useful for tackling confounding by controlling for instance for factors jointly affecting the treatment and the outcome, but also for learning effect heterogeneities across subgroups defined upon observable covariates and optimally targeting those groups for which the treatment is most effective.
    Keywords: Policy evaluation; treatment effects; machine learning; experiment; selection on observables; instrument; difference-indifferences; changes-in-changes; regression discontinuity design; regression kink design
    JEL: C21 C26 C29
    Date: 2019–08–12
  8. By: Kerda Varaku
    Abstract: In this work we use Recurrent Neural Networks and Multilayer Perceptrons to predict NYSE, NASDAQ and AMEX stock prices from historical data. We experiment with different architectures and compare data normalization techniques. Then, we leverage those findings to question the efficient-market hypothesis through a formal statistical test.
    Date: 2019–08
  9. By: Zheng Tracy Ke; Bryan T. Kelly; Dacheng Xiu
    Abstract: We introduce a new text-mining methodology that extracts sentiment information from news articles to predict asset returns. Unlike more common sentiment scores used for stock return prediction (e.g., those sold by commercial vendors or built with dictionary-based methods), our supervised learning framework constructs a sentiment score that is specifically adapted to the problem of return prediction. Our method proceeds in three steps: 1) isolating a list of sentiment terms via predictive screening, 2) assigning sentiment weights to these words via topic modeling, and 3) aggregating terms into an article-level sentiment score via penalized likelihood. We derive theoretical guarantees on the accuracy of estimates from our model with minimal assumptions. In our empirical analysis, we text-mine one of the most actively monitored streams of news articles in the financial system—the Dow Jones Newswires—and show that our supervised sentiment model excels at extracting return-predictive signals in this context.
    JEL: C53 C58 G10 G11 G12 G14 G17
    Date: 2019–08
  10. By: Tölö, Eero
    Abstract: We consider predicting systemic financial crises one to five years ahead using recurrent neural networks. The prediction performance is evaluated with the Jorda-Schularick-Taylor dataset, which includes the crisis dates and relevant macroeconomic series of 17 countries over the period 1870-2016. Previous literature has found simple neural network architectures to be useful in predicting systemic financial crises. We show that such predictions can be greatly improved by making use of recurrent neural network architectures, especially suited for dealing with time series input. The results remain robust after extensive sensitivity analysis.
    JEL: G21 C45 C52
    Date: 2019–08–27
  11. By: Francisco C. Pereira
    Abstract: This paper introduces the concept of travel behavior embeddings, a method for re-representing discrete variables that are typically used in travel demand modeling, such as mode, trip purpose, education level, family type or occupation. This re-representation process essentially maps those variables into a latent space called the \emph{embedding space}. The benefit of this is that such spaces allow for richer nuances than the typical transformations used in categorical variables (e.g. dummy encoding, contrasted encoding, principal components analysis). While the usage of latent variable representations is not new per se in travel demand modeling, the idea presented here brings several innovations: it is an entirely data driven algorithm; it is informative and consistent, since the latent space can be visualized and interpreted based on distances between different categories; it preserves interpretability of coefficients, despite being based on Neural Network principles; and it is transferrable, in that embeddings learned from one dataset can be reused for other ones, as long as travel behavior keeps consistent between the datasets. The idea is strongly inspired on natural language processing techniques, namely the word2vec algorithm. Such algorithm is behind recent developments such as in automatic translation or next word prediction. Our method is demonstrated using a model choice model, and shows improvements of up to 60\% with respect to initial likelihood, and up to 20% with respect to likelihood of the corresponding traditional model (i.e. using dummy variables) in out-of-sample evaluation. We provide a new Python package, called PyTre (PYthon TRavel Embeddings), that others can straightforwardly use to replicate our results or improve their own models. Our experiments are themselves based on an open dataset (swissmetro).
    Date: 2019–08
  12. By: Julia M. Puaschunder (The New School, NY)
    Abstract: Economics is concerned about utility. Utility theory captures people’s preferences or values. As one of the foundations of economic theory, the wealth of information and theories on utility lacks information about decision-making conflicts between preferences and values. The preference for communication is inherent in human beings as a distinct feature of humanity. Leaving a written legacy that can inform many generations to come is a humane-unique advancement of society. At the same time, however, privacy is a core human value. People choose what information to share with whom and like to protect some parts of their selves by secrecy. Protecting people’s privacy is a codified virtue around the globe grounded in the wish to uphold individual’s dignity. Yet to this day, no utility theory exists to describe the internal conflict arising from the individual preference to communicate and the value of privacy. In the age of instant communication and social media big data storage and computational power; the need for understanding people’s trade off between communication and privacy has leveraged to unprecedented momentum. For one, enormous data storage capacities and computational power in the e-big data era have created unforeseen opportunities for big data hoarding corporations to reap hidden benefits from individual’s information sharing, which occurs bit by bit in small tranches over time.
    Keywords: Behavioral Economics, Behavioral Political Economy, Democratisation of information, Education, Exchange value, Governance, Preferences, Right to delete, Right to be forgotten, Social media, Utility, Value
    Date: 2018
  13. By: Alessio Arleo; Christos Tsigkanos; Chao Jia; Roger A. Leite; Ilir Murturi; Manfred Klaffenboeck; Schahram Dustdar; Michael Wimmer; Silvia Miksch; Johannes Sorger
    Abstract: Investment planning requires knowledge of the financial landscape on a large scale, both in terms of geo-spatial and industry sector distribution. There is plenty of data available, but it is scattered across heterogeneous sources (newspapers, open data, etc.), which makes it difficult for financial analysts to understand the big picture. In this paper, we present Sabrina, a financial data analysis and visualization approach that incorporates a pipeline for the generation of firm-to-firm financial transaction networks. The pipeline is capable of fusing the ground truth on individual firms in a region with (incremental) domain knowledge on general macroscopic aspects of the economy. Sabrina unites these heterogeneous data sources within a uniform visual interface that enables the visual analysis process. In a user study with three domain experts, we illustrate the usefulness of Sabrina, which eases their analysis process.
    Date: 2019–08
  14. By: Lindquist, Matthew J. (SOFI, Stockholm University); Zenou, Yves (Monash University)
    Abstract: Social network analysis can help us understand more about the root causes of delinquent behavior and crime and provide practical guidance for the design of crime prevention policies. To illustrate these points, we first present a selective review of several key studies and findings from the criminology and police studies literature. We then turn to a presentation of recent contributions made by network economists. We highlight 10 policy lessons and provide a discussion of recent developments in the use of big data and computer technology.
    Keywords: co-offending, crime, criminal networks, social networks, peer effects, key player
    JEL: A14 K42 Z13
    Date: 2019–08
  15. By: Stefania Albanesi; Domonkos F. Vamossy
    Abstract: We develop a model to predict consumer default based on deep learning. We show that the model consistently outperforms standard credit scoring models, even though it uses the same data. Our model is interpretable and is able to provide a score to a larger class of borrowers relative to standard credit scoring models while accurately tracking variations in systemic risk. We argue that these properties can provide valuable insights for the design of policies targeted at reducing consumer default and alleviating its burden on borrowers and lenders, as well as macroprudential regulation.
    Date: 2019–08
  16. By: Zhou, Yujun; Baylis, Kathy
    Keywords: International Development
    Date: 2019–06–25
  17. By: G\'abor Petneh\'azi
    Abstract: A dilated causal one-dimensional convolutional neural network architecture is proposed for quantile regression. The model can forecast any arbitrary quantile, and it can be trained jointly on multiple similar time series. An application to Value at Risk forecasting shows that QCNN outperforms linear quantile regression and constant quantile estimates.
    Date: 2019–08
  18. By: Wolfgang Kerber (Philipps University Marburg)
    Abstract: This paper analyses whether competition law can help to solve problems of access to data and interoperability in IoT ecosystems, where often one firm has exclusive control of the data produced by a smart device (and of the technical access to this device). Such a gatekeeper position can lead to the elimination of competition for after-market and other complementary services in such IoT ecosystems. This problem is analysed both from an economic and a legal perspective, and also generally for IoT ecosystems as well as for the much discussed problems of “access to in-vehicle data and resources†in connected cars, where the “extended vehicle†concept of the car manufacturers leads to such positions of exclusive control. The paper analyses, in particular, the competition rules about abusive behavior of dominant firms (Art. 102 TFEU) and of firms with “relative market power†(§ 20 (1) GWB) in German competition law. These provisions might offer (if appropriately applied and amended) at least some solutions for these data access problems. Competition law, however, might not be sufficient for dealing with all or most of these problems, i.e. that also additional solutions might be needed (data portability, direct data (access) rights, or sector-specific regulation).
    Keywords: data access, Internet of Things, data sharing, data access, competition, digital economy, connected cars
    JEL: K23 L62 L86 O33
    Date: 2019

This nep-big issue is ©2019 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.