nep-big New Economics Papers
on Big Data
Issue of 2020‒03‒02
twenty-one papers chosen by
Tom Coupé
University of Canterbury

  1. On fintech and financial inclusion By Thomas Philippon
  2. Discretization and Machine Learning Approximation of BSDEs with a Constraint on the Gains-Process By Idris Kharroubi; Thomas Lim; Xavier Warin
  3. How Magic a Bullet Is Machine Learning for Credit Analysis? An Exploration with FinTech Lending Data By J. Christina Wang; Charles B. Perkins
  4. The network of firms implied by the news By Zheng, Hannan; Schwenkler, Gustavo
  5. The gender pay gap revisited: Does machine learning offer new insights? By Brieland, Stephanie; Töpfer, Marina
  6. Corruption red flags in public procurement: new evidence from Italian calls for tenders By Francesco Decarolis; Cristina Giorgiantonio
  7. Deep Learning for Financial Applications : A Survey By Ahmet Murat Ozbayoglu; Mehmet Ugur Gudelek; Omer Berat Sezer
  8. Priority to Unemployed Immigrants? A Causal Machine Learning Evaluation of Training in Belgium By Cockx, Bart; Lechner, Michael; Bollens, Joost
  9. Diverging roads: Theory-based vs. machine learning-implied stock risk premia By Grammig, Joachim; Hanenberg, Constantin; Schlag, Christian; Sönksen, Jantje
  10. Misleading Estimation of Backwardness through NITI Aayog SDG index: A study to find loopholes and construction of alternative index with the help of Artificial Intelligence By Sen, Sugata; Sengupta, Soumya
  11. A Hierarchy of Limitations in Machine Learning By Momin M. Malik
  12. End-of-Conflict Deforestation: Evidence from Colombia’s Peace Agreement By Mounu Prem; Santiago Saavedra; Juan F. Vargas
  13. Predicting Bank Loan Default with Extreme Gradient Boosting By Rising Odegua
  14. Reinforcement-Learning based Portfolio Management with Augmented Asset Movement Prediction States By Yunan Ye; Hengzhi Pei; Boxin Wang; Pin-Yu Chen; Yada Zhu; Jun Xiao; Bo Li
  15. Terrorist Attacks, Cultural Incidents and the Vote for Radical Parties: Analyzing Text from Twitter By Francesco Giavazzi; Felix Iglhaut; Giacomo Lemoli; Gaia Rubera
  16. Using generative adversarial networks to synthesize artificial financial datasets By Dmitry Efimov; Di Xu; Luyang Kong; Alexey Nefedov; Archana Anandakrishnan
  17. Efficient Policy Learning from Surrogate-Loss Classification Reductions By Andrew Bennett; Nathan Kallus
  18. Matching State Business Registration Records to Census Business Data By J. Daniel Kim; Kristin McCue
  19. Econometrics at scale: Spark up big data in economics By Bluhm, Benjamin; Cutura, Jannic
  20. How Do Member Countries Receive IMF Policy Advice: Results from a State-of-the-art Sentiment Index By Ghada Fayad; Chengyu Huang; Yoko Shibuya; Peng Zhao
  21. A Novel Approach to the Automatic Designation of Predefined Census Enumeration Areas and Population Sampling Frames : A Case Study in Somalia By Qader,Sarchil; Lefebvre,Veronique; Ninneman,Amy; Himelein,Kristen; Pape,Utz Johann; Bengtsson,Linus; Tatem,Andy; Bird,Tomas

  1. By: Thomas Philippon
    Abstract: The cost of financial intermediation has declined in recent years thanks to technology and increased competition in some parts of the finance industry. I document this fact and I analyze two features of new financial technologies that have stirred controversy: returns to scale and the use of big data and machine learning. I argue that the nature of fixed versus variable costs in robo-advising is likely to democratize access to financial services. Big data is likely to reduce the impact of negative prejudice in the credit market but it could reduce the effectiveness of existing policies aimed at protecting minorities.
    Keywords: fintech, discrimination, robo advising, credit scoring, big data, machine learning
    JEL: E2 G2 N2
    Date: 2020–02
  2. By: Idris Kharroubi (LPSM UMR 8001 - Laboratoire de Probabilités, Statistique et Modélisation - UPMC - Université Pierre et Marie Curie - Paris 6 - UPD7 - Université Paris Diderot - Paris 7 - CNRS - Centre National de la Recherche Scientifique); Thomas Lim (LaMME - Laboratoire de Mathématiques et Modélisation d'Evry - INRA - Institut National de la Recherche Agronomique - UEVE - Université d'Évry-Val-d'Essonne - ENSIIE - CNRS - Centre National de la Recherche Scientifique, ENSIIE - Ecole Nationale Supérieure d'Informatique pour l'Industrie et l'Entreprise); Xavier Warin (EDF - EDF)
    Abstract: We study the approximation of backward stochastic differential equations (BSDEs for short) with a constraint on the gains process. We first discretize the constraint by applying a so-called facelift operator at times of a grid. We show that this discretely constrained BSDE converges to the continuously constrained one as the mesh grid converges to zero. We then focus on the approximation of the discretely constrained BSDE. For that we adopt a machine learning approach. We show that the facelift can be approximated by an optimization problem over a class of neural networks under constraints on the neural network and its derivative. We then derive an algorithm converging to the discretely constrained BSDE as the number of neurons goes to infinity. We end by numerical experiments. Mathematics Subject Classification (2010): 65C30, 65M75, 60H35, 93E20, 49L25.
    Keywords: Constrainted BSDEs,discrete-time approximation,neural networks approxi- mation,facelift transformation
    Date: 2020–02–05
  3. By: J. Christina Wang; Charles B. Perkins
    Abstract: FinTech online lending to consumers has grown rapidly in the post-crisis era. As argued by its advocates, one key advantage of FinTech lending is that lenders can predict loan outcomes more accurately by employing complex analytical tools, such as machine learning (ML) methods. This study applies ML methods, in particular random forests and stochastic gradient boosting, to loan-level data from the largest FinTech lender of personal loans to assess the extent to which those methods can produce more accurate out-of-sample predictions of default on future loans relative to standard regression models. To explain loan outcomes, this analysis accounts for the economic conditions faced by a borrower after origination, which are typically absent from other ML studies of default. For the given data, the ML methods indeed improve prediction accuracy, but more so over the near horizon than beyond a year. This study then shows that having more data up to, but not beyond, a certain quantity enhances the predictive accuracy of the ML methods relative to that of parametric models. The likely explanation is that there has been data or model drift over time, so that methods that fit more complex models with more data can in fact suffer greater out-of-sample misses. Prediction accuracy rises, but only marginally, with additional standard credit variables beyond the core set, suggesting that unconventional data need to be sufficiently informative as a whole to help consumers with little or no credit history. This study further explores whether the greater functional flexibility of ML methods yields unequal benefit to consumers with different attributes or who reside in locales with varying economic conditions. It finds that the ML methods produce more favorable ratings for different groups of consumers, although those already deemed less risky seem to benefit more on balance.
    Keywords: FinTech/marketplace lending; supervised machine learning; default prediction
    JEL: C52 C53 C55 G23
    Date: 2019–10–14
  4. By: Zheng, Hannan; Schwenkler, Gustavo
    Abstract: We show that the news is a rich source of data on distressed firm links that drive firm-level and aggregate risks. The news tends to report about links in which a less popular firm is distressed and may contaminate a more popular firm. This constitutes a contagion channel that yields predictable returns and downgrades. Shocks to the degree of news-implied firm connectivity predict increases in aggregate volatilities, credit spreads, and default rates, and declines in output. To obtain our results, we propose a machine learning methodology that takes text data as input and outputs a data-implied firm network. JEL Classification: E32, E44, L11, G10, C82
    Keywords: contagion, machine learning, natural language processing, networks, predictability, risk measurement
    Date: 2020–02
  5. By: Brieland, Stephanie; Töpfer, Marina
    Abstract: This paper analyses gender differences in pay at the mean as well as along the wage distribution. Using data from the German Socio-Economic Panel, we estimate the adjusted gender pay gap applying a machine learning method (post-double-LASSO procedure). Comparing results from this method to conventional models in the literature, we find that the size of the adjusted pay gap differs substantially depending on the approach used. The main reason is that the machine learning approach selects numerous interactions and second-order polynomials as well as different sets of covariates at various points of the wage distribution. This insight suggests that more exible specifications are needed to estimate gender differences in pay more appropriately. We further show that estimates of all models are robust to remaining selection on unobservables.
    Keywords: Gender pay gap,Machine Learning,Selection on unobservables
    JEL: J7 J16 J31
    Date: 2020
  6. By: Francesco Decarolis (Bocconi University); Cristina Giorgiantonio (Bank of Italy)
    Abstract: This paper contributes to the analysis of quantitative indicators (i.e., red flags or screens) to detect corruption in public procurement. Expanding the set of commonly discussed indicators in the literature to new ones derived from the operating practices of police forces and the judiciary, this paper verifies the presence of these red flags in a sample of Italian awarding procedures for roadwork contracts in the period 2009-2015. Then, it validates the efficacy of the indicators through measures of direct corruption risks (judiciary cases and police investigations for corruption-related crimes) and indirect corruption risks (delays and cost overruns). From a policy perspective, our analysis shows that the most effective red flags in detecting corruption risks are those related to discretionary mechanisms for selecting private contractors (such as the most economically advantageous offer or negotiated procedures), compliance with the minimum time limit for the submission of tenders and subcontracting. Moreover, our analysis suggests that greater standardization in the call for tender documents can contribute to reducing corruption risks. From a methodological point of view, the paper highlights the relevance of prediction approaches based on machine learning methods (especially the random forests algorithm) for validating a large set of indicators.
    Keywords: public procurement, corruption, red flags
    JEL: D44 D47 H57 R42
    Date: 2020–02
  7. By: Ahmet Murat Ozbayoglu; Mehmet Ugur Gudelek; Omer Berat Sezer
    Abstract: Computational intelligence in finance has been a very popular topic for both academia and financial industry in the last few decades. Numerous studies have been published resulting in various models. Meanwhile, within the Machine Learning (ML) field, Deep Learning (DL) started getting a lot of attention recently, mostly due to its outperformance over the classical models. Lots of different implementations of DL exist today, and the broad interest is continuing. Finance is one particular area where DL models started getting traction, however, the playfield is wide open, a lot of research opportunities still exist. In this paper, we tried to provide a state-of-the-art snapshot of the developed DL models for financial applications, as of today. We not only categorized the works according to their intended subfield in finance but also analyzed them based on their DL models. In addition, we also aimed at identifying possible future implementations and highlighted the pathway for the ongoing research within the field.
    Date: 2020–02
  8. By: Cockx, Bart (Ghent University); Lechner, Michael (University of St. Gallen); Bollens, Joost (VDAB, Belgium)
    Abstract: We investigate heterogenous employment effects of Flemish training programmes. Based on administrative individual data, we analyse programme effects at various aggregation levels using Modified Causal Forests (MCF), a causal machine learning estimator for multiple programmes. While all programmes have positive effects after the lock-in period, we find substantial heterogeneity across programmes and types of unemployed. Simulations show that assigning unemployed to programmes that maximise individual gains as identified in our estimation can considerably improve effectiveness. Simplified rules, such as one giving priority to unemployed with low employability, mostly recent migrants, lead to about half of the gains obtained by more sophisticated rules.
    Keywords: policy evaluation, active labour market policy, causal machine learning, modified causal forest, conditional average treatment effects
    JEL: J68
    Date: 2019–12
  9. By: Grammig, Joachim; Hanenberg, Constantin; Schlag, Christian; Sönksen, Jantje
    Abstract: We assess financial theory-based and machine learning-implied measurements of stock risk premia by comparing the quality of their return forecasts. In the low signal-to-noise environment of a one month horizon, we find that it is preferable to rely on a theory-based approach instead of engaging in the computerintensive hyper-parameter tuning of statistical models. The theory-based approach also delivers a solid performance at the one year horizon, at which only one machine learning methodology (random forest) performs substantially better. We also consider ways to combine the opposing modeling philosophies, and identify the use of random forests to account for the approximation residuals of the theory-based approach as a promising hybrid strategy. It combines the advantages of the two diverging paths in the finance world.
    Keywords: stock risk premia,return forecasts,machine learning,theorybased return prediction
    JEL: C53 C58 G12 G17
    Date: 2020
  10. By: Sen, Sugata; Sengupta, Soumya
    Abstract: UNDP Rio +20 summit in 2012 evolved a set of indicators to realise the targets of SDGs within a deadline. Measurement of the performances under these goals has followed the methodology as developed by UNDP which is nothing but the simple average of performances of the indicators under different domains. This work concludes that this methodology to measure the goal-wise as well as the composite performances is suffering from major shortcomings and proposes an alternative using the ideas of artificial intelligence. Here it is accepted that the indicators under different goals are inter-related and hence constructing index through simple average is misleading. Moreover the methodologies under the existing indices have failed to assign weights to different indicators. This work is based on secondary data and the goal-wise indices have been determined through normalised sigmoid functions. These goal-wise indices are plotted on a radar and the area of the radar is treated as measure under composite SDG performance. The whole work is presented through an artificial neural network. Observed that the goal-wise index as developed and tested here has shown that the UNDP as well as NITI Aayog index has delivered exaggerated values of goal-wise as well as composite performances.
    Keywords: SDG Index, Sigmoidal Activation Function, Artificial Neural Network
    JEL: C63 O15
    Date: 2020–02–06
  11. By: Momin M. Malik
    Abstract: "All models are wrong, but some are useful", wrote George E. P. Box (1979). Machine learning has focused on the usefulness of probability models for prediction in social systems, but is only now coming to grips with the ways in which these models are wrong---and the consequences of those shortcomings. This paper attempts a comprehensive, structured overview of the specific conceptual, procedural, and statistical limitations of models in machine learning when applied to society. Machine learning modelers themselves can use the described hierarchy to identify possible failure points and think through how to address them, and consumers of machine learning models can know what to question when confronted with the decision about if, where, and how to apply machine learning. The limitations go from commitments inherent in quantification itself, through to showing how unmodeled dependencies can lead to cross-validation being overly optimistic as a way of assessing model performance.
    Date: 2020–02
  12. By: Mounu Prem (School of Economics, Universidad del Rosario); Santiago Saavedra (School of Economics, Universidad del Rosario); Juan F. Vargas (School of Economics, Universidad del Rosario)
    Abstract: Armed conflict can endanger natural resources through several channels such as direct predation from fighting groups, but it may also help preserve ecosystems by dissuading extractive economic activities through the fear of extortion. The effect of conflict on deforestation is thus an empirical question. This paper studies the effect on forest cover of Colombia’s recent peace negotiation between the central government and the FARC insurgency. Using yearly deforestation data from satellite images and a difference-in-differences identification strategy, we show that areas controlled by FARC prior to the declaration of a permanent ceasefire that ultimately led to a peace agreement experienced a differential increase in deforestation after the start of the ceasefire. The deforestation effect of peace is attenuated in municipalities with higher state capacity, and is exacerbated by land intensive economic activities. Our results highlight the importance of complementing peacemaking milestones with state building efforts to avoid environmental damage.
    Keywords: Deforestation, Conflict, Peace building, Colombia JEL Classification: D74, Q34
    Date: 2019–01
  13. By: Rising Odegua
    Abstract: Loan default prediction is one of the most important and critical problems faced by banks and other financial institutions as it has a huge effect on profit. Although many traditional methods exist for mining information about a loan application, most of these methods seem to be under-performing as there have been reported increases in the number of bad loans. In this paper, we use an Extreme Gradient Boosting algorithm called XGBoost for loan default prediction. The prediction is based on a loan data from a leading bank taking into consideration data sets from both the loan application and the demographic of the applicant. We also present important evaluation metrics such as Accuracy, Recall, precision, F1-Score and ROC area of the analysis. This paper provides an effective basis for loan credit approval in order to identify risky customers from a large number of loan applications using predictive modeling.
    Date: 2020–01
  14. By: Yunan Ye; Hengzhi Pei; Boxin Wang; Pin-Yu Chen; Yada Zhu; Jun Xiao; Bo Li
    Abstract: Portfolio management (PM) is a fundamental financial planning task that aims to achieve investment goals such as maximal profits or minimal risks. Its decision process involves continuous derivation of valuable information from various data sources and sequential decision optimization, which is a prospective research direction for reinforcement learning (RL). In this paper, we propose SARL, a novel State-Augmented RL framework for PM. Our framework aims to address two unique challenges in financial PM: (1) data heterogeneity -- the collected information for each asset is usually diverse, noisy and imbalanced (e.g., news articles); and (2) environment uncertainty -- the financial market is versatile and non-stationary. To incorporate heterogeneous data and enhance robustness against environment uncertainty, our SARL augments the asset information with their price movement prediction as additional states, where the prediction can be solely based on financial data (e.g., asset prices) or derived from alternative sources such as news. Experiments on two real-world datasets, (i) Bitcoin market and (ii) HighTech stock market with 7-year Reuters news articles, validate the effectiveness of SARL over existing PM approaches, both in terms of accumulated profits and risk-adjusted profits. Moreover, extensive simulations are conducted to demonstrate the importance of our proposed state augmentation, providing new insights and boosting performance significantly over standard RL-based PM method and other baselines.
    Date: 2020–02
  15. By: Francesco Giavazzi; Felix Iglhaut; Giacomo Lemoli; Gaia Rubera
    Abstract: We study the role of perceived threats from cultural diversity induced by terrorist attacks and a salient criminal event on public discourse and voters’ support for far-right parties. We first develop a rule which allocates Twitter users in Germany to electoral districts and then use a machine learning method to compute measures of textual similarity between the tweets they produce and tweets by accounts of the main German parties. Using the dates of the aforementioned exogenous events we estimate constituency-level shifts in similarity to party language. We find that following these events Twitter text becomes on average more similar to that of the main far-right party, AfD, while the opposite happens for some of the other parties. Regressing estimated shifts in similarity on changes in vote shares between federal elections we find a significant association. Our results point to the role of perceived threats on the success of nationalist parties.
    Date: 2020
  16. By: Dmitry Efimov; Di Xu; Luyang Kong; Alexey Nefedov; Archana Anandakrishnan
    Abstract: Generative Adversarial Networks (GANs) became very popular for generation of realistically looking images. In this paper, we propose to use GANs to synthesize artificial financial data for research and benchmarking purposes. We test this approach on three American Express datasets, and show that properly trained GANs can replicate these datasets with high fidelity. For our experiments, we define a novel type of GAN, and suggest methods for data preprocessing that allow good training and testing performance of GANs. We also discuss methods for evaluating the quality of generated data, and their comparison with the original real data.
    Date: 2020–02
  17. By: Andrew Bennett; Nathan Kallus
    Abstract: Recent work on policy learning from observational data has highlighted the importance of efficient policy evaluation and has proposed reductions to weighted (cost-sensitive) classification. But, efficient policy evaluation need not yield efficient estimation of policy parameters. We consider the estimation problem given by a weighted surrogate-loss classification reduction of policy learning with any score function, either direct, inverse-propensity weighted, or doubly robust. We show that, under a correct specification assumption, the weighted classification formulation need not be efficient for policy parameters. We draw a contrast to actual (possibly weighted) binary classification, where correct specification implies a parametric model, while for policy learning it only implies a semiparametric model. In light of this, we instead propose an estimation approach based on generalized method of moments, which is efficient for the policy parameters. We propose a particular method based on recent developments on solving moment problems using neural networks and demonstrate the efficiency and regret benefits of this method empirically.
    Date: 2020–02
  18. By: J. Daniel Kim; Kristin McCue
    Abstract: We describe our methodology and results from matching state Business Registration Records (BRR) to Census business data. We use data from Massachusetts and California to develop methods and preliminary results that could be used to guide matching data for additional states. We obtain matches to Census business records for 45% of the Massachusetts BRR records and 40% of the California BRR records. We find higher match rates for incorporated businesses and businesses with higher startup-quality scores as assigned in Guzman and Stern (2018). Clerical reviews show that using relatively strict matching on address is important for match accuracy, while results are less sensitive to name matching strictness. Among matched BRR records, the modal timing of the first match to the BR is in the year in which the BRR record was filed. We use two sets of software to identify matches: SAS DQ Match and a machine-learning algorithm described in Cuffe and Goldschlag (2018). We find preliminary evidence that while the ML-based method yields more match results, SAS DQ tends to result in higher accuracy rates. To conclude, we provide suggestions on how to proceed with matching other states’ data in light of our findings using these two states.
    Date: 2020–01
  19. By: Bluhm, Benjamin; Cutura, Jannic
    Abstract: This paper provides an overview of how to use \big data" for economic research. We investigate the performance and ease of use of different Spark applications running on a distributed file system to enable the handling and analysis of data sets which were previously not usable due to their size. More specifically, we explain how to use Spark to (i) explore big data sets which exceed retail grade computers memory size and (ii) run typical econometric tasks including microeconometric, panel data and time series regression models which are prohibitively expensive to evaluate on stand-alone machines. By bridging the gap between the abstract concept of Spark and ready-to-use examples which can easily be altered to suite the researchers need, we provide economists and social scientists more generally with the theory and practice to handle the ever growing datasets available. The ease of reproducing the examples in this paper makes this guide a useful reference for researchers with a limited background in data handling and distributed computing.
    Keywords: Econometrics,Distributed Computing,Apache Spark
    JEL: C53 C55
    Date: 2020
  20. By: Ghada Fayad; Chengyu Huang; Yoko Shibuya; Peng Zhao
    Abstract: This paper applies state-of-the-art deep learning techniques to develop the first sentiment index measuring member countries’ reception of IMF policy advice at the time of Article IV Consultations. This paper finds that while authorities of member countries largely agree with Fund advice, there is variation across country size, external openness, policy sectors and their assessed riskiness, political systems, and commodity export intensity. The paper also looks at how sentiment changes during and after a financial arrangement or program with the Fund, as well as when a country receives IMF technical assistance. The results shed light on key aspects on Fund surveillance while redefining how the IMF can view its relevance, value added, and traction with its member countries.
    Keywords: Fiscal sector;External sector;Economic conditions;Real sector;Commodity price indexes;IMF,Surveillance,Economic Policy,Sentiment Analysis,Natural Language Processing,WP,article IV,article IV consultation,paragraph,sector-specific
    Date: 2020–01–17
  21. By: Qader,Sarchil; Lefebvre,Veronique; Ninneman,Amy; Himelein,Kristen; Pape,Utz Johann; Bengtsson,Linus; Tatem,Andy; Bird,Tomas
    Abstract: Enumeration areas are the operational geographic units for the collection, dissemination, and analysis of census data and are often used as a national sampling frame for various types of surveys. Traditionally, enumeration areas are created by manually digitizing small geographic units on high-resolution satellite imagery or physically walking the boundaries of units, both of which are highly time, cost, and labor intensive. In addition, creating enumeration areas requires considering the size of the population and area within each unit. This is an optimization problem that can best be solved by a computer. This paper, for the first time, produces an automatic designation of predefined census enumeration areas based on high-resolution gridded population and settlement data sets and using publicly available natural and administrative boundaries. This automated approach is compared with manually digitized enumeration areas that were created in urban areas in Mogadishu and Hargeisa for the United Nations Population Estimation Survey for Somalia in 2014. The automatically generated enumeration areas are consistent with standard enumeration areas, including having identifiable boundaries to field teams on the ground, and appropriate sizing and population for coverage by an enumerator. Furthermore, the automated urban enumeration areas have no gaps. The paper extends this work to rural Somalia, for which no records exist of previous enumeration area demarcations. This work shows the time, labor, and cost-saving value of automated enumeration area delineation and points to the potential for broadly available tools that are suitable for low-income and data-poor settings but applicable to potentially wider contexts.
    Keywords: Inequality,Armed Conflict,ICT Applications,Employment and Unemployment
    Date: 2019–08–08

This nep-big issue is ©2020 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.