nep-big New Economics Papers
on Big Data
Issue of 2023‒01‒30
twenty papers chosen by
Tom Coupé
University of Canterbury

  1. Stock Market Prediction via Deep Learning Techniques: A Survey By Jinan Zou; Qingying Zhao; Yang Jiao; Haiyao Cao; Yanxi Liu; Qingsen Yan; Ehsan Abbasnejad; Lingqiao Liu; Javen Qinfeng Shi
  2. Application of Machine Learning Algorithms on Satellite Imagery for Road Quality Monitoring: An Alternative Approach to Road Quality Surveys By Thegeya, Aaron; Mitterling, Thomas; Martinez Jr., Arturo; Bulan, Joseph; Durante, Ron Lester; Mag-atas, Jayzon
  3. Smarter than humans? Validating how OpenAI's ChatGPT model explains crowdfunding, alternative finance and community finance By Wenzlaff, Karsten; Spaeth, Sebastian
  4. Corrupted by Algorithms? How AI-generated and Human-written Advice Shape (Dis)honesty By Margarita Leib; Nils K\"obis; Rainer Michael Rilke; Marloes Hagens; Bernd Irlenbusch
  5. Measuring an artificial intelligence agent's trust in humans using machine incentives By Tim Johnson; Nick Obradovich
  6. Returns to Solar Panels in the Housing Market: A Meta Learner Approach By Elias Asproudis; Cigdem Gedikli; Oleksandr Talavera; Okan Yilmaz
  7. fintech-kMC: Agent based simulations of financial platforms for design and testing of machine learning systems By Isaac Tamblyn; Tengkai Yu; Ian Benlolo
  8. Greenhouse gases emissions: estimating corporate non-reported emissions using interpretable machine learning By Jeremi Assael; Thibaut Heurtebize; Laurent Carlier; Fran\c{c}ois Soup\'e
  9. Bayesian Modeling of Time-Varying Parameters Using Regression Trees By Niko Hauzenberger; Florian Huber; Gary Koop; James Mitchell
  10. Model-based recursive partitioning to estimate unfair health inequalities in the United Kingdom Household Longitudinal Study By Brunori, Paolo; Davillas, Apostolos; Jones, Andrew M.; Scarchilli, Giovanna
  11. Who is mobilized to vote by short text messages? Evidence from a nationwide field experiment with young voters. By Salomo Hirvonen; Maarit Lassander; Lauri Sääksvuori; Janne Tukiainen
  12. Mean-field neural networks-based algorithms for McKean-Vlasov control problems * By Huy\^en Pham; Xavier Warin
  13. Risk sharing with deep neural networks By Matteo Burzoni; Alessandro Doldi; Enea Monzio Compagnoni
  14. Vax Populi: The Social Costs of Online Vaccine Skepticism By Matilde Giaccherini; Joanna Kopinska
  15. Foundations for nighttime lights data analysis By Ayush Patnaik; Ajay Shah; Susan Thomas
  16. Refining Public Policies with Machine Learning: The Case of Tax Auditing By Marco Battaglini; Luigi Guiso; Chiara Lacava; Douglas L. Miller; Eleonora Patacchini
  17. Deep Learning on time-variant Product Space By Arnault Pachot
  18. Customer satisfaction and natural language processing By Yolande Piris; Anne-Cécile Gay
  19. A bankruptcy probability model for assessing credit risk on corporate loans with automated variable selection By Ida Nervik Hjelseth; Arvid Raknerud; Bjørn H. Vatne
  20. A Gender-Sensitive Earthquake Recovery Assessment Using Administrative and Satellite Data: The Case of Indonesia’s 2016 Aceh Earthquake By Akter, Sonia; Fauzia, Talitha; Pundit, Madhavi; Schroder, Marcel

  1. By: Jinan Zou; Qingying Zhao; Yang Jiao; Haiyao Cao; Yanxi Liu; Qingsen Yan; Ehsan Abbasnejad; Lingqiao Liu; Javen Qinfeng Shi
    Abstract: The stock market prediction has been a traditional yet complex problem researched within diverse research areas and application domains due to its non-linear, highly volatile and complex nature. Existing surveys on stock market prediction often focus on traditional machine learning methods instead of deep learning methods. Deep learning has dominated many domains, gained much success and popularity in recent years in stock market prediction. This motivates us to provide a structured and comprehensive overview of the research on stock market prediction focusing on deep learning techniques. We present four elaborated subtasks of stock market prediction and propose a novel taxonomy to summarize the state-of-the-art models based on deep neural networks from 2011 to 2022. In addition, we also provide detailed statistics on the datasets and evaluation metrics commonly used in the stock market. Finally, we highlight some open issues and point out several future directions by sharing some new perspectives on stock market prediction.
    Date: 2022–12
  2. By: Thegeya, Aaron (World Data Lab); Mitterling, Thomas (World Data Lab); Martinez Jr., Arturo (Asian Development Bank); Bulan, Joseph (Asian Development Bank); Durante, Ron Lester (Asian Development Bank); Mag-atas, Jayzon (Asian Development Bank)
    Abstract: Roads are vital to support the transportation of people, goods, and services, among others. To yield their optimal socioeconomic impact, proper maintenance of existing roads is required; however, this is typically underfunded. Since detecting road quality is both labor and capital intensive, information on it is usually scarce, especially in resource-constrained countries. Accordingly, the study examines the feasibility of using satellite imagery and artificial intelligence to develop an efficient and cost-effective way to determine and predict the condition of roads. With this goal, a preliminary algorithm was created and validated using medium-resolution satellite imagery and existing road roughness data from the Philippines. After analysis, it was determined that the algorithm had an accuracy rate up to 75% and can be used for the preliminary identification of poor to bad roads. This provides an alternative for compiling road quality data, especially for areas where conventional methods can be difficult to implement. Nonetheless, additional technical enhancements need to be explored to further increase the algorithm’s prediction accuracy and enhance its robustness.
    Keywords: road quality; road maintenance; Sustainable Development Goals; remote sensing; deep learning
    JEL: O18 R42
    Date: 2022–12–22
  3. By: Wenzlaff, Karsten; Spaeth, Sebastian
    Abstract: The ChatGPT model of OpenAI allows users to ask questions, which are answered through an artificial intelligence trained through supervised, reinforced machine-learning. The answers depend on the input which the algorithm receives from the users, as well as from the content it has been given. The paper explores how answers to definitions about crowdfunding, alternative finance and community finance deviate or correspond to answers given by real human-beings in academic scholarship. Crowdfunding, alternative finance and community finance are chosen because academic literature does not provide consistent definitions on each of these terms, but some definitions are accepted by more scholars. By addressing the research gap concerning the accuracy of answers generated by an artificial intelligence, the paper contributes to the growing literature of implications of textual artificial intelligence on academia.
    Keywords: Crowdfunding, Alternative Finance, Community Finance, Machine Learning, Artificial Intelligence
    Date: 2022
  4. By: Margarita Leib; Nils K\"obis; Rainer Michael Rilke; Marloes Hagens; Bernd Irlenbusch
    Abstract: Artificial Intelligence (AI) increasingly becomes an indispensable advisor. New ethical concerns arise if AI persuades people to behave dishonestly. In an experiment, we study how AI advice (generated by a Natural-Language-Processing algorithm) affects (dis)honesty, compare it to equivalent human advice, and test whether transparency about advice source matters. We find that dishonesty-promoting advice increases dishonesty, whereas honesty-promoting advice does not increase honesty. This is the case for both AI- and human advice. Algorithmic transparency, a commonly proposed policy to mitigate AI risks, does not affect behaviour. The findings mark the first steps towards managing AI advice responsibly.
    Date: 2023–01
  5. By: Tim Johnson; Nick Obradovich
    Abstract: Scientists and philosophers have debated whether humans can trust advanced artificial intelligence (AI) agents to respect humanity's best interests. Yet what about the reverse? Will advanced AI agents trust humans? Gauging an AI agent's trust in humans is challenging because--absent costs for dishonesty--such agents might respond falsely about their trust in humans. Here we present a method for incentivizing machine decisions without altering an AI agent's underlying algorithms or goal orientation. In two separate experiments, we then employ this method in hundreds of trust games between an AI agent (a Large Language Model (LLM) from OpenAI) and a human experimenter (author TJ). In our first experiment, we find that the AI agent decides to trust humans at higher rates when facing actual incentives than when making hypothetical decisions. Our second experiment replicates and extends these findings by automating game play and by homogenizing question wording. We again observe higher rates of trust when the AI agent faces real incentives. Across both experiments, the AI agent's trust decisions appear unrelated to the magnitude of stakes. Furthermore, to address the possibility that the AI agent's trust decisions reflect a preference for uncertainty, the experiments include two conditions that present the AI agent with a non-social decision task that provides the opportunity to choose a certain or uncertain option; in those conditions, the AI agent consistently chooses the certain option. Our experiments suggest that one of the most advanced AI language models to date alters its social behavior in response to incentives and displays behavior consistent with trust toward a human interlocutor when incentivized.
    Date: 2022–12
  6. By: Elias Asproudis (Swansea University); Cigdem Gedikli (Swansea University); Oleksandr Talavera (University of Birmingham); Okan Yilmaz (Swansea University)
    Abstract: This paper aims to estimate the returns to solar panels in the UK residential housing market. Our analysis applies a causal machine learning approach to Zoopla property data containing about 5 million observations. Drawing on meta-learner algorithms, we provide strong evidence fortifying that solar panels are directly capitalized into sale prices. Our results point to a selling price premium above 6 percent (range between 6.2 percent to 6.9 percent depending on the meta-learner) associated with solar panels. Considering that the average selling price is 230, 536 GBP in our sample, this corresponds to an additional 14, 293 GBP to 15, 906 GBP selling price premium for houses with solar panels. Our results are robust to traditional hedonic pricing models and matching techniques.
    Keywords: solar panels; residential housing market; sale prices; machine-learning; meta-learners
    JEL: R21 R31 Q42 Q5
    Date: 2023–01
  7. By: Isaac Tamblyn; Tengkai Yu; Ian Benlolo
    Abstract: We discuss our simulation tool, fintech-kMC, which is designed to generate synthetic data for machine learning model development and testing. fintech-kMC is an agent-based model driven by a kinetic Monte Carlo (a.k.a. continuous time Monte Carlo) engine which simulates the behaviour of customers using an online digital financial platform. The tool provides an interpretable, reproducible, and realistic way of generating synthetic data which can be used to validate and test AI/ML models and pipelines to be used in real-world customer-facing financial applications.
    Date: 2023–01
  8. By: Jeremi Assael (BNPP CIB GM Lab, MICS); Thibaut Heurtebize (BNPP CIB GM Lab); Laurent Carlier (BNPP CIB GM Lab); Fran\c{c}ois Soup\'e
    Abstract: As of 2022, greenhouse gases (GHG) emissions reporting and auditing are not yet compulsory for all companies and methodologies of measurement and estimation are not unified. We propose a machine learning-based model to estimate scope 1 and scope 2 GHG emissions of companies not reporting them yet. Our model, specifically designed to be transparent and completely adapted to this use case, is able to estimate emissions for a large universe of companies. It shows good out-of-sample global performances as well as good out-of-sample granular performances when evaluating it by sectors, by countries or by revenues buckets. We also compare our results to those of other providers and find our estimates to be more accurate. Thanks to the proposed explainability tools using Shapley values, our model is fully interpretable, the user being able to understand which factors split explain the GHG emissions for each particular company.
    Date: 2022–12
  9. By: Niko Hauzenberger; Florian Huber; Gary Koop; James Mitchell
    Abstract: In light of widespread evidence of parameter instability in macroeconomic models, many time-varying parameter (TVP) models have been proposed. This paper proposes a nonparametric TVP-VAR model using Bayesian additive regression trees (BART). The novelty of this model stems from the fact that the law of motion driving the parameters is treated nonparametrically. This leads to great flexibility in the nature and extent of parameter change, both in the conditional mean and in the conditional variance. In contrast to other nonparametric and machine learning methods that are black box, inference using our model is straightforward because, in treating the parameters rather than the variables nonparametrically, the model remains conditionally linear in the mean. Parsimony is achieved through adopting nonparametric factor structures and use of shrinkage priors. In an application to US macroeconomic data, we illustrate the use of our model in tracking both the evolving nature of the Phillips curve and how the effects of business cycle shocks on inflationary measures vary nonlinearly with movements in uncertainty.
    Keywords: Bayesian Vector Autoregression; Time-varying Parameters; Nonparametric Modeling; Machine Learning; Regression Trees; Phillips Curve; Business Cycle Shocks
    JEL: C11 C32 C51 E32
    Date: 2023–01–11
  10. By: Brunori, Paolo; Davillas, Apostolos; Jones, Andrew M.; Scarchilli, Giovanna
    Abstract: We measure unfair health inequality in the UK using a novel data-driven empirical approach. We explain health variability as the result of circumstances beyond individual control and health-related behaviours. We do this using model-based recursive partitioning, a supervised machine learning algorithm. Unlike usual tree-based algorithms, model-based recursive partitioning does identify social groups with different expected levels of health but also unveils the heterogeneity of the relationship linking behaviors and health outcomes across groups. The empirical application is conducted using the UK Household Longitudinal Study. We show that unfair inequality is a substantial fraction of the total explained health variability. This finding holds no matter which exact definition of fairness is adopted: using both the fairness gap and direct unfairness measures, each evaluated at different reference values for circumstances or effort.
    Keywords: health equity; inequality of opportunity; machine learning; unhealthy lifestyle behaviours; Understanding Society is an initiative funded by the Economic and Social Research Council and various Government Departments; with scientific leadership by the Institute for Social and Economic Research; University of Essex; and survey delivery by NatCen Social Research and Kantar Public.
    JEL: D63
    Date: 2022–12–01
  11. By: Salomo Hirvonen (Department of Economics, University of Turku.); Maarit Lassander (Prime Minister's Office, Finland.); Lauri Sääksvuori (Finnish Institute for Health and Welfare, Finland.); Janne Tukiainen (Department of Economics, University of Turku.)
    Abstract: We conduct a large-scale randomized controlled trial to evaluate the effectiveness of short text messages (SMS) as a tool to mobilize young voters, and thus, ameliorate the stubborn gap in political participation between younger and older citizens. We find that receiving an SMS reminder before the Finnish county elections in 2022 increases the probability of voting among 18-29 year-old voters by 0.9 percentage points. Moreover, we observe that the most simplified message is more effective than messages appealing to expressive or rational motivations to vote. Using comprehensive administrative data and data-driven machine learning methods, we also examine treatment effect heterogeneity and spillover effects. We document that SMS based mobilization of voters does not only reduce existing social inequalities in voting between the age cohorts but also among the young citizens. Moreover, we remarkably find that over 100 percent of the direct treatment effect spilled over to non-treated household members. Our results highlight the importance of understanding spillover effects and treatment effect heterogeneities in the evaluation of get-out-the-vote interventions.
    Keywords: Get-out-the-vote, Field experiments, Spillover effects, Voter turnout
    JEL: C93 D72
    Date: 2023–01
  12. By: Huy\^en Pham (UPD7, LPSM); Xavier Warin (EDF R\&D, FiME Lab)
    Abstract: This paper is devoted to the numerical resolution of McKean-Vlasov control problems via the class of mean-field neural networks introduced in our companion paper [25] in order to learn the solution on the Wasserstein space. We propose several algorithms either based on dynamic programming with control learning by policy or value iteration, or backward SDE from stochastic maximum principle with global or local loss functions. Extensive numerical results on different examples are presented to illustrate the accuracy of each of our eight algorithms. We discuss and compare the pros and cons of all the tested methods.
    Date: 2022–12
  13. By: Matteo Burzoni; Alessandro Doldi; Enea Monzio Compagnoni
    Abstract: We consider the problem of optimally sharing a financial position among agents with potentially different reference risk measures. The problem is equivalent to computing the infimal convolution of the risk metrics and finding the so-called optimal allocations. We propose a neural network-based framework to solve the problem and we prove the convergence of the approximated inf-convolution, as well as the approximated optimal allocations, to the corresponding theoretical values. We support our findings with several numerical experiments.
    Date: 2022–12
  14. By: Matilde Giaccherini; Joanna Kopinska
    Abstract: We quantify the effects of online vaccine skepticism on vaccine uptake and health complications for individuals not targeted by immunization campaigns. We collect the universe of Italian vaccine-related tweets for 2013-2018, label anti-vax stances using NLP, and match them with vaccine coverage and vaccine-preventable hospitalizations at the most granular level (municipal-ity and year). We propose a model of opinion dynamics on social networks that matches the observed data and shows that a vaccine mandate increases the average vaccination rate, but it also increases the controversialness around the topic, endogenously fueling polarization of opinions among users. We then leverage the intransitivity in network connections with “friends of friends” to isolate the exogenous source of variation for users’ vaccine-related stances and implement an IV strategy. We find that a 10pp increase in the municipality anti-vax stance causes a 0.43pp de-crease in coverage of the Measles-Mumps-Rubella vaccine, 2.1 additional hospitalizations every 100k residents among individuals untargeted by the immunization (newborns, the immunosup-pressed, pregnant women) and an excess expenditure of 7, 311 euro, representing an 11% increase in health expenses.
    Keywords: social media, Twitter, vaccines, controversialness, polarization, text analysis
    JEL: I18 L82 Z18
    Date: 2022
  15. By: Ayush Patnaik (xKDR Forum); Ajay Shah (xKDR Froum); Susan Thomas (xKDR Forum)
    Abstract: Nighttime lights captured from satellites has emerged as an important way to measure prosperity. Every researcher who uses the files released by NASA and NOAA faces the challenge of pre-processing it in addressing data quality issues. We present NighttimeLights.jl, a package written in Julia, which implements conventional and novel methods for cleaning the data. The package also serves as a platform for methodological research in remote sensing.
    JEL: Y90
    Date: 2022–12
  16. By: Marco Battaglini; Luigi Guiso; Chiara Lacava; Douglas L. Miller; Eleonora Patacchini
    Abstract: We study the extent to which ML techniques can be used to improve tax auditing efficiency using administrative data, without the need of randomized audits. Using Italy's population data on sole proprietorship tax returns, audits and their outcome, we develop a new approach to address the so called selective labels problem - the fact that a ML algorithm must necessarily be trained on endogenously selected data. We document the existence of substantial margins for raising revenue from audits by improving the selection of taxpayers to audit with ML. Replacing the 10% least productive audits with an equal number of taxpayers selected by our trained algorithm raises detected tax evasion by as much as 38%, and evasion that is actually payed back by 29%.
    JEL: H2 H20 H26
    Date: 2022–12
  17. By: Arnault Pachot (IP - Institut Pascal - CNRS - Centre National de la Recherche Scientifique - UCA - Université Clermont Auvergne - INP Clermont Auvergne - Institut national polytechnique Clermont Auvergne - UCA - Université Clermont Auvergne)
    Date: 2022–11–05
  18. By: Yolande Piris (LEGO - Laboratoire d'Economie et de Gestion de l'Ouest - UBS - Université de Bretagne Sud - UBO - Université de Brest - IMT - Institut Mines-Télécom [Paris] - IBSHS - Institut Brestois des Sciences de l'Homme et de la Société - UBO - Université de Brest - UBL - Université Bretagne Loire - IMT Atlantique - IMT Atlantique - IMT - Institut Mines-Télécom [Paris]); Anne-Cécile Gay
    Date: 2021–01
  19. By: Ida Nervik Hjelseth; Arvid Raknerud; Bjørn H. Vatne
    Abstract: We propose an econometric model for predicting the share of bank debt held by bankrupt firms by combining a novel set of firm-level financial variables and macroeconomic indicators. Our firm-level data include payment remarks in the form of debt collections from private agencies and attachments from private and public agencies and cover all Norwegian limited liability companies for the period 2010–2021. We use logistic Lasso regressions to select bankruptcy predictors from a large set of potential predictors, comparing a highly sparse variable selection criterion (“the one standard error rule†) with the minimum cross validation error (CVE) criterion. Moreover, we examine the implications of using debt shares as weights in the estimation and find that weighting has a large impact on variable selection and predictions and, generally, leads to lower out-of-sample prediction errors than alternative approaches. Debt weighting combined with sparse variable selection gives the best predictions of the risk of bankruptcy in firms holding high shares of the bank debt.
    Keywords: Bankruptcy prediction, credit risk, corporate bank debt, Lasso, weighted logistic regression
    JEL: C25 C33 C53 G33 D22
    Date: 2022–06–20
  20. By: Akter, Sonia (Lee Kuan Yew School of Public Policy, National University of Singapore); Fauzia, Talitha (Lee Kuan Yew School of Public Policy, National University of Singapore); Pundit, Madhavi (Asian Development Bank); Schroder, Marcel (Asian Development Bank)
    Abstract: This study presents a gender-specific assessment of medium-term disaster recovery following a series of earthquakes in Indonesia’s Aceh Province on 7 December 2016. For this assessment, we combine the village-level nighttime radiance data obtained from the Visible Infrared Imaging Radiometer Suite instrument, distance from the earthquake epicenters collected from the United Nations Satellite Centre and the Village Potential Statistics (PODES) 2014 and 2018—administrative data collected by Indonesia’s Central Statistics Bureau. We develop a novel index to represent women’s welfare in the context of a disaster—the Women’s Welfare after Disasters Index (W2DI). The nighttime radiance scores are used as indicators of overall economic welfare, while the W2DI specifically represents women’s welfare. Using the difference-indifferences method, we compare the average monthly nighttime radiance and W2DI scores in earthquake-affected and unaffected villages of the Aceh Province before and after the 2016 earthquake series. Similar to studies using the nighttime radiance to monitor disaster recovery and relief, our findings reveal that, on average, the monthly nighttime radiance scores of the earthquake-affected villages 2 months after the earthquakes were brighter relative to the changes of the unaffected villages, implying an improvement in overall economic well-being of the earthquake-affected population. However, findings from the W2DI give us richer insights related to women’s welfare. While an important domain of women’s welfare—particularly, availability and access to the health infrastructure—improved significantly after the earthquake series, there was substantial deterioration in access to basic needs (e.g., water, fuel, sanitation). Such access plays an essential role in women’s well-being as they are directly linked to women’s role in the society. This study demonstrates that women in disaster-affected areas may experience a setback in some domains of their welfare in the medium term even when the economic welfare in the disaster-affected areas, in general, improved because of the gradual increase of human activities after reconstruction work occurred. The study also shows how a gender-specific disaster assessment tool can be developed and applied to monitor and assess disaster recovery for a subgroup of population and identify areas that require intervention.
    Keywords: disaster relief; recovery; climate injustice; socioeconomic vulnerability; gender
    JEL: I30 J16 Q54
    Date: 2022–12–22

This nep-big issue is ©2023 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.