nep-big New Economics Papers
on Big Data
Issue of 2022‒03‒14
fifteen papers chosen by
Tom Coupé
University of Canterbury

  1. Using satellites and artificial intelligence to measure health and material-living standards in India By Adel Daoud; Felipe Jordan; Makkunda Sharma; Fredrik Johansson; Devdatt Dubhashi; Sourabh Paul; Subhashis Banerjee
  2. Comparative Study of Machine Learning Models for Stock Price Prediction By Ogulcan E. Orsel; Sasha S. Yamada
  3. Can a Machine Correct Option Pricing Models? By Caio Almeida; Jianqing Fan; Francesca Tang
  4. What Drives Financial Sector Development in Africa? Insights from Machine Learning By Isaac K. Ofori; Christopher Quaidoo; Pamela E. Ofori
  5. Dual-CLVSA: a Novel Deep Learning Approach to Predict Financial Markets with Sentiment Measurements By Jia Wang; Hongwei Zhu; Jiancheng Shen; Yu Cao; Benyuan Liu
  6. Who Increases Emergency Department Use? New Insights from the Oregon Health Insurance Experiment By Augustine Denteh; Helge Liebert
  7. What governs attitudes toward artificial intelligence adoption and governance? By O'Shaughnessy, Matthew; Schiff, Daniel; Varshney, Lav R.; Rozell, Christopher; Davenport, Mark
  8. Artificial Intelligence and Reduced SMEs' Business Risks. A Dynamic Capabilities Analysis during the COVID-19 Pandemic By Drydakis, Nick
  9. StonkBERT: Can Language Models Predict Medium-Run Stock Price Movements? By Stefan Pasch; Daniel Ehnes
  10. Breakthroughs, Backlashes and Artificial General Intelligence: An Extended Real Options Approach By Gries, Thomas; Naudé, Wim
  11. Dependence model assessment and selection with DecoupleNets By Marius Hofert; Avinash Prasad; Mu Zhu
  12. Platform-based business models and financial inclusion By Karen Croxson; Jon Frost; Leonardo Gambacorta; Tommaso Valletti
  13. Application of K-means Clustering Algorithm in Evaluation and Statistical Analysis of Internet Financial Transaction Data By Shi Bo
  14. The impact of growing season temperature on grape prices in Australia By German Puga; Kym Anderson; Firmin Doko Tchatoka
  15. Air pollution in an urban world: A global view on density, cities and emissions By David Castells-Quintana; Elisa Dienesch; Melanie Krause

  1. By: Adel Daoud; Felipe Jordan; Makkunda Sharma; Fredrik Johansson; Devdatt Dubhashi; Sourabh Paul; Subhashis Banerjee
    Abstract: The application of deep learning methods to survey human development in remote areas with satellite imagery at high temporal frequency can significantly enhance our understanding of spatial and temporal patterns in human development. Current applications have focused their efforts in predicting a narrow set of asset-based measurements of human well-being within a limited group of African countries. Here, we leverage georeferenced village-level census data from across 30 percent of the landmass of India to train a deep-neural network that predicts 16 variables representing material conditions from annual composites of Landsat 7 imagery. The census-based model is used as a feature extractor to train another network that predicts an even larger set of developmental variables (over 90 variables) included in two rounds of the National Family Health Survey (NFHS) survey. The census-based model outperforms the current standard in the literature, night-time-luminosity-based models, as a feature extractor for several of these large set of variables. To extend the temporal scope of the models, we suggest a distribution-transformation procedure to estimate outcomes over time and space in India. Our procedure achieves levels of accuracy in the R-square of 0.92 to 0.60 for 21 development outcomes, 0.59 to 0.30 for 25 outcomes, and 0.29 to 0.00 for 28 outcomes, and 19 outcomes had negative R-square. Overall, the results show that combining satellite data with Indian Census data unlocks rich information for training deep learning models that track human development at an unprecedented geographical and temporal definition.
    Date: 2021–12
  2. By: Ogulcan E. Orsel; Sasha S. Yamada
    Abstract: In this work, we apply machine learning techniques to historical stock prices to forecast future prices. To achieve this, we use recursive approaches that are appropriate for handling time series data. In particular, we apply a linear Kalman filter and different varieties of long short-term memory (LSTM) architectures to historical stock prices over a 10-year range (1/1/2011 - 1/1/2021). We quantify the results of these models by computing the error of the predicted values versus the historical values of each stock. We find that of the algorithms we investigated, a simple linear Kalman filter can predict the next-day value of stocks with low-volatility (e.g., Microsoft) surprisingly well. However, in the case of high-volatility stocks (e.g., Tesla) the more complex LSTM algorithms significantly outperform the Kalman filter. Our results show that we can classify different types of stocks and then train an LSTM for each stock type. This method could be used to automate portfolio generation for a target return rate.
    Date: 2022–01
  3. By: Caio Almeida (Princeton University); Jianqing Fan (Princeton University); Francesca Tang (Princeton University)
    Abstract: We introduce a novel approach to capture implied volatility smiles. Given any parametric option pricing model used to fit a smile, we train a deep feedforward neural network on the model’s orthogonal residuals to correct for potential mispricings and boost performance. Using a large number of recent S&P500 options, we compare our hybrid machine-corrected model to several standalone parametric models ranging from ad-hoc corrections of Black-Scholes to more structural noarbitrage stochastic volatility models. Empirical results based on out-of-sample fitting errors - in cross-sectional and time-series dimensions - consistently confirm that a machine can in fact correct existing models without overfitting. Moreover, we find that our two-step technique is relatively indiscriminate: regardless of the bias or structure of the original parametric model, our boosting approach is able to correct it to approximately the same degree. Hence, our methodology is adaptable and versatile in its application to a large range of parametric option pricing models. As an overarching theme, machine corrected methods, guided by an implied volatility model as a template, outperform pure machine learning methods.
    Keywords: Deep Learning, Boosting, Implied Volatility, Stochastic Volatility, Model Correction
    JEL: E37
    Date: 2021–05
  4. By: Isaac K. Ofori (University of Insubria, Varese, Italy); Christopher Quaidoo (Legon, Accra, Ghana); Pamela E. Ofori (University of Insubria, Varese, Italy)
    Abstract: This study uses machine learning techniques to identify the key drivers of financial development in Africa. To this end, four regularization techniques— the Standard lasso, Adaptive lasso, the minimum Schwarz Bayesian information criterion lasso, and the Elasticnet are trained based on a dataset containing 86 covariates of financial development for the period 1990 – 2019. The results show that variables such as cell phones, economic globalisation, institutional effectiveness, and literacy are crucial for financial sector development in Africa. Evidence from the Partialing-out lasso instrumental variable regression reveals that while inflation and agricultural sector employment suppress financial sector development, cell phones and institutional effectiveness are remarkable in spurring financial sector development in Africa. Policy recommendations are provided in line with the rise in globalisation, and technological progress in Africa.
    Keywords: Africa, Elasticnet, Financial Development, Financial Inclusion, Lasso, Regularization, Variable Selection
    JEL: C01 C14 C52 C53 C55 E5 O55
    Date: 2021–01
  5. By: Jia Wang; Hongwei Zhu; Jiancheng Shen; Yu Cao; Benyuan Liu
    Abstract: It is a challenging task to predict financial markets. The complexity of this task is mainly due to the interaction between financial markets and market participants, who are not able to keep rational all the time, and often affected by emotions such as fear and ecstasy. Based on the state-of-the-art approach particularly for financial market predictions, a hybrid convolutional LSTM Based variational sequence-to-sequence model with attention (CLVSA), we propose a novel deep learning approach, named dual-CLVSA, to predict financial market movement with both trading data and the corresponding social sentiment measurements, each through a separate sequence-to-sequence channel. We evaluate the performance of our approach with backtesting on historical trading data of SPDR SP 500 Trust ETF over eight years. The experiment results show that dual-CLVSA can effectively fuse the two types of data, and verify that sentiment measurements are not only informative for financial market predictions, but they also contain extra profitable features to boost the performance of our predicting system.
    Date: 2022–01
  6. By: Augustine Denteh (Tulane University); Helge Liebert (University of Zurich)
    Abstract: We provide new insights into the finding that Medicaid increased emergency department (ED) use from the Oregon experiment. Using nonparametric causal machine learning methods, we find economically meaningful treatment effect heterogeneity in the impact of Medicaid coverage on ED use. The effect distribution is widely dispersed, with significant positive effects concentrated among high-use individuals. A small group—about 14% of participants—in the right tail with significant increases in ED use drives the overall effect. The remainder of the individualized treatment effects is either indistinguishable from zero or negative. The average treatment effect is not representative of the individualized treatment effect for most people. We identify four priority groups with large and statistically significant increases in ED use—men, prior SNAP participants, adults less than 50 years old, and those with pre-lottery ED use classified as primary care treatable. Our results point to an essential role of intensive margin effects— Medicaid increases utilization among those already accustomed to ED use and who use the emergency department for all types of care. We leverage the heterogeneous effects to estimate optimal assignment rules to prioritize insurance applications in similar expansions.
    Keywords: Medicaid, ED visit, effect heterogeneity, machine learning, efficient policy learning
    JEL: H75 I13 I38
    Date: 2022–01
  7. By: O'Shaughnessy, Matthew; Schiff, Daniel (Georgia Institute of Technology); Varshney, Lav R.; Rozell, Christopher; Davenport, Mark
    Abstract: Designing effective and inclusive governance and public communication strategies for artificial intelligence (AI) requires understanding how stakeholders reason about its use and governance. We examine underlying factors and mechanisms that drive attitudes toward the use and governance of AI across six policy-relevant applications using structural equation modeling and surveys of both U.S. adults (N=3524) and technology workers enrolled in an online computer science master’s degree program (N=425). We find that the cultural values of individualism, egalitarianism, general risk aversion, and techno-skepticism are important drivers of AI attitudes. Perceived benefit drives attitudes toward AI use, but not its governance. Experts hold more nuanced views than the public, and are more supportive of AI use but not its regulation. Drawing on these findings, we discuss challenges and opportunities for participatory AI governance, and we recommend that trustworthy AI governance be emphasized as strongly as trustworthy AI.
    Date: 2021–12–14
  8. By: Drydakis, Nick (Anglia Ruskin University)
    Abstract: The study utilises the International Labor Organization's SMEs COVID-19 pandemic business risks scale to determine whether Artificial Intelligence (AI) applications are associated with reduced business risks for SMEs. A new 10-item scale was developed to capture the use of AI applications in core services such as marketing and sales, pricing and cash flow. Data were collected from 317 SMEs between April and June 2020, with follow-up data gathered between October and December 2020 in London, England. AI applications to target consumers online, offer cash flow forecasting and facilitate HR activities are associated with reduced business risks caused by the COVID-19 pandemic for both small and medium enterprises. The study indicates that AI enables SMEs to boost their dynamic capabilities by leveraging technology to meet new types of demand, move at speed to pivot business operations, boost efficiency and thus, reduce their business risks.
    Keywords: SMEs, business risks, COVID-19, artificial intelligence, dynamic capabilities
    JEL: O33 Q55 L26
    Date: 2022–02
  9. By: Stefan Pasch; Daniel Ehnes
    Abstract: To answer this question, we fine-tune transformer-based language models, including BERT, on different sources of company-related text data for a classification task to predict the one-year stock price performance. We use three different types of text data: News articles, blogs, and annual reports. This allows us to analyze to what extent the performance of language models is dependent on the type of the underlying document. StonkBERT, our transformer-based stock performance classifier, shows substantial improvement in predictive accuracy compared to traditional language models. The highest performance was achieved with news articles as text source. Performance simulations indicate that these improvements in classification accuracy also translate into above-average stock market returns.
    Date: 2022–02
  10. By: Gries, Thomas (University of Paderborn); Naudé, Wim (University College Cork)
    Abstract: Breakthroughs and backlashes have marked progress in the development and diffusion of Artificial Intelligence (AI). These shocks make the investment in developing an Artificial General Intelligence (AGI) subject to considerable uncertainty. This paper applies a real options model, extended to account for stochastic jumps, to model the consequences of these breakthroughs and backlashes characterising on investment for an AGI. The model analytics indicate that the average magnitude and frequency of stochastic jumps will determine the optimum amount of time and money to invest in pursuing an AGI and that these may be too expensive and time-consuming for most private entrepreneurs.
    Keywords: radical innovation, real option models, artificial intelligence, risk
    JEL: O31 O32 C61 C65
    Date: 2022–02
  11. By: Marius Hofert; Avinash Prasad; Mu Zhu
    Abstract: Neural networks are suggested for learning a map from $d$-dimensional samples with any underlying dependence structure to multivariate uniformity in $d'$ dimensions. This map, termed DecoupleNet, is used for dependence model assessment and selection. If the data-generating dependence model was known, and if it was among the few analytically tractable ones, one such transformation for $d'=d$ is Rosenblatt's transform. DecoupleNets only require an available sample and are applicable to $d'
    Date: 2022–02
  12. By: Karen Croxson; Jon Frost; Leonardo Gambacorta; Tommaso Valletti
    Abstract: Three types of digital platforms are expanding in financial services: (i) fintech entrants; (ii) big tech firms; and (iii) increasingly, incumbent financial institutions with platformbased business models. These platforms can dramatically lower costs and thereby aid financial inclusion – but these same features can give rise to digital monopolies and oligopolies. Digital platforms operate in multi-sided markets, and rely crucially on big data. This leads to specific network effects, returns to scale and scope, and policy trade-offs. To reap the benefits of platforms while mitigating risks, policy makers can: (i) apply existing financial, antitrust and privacy regulations, (ii) adapt old and adopt new regulations, combining an activity and entity-based approach, and/or (iii) provide new public infrastructures. The latter include digital identity, retail fast payment systems and central bank digital currencies (CBDCs). These public infrastructures, as well as ex ante competition rules and data portability, are particularly promising. Yet to achieve their policy goals, central banks and financial regulators need to coordinate with competition and data protection authorities.
    Keywords: financial inclusion, fintech, big tech, platforms
    JEL: E51 G23 O31
    Date: 2021–12
  13. By: Shi Bo
    Abstract: The purpose is to promote the orderly development of China's Internet financial transactions and minimize default and delinquency in Internet financial transactions. Based on the typical big data algorithm (K-means algorithm), this paper discusses the concepts of the K-means algorithm and Internet financial transactions, as well as the significance of big data algorithms for Internet financial transaction data evaluation and statistical analysis. Meanwhile, the existing Internet financial transaction systems are reviewed, and their deficiencies are summarized, based on which relevant countermeasures and suggestions are put forward. At the same time, the K-means clustering algorithm is applied to evaluate financial transaction data, finding that it can improve the accuracy of data and reduce the error by 40%. But when the number of clusters is 7, the output result distribution interval of the K-means clustering algorithm is 4 days, and when the number of clusters is 10, the output result distribution interval of the K-means clustering algorithm is 6 days, indicating that the convergence effect of this algorithm is relatively good. Additionally, many small and micro individuals still hold a negative attitude towards the innovation and adjustment of Internet financial transactions, indicating that the construction of China's Internet financial transaction system needs further optimization. The satisfaction of most small and micro individuals with innovation and adjustment also shows that the proposed Internet financial transaction adjustment measures are feasible, can provide references for relevant Internet financial transactions, and contributes to the development of Internet financial transactions in China.
    Date: 2022–01
  14. By: German Puga (Centre for Global Food and Resources, School of Economics and Public Policy, University of Adelaide, Australia, and Wine Economics Research Centre, School of Economics and Public Policy, University of Adelaide, Australia); Kym Anderson (Wine Economics Research Centre, School of Economics and Public Policy, University of Adelaide, Australia, and Arndt-Corden Dept of Economics, Australian National University, Canberra ACT 2601, Australia); Firmin Doko Tchatoka (School of Economics and Public Policy, University of Adelaide, Australia)
    Abstract: Cross-sectional models are useful for quantifying the impact that climate or climate change may have on grape prices due to changes in grape quality. However, these models are susceptible to omitted variable bias. The aim of this study is to estimate the impact of growing season temperature (GST) on grape prices using cross-sectional data for Australia, while controlling for growing season precipitation, regional yield, variety, and other 103 characteristics that relate to the production system of the wine regions. We estimate this model using (area) weighted least squares and variables from a principal component analysis (PCA) to control for the characteristics that relate to the production system. This estimation strategy allows us to decrease omitted variable bias while avoiding multicollinearity and over-controlling issues. We show that failing to control for characteristics that relate to the production system overestimates the impact of GST and hence, climate change. This finding is confirmed by a LASSO model that also incorporates variables from the PCA, which we estimate as a robustness check using a cross-fit partialing-out estimator (double machine learning).
    Keywords: omitted variable bias, climate impact, grape quality, grape price, climate change
    JEL: Q11 Q15 Q54
    Date: 2021–11
  15. By: David Castells-Quintana (Universidad Autónoma de Barcelona, University of Barcelona, AQR-IREA); Elisa Dienesch (IEP Aix-en-Provence - Sciences Po Aix - Institut d'études politiques d'Aix-en-Provence, AMSE - Aix-Marseille Sciences Economiques - EHESS - École des hautes études en sciences sociales - AMU - Aix Marseille Université - ECM - École Centrale de Marseille - CNRS - Centre National de la Recherche Scientifique); Melanie Krause (University of Hamburg)
    Abstract: In this paper, we take a global view at air pollution looking at cities and countries worldwide. We pay special attention at the spatial distribution of population and its relationship with the evolution of emissions. To do so, we build i) a unique and large dataset for more than 1200 (big) cities around the world, combining data on emissions of CO2 and PM2.5 with satellite data on built-up areas, population and light intensity at night at the grid-cell level for the last two decades, and ii) a large dataset for more than 190 countries with data from 1960 to 2010. At the city level, we find that denser cities show lower emissions per capita. We also find evidence for the importance of the spatial structure of the city, with polycentricity being associated with lower emissions in the largest urban areas, while monocentricity being more beneficial for smaller cities. In sum, our results suggest that the size and structure of urban areas matters when studying the density-emissions relationship. This is reinforced by results using our country-level data where we find that higher density in urban areas is associated with lower emissions per capita. All our main findings are robust to several controls and different specifications and estimation techniques, as well as different identification strategies.
    Keywords: Density,Pollution,Cities,City structure,Development
    Date: 2021–11

This nep-big issue is ©2022 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.