nep-big New Economics Papers
on Big Data
Issue of 2023‒02‒20
24 papers chosen by
Tom Coupé
University of Canterbury

  1. When are Google Data Useful to Nowcast GDP? An Approach via Preselection and Shrinkage By Laurent Ferrara; Anna Simoni
  2. Measuring Corporate Digital Divide with web scraping: Evidence from Italy By Mazzoni Leonardo; Pinelli Fabio; Riccaboni Massimo
  3. Adversarial AI in Insurance: Pervasiveness and Resilience By Elisa Luciano; Matteo Cattaneo; Ron Kenett
  4. Taste of home: Birth town bias in Geographical Indications By Resce, Giuliano; Vaquero-Piñeiro, Cristina
  5. Antitrust, Regulation, and User Union in the Era of Digital Platforms and Big Data By Lin William Cong; Simon Mayer
  6. A predictive model of sovereign investment grade using machine learning and natural language processing By María Victoria Landaberry; Kenji Nakasone; Johann Pérez; María del Pilar Posada
  7. Macroeconomic forecasting and sovereign risk assessment using deep learning techniques By Anastasios Petropoulos; Vassilis Siakoulis; Konstantinos P. Panousis; Loukas Papadoulas; Sotirios Chatzis
  8. An Optimal Control Strategy for Execution of Large Stock Orders Using LSTMs By A. Papanicolaou; H. Fu; P. Krishnamurthy; B. Healy; F. Khorrami
  9. Estimating the Impact of the Age of Criminal Majority: Decomposing Multiple Treatments in a Regression Discontinuity Framework By Michael Mueller-Smith; Benjamin Pyle; Caroline Walker
  10. Monotonicity for AI ethics and society: An empirical study of the monotonic neural additive model in criminology, education, health care, and finance By Dangxing Chen; Luyao Zhang
  11. The Potential Impact of Artificial Intelligence on Healthcare Spending By Nikhil Sahni; George Stein; Rodney Zemmel; David M. Cutler
  12. Leveraging Vision-Language Models for Granular Market Change Prediction By Christopher Wimmer; Navid Rekabsaz
  13. Sequential Graph Attention Learning for Predicting Dynamic Stock Trends (Student Abstract) By Tzu-Ya Lai; Wen Jung Cheng; Jun-En Ding
  14. The rise of China's technological power: the perspective from frontier technologies By Bergeaud, Antonin; Verluise, Cyril
  15. Eliminating Disparate Treatment in Modeling Default of Credit Card Clients By Tom, Daniel M. Ph.D.
  16. ddml: Double/debiased machine learning in Stata By Achim Ahrens; Christian B. Hansen; Mark E. Schaffer; Thomas Wiemann
  18. Personalized prognosis & treatment using Ledley-Jaynes machines: An example study on conversion from Mild Cognitive Impairment to Alzheimer's Disease By Porta Mana, PierGianLuca; Rye, Ingrid; Vik, Alexandra; Kociński, Marek; Lundervold, Astri Johansen; Lundervold, Arvid; Lundervold, Alexander Selvikvåg
  19. Using machine learning to measure financial risk in China By Al-Haschimi, Alexander; Apostolou, Apostolos; Azqueta-Gavaldon, Andres; Ricci, Martino
  20. Parameter Recovery Using Remotely Sensed Variables By Jonathan Proctor; Tamma Carleton; Sandy Sum
  21. AI Literacy - Towards Measuring Human Competency in Artificial Intelligence By Pinski, Marc; Benlian, Alexander
  22. Learning Production Process Heterogeneity Across Industries: Implications of Deep Learning for Corporate M&A Decisions By Jongsub Lee; Hayong Yun
  23. Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information— By Ola, Aranuwa Felix
  24. Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by non-human animals and humans By Ola, Aranuwa Felix

  1. By: Laurent Ferrara; Anna Simoni (CREST - Centre de Recherche en Économie et Statistique - ENSAI - Ecole Nationale de la Statistique et de l'Analyse de l'Information [Bruz] - X - École polytechnique - ENSAE Paris - École Nationale de la Statistique et de l'Administration Économique - CNRS - Centre National de la Recherche Scientifique, CNRS - Centre National de la Recherche Scientifique)
    Date: 2022–10–10
  2. By: Mazzoni Leonardo; Pinelli Fabio; Riccaboni Massimo
    Abstract: With the increasing pervasiveness of ICTs in the fabric of economic activities, the corporate digital divide has emerged as a new crucial topic to evaluate the IT competencies and the digital gap between firms and territories. Given the scarcity of available granular data to measure the phenomenon, most studies have used survey data. To bridge the empirical gap, we scrape the website homepage of 182 705 Italian firms, extracting ten features related to their digital footprint characteristics to develop a new corporate digital assessment index. Our results highlight a significant digital divide across dimensions, sectors and geographical locations of Italian firms, opening up new perspectives on monitoring and near-real-time data-driven analysis.
    Date: 2023–01
  3. By: Elisa Luciano; Matteo Cattaneo; Ron Kenett
    Abstract: The rapid and dynamic pace of Artificial Intelligence (AI) and Machine Learning (ML) is revolutionizing the insurance sector. AI offers significant, very much welcome advantages to insurance companies, and is fundamental to their customer-centricity strategy. It also poses challenges, in the project and implementation phase. Among those, we study Adversarial Attacks, which consist of the creation of modified input data to deceive an AI system and produce false outputs. We provide examples of attacks on insurance AI applications, categorize them, and argue on defence methods and precautionary systems, considering that they can involve few-shot and zero-shot multilabelling. A related topic, with growing interest, is the validation and verification of systems incorporating AI and ML components. These topics are discussed in various sections of this paper.
    Date: 2023–01
  4. By: Resce, Giuliano; Vaquero-Piñeiro, Cristina
    Abstract: We investigate the role of local favoritism in the Geographical Indications (GIs) quality scheme, one of the main pillars of agri-food policy in the EU. Taking advantage of a rich and unique municipalities' geo-referenced database over the 2000-2020 period, we evaluate whether the birthplaces of Regional council members are favored in the acknowledgment of GIs in Italy. To address the potential confounding effects and selection biases, we combine a Difference in Difference strategy with machine learning methods for counterfactual analysis. Results reveal that councilors' birth municipalities are more likely to obtain their products certified as GIs. The birth town bias is more substantial in areas where the level of institutional quality is lower, there is higher corruption, and lower government efficiency, suggesting that the mediation of politicians is determinant where the formal standardized procedures are muddled.
    Keywords: Political Economy; Geographical Indications; Political representation; Electoral success; Local Development.
    JEL: D72 L66 Q18 R11
    Date: 2023–02–07
  5. By: Lin William Cong; Simon Mayer
    Abstract: We model platform competition with endogenous data generation, collection, and sharing, thereby providing a unifying framework to evaluate data-related regulation and antitrust policies. Data are jointly produced from users' economic activities and platforms' investments in data infrastructure. Data improves service quality, causing a feedback loop that tends to concentrate market power. Dispersed users do not internalize the impact of their data contribution on (i) service quality for other users, (ii) market concentration, and (iii) platforms’ incentives to invest in data infrastructure, causing inefficient over- or under-collection of data. Data sharing proposals, user privacy protections, platform commitments, and markets for data cannot fully address these inefficiencies. We introduce and analyze user union, which represents and coordinates users, as a potential option for antitrust and consumer protection in the digital era.
    JEL: L10 L41 L50 O30
    Date: 2023–01
  6. By: María Victoria Landaberry (Banco Central del Uruguay); Kenji Nakasone (UTEC - Universidad Tecnológica); Johann Pérez (UTEC - Universidad Tecnológica); María del Pilar Posada (Banco Central del Uruguay)
    Abstract: Las agencias calificadoras de riesgo como Moody's, Standard and Poor's y Fitch califican los activos soberanos basados en un análisis matemático de factores económicos, sociales y políticos conjuntamente con un análisis cualitativo de juicio de experto. De acuerdo a la calificación obtenida, los países pueden ser clasificados como aquellos que tienen grado inversor o cuentan con grado especulativo. Tener grado inversor es importante en la medida que reduce en costo de financiamiento y expande el conjunto de potenciales inversores en una economía. En este documento nos proponemos predecir si la deuda soberana de un país será calificada con grado inversor utilizando un conjunto de variables macroeconómicas y variables obtenidas a partir del análisis de texto de los reportes de Fitch entre 2000 y 2018 utilizando técnicas de procesamiento natural de lenguaje. Utilizamos una regresión logística y un conjunto de algoritmos de machine learning alternativos. De acuerdo a nuestros resultados, el índice de incertidumbre, construido a partir de los reportes de Fitch, es estadísticamente significativo para predecir el grado inversor. Al comparar los distintos algoritmos de machine learning, random forest es el que tiene mejor poder predictivo fuera de la muestra cuando la variable dependiente refiere al mismo año que las variables explicativas mientras que knearest neighbors tiene el mejor desempeño predictivo cuando las variables independientes refieren al año anterior en términos del f1-score y recall.
    Keywords: Riesgo soberano, agencias calificadoras, variables macroeconómicas, análisis de texto, procesamiento natural del lenguaje; machine learning
    JEL: E22 E66 G24
    Date: 2022
  7. By: Anastasios Petropoulos; Vassilis Siakoulis; Konstantinos P. Panousis; Loukas Papadoulas; Sotirios Chatzis
    Abstract: In this study, we propose a novel approach of nowcasting and forecasting the macroeconomic status of a country using deep learning techniques. We focus particularly on the US economy but the methodology can be applied also to other economies. Specifically US economy has suffered a severe recession from 2008 to 2010 which practically breaks out conventional econometrics model attempts. Deep learning has the advantage that it models all macro variables simultaneously taking into account all interdependencies among them and detecting non-linear patterns which cannot be easily addressed under a univariate modelling framework. Our empirical results indicate that the deep learning methods have a superior out-of-sample performance when compared to traditional econometric techniques such as Bayesian Model Averaging (BMA). Therefore our results provide a concise view of a more robust method for assessing sovereign risk which is a crucial component in investment and monetary decisions.
    Date: 2023–01
  8. By: A. Papanicolaou; H. Fu; P. Krishnamurthy; B. Healy; F. Khorrami
    Abstract: In this paper, we simulate the execution of a large stock order with real data and general power law in the Almgren and Chriss model. The example that we consider is the liquidation of a large position executed over the course of a single trading day in a limit order book. Transaction costs are incurred because large orders walk the order book, that is, they consume order-book liquidity beyond the best bid/ask. We model these transaction costs with a power law that is inversely proportional to trading volume. We obtain a policy approximation by training a long short term memory (LSTM) neural network to minimize transaction costs accumulated when execution is carried out as a sequence of smaller sub orders. Using historical S&P100 price and volume data, we evaluate our LSTM strategy relative to strategies based on time-weighted average price (TWAP) and volume-weighted average price (VWAP). For execution of a single stock, the input to the LSTM includes the entire cross section of data on all 100 stocks, including prices, volume, TWAPs and VWAPs. By using the entire data cross section, the LSTM should be able to exploit any inter-stock co-dependence in volume and price movements, thereby reducing overall transaction costs. Our tests on the S&P100 data demonstrate that in fact this is so, as our LSTM strategy consistently outperforms TWAP and VWAP-based strategies.
    Date: 2023–01
  9. By: Michael Mueller-Smith; Benjamin Pyle; Caroline Walker
    Abstract: This paper studies the impact of adult prosecution on recidivism and employment trajectories for adolescent, first-time felony defendants. We use extensive linked Criminal Justice Administrative Record System and socio-economic data from Wayne County, Michigan (Detroit). Using the discrete age of majority rule and a regression discontinuity design, we find that adult prosecution reduces future criminal charges over 5 years by 0.48 felony cases (? 20%) while also worsening labor market outcomes: 0.76 fewer employers (? 19%) and $674 fewer earnings (? 21%) per year. We develop a novel econometric framework that combines standard regression discontinuity methods with predictive machine learning models to identify mechanism-specific treatment effects that underpin the overall impact of adult prosecution. We leverage these estimates to consider four policy counterfactuals: (1) raising the age of majority, (2) increasing adult dismissals to match the juvenile disposition rates, (3) eliminating adult incarceration, and (4) expanding juvenile record sealing opportunities to teenage adult defendants. All four scenarios generate positive returns for government budgets. When accounting for impacts to defendants as well as victim costs borne by society stemming from increases in recidivism, we find positive social returns for juvenile record sealing expansions and dismissing marginal adult charges; raising the age of majority breaks even. Eliminating prison for first-time adult felony defendants, however, increases net social costs. Policymakers may still find this attractive if they are willing to value beneficiaries (taxpayers and defendants) slightly higher (124%) than potential victims.
    Keywords: juvenile and criminal justice, regression discontinuity, machine learning, recidivism, employment
    JEL: C36 C45 K14 K42 J24
    Date: 2023–01
  10. By: Dangxing Chen; Luyao Zhang
    Abstract: Algorithm fairness in the application of artificial intelligence (AI) is essential for a better society. As the foundational axiom of social mechanisms, fairness consists of multiple facets. Although the machine learning (ML) community has focused on intersectionality as a matter of statistical parity, especially in discrimination issues, an emerging body of literature addresses another facet -- monotonicity. Based on domain expertise, monotonicity plays a vital role in numerous fairness-related areas, where violations could misguide human decisions and lead to disastrous consequences. In this paper, we first systematically evaluate the significance of applying monotonic neural additive models (MNAMs), which use a fairness-aware ML algorithm to enforce both individual and pairwise monotonicity principles, for the fairness of AI ethics and society. We have found, through a hybrid method of theoretical reasoning, simulation, and extensive empirical analysis, that considering monotonicity axioms is essential in all areas of fairness, including criminology, education, health care, and finance. Our research contributes to the interdisciplinary research at the interface of AI ethics, explainable AI (XAI), and human-computer interactions (HCIs). By evidencing the catastrophic consequences if monotonicity is not met, we address the significance of monotonicity requirements in AI applications. Furthermore, we demonstrate that MNAMs are an effective fairness-aware ML approach by imposing monotonicity restrictions integrating human intelligence.
    Date: 2023–01
  11. By: Nikhil Sahni; George Stein; Rodney Zemmel; David M. Cutler
    Abstract: The potential of artificial intelligence (AI) to simplify existing healthcare processes and create new, more efficient ones is a major topic of discussion in the industry. Yet healthcare lags other industries in AI adoption. In this paper, we estimate that wider adoption of AI could lead to savings of 5 to 10 percent in US healthcare spending—roughly $200 billion to $360 billion annually in 2019 dollars. These estimates are based on specific AI-enabled use cases that employ today’s technologies, are attainable within the next five years, and would not sacrifice quality or access. These opportunities could also lead to non-financial benefits such as improved healthcare quality, increased access, better patient experience, and greater clinician satisfaction. We further present case studies and discuss how to overcome the challenges to AI deployments. We conclude with a review of recent market trends that may shift the AI adoption trajectory toward a more rapid pace.
    JEL: I10 L2 M15
    Date: 2023–01
  12. By: Christopher Wimmer; Navid Rekabsaz
    Abstract: Predicting future direction of stock markets using the historical data has been a fundamental component in financial forecasting. This historical data contains the information of a stock in each specific time span, such as the opening, closing, lowest, and highest price. Leveraging this data, the future direction of the market is commonly predicted using various time-series models such as Long-Short Term Memory networks. This work proposes modeling and predicting market movements with a fundamentally new approach, namely by utilizing image and byte-based number representation of the stock data processed with the recently introduced Vision-Language models. We conduct a large set of experiments on the hourly stock data of the German share index and evaluate various architectures on stock price prediction using historical stock data. We conduct a comprehensive evaluation of the results with various metrics to accurately depict the actual performance of various approaches. Our evaluation results show that our novel approach based on representation of stock data as text (bytes) and image significantly outperforms strong deep learning-based baselines.
    Date: 2023–01
  13. By: Tzu-Ya Lai; Wen Jung Cheng; Jun-En Ding
    Abstract: The stock market is characterized by a complex relationship between companies and the market. This study combines a sequential graph structure with attention mechanisms to learn global and local information within temporal time. Specifically, our proposed "GAT-AGNN" module compares model performance across multiple industries as well as within single industries. The results show that the proposed framework outperforms the state-of-the-art methods in predicting stock trends across multiple industries on Taiwan Stock datasets.
    Date: 2023–01
  14. By: Bergeaud, Antonin; Verluise, Cyril
    Abstract: We use patent data to study the contribution of the US, Europe, China and Japan to frontier technology using automated patent landscaping. We find that China's contribution to frontier technology has become quantitatively similar to the US in the late 2010s while overcoming the European and Japanese contributions respectively. Although China still exhibits the stigmas of a catching up economy, these stigmas are on the downside. The quality of frontier technology patents published at the Chinese Patent Office has leveled up to the quality of patents published at the European and Japanese patent offices. At the same time, frontier technology patenting at the Chinese Patent Office seems to have been increasingly supported by domestic patentees, suggesting the build up of domestic capabilities.
    Keywords: frontier technologies; China; patent landscaping; machine learning; patents
    JEL: O30 O31 O32
    Date: 2022–10–14
  15. By: Tom, Daniel M. Ph.D.
    Abstract: A recent online search for model performance for benchmarking purposes reveals evidence of disparate treatment on a prohibitive basis in ML models appearing in the search result. Using our logistic regression with AI approach, we are able to build a superior credit model without any prohibitive and other demographic characteristics (gender, age, marital status, level of education) from the default of credit card clients dataset in the UCI Machine Learning Repository. We compare our AI flashlight beam search result to exhaustive search approach in the space of all possible models, and the AI search finds the highest separation/highest likelihood models efficiently after evaluating a small number of model candidates.
    Date: 2023–01–17
  16. By: Achim Ahrens; Christian B. Hansen; Mark E. Schaffer; Thomas Wiemann
    Abstract: We introduce the package ddml for Double/Debiased Machine Learning (DDML) in Stata. Estimators of causal parameters for five different econometric models are supported, allowing for flexible estimation of causal effects of endogenous variables in settings with unknown functional forms and/or many exogenous variables. ddml is compatible with many existing supervised machine learning programs in Stata. We recommend using DDML in combination with stacking estimation which combines multiple machine learners into a final predictor. We provide Monte Carlo evidence to support our recommendation.
    Date: 2023–01
  17. By: Bilgin, Rumeysa (Istanbul Sabahattin Zaim University)
    Abstract: The previous literature on capital structure has produced plenty of potential determinants of leverage over the last decades. However, their research models usually cover only a restricted number of explanatory variables, and many suffer from omitted variable bias. This study contributes to the literature by advocating a sound approach to selecting the control variables for empirical capital structure studies. We applied two linear LASSO inference approaches and the double machine learning (DML) framework to the LASSO, random forest, decision tree, and gradient boosting learners to evaluate the marginal contributions of three proposed determinants; cash holdings, non-debt tax shield, and current ratio. While some studies did not use these variables in their models, others obtained contradictory results. Our findings have revealed that cash holdings, current ratio, and non-debt tax shield are crucial factors that substantially affect the leverage decisions of firms and should be controlled in empirical capital structure studies.
    Date: 2023–01–23
  18. By: Porta Mana, PierGianLuca (HVL Western Norway University of Applied Sciences); Rye, Ingrid; Vik, Alexandra; Kociński, Marek; Lundervold, Astri Johansen; Lundervold, Arvid (University of Bergen); Lundervold, Alexander Selvikvåg (Western Norway University of Applied Sciences)
    Abstract: The present work presents a statistically sound, rigorous, and model-free algorithm – the Ledley-Jaynes machine – for use in personalized medicine. The Ledley-Jaynes machine is designed first to learn from a dataset of clinical with relevant predictors and predictands, and then to assist a clinician in the assessment of prognosis & treatment for new patients. It allows the clinician to input, for each new patient, additional patient-dependent clinical information, as well as patient-dependent information about benefits and drawbacks of available treatments. We apply the algorithm in a realistic setting for clinical decision-making, incorporating clinical, environmental, imaging, and genetic data, using a data set of subjects suffering from mild cognitive impairment and Alzheimer’s Disease. We show how the algorithm is theoretically optimal, and discuss some of its major advantages for decision-making under risk, resource planning, imputation of missing values, assessing the prognostic importance of each predictor, and more.
    Date: 2023–01–26
  19. By: Al-Haschimi, Alexander; Apostolou, Apostolos; Azqueta-Gavaldon, Andres; Ricci, Martino
    Abstract: We develop a measure of overall financial risk in China by applying machine learning techniques to textual data. A pre-defined set of relevant newspaper articles is first selected using a specific constellation of risk-related keywords. Then, we employ topical modelling based on an unsupervised machine learning algorithm to decompose financial risk into its thematic drivers. The resulting aggregated indicator can identify major episodes of overall heightened financial risks in China, which cannot be consistently captured using financial data. Finally, a structural VAR framework is employed to show that shocks to the financial risk measure have a significant impact on macroeconomic and financial variables in China and abroad. JEL Classification: C32, C65, E32, F44, G15
    Keywords: China, financial risk, LDA, machine learning, textual analysis, topic modelling
    Date: 2023–01
  20. By: Jonathan Proctor; Tamma Carleton; Sandy Sum
    Abstract: Remotely sensed measurements and other machine learning predictions are increasingly used in place of direct observations in empirical analyses. Errors in such measures may bias parameter estimation, but it remains unclear how large such biases are or how to correct for them. We leverage a new benchmark dataset providing co-located ground truth observations and remotely sensed measurements for multiple variables across the contiguous U.S. to show that the common practice of using remotely sensed measurements without correction leads to biased parameter point estimates and standard errors across a diversity of empirical settings. More than three-quarters of the 95% confidence intervals we estimate using remotely sensed measurements do not contain the true coefficient of interest. These biases result from both classical measurement error and more structured measurement error, which we find is common in machine learning based remotely sensed measurements. We show that multiple imputation, a standard statistical imputation technique so far untested in this setting, effectively reduces bias and improves statistical coverage with only minor reductions in power in both simple linear regression and panel fixed effects frameworks. Our results demonstrate that multiple imputation is a generalizable and easily implementable method for correcting parameter estimates relying on remotely sensed variables.
    JEL: C18 C45 C80 Q0
    Date: 2023–01
  21. By: Pinski, Marc; Benlian, Alexander
    Date: 2023
  22. By: Jongsub Lee; Hayong Yun
    Abstract: Using deep learning techniques, we introduce a novel measure for production process heterogeneity across industries. For each pair of industries during 1990-2021, we estimate the functional distance between two industries' production processes via deep neural network. Our estimates uncover the underlying factors and weights reflected in the multi-stage production decision tree in each industry. We find that the greater the functional distance between two industries' production processes, the lower are the number of M&As, deal completion rates, announcement returns, and post-M&A survival likelihood. Our results highlight the importance of structural heterogeneity in production technology to firms' business integration decisions.
    Date: 2023–01
  23. By: Ola, Aranuwa Felix
    Abstract: Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by non-human animals and humans. Example tasks in which this is done include speech recognition, computer vision, translation between (natural) languages, as well as other mappings of inputs. The Oxford English Dictionary of Oxford University Press defines artificial intelligence as:
    Date: 2023–01–09
  24. By: Ola, Aranuwa Felix
    Abstract: Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by non-human animals and humans
    Date: 2023–01–09

This nep-big issue is ©2023 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.