nep-big New Economics Papers
on Big Data
Issue of 2022‒05‒09
seventeen papers chosen by
Tom Coupé
University of Canterbury

  1. Prediction of motor insurance claims occurrence as an imbalanced machine learning problem By Sebastian Baran; Przemys{\l}aw Rola
  2. Electricity Price Forecasting: The Dawn of Machine Learning By Arkadiusz J\k{e}drzejewski; Jesus Lago; Grzegorz Marcjasz; Rafa{\l} Weron
  3. Neural Network and Order Flow, Technical Analysis: Predicting short-term direction of futures contract By Yiyang Zheng
  4. Attention-based CNN-LSTM and XGBoost hybrid model for stock prediction By Zhuangwei Shi; Yang Hu; Guangliang Mo; Jian Wu
  5. Dual humanness and trust in conversational AI : A person-centered approach By Peng Hu; Yaobin Lu; Yeming Gong
  6. Who Increases Emergency Department Use? New Insights from the Oregon Health Insurance Experiment By Denteh, Augustine; Liebert, Helge
  7. Exploring Artificial Intelligence as a General Purpose Technology with Patent Data -- A Systematic Comparison of Four Classification Approaches By Kerstin H\"otte; Taheya Tarannum; Vilhelm Verendel; Lauren Bennett
  8. Learning Probability Distributions in Macroeconomics and Finance By Jozef Barunik; Lubos Hanus
  9. Does non-linear factorization of financial returns help build better and stabler portfolios? By Bruno Spilak; Wolfgang Karl H\"ardle
  10. You reap what (you think) you sow? Evidence on farmers’behavioral adjustments in the case of correct crop varietal identification By Paola Mallia
  11. Extraction of deterministic components for high frequency stochastic process -- an application from CSI 300 index By Xianfei Hui; Baiqing Sun; Yan Zhou; Indranil SenGupta
  12. An Exploratory Study of Stock Price Movements from Earnings Calls By Sourav Medya; Mohammad Rasoolinejad; Yang Yang; Brian Uzzi
  13. Robust Operator Learning to Solve PDE By Carl Remlinger; Joseph Mikael; Romuald Elie
  15. Living and perceiving a crisis: how the pandemic influenced Americans' preferences and beliefs By Guglielmo Briscese; Maddalena Grignani; Stephen Stapleton
  16. Measuring Judicial Sentiment: Methods and Application to US Circuit Courts By Elliott Ash; Daniel L. Chen; Sergio Galletta
  17. The political economy of big data leaks: Uncovering the skeleton of tax evasion By Pier Luigi Sacco; Alex Arenas; Manlio De Domenico

  1. By: Sebastian Baran; Przemys{\l}aw Rola
    Abstract: The insurance industry, with its large datasets, is a natural place to use big data solutions. However it must be stressed, that significant number of applications for machine learning in insurance industry, like fraud detection or claim prediction, deals with the problem of machine learning on an imbalanced data set. This is due to the fact that frauds or claims are rare events when compared with the entire population of drivers. The problem of imbalanced learning is often hard to overcome. Therefore, the main goal of this work is to present and apply various methods of dealing with an imbalanced dataset in the context of claim occurrence prediction in car insurance. In addition, the above techniques are used to compare the results of machine learning algorithms in the context of claim occurrence prediction in car insurance. Our study covers the following techniques: logistic-regression, decision tree, random forest, xgBoost, feed-forward network. The problem is the classification one.
    Date: 2022–04
  2. By: Arkadiusz J\k{e}drzejewski; Jesus Lago; Grzegorz Marcjasz; Rafa{\l} Weron
    Abstract: Electricity price forecasting (EPF) is a branch of forecasting on the interface of electrical engineering, statistics, computer science, and finance, which focuses on predicting prices in wholesale electricity markets for a whole spectrum of horizons. These range from a few minutes (real-time/intraday auctions and continuous trading), through days (day-ahead auctions), to weeks, months or even years (exchange and over-the-counter traded futures and forward contracts). Over the last 25 years, various methods and computational tools have been applied to intraday and day-ahead EPF. Until the early 2010s, the field was dominated by relatively small linear regression models and (artificial) neural networks, typically with no more than two dozen inputs. As time passed, more data and more computational power became available. The models grew larger to the extent where expert knowledge was no longer enough to manage the complex structures. This, in turn, led to the introduction of machine learning (ML) techniques in this rapidly developing and fascinating area. Here, we provide an overview of the main trends and EPF models as of 2022.
    Date: 2022–04
  3. By: Yiyang Zheng
    Abstract: Predictions of short-term directional movement of the futures contract can be challenging as its pricing is often based on multiple complex dynamic conditions. This work presents a method for predicting the short-term directional movement of an underlying futures contract. We engineered a set of features from technical analysis, order flow, and order-book data. Then, Tabnet, a deep learning neural network, is trained using these features. We train our model on the Silver Futures Contract listed on Shanghai Futures Exchange and achieve an accuracy of 0.601 on predicting the directional change during the selected period.
    Date: 2022–02
  4. By: Zhuangwei Shi; Yang Hu; Guangliang Mo; Jian Wu
    Abstract: Stock market plays an important role in the economic development. Due to the complex volatility of the stock market, the research and prediction on the change of the stock price, can avoid the risk for the investors. The traditional time series model ARIMA can not describe the nonlinearity, and can not achieve satisfactory results in the stock prediction. As neural networks are with strong nonlinear generalization ability, this paper proposes an attention-based CNN-LSTM and XGBoost hybrid model to predict the stock price. The model constructed in this paper integrates the time series model, the Convolutional Neural Networks with Attention mechanism, the Long Short-Term Memory network, and XGBoost regressor in a non-linear relationship, and improves the prediction accuracy. The model can fully mine the historical information of the stock market in multiple periods. The stock data is first preprocessed through ARIMA. Then, the deep learning architecture formed in pretraining-finetuning framework is adopted. The pre-training model is the Attention-based CNN-LSTM model based on sequence-to-sequence framework. The model first uses convolution to extract the deep features of the original stock data, and then uses the Long Short-Term Memory networks to mine the long-term time series features. Finally, the XGBoost model is adopted for fine-tuning. The results show that the hybrid model is more effective and the prediction accuracy is relatively high, which can help investors or institutions to make decisions and achieve the purpose of expanding return and avoiding risk. Source code is available at X-stock-prediction.
    Date: 2022–04
  5. By: Peng Hu (emlyon business school); Yaobin Lu; Yeming Gong
    Abstract: Conversational Artificial Intelligence (AI) is digital agents that interact with users by natural language. To advance the understanding of trust in conversational AI, this study focused on two humanness factors manifested by conversational AI: speaking and listening. First, we explored users' heterogeneous perception patterns based on the two humanness factors. Next, we examined how this heterogeneity relates to trust in conversational AI. A two-stage survey was conducted to collect data. Latent profile analysis revealed three distinct patterns: para-human perception, para-machine perception, and asymmetric perception. Finite mixture modeling demonstrated that the benefit of humanizing AI's voice for competence-related trust can evaporate once AI's language understanding is perceived as poor. Interestingly, the asymmetry between humanness perceptions in speaking and listening can impede morality-related trust. By adopting a person-centered approach to address the relationship between dual humanness and user trust, this study contributes to the literature on trust in conversational AI and the practice of trust-inducing AI design.
    Keywords: Artificial intelligence,Humanness perception,Trust,Person-centered approach,Finite mixture modeling
    Date: 2021–06–01
  6. By: Denteh, Augustine (Georgia State University); Liebert, Helge (University of Zurich)
    Abstract: We provide new insights regarding the finding that Medicaid increased emergency department (ED) use from the Oregon experiment. We find meaningful heterogeneous impacts of Medicaid on ED use using causal machine learning methods. The treatment effect distribution is widely dispersed, and the average effect is not representative of most individualized treatment effects. A small group—about 14% of participants—in the right tail of the distribution drives the overall effect. We identify priority groups with economically significant increases in ED usage based on demographics and prior utilization. Intensive margin effects are an important driver of increases in ED utilization.
    Keywords: Medicaid, ED use, effect heterogeneity, causal machine learning, optimal policy
    JEL: H75 I13 I38
    Date: 2022–03
  7. By: Kerstin H\"otte; Taheya Tarannum; Vilhelm Verendel; Lauren Bennett
    Abstract: Artificial Intelligence (AI) is often defined as the next general purpose technology (GPT) with profound economic and societal consequences. We examine how strongly four patent AI classification methods reproduce the GPT-like features of (1) intrinsic growth, (2) generality, and (3) innovation complementarities. Studying US patents from 1990-2019, we find that the four methods (keywords, scientific citations, WIPO, and USPTO approach) vary in classifying between 3-17% of all patents as AI. The keyword-based approach demonstrates the strongest intrinsic growth and generality despite identifying the smallest set of AI patents. The WIPO and science approaches generate each GPT characteristic less strikingly, whilst the USPTO set with the largest number of patents produces the weakest features. The lack of overlap and heterogeneity between all four approaches emphasises that the evaluation of AI innovation policies may be sensitive to the choice of classification method.
    Date: 2022–04
  8. By: Jozef Barunik; Lubos Hanus
    Abstract: We propose a deep learning approach to probabilistic forecasting of macroeconomic and financial time series. Being able to learn complex patterns from a data rich environment, our approach is useful for a decision making that depends on uncertainty of large number of economic outcomes. Specifically, it is informative to agents facing asymmetric dependence of their loss on outcomes from possibly non-Gaussian and non-linear variables. We show the usefulness of the proposed approach on the two distinct datasets where a machine learns the pattern from data. First, we construct macroeconomic fan charts that reflect information from high-dimensional data set. Second, we illustrate gains in prediction of stock return distributions which are heavy tailed, asymmetric and suffer from low signal-to-noise ratio.
    Date: 2022–04
  9. By: Bruno Spilak; Wolfgang Karl H\"ardle
    Abstract: A portfolio allocation method based on linear and non-linear latent constrained conditional factors is presented. The factor loadings are constrained to always be positive in order to obtain long-only portfolios, which is not guaranteed by classical factor analysis or PCA. In addition, the factors are to be uncorrelated among clusters in order to build long-only portfolios. Our approach is based on modern machine learning tools: convex Non-negative Matrix Factorization (NMF) and autoencoder neural networks, designed in a specific manner to enforce the learning of useful hidden data structure such as correlation between the assets' returns. Our technique finds lowly correlated linear and non-linear conditional latent factors which are used to build outperforming global portfolios consisting of cryptocurrencies and traditional assets, similar to hierarchical clustering method. We study the dynamics of the derived non-linear factors in order to forecast tail losses of the portfolios and thus build more stable ones.
    Date: 2022–04
  10. By: Paola Mallia (PSE - Paris School of Economics - ENPC - École des Ponts ParisTech - ENS-PSL - École normale supérieure - Paris - PSL - Université Paris sciences et lettres - UP1 - Université Paris 1 Panthéon-Sorbonne - CNRS - Centre National de la Recherche Scientifique - EHESS - École des hautes études en sciences sociales - INRAE - Institut National de Recherche pour l’Agriculture, l’Alimentation et l’Environnement, PJSE - Paris Jourdan Sciences Economiques - UP1 - Université Paris 1 Panthéon-Sorbonne - ENS-PSL - École normale supérieure - Paris - PSL - Université Paris sciences et lettres - EHESS - École des hautes études en sciences sociales - ENPC - École des Ponts ParisTech - CNRS - Centre National de la Recherche Scientifique - INRAE - Institut National de Recherche pour l’Agriculture, l’Alimentation et l’Environnement)
    Abstract: Adoption of improved seed varieties has the potential to lead to substantial pro ductivity increases in agriculture. However, only 36 percent of the farmers that grow an improved maize variety report doing so in Ethiopia. This paper provides the first causal evidence of the impact of misperception in improved maize varieties on farm ers' production decisions, productivity and profitability. We employ an Instrumental Variable approach that takes advantage of the roll-out of a governmental program that increases transparency in the seed sector. We find that farmers who correctly classify the improved maize variety grown experience large increases in inputs usage (urea, NPS, labor) and yields, but no statistically significant changes in other agricul tural practices or profits. Using machine learning techniques, we develop a model of interpolation to predict objectively measured varietal identification from farmers' self reported data which provides proof-of-concept towards scalable approaches to obtain reliable measures of crop varieties and allows us to extend the analysis to the nationally representative sample.
    Date: 2022–03
  11. By: Xianfei Hui; Baiqing Sun; Yan Zhou; Indranil SenGupta
    Abstract: This paper models stochastic process of price time series of CSI 300 index in Chinese financial market, analyzes volatility characteristics of intraday high-frequency price data. In the new generalized Barndorff-Nielsen and Shephard model, the lag caused by asynchrony of market information is considered, and the problem of lack of long-term dependence is solved. To speed up the valuation process, several machine learning and deep learning algorithms are used to estimate parameter and evaluate forecast results. Tracking historical jumps of different magnitudes offers promising avenues for simulating dynamic price processes and predicting future jumps. Numerical results show that the deterministic component of stochastic volatility processes would always be captured over short and longer-term windows. Research finding could be suitable for influence investors and regulators interested in predicting market dynamics based on realized volatility.
    Date: 2022–04
  12. By: Sourav Medya; Mohammad Rasoolinejad; Yang Yang; Brian Uzzi
    Abstract: Financial market analysis has focused primarily on extracting signals from accounting, stock price, and other numerical hard data reported in P&L statements or earnings per share reports. Yet, it is well-known that the decision-makers routinely use soft text-based documents that interpret the hard data they narrate. Recent advances in computational methods for analyzing unstructured and soft text-based data at scale offer possibilities for understanding financial market behavior that could improve investments and market equity. A critical and ubiquitous form of soft data are earnings calls. Earnings calls are periodic (often quarterly) statements usually by CEOs who attempt to influence investors' expectations of a company's past and future performance. Here, we study the statistical relationship between earnings calls, company sales, stock performance, and analysts' recommendations. Our study covers a decade of observations with approximately 100,000 transcripts of earnings calls from 6,300 public companies from January 2010 to December 2019. In this study, we report three novel findings. First, the buy, sell and hold recommendations from professional analysts made prior to the earnings have low correlation with stock price movements after the earnings call. Second, using our graph neural network based method that processes the semantic features of earnings calls, we reliably and accurately predict stock price movements in five major areas of the economy. Third, the semantic features of transcripts are more predictive of stock price movements than sales and earnings per share, i.e., traditional hard data in most of the cases.
    Date: 2022–01
  13. By: Carl Remlinger (Université Gustave Eiffel, EDF R&D - EDF R&D - EDF - EDF, FiME Lab - Laboratoire de Finance des Marchés d'Energie - EDF R&D - EDF R&D - EDF - EDF - CREST - Université Paris Dauphine-PSL - PSL - Université Paris sciences et lettres); Joseph Mikael (EDF R&D LME - Laboratoire des Matériels Électriques - EDF R&D - EDF R&D - EDF - EDF); Romuald Elie (Université Gustave Eiffel)
    Abstract: A model solving a family of partial differential equations (PDEs) with a single training is proposed. Re-calibrating a risk factor model or re-training a solver every time the market conditions change is costly and unsatisfactory. We therefore want to solve PDEs when the environment is not stationary or for several initial conditions at the same time. By learning operators in a single training, we ensure of the robustness of optimal controls with variations of the models, options or constraints. But, ultimately, we want to generalize by solving the PDE with models or conditions that were not present during training. We confirm the effectiveness of the method with several risk management problems by comparing it with other machine learning approaches. We evaluate our DeepOHedger on option pricing tasks, including local volatility models and option spreads involved in energy markets. Finally, we present a purely data-driven approach to risk hedging, from time series generation to learning optimal policiy. Our model then solves a family of parametric PDE from synthetic samples produced by a deep generator previously trained on spot price data from different countries.
    Abstract: Un modèle résolvant une famille d'équations différentielles partielles paramétriques (EDP) est proposé. Re-calibrer un modèle de facteur de risque ou ré-entraîner un solveur chaque fois que les conditions de marché changent est coûteux et insatisfaisant. Nous voulons donc résoudre les EDP lorsque l'environnement n'est pas stationnaire ou pour plusieurs conditions initiales en même temps. En apprenant les opérateurs avec un seul entraînement, nous nous assurons de la robustesse des contrôles optimaux avec des variations des modèles, des options ou des contraintes. Mais, finalement, nous voulons généraliser en résolvant l'EDP avec des modèles ou des conditions qui n'étaient pas présents lors de l'apprentissage. Nous confirmons l'efficacité de la méthode avec plusieurs problèmes de gestion des risques en la comparant avec d'autres approches d'apprentissage automatique. Nous évaluons notre DeepOHedger sur des tâches d'évaluation d'options, y compris les modèles de volatilité locale et spread d'options impliqués dans les marchés de l'énergie. Enfin, nous présentons une approche purement basée sur les données pour la couverture des risques, de la génération de séries temporelles à l'apprentissage de politiques optimales. Notre modèle résout alors une famille d'EDP paramétriques à partir d'échantillons synthétiques produits par un générateur profond préalablement entraîné sur des données de prix spot de différents pays.
    Date: 2022–03–07
  14. By: Mouad Lamrabet (Laboratoire de Recherche en Intelligence Stratégique - UH2MC - Université Hassan II [Casablanca]); Lam'Hammdi Hicham; Amal Tomal; Tariq Akdim; Taoufik Benkaraache (Laboratoire de Recherche en Intelligence Stratégique - UH2MC - Université Hassan II [Casablanca])
    Abstract: The objective of this paper is both to present new scientific concepts of the territory, as well as to call for the resumption of an actualist territorial research. To do this, we have come together around a collective reflection, as researchers of the territory, to present, each from its own field of research, the most new territorial concepts. In this sense, we successively present in this article 4 new concepts: (C1) Territorial Big Data, (C2) Attractiveness and Territorial Digital Marketing, (C3) Territorial Living Lab and (C4) Augmented Metropolis.
    Keywords: Territorial Big Data,Attractiveness and Territorial Digital Marketing,Territorial Living Lab,Augmented Metropolis
    Date: 2022–02–11
  15. By: Guglielmo Briscese; Maddalena Grignani; Stephen Stapleton
    Abstract: Crises can cause important societal changes by shifting citizens' preferences and beliefs, but how such change happens remains an open question. Following a representative sample of Americans in a longitudinal multi-wave survey throughout 2020, we find that citizens reduced trust in public institutions and became more supportive of government spending after being directly impacted by the crisis, such as when they lost a sizeable portion of their income or knew someone hospitalized with the virus. These shifts occurred very rapidly, sometimes in a matter of weeks, and persisted over time. We also record an increase in the partisan gap on the same outcomes, which can be largely explained by misperceptions about the crisis inflated by the consumption of partisan leaning news. In an experiment, we expose respondents to the same source of information and find that it successfully recalibrates perceptions, with persistent effects. We complement our analysis by employing machine learning to estimate heterogeneous treatment effects, and show that our findings are robust to several specifications and estimation strategies. In sum, both lived experiences and media inflated misperceptions can alter citizens' beliefs rapidly during a crisis.
    Date: 2022–02
  16. By: Elliott Ash; Daniel L. Chen (IAST - Institute for Advanced Study in Toulouse); Sergio Galletta
    Abstract: This paper provides a general method for analysing the sentiments expressed in the language of judicial rulings. We apply natural language processing tools to the text of US appellate court opinions to extrapolate judges' sentiments (positive/good vs. negative/bad) towards a number of target social groups. We explore descriptively how these sentiments vary over time and across types of judges. In addition, we provide a method for using random assignment of judges in an instrumental variables framework to estimate causal effects of judges' sentiments. In an empirical application, we show that more positive sentiment influences future judges by increasing the likelihood of reversal but also increasing the number of forward citations.
    Date: 2022
  17. By: Pier Luigi Sacco; Alex Arenas; Manlio De Domenico
    Abstract: After the leak of 11.5 million documents from the Panamanian corporation Mossack Fonseca, an intricate network of offshore business entities has been revealed. The emerging picture is that of legal entities, either individuals or companies, involved in offshore activities and transactions with several tax havens simultaneously which establish, indirectly, an effective network of countries acting on tax evasion. The analysis of this network quantitatively uncovers a strongly connected core (a rich-club) of countries whose indirect interactions, mediated by legal entities, form the skeleton for tax evasion worldwide. Intriguingly, the rich-club mainly consists of well-known tax havens such as British Virgin Islands and Hong Kong, and major global powers such as China, Russia, United Kingdom and United States of America. The analysis provides a new way to rank tax havens because of the role they play in this network, and the results call for an international coordination on taxation policies that take into account the complex interconnected structure of tax evaders in a globalized economy.
    Date: 2022–02

This nep-big issue is ©2022 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.