nep-big New Economics Papers
on Big Data
Issue of 2023‒11‒20
23 papers chosen by
Tom Coupé, University of Canterbury

  1. Social Media and Real Estate: Do Twitter users predict REIT performance? By Nino Paulus; Lukas Lautenschlaeger; Wolfgang Schäfers
  2. Blending gradient boosted trees and neural networks for point and probabilistic forecasting of hierarchical time series By Ioannis Nasios; Konstantinos Vogklis
  3. Transparency challenges in policy evaluation with causal machine learning -- improving usability and accountability By Patrick Rehill; Nicholas Biddle
  4. GDP nowcasting with Machine Learning and Unstructured Data to Peru By Juan Tenorio; Wilder Pérez
  5. Co-Training Realized Volatility Prediction Model with Neural Distributional Transformation By Xin Du; Kai Moriyama; Kumiko Tanaka-Ishii
  6. Determinants of U.S. REIT Bond Risk Premia with Explainable Machine Learning By Jakob Kozak; Maximilian Nagl; Cathrine Nagl; Eli Beracha; Wolfgang Schäfers
  7. A Score Function to Prioritize Editing in Household Survey Data: A Machine Learning Approach By Nicolás Forteza; Sandra García-Uribe
  8. Demand Estimation with Text and Image Data By Giovanni Compiani; Ilya Morozov; Stephan Seiler
  9. Deriving Technology Indicators from Corporate Websites: A Comparative Assessment Using Patents By Sebastian Heinrich
  10. Neural Network for valuing Bitcoin options under jump-diffusion and market sentiment model By Edson Pindza; Jules Clement Mba; Sutene Mwambi; Nneka Umeorah
  11. Machine Learning for Staggered Difference-in-Differences and Dynamic Treatment Effect Heterogeneity By Julia Hatamyar; Noemi Kreif; Rudi Rocha; Martin Huber
  12. Understanding Models and Model Bias with Gaussian Processes By Thomas R. Cook; Nathan M. Palmer
  13. A Semiparametric Instrumented Difference-in-Differences Approach to Policy Learning By Pan Zhao; Yifan Cui
  14. Few-Shot Learning Patterns in Financial Time-Series for Trend-Following Strategies By Kieran Wood; Samuel Kessler; Stephen J. Roberts; Stefan Zohren
  15. Improving Portfolio Performance Using a Novel Method for Predicting Financial Regimes By Piotr Pomorski; Denise Gorse
  16. How Does China's Household Portfolio Selection Vary with Financial Inclusion? By Yong Bian; Xiqian Wang; Qin Zhang
  17. Functional gradient descent boosting for additive non‐linear spatial autoregressive model (gaussian and probit) By Ghislain Geniaux
  18. Towards reducing hallucination in extracting information from financial reports using Large Language Models By Bhaskarjit Sarmah; Tianjie Zhu; Dhagash Mehta; Stefano Pasquali
  19. Mining the Gap: Extracting Firms’ Inflation Expectations From Earnings Calls By Silvia Albrizio; Allan Dizioli; Pedro Vitale Simon
  20. Using big data to measure cultural tourism in Europe with unprecedented precision By Borowiecki, Karol Jan; Pedersen, Maja U.; Mitchell, Sara Beth
  21. Double and Single Descent in Causal Inference with an Application to High-Dimensional Synthetic Control By Jann Spiess; Guido Imbens; Amar Venugopal
  22. Towards Enhanced Local Explainability of Random Forests: a Proximity-Based Approach By Joshua Rosaler; Dhruv Desai; Bhaskarjit Sarmah; Dimitrios Vamvourellis; Deran Onay; Dhagash Mehta; Stefano Pasquali
  23. Forecast Reconciliation: A Review By George Athanasopoulos; Rob J Hyndman; Nikolaos Kourentzes; Anastasios Panagiotelis

  1. By: Nino Paulus; Lukas Lautenschlaeger; Wolfgang Schäfers
    Abstract: Problems and objective Social media platforms have become vibrant online platforms where people share their opinions and views on any topic (Yadav and Vishwakarma, 2020). With the increasing volume and speed of social media, the exchange of stock market-related information has become more important, which is why the effects of social media information on stock markets are becoming increasingly salient (Li et al., 2018). Business organizations need to understand these dynamics, as it reflects the interest of all kind of market participants – retail investors, institutional investors, but also clients, journalists and many others. Therefore, it is not surprising that there is evidence for public sentiment, obtained from social media, correlating with or even predicting economic indicators (e.g. Bollen et al., 2011; Sprenger et al., 2014; Xu and Cohen, 2018). Regarding real estate, Zamani and Schwartz (2017) successfully used Twitter language to forecast house price changes for a small sample at the county level. Except this limited research on real estate markets and the research for the general stock market, there is no more general study that examines the relationship between social media and real estate markets. Nevertheless, real estate markets are of particular interest, not only because of its popularity as an asset class among retail investors, but also because real estate is ubiquitous in daily life and the intransparency of the market. Sentiment indicators extracted from social media therefore promises to cover perspectives from all kinds of people and could therefore be more informative than traditional sentiment measures. However, as described by Li et al. (2018), social media-based sentiment indicators are not intended to replace traditional sentiment indicators, but rather complement them, as these are usually based on the knowledge of only a few industry insiders instead of that of the general public. Besides, the study focuses on indirect real estate (i.e. REITs) as it allows retail investors who represent the majority of social media users sharing equity-related information, to participate in real estate markets. Methodology & Data Using a dictionary-based approach, a classical machine learning approach as well as a deep learning based approach to extract the sentiment of approximately 4 million tweets, this paper compared methods of different complexity in terms of their ability to classify social media sentiment and predict indirect real estate returns on a monthly basis. The baseline for this comparison is a conventional dictionary-based approach including valence shifting properties. The dictionary used is the real estate specific dictionary developed Ruscheinsky et al. (2018). For the classical machine learning method, a support vector machine (SVM), which already has stated to be potent in a real estate context (Hausler et al., 2018), is utilized. The more complex deep learning approach is based on a Long Short-Term Memory (LSTM) model. The usefulness of deep learning-based approaches for sentiment analysis in a real estate context has been proven before by Braun et al. (2019). As high-tradevolume-stocks tend to be discussed most on Twitter, posts are collected from this platform (Xu and Cohen, 2018), including a ten-year timespan from 2013 to 2022. Hereby selection is made on the basis of cashtags representing all US REITs. The monthly total return of the FTSE Nareit allEquity Total Return states the dependent variable, whereby the created sentiment variable is the variable of interest. Contribution to science and practice The aim of this study is to create a standardized framework that enables investors of all kinds to better classify current market events and thus better navigate the opaque real estate market. This framework could be applied not only by investors, but vice versa by REITs to understand and optimize their position in society and in the investor landscape. To the authors knowledge, this is the first study to analyze the impact of social media sentiment on (indirect) real estate returns, based on a comprehensive national dataset.
    JEL: R3
    Date: 2023–01–01
  2. By: Ioannis Nasios; Konstantinos Vogklis
    Abstract: In this paper we tackle the problem of point and probabilistic forecasting by describing a blending methodology of machine learning models that belong to gradient boosted trees and neural networks families. These principles were successfully applied in the recent M5 Competition on both Accuracy and Uncertainty tracks. The keypoints of our methodology are: a) transform the task to regression on sales for a single day b) information rich feature engineering c) create a diverse set of state-of-the-art machine learning models and d) carefully construct validation sets for model tuning. We argue that the diversity of the machine learning models along with the careful selection of validation examples, where the most important ingredients for the effectiveness of our approach. Although forecasting data had an inherent hierarchy structure (12 levels), none of our proposed solutions exploited that hierarchical scheme. Using the proposed methodology, our team was ranked within the gold medal range in both Accuracy and the Uncertainty track. Inference code along with already trained models are available at rtainty_3rd_place
    Date: 2023–10
  3. By: Patrick Rehill; Nicholas Biddle
    Abstract: Causal machine learning tools are beginning to see use in real-world policy evaluation tasks to flexibly estimate treatment effects. One issue with these methods is that the machine learning models used are generally black boxes, i.e., there is no globally interpretable way to understand how a model makes estimates. This is a clear problem in policy evaluation applications, particularly in government, because it is difficult to understand whether such models are functioning in ways that are fair, based on the correct interpretation of evidence and transparent enough to allow for accountability if things go wrong. However, there has been little discussion of transparency problems in the causal machine learning literature and how these might be overcome. This paper explores why transparency issues are a problem for causal machine learning in public policy evaluation applications and considers ways these problems might be addressed through explainable AI tools and by simplifying models in line with interpretable AI principles. It then applies these ideas to a case-study using a causal forest model to estimate conditional average treatment effects for a hypothetical change in the school leaving age in Australia. It shows that existing tools for understanding black-box predictive models are poorly suited to causal machine learning and that simplifying the model to make it interpretable leads to an unacceptable increase in error (in this application). It concludes that new tools are needed to properly understand causal machine learning models and the algorithms that fit them.
    Date: 2023–10
  4. By: Juan Tenorio; Wilder Pérez
    Abstract: In a context of ongoing change, “nowcasting” models based on Machine Learning (ML) algorithms deliver a noteworthy advantage for decision-making in both the public and private sectors due to its flexibility and ability to drive large amounts of data. This document presents projection models for the monthly GDP rate growth of Peru, which incorporate structured macroeconomic indicators with high-frequency unstructured sentiment variables. The window sampling comes from January 2007 to May 2023, including a total of 91 variables. By assessing six ML algorithms, the best predictors for each model were identified. The results reveal the high capacity of each ML model with unstructured data to provide more accurate and anticipated predictions than traditional time series models, where the outstanding models were Gradient Boosting Machine, LASSO, and Elastic Net, which achieved a prediction error reduction of 20% to 25% compared to the AR and Dynamic Factor Models (DFM) models. These results could be influenced by the analysis period, which includes crisis events featured by high uncertainty, where ML models with unstructured data improve significance.
    Keywords: nowcasting, machine learning, GDP growth
    Date: 2023–11
  5. By: Xin Du; Kai Moriyama; Kumiko Tanaka-Ishii
    Abstract: This paper shows a novel machine learning model for realized volatility (RV) prediction using a normalizing flow, an invertible neural network. Since RV is known to be skewed and have a fat tail, previous methods transform RV into values that follow a latent distribution with an explicit shape and then apply a prediction model. However, knowing that shape is non-trivial, and the transformation result influences the prediction model. This paper proposes to jointly train the transformation and the prediction model. The training process follows a maximum-likelihood objective function that is derived from the assumption that the prediction residuals on the transformed RV time series are homogeneously Gaussian. The objective function is further approximated using an expectation-maximum algorithm. On a dataset of 100 stocks, our method significantly outperforms other methods using analytical or naive neural-network transformations.
    Date: 2023–10
  6. By: Jakob Kozak; Maximilian Nagl; Cathrine Nagl; Eli Beracha; Wolfgang Schäfers
    Abstract: Corporate bonds are an important source of funding for real estate investment trusts (REITs). The outstanding unsecured debt of U.S. equity REITs, which is an approximation for outstanding bond debt, was $450 billion in 2022, while REIT net asset value was $1.1 trillion in the same year. This highlights the importance of corporate bonds for U.S. REITs. However, the literature on bond risk premia focuses only on corporate bonds in general and neglects the specific structure and functioning of issuing REITs. Specifically, U.S. REITs must distribute 90% of their taxable income to shareholders, which prevents them from building capital internally through retained earnings. Since corporate bonds represent a general claim on corporate assets and cash in the case of default, we hypothesize that the drivers of REIT bond risk premia differ from those of the general corporate bond market. Therefore, this paper aims to fill this gap by examining yield spreads, which are the difference between the yield on a REIT bond and the U.S. Treasury yield having the same maturity. Based on findings in the empirical asset pricing literature on the superior performance of artificial neural networks in the adjacent fields of stock and bond return prediction, this paper applies an artificial neural network to predict REIT bond yield spreads. We use a data set of 27, 014 monthly U.S. REIT bond transactions from 2010 to 2021 and 33 explanatory variables including bond characteristics, equity and bond market variables, macroeconomic indicators, and, as a novelty, REIT balance sheet data, REIT type, and direct real estate market total return. Preliminary results show that the neural network predicts REIT bond yield spreads with an out-of-sample mean R2 of 36.3%. Feature importance analysis using explainable machine learning methods shows that default risk, captured by REIT size, economy-wide default risk spread, and interest rate volatility, is highly relevant to the prediction of REIT bond yield spreads. We also find evidence for tax and illiquidity risk premia. Interestingly, equity market-related variables are only important in times of economic recession. Real estate market return is an important feature and is negatively related to the predictions of REIT bond yield spreads. These findings underline that bond risk premia for REITs have additional drivers compared to those in the general corporate bond market.
    Keywords: Fixed Income; Machine Learning; REIT; Risk Premium
    JEL: R3
    Date: 2023–01–01
  7. By: Nicolás Forteza (Banco de España); Sandra García-Uribe (Banco de España)
    Abstract: Errors in the collection of household finance survey data may proliferate in population estimates, especially when there is oversampling of some population groups. Manual case-by-case revision has been commonly applied in order to identify and correct potential errors and omissions such as omitted or misreported assets, income and debts. We derive a machine learning approach for the purpose of classifying survey data affected by severe errors and omissions in the revision phase. Using data from the Spanish Survey of Household Finances we provide the best-performing supervised classification algorithm for the task of prioritizing cases with substantial errors and omissions. Our results show that a Gradient Boosting Trees classifier outperforms several competing classifiers. We also provide a framework that takes into account the trade-off between precision and recall in the survey agency in order to select the optimal classification threshold.
    Keywords: machine learning, predictive models, selective editing, survey data
    JEL: C81 C83 C88
    Date: 2023–10
  8. By: Giovanni Compiani; Ilya Morozov; Stephan Seiler
    Abstract: We propose a demand estimation method that allows researchers to estimate substitution patterns from unstructured image and text data. We first employ a series of machine learning models to measure product similarity from products’ images and textual descriptions. We then estimate a nested logit model with product-pair specific nesting parameters that depend on the image and text similarities between products. Our framework does not require collecting product attributes for each category and can capture product similarity along dimensions that are hard to account for with observed attributes. We apply our method to a dataset describing the behavior of Amazon shoppers across several categories and show that incorporating texts and images in demand estimation helps us recover a flexible cross-price elasticity matrix.
    Keywords: demand estimation, unstructured data, computer vision, text models
    JEL: C10 C50 C81
    Date: 2023
  9. By: Sebastian Heinrich (KOF Swiss Economic Institute, ETH Zurich, Switzerland)
    Abstract: This paper investigates the potential of indicators derived from corporate websites to measure technology related concepts. Using arti cial intelligence (AI) technology as a case in point, I construct a 24-year panel combining the texts of websites and patent portfolios for over 1, 000 large companies. By identifying AI exposure with a comprehensive keyword set, I show that website and patent data are strongly related, suggesting that corporate websites constitute a promising data source to trace AI technologies.
    Keywords: corporate website, patent portfolio, technology indicator, text data, artificial intelligence
    JEL: C81 O31 O33
    Date: 2023–07
  10. By: Edson Pindza; Jules Clement Mba; Sutene Mwambi; Nneka Umeorah
    Abstract: Cryptocurrencies and Bitcoin, in particular, are prone to wild swings resulting in frequent jumps in prices, making them historically popular for traders to speculate. A better understanding of these fluctuations can greatly benefit crypto investors by allowing them to make informed decisions. It is claimed in recent literature that Bitcoin price is influenced by sentiment about the Bitcoin system. Transaction, as well as the popularity, have shown positive evidence as potential drivers of Bitcoin price. This study considers a bivariate jump-diffusion model to describe Bitcoin price dynamics and the number of Google searches affecting the price, representing a sentiment indicator. We obtain a closed formula for the Bitcoin price and derive the Black-Scholes equation for Bitcoin options. We first solve the corresponding Bitcoin option partial differential equation for the pricing process by introducing artificial neural networks and incorporating multi-layer perceptron techniques. The prediction performance and the model validation using various high-volatile stocks were assessed.
    Date: 2023–10
  11. By: Julia Hatamyar; Noemi Kreif; Rudi Rocha; Martin Huber
    Abstract: We combine two recently proposed nonparametric difference-in-differences methods, extending them to enable the examination of treatment effect heterogeneity in the staggered adoption setting using machine learning. The proposed method, machine learning difference-in-differences (MLDID), allows for estimation of time-varying conditional average treatment effects on the treated, which can be used to conduct detailed inference on drivers of treatment effect heterogeneity. We perform simulations to evaluate the performance of MLDID and find that it accurately identifies the true predictors of treatment effect heterogeneity. We then use MLDID to evaluate the heterogeneous impacts of Brazil's Family Health Program on infant mortality, and find those in poverty and urban locations experienced the impact of the policy more quickly than other subgroups.
    Date: 2023–10
  12. By: Thomas R. Cook; Nathan M. Palmer
    Abstract: Despite growing interest in the use of complex models, such as machine learning (ML) models, for credit underwriting, ML models are difficult to interpret, and it is possible for them to learn relationships that yield de facto discrimination. How can we understand the behavior and potential biases of these models, especially if our access to the underlying model is limited? We argue that counterfactual reasoning is ideal for interpreting model behavior, and that Gaussian processes (GP) can provide approximate counterfactual reasoning while also incorporating uncertainty in the underlying model’s functional form. We illustrate with an exercise in which a simulated lender uses a biased machine model to decide credit terms. Comparing aggregate outcomes does not clearly reveal bias, but with a GP model we can estimate individual counterfactual outcomes. This approach can detect the bias in the lending model even when only a relatively small sample is available. To demonstrate the value of this approach for the more general task of model interpretability, we also show how the GP model’s estimates can be aggregated to recreate the partial density functions for the lending model.
    Keywords: models; Gaussian process; model bias
    JEL: C10 C14 C18 C45
    Date: 2023–06–15
  13. By: Pan Zhao; Yifan Cui
    Abstract: Recently, there has been a surge in methodological development for the difference-in-differences (DiD) approach to evaluate causal effects. Standard methods in the literature rely on the parallel trends assumption to identify the average treatment effect on the treated. However, the parallel trends assumption may be violated in the presence of unmeasured confounding, and the average treatment effect on the treated may not be useful in learning a treatment assignment policy for the entire population. In this article, we propose a general instrumented DiD approach for learning the optimal treatment policy. Specifically, we establish identification results using a binary instrumental variable (IV) when the parallel trends assumption fails to hold. Additionally, we construct a Wald estimator, novel inverse probability weighting (IPW) estimators, and a class of semiparametric efficient and multiply robust estimators, with theoretical guarantees on consistency and asymptotic normality, even when relying on flexible machine learning algorithms for nuisance parameters estimation. Furthermore, we extend the instrumented DiD to the panel data setting. We evaluate our methods in extensive simulations and a real data application.
    Date: 2023–10
  14. By: Kieran Wood; Samuel Kessler; Stephen J. Roberts; Stefan Zohren
    Abstract: Forecasting models for systematic trading strategies do not adapt quickly when financial market conditions change, as was seen in the advent of the COVID-19 pandemic in 2020, when market conditions changed dramatically causing many forecasting models to take loss-making positions. To deal with such situations, we propose a novel time-series trend-following forecaster that is able to quickly adapt to new market conditions, referred to as regimes. We leverage recent developments from the deep learning community and use few-shot learning. We propose the Cross Attentive Time-Series Trend Network - X-Trend - which takes positions attending over a context set of financial time-series regimes. X-Trend transfers trends from similar patterns in the context set to make predictions and take positions for a new distinct target regime. X-Trend is able to quickly adapt to new financial regimes with a Sharpe ratio increase of 18.9% over a neural forecaster and 10-fold over a conventional Time-series Momentum strategy during the turbulent market period from 2018 to 2023. Our strategy recovers twice as quickly from the COVID-19 drawdown compared to the neural-forecaster. X-Trend can also take zero-shot positions on novel unseen financial assets obtaining a 5-fold Sharpe ratio increase versus a neural time-series trend forecaster over the same period. X-Trend both forecasts next-day prices and outputs a trading signal. Furthermore, the cross-attention mechanism allows us to interpret the relationship between forecasts and patterns in the context set.
    Date: 2023–10
  15. By: Piotr Pomorski; Denise Gorse
    Abstract: This work extends a previous work in regime detection, which allowed trading positions to be profitably adjusted when a new regime was detected, to ex ante prediction of regimes, leading to substantial performance improvements over the earlier model, over all three asset classes considered (equities, commodities, and foreign exchange), over a test period of four years. The proposed new model is also benchmarked over this same period against a hidden Markov model, the most popular current model for financial regime prediction, and against an appropriate index benchmark for each asset class, in the case of the commodities model having a test period cost-adjusted cumulative return over four times higher than that expected from the index. Notably, the proposed model makes use of a contrarian trading strategy, not uncommon in the financial industry but relatively unexplored in machine learning models. The model also makes use of frequent short positions, something not always desirable to investors due to issues of both financial risk and ethics; however, it is discussed how further work could remove this reliance on shorting and allow the construction of a long-only version of the model.
    Date: 2023–09
  16. By: Yong Bian; Xiqian Wang; Qin Zhang
    Abstract: Portfolio underdiversification is one of the most costly losses accumulated over a household's life cycle. We provide new evidence on the impact of financial inclusion services on households' portfolio choice and investment efficiency using 2015, 2017, and 2019 survey data for Chinese households. We hypothesize that higher financial inclusion penetration encourages households to participate in the financial market, leading to better portfolio diversification and investment efficiency. The results of the baseline model are consistent with our proposed hypothesis that higher accessibility to financial inclusion encourages households to invest in risky assets and increases investment efficiency. We further estimate a dynamic double machine learning model to quantitatively investigate the non-linear causal effects and track the dynamic change of those effects over time. We observe that the marginal effect increases over time, and those effects are more pronounced among low-asset, less-educated households and those located in non-rural areas, except for investment efficiency for high-asset households.
    Date: 2023–11
  17. By: Ghislain Geniaux (ECODEVELOPPEMENT - Unité de recherche d'Écodéveloppement - INRAE - Institut National de Recherche pour l’Agriculture, l’Alimentation et l’Environnement)
    Abstract: In this working paper, I aim to establish a connection between the traditional mod- els of spatial econometrics and machine learning algorithms. The objective is to determine, within the context of big data, which variables should be incorporated into autoregressive nonlinear models and in what forms: linear, nonlinear, spatially varying, or with interactions with other variables. To address these questions, I propose an extension of boosting algorithms (Friedman, 2001; Buhlmann et al., 2007) to semi-parametric autoregressive models (SAR, SDM, SEM, and SARAR), formulated as additive models with smoothing splines functions. This adaptation primarily relies on estimating the spatial parameter using the Quasi-Maximum Like- lihood (QML) method, following the examples set by Basile and Gress (2004) and Su and Jin (2010). To simplify the calculation of the spatial multiplier, I propose two extensions. The first is based on the direct application of the Closed Form Estimator (CFE), recently proposed by Smirnov (2020). Additionally, I suggest a Flexible Instrumental Variable Approach/control function approach (Marra and Radice, 2010; Basile et al., 2014) for SAR models, which dynamically constructs the instruments based on the functioning of the functional gradient descent boosting algorithm. The proposed estimators can be easily extended to incorporate decision trees instead of smoothing splines, allowing for the identification of more complex variable interactions. For discrete choice models with spatial dependence, I extend the SAR probit model approximation method proposed by Martinetti and Geniaux (2018) to the nonlinear case using the boosting algorithm and smoothing splines. Using synthetic data, I study the finite sample properties of the proposed estimators for both Gaussian and probit cases. Finally, inspired by the work of Debarsy and LeSage (2018, 2022), I extend the Gaussian case of the nonlinear SAR model to a more complex spatial autoregressive multiplier, involving multiple spatial weight matrices. This extension helps determine the most geographically relevant spatial weight matrix. To illustrate the efficacy of functional gradient descent boosting for additive nonlinear spatial autoregressive models, I employ real data from a large dataset on house prices in France, assessing the out-sample accuracy.
    Keywords: Spatial Autoregressive model, gradient boosting,
    Date: 2023–05–25
  18. By: Bhaskarjit Sarmah; Tianjie Zhu; Dhagash Mehta; Stefano Pasquali
    Abstract: For a financial analyst, the question and answer (Q\&A) segment of the company financial report is a crucial piece of information for various analysis and investment decisions. However, extracting valuable insights from the Q\&A section has posed considerable challenges as the conventional methods such as detailed reading and note-taking lack scalability and are susceptible to human errors, and Optical Character Recognition (OCR) and similar techniques encounter difficulties in accurately processing unstructured transcript text, often missing subtle linguistic nuances that drive investor decisions. Here, we demonstrate the utilization of Large Language Models (LLMs) to efficiently and rapidly extract information from earnings report transcripts while ensuring high accuracy transforming the extraction process as well as reducing hallucination by combining retrieval-augmented generation technique as well as metadata. We evaluate the outcomes of various LLMs with and without using our proposed approach based on various objective metrics for evaluating Q\&A systems, and empirically demonstrate superiority of our method.
    Date: 2023–10
  19. By: Silvia Albrizio; Allan Dizioli; Pedro Vitale Simon
    Abstract: Using a novel approach involving natural language processing (NLP) algorithms, we construct a new cross-country index of firms' inflation expectations from earnings call transcripts. Our index has a high correlation with existing survey-based measures of firms' inflation expectations, it is robust to external validation tests and is built using a new method that outperforms other NLP algorithms. In an application of our index to United States, we uncover some facts related to firm's inflation expectations. We show that higher expected inflation translates into future inflation. Going into the firms level dimension of our index, we show departures from a rational framework in firms' inflation expectations and that firms' attention to the central enhances monetary policy effectiveness.
    Keywords: rms' inflation expectations; Firms' earnings calls transcripts; Natural Language processing; GPT3.5; Monetary policy
    Date: 2023–10–04
  20. By: Borowiecki, Karol Jan (Department of Economics); Pedersen, Maja U. (Department of Economics); Mitchell, Sara Beth (Department of Economics)
    Abstract: International tourism statistics are notorious for being over-aggregated, lacking information about the tourist, available with a lag, and often provided only at the annual level. In response to this, we suggest a unique complementary approach that is computer-science driven and relies on big data collected from a leading travel portal. The novel approach enables us to obtain a systematic, consistent, and reliable approximation for tourism flows, and this with unparalleled precision, frequency, and depth of information. Our approach delivers also an unprecedented list of all tourist attractions in a country, along with data on the popularity and quality of these attractions. We provide validity tests of the approach pursued and present one application of the data by illuminating the patterns and changes in travel flows in selected European destinations during and after the Covid-19 pandemic. This project opens a range of new research questions and possibilities for cultural economics, in particular related to cultural heritage and tourism.
    Keywords: Tourism; Cultural heritage; Big data; Covid-19
    JEL: J60 L83 O10 Z11 Z30
    Date: 2023–11–02
  21. By: Jann Spiess; Guido Imbens; Amar Venugopal
    Abstract: Motivated by a recent literature on the double-descent phenomenon in machine learning, we consider highly over-parameterized models in causal inference, including synthetic control with many control units. In such models, there may be so many free parameters that the model fits the training data perfectly. We first investigate high-dimensional linear regression for imputing wage data and estimating average treatment effects, where we find that models with many more covariates than sample size can outperform simple ones. We then document the performance of high-dimensional synthetic control estimators with many control units. We find that adding control units can help improve imputation performance even beyond the point where the pre-treatment fit is perfect. We provide a unified theoretical perspective on the performance of these high-dimensional models. Specifically, we show that more complex models can be interpreted as model-averaging estimators over simpler ones, which we link to an improvement in average performance. This perspective yields concrete insights into the use of synthetic control when control units are many relative to the number of pre-treatment periods.
    JEL: C01
    Date: 2023–10
  22. By: Joshua Rosaler; Dhruv Desai; Bhaskarjit Sarmah; Dimitrios Vamvourellis; Deran Onay; Dhagash Mehta; Stefano Pasquali
    Abstract: We initiate a novel approach to explain the out of sample performance of random forest (RF) models by exploiting the fact that any RF can be formulated as an adaptive weighted K nearest-neighbors model. Specifically, we use the proximity between points in the feature space learned by the RF to re-write random forest predictions exactly as a weighted average of the target labels of training data points. This linearity facilitates a local notion of explainability of RF predictions that generates attributions for any model prediction across observations in the training set, and thereby complements established methods like SHAP, which instead generates attributions for a model prediction across dimensions of the feature space. We demonstrate this approach in the context of a bond pricing model trained on US corporate bond trades, and compare our approach to various existing approaches to model explainability.
    Date: 2023–10
  23. By: George Athanasopoulos; Rob J Hyndman; Nikolaos Kourentzes; Anastasios Panagiotelis
    Abstract: Collections of time series that are formed via aggregation are prevalent in many fields. These are commonly referred to as hierarchical time series and may be constructed cross-sectionally across different variables, temporally by aggregating a single series at different frequencies, or may even be generalised beyond aggregation as time series that respect linear constraints. When forecasting such time series, a desirable condition is for forecasts to be coherent, that is to respect the constraints. The past decades have seen substantial growth in this field with the development of reconciliation methods that not only ensure coherent forecasts but can also improve forecast accuracy. This paper serves as both an encyclopaedic review of forecast reconciliation and an entry point for researchers and practitioners dealing with hierarchical time series. The scope of the article includes perspectives on forecast reconciliation from machine learning, Bayesian statistics and probabilistic forecasting as well as applications in economics, energy, tourism, retail demand and demography.
    Keywords: aggregation, coherence, cross-temporal, hierarchical time series, grouped time series, temporal aggregation
    JEL: C10 C14 C53
    Date: 2023

This nep-big issue is ©2023 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.