nep-big New Economics Papers
on Big Data
Issue of 2019‒02‒04
eight papers chosen by
Tom Coupé
University of Canterbury

  1. Dying Light: War and Trade of the Separatist-Controlled Areas of Ukraine By Artem Kochnev
  2. Online Reputation Mechanisms and the Decreasing Value of Chain Affliation By Hollenbeck, Brett
  3. Predicting innovative firms using web mining and deep learning By Kinne, Jan; Lenz, David
  4. Deep Learning Volatility By Blanka Horvath; Aitor Muguruza; Mehdi Tomas
  5. Orthogonal Statistical Learning By Dylan J. Foster; Vasilis Syrgkanis
  6. Temporal Logistic Neural Bag-of-Features for Financial Time series Forecasting leveraging Limit Order Book Data By Nikolaos Passalis; Anastasios Tefas; Juho Kanniainen; Moncef Gabbouj; Alexandros Iosifidis
  7. lassopack: Model selection and prediction with regularized regression in Stata By Achim Ahrens; Christian B. Hansen; Mark E. Schaffer
  8. Modified Causal Forests for Estimating Heterogeneous Causal Effects By Lechner, Michael

  1. By: Artem Kochnev
    Abstract: The paper investigates how war and the war-related government policies affected economic activity of the separatist-controlled areas of Ukraine. The paper applies a quasi-experimental study design to estimate the impact of two events on the separatist-controlled areas the introduction of the separatist control and the introduction of the second round of the trade ban, which was imposed by the government of Ukraine on the separatist-controlled territories in 2017. Using a difference-in-difference estimation procedure that controls for the yearly and monthly effects, individual fixed effects, and the region-specific time shocks, the study finds that the separatist rule decreased the economic activity by 38% in the Donetsk region and 51% in the Luhansk region according to the preferred specifications. At the same time, the trade ban of the year 2017 against the major industrial enterprises of the separatist-controlled areas decreased luminosity by 20%. The paper argues that the trade disruptions due to the war actions were nested within the negative effect of the separatist rule and accounted for half of it.
    Keywords: costs of war, satellite data, trade, Ukraine crisis, political economy
    JEL: D74 E01 E20 F51
    Date: 2019–01
  2. By: Hollenbeck, Brett
    Abstract: This paper investigates the value of branding and how it is changing in response to a large increase in consumer information provided by online reputation mechanisms. As an application of umbrella branding, theory suggests much of the value to firms of chain affiliation results from asymmetric information between buyers and sellers. As more information becomes available, consumers should rely less on brand names as quality signals and the ability for firms to extend reputations across heterogenous outlets should decrease. To examine this empirically, this paper combines a large, 15 year panel of hotel revenues with millions of online reviews from multiple platforms and performs a machine learning analysis of review text to recover latent, time-varying dimensions of firm quality. I find that branded, or chain-affiliated, hotels earn substantially higher revenues than equivalent independent hotels, but that this premium has declined by over 50% from 2000 to 2015. I find that this can be largely attributed to an increase in online reputation mechanisms, and that this affect is largest for low quality and small market firms. Numerous measures of the information content of online reviews show that as information has increased, independent hotel revenue grows substantially more than chain hotel revenue. Finally, the correlation between firm revenue and brand-wide reputation is decreasing and the correlation with individual hotel reputation is replacing it.
    Keywords: Online Reviews, Branding, Text Analysis, Franchising
    JEL: L15 L22
    Date: 2018–10
  3. By: Kinne, Jan; Lenz, David
    Abstract: Innovation is considered as a main driver of economic growth. Promoting the development of innovation through STI (science, technology and innovation) policies requires accurate indicators of innovation. Traditional indicators often lack coverage, granularity as well as timeliness and involve high data collection costs, especially when conducted at a large scale. In this paper, we propose a novel approach on how to create firm-level innovation indicators at the scale of millions of firms. We use traditional firm-level innovation indicators from the questionnaire-based Community Innovation Survey (CIS) survey to train an artificial neural network classification model on labelled (innovative/non-innovative) web texts of surveyed firms. Subsequently, we apply this classification model to the web texts of hundreds of thousands of firms in Germany to predict their innovation status. Our results show that this approach produces credible predictions and has the potential to be a valuable and highly cost-efficient addition to the existing set of innovation indicators, especially due to its coverage and regional granularity. The predicted firm-level probabilities can also directly be interpreted as a continuous measure of innovativeness, opening up additional advantages over traditional binary innovation indicators.
    Keywords: Web Mining,Web Scraping,R&D,R&I,STI,Innovation,Indicators,Text Mining,Natural Language Processing,NLP,Deep Learning
    JEL: O30 C81 C83
    Date: 2019
  4. By: Blanka Horvath; Aitor Muguruza; Mehdi Tomas
    Abstract: We present a consistent neural network based calibration method for a number of volatility models -- including the rough volatility family -- that performs the calibration task within a few milliseconds for the full implied volatility surface. The aim of neural networks in this work is an off-line approximation of complex pricing functions, which are difficult to represent or time-consuming to evaluate by other means. We highlight how this perspective opens new horizons for quantitative modelling: The calibration bottleneck posed by a slow pricing of derivative contracts is lifted. This brings several model families (such as rough volatility models) within the scope of applicability in industry practice. As customary for machine learning, the form in which information from available data is extracted and stored is crucial for network performance. With this in mind, we discuss how our approach addresses the usual challenges of machine learning solutions in a financial context (availability of training data, interpretability of results for regulators, control over generalisation errors). We present specific architectures for price approximation and calibration and optimize these with respect to different objectives regarding accuracy, speed and robustness. We also find that including the intermediate step of learning pricing functions of (classical or rough) models before calibration significantly improves network performance compared to direct calibration to data.
    Date: 2019–01
  5. By: Dylan J. Foster; Vasilis Syrgkanis
    Abstract: We provide excess risk guarantees for statistical learning in the presence of an unknown nuisance component. We analyze a two-stage sample splitting meta-algorithm that takes as input two arbitrary estimation algorithms: one for the target model and one for the nuisance model. We show that if the population risk satisfies a condition called Neyman orthogonality, the impact of the first stage error on the excess risk bound achieved by the meta-algorithm is of second order. Our general theorem is agnostic to the particular algorithms used for the target and nuisance and only makes an assumption on their individual performance. This enables the use of a plethora of existing results from statistical learning and machine learning literature to give new guarantees for learning with a nuisance component. Moreover, by focusing on excess risk rather than parameter estimation, we can give guarantees under weaker assumptions than in previous works and accommodate the case where the target parameter belongs to a complex nonparametric class. When the nuisance and target parameters belong to arbitrary classes, we characterize conditions on the metric entropy such that oracle rates---rates of the same order as if we knew the nuisance model---are achieved. We also analyze the rates achieved by specific estimation algorithms such as variance-penalized empirical risk minimization, neural network estimation and sparse high-dimensional linear model estimation. We highlight the applicability of our results via four applications of primary importance: 1) heterogeneous treatment effect estimation, 2) offline policy optimization, 3) domain adaptation, and 4) learning with missing data.
    Date: 2019–01
  6. By: Nikolaos Passalis; Anastasios Tefas; Juho Kanniainen; Moncef Gabbouj; Alexandros Iosifidis
    Abstract: Time series forecasting is a crucial component of many important applications, ranging from forecasting the stock markets to energy load prediction. The high-dimensionality, velocity and variety of the data collected in these applications pose significant and unique challenges that must be carefully addressed for each of them. In this work, a novel Temporal Logistic Neural Bag-of-Features approach, that can be used to tackle these challenges, is proposed. The proposed method can be effectively combined with deep neural networks, leading to powerful deep learning models for time series analysis. However, combining existing BoF formulations with deep feature extractors pose significant challenges: the distribution of the input features is not stationary, tuning the hyper-parameters of the model can be especially difficult and the normalizations involved in the BoF model can cause significant instabilities during the training process. The proposed method is capable of overcoming these limitations by a employing a novel adaptive scaling mechanism and replacing the classical Gaussian-based density estimation involved in the regular BoF model with a logistic kernel. The effectiveness of the proposed approach is demonstrated using extensive experiments on a large-scale financial time series dataset that consists of more than 4 million limit orders.
    Date: 2019–01
  7. By: Achim Ahrens; Christian B. Hansen; Mark E. Schaffer
    Abstract: This article introduces lassopack, a suite of programs for regularized regression in Stata. lassopack implements lasso, square-root lasso, elastic net, ridge regression, adaptive lasso and post-estimation OLS. The methods are suitable for the high-dimensional setting where the number of predictors $p$ may be large and possibly greater than the number of observations, $n$. We offer three different approaches for selecting the penalization (`tuning') parameters: information criteria (implemented in lasso2), $K$-fold cross-validation and $h$-step ahead rolling cross-validation for cross-section, panel and time-series data (cvlasso), and theory-driven (`rigorous') penalization for the lasso and square-root lasso for cross-section and panel data (rlasso). We discuss the theoretical framework and practical considerations for each approach. We also present Monte Carlo results to compare the performance of the penalization approaches.
    Date: 2019–01
  8. By: Lechner, Michael (University of St. Gallen)
    Abstract: Uncovering the heterogeneity of causal effects of policies and business decisions at various levels of granularity provides substantial value to decision makers. This paper develops new estimation and inference procedures for multiple treatment models in a selection-on-observables frame-work by modifying the Causal Forest approach suggested by Wager and Athey (2018). The new estimators have desirable theoretical and computational properties for various aggregation levels of the causal effects. An Empirical Monte Carlo study shows that they may outperform previously suggested estimators. Inference tends to be accurate for effects relating to larger groups and conservative for effects relating to fine levels of granularity. An application to the evaluation of an active labour market programme shows the value of the new methods for applied research.
    Keywords: causal machine learning, statistical learning, average treatment effects, conditional average treatment effects, multiple treatments, selection-on-observable, causal forests
    JEL: C21 J68
    Date: 2018–12

This nep-big issue is ©2019 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.