nep-big New Economics Papers
on Big Data
Issue of 2019‒09‒30
nineteen papers chosen by
Tom Coupé
University of Canterbury

  1. Data Intermediaries and Selling Mechanisms for Customized Consumer Information * By David Bounie; Antoine Dubus; Patrick Waelbroeck
  2. Data Science in Strategy: Machine learning and text analysis in the study of firm growth By Daan Kolkman; Arjen van Witteloostuijn
  3. Testing the Employment Impact of Automation, Robots and AI: A Survey and Some Methodological Issues By Barbieri, Laura; Mussida, Chiara; Piva, Mariacristina; Vivarelli, Marco
  4. Testing the employment and skill impact of new technologies: A survey and some methodological issues By Barbieri, Laura; Mussida, Chiara; Piva, Mariacristina; Vivarelli, Marco
  5. Machine Learning for Solar Accessibility: Implications for Low-Income Solar Expansion and Profitability By Sruthi Davuluri; René García Francheschini; Christopher R. Knittel; Chikara Onda; Kelly Roache
  6. Improving forecasts of the level and structure of long-run discount rates in the leasehold property market By Thomas Weston; Stanimira Milcheva
  7. To Detect Irregular Trade Behaviors In Stock Market By Using Graph Based Ranking Methods By Loc Tran; Linh Tran
  8. Deep Neural Networks for Choice Analysis: Architectural Design with Alternative-Specific Utility Functions By Shenhao Wang; Jinhua Zhao
  9. Financial Frictions and the Wealth Distribution By Jesus Fernandez-Villaverde; Samuel Hurtado; Galo Nuno
  10. From Twitter to GDP: Estimating Economic Activity From Social Media By Indaco, Agustín
  11. Automatic extraction of condition-specific visual characteristics from buildings By Miroslav Despotovic; David Koch; Sascha Leiber
  12. Gradient Boost with Convolution Neural Network for Stock Forecast By Jialin Liu; Chih-Min Lin; Fei Chao
  13. Specialization, Market Access and Medium-Term Growth By Dominick Bartelme; Andrei Levchenko; Ting Lan
  14. Automation probability within the German real estate industry due to digitalization: A calculation of the size of the job killer aspect of digitalization gilded with an optimistic outlook due to the job engine aspect By Daniel Piazolo
  15. A Test of DMPS and VIIRS Night Lights Data for Estimating GDP and Spatial Inequality for Rural and Urban Areas By John Gibson; Susan Olivia; Geua Boe-Gibson
  16. Digital innovation and Real estate appraisal By Agostino Valier; Ezio Micelli
  17. Reinforcement Learning for Portfolio Management By Angelos Filos
  18. Inference after lasso model selection By David Drukker
  19. From Transactions Data to Economic Statistics: Constructing Real-time, High-frequency, Geographic Measures of Consumer Spending By Aditya Aladangady; Shifrah Aron-Dine; Wendy Dunn; Laura Feiveson; Paul Lengermann; Claudia Sahm

  1. By: David Bounie (Télécom ParisTech); Antoine Dubus (Télécom ParisTech); Patrick Waelbroeck (Télécoms Paris Tech - Télécom ParisTech)
    Abstract: We investigate the strategies of a data intermediary selling customized consumer information to firms for price discrimination purpose. We analyze how the mechanism through which the data intermediary sells information influences how much consumer data he will collect and sell to firms, and how it impacts consumer surplus. We consider three selling mechanisms tailored to sell customized consumer information: take it or leave it offers, sequential bargaining, and simultaneous offers. We show that the more data the intermediary collects, the lower consumer surplus. Consumer data collection is minimized, and consumer surplus maximized under the take it or leave it mechanism, which is the least profitable mechanism for the intermediary. We argue that selling mechanisms can be used as a regulatory tool by data protection agencies and competition authorities to limit consumer information collection and increase consumer surplus.
    Date: 2019–09–15
    URL: http://d.repec.org/n?u=RePEc:hal:wpaper:hal-02288708&r=all
  2. By: Daan Kolkman (Technical University Eindhoven); Arjen van Witteloostuijn (Vrije Universiteit Amsterdam)
    Abstract: This study examines the applicability of modern Data Science techniques in the domain of Strategy. We apply novel techniques from the field of machine learning and text analysis. WE proceed in two steps. First, we compare different machine learning techniques to traditional regression methods in terms of their goodness-of-fit, using a dataset with 168,055 firms, only including basic demographic and financial information. The novel methods fare to three to four times better, with the random forest technique achieving the best goodness-of-fit. Second, based on 8,163 informative websites of Dutch SMEs, we construct four additional proxies for personality and strategy variables. Including our four text-analyzed variables adds about 2.5 per cent to the R2. Together, our pair of contributions provide evidence for the large potential of applying modern Data Science techniques in Strategy research. We reflect on the potential contribution of modern Data Science techniques from the perspective of the common critique that machine learning offers increased predictive accuracy at the expense of explanatory insight. Particularly, we will argue and illustrate why and how machine learning can be a productive element in the abductive theory-building cycle.
    JEL: L1
    Date: 2019–09–20
    URL: http://d.repec.org/n?u=RePEc:tin:wpaper:20190066&r=all
  3. By: Barbieri, Laura (Università Cattolica di Piacenza); Mussida, Chiara (Università Cattolica del Sacro Cuore); Piva, Mariacristina (Università Cattolica del Sacro Cuore); Vivarelli, Marco (Università Cattolica del Sacro Cuore)
    Abstract: The present technological revolution, characterized by the pervasive and growing presence of robots, automation, Artificial Intelligence and machine learning, is going to transform societies and economic systems. However, this is not the first technological revolution humankind has been facing, but it is probably the very first one with such an accelerated diffusion pace involving all the industrial sectors. Studying its mechanisms and consequences (will the world turn into a jobless society or not?), mainly considering the labor market dynamics, is a crucial matter. This paper aims at providing an updated picture of main empirical evidence on the relationship between new technologies and employment both in terms of overall consequences on the number of employees, tasks required, and wage/inequality effect.
    Keywords: technology, innovation, employment, skill, task, routine
    JEL: O33
    Date: 2019–09
    URL: http://d.repec.org/n?u=RePEc:iza:izadps:dp12612&r=all
  4. By: Barbieri, Laura; Mussida, Chiara; Piva, Mariacristina; Vivarelli, Marco
    Abstract: The present technological revolution, characterized by the pervasive and growing presence of robots, automation, Artificial Intelligence and machine learning, is going to transform societies and economic systems. However, this is not the first technological revolution humankind has been facing, but it is probably the very first one with such an accelerated diffusion pace involving all the industrial sectors. Studying its mechanisms and consequences (will the world turn into a jobless society or not?), mainly considering the labor market dynamics, is a crucial matter. This paper aims at providing an updated picture of main empirical evidence on the relationship between new technologies and employment both in terms of overall consequences on the number of employees, tasks required, and wage/inequality effect.
    Keywords: technology,innovation,employment,skill,task,routine
    JEL: O33
    Date: 2019
    URL: http://d.repec.org/n?u=RePEc:zbw:glodps:397&r=all
  5. By: Sruthi Davuluri; René García Francheschini; Christopher R. Knittel; Chikara Onda; Kelly Roache
    Abstract: The solar industry in the US typically uses a credit score such as the FICO score as an indicator of consumer utility payment performance and credit worthiness to approve customers for new solar installations. Using data on over 800,000 utility payment performance and over 5,000 demographic variables, we compare machine learning and econometric models to predict the probability of default to credit-score cutoffs. We compare these models across a variety of measures, including how they affect consumers of different socio-economic backgrounds and profitability. We find that a traditional regression analysis using a small number of variables specific to utility repayment performance greatly increases accuracy and LMI inclusivity relative to FICO score, and that using machine learning techniques further enhances model performance. Relative to FICO, the machine learning model increases the number of low-to-moderate income consumers approved for community solar by 1.1% to 4.2% depending on the stringency used for evaluating potential customers, while decreasing the default rate by 1.4 to 1.9 percentage points. Using electricity utility repayment as a proxy for solar installation repayment, shifting from a FICO score cutoff to the machine learning model increases profits by 34% to 1882% depending on the stringency used for evaluating potential customers.
    JEL: C53 L11 L94 Q2
    Date: 2019–09
    URL: http://d.repec.org/n?u=RePEc:nbr:nberwo:26178&r=all
  6. By: Thomas Weston; Stanimira Milcheva
    Abstract: Transaction-level data will be utilised to explore the level and structure of long-run discount rates applicable to the residential leasehold property market. While techniques such as panel and hedonic regression have traditionally been applied to housing data, issues such as nonlinearity, multicollinearity and heteroscedasticity, present challenges to the ability of traditional regression-based methodologies to make long-term, accurate forecasts. Therefore, this work will compare these traditional regression techniques with two machine learning techniques – a long short-term memory (LSTM) model, and a gradient episodic memory (GEM) model – that are anticipated to overcome these issues endemic to housing data, and provide more accurate and precise forecasts. LSTM models overcome some of the problems with regression models, namely, nonlinearity, and the level of memory over time. Where regression models lack a categorical memory component, LSTM models provide the ability to learn features from the data, as opposed to directly applying a pre-conceived, prior structure. This results in LSTM models being able to better deal with inter-temporal, yet rarely occurring events. GEM models build on the strengths of LSTM, and allow task-based learning, which enables more precise modelling of the behaviour and recurrence of rarely-occurring events within leasehold transaction data.
    Keywords: Discount-rate; Forecasting; housing; Leasehold; Machine Learning
    JEL: R3
    Date: 2019–01–01
    URL: http://d.repec.org/n?u=RePEc:arz:wpaper:eres2019_71&r=all
  7. By: Loc Tran; Linh Tran
    Abstract: To detect the irregular trade behaviors in the stock market is the important problem in machine learning field. These irregular trade behaviors are obviously illegal. To detect these irregular trade behaviors in the stock market, data scientists normally employ the supervised learning techniques. In this paper, we employ the three graph Laplacian based semi-supervised ranking methods to solve the irregular trade behavior detection problem. Experimental results show that that the un-normalized and symmetric normalized graph Laplacian based semi-supervised ranking methods outperform the random walk Laplacian based semi-supervised ranking method.
    Date: 2019–09
    URL: http://d.repec.org/n?u=RePEc:arx:papers:1909.08964&r=all
  8. By: Shenhao Wang; Jinhua Zhao
    Abstract: Whereas deep neural network (DNN) is increasingly applied to choice analysis, it is challenging to reconcile domain-specific behavioral knowledge with generic-purpose DNN, to improve DNN's interpretability and predictive power, and to identify effective regularization methods for specific tasks. This study designs a particular DNN architecture with alternative-specific utility functions (ASU-DNN) by using prior behavioral knowledge. Unlike a fully connected DNN (F-DNN), which computes the utility value of an alternative k by using the attributes of all the alternatives, ASU-DNN computes it by using only k's own attributes. Theoretically, ASU-DNN can dramatically reduce the estimation error of F-DNN because of its lighter architecture and sparser connectivity. Empirically, ASU-DNN has 2-3% higher prediction accuracy than F-DNN over the whole hyperparameter space in a private dataset that we collected in Singapore and a public dataset in R mlogit package. The alternative-specific connectivity constraint, as a domain-knowledge-based regularization method, is more effective than the most popular generic-purpose explicit and implicit regularization methods and architectural hyperparameters. ASU-DNN is also more interpretable because it provides a more regular substitution pattern of travel mode choices than F-DNN does. The comparison between ASU-DNN and F-DNN can also aid in testing the behavioral knowledge. Our results reveal that individuals are more likely to compute utility by using an alternative's own attributes, supporting the long-standing practice in choice modeling. Overall, this study demonstrates that prior behavioral knowledge could be used to guide the architecture design of DNN, to function as an effective domain-knowledge-based regularization method, and to improve both the interpretability and predictive power of DNN in choice analysis.
    Date: 2019–09
    URL: http://d.repec.org/n?u=RePEc:arx:papers:1909.07481&r=all
  9. By: Jesus Fernandez-Villaverde (University of Pennsylvania, NBER, and CEPR); Samuel Hurtado (Banco de Espana); Galo Nuno (Banco de Espana)
    Abstract: This paper investigates how, in a heterogeneous agents model with ?nancial frictions, idiosyncratic individual shocks interact with exogenous aggregate shocks to generate time-varying levels of leverage and endogenous aggregate risk. To do so, we show how such a model can be e?ciently computed, despite its substantial nonlinearities, using tools from machine learning. We also illustrate how the model can be structurally estimated with a likelihood function, using tools from inference with di?usions. We document, ?rst, the strong nonlinearities created by ?nancial frictions. Second, we report the existence of multiple stochastic steady states with properties that di?er from the deterministic steady state along important dimensions. Third, we illustrate how the generalized impulse re-sponse functions of the model are highly state-dependent. In particular, we ?nd that the recovery after a negative aggregate shock is more sluggish when the economy is more lever-aged. Fourth, we prove that wealth heterogeneity matters in this economy because of the asymmetric responses of household consumption decisions to aggregate shocks.
    Keywords: Heterogeneous agents; aggregate shocks; continuous-time; machine learning; neural networks; structural estimation; likelihood functions
    JEL: C45 C63 E32 E44 G01 G11
    Date: 2019–09–13
    URL: http://d.repec.org/n?u=RePEc:pen:papers:19-015&r=all
  10. By: Indaco, Agustín
    Abstract: Using all geo-located image tweets shared on Twitter in 2012-2013, I find that the volume of tweets is a valid proxy for estimating current GDP in USD at the country level. Residuals from my preferred model are negatively correlated to a data quality index, indicating that my estimates of GDP are more accurate for countries with more reliable GDP data. Comparing Twitter with more commonly-used proxy of night-light data, I find that variation in Twitter activity explains slightly more of the cross-country variance in GDP. I also exploit the continuous time and geographic granularity of social media posts to create monthly and weekly estimates of GDP for the US, as well as sub- national estimates, including those economic areas that span national borders. My findings suggest that Twitter can be used to measure economic activity in a more timely and more spatially disaggregate way than conventional data and that governments’ statistical agencies could incorporate social media data to complement and further reduce measurement error in their official GDP estimates.
    Keywords: National Accounts, Big Data
    JEL: C53 E01 Q11
    Date: 2019–03–19
    URL: http://d.repec.org/n?u=RePEc:pra:mprapa:95885&r=all
  11. By: Miroslav Despotovic; David Koch; Sascha Leiber
    Abstract: The value of a property is influenced by a number of factors such as location, year of construction, area used, etc. In particular, the classification of the condition of a building plays an important role in this context, since each real estate actor (expert, broker, etc.) perceives the condition individually. This paper investigates automatic extraction of condition-specific visual characteristics from buildings using indoor and outdoor images as well as automatic classification of condition classes. This is a complex task because an object of interest can appear at different positions within the image. In addition, an object of interest and/or the building can be captured from different distances and perspectives and under different weather and lighting conditions. Furthermore, the classification method applied with the convolutional neural network, as described in this paper, requires a large amount of input data. The forecast results of the neural network are promising and show accuracy rates between 67 and 81% using various set-up constellations. The described method has a high development potential in the scientific as well as in the practical sense. The results are technically innovative and should, apart from research relevant contribution, make a practical contribution to future automation-supported real estate valuation procedures. The primary aim of this work is to stimulate the development of new scientifically relevant methods and questions in this direction.
    Keywords: Hedonic Pricing; image analyses; Neural Networks
    JEL: R3
    Date: 2019–01–01
    URL: http://d.repec.org/n?u=RePEc:arz:wpaper:eres2019_284&r=all
  12. By: Jialin Liu; Chih-Min Lin; Fei Chao
    Abstract: Market economy closely connects aspects to all walks of life. The stock forecast is one of task among studies on the market economy. However, information on markets economy contains a lot of noise and uncertainties, which lead economy forecasting to become a challenging task. Ensemble learning and deep learning are the most methods to solve the stock forecast task. In this paper, we present a model combining the advantages of two methods to forecast the change of stock price. The proposed method combines CNN and GBoost. The experimental results on six market indexes show that the proposed method has better performance against current popular methods.
    Date: 2019–09
    URL: http://d.repec.org/n?u=RePEc:arx:papers:1909.09563&r=all
  13. By: Dominick Bartelme (University of Michigan); Andrei Levchenko (University of Michigan); Ting Lan (University of Michigan)
    Abstract: This paper estimates the impact of foreign sectoral demand and supply shocks on medium-term economic growth. Our empirical strategy is based on a first order approximation to a wide class of small open economy models that feature sector-level gravity in trade flows. The framework allows us to measure foreign shocks and characterize their impact on growth in terms of reduced-form elasticities. We use machine learning techniques to group 4-digit manufacturing sectors into a smaller number of clusters, and show that the cluster-level growth elasticities can be estimated using high-dimensional statistical techniques. We find clear evidence of heterogeneity in the growth elasticities of different foreign shocks. Foreign demand shocks in complex intermediate and capital goods have large growth impacts, and both supply and demand shocks in capital goods have particularly large impacts on growth for poor countries. Counterfactual exercises show that both comparative advantage and geography play a quantitatively large role in how foreign shocks affect economic growth.
    Date: 2019
    URL: http://d.repec.org/n?u=RePEc:red:sed019:999&r=all
  14. By: Daniel Piazolo
    Abstract: Through a combination of various sources about employment data, insights about the probability of the risk of automation of real estate jobs within Germany can be derived. Sources are: 1.) BerufeNET data base of the German federal employment agency (Bundesagentur für Arbeit), 2.) JobFutoromat database with the estimation of automation probabilities of across-the-board occupations, 3.) Lists with the number of employees subject to social insurance on an occupation group level. Consequently, a weighted average of the automation probability of jobs within each occupational group and within the overall real estate sector of Germany can be derived. Thus the negative side of digitalization will be quantified (i.e. the job killer aspect). Since Germany is the largest economy within the European Union, some of the insights can be transferred to the European level. The paper also discusses, that the novel possibilities through the use of digital tools like artificial intelligence will create new employment possibilities within the various real estate areas. The challenges are addressed how to localize and to quantify the specific positive effect of digitalization (i.e. the job engine).
    Keywords: Automation; Digital Transformation; Disruption; Emplyoment; Structural Change
    JEL: R3
    Date: 2019–01–01
    URL: http://d.repec.org/n?u=RePEc:arz:wpaper:eres2019_181&r=all
  15. By: John Gibson (University of Waikato); Susan Olivia (University of Waikato); Geua Boe-Gibson (University of Waikato)
    Abstract: Night lights, as detected by satellites, are increasingly used by economists, especially to proxy for economic activity in poor countries. Widely used data from the Defense Meteorological Satellite Program (DMSP) have several flaws; blurring, top-coding, lack of calibration, and variation in sensor amplification that impairs comparability over time and space. These flaws are not present in newer data from the Visible Infrared Imaging Radiometer Suite (VIIRS) that is widely used in other disciplines. Economists have been slow to switch to these better VIIRS data, perhaps because flaws in DMSP are rarely emphasized. We show the relationship between night lights and Indonesian GDP at the second sub-national level for 497 spatial units. The DMSP data are not a suitable proxy for GDP outside of cities. Within the urban sector, the lights-GDP relationship is twice as noisy using DMSP as using VIIRS. Spatial inequality is considerably understated by the DMSP data. A Pareto adjustment to correct for top-coding in DMSP data has a modest effect but still understates spatial inequality and misses much of the intra-city heterogeneity in the brightness of lights for Jakarta.
    Keywords: density; DMSP; inequality; night lights; VIIRS; Indonesia
    JEL: O15 R12
    Date: 2019–09–23
    URL: http://d.repec.org/n?u=RePEc:wai:econwp:19/11&r=all
  16. By: Agostino Valier; Ezio Micelli
    Abstract: This research reviews the existing literature on the use of digital innovation in real estate valuation, focusing on three aspects.First, it analyses the factors that make the use of digital innovation increasingly relevant in the real estate sector and, more specifically, in the evaluation phase of assets. The need for innovation in the real estate market is highlighted, as the demand from investors for fast, reliable and objective appraisals.The second part reports the literature on digital innovations applied to valuation models, distinguishing between forecasting models for future market trends and assets-specific automated valuation models. This section focuses on the impact that new models have on the currently used approaches for value assessment.Third, the use of digital-based valuation models is investigated by analysing the context conditions. The review analyses the literature that correlates the reliability of the new models and the conditions of the real estate market in which they are used, especially in terms of information efficiency. Finally, the conclusions summarise the limits and potential of digital innovation in the field of valuation. Future directions are then identified.
    Keywords: Automated Valuation Models; Big data; Digital innovation; Forecasting analysis; proptech
    JEL: R3
    Date: 2019–01–01
    URL: http://d.repec.org/n?u=RePEc:arz:wpaper:eres2019_320&r=all
  17. By: Angelos Filos
    Abstract: In this thesis, we develop a comprehensive account of the expressive power, modelling efficiency, and performance advantages of so-called trading agents (i.e., Deep Soft Recurrent Q-Network (DSRQN) and Mixture of Score Machines (MSM)), based on both traditional system identification (model-based approach) as well as on context-independent agents (model-free approach). The analysis provides conclusive support for the ability of model-free reinforcement learning methods to act as universal trading agents, which are not only capable of reducing the computational and memory complexity (owing to their linear scaling with the size of the universe), but also serve as generalizing strategies across assets and markets, regardless of the trading universe on which they have been trained. The relatively low volume of daily returns in financial market data is addressed via data augmentation (a generative approach) and a choice of pre-training strategies, both of which are validated against current state-of-the-art models. For rigour, a risk-sensitive framework which includes transaction costs is considered, and its performance advantages are demonstrated in a variety of scenarios, from synthetic time-series (sinusoidal, sawtooth and chirp waves), simulated market series (surrogate data based), through to real market data (S\&P 500 and EURO STOXX 50). The analysis and simulations confirm the superiority of universal model-free reinforcement learning agents over current portfolio management model in asset allocation strategies, with the achieved performance advantage of as much as 9.2\% in annualized cumulative returns and 13.4\% in annualized Sharpe Ratio.
    Date: 2019–09
    URL: http://d.repec.org/n?u=RePEc:arx:papers:1909.09571&r=all
  18. By: David Drukker (StataCorp LP)
    Abstract: The increasing availability of high-dimensional data and increasing interest in more realistic functional forms have sparked a renewed interest in automated methods for selecting the covariates to include in a model. I discuss the promises and perils of model selection and pay special attention to estimators that provide reliable inference after model selection. I will demonstrate how to use Stata 16's new features for double selection, partialing out, and cross-fit partialing out to estimate the effects of variables of interest while using lasso methods to select control variables.
    Date: 2019–09–15
    URL: http://d.repec.org/n?u=RePEc:boc:usug19:25&r=all
  19. By: Aditya Aladangady; Shifrah Aron-Dine; Wendy Dunn; Laura Feiveson; Paul Lengermann; Claudia Sahm
    Abstract: Access to timely information on consumer spending is important to economic policymakers. The Census Bureau’s monthly retail trade survey is a primary source for monitoring consumer spending nationally, but it is not well suited to study localized or short-lived economic shocks. Moreover, lags in the publication of the Census estimates and subsequent, sometimes large, revisions diminish its usefulness for real-time analysis. Expanding the Census survey to include higher frequencies and subnational detail would be costly and would add substantially to respondent burden. We take an alternative approach to fill these information gaps. Using anonymized transactions data from a large electronic payments technology company, we create daily estimates of retail spending at detailed geographies. Our daily estimates are available only a few days after the transactions occur, and the historical time series are available from 2010 to the present. When aggregated to the national level, the pattern of monthly growth rates is similar to the official Census statistics. We discuss two applications of these new data for economic analysis: First, we describe how our monthly spending estimates are useful for real-time monitoring of aggregate spending, especially during the government shutdown in 2019, when Census data were delayed and concerns about the economy spiked. Second, we show how the geographic detail allowed us quantify in real time the spending effects of Hurricanes Harvey and Irma in 2017.
    JEL: E21 E27
    Date: 2019–09
    URL: http://d.repec.org/n?u=RePEc:nbr:nberwo:26253&r=all

This nep-big issue is ©2019 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at http://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.