nep-big New Economics Papers
on Big Data
Issue of 2019‒08‒26
24 papers chosen by
Tom Coupé
University of Canterbury

  1. From Mad Men to Maths Men: Concentration and Buyer Power in Online Advertising By Decarolis, Francesco; Rovigatti, Gabriele
  2. In search of a job: Forecasting employment growth using Google Trends By Daniel Borup; Erik Christian Montes Schütte
  3. British Stock Market, BREXIT and Media Sentiments - A Big Data Analysis By Gopal K. Basak; Pranab Kumar Das; Sugata Marjit; Debashis Mukherjee; Lei Yang
  4. Nowcasting US GDP with artificial neural networks By Loermann, Julius; Maas, Benedikt
  5. Predicting credit default probabilities using machine learning techniques in the face of unequal class distributions By Anna Stelzer
  6. Realized Volatility Forecasting with Neural Networks By Bucci, Andrea
  7. A hybrid neural network model based on improved PSO and SA for bankruptcy prediction By Fatima Zahra Azayite; Said Achchab
  8. Environnement big data et prise de décision intuitive : le cas de la Police Nationale des Bouches du Rhône By Jordan Vazquez; Cécile Godé; Jean-Fabrice Lebraty
  9. Artificial Intelligence Applications & Venture Funding in Healthcare By Halminen, Olli; Tenhunen, Henni; Heliste, Antti; Seppälä, Timo
  10. The Use of Binary Choice Forests to Model and Estimate Discrete Choice Models By Ningyuan Chen; Guillermo Gallego; Zhuodong Tang
  11. Forward-Selected Panel Data Approach for Program Evaluation By Zhentao Shi; Jingyi Huang
  12. Large scale continuous-time mean-variance portfolio allocation via reinforcement learning By Haoran Wang; Xun Yu Zhou
  13. Estimation of Conditional Average Treatment Effects with High-Dimensional Data By Qingliang Fan; Yu-Chin Hsu; Robert P. Lieli; Yichong Zhang
  14. Environnement big data et décision : l'étape de contre la montre du tour de France 2017 By Jordan Vazquez; Cécile Godé; Jean-Fabrice Lebraty
  15. Determining the Importance of an Attribute in a Demand System: Structural versus Machine Learning Approach By Badruddoza, Syed; Amin, Modhurima D.
  16. Bitcoin Return Volatility Forecasting: A Comparative Study of GARCH Model and Machine Learning Model By Shen, Ze; Wan, Qing; Leatham, David J.
  17. Who is Tested for Heart Attack and Who Should Be: Predicting Patient Risk and Physician Error By Sendhil Mullainathan; Ziad Obermeyer
  18. Worried about the fourth industrial revolution's impact on jobs? Scale up skills development and training! By Terry McKinley
  19. Review of the Plan for Integrating Big Data Analytics Program for the Electronic Marketing System and Customer Relationship Management: A Case Study XYZ Institution By Idha Sudianto
  20. Enhancing the Demand for Labour survey by including skills from online job advertisements using model-assisted calibration By Maciej Ber\k{e}sewicz; Greta Bia{\l}kowska; Krzysztof Marcinkowski; Magdalena Ma\'slak; Piotr Opiela; Robert Pater; Katarzyna Zadroga
  21. Taxable Stock Trading with Deep Reinforcement Learning By Shan Huang
  22. Inference after lasso model selection By David Drukker
  23. Using lasso and related estimators for prediction By Di Liu
  24. Mobility Data Sharing: Challenges and Policy Recommendations By D'Agostino, Mollie; Pellaton, Paige; Brown, Austin

  1. By: Decarolis, Francesco; Rovigatti, Gabriele
    Abstract: This paper analyzes the impact of intermediaries' concentration on the allocation of revenues in online platforms. We study sponsored search - the sale of ad space on search engines through online auctions - documenting how advertisers increasingly bid through a handful of specialized intermediaries. This enhances automated bidding and data pooling, but lessens competition whenever the intermediary represents competing advertisers. Using data on nearly 40 million Google's keyword-auctions, we first apply machine learning algorithms to cluster keywords into thematic groups serving as relevant markets. Then, through an instrumental variable strategy, we quantify a negative and sizeable impact of intermediaries' concentration on platform's revenues.
    Keywords: Buyer Power; Concentration; online advertising; platforms; Sponsored Search
    JEL: C72 D44 L81
    Date: 2019–07
  2. By: Daniel Borup (Aarhus University and CREATES); Erik Christian Montes Schütte (Aarhus University and CREATES)
    Abstract: We show that Google search activity on relevant terms is a strong out-of-sample predictor for future employment growth in the US over the period 2004-2018 at both short and long horizons. Using a subset of ten keywords associated with “jobs”, we construct a large panel of 173 variables using Google’s own algorithms to find related search queries. We find that the best Google Trends model achieves an out-of-sample R2 between 26% and 59% at horizons spanning from one month to a year ahead, strongly outperforming benchmarks based on a large set of macroeconomic and financial predictors. This strong predictability extends to US state-level employment growth, using state-level specific Google search activity. Encompassing tests indicate that when the Google Trends panel is exploited using a non-linear model it fully encompasses the macroeconomic forecasts and provides significant information in excess of those.
    Keywords: Google Trends, Forecast comparison, US employment growth, Targeting predictors, Random forests, Keyword search.
    JEL: C22 C53 E17 E24
    Date: 2019–08–22
  3. By: Gopal K. Basak; Pranab Kumar Das; Sugata Marjit; Debashis Mukherjee; Lei Yang
    Abstract: In this paper we show, using a Machine Learning Framework and utilising a substantial corpus of media articles on Brexit, confirmed evidence of co-integration and causality between the ensuing media sentiments and British currency. The novel contribution of this paper is that along with sentiment analysis using commonly used lexicons, we devised a method using Bayesian learning to create a more context aware and more informative lexicon for Brexit. Moreover, leveraging and extending this we can unearth hidden relationship between originating media sentiments and related economic and financial variables. Our method is a distinct improvement over the existing ones and can predict out of sample outcomes better than conventional ones.
    Keywords: digitization, machine learning
    Date: 2019
  4. By: Loermann, Julius; Maas, Benedikt
    Abstract: We use a machine learning approach to forecast the US GDP value of the current quarter and several quarters ahead. Within each quarter, the contemporaneous value of GDP growth is unavailable but can be estimated using higher-frequency variables that are published in a more timely manner. Using the monthly FRED-MD database, we compare the feedforward artificial neural network forecasts of GDP growth to forecasts of state of the art dynamic factor models and the Survey of Professional Forecasters, and we evaluate the relative performance. The results indicate that the neural network outperforms the dynamic factor model in terms of now- and forecasting, while it generates at least as good now- and forecasts as the Survey of Professional Forecasters.
    Keywords: Nowcasting; Machine learning; Neural networks; Big data
    JEL: C32 C53 E32
    Date: 2019–05
  5. By: Anna Stelzer
    Abstract: This study conducts a benchmarking study, comparing 23 different statistical and machine learning methods in a credit scoring application. In order to do so, the models' performance is evaluated over four different data sets in combination with five data sampling strategies to tackle existing class imbalances in the data. Six different performance measures are used to cover different aspects of predictive performance. The results indicate a strong superiority of ensemble methods and show that simple sampling strategies deliver better results than more sophisticated ones.
    Date: 2019–07
  6. By: Bucci, Andrea
    Abstract: In the last few decades, a broad strand of literature in finance has implemented artificial neural networks as forecasting method. The major advantage of this approach is the possibility to approximate any linear and nonlinear behaviors without knowing the structure of the data generating process. This makes it suitable for forecasting time series which exhibit long memory and nonlinear dependencies, like conditional volatility. In this paper, I compare the predictive performance of feed-forward and recurrent neural networks (RNN), particularly focusing on the recently developed Long short-term memory (LSTM) network and NARX network, with traditional econometric approaches. The results show that recurrent neural networks are able to outperform all the traditional econometric methods. Additionally, capturing long-range dependence through Long short-term memory and NARX models seems to improve the forecasting accuracy also in a highly volatile framework.
    Keywords: Neural Networks; Realized Volatility; Forecast
    JEL: C22 C45 C53 G17
    Date: 2019–08
  7. By: Fatima Zahra Azayite; Said Achchab
    Abstract: Predicting firm's failure is one of the most interesting subjects for investors and decision makers. In this paper, a bankruptcy prediction model is proposed based on Artificial Neural networks (ANN). Taking into consideration that the choice of variables to discriminate between bankrupt and non-bankrupt firms influences significantly the model's accuracy and considering the problem of local minima, we propose a hybrid ANN based on variables selection techniques. Moreover, we evolve the convergence of Particle Swarm Optimization (PSO) by proposing a training algorithm based on an improved PSO and Simulated Annealing. A comparative performance study is reported, and the proposed hybrid model shows a high performance and convergence in the context of missing data.
    Date: 2019–07
  8. By: Jordan Vazquez (UJML - Université Jean Moulin - Lyon III - Université de Lyon); Cécile Godé (CRET-LOG - Centre de Recherche sur le Transport et la Logistique - AMU - Aix Marseille Université); Jean-Fabrice Lebraty (UJML - Université Jean Moulin - Lyon III - Université de Lyon)
    Abstract: Cet article se pose la question de la place de l'intuition dans le processus décisionnel en environnement big data. Il s'appuie sur une étude de cas exploratoire développée près des décideurs du Centre d'Information et de Commandement (CIC) de la Police Nationale (PN) des Bouches du Rhône. Ces derniers évoluent en environnement big data et doivent régulièrement gérer des situations imprévues. Le corpus des données de terrain a été construit par triangulation de 28 entretiens individuels et collectifs, d'observations non participatives ainsi que d'archives et de rapports officiels. Ces nouvelles informations sont autant d'indices qui permettent aux décideurs d'anticiper les imprévus, les conduisant à reconfigurer leurs attentes, leurs objectifs et leurs actions. Ces aspects positifs sont cependant à évaluer au regard du risque induit par le volume conséquent d'informations dorénavant à disposition des décideurs. Ils doivent maîtriser les nouveaux systèmes et les applications qui permettent d'exploiter l'environnement big data. Nos résultats suggèrent que lorsque les décideurs ne maîtrisent pas ces systèmes, l'environnement big data peut conduire un décideur expert à redevenir un novice.
    Date: 2019–06–03
  9. By: Halminen, Olli; Tenhunen, Henni; Heliste, Antti; Seppälä, Timo
    Abstract: Abstract Venture Capital (VC) funding raised by companies producing Artificial Intelligence (AI) solutions is on the rise. In healthcare, VC funding is distributed unevenly and certain technologies have attracted significantly more funding than others. The funding decisions made by VC companies also work as a technology driver for the industry. We analyzed a database of 106 Healthcare AI companies collected from open online sources to understand factors affecting VC funding of AI companies operating in different areas of healthcare. Companies acting as R&D catalysts have been most succesful in raising VC funding. The results suggest that there is a significant connection between higher funding and having research organizations and pharmaceutical companies as the customer of the product or service. In addition, focusing on AI solutions that are applied to direct patient care delivery is associated with lower funding. We discuss the implications of our findings on health technology research and development, and on the barriers of platform data markets in healthcare industry.
    Keywords: Artificial Intelligence, Capital Funding, Technology
    JEL: G2 G24 I1 I19
    Date: 2019–08–20
  10. By: Ningyuan Chen; Guillermo Gallego; Zhuodong Tang
    Abstract: We show the equivalence of discrete choice models and the class of binary choice forests, which are random forest based on binary choice trees. This suggests that standard machine learning techniques based on random forest can serve to estimate discrete choice model with an interpretable output. This is confirmed by our data driven result that states that random forest can accurately predict the choice probability of any discrete choice model. Our framework has unique advantages: it can capture behavioral patterns such as irrationality or sequential searches; it handles nonstandard formats of training data that result from aggregation; it can measure product importance based on how frequently a random customer would make decisions depending on the presence of the product; it can also incorporate price information. Our numerical results show that binary choice forest can outperform the best parametric models with much better computational times.
    Date: 2019–08
  11. By: Zhentao Shi; Jingyi Huang
    Abstract: Policy evaluation is central to economic data analysis, but economists mostly work with observational data in view of limited opportunities to carry out controlled experiments. In the potential outcome framework, the panel data approach (Hsiao, Ching and Wan, 2012) constructs the counterfactual by exploiting the correlation between cross-sectional units in panel data. The choice of cross-sectional control units, a key step in their implementation, is nevertheless unresolved in data-rich environment when many possible controls are at the researcher's disposal. We propose the forward selection method to choose control units, and establish validity of post-selection inference. Our asymptotic framework allows the number of possible controls to grow much faster than the time dimension. The easy-to-implement algorithms and their theoretical guarantee extend the panel data approach to big data settings. Monte Carlo simulations are conducted to demonstrate the finite sample performance of the proposed method. Two empirical examples illustrate the usefulness of our procedure when many controls are available in real-world applications.
    Date: 2019–08
  12. By: Haoran Wang; Xun Yu Zhou
    Abstract: We propose to solve large scale Markowitz mean-variance (MV) portfolio allocation problem using reinforcement learning (RL). By adopting the recently developed continuous-time exploratory control framework, we formulate the exploratory MV problem in high dimensions. We further show the optimality of a multivariate Gaussian feedback policy, with time-decaying variance, in trading off exploration and exploitation. Based on a provable policy improvement theorem, we devise a scalable and data-efficient RL algorithm and conduct large scale empirical tests using data from the S&P 500 stocks. We found that our method consistently achieves over 10% annualized returns and it outperforms econometric methods and the deep RL method by large margins, for both long and medium terms of investment with monthly and daily trading.
    Date: 2019–07
  13. By: Qingliang Fan; Yu-Chin Hsu; Robert P. Lieli; Yichong Zhang
    Abstract: Given the unconfoundedness assumption, we propose new nonparametric estimators for the reduced dimensional conditional average treatment effect (CATE) function. In the first stage, the nuisance functions necessary for identifying CATE are estimated by machine learning methods, allowing the number of covariates to be comparable to or larger than the sample size. This is a key feature since identification is generally more credible if the full vector of conditioning variables, including possible transformations, is high-dimensional. The second stage consists of a low-dimensional kernel regression, reducing CATE to a function of the covariate(s) of interest. We consider two variants of the estimator depending on whether the nuisance functions are estimated over the full sample or over a hold-out sample. Building on Belloni at al. (2017) and Chernozhukov et al. (2018), we derive functional limit theory for the estimators and provide an easy-to-implement procedure for uniform inference based on the multiplier bootstrap. The empirical application revisits the effect of maternal smoking on a baby's birth weight as a function of the mother's age.
    Date: 2019–08
  14. By: Jordan Vazquez (UJML - Université Jean Moulin - Lyon III - Université de Lyon); Cécile Godé (CRET-LOG - Centre de Recherche sur le Transport et la Logistique - AMU - Aix Marseille Université); Jean-Fabrice Lebraty (UJML3 Droit - Université Jean Moulin Lyon 3 - Faculé de Droit - UJML - Université Jean Moulin - Lyon III - Université de Lyon)
    Date: 2018–05–16
  15. By: Badruddoza, Syed; Amin, Modhurima D.
    Keywords: Research Methods/ Statistical Methods
    Date: 2019–06–25
  16. By: Shen, Ze; Wan, Qing; Leatham, David J.
    Keywords: Agribusiness
    Date: 2019–06–25
  17. By: Sendhil Mullainathan; Ziad Obermeyer
    Abstract: In deciding whether to test for heart attack (acute coronary syndromes), physicians implicitly judge risk. To assess these decisions, we produce explicit risk predictions by applying machine learning to Medicare claims data. Comparing these on a patient-by-patient basis to physician decisions reveals more about low-value care than the usual approach of measuring average testing results. It more precisely quantifies over-use: while the average test is marginally cost-effective, tests at the bottom of the risk distribution are highly cost-ineffective. But it also reveals under- use: many patients at the top of the risk distribution go untested; and they go on to have frequent adverse cardiac events, including death, in the next 30 days. At standard clinical thresholds, these event rates suggest they should have been tested. In aggregate, 42.8% of the potential welfare gains of improving testing would come from addressing under-use. Existing policies though are too blunt: when testing is reduced, for example, both low-value and high-value tests fall. Finally, to understand physician error we build a separate algorithm of the physician and find evidence of bounded rationality as well as biases such as representativeness. We suggest models of physician moral hazard should be expanded to include ‘behavioral hazard’.
    JEL: D8 D84 D9 I1 I13
    Date: 2019–08
  18. By: Terry McKinley (IPC-IG)
    Abstract: "We have been living through the third industrial revolution?'digitalisation'?since 1980. However, the fourth industrial revolution (driven mainly by robotics and artificial intelligence) already appears to be fast approaching. What will be its likely impacts on jobs, incomes and economic inequality? And, more importantly, what can be done about them? This One Pager focuses on this revolution's practical implications for social protection programmes". (...)
    Keywords: fourth industrial revolution, impact on jobs, scale up, skills, development, training
    Date: 2019–07
  19. By: Idha Sudianto
    Abstract: This research aims to explore business processes and what the factors have major influence on electronic marketing and CRM systems? Which data needs to be analyzed and integrated in the system, and how to do that? How effective of integration the electronic marketing and CRM with big data enabled to support Marketing and Customer Relation operations. Research based on case studies at XYZ Organization: International Language Education Service in Surabaya. Research is studying secondary data which is supported by qualitative research methods. Using purposive sampling technique with observation and interviewing several respondents who need the system integration. The documentation of interview is coded to keep confidentiality of the informant. Method of extending participation, triangulation of data sources, discussions and the adequacy of the theory are uses to validate data. Miles and Huberman models is uses to do analysis the data interview. Results of the research are expected to become a holistic approach to fully integrate the Big Data Analytics program with electronic marketing and CRM systems.
    Date: 2019–08
  20. By: Maciej Ber\k{e}sewicz; Greta Bia{\l}kowska; Krzysztof Marcinkowski; Magdalena Ma\'slak; Piotr Opiela; Robert Pater; Katarzyna Zadroga
    Abstract: In the article we describe an enhancement to the Demand for Labour (DL) survey conducted by Statistics Poland, which involves the inclusion of skills obtained from online job advertisements. The main goal is to provide estimates of the demand for skills (competences), which is missing in the DL survey. To achieve this, we apply a data integration approach combining traditional calibration with the LASSO-assisted approach to correct representation error in the online data. Faced with the lack of access to unit-level data from the DL survey, we use estimated population totals and propose a~bootstrap approach that accounts for the uncertainty of totals reported by Statistics Poland. We show that the calibration estimator assisted with LASSO outperforms traditional calibration in terms of standard errors and reduces representation bias in skills observed in online job ads. Our empirical results show that online data significantly overestimate interpersonal, managerial and self-organization skills while underestimating technical and physical skills. This is mainly due to the under-representation of occupations categorised as Craft and Related Trades Workers and Plant and Machine Operators and Assemblers.
    Date: 2019–08
  21. By: Shan Huang
    Abstract: In this paper, we propose stock trading based on the average tax basis. Recall that when selling stocks, capital gain should be taxed while capital loss can earn certain tax rebate. We learn the optimal trading strategies with and without considering taxes by reinforcement learning. The result shows that tax ignorance could induce more than 62% loss on the average portfolio returns, implying that taxes should be embedded in the environment of continuous stock trading on AI platforms.
    Date: 2019–07
  22. By: David Drukker (StataCorp)
    Abstract: The increasing availability of high-dimensional data and increasing interest in more realistic functional forms have sparked a renewed interest in automated methods for selecting the covariates to include in a model. I discuss the promises and perils of model selection and pay special attention to estimators that provide reliable inference after model selection. I will demonstrate how to use Stata 16's new features for double selection, partialing out, and cross-fit partialing out to estimate the effects of variables of interest while using lasso methods to select control variables.
    Date: 2019–08–02
  23. By: Di Liu (StataCorp)
    Abstract: Users may extend Stata's features using other programming languages such as Java and C. New in Stata 16, Stata has tight integration with Python, which allows users to embed and execute Python code from within Stata. I will discuss how users can easily call Python from Stata, output Python results within Stata, and exchange data and results between Python and Stata, both interactively and as sub-routines within do-files and ado-files. I will also show examples of the Stata Function Interface (sfi); a Python module provided with Stata which provides extensive facilities for accessing Stata objects from within Python.
    Date: 2019–08–02
  24. By: D'Agostino, Mollie; Pellaton, Paige; Brown, Austin
    Abstract: Dynamic and responsive transportation systems are a core pillar of equitable and sustainable communities. Achieving such systems requires comprehensive mobility data, or data that reports the movement of individuals and vehicles. Such data enable planners and policymakers to make informed decisions and enable researchers to model the effects of various transportation solutions. However, collecting mobility data also raises concerns about privacy and proprietary interests. This issue paper provides an overview of the top needs and challenges surrounding mobility data sharing and presents four relevant policy strategies: (1) Foster voluntary agreement among mobility providers for a set of standardized data specifications; (2) Develop clear data-sharing requirements designed for transportation network companies and other mobility providers; (3) Establish publicly held big-data repositories, managed by third parties, to securely hold mobility data and provide structured access by states, cities, and researchers; (4) Leverage innovative land-use and transportation-planning tools.
    Keywords: Social and Behavioral Sciences, data sharing, policy making, transportation policy, shared mobility, transportation planning, city planning
    Date: 2019–08–01

This nep-big issue is ©2019 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.