nep-big 2020-11-30 papers

on Big Data

Issue of 2020–11–30
seventeen papers chosen by
Tom Coupé, University of Canterbury

Predicting well-being based on features visible from space – the case of Warsaw By Krystian Andruszek; Piotr Wójcik
Mostly Harmless Machine Learning: Learning Optimal Instruments in Linear IV Models By Jiafeng Chen; Daniel L. Chen; Greg Lewis
Analysis and Forecasting of Financial Time Series Using CNN and LSTM-Based Deep Learning Models By Sidra Mehtab; Jaydip Sen; Subhasis Dasgupta
China's Missing Pigs: Correcting China's Hog Inventory Data Using a Machine Learning Approach By Shao, Yongtong; Xiong, Tao; Li, Minghao; Hayes, Dermot; Zhang, Wendong; Xie, Wei
Population synthesis for urban resident modeling using deep generative models By Martin Johnsen; Oliver Brandt; Sergio Garrido; Francisco C. Pereira
Identifying Consumer Preferences from User- and Crowd-Generated Digital Footprints on Amazon.com by Leveraging Machine Learning and Natural Language Processing By Jikhan Jeong
Prospects and challenges of quantum finance By Adam Bouland; Wim van Dam; Hamed Joorati; Iordanis Kerenidis; Anupam Prakash
Data-driven mergers and personalization By Zhijun Chen; Chongwoo Choe; Jiajia Cong; Noriaki Matsushima
Addressing the productivity paradox with big data: A literature review and adaptation of the CDM econometric model By Schubert, Torben; Jäger, Angela; Türkeli, Serdar; Visentin, Fabiana
Exploration of model performances in the presence of heterogeneous preferences and random effects utilities awareness By Gusarov, N.; Talebijmalabad, A.; Joly, I.
Central bank tone and the dispersion of views within monetary policy committees By Paul Hubert; Fabien Labondance
We have just explained real convergence factors using machine learning By Piotr Wójcik; Bartłomiej Wieczorek
Central Bank Tone and the Dispersion of Views within Monetary Policy Committees By Paul Hubert; Fabien Labondance
Sequentially Estimating Approximate Conditional Mean Using the Extreme Learning Machine By LIJUAN HUO; JIN SEO CHO
Predicting United States Policy Outcomes with Random Forests By Shawn K. McGuire; Charles B. Delahunt
Past, Present and Future of the Spanish Labour Market: When the Pandemic meets the Megatrends By Juan J Dolado; Florentino Felgueroso; Juan F.Jimeno
Sentiment Diffusion in Financial News Networks and Associated Market Movements By Xingchen Wan; Jie Yang; Slavi Marinov; Jan-Peter Calliess; Stefan Zohren; Xiaowen Dong

Predicting well-being based on features visible from space – the case of Warsaw

By:	Krystian Andruszek (Data Science Lab WNE UW); Piotr Wójcik (Faculty of Economic Sciences, Data Science Lab WNE UW, University of Warsaw)
Abstract:	In recent years, availability of satellite imagery has grown rapidly. In addition, deep neural networks gained popularity and become widely used in various applications. This article focuses on using innovative deep learning and machine learning methods with combination of data that is describing objects visible from space. High resolution daytime satellite images are used to extract features for particular areas with the use of transfer learning and convolutional neural networks. Then extracted features are used in machine learning models (LASSO and random forest) as predictors of various socio-economic indicators. The analysis is performed on a local level of Warsaw districts. The findings from such approach can be a great help to get almost continuous measurement of the economic well-being, independently of statistical offices.
Keywords:	well-being, economic indicators, Open Street Map, satellite images, Warsaw
JEL:	I31 R12 O18 C14
Date:	2020
URL:	https://d.repec.org/n?u=RePEc:war:wpaper:2020-37

Mostly Harmless Machine Learning: Learning Optimal Instruments in Linear IV Models

By:	Jiafeng Chen; Daniel L. Chen; Greg Lewis
Abstract:	We provide some simple theoretical results that justify incorporating machine learning in a standard linear instrumental variable setting, prevalent in empirical research in economics. Machine learning techniques, combined with sample-splitting, extract nonlinear variation in the instrument that may dramatically improve estimation precision and robustness by boosting instrument strength. The analysis is straightforward in the absence of covariates. The presence of linearly included exogenous covariates complicates identification, as the researcher would like to prevent nonlinearities in the covariates from providing the identifying variation. Our procedure can be effectively adapted to account for this complication, based on an argument by Chamberlain (1992). Our method preserves standard intuitions and interpretations of linear instrumental variable methods and provides a simple, user-friendly upgrade to the applied economics toolbox. We illustrate our method with an example in law and criminal justice, examining the causal effect of appellate court reversals on district court sentencing decisions.
Date:	2020–11
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2011.06158

Analysis and Forecasting of Financial Time Series Using CNN and LSTM-Based Deep Learning Models

By:	Sidra Mehtab; Jaydip Sen; Subhasis Dasgupta
Abstract:	Prediction of stock price and stock price movement patterns has always been a critical area of research. While the well-known efficient market hypothesis rules out any possibility of accurate prediction of stock prices, there are formal propositions in the literature demonstrating accurate modeling of the predictive systems can enable us to predict stock prices with a very high level of accuracy. In this paper, we present a suite of deep learning-based regression models that yields a very high level of accuracy in stock price prediction. To build our predictive models, we use the historical stock price data of a well-known company listed in the National Stock Exchange (NSE) of India during the period December 31, 2012 to January 9, 2015. The stock prices are recorded at five minutes interval of time during each working day in a week. Using these extremely granular stock price data, we build four convolutional neural network (CNN) and five long- and short-term memory (LSTM)-based deep learning models for accurate forecasting of future stock prices. We provide detailed results on the forecasting accuracies of all our proposed models based on their execution time and their root mean square error (RMSE) values.
Date:	2020–11
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2011.08011

China's Missing Pigs: Correcting China's Hog Inventory Data Using a Machine Learning Approach

By:	Shao, Yongtong; Xiong, Tao; Li, Minghao; Hayes, Dermot; Zhang, Wendong; Xie, Wei
Abstract:	Small sample size often limits forecasting tasks such as the prediction of production, yield, and consumption of agricultural products. Machine learning offers an appealing alternative to traditional forecasting methods. In particular, Support Vector Regression has superior forecasting performance in small sample applications. In this article, we introduce Support Vector Regression via an application to Chinaâ€™s hog market. Since 2014, Chinaâ€™s hog inventory data has experienced an abnormal decline that contradicts price and consumption trends. We use Support Vector Regression to predict the true inventory based on the price-inventory relationship before 2014. We show that, in this application with a small sample size, Support Vector Regression out-performs neural networks, random forest, and linear regression. Predicted hog inventory decreased by 3.9% from November 2013 to September 2017, instead of the 25.4% decrease in the reported data.
Date:	2020–01–01
URL:	https://d.repec.org/n?u=RePEc:isu:genstf:202001010800001619

Population synthesis for urban resident modeling using deep generative models

By:	Martin Johnsen; Oliver Brandt; Sergio Garrido; Francisco C. Pereira
Abstract:	The impacts of new real estate developments are strongly associated to its population distribution (types and compositions of households, incomes, social demographics) conditioned on aspects such as dwelling typology, price, location, and floor level. This paper presents a Machine Learning based method to model the population distribution of upcoming developments of new buildings within larger neighborhood/condo settings. We use a real data set from Ecopark Township, a real estate development project in Hanoi, Vietnam, where we study two machine learning algorithms from the deep generative models literature to create a population of synthetic agents: Conditional Variational Auto-Encoder (CVAE) and Conditional Generative Adversarial Networks (CGAN). A large experimental study was performed, showing that the CVAE outperforms both the empirical distribution, a non-trivial baseline model, and the CGAN in estimating the population distribution of new real estate development projects.
Date:	2020–11
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2011.06851

Identifying Consumer Preferences from User- and Crowd-Generated Digital Footprints on Amazon.com by Leveraging Machine Learning and Natural Language Processing

By:	Jikhan Jeong
Abstract:	Inexperienced consumers may have high uncertainty about experience goods that require technical knowledge and skills to operate effectively; therefore, experienced consumersâ prior reviews can be useful for inexperienced ones. However, the one-sided review system (e.g., Amazon.com) only provides the opportunity for consumers to write a review as a buyer and contains no feedback from the sellerâs side, so the information displayed about individual buyers is limited. This study analyzes consumersâ digital footprints (DFs) to identify and predict unobserved consumer preferences from online product reviews. It makes use of Python coding along with high-performance computing to extract reviewersâ DFs for a specific product group (programmable thermostats) from a dataset of 141 million Amazon reviews. It identifies consumersâ sentiment toward product content dimensions (PCDs) extracted from review text by applying topic modeling and domain expert annotations. However, some questionable reviews (posted by âsuspicious one-time reviewersâ and âalways-the-same rating reviewersâ) are excluded. This paper obtains three main results: First, I find that the factors that affect consumer ratings are: (a) userâ DFs (e.g., length of the product review, average rating across all categories, volume of prior reviews overall and in sub-categories), (b) reviewersâ attitudes toward eight product content dimensions (smart connectivity, easiness, energy saving, functionality, support, price value, privacy, and the Amazon effect), and (c) other prior reviewers DFs (e.g., length of the review summary.) All the heteroskedastic ordered probit models with DF and sentiment variables show a better model fit than the base model. This paper is the first to identify the effect of service quality of the online platform (Amazon.com) on ratings. Second, extreme gradient boosting (XGBoost) is found to obtain the highest F1 score for predicting the ratings of potential consumers before they make a purchase or write a review. All the models containing DF and sentiment variables show a higher prediction performance than the base model. Classifications with a lower range of labels (three-class or binary classifications) show better prediction performance than the five-star rating classification. However, the performance for the minority class is low. Third, a convolutional neural network (CNN) on top of Bidirectional Encoder Representations from Transformers (BERT) embedding shows the highest F1 score for classifying consumersâ sentiment toward a specific PCD. Overall, this approach developed in this paper is applicable, scalable, and interpretable for distinguishing important drivers of consumer reviews for different goods in a specific industry and can be used by industry to identify and predict unobserved consumer preferences and sentiment associated with product content dimensions.
JEL:	D80 M21 M31 C45
Date:	2020–11–10
URL:	https://d.repec.org/n?u=RePEc:jmp:jm2020:pje208

Prospects and challenges of quantum finance

By:	Adam Bouland; Wim van Dam; Hamed Joorati; Iordanis Kerenidis; Anupam Prakash
Abstract:	Quantum computers are expected to have substantial impact on the finance industry, as they will be able to solve certain problems considerably faster than the best known classical algorithms. In this article we describe such potential applications of quantum computing to finance, starting with the state-of-the-art and focusing in particular on recent works by the QC Ware team. We consider quantum speedups for Monte Carlo methods, portfolio optimization, and machine learning. For each application we describe the extent of quantum speedup possible and estimate the quantum resources required to achieve a practical speedup. The near-term relevance of these quantum finance algorithms varies widely across applications - some of them are heuristic algorithms designed to be amenable to near-term prototype quantum computers, while others are proven speedups which require larger-scale quantum computers to implement. We also describe powerful ways to bring these speedups closer to experimental feasibility - in particular describing lower depth algorithms for Monte Carlo methods and quantum machine learning, as well as quantum annealing heuristics for portfolio optimization. This article is targeted at financial professionals and no particular background in quantum computation is assumed.
Date:	2020–11
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2011.06492

Data-driven mergers and personalization

By:	Zhijun Chen; Chongwoo Choe; Jiajia Cong; Noriaki Matsushima
Abstract:	Recent years have seen growing cases of data-driven tech mergers such as Google/Fitbit, in which a dominant digital platform acquires a relatively small firm possessing a large volume of consumer data. The digital platform can consolidate the consumer data with its existing data set from other services and use it for personalization in related markets. We develop a theoretical model to examine the impact of such mergers across the two markets that are related through a consumption synergy. The merger links the markets for data collection and data application, through which the digital platform can leverage its market power and hurt competitors in both markets. Personalization can lead to exploitation of some consumers in the market for data application. But insofar as competitors remain active, the merger increases total consumer surplus in both markets by intensifying competition. When the consumption synergy is large enough, the merger can result in monopolization of both markets, leading to further consumer harm when stand-alone competitors exit in the long run. Thus, there is a trade-off where potential dynamic costs can outweigh static benefits. We also discuss policy implications by considering various merger remedies.
Date:	2020–11
URL:	https://d.repec.org/n?u=RePEc:dpr:wpaper:1108

Addressing the productivity paradox with big data: A literature review and adaptation of the CDM econometric model

By:	Schubert, Torben (Fraunhofer ISI, and Circle, Lund University); Jäger, Angela (Fraunhofer ISI); Türkeli, Serdar (UNU-MERIT, Maastricht University); Visentin, Fabiana (UNU-MERIT, Maastricht University)
Abstract:	This paper develops the plan for the econometric estimations concerning the relationship between firm productivity and the specifics of the innovation process. The paper consists of three main parts. In the first, we review the relevant literature related to the productivity paradox and its causes. Specific attention will be paid to broad economic trends, in particular the higher importance of intangibles, the increasing importance of knowledge spillovers and servitisation as drivers of the slowdown in productivity growth. In the second part, we introduce a plan for the econometric estimation strategy. Here we propose an extended Crépon-Duguet-Mairesse type of model (CDM), which enriches the original specification by the three influence factors of intangibles, spillovers, and servitisation. This will allow testing the influence of these three factors on productivity at the level of the firm within a unified framework. In the third part, we build on the literature review in order to provide a detailed plan for the data collection procedure including a description of the variables to be collected and the source from which the variables are coming. It should be noted that we will rely partly on structured data (e.g. ORBIS), while many others variables will need to be generated from unstructured sources, in particular the webpages of firms. The use of unstructured data is a particular strength of our proposed data collection procedure because the use of such data is expected to offer novel insights. However, it implies additional risks in terms of data quality or missing data. Our data collection plan explores the maximum potential of variables that will ideally be made available for later econometric treatment. Whether indeed all variables will have sufficient quality to be used in the econometric estimations will be subject to the outcomes of the actual collection efforts.
Keywords:	Productivity, Intangibles, Servitisation, Innovation, R&D, Open Innovation, IPR, Knowledge diffusion, Economic growth, Productivity Paradox, Big data, Large data sets, data collection
JEL:	C55 C80 D24 E22 L80 O31 O32 O34 O40 O47
Date:	2020–11–11
URL:	https://d.repec.org/n?u=RePEc:unm:unumer:2020050

Exploration of model performances in the presence of heterogeneous preferences and random effects utilities awareness

By:	Gusarov, N.; Talebijmalabad, A.; Joly, I.
Abstract:	This work is a cross-disciplinary study of econometrics and machine learning (ML) models applied to consumer choice preference modelling. To bridge the interdisciplinary gap, a simulation and theorytesting framework is proposed. It incorporates all essential steps from hypothetical setting generation to the comparison of various performance metrics. The flexibility of the framework in theory-testing and models comparison over economics and statistical indicators is illustrated based on the work of Michaud, Llerena and Joly (2012). Two datasets are generated using the predefined utility functions simulating the presence of homogeneous and heterogeneous individual preferences for alternatives’ attributes. Then, three models issued from econometrics and ML disciplines are estimated and compared. The study demonstrates the proposed methodological approach’s efficiency, successfully capturing the differences between the models issued from different fields given the homogeneous or heterogeneous consumer preferences.
Keywords:	DISCRETE CHOICE MODELS;NEURAL NETWORK;PERFORMANCE COMPARISON;HETEREGENEOUS PREFERENCES
JEL:	C25 C45 C52 C80 C90
Date:	2020
URL:	https://d.repec.org/n?u=RePEc:gbl:wpaper:2020-12

Central bank tone and the dispersion of views within monetary policy committees

By:	Paul Hubert (Observatoire français des conjonctures économiques); Fabien Labondance (Observatoire français des conjonctures économiques)
Abstract:	Does policymakers’ choice of words matter? We explore empirically whether central bank tone conveyed in FOMC statements contains useful information for financial market participants. We quantify central bank tone using computational linguistics and identify exogenous shocks to central bank tone orthogonal to the state of the economy. Using an ARCH model and a high-frequency approach, we find that positive central bank tone increases interest rates at the 1-year maturity. We therefore investigate which potential pieces of information could be revealed by central bank tone. Our tests suggest that it relates to the dispersion of views among FOMC members. This information may be useful to financial markets to understand current and future policy decisions. Finally, we showthatcentral banktonehelps predicting future policy decisions.
Keywords:	Animal spirits; Optimism; Confidence; FOMC; Central bank communication; Interest rate expectations; ECB; Aggregate effects
JEL:	E43 E52 E58
Date:	2019–11
URL:	https://d.repec.org/n?u=RePEc:spo:wpmain:info:hdl:2441/3mgbd73vkp9f9oje7utooe7vpg

We have just explained real convergence factors using machine learning

By:	Piotr Wójcik (Faculty of Economic Sciences, Data Science Lab WNE UW, University of Warsaw); Bartłomiej Wieczorek (Data Science Lab WNE UW)
Abstract:	There are several competing empirical approaches to identify factors of real economic convergence. However, all of the previous studies of cross-country convergence assume a linear model specification. This article uses a novel approach and shows the application of several machine learning tools to this topic discussing their advantages over the other methods, including possibility of identifying nonlinear relationships without any a priori assumptions about its shape. The results suggest that conditional convergence observed in earlier studies could have been a result of inappropriate model speciﬁcation. We find that in a correct non-linear approach, initial GDP is not (strongly) correlated with growth. In addition, the tools of interpretable machine learning allow to discover the shape of relationship between the average growth and initial GDP. Based on these tools we prove the occurrence of convergence of clubs.
Keywords:	cross-country convergence, conditional convergence, determinants, machine learning, non-linear
JEL:	O47 C14 C52
Date:	2020
URL:	https://d.repec.org/n?u=RePEc:war:wpaper:2020-38

Central Bank Tone and the Dispersion of Views within Monetary Policy Committees

By:	Paul Hubert (Observatoire français des conjonctures économiques); Fabien Labondance (Observatoire français des conjonctures économiques)
Abstract:	Does policymakers’ choice of words matter? We explore empirically whether central bank tone conveyed in FOMC statements contains useful information for financial market participants. We quantify central bank tone using computational linguistics and identify exogenous shocks to central bank tone orthogonal to the state of the economy. Using an ARCH model and a high-frequency approach, we find that positive central bank tone increases interest rates at the 1-year maturity. We therefore investigate which potential pieces of information could be revealed by central bank tone. Our tests suggest that it relates to the dispersion of views among FOMC members. This information may be useful to financial markets to understand current and future policy decisions. Finally, we show that central bank tone helps predicting future policy decisions.
Keywords:	Optimism; FOMC; Dissent; Interest rate expectations; ECB
JEL:	E43 E52 E58
Date:	2020
URL:	https://d.repec.org/n?u=RePEc:spo:wpmain:info:hdl:2441/7v8fvu0bf08jcoi4epn8cutjm8

Sequentially Estimating Approximate Conditional Mean Using the Extreme Learning Machine

By:	LIJUAN HUO (Beijing Institute of Technology); JIN SEO CHO (Yonsei Univ)
Abstract:	This study applies the Wald test statistic assisted by the extreme learning machine (ELM) to test for model misspecification. When testing for model misspecification of conditional mean, the omnibus test statistics weakly converge to a Gaussian stochastic process under the null that makes their application inconvenient. We overcome this by applying the ELM to the Wald test statistic defined by the functional regression and also apply it to a sequential testing procedure to estimate an approximate conditional expectation. By conducting extensive Monte Carlo experiments, we evaluate its performance and verify that the sequential WELM testing procedure estimates the most parsimonious conditional mean equation consistently if the candidate polynomial models are correctly specified; and further it consistently rejects all candidate models if all of them are misspecified.
Keywords:	specification testing; conditional mean; omnibus test; Gaussian process; extreme learning machine; sequential testing procedure.
Date:	2020–10
URL:	https://d.repec.org/n?u=RePEc:yon:wpaper:2020rwp-180

Predicting United States Policy Outcomes with Random Forests

By:	Shawn K. McGuire; Charles B. Delahunt (University of Washington, Seattle, WA)
Abstract:	Two decades of U.S. government legislative outcomes, as well as the policy preferences of rich people, the general population, and diverse interest groups, were captured in a detailed dataset curated and analyzed by Gilens, Page et al. (2014). They found that the preferences of the rich correlated strongly with policy outcomes, while the preferences of the general population did not, except via a linkage with rich people`s preferences. Their analysis applied the tools of classical statistical inference, in particular logistic regression. In this paper we analyze the Gilens dataset using the complementary tools of Random Forest classifiers (RFs), from Machine Learning. We present two primary findings, concerning respectively prediction and inference: (i) Holdout test sets can be predicted with approximately 70% balanced accuracy by models that consult only the preferences of rich people and a small number of powerful interest groups, as well as policy area labels. These results include retrodiction, where models trained on pre-1997 cases predicted ``future`` (post-1997) cases. The 20% gain in accuracy over baseline (chance), in this detailed but noisy dataset, indicates the high importance of a few wealthy players in U.S. policy outcomes, and aligns with a body of research indicating that the U.S. government has significant plutocratic tendencies. (ii) The feature selection methods of RF models identify especially salient subsets of interest groups (economic players). These can be used to further investigate the dynamics of governmental policy making, and also offer an example of the potential value of RF feature selection methods for inference on datasets such as this one.
Keywords:	political economy, financial crisis, political parties, political money.
JEL:	G20 L5 N22 D72 G38 P16 K22
Date:	2020–10–02
URL:	https://d.repec.org/n?u=RePEc:thk:wpaper:inetwp138

Past, Present and Future of the Spanish Labour Market: When the Pandemic meets the Megatrends

By:	Juan J Dolado; Florentino Felgueroso; Juan F.Jimeno
Abstract:	This paper reviews the experience so far of the Spanish labour market during the Covid‐19 crisisin the light of the existing institutions, its performance during past recessions, and the policymeasures adopted during the pandemic. Emphasis is placed on the role of worldwide trends inlabour markets, due to automation and AI, in shaping a potential recovery of this (hopefully)transitory shock through a big reallocation process of employment and economic activity. It alsohighlights some innovations to employment and social policies needed to smooth thereallocation process and lessen the rise in inequality associated to technological trends.
Date:	2020–11
URL:	https://d.repec.org/n?u=RePEc:fda:fdaeee:eee2020-37

Sentiment Diffusion in Financial News Networks and Associated Market Movements

By:	Xingchen Wan; Jie Yang; Slavi Marinov; Jan-Peter Calliess; Stefan Zohren; Xiaowen Dong
Abstract:	In an increasingly connected global market, news sentiment towards one company may not only indicate its own market performance, but can also be associated with a broader movement on the sentiment and performance of other companies from the same or even different sectors. In this paper, we apply NLP techniques to understand news sentiment of 87 companies among the most reported on Reuters for a period of seven years. We investigate the propagation of such sentiment in company networks and evaluate the associated market movements in terms of stock price and volatility. Our results suggest that, in certain sectors, strong media sentiment towards one company may indicate a significant change in media sentiment towards related companies measured as neighbours in a financial network constructed from news co-occurrence. Furthermore, there exists a weak but statistically significant association between strong media sentiment and abnormal market return as well as volatility. Such an association is more significant at the level of individual companies, but nevertheless remains visible at the level of sectors or groups of companies.
Date:	2020–11
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2011.06430

This nep-big issue is ©2020 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.

General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.

NEP’s infrastructure is sponsored by the Griffith Business School of Griffith University in Australia.