nep-big 2019-05-20 papers

on Big Data

Issue of 2019–05–20
eighteen papers chosen by
Tom Coupé, University of Canterbury

Job automation risk, economic structure and trade: a European perspective By Foster-McGregor, Neil; Nomaler, Önder; Verspagen, Bart
Selling Data By Carlos Segura-Rodriguez
Like it or not? The impact of online platforms on the productivity of incumbent service By Alberto Bailin Rivares; Peter Gal; Valentine Millot; Stéphane Sorbe
Of trackers and tractors. Using a smartphone app and compositional data analysis to explore the link between mechanization and intra-household allocation of time in Zambia By Daum, Thomas; Capezzone, Filippo; Birner, Regina
Credit Risk Analysis using Machine and Deep Learning models By Peter Addo; Dominique Guegan; Bertrand Hassani
Credit Risk Analysis Using Machine and Deep Learning Models By Dominique Guegan; Peter Addo; Bertrand Hassani
The European venture capital landscape: an EIF perspective. Volume V: The economic impact of VC investments supported by the EIF By Pavlova, Elitsa; Signore, Simone
A Stock Selection Method Based on Earning Yield Forecast Using Sequence Prediction Models By Jessie Sun
Improving Regression-based Event Study Analysis Using a Topological Machine-learning Method By Takashi Yamashita; Ryozo Miura
Automating Response Evaluation for Franchising Questions on the 2017 Economic Census By Joseph Staudt; Yifang Wei; Lisa Singh; Shawn D. Klimek; J. Bradford Jensen; Andrew L. Baer
(MARTINGALE) OPTIMAL TRANSPORT AND ANOMALY DETECTION WITH NEURAL NETWORKS: A PRIMAL-DUAL ALGORITHM By Pierre Henry-Labordère
Unconventional Exchange: Methods for Statistical Analysis of Virtual Goods By Oliver James Scholten; Peter Cowling; Kenneth A. Hawick; James Alfred Walker
Deep Haar Scattering Networks in Unidimensional Pattern Recognition Problems By Fernando Fernandes Neto; Claudio Garcia, Rodrigo de Losso da Silveira Bueno, Pedro Delano Cavalcanti, Alemayehu Solomon Admas
Automated Linking of Historical Data By Ran Abramitzky; Leah Platt Boustan; Katherine Eriksson; James J. Feigenbaum; Santiago Pérez
What is the state of the manufacturing sector in Mozambique? By Schou Soren; Fisker Peter
Financial Stability and the Fed: Evidence from Congressional Hearings By Arina Wischnewsky; David-Jan Jansen; Matthias Neuenkirch
Tax-motivated transfer mispricing in South Africa: Direct evidence using transaction data By Wier Ludvig
Migration Fear, Uncertainty, and Macroeconomic Dynamics By Michael Donadelli; Luca Gerotto; Marcella Lucchetta; Daniela Arzu

Job automation risk, economic structure and trade: a European perspective

By:	Foster-McGregor, Neil (UNU-MERIT); Nomaler, Önder (UNU-MERIT); Verspagen, Bart (UNU-MERIT)
Abstract:	Recent studies report that technological developments in machine learning and artificial intelligence present a significant risk to jobs in advanced countries. We re-estimate automation risk at the job level, finding sectoral employment structure to be key in determining automation risk at the country level. At the country level, we find a negative relationship between automation risk and labour productivity. We then analyse the role of trade as a factor leading to structural changes and consider the effect of trade on aggregate automation risk by comparing automation risk between a hypothetical autarky and the actual situation. Results indicate that trade increases automation risk in Europe, although moderately so. European countries with high labour productivity see automation risk increase due to trade, with trade between European and non-European nations driving these results. This implies that the high productivity countries do not, on the balance, offshore automation risk, but rather import it.
Keywords:	Automation risk for employment, Industry 4.0, Globalisation, Global Value Chains
JEL:	F16 O33 J24
Date:	2019–04–08
URL:	https://d.repec.org/n?u=RePEc:unm:unumer:2019011

Selling Data

By:	Carlos Segura-Rodriguez (Department of Economics, University of Pennsylvania)
Abstract:	I study how a monopolist data broker (seller), who wants to maximize profits, should present and sell consumer data to a firm (buyer). The buyer has an interest in forecasting a particular consumer characteristic, but the seller is uncertain about which characteristic the buyer wants to forecast and how much the buyer values information. I assume that the joint distribution of both the unknown characteristics and the data is elliptical. This information environment reduces to a multidimensional, multi-product mechanism design problem in which the buyer’s payoffs are nonlinear. Hence, I cannot use the common differential approach to solve for the optimal mechanism. I obtain two main results. First, I show that the seller should optimally offer statistics that are linear combinations of the data and independent noise. Second, by using a direct approach, I show that in the optimal mechanism the seller might want to offer a continuum of different statistics, and these statistics, without containing independent noise, are less correlated than they would be if the seller could perfectly price discriminate. Thus this distortion affects the mimicking type more than the mimicked type.
Keywords:	Information Design, Mechanism Design, Multidimensional Screening,Product Design, Elliptical Distribution
JEL:	D42 D82 D83 D86
Date:	2019–04–21
URL:	https://d.repec.org/n?u=RePEc:pen:papers:19-006

Like it or not? The impact of online platforms on the productivity of incumbent service

By:	Alberto Bailin Rivares; Peter Gal; Valentine Millot; Stéphane Sorbe
Abstract:	This paper uses a novel empirical approach to assess if the development of online platforms affects the productivity of service firms. We build a proxy measure of platform use across four industries (hotels, restaurants, taxis and retail trade) and ten OECD countries using internet search data from Google Trends, which we link to firm-level data on productivity in these industries. We find that platform development supports the productivity of the average incumbent service firm and also stimulates labour reallocation towards more productive firms in these industries. This may notably reflect that platforms’ user review and rating systems reduce information asymmetries between consumers and service providers, enhancing competition between providers. The effects depend on platform type. “Aggregator” platforms that connect incumbent service providers to consumers tend to push up the productivity of incumbents, while more disruptive platforms that enable new types of providers to compete with them (e.g. home sharing, ride hailing) have on average no significant effect on it. Consistent with this, we find that different platform types affect differently the profits, mark-ups, employment and wages of incumbent service firms. Finally, the productivity gains from platforms are lower when a platform is persistently dominant on its market, suggesting that the contestability of platform markets should be promoted.
Keywords:	competition, digital, google trends, platforms, productivity, services, user rating
JEL:	D24 L13 L80 O33
Date:	2019–05–21
URL:	https://d.repec.org/n?u=RePEc:oec:ecoaaa:1548-en

Of trackers and tractors. Using a smartphone app and compositional data analysis to explore the link between mechanization and intra-household allocation of time in Zambia

By:	Daum, Thomas; Capezzone, Filippo; Birner, Regina
Abstract:	Digital tools may help to study socioeconomic aspects of agricultural development that are difficult to measure such as the effects of new technologies, policies and practices on the intra-household allocation of time. As new technologies, policies and practices may target different crops and tasks, they can affect time-use of men, women, boys and girls differently. Development strategies that overlook such effects can fail or have negative consequences for vulnerable household members. In this paper, the effects of agricultural mechanization on time-use in smallholder farming households in Zambia were investigated. For this, a novel data collection method was used: a pictorial smartphone application that allows real-time recording of time-use to eliminate recall bias. Existing studies analyzing intra-household allocation of resources often focus on adult males and females. This study paid particular attention to boys and girls. The study also addressed seasonal variations. For data analysis, compositional data analysis was used, which yields higher accuracy than univariate analysis by accounting for the co-dependence and sum constraint of time-use data. The study found that women benefit relatively more from mechanization with regard to time-use during land preparation, which leads to gender differentiation; for households using manual labor, such differentiation was not found. There was some evidence that the time "saved" is used for off-farm and domestic work. No negative second-round effects (such as higher labor burdens) during weeding and harvesting/processing and no negative effects on children were found. The study debunks some myths related to gender roles in African smallholder agriculture, opens the field to more studies on technology adoption and time-use and suggests that gender roles are changing with agricultural transformation.
Keywords:	Labor and Human Capital, Production Economics, Research and Development/Tech Change/Emerging Technologies, Research Methods/ Statistical Methods
Date:	2019–05–14
URL:	https://d.repec.org/n?u=RePEc:ags:ubzefd:288434

Credit Risk Analysis using Machine and Deep Learning models

By:	Peter Addo (Lead Data Scientist - SNCF Mobilité); Dominique Guegan (UP1 - Université Panthéon-Sorbonne, Labex ReFi - UP1 - Université Panthéon-Sorbonne, University of Ca’ Foscari [Venice, Italy], CES - Centre d'économie de la Sorbonne - UP1 - Université Panthéon-Sorbonne - CNRS - Centre National de la Recherche Scientifique, IPAG Business School); Bertrand Hassani (Labex ReFi - UP1 - Université Panthéon-Sorbonne, Capgemini Consulting [Paris])
Abstract:	Due to the hyper technology associated to Big Data, data availability and computing power, most banks or lending financial institutions are renewing their business models. Credit risk predictions, monitoring, model reliability and effective loan processing are key to decision making and transparency. In this work, we build binary classifiers based on machine and deep learning models on real data in predicting loan default probability. The top 10 important features from these models are selected and then used in the modelling process to test the stability of binary classifiers by comparing performance on separate data. We observe that tree-based models are more stable than models based on multilayer artificial neural networks. This opens several questions relative to the intensive used of deep learning systems in the enterprises.
Keywords:	Deep learning,Data Science,Credit risk,Financial regulation,Bigdata
Date:	2018–02
URL:	https://d.repec.org/n?u=RePEc:hal:journl:halshs-01719983

Credit Risk Analysis Using Machine and Deep Learning Models

By:	Dominique Guegan (UP1 - Université Panthéon-Sorbonne, CES - Centre d'économie de la Sorbonne - UP1 - Université Panthéon-Sorbonne - CNRS - Centre National de la Recherche Scientifique, Labex ReFi - UP1 - Université Panthéon-Sorbonne, IPAG Business School, University of Ca’ Foscari [Venice, Italy]); Peter Addo (AFD - Agence française de développement, Labex ReFi - UP1 - Université Panthéon-Sorbonne); Bertrand Hassani (Labex ReFi - UP1 - Université Panthéon-Sorbonne, CES - Centre d'économie de la Sorbonne - UP1 - Université Panthéon-Sorbonne - CNRS - Centre National de la Recherche Scientifique, Capgemini Consulting [Paris], UCL-CS - Computer science department [University College London] - UCL - University College of London [London])
Abstract:	Due to the advanced technology associated with Big Data, data availability and computing power, most banks or lending institutions are renewing their business models. Credit risk predictions, monitoring, model reliability and effective loan processing are key to decision-making and transparency. In this work, we build binary classifiers based on machine and deep learning models on real data in predicting loan default probability. The top 10 important features from these models are selected and then used in the modeling process to test the stability of binary classifiers by comparing their performance on separate data. We observe that the tree-based models are more stable than the models based on multilayer artificial neural networks. This opens several questions relative to the intensive use of deep learning systems in enterprises.
Keywords:	financial regulation,deep learning,Big data,data science,credit risk
Date:	2018
URL:	https://d.repec.org/n?u=RePEc:hal:journl:halshs-01835164

The European venture capital landscape: an EIF perspective. Volume V: The economic impact of VC investments supported by the EIF

By:	Pavlova, Elitsa; Signore, Simone
Abstract:	This paper examines the impact of venture capital (VC) investments supported by the EIF on the financial growth and performance of young and innovative firms. Using a novel dataset covering European start-ups supported by VC in the years 2007 to 2014, we generate a counterfactual group of non-VCbacked firms through a combination of exact and propensity score matching. To offset the relatively limited set of observables allowed by our data, we estimate treatment propensity using a series of innovative measures based on machine learning, network theory, and satellite imagery analysis. Our results document the positive effects of EIF-supported VC investments on start-up performance, as measured through various financial indicators (e.g. assets, revenue, employment). We find that VC financing enables start-ups to prioritise long-term growth, trading off short- to medium-term profitability if necessary. Overall, our work provides meaningful evidence towards the positive effects of EIF-supported VC investment on the financial growth of young and innovative businesses in Europe.
Keywords:	EIF,venture capital,public intervention,real effects,start-ups,machine learning,geospatial analysis,network theory
JEL:	G24 L25 M13 O38
Date:	2019
URL:	https://d.repec.org/n?u=RePEc:zbw:eifwps:201955

A Stock Selection Method Based on Earning Yield Forecast Using Sequence Prediction Models

By:	Jessie Sun
Abstract:	Long-term investors, different from short-term traders, focus on examining the underlying forces that affect the well-being of a company. They rely on fundamental analysis which attempts to measure the intrinsic value an equity. Quantitative investment researchers have identified some value factors to determine the cost of investment for a stock and compare different stocks. This paper proposes using sequence prediction models to forecast a value factor-the earning yield (EBIT/EV) of a company for stock selection. Two advanced sequence prediction models-Long Short-term Memory (LSTM) and Gated Recurrent Unit (GRU) networks are studied. These two models can overcome the inherent problems of a standard Recurrent Neural Network, i.e., vanishing and exploding gradients. This paper firstly introduces the theories of the networks. And then elaborates the workflow of stock pool creation, feature selection, data structuring, model setup and model evaluation. The LSTM and GRU models demonstrate superior performance of forecast accuracy over a traditional Feedforward Neural Network model. The GRU model slightly outperformed the LSTM model.
Date:	2019–05
URL:	https://d.repec.org/n?u=RePEc:arx:papers:1905.04842

Improving Regression-based Event Study Analysis Using a Topological Machine-learning Method

By:	Takashi Yamashita; Ryozo Miura
Abstract:	This paper introduces a new correction scheme to a conventional regression-based event study method: a topological machine-learning approach with a self-organizing map (SOM).We use this new scheme to analyze a major market event in Japan and find that the factors of abnormal stock returns can be easily can be easily identified and the event-cluster can be depicted.We also find that a conventional event study method involves an empirical analysis mechanism that tends to derive bias due to its mechanism, typically in an event-clustered market situation. We explain our new correction scheme and apply it to an event in the Japanese market --- the holding disclosure of the Government Pension Investment Fund (GPIF) on July 31, 2015.
Date:	2019–05
URL:	https://d.repec.org/n?u=RePEc:arx:papers:1905.06536

Automating Response Evaluation for Franchising Questions on the 2017 Economic Census

By:	Joseph Staudt; Yifang Wei; Lisa Singh; Shawn D. Klimek; J. Bradford Jensen; Andrew L. Baer
Abstract:	Between the 2007 and 2012 Economic Censuses (EC), the count of franchise-affiliated establishments declined by 9.8%. One reason for this decline was a reduction in resources that the Census Bureau was able to dedicate to the manual evaluation of survey responses in the franchise section of the EC. Extensive manual evaluation in 2007 resulted in many establishments, whose survey forms indicated they were not franchise-affiliated, being recoded as franchise-affiliated. No such evaluation could be undertaken in 2012. In this paper, we examine the potential of using external data harvested from the web in combination with machine learning methods to automate the process of evaluating responses to the franchise section of the 2017 EC. Our method allows us to quickly and accurately identify and recode establishments have been mistakenly classified as not being franchise-affiliated, increasing the unweighted number of franchise-affiliated establishments in the 2017 EC by 22%-42%.
JEL:	C81 L8
Date:	2019–05
URL:	https://d.repec.org/n?u=RePEc:nbr:nberwo:25818

(MARTINGALE) OPTIMAL TRANSPORT AND ANOMALY DETECTION WITH NEURAL NETWORKS: A PRIMAL-DUAL ALGORITHM

By:	Pierre Henry-Labordère (Societe Generale - Société Générale)
Abstract:	In this paper, we introduce a primal-dual algorithm for solving (martingale) optimal transportation problems, with cost functions satisfying the twist condition, close to the one that has been used recently for training generative adversarial networks. As some additional applications, we consider anomaly detection and automatic generation of financial data.
Date:	2019–04–10
URL:	https://d.repec.org/n?u=RePEc:hal:wpaper:hal-02095222

Unconventional Exchange: Methods for Statistical Analysis of Virtual Goods

By:	Oliver James Scholten; Peter Cowling; Kenneth A. Hawick; James Alfred Walker
Abstract:	Hyperinflation and price volatility in virtual economies has the potential to reduce player satisfaction and decrease developer revenue. This paper describes intuitive analytical methods for monitoring volatility and inflation in virtual economies, with worked examples on the increasingly popular multiplayer game Old School Runescape. Analytical methods drawn from mainstream financial literature are outlined and applied in order to present a high level overview of virtual economic activity of 3467 price series over 180 trading days. Six-monthly volume data for the top 100 most traded items is also used both for monitoring and value estimation, giving a conservative estimate of exchange trading volume of over {\pounds}60m in real value. Our worked examples show results from a well functioning virtual economy to act as a benchmark for future work. This work contributes to the growing field of virtual economics and game development, describing how data transformations and statistical tests can be used to improve virtual economic design and analysis, with applications in real-time monitoring systems.
Date:	2019–05
URL:	https://d.repec.org/n?u=RePEc:arx:papers:1905.06721

Deep Haar Scattering Networks in Unidimensional Pattern Recognition Problems

By:	Fernando Fernandes Neto; Claudio Garcia, Rodrigo de Losso da Silveira Bueno, Pedro Delano Cavalcanti, Alemayehu Solomon Admas
Abstract:	The aim of this paper is to discuss the use of Haar scattering networks, which is a very simple architecture that naturally supports a large number of stacked layers, yet with very few parameters, in a relatively broad set of pattern recognition problems, including regression and classification tasks. This architecture, basically, consists of stacking convolutional filters, that can be thought as a generalization of Haar wavelets, followed by nonlinear operators which aim to extract symmetries and invariances that are later fed in a classification/regression algorithm. We show that good results can be obtained with the proposed method for both kind of tasks. We outperformed the best available algorithms in 4 out of 18 important data classification problems, and obtained a more robust performance than ARIMA and ETS time series methods in regression problems for data with invariances and symmetries, with desirable features, such as possibility to evaluate parameter stability and easy structural assessment.
Keywords:	Haar Scattering Network; Pattern Recognition; Classification; Regression; Time Series.
JEL:	C38 C45 C52 C63
Date:	2019–05–07
URL:	https://d.repec.org/n?u=RePEc:spa:wpaper:2019wpecon16

Automated Linking of Historical Data

By:	Ran Abramitzky; Leah Platt Boustan; Katherine Eriksson; James J. Feigenbaum; Santiago Pérez
Abstract:	The recent digitization of complete count census data is an extraordinary opportunity for social scientists to create large longitudinal datasets by linking individuals from one census to another or from other sources to the census. We evaluate different automated methods for record linkage, performing a series of comparisons across methods and against hand linking. We have three main findings that lead us to conclude that automated methods perform well. First, a number of automated methods generate very low (less than 5%) false positive rates. The automated methods trace out a frontier illustrating the tradeoff between the false positive rate and the (true) match rate. Relative to more conservative automated algorithms, humans tend to link more observations but at a cost of higher rates of false positives. Second, when human linkers and algorithms have the same amount of information, there is relatively little disagreement between them. Third, across a number of plausible analyses, coefficient estimates and parameters of interest are very similar when using linked samples based on each of the different automated methods. We provide code and Stata commands to implement the various automated methods.
JEL:	C81 N0
Date:	2019–05
URL:	https://d.repec.org/n?u=RePEc:nbr:nberwo:25825

What is the state of the manufacturing sector in Mozambique?

By:	Schou Soren; Fisker Peter
Abstract:	The latest firm survey of Mozambique, the Inquerito ás Indústrias Manufactureiras (IIM) 2017, draws a concerning picture of the manufacturing sector. However, it is not obvious whether this is true for the population of manufacturing firms in Mozambique, as the representativeness of the IIM 2017 sample is not clear.This paper triangulates the findings of IIM 2017 by considering other indicators of the health of the manufacturing sector in Mozambique, including manufacturing gross domestic product, the latest enterprise census, and satellite imagery. Results indicate that the manufacturing sector grows at roughly the same pace as the population.
Keywords:	satellite images,census,firm survey,Manufacturing
Date:	2018
URL:	https://d.repec.org/n?u=RePEc:unu:wpaper:wp2018-190

Financial Stability and the Fed: Evidence from Congressional Hearings

By:	Arina Wischnewsky; David-Jan Jansen; Matthias Neuenkirch
Abstract:	This paper retraces how financial stability considerations interacted with U.S. monetary policy before and during the Great Recession. Using text-mining techniques, we construct indicators for financial stability sentiment expressed during testimonies of four Federal Reserve Chairs at Congressional hearings. Including these text-based measures adds explanatory power to Taylor-rule models. In particular, negative financial stability sentiment coincided with a more accommodative monetary policy stance than implied by standard Taylor-rule factors, even in the decades before the Great Recession. These findings are consistent with a preference for monetary policy reacting to financial instability rather than acting pre-emptively to a perceived build-up of risks.
Keywords:	monetary policy, financial stability, Taylor rule, text mining
JEL:	E52 E58 N12
Date:	2019
URL:	https://d.repec.org/n?u=RePEc:trr:wpaper:201908

Tax-motivated transfer mispricing in South Africa: Direct evidence using transaction data

By:	Wier Ludvig
Abstract:	This paper provides the first direct systematic evidence of profit shifting through transfer mispricing in a developing country.Using South African transaction-level customs data, I directly test for transfer price deviations from armâ€™s-length pricing. I find that multinational firms in South Africa manipulate transfer prices in order to shift taxable profits to low-tax countries. The estimated tax loss is 0.5 per cent of corporate tax payments.My estimates do not support the common belief that transfer mispricing in South Africa is more severe than in advanced economies. I find that an OECD-recommended reform had no long-term impact on transfer mispricing but argue that the method used in this paper provides a cost-efficient way to curb transfer mispricing.Resources Appendix.xlsx
Keywords:	Multinational firms,Profit shifting,Tax,Developing countries,International taxation
Date:	2018
URL:	https://d.repec.org/n?u=RePEc:unu:wpaper:wp2018-123

Migration Fear, Uncertainty, and Macroeconomic Dynamics

By:	Michael Donadelli (Faculty of Economics and Business Administration and Research Center SAFE, Goethe University Frankfurt; Department of Economics, University Of Venice Cà Foscari); Luca Gerotto (University Of Venice Cà Foscari); Marcella Lucchetta (University Of Venice Cà Foscari); Daniela Arzu (University Of Venice Cà Foscari)
Abstract:	This paper examines the effects of changes in immigration-related uncertainty and fear on the real economic activity in four advanced economies (i.e., US, UK, Germany and France). Immigration uncertainty/fear is first captured by two news-based indicators developed by Baker et al. (2015), namely the Migration Policy Uncertainty Index (MPUI) and the Migration Fear Index (MFI), and then by a novel Google Trend Migration Uncertainty Index based on the frequency of internet searches for “immigration” (GTMU). VAR investigations suggest that the macroeconomic implications of rising immigration uncertainty/fear depend on the country under examination as well as on the way in which immigration uncertainty/fear is measured. In the US and UK, MPUI, MFI and GTMU shocks induce positive long-run effects on the real economic activity. Differently, in Germany, MPUI and MFI shocks lead to expansionary reactions whereas GTMU shocks generate significant adverse effects on the economy. This suggests that increasing media attention and rising population’s interest in immigration-related issues affect people’s mood in a different way. In France, MPUI, MFI and GTMU shocks induce negative macroeconomic effects in the long-run. A battery of robustness tests confirms our main findings.
Keywords:	Immigration, Uncertainty, Fear, Google Trends, Business Cycle
JEL:	C32 E32
Date:	2018
URL:	https://d.repec.org/n?u=RePEc:ven:wpaper:2018:29

This nep-big issue is ©2019 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.

General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.

NEP’s infrastructure is sponsored by the Griffith Business School of Griffith University in Australia.