nep-big 2020-05-25 papers

on Big Data

Issue of 2020‒05‒25
eighteen papers chosen by
Tom Coupé
University of Canterbury

Machine Learning Econometrics: Bayesian algorithms and methods By Dimitris Korobilis; Davide Pettenuzzo
The economics of the German investigation of Facebook's data collection By Budzinski, Oliver; Gruésevaja, Marina; Noskova, Victoriia
The trainer, the verifier, the imitator: Three ways in which human platform workers support artificial intelligence By Paola Tubaro; Antonio Casilli; Marion Coville
A Machine Learning Approach for Flagging Incomplete Bid-rigging Cartels By Wallimann, Hannes; Imhof, David; Huber, Martin
Tree-based Synthetic Control Methods: Consequences of moving the US Embassy By Nicolaj N. Mühlbach
Priority to unemployed immigrants? A causal machine learning evaluation of training in Belgium By Cockx, Bart; Lehner, Michael; Bollens, Joost
Priority to unemployed immigrants? A causal machine learning evaluation of training in Belgium By Cockx, Bart; Lehner, Michael; Bollens, Joost
Causal mediation analysis with double machine learning By Farbmacher, Helmut; Huber, Martin; Langen, Henrika; Spindler, Martin
Crowdsourced production of AI Training Data: How human workers teach self-driving cars how to see By Schmidt, Florian Alexander
Fast and Accurate Variational Inference for Models with Many Latent Variables By Rub\'en Loaiza-Maya; Michael Stanley Smith; David J. Nott; Peter J. Danaher
Smart Data und Künstliche Intelligenz: Technologie, Arbeit, Akzeptanz By Kaiser, Oliver S.; Malanowski, Norbert
Corruption in the times of Pandemia By Jorge Gallego; Mounu Prem; Juan F. Vargas
Know Your Clients' behaviours: a cluster analysis of financial transactions By John R. J. Thompson; Longlong Feng; R. Mark Reesor; Chuck Grace
On health and privacy: technology to combat the pandemic By Carlos Cantú; Gong Cheng; Sebastian Doerr; Jon Frost; Leonardo Gambacorta
Public Concern and the Financial Markets during the COVID-19 outbreak By Michele Costola; Matteo Iacopini; Carlo R. M. A. Santagiustina
Targeting predictors in random forest regression By Daniel Borup; Bent Jesper Christensen; Nicolaj N. Mühlbach; Mikkel S. Nielsen
Media persuasion through slanted language: Evidence from the coverage of immigration By Milena Djourelova
Assessing the 'digital divide' and its regional determinants: Evidence from a web-scraping analysis By Thonipara, Anita; Sternberg, Rolf G.; Proeger, Till; Haefner, Lukas

Machine Learning Econometrics: Bayesian algorithms and methods

By:	Dimitris Korobilis; Davide Pettenuzzo
Abstract:	As the amount of economic and other data generated worldwide increases vastly, a challenge for future generations of econometricians will be to master efficient algorithms for inference in empirical models with large information sets. This Chapter provides a review of popular estimation algorithms for Bayesian inference in econometrics and surveys alternative algorithms developed in machine learning and computing science that allow for efficient computation in high-dimensional settings. The focus is on scalability and parallelizability of each algorithm, as well as their ability to be adopted in various empirical settings in economics and finance.
Keywords:	MCMC, approximate inference, scalability, parallel computation
JEL:	C11 C15 C49 C88
Date:	2020–04
URL:	http://d.repec.org/n?u=RePEc:gla:glaewp:2020_09&r=all

The economics of the German investigation of Facebook's data collection

By:	Budzinski, Oliver; Gruésevaja, Marina; Noskova, Victoriia
Abstract:	The importance of digital platforms and related data-driven business models is ever increasing and poses challenges for the workability of competition in the respective markets (tendencies towards dominant platforms, paying-with-data instead of traditional money, privacy concerns, etc.). Due to such challenges, investigations of such markets are of high interest. One of recent cases is the investigation of Facebook's data collection practices by German competition authorities. Our paper, in contrast to the wide stream of legal studies on this case, aims to analyze whether Facebook's practices regarding data collection could constitute an abuse of market power from an economic perspective, more specifically against the background of modern data economics. In doing so we summarize the state of the advanced theories, including influences from behavioral economics, addressing such markets, and discuss four potential theories of harm.
Keywords:	data economics,big data,economics of privacy,competition,Facebook case,paying-with-data,abuse of dominance,market power,digital economy
JEL:	K21 L41 L86 L12 M21 L14 K42
Date:	2020
URL:	http://d.repec.org/n?u=RePEc:zbw:tuiedp:139&r=all

The trainer, the verifier, the imitator: Three ways in which human platform workers support artificial intelligence

By:	Paola Tubaro (CNRS - Centre National de la Recherche Scientifique, TAU - TAckling the Underspecified - LRI - Laboratoire de Recherche en Informatique - UP11 - Université Paris-Sud - Paris 11 - CentraleSupélec - CNRS - Centre National de la Recherche Scientifique - Inria Saclay - Ile de France - Inria - Institut National de Recherche en Informatique et en Automatique, LRI - Laboratoire de Recherche en Informatique - UP11 - Université Paris-Sud - Paris 11 - CentraleSupélec - CNRS - Centre National de la Recherche Scientifique); Antonio Casilli (I3, une unité mixte de recherche CNRS (UMR 9217) - Institut interdisciplinaire de l’innovation - X - École polytechnique - Télécom ParisTech - MINES ParisTech - École nationale supérieure des mines de Paris - CNRS - Centre National de la Recherche Scientifique, Télécom ParisTech, IMT - Institut Mines-Télécom [Paris]); Marion Coville (Université de Poitiers)
Abstract:	This paper sheds light on the role of digital platform labour in the development of today's artificial intelligence, predicated on data-intensive machine learning algorithms. Focus is on the specific ways in which outsourcing of data tasks to myriad 'micro-workers', recruited and managed through specialized platforms, powers virtual assistants, self-driving vehicles and connected objects. Using qualitative data from multiple sources, we show that micro-work performs a variety of functions, between three poles that we label, respectively, 'artificial intelligence preparation', 'artificial intelligence verification' and 'artificial intelligence impersonation'. Because of the wide scope of application of micro-work, it is a structural component of contemporary artificial intelligence production processes - not an ephemeral form of support that may vanish once the technology reaches maturity stage. Through the lens of micro-work, we prefigure the policy implications of a future in which data technologies do not replace human workforce but imply its marginalization and precariousness.
Keywords:	Digital platform labour,micro-work,datafied production processes,artificial intelligence,machine learning
Date:	2020–04
URL:	http://d.repec.org/n?u=RePEc:hal:journl:hal-02554196&r=all

A Machine Learning Approach for Flagging Incomplete Bid-rigging Cartels

By:	Wallimann, Hannes (Faculty of Economics and Social Sciences); Imhof, David; Huber, Martin
Abstract:	We propose a new method for flagging bid rigging, which is particularly useful for detecting incomplete bid-rigging cartels. Our approach combines screens, i.e. statistics derived from the distribution of bids in a tender, with machine learning to predict the probability of collusion. As a methodological innovation, we calculate such screens for all possible subgroups of three or four bids within a tender and use summary statistics like the mean, median, maximum, and minimum of each screen as predictors in the machine learning algorithm. This approach tackles the issue that competitive bids in incomplete cartels distort the statistical signals produced by bid rigging. We demonstrate that our algorithm outperforms previously suggested methods in applications to incomplete cartels based on empirical data from Switzerland.
Keywords:	Bid rigging detection; screening methods; descriptive statistics; machine learning; random forest; lasso; ensemble methods
JEL:	C21 C45 C52 D22 D40 K40
Date:	2020–04–01
URL:	http://d.repec.org/n?u=RePEc:fri:fribow:fribow00513&r=all

Tree-based Synthetic Control Methods: Consequences of moving the US Embassy

By:	Nicolaj N. Mühlbach (Aarhus University and CREATES)
Abstract:	We recast the synthetic controls for evaluating policies as a counterfactual prediction problem and replace its linear regression with a nonparametric model inspired by machine learning. The proposed method enables us to achieve more accurate counterfactual predictions. We apply our method to a highly-debated policy: the move of the US embassy to Jerusalem. In Israel and Palestine, we find that the average number of weekly conflicts has increased by roughly 103% over 48 weeks since the move was announced on December 6, 2017. Using conformal inference and placebo tests, we justify our model and find the increase to be statistically significant.
Keywords:	Treatment effects, Program evaluation, Synthetic control, Machine learning, US embassy move
JEL:	C14 C21 C54 D02 D74 F51
Date:	2020–05–19
URL:	http://d.repec.org/n?u=RePEc:aah:create:2020-04&r=all

Priority to unemployed immigrants? A causal machine learning evaluation of training in Belgium

By:	Cockx, Bart; Lehner, Michael; Bollens, Joost
Abstract:	Based on administrative data of unemployed in Belgium, we estimate the labour market effects of three training programmes at various aggregation levels using Modified Causal Forests, a causal machine learning estimator. While all programmes have positive effects after the lock-in period, we find substantial heterogeneity across programmes and unemployed. Simulations show that â€œblack-boxâ€ rules that reassign unemployed to programmes that maximise estimated individual gains can considerably improve effectiveness: up to 20% more (less) time spent in (un)employment within a 30 months window. A shallow policy tree delivers a simple rule that realizes about 70% of this gain.
JEL:	J68
Date:	2020–05–12
URL:	http://d.repec.org/n?u=RePEc:unm:umagsb:2020015&r=all

Priority to unemployed immigrants? A causal machine learning evaluation of training in Belgium

By:	Cockx, Bart; Lehner, Michael; Bollens, Joost
Abstract:	Based on administrative data of unemployed in Belgium, we estimate the labour market effects of three training programmes at various aggregation levels using Modified Causal Forests, a causal machine learning estimator. While all programmes have positive effects after the lock-in period, we find substantial heterogeneity across programmes and unemployed. Simulations show that â€œblack-boxâ€ rules that reassign unemployed to programmes that maximise estimated individual gains can considerably improve effectiveness: up to 20% more (less) time spent in (un)employment within a 30 months window. A shallow policy tree delivers a simple rule that realizes about 70% of this gain.
JEL:	J68
Date:	2020–05–11
URL:	http://d.repec.org/n?u=RePEc:unm:umaror:2020006&r=all

Causal mediation analysis with double machine learning

By:	Farbmacher, Helmut (Max Planck Society); Huber, Martin; Langen, Henrika (Faculty of Economics and Social Sciences); Spindler, Martin (Universität Hamburg)
Abstract:	This paper combines causal mediation analysis with double machine learning to control for observed confounders in a data-driven way under a selection-on-observables assumption in a high-dimensional setting. We consider the average indirect effect of a binary treatment operating through an intermediate variable (or mediator) on the causal path between the treatment and the outcome, as well as the unmediated direct effect. Estimation is based on efficient score functions, which possess a multiple robustness property w.r.t. misspecifications of the outcome, mediator, and treatment models. This property is key for selecting these models by double machine learning, which is combined with data splitting to prevent overfitting in the estimation of the effects of interest. We demonstrate that the direct and indirect effect estimators are asymptotically normal and root-n consistent under specific regularity conditions and investigate the finite sample properties of the suggested methods in a simulation study when considering lasso as machine learner. We also provide an empirical application to the U.S. National Longitudinal Survey of Youth, assessing the indirect effect of health insurance coverage on general health operating via routine checkups as mediator, as well as the direct effect. We find a moderate short term effect of health insurance coverage on general health which is, however, not mediated by routine checkups.
Keywords:	Mediation; direct and indirect effects; causal mechanisms; double machine learning; effcient score
JEL:	C21
Date:	2020–05–01
URL:	http://d.repec.org/n?u=RePEc:fri:fribow:fribow00515&r=all

Crowdsourced production of AI Training Data: How human workers teach self-driving cars how to see

By:	Schmidt, Florian Alexander
Abstract:	Since 2017 the automotive industry has developed a high demand for ground truth data. Without this data, the ambitious goal of producing fully autonomous vehicles will remain out of reach. The self-driving car depends on self-learning algorithms, which in turn have to undergo a lot of supervised training. This requires vast amounts of manual labour, performed by crowdworkers across the globe. As a consequence, the demand in training data is transforming the crowdsourcing industry. This study is an investigation into the dynamics of this shift and its impacts on the working conditions of the crowdworkers.
Keywords:	crowdworking,artificial Intelligence,self-driving cars,automotive industry,global labour markets,AI
Date:	2019
URL:	http://d.repec.org/n?u=RePEc:zbw:hbsfof:155&r=all

Fast and Accurate Variational Inference for Models with Many Latent Variables

By:	Rub\'en Loaiza-Maya; Michael Stanley Smith; David J. Nott; Peter J. Danaher
Abstract:	Models with a large number of latent variables are often used to fully utilize the information in big or complex data. However, they can be difficult to estimate using standard approaches, and variational inference methods are a popular alternative. Key to the success of these is the selection of an approximation to the target density that is accurate, tractable and fast to calibrate using optimization methods. Mean field or structured Gaussian approximations are common, but these can be inaccurate and slow to calibrate when there are many latent variables. Instead, we propose a family of tractable variational approximations that are more accurate and faster to calibrate for this case. The approximation is a parsimonious copula model for the parameter posterior, combined with the exact conditional posterior of the latent variables. We derive a simplified expression for the re-parameterization gradient of the variational lower bound, which is the main ingredient of efficient optimization algorithms used to implement variational estimation. We illustrate using two substantive econometric examples. The first is a nonlinear state space model for U.S. inflation. The second is a random coefficients tobit model applied to a rich marketing dataset with one million sales observations from a panel of 10,000 individuals. In both cases, we show that our approximating family is faster to calibrate than either mean field or structured Gaussian approximations, and that the gains in posterior estimation accuracy are considerable.
Date:	2020–05
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2005.07430&r=all

Smart Data und Künstliche Intelligenz: Technologie, Arbeit, Akzeptanz

By:	Kaiser, Oliver S.; Malanowski, Norbert
Abstract:	Innovative Technologien sind für Mitbestimmungsakteure ein wichtiges Thema, zu dem sie wissenschaftliche Expertise benötigen, um mitgestalten zu können. Das Working Paper dient vor allem dazu, relevante Trends frühzeitig zu erkennen und diese für eine prospektive Innovations- und Technikgestaltung im Dreiklang von Technik, Mensch und Organisation aufzubereiten. Behandelt werden zum einen Smart Data und Künstliche Intelligenz im Kontext von Technologie, Arbeit und Akzeptanz. Zum anderen diskutiert das Working Paper erste innovations-, industrie- und arbeitspolitische Denkanstöße für die Nutzung von Gestaltungspielräumen
Keywords:	Smart Data,Künstliche Intelligenz,Digitalisierung,Big Data,Automatisierung
Date:	2019
URL:	http://d.repec.org/n?u=RePEc:zbw:hbsfof:136&r=all

Corruption in the times of Pandemia

By:	Jorge Gallego; Mounu Prem; Juan F. Vargas
Abstract:	The public health crisis caused by the COVID-19 pandemic, coupled with the subsequent economic emergency and social turmoil, has pushed governments to substantially and swiftly increase spending. Because of the pressing nature of the crisis, public procurement rules and procedures have been relaxed in many places in order to expedite transactions. However, this may also create opportunities for corruption. Using contract-level information on public spending from Colombia’s e-procurement platform, and a difference-in-differences identification strategy, we find that municipalities classified by a machine learning algorithm as traditionally more prone to corruption react to the pandemic-led spending surge by using a larger proportion of discretionary non-competitive contracts and increasing their average value. This is especially so in the case of contracts to procure crisis-related goods and services. Our evidence suggests that large negative shocks that require fast and massive spending may increase corruption, thus at least partially offsetting the mitigating effects of this fiscal instrument.
Keywords:	Corruption, COVID-19, Public procurement, Machine learning
JEL:	H57 H75 D73 I18
Date:	2020–05–14
URL:	http://d.repec.org/n?u=RePEc:col:000518:018164&r=all

Know Your Clients' behaviours: a cluster analysis of financial transactions

By:	John R. J. Thompson; Longlong Feng; R. Mark Reesor; Chuck Grace
Abstract:	In Canada, financial advisors and dealers by provincial securities commissions, and those self-regulatory organizations charged with direct regulation over investment dealers and mutual fund dealers, respectively to collect and maintain Know Your Client (KYC) information, such as their age or risk tolerance, for investor accounts. With this information, investors, under their advisor's guidance, make decisions on their investments which are presumed to be beneficial to their investment goals. Our unique dataset is provided by a financial investment dealer with over 50,000 accounts for over 23,000 clients. We use a modified behavioural finance recency, frequency, monetary model for engineering features that quantify investor behaviours, and machine learning clustering algorithms to find groups of investors that behave similarly. We show that the KYC information collected does not explain client behaviours, whereas trade and transaction frequency and volume are most informative. We believe the results shown herein encourage financial regulators and advisors to use more advanced metrics to better understand and predict investor behaviours.
Date:	2020–05
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2005.03625&r=all

On health and privacy: technology to combat the pandemic

By:	Carlos Cantú; Gong Cheng; Sebastian Doerr; Jon Frost; Leonardo Gambacorta
Abstract:	Technology has been harnessed in the fight against the Covid-19 pandemic, eg to administer remote medical consultations, analyse aggregate movements and track paths of contact. Successful applications are predicated on broad public support. They must address concerns about data privacy, and the potential for misuse of data by governments and companies. Transparent public policies and clear governance frameworks can help to build trust. One possible approach is to differentiate data use during a pandemic and in normal times.
Date:	2020–05–19
URL:	http://d.repec.org/n?u=RePEc:bis:bisblt:17&r=all

Public Concern and the Financial Markets during the COVID-19 outbreak

By:	Michele Costola; Matteo Iacopini; Carlo R. M. A. Santagiustina
Abstract:	We measure the public concern during the outbreak of COVID-19 disease using three data sources from Google Trends (YouTube, Google News, and Google Search). Our findings are three-fold. First, the public concern in Italy is found to be a driver of the concerns in other countries. Second, we document that Google Trends data for Italy better explains the stock index returns of France, Germany, Great Britain, the United States, and Spain with respect to their country-based indicators. Finally, we perform a time-varying analysis and identify that the most severe impacts in the financial markets occur at each step of the Italian lock-down process.
Date:	2020–05
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2005.06796&r=all

Targeting predictors in random forest regression

By:	Daniel Borup (Aarhus University, CREATES and the Danish Finance Institute (DFI)); Bent Jesper Christensen (Aarhus University, CREATES and the Dale T. Mortensen Center); Nicolaj N. Mühlbach (Aarhus University and CREATES); Mikkel S. Nielsen (Columbia University)
Abstract:	Random forest regression (RF) is an extremely popular tool for the analysis of high-dimensional data. Nonetheless, its benefits may be lessened in sparse settings, due to weak predictors, and a pre-estimation dimension reduction (targeting) step is required. We show that proper targeting controls the probability of placing splits along strong predictors, thus providing an important complement to RF’s feature sampling. This is supported by simulations using representative finite samples. Moreover, we quantify the immediate gain from targeting in terms of increased strength of individual trees. Macroeconomic and financial applications show that the bias-variance tradeoff implied by targeting, due to increased correlation among trees in the forest, is balanced at a medium degree of targeting, selecting the best 10–30% of commonly applied predictors. Improvements in predictive accuracy of targeted RF relative to ordinary RF are considerable, up to 12–13%, occurring both in recessions and expansions, particularly at long horizons.
Keywords:	Random forests, LASSO, high-dimensional forecasting, weak predictors, targeted predictors
JEL:	C53 C55 E17 G12
Date:	2020–05–14
URL:	http://d.repec.org/n?u=RePEc:aah:create:2020-03&r=all

Media persuasion through slanted language: Evidence from the coverage of immigration

By:	Milena Djourelova
Abstract:	Can the language used by mass media to cover policy relevant issues affect readers' policy preferences? I examine this question for the case of immigration, exploiting an abrupt ban on the term "illegal immigrant" in wire content distributed to media outlets by the Associated Press (AP). Using text data on AP dispatches and the content of a large number of US print and online outlets, I find that articles mentioning "illegal immigrant" decline by 28% in outlets that rely on AP relative to others. This change in language appears to have had a tangible impact on readers' views on immigration. Following AP's ban, individuals exposed to outlets relying more heavily on AP tend to support less restrictive immigration and border security policies. The effect is driven by frequent readers and does not apply to views on issues other than immigration.
Keywords:	mass media, media slant, Framing, Immigration
JEL:	D72 L82 Z13
Date:	2020–05
URL:	http://d.repec.org/n?u=RePEc:upf:upfgen:1720&r=all

Assessing the 'digital divide' and its regional determinants: Evidence from a web-scraping analysis

By:	Thonipara, Anita; Sternberg, Rolf G.; Proeger, Till; Haefner, Lukas
Abstract:	Following the 'death of distance' postulate, digitization may reduce or even eliminate the penalty of firms being located in rural areas compared with those in urban agglomerations. Despite many recent attempts to measure digitization effects across space, there remains a lack of empirical evidence regarding the adoption of digital technologies from an explicit spatial perspective, i.e. comparing urban with rural areas. Using web-scraping data for a representative sample of 345,000 German firms, we analyze the determinants of homepage usage. Accordingly, we show that homepage usage - as a proxy for the degree of digitization of the respective firm - is highly dependent on location, whereby firms in urban areas are more than twice as likely to use webpages than those located in rural areas. Our county-level analysis shows that a high population density, young population, net gains in internal migration, high educational level and high firm-specific revenues have a positive and significant effect on the probability that firms conduct digital marketing using webpages. Access to broadband internet has a positive effect in rural areas. There are no differences between urban, suburban and rural areas in terms of webpage up-todateness as well as social media usage. We conclude that there is a substantial digital divide in online marketing and discuss policy implications.
Keywords:	digital divide,digitization,Germany,rural,Small and Medium-Sized Enterprises,urban,web-scraping
JEL:	D22 L22 L26
Date:	2020
URL:	http://d.repec.org/n?u=RePEc:zbw:ifhwps:252020&r=all

This nep-big issue is ©2020 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.

General information on the NEP project can be found at http://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.

NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.