nep-big 2018-09-10 papers

on Big Data

Issue of 2018‒09‒10
twelve papers chosen by
Tom Coupé
University of Canterbury

Profit Efficiency, DEA, FDH and Big Data By Valentin Zelenyuk
Stable Predictions across Unknown Environments By Kuang, Kun; Xiong, Ruoxuan; Cui, Peng; Athey, Susan; Li, Bo
Does Machine Translation Affect International Trade? Evidence from a Large Digital Platform By Erik Brynjolfsson; Xiang Hui; Meng Liu
A Self-Attention Network for Hierarchical Data Structures with an Application to Claims Management By Leander L\"ow; Martin Spindler; Eike Brechmann
DeepLOB: Deep Convolutional Neural Networks for Limit Order Books By Zihao Zhang; Stefan Zohren; Stephen Roberts
Uniform Inference in High-Dimensional Gaussian Graphical Models By Sven Klaassen; Jannis K\"uck; Martin Spindler
The Effect of Positive Mood on Cooperation in Repeated Interaction By Proto, Eugenio; Sgroi, Daniel; Nazneen, Mahnaz
An Open and Data-driven Taxonomy of Skills Extracted from Online Job Adverts By Jyldyz Djumalieva1; Cath Sleeman
"Negotiating the algorithm" automation, artificial intelligence and labour protection By De Stefano, Valerio.
The roots of inequality: Estimating inequality of opportunity from regression trees By Paolo Brunori; Paul Hufe; Daniel Gerszon Mahler
Detecting and Validating Global Technology Trends Using Quantitative and Expert-Based Foresight Techniques By Ilya Kuzminov; Pavel Bakhtin; Elena Khabirova; Irina V. Loginova
Selection of calibration windows for day-ahead electricity price forecasting By Grzegorz Marcjasz; Tomasz Serafin; Rafal Weron

Profit Efficiency, DEA, FDH and Big Data

By:	Valentin Zelenyuk (CEPA - School of Economics, The University of Queensland)
Abstract:	The goal of this article is to outline a very simple way of estimating profit efficiency in the DEA and FDH frameworks, but avoiding the computational burden of linear programming. With this result it is possible to compute profit efficiency even when dimension of inputs and outputs are larger than the dimension of number of decision making units (firms, individuals, etc.), as is often the case in the `big data'.
Date:	2018–08
URL:	http://d.repec.org/n?u=RePEc:qld:uqcepa:125&r=big

Stable Predictions across Unknown Environments

By:	Kuang, Kun (Tsinghua University); Xiong, Ruoxuan (Stanford University); Cui, Peng (Tsinghua University); Athey, Susan (Stanford University); Li, Bo (Tsinghua University)
Abstract:	In many important machine learning applications, the training distribution used to learn a probabilistic classifier differs from the testing distribution on which the classifier will be used to make predictions. Traditional methods correct the distribution shift by reweighting the training data with the ratio of the density between test and training data. In many applications training takes place without prior knowledge of the testing distribution on which the algorithm will be applied in the future. Recently, methods have been proposed to address the shift by learning causal structure, but those methods rely on the diversity of multiple training data to a good performance, and have complexity limitations in high dimensions. In this paper, we propose a novel Deep Global Balancing Regression (DGBR) algorithm to jointly optimize a deep auto-encoder model for feature selection and a global balancing model for stable prediction across unknown environments. The global balancing model constructs balancing weights that facilitate estimating of partial effects of features (holding fixed all other features), a problem that is challenging in high dimensions, and thus helps to identify stable, causal relationships between features and outcomes. The deep auto-encoder model is designed to reduce the dimensionality of the feature space, thus making global balancing easier. We show, both theoretically and with empirical experiments, that our algorithm can make stable predictions across unknown environments. Our experiments on both synthetic and real world datasets demonstrate that our DGBR algorithm outperforms the state-of-the-art methods for stable prediction across unknown environments.
Date:	2018–06
URL:	http://d.repec.org/n?u=RePEc:ecl:stabus:3695&r=big

Does Machine Translation Affect International Trade? Evidence from a Large Digital Platform

By:	Erik Brynjolfsson; Xiang Hui; Meng Liu
Abstract:	Artificial intelligence (AI) is surpassing human performance in a growing number of domains. However, there is limited evidence of its economic effects. Using data from a digital platform, we study a key application of AI: machine translation. We find that the introduction of a machine translation system has significantly increased international trade on this platform, increasing exports by 17.5%. Furthermore, heterogeneous treatment effects are all consistent with a substantial reduction in translation-related search costs. Our results provide causal evidence that language barriers significantly hinder trade and that AI has already begun to improve economic efficiency in at least one domain.
JEL:	D8 F1 F14 O3 O33
Date:	2018–08
URL:	http://d.repec.org/n?u=RePEc:nbr:nberwo:24917&r=big

A Self-Attention Network for Hierarchical Data Structures with an Application to Claims Management

By:	Leander L\"ow; Martin Spindler; Eike Brechmann
Abstract:	Insurance companies must manage millions of claims per year. While most of these claims are non-fraudulent, fraud detection is core for insurance companies. The ultimate goal is a predictive model to single out the fraudulent claims and pay out the non-fraudulent ones immediately. Modern machine learning methods are well suited for this kind of problem. Health care claims often have a data structure that is hierarchical and of variable length. We propose one model based on piecewise feed forward neural networks (deep learning) and another model based on self-attention neural networks for the task of claim management. We show that the proposed methods outperform bag-of-words based models, hand designed features, and models based on convolutional neural networks, on a data set of two million health care claims. The proposed self-attention method performs the best.
Date:	2018–08
URL:	http://d.repec.org/n?u=RePEc:arx:papers:1808.10543&r=big

DeepLOB: Deep Convolutional Neural Networks for Limit Order Books

By:	Zihao Zhang; Stefan Zohren; Stephen Roberts
Abstract:	We develop a large-scale deep learning model to predict price movements from limit order book (LOB) data of cash equities. The architecture utilises convolutional filters to capture the spatial structure of the limit order books as well as LSTM modules to capture longer time dependencies. The model is trained using electronic market quotes from the London Stock Exchange. Our model delivers a remarkably stable out-of-sample prediction accuracy for a variety of instruments and outperforms existing methods such as Support Vector Machines, standard Multilayer Perceptrons, as well as other previously proposed convolutional neural network (CNN) architectures. The results obtained lead to good profits in a simple trading simulation, especially when compared with the baseline models. Importantly, our model translates well to instruments which were not part of the training set, indicating the model's ability to extract universal features. In order to better understand these features and to go beyond a "black box" model, we perform a sensitivity analysis to understand the rationale behind the model predictions and reveal the components of LOBs that are most relevant. The ability to extract robust features which translate well to other instruments is an important property of our model which has many other applications.
Date:	2018–08
URL:	http://d.repec.org/n?u=RePEc:arx:papers:1808.03668&r=big

Uniform Inference in High-Dimensional Gaussian Graphical Models

By:	Sven Klaassen; Jannis K\"uck; Martin Spindler
Abstract:	Graphical models have become a very popular tool for representing dependencies within a large set of variables and are key for representing causal structures. We provide results for uniform inference on high-dimensional graphical models with the number of target parameters being possible much larger than sample size. This is in particular important when certain features or structures of a causal model should be recovered. Our results highlight how in high-dimensional settings graphical models can be estimated and recovered with modern machine learning methods in complex data sets. We also demonstrate in simulation study that our procedure has good small sample properties.
Date:	2018–08
URL:	http://d.repec.org/n?u=RePEc:arx:papers:1808.10532&r=big

The Effect of Positive Mood on Cooperation in Repeated Interaction

By:	Proto, Eugenio; Sgroi, Daniel; Nazneen, Mahnaz
Abstract:	Existing research supports two opposing mechanisms through which positive mood might affect cooperation. Some studies have suggested that positive mood produces more altruistic, open and helpful behavior, fostering cooperation. However, there is contrasting research supporting the idea that positive mood produces more assertiveness and inward-orientation and reduced use of information, hampering cooperation. We find evidence that suggests the second hypothesis dominates when playing the repeated Prisoner’s Dilemma. Players in an induced positive mood tend to cooperate less than players in a neutral mood setting. This holds regardless of uncertainty surrounding the number of repetitions or whether pre-play communication has taken place. This finding is consistent with a text analysis of the pre-play communication between players indicating that subjects in a more positive mood use more inward-oriented, more negative and less positive language. To the best of our knowledge we are the first to use text analysis in pre-play communication.
Keywords:	Financial Economics
Date:	2017–11–11
URL:	http://d.repec.org/n?u=RePEc:ags:uwarer:269091&r=big

An Open and Data-driven Taxonomy of Skills Extracted from Online Job Adverts

By:	Jyldyz Djumalieva1; Cath Sleeman
Abstract:	In this work we offer an open and data-driven skills taxonomy, which is independent of ESCO and O*NET, two popular available taxonomies that are expert-derived. Since the taxonomy is created in an algorithmic way without expert elicitation, it can be quickly updated to reflect changes in labour demand and provide timely insights to support labour market decision-making. Our proposed taxonomy also captures links between skills, aggregated job titles, and the salaries mentioned in the millions of UK job adverts used in this analysis. To generate the taxonomy, we employ machine learning methods, such as word embeddings, network community detection algorithms and consensus clustering. We model skills as a graph with individual skills as vertices and their co-occurrences in job adverts as edges. The strength of the relationships between the skills is measured using both the frequency of actual co-occurrences of skills in the same advert as well as their shared context, based on a trained word embeddings model. Once skills are represented as a network, we hierarchically group them into clusters. To ensure the stability of the resulting clusters, we introduce bootstrapping and consensus clustering stages into the methodology. While we share initial results and describe the skill clusters, the main purpose of this paper is to outline the methodology for building the taxonomy.
Keywords:	Skills, Skills taxonomy, Labour demand, Online job adverts, Big data, Machine learning, Word embeddings
JEL:	C18 C38 J23 J24
Date:	2018–08
URL:	http://d.repec.org/n?u=RePEc:nsr:escoed:escoe-dp-2018-13&r=big

"Negotiating the algorithm" automation, artificial intelligence and labour protection

By:	De Stefano, Valerio.
Abstract:	This paper aims at filling some gaps in the mainstream debate on automation, the introduction of new technologies at the workplace and the future of work. This debate has concentrated, so far, on how many jobs will be lost as a consequence of technological innovation. This paper examines instead issues related to the quality of jobs in future labour markets. It addresses the detrimental effects on workers of awarding legal capacity and rights and obligation to robots. It examines the implications of practices such as People Analytics and the use of big data and artificial intelligence to manage the workforce. It stresses on an oft-neglected feature of the contract of employment, namely the fact that it vests the employer with authority and managerial prerogatives over workers. It points out that a vital function of labour law is to limit these authority and prerogatives to protect the human dignity of workers. In light of this, it argues that even if a Universal Basic Income were introduced, the existence of managerial prerogatives would still warrant the existence of labour regulation since this regulation is about much more than protecting workers’ income. It then highlights the benefits of human- rights based approaches to labour regulation to protect workers’ privacy against invasive electronic monitoring. It concludes by highlighting the crucial role of collective regulation and social partners in governing automation and the impact of technology at the workplace. It stresses that collective dismissal regulation and the involvement of workers’ representatives in managing and preventing job losses is crucial and that collective actors should actively participate in the governance of technology-enhanced management systems, to ensure a vital “human- in-command” approach.
Keywords:	1, 2, 3
Date:	2018
URL:	http://d.repec.org/n?u=RePEc:ilo:ilowps:994998792302676&r=big

The roots of inequality: Estimating inequality of opportunity from regression trees

By:	Paolo Brunori (University of Florence, Italy); Paul Hufe (ifo Munich and LMU Munich, Germany); Daniel Gerszon Mahler (University of Copenhagen, Denmark and World Bank)
Abstract:	We propose a set of new methods to estimate inequality of opportunity based on conditional inference regression trees. In particular, we illustrate how these methods represent a substantial improvement over existing empirical approaches to measure inequality of opportunity. First, they minimize the risk of arbitrary and ad-hoc model selection. Second, they provide a standardized way of trading of upward and downward biases in inequality of opportunity estimations. Finally, regression trees can be graphically represented; their structure is immediate to read and easy to understand. This will make the measurement of inequality of opportunity more easily comprehensible to a large audience. These advantages are illustrated by an empirical application based on the 2011 wave of the European Union Statistics on Income and Living Conditions.
Keywords:	Equality of opportunity, machine learning, random forests.
JEL:	D31 D63 C38
Date:	2018–01
URL:	http://d.repec.org/n?u=RePEc:inq:inqwps:ecineq2018-455&r=big

Detecting and Validating Global Technology Trends Using Quantitative and Expert-Based Foresight Techniques

By:	Ilya Kuzminov (National Research University Higher School of Economics); Pavel Bakhtin (National Research University Higher School of Economics); Elena Khabirova (National Research University Higher School of Economics); Irina V. Loginova (National Research University Higher School of Economics)
Abstract:	This paper contributes to the conceptualisation and operationalisation of the “technology trend” discussion in the scope of the Technology Foresight paradigm. It proposes a consistent logical approach to analysing technology trends and increase predictive potential of futures studies. The approach integrates Big Data analysis into the Foresight studies’ toolset by means of applying text mining, namely computerised analysis of large volumes of unstructured text-based industry-relevant analytics. It comprises methodological results such as analytical decomposition of the trend concept, including trend attributes (inherent characteristics) and various trend types and empirical results of detection and classification of global technology trends in the agricultural sector. The study makes a significant contribution to the development of a conceptual apparatus for trend analysis as a sub-area of Foresight methodology. The agricultural field is used to demonstrate the application the methodology. The empirical results can be applied by federal and regional authorities responsible for promoting development of the sectors to design relevant strategies and programmes, and by companies to set their long-term marketing and investment priorities.
Keywords:	technology trends, innovation, science and technology forecasting, science and technology progress, foresight, text mining, survey, bibliometrics, patent analysis
JEL:	O1 O3
Date:	2018
URL:	http://d.repec.org/n?u=RePEc:hig:wpaper:82sti2018&r=big

Selection of calibration windows for day-ahead electricity price forecasting

By:	Grzegorz Marcjasz; Tomasz Serafin; Rafal Weron
Abstract:	We conduct an extensive empirical study on the selection of calibration windows for day-ahead electricity price forecasting, which involves 6-year long datasets from three major power markets and four autoregressive expert models fitted either to raw or transformed prices. Since the variability of prediction errors across windows of different lengths and across datasets can be substantial, selecting ex-ante one window is risky. Instead, we argue that averaging forecasts across different calibration windows is a robust alternative and introduce a new, well-performing weighting scheme for averaging these forecasts.
Keywords:	Electricity price forecasting; Forecast averaging; Calibration window; Autoregression; Variance stabilizing transformation; Conditional predictive ability
JEL:	C14 C22 C51 C53 Q47
Date:	2018–08–29
URL:	http://d.repec.org/n?u=RePEc:wuu:wpaper:hsc1806&r=big

This nep-big issue is ©2018 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.

General information on the NEP project can be found at http://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.

NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.