nep-big New Economics Papers
on Big Data
Issue of 2019‒06‒10
thirteen papers chosen by
Tom Coupé
University of Canterbury

  1. OK Computer: The Creation and Integration of AI in Europe By Bernardo S. Buarque; Ronald B. Davies; Dieter F. Kogler; Ryan M. Hynes
  2. Predicting free-riding in a public goods game: Analysis of content and dynamic facial expressions in face-to-face communication By Bershadskyy, Dmitri; Othman, Ehsan; Saxen, Frerk
  3. Attribute Sentiment Scoring With Online Text Reviews : Accounting for Language Structure and Attribute Self-Selection By Ishita Chakraborty; Minkyung Kim; K. Sudhir
  4. Machine Learning Estimation of Heterogeneous Treatment Effects with Instruments By Vasilis Syrgkanis; Victor Lei; Miruna Oprescu; Maggie Hei; Keith Battocchi; Greg Lewis
  5. Security Analysis of Machine Learning Systems for the Financial Sector By Shiori Inoue; Masashi Une
  6. News-driven infl ation expectations and information rigidities By Vegard H. Larsen; Leif Anders Thorsrud; Julia Zhulanova
  7. Buy, Sell or Hold: Entity-Aware Classification of Business News By Sinha, Ankur; Kedas, Satishwar; Kumar, Rishu; Malo, Pekka
  8. Matching on What Matters: A Pseudo-Metric Learning Approach to Matching Estimation in High Dimensions By Gentry Johnson; Brian Quistorff; Matt Goldman
  9. Study of the readiness of managers of the civil service to work in a digital society By Sinyagin, Yuriy (Синягин, Юрий); Sinyagina, Natalia (Синягина, Наталья); Markaryan, Violetta (Маркарян, Виолетта); Barkova, Yulia (Баркова, Юлия)
  10. Principal component analysis-aided statistical process optimisation (PASPO) for process improvement in industrial refineries By Teng, Sin Yong; How, Bing Shen; Leong, Wei Dong; Teoh, Jun Hao; Cheah, Adrian Chee Siang; Motavasel, Zahra; Lam, Hon Loong
  11. Short-term forecasting of the US unemployment rate By Maas, Benedikt
  12. Intertemporal Evidence on the Strategy of Populism By Gloria Gennaro; Giampaolo Lecce; Massimo Morelli
  13. EFForTS-LGraf: A landscape generator for creating smallholder-driven land-use mosaics By Salecker, Jan; Dislich, Claudia; Wiegand, Kerstin; Meyer, Katrin M.; Pe'er, Guy

  1. By: Bernardo S. Buarque; Ronald B. Davies; Dieter F. Kogler; Ryan M. Hynes
    Abstract: This paper investigates the creation and integration of Artificial Intelligence (AI) patents in Europe. We create a panel of AI patents over time, mapping them into regions at the NUTS2 level. We then proceed by examining how AI is integrated into the knowledge space of each region. In particular, we find that those regions where AI is most embedded into the innovation landscape are also those where the number of AI patents is largest. This suggests that to increase AI innovation it may be necessary to integrate it with industrial development, a feature central to many recent AI-promoting policies.
    Keywords: Artificial Intelligence; Geography of Innovation; Knowledge Space; Technological Change; Regional Studies
    JEL: O33 O31 R11
    Date: 2019–05
  2. By: Bershadskyy, Dmitri; Othman, Ehsan; Saxen, Frerk
    Abstract: This paper illustrates how audio-visual data from pre-play face-to-face communication can be used to identify groups which contain free-riders in a public goods experiment. It focuses on two channels over which face-to-face communication influences contributions to a public good. Firstly, the contents of the face-to-face communication are investigated by categorising specific strategic information and using simple meta-data. Secondly, a machine-learning approach to analyse facial expressions of the subjects during their communications is implemented. These approaches constitute the first of their kind, analysing content and facial expressions in face-to-face communication aiming to predict the behaviour of the subjects in a public goods game. The analysis shows that verbally mentioning to fully contribute to the public good until the very end and communicating through facial clues reduce the commonly observed end-game behaviour. The length of the face-to-face communication quantified in number of words is further a good measure to predict cooperation behaviour towards the end of the game. The obtained findings provide first insights how a priori available information can be utilised to predict free-riding behaviour in public goods games.
    Keywords: automatic facial expressions recognition,content analysis,public goods experiment,face-to-face communication
    JEL: C80 C92 D91
    Date: 2019
  3. By: Ishita Chakraborty (School of Management, Yale University); Minkyung Kim (School of Management, Yale University); K. Sudhir (Cowles Foundation & School of Management, Yale University; School of Management, Yale University)
    Abstract: The authors address two novel and signi?cant challenges in using online text reviews to obtain attribute level ratings. First, they introduce the problem of inferring attribute level sentiment from text data to the marketing literature and develop a deep learning model to address it. While extant bag of words based topic models are fairly good at attribute discovery based on frequency of word or phrase occurrences, associating sentiments to attributes requires exploiting the spatial and sequential structure of language. Second, they illustrate how to correct for attribute self-selection—reviewers choose the subset of attributes to write about—in metrics of attribute level restaurant performance. Using reviews for empirical illustration, they ?nd that a hybrid deep learning (CNN-LSTM) model, where CNN and LSTM exploit the spatial and sequential structure of language respectively provide the best performance in accuracy, training speed and training data size requirements. The model does particularly well on the “hard” sentiment classi?cation problems. Further, accounting for attribute self-selection signi?cantly impacts sentiment scores, especially on attributes that are frequently missing.
    Keywords: Text mining, Natural language processing (NLP), Convolutional neural networks (CNN), Long-short term memory (LSTM) Networks, Deep learning, Lexicons, Endogeneity, Self-selection, Online reviews, Online ratings, Customer satisfaction
    JEL: M1 M3 C8 C5
    Date: 2019–05
  4. By: Vasilis Syrgkanis; Victor Lei; Miruna Oprescu; Maggie Hei; Keith Battocchi; Greg Lewis
    Abstract: We consider the estimation of heterogeneous treatment effects with arbitrary machine learning methods in the presence of unobserved confounders with the aid of a valid instrument. Such settings arise in A/B tests with an intent-to-treat structure, where the experimenter randomizes over which user will receive a recommendation to take an action, and we are interested in the effect of the downstream action. We develop a statistical learning approach to the estimation of heterogeneous effects, reducing the problem to the minimization of an appropriate loss function that depends on a set of auxiliary models (each corresponding to a separate prediction task). The reduction enables the use of all recent algorithmic advances (e.g. neural nets, forests). We show that the estimated effect model is robust to estimation errors in the auxiliary models, by showing that the loss satisfies a Neyman orthogonality criterion. Our approach can be used to estimate projections of the true effect model on simpler hypothesis spaces. When these spaces are parametric, then the parameter estimates are asymptotically normal, which enables construction of confidence sets. We applied our method to estimate the effect of membership on downstream webpage engagement on TripAdvisor, using as an instrument an intent-to-treat A/B test among 4 million TripAdvisor users, where some users received an easier membership sign-up process. We also validate our method on synthetic data and on public datasets for the effects of schooling on income.
    Date: 2019–05
  5. By: Shiori Inoue (Institute for Monetary and Economic Studies, Bank of Japan (E-mail:; Masashi Une (Director, Institute for Monetary and Economic Studies, Bank of Japan (E-mail:
    Abstract: The use of artificial intelligence, particularly machine learning (ML), is being extensively discussed in the financial sector. ML systems, however, tend to have specific vulnerabilities as well as those common to all information technology systems. To effectively deploy secure ML systems, it is critical to consider in advance how to address potential attacks targeting the vulnerabilities. In this paper, we classify ML systems into 12 types on the basis of the relationships among entities involved in the system and discuss the vulnerabilities and threats, as well as the corresponding countermeasures for each type. We then focus on typical use cases of ML systems in the financial sector, and discuss possible attacks and security measures.
    Keywords: Artificial Intelligence, Machine Learning System, Security, Threat, Vulnerability
    JEL: L86 L96 Z00
    Date: 2019–05
  6. By: Vegard H. Larsen; Leif Anders Thorsrud; Julia Zhulanova
    Abstract: We investigate the role played by the media in the expectations formation process of households. Using a news-topic-based approach we show that news types the media choose to report on, e.g., (Internet) technology, health, and politics, are good predictors of households' stated in ation expectations. In turn, in a noisy information model setting, augmented with a simple media channel, we document that the underlying time series properties of relevant news topics explain the timevarying information rigidity among households. As such, we not only provide a novel estimate showing the degree to which information rigidities among households vary across time, but also provide, using a large news corpus and machine learning algorithms, robust and new evidence highlighting the role of the media for understanding infl ation expectations and information rigidities.
    Keywords: Expectations, Media, Machine Learning, Inflation
    Date: 2019–04
  7. By: Sinha, Ankur; Kedas, Satishwar; Kumar, Rishu; Malo, Pekka
    Abstract: Financial sector is expected to be at the forefront of the adoption of machine learning methods, driven by the superior performance of the data-driven approaches over traditional modelling approaches. There has been a widespread interest in automatically extracting information from financial news flow as the signals might be useful for investment decisions. While quantitative finance focuses on analysis of structured financial data for investment decisions, the potential of utilizing unstructured news flow in decision making is not fully tapped. Research in financial news analytics tries to address this gap by detecting events and aspects that provide buy, sell or hold information in news, commonly interpreted as financial sentiments. In this paper, we develop a framework utilizing information theoretic concepts and machine learning methods that understands the context and is capable of extracting buy, sell or hold information contained within news headlines. The proposed framework is also capable of detecting conflicting sentiments on multiple companies within the same news headline, which to our best knowledge has not been studied earlier. Further, we develop an information system which analyzes the news flow in real-time, allowing users to track financial sentiments by company, sector and index via a dashboard. Through this study we make three dataset related contributions - firstly, a training dataset consisting of more than 12,000 news headlines annotated for entities and their relevant financial sentiments by multiple annotators, secondly, an entity database of over 1,000 financial and economic entities relevant to Indian economy and their forms of appearance in news media amounting to over 5,000 phrases and thirdly, make improvements in existing financial dictionaries. Using the proposed system, we study the effect of the information derived from daily news flow in the years 2012 to 2017, over the Indian broad market equity index NSE 500, and show that the information has predictive value.
    Date: 2019–04–30
  8. By: Gentry Johnson; Brian Quistorff; Matt Goldman
    Abstract: When pre-processing observational data via matching, we seek to approximate each unit with maximally similar peers that had an alternative treatment status--essentially replicating a randomized block design. However, as one considers a growing number of continuous features, a curse of dimensionality applies making asymptotically valid inference impossible (Abadie and Imbens, 2006). The alternative of ignoring plausibly relevant features is certainly no better, and the resulting trade-off substantially limits the application of matching methods to "wide" datasets. Instead, Li and Fu (2017) recasts the problem of matching in a metric learning framework that maps features to a low-dimensional space that facilitates "closer matches" while still capturing important aspects of unit-level heterogeneity. However, that method lacks key theoretical guarantees and can produce inconsistent estimates in cases of heterogeneous treatment effects. Motivated by straightforward extension of existing results in the matching literature, we present alternative techniques that learn latent matching features through either MLPs or through siamese neural networks trained on a carefully selected loss function. We benchmark the resulting alternative methods in simulations as well as against two experimental data sets--including the canonical NSW worker training program data set--and find superior performance of the neural-net-based methods.
    Date: 2019–05
  9. By: Sinyagin, Yuriy (Синягин, Юрий) (The Russian Presidential Academy of National Economy and Public Administration); Sinyagina, Natalia (Синягина, Наталья) (The Russian Presidential Academy of National Economy and Public Administration); Markaryan, Violetta (Маркарян, Виолетта) (The Russian Presidential Academy of National Economy and Public Administration); Barkova, Yulia (Баркова, Юлия) (The Russian Presidential Academy of National Economy and Public Administration)
    Abstract: The paper analyzes the challenges associated with the development of artificial intelligence and digitalization of management, based on the results of an empirical study among more than a thousand managers, a characteristic of a leader who is able to effectively manage in new conditions is composed. The paper characterizes a 4-factor model of the integrated readiness of civil service managers to work using artificial intelligence systems in a digital society. Expectations and ideas about artificial intelligence among Russian leaders at various levels are identified. The individual psychological and personal-professional qualities that influence the readiness of the heads of the civil service to work in a digital society have been studied. Also the paper analyzes the possible threats associated with the emergence and development of artificial intelligence systems.
    Date: 2019–05
  10. By: Teng, Sin Yong; How, Bing Shen; Leong, Wei Dong; Teoh, Jun Hao; Cheah, Adrian Chee Siang; Motavasel, Zahra; Lam, Hon Loong
    Abstract: Integrated refineries and industrial processing plant in the real-world always face management and design difficulties to keep the processing operation lean and green. These challenges highlight the essentiality to improving product quality and yield without compromising environmental aspects. For various process system engineering application, traditional optimisation methodologies (i.e., pure mix-integer non-linear programming) can yield very precise global optimum solutions. However, for plant-wide optimisation, the generated solutions by such methods highly rely on the accuracy of the constructed model and often require an enumerate amount of process changes to be implemented in the real world. This paper solves this issue by using a special formulation of correlation-based principal component analysis (PCA) and Design of Experiment (DoE) methodologies to serve as statistical process optimisation for industrial refineries. The contribution of this work is that it provides an efficient framework for plant-wide optimisation based on plant operational data while not compromising on environmental impacts. Fundamentally, PCA is used to prioritise statistically significant process variables based on their respective contribution scores. The variables with high contribution score are then optimised by the experiment-based optimisation methodology. By doing so, the number of experiments run for process optimisation and process changes can be reduced by efficient prioritisation. Process cycle assessment ensures that no negative environmental impact is caused by the optimisation result. As a proof of concept, this framework is implemented in a real oil re-refining plant. The overall product yield was improved by 55.25% while overall product quality improved by 20.6%. Global Warming Potential (GWP) and Acidification Potential (AP) improved by 90.89% and 3.42% respectively.
    Keywords: Principal Component Analysis, Design of Experiment, Plant-wide Optimisation, Statistical Process Optimization, PASPO, Big Data Analytics
    JEL: C1 C6 C8 C9 L6
    Date: 2019–07–10
  11. By: Maas, Benedikt
    Abstract: This paper aims to assess whether Google search data is useful when predicting the US unemployment rate among other more traditional predictor variables. A weekly Google index is derived from the keyword “unemployment” and is used in diffusion index variants along with the weekly number of initial claims and monthly estimated latent factors. The unemployment rate forecasts are generated using MIDAS regression models that take into account the actual frequencies of the predictor variables. The forecasts are made in real-time and the forecasts of the best forecasting models exceed, for the most part, the root mean squared forecast error of two benchmarks. However, as the forecasting horizon increases, the forecasting performance of the best diffusion index variants decreases over time, which suggests that the forecasting methods proposed in this paper are most useful in the short-term.
    Keywords: Forecasting, Unemployment rate, MIDAS, Google Trends
    JEL: C32 C53 E32
    Date: 2019–04–16
  12. By: Gloria Gennaro; Giampaolo Lecce; Massimo Morelli
    Abstract: Do candidates use populism to maximize the impact of political campaigns? Is the supply of populism strategic? We apply automated text analysis to all available 2016 US Presidential campaign speeches and 2018 midterm campaign programs using a continuous index of populism. This novel dataset shows that the use of populist rhetoric is responsive to the level of expected demand for populism in the local audience. In particular, we provide evidence that current U.S. President Donald Trump uses more populist rhetoric in swing states and in locations where economic insecurity is prevalent. These findings were confirmed when the analysis was extended to recent legislative campaigns wherein candidates tended towards populism when campaigning in stiffly competitive districts where constituents are experiencing high levels of economic insecurity. We also show that pandering is more common for candidates who can credibly sustain anti-elite positions, such as those with shorter political careers. Finally, our results suggest that a populist strategy is rewarded by voters since higher levels of populism are associated with higher shares of the vote, precisely in competitive districts where voters are experiencing economic insecurity. Keywords: Populism, Electoral Campaign, American Politics, Text Analysis
    Date: 2019
  13. By: Salecker, Jan; Dislich, Claudia; Wiegand, Kerstin; Meyer, Katrin M.; Pe'er, Guy
    Abstract: Spatially-explicit simulation models are commonly used to study complex ecological and socio-economic research questions. Often these models depend on detailed input data, such as initial land-cover maps to set up model simulations. Here we present the landscape generator EFFortS-LGraf that provides artificially-generated land-use maps of agricultural landscapes shaped by small-scale farms. EFForTS-LGraf is a process-based landscape generator that explicitly incorporates the human dimension of land-use change. The model generates roads and villages that consist of smallholder farming households. These smallholders use different establishment strategies to create fields in their close vicinity. Crop types are distributed to these fields based on crop fractions and specialization levels. EFForTS-LGraf model parameters such as household area or field size frequency distributions can be derived from household surveys or geospatial data. This can be an advantage over the abstract parameters of neutral landscape generators. We tested the model using oil palm and rubber farming in Indonesia as a case study and validated the artificially-generated maps against classified satellite images. Our results show that EFForTS-LGraf is able to generate realistic land-cover maps with properties that lie within the boundaries of landscapes from classified satellite images. An applied simulation experiment on landscape-level effects of increasing household area and crop specialization revealed that larger households with higher specialization levels led to spatially more homogeneous and less scattered crop type distributions and reduced edge area proportion. Thus, EFForTS-LGraf can be applied both to generate maps as inputs for simulation modelling and as a stand-alone tool for specific landscape-scale analyses in the context of ecological-economic studies of smallholder farming systems.
    Keywords: landscape generator,agent-based model,ABM,NetLogo,process-based,Indonesia
    Date: 2019

This nep-big issue is ©2019 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.