nep-big New Economics Papers
on Big Data
Issue of 2020‒04‒13
twenty-one papers chosen by
Tom Coupé
University of Canterbury

  1. Use of AI, Work Style Reform, and Productivity: Evidence from an Individual-Level Survey (Japanese) By MORIKAWA Masayuki
  2. Trustworthy artificial intelligence (AI) in education: Promises and challenges By Stéphan Vincent-Lancrin; Reyer van der Vlies
  3. Contracting, pricing, and data collection under the AI flywheel effect By Francis de Véricourt,; Huseyin Gurkan,
  4. Estimating the Green Potential of Occupations: A New Approach Applied to the U.S. Labor Market By Rutzer, Christian; Niggli, Matthias; Weder, Rolf
  5. Determining feature importance for actionable climate change mitigation policies By Romit Maulik; Junghwa Choi; Wesley Wehde; Prasanna Balaprakash
  6. ESG investments: Filtering versus machine learning approaches By Carmine de Franco; Christophe Geissler; Vincent Margot; Bruno Monnier
  7. Where do we stand in cryptocurrencies economic research? A survey based on hybrid analysis By Aurelio F. Bariviera; Ignasi Merediz-Sol\`a
  8. Data Science in Economics By Saeed Nosratabadi; Amir Mosavi; Puhong Duan; Pedram Ghamisi
  9. Tecnologías de Big data y biopolítica: mecanismos relacionales de procesamiento de datos en época de pandemia mundial viral By Herrera, Pablo Matías; Garcia Fronti, Javier
  10. Double Machine Learning Based Program Evaluation under Unconfoundedness By Knaus, Michael C.
  11. Towards Explainability of Machine Learning Models in Insurance Pricing By Kevin Kuo; Daniel Lupton
  12. Double Debiased Machine Learning Nonparametric Inference with Continuous Treatments By Kyle Colangelo; Ying-Ying Lee
  13. Forecasting Waiting Time to Treatment for Emergency Department Patients By Pak, Anton; Gannon, Brenda; Staib, Andrew
  14. Failure of Equilibrium Selection Methods for Multiple-Principal, Multiple-Agent Problems with Non-Rivalrous Goods: An Analysis of Data Markets By Samir Wadhwa; Roy Dong
  15. Regulation of Data Localization Measures in WTO Law (Japanese) By TOJO Yoshizumi
  16. QuantNet: Transferring Learning Across Systematic Trading Strategies By Adriano Koshiyama; Sebastian Flennerhag; Stefano B. Blumberg; Nick Firoozye; Philip Treleaven
  17. Search of Attention in Financial Market By Chong, Terence Tai Leung; Li, Chen
  18. Sorting Big Data by Revealed Preference with Application to College Ranking By Xingwei Hu
  19. Reinforcement Learning in Economics and Finance By Arthur Charpentier; Romuald Elie; Carl Remlinger
  20. Analyzing the Online Advertising Market from the Perspective of Competition Policy (Japanese) By KAWAHAMA Noboru; TAKEDA Kuninobu
  21. Potential and pitfalls of big transport data for spatial interaction models of urban mobility By Oshan, Taylor M.

  1. By: MORIKAWA Masayuki
    Abstract: This study presents individual-level evidence of the use of artificial intelligence (AI) and big data and discusses the relationship of these new automation technologies with work-style reform and productivity. The results indicate, first, that young and highly educated individuals tend to use AI and big data in their jobs. Second, use of these automation technologies is positively associated with earnings. Third, recent work-style reform has not necessarily contributed to improving efficiency in work, but the results suggest that use of AI and big data combined with work-style reform may contribute to workers' productivity.
    Date: 2020–03
  2. By: Stéphan Vincent-Lancrin (OECD); Reyer van der Vlies (OECD)
    Abstract: This paper was written to support the G20 artificial intelligence (AI) dialogue. With the rise of artificial intelligence (AI), education faces two challenges: reaping the benefits of AI to improve education processes, both in the classroom and at the system level; and preparing students for new skillsets for increasingly automated economies and societies. AI applications are often still nascent, but there are many examples of promising uses that foreshadow how AI might transform education. With regard to the classroom, this paper highlights how AI can accelerate personalised learning, the support of students with special needs. At the system level, promising uses include predictive analysis to reduce dropout, and assessing new skillsets. A new demand for complex skills that are less easy to automate (e.g. higher cognitive skills like creativity and critical thinking) is also the consequence of AI and digitalisation. Reaching the full potential of AI requires that stakeholders trust not only the technology, but also its use by humans. This raises new policy challenges around “trustworthy AI”, encompassing the privacy and security of data, but also possible wrongful uses of data leading to biases against individuals or groups.
    Date: 2020–04–08
  3. By: Francis de Véricourt, (ESMT European School of Management and Technology and E.CA Economics); Huseyin Gurkan, (ESMT European School of Management and Technology)
    Abstract: This paper explores how firms that lack expertise in machine learning (ML) can leverage the so-called AI Flywheel effect. This effect designates a virtuous cycle by which, as an ML product is adopted and new user data are fed back to the algorithm, the product improves, enabling further adoptions. However, managing this feedback loop is difficult, especially when the algorithm is contracted out. Indeed, the additional data that the AI Flywheel effect generates may change the provider’s incentives to improve the algorithm over time. We formalize this problem in a simple two-period moral hazard framework that captures the main dynam- ics between machine learning, data acquisition, pricing and contracting. We find that the firm’s decisions crucially depend on how the amount of data on which the machine is trained interacts with the provider’s effort. If this effort has a more (resp. less) significant impact on accuracy for larger volumes of data, the firm underprices (resp. overprices) the product. Further, the firm’s starting dataset, as well as the data volume that its product collects per user, significantly affect its pricing and data collection strategies. The firm leverages the virtuous cycle less for larger starting datasets and sometimes more for larger data volumes per user. Interestingly, the presence of incentive issues can induce the firm to leverage the effect less when its product collects more data per user.
    Keywords: Data, machine learning, pricing, incentives and contracting
    Date: 2020–03–03
  4. By: Rutzer, Christian (University of Basel); Niggli, Matthias (University of Basel); Weder, Rolf (University of Basel)
    Abstract: This paper presents a new approach to estimate the green potential of occupations. Using data from O*NET on the skills that workers possess and the tasks they carry out, we train several machine learning algorithms to predict the green potential of U.S. occupations classified according to the 6-digit Standard Occupational Classication. Our methodology allows existing discrete classications of occupations to be extended to a continuum of classes. This improves the analysis of heterogeneous occupations in terms of their green potential. Our approach makes two contributions to the literature. First, as it more accurately ranks occupations in terms of their green potential, it leads to a better understanding of the extent to which a given workforce is prepared to cope with a transition to a green economy. Second, it allows for a more accurate analysis of differences between workforces across regions. We use U.S. occupational employment data to highlight both aspects.
    Keywords: green skills, green tasks, green potential, supervised learning, labor market
    JEL: C53 J21 J24 Q52
    Date: 2020–03–01
  5. By: Romit Maulik; Junghwa Choi; Wesley Wehde; Prasanna Balaprakash
    Abstract: Given the importance of public support for policy change and implementation, public policymakers and researchers have attempted to understand the factors associated with this support for climate change mitigation policy. In this article, we compare the feasibility of using different supervised learning methods for regression using a novel socio-economic data set which measures public support for potential climate change mitigation policies. Following this model selection, we utilize gradient boosting regression, a well-known technique in the machine learning community, but relatively uncommon in public policy and public opinion research, and seek to understand what factors among the several examined in previous studies are most central to shaping public support for mitigation policies in climate change studies. The use of this method provides novel insights into the most important factors for public support for climate change mitigation policies. Using national survey data, we find that the perceived risks associated with climate change are more decisive for shaping public support for policy options promoting renewable energy and regulating pollutants. However, we observe a very different behavior related to public support for increasing the use of nuclear energy where climate change risk perception is no longer the sole decisive feature. Our findings indicate that public support for renewable energy is inherently different from that for nuclear energy reliance with the risk perception of climate change, dominant for the former, playing a subdued role for the latter.
    Date: 2020–03
  6. By: Carmine de Franco (OSSIAM); Christophe Geissler (Advestis); Vincent Margot (Advestis); Bruno Monnier (OSSIAM)
    Abstract: We designed a machine learning algorithm that identifies patterns between ESG profiles and financial performances for companies in a large investment universe. The algorithm consists of regularly updated sets of rules that map regions into the high-dimensional space of ESG features to excess return predictions. The final aggregated predictions are transformed into scores which allow us to design simple strategies that screen the investment universe for stocks with positive scores. By linking the ESG features with financial performances in a non-linear way, our strategy based upon our machine learning algorithm turns out to be an efficient stock picking tool, which outperforms classic strategies that screen stocks according to their ESG ratings, as the popular best-in-class approach. Our paper brings new ideas in the growing field of financial literature that investigates the links between ESG behavior and the economy. We show indeed that there is clearly some form of alpha in the ESG profile of a company, but that this alpha can be accessed only with powerful, non-linear techniques such as machine learning.
    Keywords: Sustainable Investments,Best-in-class approach,ESG,Machine Learning,Portfolio Construction
    Date: 2018–10–22
  7. By: Aurelio F. Bariviera; Ignasi Merediz-Sol\`a
    Abstract: This survey develops a dual analysis, consisting, first, in a bibliometric examination and, second, in a close literature review of all the scientific production around cryptocurrencies conducted in economics so far. The aim of this paper is twofold. On the one hand, proposes a methodological hybrid approach to perform comprehensive literature reviews. On the other hand, we provide an updated state of the art in cryptocurrency economic literature. Our methodology emerges as relevant when the topic comprises a large number of papers, that make unrealistic to perform a detailed reading of all the papers. This dual perspective offers a full landscape of cryptocurrency economic research. Firstly, by means of the distant reading provided by machine learning bibliometric techniques, we are able to identify main topics, journals, key authors, and other macro aggregates. Secondly, based on the information provided by the previous stage, the traditional literature review provides a closer look at methodologies, data sources and other details of the papers. In this way, we offer a classification and analysis of the mounting research produced in a relative short time span.
    Date: 2020–03
  8. By: Saeed Nosratabadi; Amir Mosavi; Puhong Duan; Pedram Ghamisi
    Abstract: This paper provides the state of the art of data science in economics. Through a novel taxonomy of applications and methods advances in data science are investigated. The data science advances are investigated in three individual classes of deep learning models, ensemble models, and hybrid models. Application domains include stock market, marketing, E-commerce, corporate banking, and cryptocurrency. Prisma method, a systematic literature review methodology is used to ensure the quality of the survey. The findings revealed that the trends are on advancement of hybrid models as more than 51% of the reviewed articles applied hybrid model. On the other hand, it is found that based on the RMSE accuracy metric, hybrid models had higher prediction accuracy than other algorithms. While it is expected the trends go toward the advancements of deep learning models.
    Date: 2020–03
  9. By: Herrera, Pablo Matías; Garcia Fronti, Javier
    Abstract: The promises and risks associated with the development of big data technologies are exacerbated by the global viral pandemic known as COVID-19, SARS-CoV-2, or coronaviruses. From an organizational point of view, a series of debates arise that, although in principle antagonistic, have outcomes in intermediate positions and question values of society. Faced with these questions, a very common practice in the face of the development of the coronavirus was to raise dystopian futures. Within this work, avoiding the proposal of "what is expected once it ends", the global viral pandemic is taken as the opening of a space to reflect on the development of Big data technologies and elaborate questions related to biopolitics. One of these questions is the following: how can practical categories, strategies, protocols and policies be developed to articulate and assign responsibilities in the relationship between humans and nonhumans? The answer to that question, surely, is not related to the approach of dystopian futures. Proposing scenarios based on "what is expected once it ends" is not a viable path. In this sense, the answer is related rather to the understanding of the agencies that exist in a mechanism in which, at least, humans, technologies, algorithms, data, and viruses interact. Understanding what is happening today is what enables policy development based on the allocation of responsibilities within the complex hybrid that represents the development of big data technologies.
    Keywords: Tecnologías de Big data; Biopolítica; Mecanismos relacionales de procesamiento de datos; Pandemia mundial viral
    JEL: O32 O33 O38
    Date: 2020–04–09
  10. By: Knaus, Michael C. (University of St. Gallen)
    Abstract: This paper consolidates recent methodological developments based on Double Machine Learning (DML) with a focus on program evaluation under unconfoundedness. DML based methods leverage flexible prediction methods to control for confounding in the estimation of (i) standard average effects, (ii) different forms of heterogeneous effects, and (iii) optimal treatment assignment rules. We emphasize that these estimators build all on the same doubly robust score, which allows to utilize computational synergies. An evaluation of multiple programs of the Swiss Active Labor Market Policy shows how DML based methods enable a comprehensive policy analysis. However, we find evidence that estimates of individualized heterogeneous effects can become unstable.
    Keywords: causal machine learning, conditional average treatment effects, optimal policy learning, individualized treatment rules, multiple treatments
    JEL: C21
    Date: 2020–03
  11. By: Kevin Kuo; Daniel Lupton
    Abstract: Machine learning methods have garnered increasing interest among actuaries in recent years. However, their adoption by practitioners has been limited, partly due to the lack of transparency of these methods, as compared to generalized linear models. In this paper, we discuss the need for model interpretability in property & casualty insurance ratemaking, propose a framework for explaining models, and present a case study to illustrate the framework.
    Date: 2020–03
  12. By: Kyle Colangelo; Ying-Ying Lee
    Abstract: We propose a nonparametric inference method for causal effects of continuous treatment variables, under unconfoundedness and in the presence of high-dimensional or nonparametric nuisance parameters. Our double debiased machine learning (DML) estimators for the average dose-response function (or the average structural function) and the partial effects are asymptotically normal with nonparametric convergence rates. The nuisance estimators for the conditional expectation function and the conditional density can be nonparametric kernel or series estimators or ML methods. Using a kernel-based doubly robust influence function and cross-fitting, we give tractable primitive conditions under which the nuisance estimators do not affect the first-order large sample distribution of the DML estimators. We justify the use of kernel to localize the continuous treatment at a given value by the Gateaux derivative. We implement various ML methods in Monte Carlo simulations and an empirical application on a job training program evaluation.
    Date: 2020–04
  13. By: Pak, Anton; Gannon, Brenda; Staib, Andrew
    Abstract: Problem definition. The current systems of reporting waiting time to patients in public emergency departments (EDs) has largely relied on rolling average or median estimators which have limited accuracy. This paper proposes to use the statistical learning algorithms that significantly improve waiting time forecasts. Practical Relevance. Generating and using a large set of queueing and service flow variables, we provide evidence of the improvement in waiting time accuracy and reduction in prediction errors. In addition to the mean squared prediction error (MSPE) and mean absolute prediction error (MAPE), we advocate to use the percentage of underpredicted observations as patients are more concerned when the actual waiting time exceeds the time forecast rather than vice versa. Provision of the accurate waiting time also helps to improve satisfaction of ED patients. Methodology. The use of the statistical learning methods (ridge, LASSO, random forest) is motivated by their advantages in exploring data connections in flexible ways, identifying relevant predictors, and preventing overfitting of the data. We also use quantile regression to generate time forecasts which may better address the patient's asymmetric perception of underpredicted and overpredicted ED waiting times. Results. We find robust evidence that the proposed estimators significantly outperform the commonly implemented rolling average. Using queueing and service flow variables together with information on diurnal fluctuations, quantile regression outperforms the best rolling average by 18% with respect to MSPE and reduces by 42% the number of patients with large underpredicted waiting times. Managerial implications. By reporting more accurate waiting times, hospitals may enjoy higher patient satisfaction. We show that to increase the predictive accuracy, a hospital ED may decide to provide predictions to patients registered only during the daytime when the ED operates at full capacity translating to more predictive service rates and the demand for treatments.
    Date: 2020–03–26
  14. By: Samir Wadhwa; Roy Dong
    Abstract: The advent of machine learning tools has led to the rise of data markets. These data markets are characterized by multiple data purchasers interacting with a set of data sources. Data sources have more information about the quality of data than the data purchasers; additionally, data itself is a non-rivalrous good that can be shared with multiple parties at negligible marginal cost. In this paper, we study the multiple-principal, multiple-agent problem with non-rivalrous goods. Under the assumption that the principal's payoff is quasilinear in the payments given to agents, we show that there is a fundamental degeneracy in the market of non-rivalrous goods. Specifically, for a general class of payment contracts, there will be an infinite set of generalized Nash equilibria. This multiplicity of equilibria also affects common refinements of equilibrium definitions intended to uniquely select an equilibrium: both variational equilibria and normalized equilibria will be non-unique in general. This implies that most existing equilibrium concepts cannot provide predictions on the outcomes of data markets emerging today. The results support the idea that modifications to payment contracts themselves are unlikely to yield a unique equilibrium, and either changes to the models of study or new equilibrium concepts will be required to determine unique equilibria in settings with multiple principals and a non-rivalrous good.
    Date: 2020–03
  15. By: TOJO Yoshizumi
    Abstract: An open internet and freedom of cross-border data distribution is critical for the evolution of digital trade. While governments are struggling to maximize the opportunities for economic growth via utilization of big data on the one hand, they face challenges in controlling the risks associated with data flows and achieving other public policy goals such as privacy and cybersecurity protection on the other. The latter legitimate concern accounts for the proliferation of data localization measures among many countries. The WTO Agreement, especially the General Agreement on Trade in Services (GATS), is the most important legal text governing data localization measures. While the GATS faces several limitations in regulating data localization measures due to the fact that most of the GATS was negotiated when the internet was in its infancy, in comparison, for the last decade, FTAs have developed rules on data flows in e-commerce chapters and complemented the multilateral rules. This is especially the case with the rules provided in CPTPP. FTAs, therefore, could also work as ’model rules’ for future multilateral negotiations.
    Date: 2020–02
  16. By: Adriano Koshiyama; Sebastian Flennerhag; Stefano B. Blumberg; Nick Firoozye; Philip Treleaven
    Abstract: In this work we introduce QuantNet: an architecture that is capable of transferring knowledge over systematic trading strategies in several financial markets. By having a system that is able to leverage and share knowledge across them, our aim is two-fold: to circumvent the so-called Backtest Overfitting problem; and to generate higher risk-adjusted returns and fewer drawdowns. To do that, QuantNet exploits a form of modelling called Transfer Learning, where two layers are market-specific and another one is market-agnostic. This ensures that the transfer occurs across trading strategies, with the market-agnostic layer acting as a vehicle to share knowledge, cross-influence each strategy parameters, and ultimately the trading signal produced. In order to evaluate QuantNet, we compared its performance in relation to the option of not performing transfer learning, that is, using market-specific old-fashioned machine learning. In summary, our findings suggest that QuantNet performs better than non transfer-based trading strategies, improving Sharpe ratio in 15% and Calmar ratio in 41% across 3103 assets in 58 equity markets across the world. Code coming soon.
    Date: 2020–04
  17. By: Chong, Terence Tai Leung; Li, Chen
    Abstract: This study employs correlation coefficients and the factor-augmented vector autoregressive (FAVAR) model to investigate the relationship between the stock market and investors’ sentiment measured by big data. The investors’ sentiment index is constructed from a pool of relative keyword series provided by the Baidu Index. We target two composite stock indices, namely the Hang Seng Index and the Shanghai Composite Index. We first compute the Pearson product-moment correlation coefficient to find the degree of correlation between keywords and composite stock price indices. Then, we apply the FAVAR model to obtain the impulse response of stock price to the investors’ sentiment index. Finally, we examine the leading effects of keywords on stock prices using lagged correlation coefficients. We obtain two main findings. First, a strong correlation exists between investors’ sentiment and composite stock price: Second, before and after the launch of the Shanghai-Hong Kong Stock Connect, the keywords affecting the fluctuation of the Hang Seng Index are different.
    Keywords: Baidu Index, Stock Connect
    JEL: G14
    Date: 2020–01–01
  18. By: Xingwei Hu
    Abstract: When ranking big data observations such as colleges in the United States, diverse consumers reveal heterogeneous preferences. The objective of this paper is to sort out a linear ordering for these observations and to recommend strategies to improve their relative positions in the ranking. A properly sorted solution could help consumers make the right choices, and governments make wise policy decisions. Previous researchers have applied exogenous weighting or multivariate regression approaches to sort big data objects, ignoring their variety and variability. By recognizing the diversity and heterogeneity among both the observations and the consumers, we instead apply endogenous weighting to these contradictory revealed preferences. The outcome is a consistent steady-state solution to the counterbalance equilibrium within these contradictions. The solution takes into consideration the spillover effects of multiple-step interactions among the observations. When information from data is efficiently revealed in preferences, the revealed preferences greatly reduce the volume of the required data in the sorting process. The employed approach can be applied in many other areas, such as sports team ranking, academic journal ranking, voting, and real effective exchange rates.
    Date: 2020–03
  19. By: Arthur Charpentier; Romuald Elie; Carl Remlinger
    Abstract: Reinforcement learning algorithms describe how an agent can learn an optimal action policy in a sequential decision process, through repeated experience. In a given environment, the agent policy provides him some running and terminal rewards. As in online learning, the agent learns sequentially. As in multi-armed bandit problems, when an agent picks an action, he can not infer ex-post the rewards induced by other action choices. In reinforcement learning, his actions have consequences: they influence not only rewards, but also future states of the world. The goal of reinforcement learning is to find an optimal policy -- a mapping from the states of the world to the set of actions, in order to maximize cumulative reward, which is a long term strategy. Exploring might be sub-optimal on a short-term horizon but could lead to optimal long-term ones. Many problems of optimal control, popular in economics for more than forty years, can be expressed in the reinforcement learning framework, and recent advances in computational science, provided in particular by deep learning algorithms, can be used by economists in order to solve complex behavioral problems. In this article, we propose a state-of-the-art of reinforcement learning techniques, and present applications in economics, game theory, operation research and finance.
    Date: 2020–03
  20. By: KAWAHAMA Noboru; TAKEDA Kuninobu
    Abstract: Mega platform operators are active in the advertising market. They make use of that profit to provide new services to consumers. In that respect, online advertising is at the core of the Internet ecosystem. Today many competition authorities are carrying out sector inquiries on this topic. There is one common concern they share is lack of transparency in the online advertising market, especially in the programmatic display advertising market. This paper classifies ad tech markets in order to analyze the advertising industry from the perspective of competition policy and sorts essential issues from the practices and discussions in foreign countries. Significant results of this study are as follows. First, competitive advantage in the advertising market is determined by ad tech, data, and advertisement inventory. Google seems to control all of these. Second, the publisher-side ad servers are considered a bottleneck for competition among ad exchanges. Third, sequential auctions carried out by Google create arbitrage opportunities for Google. These insights have implications for public policy and enforcement of competition law in Japan.
    Date: 2020–02
  21. By: Oshan, Taylor M.
    Abstract: Massive amounts of data that characterize how people meet their economic needs, interact within social communities, and utilize shared resources are being produced by cities. Harnessing these ever-increasing data streams is crucial for understanding urban dynamics. Within the context of transportation modeling it still remains largely unknown whether or not these new data sources provide the opportunity to better understand spatial processes. Therefore, in this paper, the usefulness of a recently available big transport dataset - the New York City (NYC) taxi trip data - is evaluated within a spatial interaction modeling framework. This is done by first comparing parameter estimates from a model using the taxi data to parameter estimates from a model using a traditional commuting dataset. In addition, the high temporal resolution of the taxi data provide an exciting means to explore potential dynamics in movement behavior. It is demonstrated how parameter estimates can be obtained for temporal subsets of data and compared over time to investigate mobility dynamics. The results of this work indicate that a pitfall of big transport data is that it is less useful for modeling distinct phenomena; however, there is a strong potential for modeling high frequency temporal dynamics of diverse urban activities.
    Date: 2020–03–09

This nep-big issue is ©2020 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.