nep-big New Economics Papers
on Big Data
Issue of 2022‒03‒21
sixteen papers chosen by
Tom Coupé
University of Canterbury

  1. Managers versus Machines: Do Algorithms Replicate Human Intuition in Credit Ratings? By Matthew Harding; Gabriel F. R. Vasconcelos
  2. Could Machine Learning be a General Purpose Technology? A Comparison of Emerging Technologies Using Data from Online Job Postings By Avi Goldfarb; Bledi Taska; Florenta Teodoridis
  3. A Framework for using Machine Learning to Support Qualitative Data Coding By Baumgartner, Peter; Smith, Amanda; Olmsted, Murrey; Ohse, Dawn
  4. Traditional Marketing Analytics, Big Data Analytics, Big Data System Quality and the Success of New Product Development By Aljumah, Ahmad Ibrahim; Nuseir, Mohammed T.; Alam, Md. Mahmudul
  5. Machine Learning for Credit Scoring: Improving Logistic Regression with Non-Linear Decision-Tree Effects By Elena Ivona Dumitrescu; Sullivan Hué; Christophe Hurlin; Sessi Tokpavi
  6. Host type and pricing on Airbnb: Seasonality and perceived market power By Georges Casamatta; Sauveur Giannoni; Daniel Brunstein; Johan Jouve
  7. Data-Driven Incentive Alignment in Capitation Schemes By Mark Braverman; Sylvain Chassang
  8. Desarrollo de una herramienta de aprendizaje automático (machine learning) para establecer relaciones entre ocupaciones y programas de capacitación en el Uruguay By Velardez, Miguel Omar; Dima, Germán César
  9. Organizational Performance and Capabilities to Analyze Big Data: Do the Ambidexterity and Business Value of Big Data Analytics Matter? By Aljumah, Ahmad Ibrahim; Nuseir, Mohammed T.; Alam, Md. Mahmudul
  10. Searching for Approval By Sumit Agarwal; John R. Grigsby; Ali Hortaçsu; Gregor Matvos; Amit Seru
  11. FisrEbp: Enterprise Bankruptcy Prediction via Fusing its Intra-risk and Spillover-Risk By Yu Zhao; Shaopeng Wei; Yu Guo; Qing Yang; Gang Kou
  12. Reading Between the Lines : Objective Function Estimation using RBA Communications By Gao, Robert
  13. A Neural Phillips Curve and a Deep Output Gap By Philippe Goulet Coulombe
  14. Ideology and monetary policy: the role of political parties’ stances in the ECB’s parliamentary hearings By Fraccaroli, Nicolò; Giovannini, Alessandro; Jamet, Jean-Francois; Persson, Eric
  15. Quantifying Vision through Language Demonstrates that Visionary Ideas Come from the Periphery By Vicinanza, Paul; Goldberg, Amir; Srivastava, Sameer
  16. Legal Loopholes and Data for Dollars: How Law Enforcement and Intelligence Agencies Are Buying Your Data from Brokers By Shenkman, Carey; Franklin, Sharon Bradford; Nojeim, Greg; Thakur, Dhanaraj

  1. By: Matthew Harding; Gabriel F. R. Vasconcelos
    Abstract: We use machine learning techniques to investigate whether it is possible to replicate the behavior of bank managers who assess the risk of commercial loans made by a large commercial US bank. Even though a typical bank already relies on an algorithmic scorecard process to evaluate risk, bank managers are given significant latitude in adjusting the risk score in order to account for other holistic factors based on their intuition and experience. We show that it is possible to find machine learning algorithms that can replicate the behavior of the bank managers. The input to the algorithms consists of a combination of standard financials and soft information available to bank managers as part of the typical loan review process. We also document the presence of significant heterogeneity in the adjustment process that can be traced to differences across managers and industries. Our results highlight the effectiveness of machine learning based analytic approaches to banking and the potential challenges to high-skill jobs in the financial sector.
    Date: 2022–02
  2. By: Avi Goldfarb; Bledi Taska; Florenta Teodoridis
    Abstract: General purpose technologies (GPTs) push out the production possibility frontier and are of strategic importance to managers and policymakers. While theoretical models that explain the characteristics, benefits, and approaches to create and capture value from GPTs have advanced significantly, empirical methods to identify GPTs are lagging. The handful of available attempts are typically context specific and rely on hindsight. For managers deciding on technology strategy, it means that the classification, when available, comes too late. We propose a more universal approach of assessing the GPT likelihood of emerging technologies using data from online job postings. We benchmark our approach against prevailing empirical GPT methods that exploit patent data and provide an application on a set of emerging technologies. Our application exercise suggests that a cluster of technologies comprised of machine learning and related data science technologies is relatively likely to be GPT.
    JEL: O32 O33
    Date: 2022–02
  3. By: Baumgartner, Peter (RTI International); Smith, Amanda; Olmsted, Murrey; Ohse, Dawn
    Abstract: Open-ended survey questions provide qualitative data that are useful for a multitude of reasons. However, qualitative data analysis is labor intensive, and researchers often lack the needed time and resources resulting in underutilization of qualitative data. In attempting to address these issues, we looked to machine learning and recent advances in language models and transfer learning to assist in qualitative coding of responses. We trained a machine learning model following the BERT architecture to predict thematic codes that were then adjudicated by human coders. Results suggest this is a promising approach that can be used to support traditional coding methods and has the potential to alleviate some of the burden associated with qualitative data analysis.
    Date: 2021–11–16
  4. By: Aljumah, Ahmad Ibrahim; Nuseir, Mohammed T.; Alam, Md. Mahmudul (Universiti Utara Malaysia)
    Abstract: Objective/Purpose: This study investigates the impact of traditional marketing analytics and big data analytics on the success of a new product. Moreover, it assesses the mediating effects of the quality of big data system. Methodology/Design: This study is based on primary data that were collected through an online questionnaire survey from large manufacturing firms operating in UAE. Out of total distributed 421 samples, 327 samples were used for final data analysis. The survey was conducted from March-April 2020 and data analysis was done via Structural Equation Modelling (SEM-PLS). Findings: It emerges that big data analysis (BDA), traditional marketing analysis (TMA) and big data system quality (BDSQ) are significant determinants of new product development (NPD) success. Meanwhile, the BDA and TMA significantly affect the BDSQ. Results of the mediating role of BDSQ in the relationship between the BDA and NPD as well as TMA and NPD are significant. Implications: There are significant policy implications for practitioners and researchers concerning the role of analytics, particularly big data analytics and big data system quality, when attempting to achieve success in developing new products. Originality/Value: This is an original study based on primary data from UAE.
    Date: 2021–11–30
  5. By: Elena Ivona Dumitrescu (EconomiX - UPN - Université Paris Nanterre - CNRS - Centre National de la Recherche Scientifique); Sullivan Hué (LEO - Laboratoire d'Économie d'Orleans - UO - Université d'Orléans - UT - Université de Tours, AMSE - Aix-Marseille Sciences Economiques - EHESS - École des hautes études en sciences sociales - AMU - Aix Marseille Université - ECM - École Centrale de Marseille - CNRS - Centre National de la Recherche Scientifique); Christophe Hurlin (LEO - Laboratoire d'Économie d'Orleans - UO - Université d'Orléans - UT - Université de Tours); Sessi Tokpavi (LEO - Laboratoire d'Économie d'Orleans - UO - Université d'Orléans - UT - Université de Tours)
    Abstract: In the context of credit scoring, ensemble methods based on decision trees, such as the random forest method, provide better classification performance than standard logistic regression models. However, logistic regression remains the benchmark in the credit risk industry mainly because the lack of interpretability of ensemble methods is incompatible with the requirements of financial regulators. In this paper, we propose a high-performance and interpretable credit scoring method called penalised logistic tree regression (PLTR), which uses information from decision trees to improve the performance of logistic regression. Formally, rules extracted from various short-depth decision trees built with original predictive variables are used as predictors in a penalised logistic regression model. PLTR allows us to capture non-linear effects that can arise in credit scoring data while preserving the intrinsic interpretability of the logistic regression model. Monte Carlo simulations and empirical applications using four real credit default datasets show that PLTR predicts credit risk significantly more accurately than logistic regression and compares competitively to the random forest method.
    Keywords: Risk management,Credit scoring,Machine learning,Interpretability,Econometrics
    Date: 2022–03–16
  6. By: Georges Casamatta (LISA - Lieux, Identités, eSpaces, Activités - UPP - Université Pascal Paoli - CNRS - Centre National de la Recherche Scientifique, TSE - Toulouse School of Economics - UT1 - Université Toulouse 1 Capitole - Université Fédérale Toulouse Midi-Pyrénées - EHESS - École des hautes études en sciences sociales - CNRS - Centre National de la Recherche Scientifique - INRAE - Institut National de Recherche pour l’Agriculture, l’Alimentation et l’Environnement); Sauveur Giannoni (LISA - Lieux, Identités, eSpaces, Activités - UPP - Université Pascal Paoli - CNRS - Centre National de la Recherche Scientifique); Daniel Brunstein (LISA - Lieux, Identités, eSpaces, Activités - UPP - Université Pascal Paoli - CNRS - Centre National de la Recherche Scientifique); Johan Jouve (LISA - Lieux, Identités, eSpaces, Activités - UPP - Université Pascal Paoli - CNRS - Centre National de la Recherche Scientifique)
    Abstract: The literature on short-term rental emphasises the heterogeneity of the hosts population. Some argue that professional and opportunistic hosts differ in terms of their pricing strategy. This study highlights how differences in market perception and information create a price differential between professional and non-professional players. Proposing an original and accurate definition of professional hosts, we rely on a large dataset of almost 9,000 properties and 73,000 observations to investigate the pricing behaviour of Airbnb sellers in Corsica (France). Using OLS and the double-machine learning methods, we demonstrate that a price differential exists between professional and opportunistic sellers. In addition, we assess the impact of seasonality in demand on the size and direction of this price differential. We find that professionals perceive a higher degree of market power than others during the peak season and it allows them to enhance their revenues.
    Keywords: Short-term rental,Pricing,Professionalism,Double machine learning,Seasonality,Market-power
    Date: 2022
  7. By: Mark Braverman (Princeton University); Sylvain Chassang (New York University)
    Abstract: This paper explores whether Big Data, taking the form of extensive but high dimensional records, can reduce the cost of adverse selection in government-run capitation schemes. We argue that using data to improve the ex ante precision of capitation regressions is unlikely to be helpful. Even if types become essentially observable, the high dimensionality of covariates makes it infeasible to precisely estimate the cost of serving a given type. This gives an informed private provider scope to select types that are relatively cheap to serve. Instead, we argue that data can be used to align incentives by forming unbiased and non-manipulable ex post estimates of a private provider’s gains from selection.
    Keywords: adverse selection, big data, capitation, health-care regulation, detailfree mechanism design, delegated model selection
    JEL: C55 D82 H51 I11 I13
    Date: 2020–03
  8. By: Velardez, Miguel Omar; Dima, Germán César
    Abstract: En este trabajo se desarrolló una herramienta automática y no supervisada que tiene por objeto recomendar programas de capacitación para una serie de ocupaciones sobre la base de similitudes entre el perfil de egreso de un conjunto de programas de la Universidad del Trabajo del Uruguay (UTU) y la descripción de las tareas correspondientes a 22 ocupaciones obtenidas a partir del relevamiento de ONET de Uruguay. En la herramienta se utilizan instrumentos del procesamiento de lenguaje natural (Natural Language Processing o NLP) cuya atención se centra en la repetibilidad de conceptos claves y en las similitudes del texto como un todo. Con el fin de evaluar este método, se contrastaron las recomendaciones obtenidas a partir de la herramienta con las que brindó un grupo de personas expertas. Los resultados muestran que la herramienta desarrollada permite recomendar un promedio de hasta nueve programas de capacitación para cada ocupación con un porcentaje de éxito medio del 85%. El potencial de esta metodología radica en que permite manejar de forma eficiente grandes volúmenes de datos que pueden contribuir a brindar información no sesgada en los servicios de desarrollo de carrera.
    Date: 2022–02–01
  9. By: Aljumah, Ahmad Ibrahim; Nuseir, Mohammed T.; Alam, Md. Mahmudul (Universiti Utara Malaysia)
    Abstract: Objective/Purpose: The aim of the study is to examine the impact of the big data analytics capabilities (BDAC) on the organizational performance. The study also examines the mediating role of ambidexterity and the moderating role of business value of big data (BVBD) analytics in the relationship between the big data analytics capabilities and the organizational performance. Methodology/Design: This study collected primary data based on a questionnaire survey among the large manufacturing firms operating in UAE. A total of 650 questionnaires were distributed among the manufacturing firms and 295 samples were used for final data analysis. The survey was conducted from September to November in 2019 and partial least squares structural equation modeling (PLS-SEM). Findings: The BDA scalability is supported by the findings on the performance of firm and its determinants such as system, value of business, and quality of information. The role of business value as a moderator and ambidexterity as mediator are found significant. The results reveal that there is a need for managers to consider the business value and quality dynamics as crucial strategic objectives to achieve high performance of the firm. Implications: The study has significant policy implication for practitioners and researchers for understanding the issues related to big data analytics. Originality/Value: This is an original study based on primary data from UAE manufacturing firms.
    Date: 2021–11–30
  10. By: Sumit Agarwal (National University of Singapore); John R. Grigsby (Princeton University); Ali Hortaçsu (University of Chicago); Gregor Matvos (Northwestern University); Amit Seru (Stanford University)
    Abstract: We study the interaction of search and application approval in credit markets. We combine a unique dataset, which details search behavior for a large sample of mortgage borrowers, with loan application and rejection decisions. Our data reveal substantial dispersion in mortgage rates and search intensity, conditional on observables. However, in contrast to predictions of standard search models, we find a novel non-monotonic relationship between search and realized prices: borrowers, who search a lot, obtain more expensive mortgages than borrowers' with less frequent search. The evidence suggests that this occurs because lenders screen borrowers' creditworthiness, rejecting unworthy borrowers, which differentiates consumer credit markets from other search markets. Based on these insights, we build a model that combines search and screening in presence of asymmetric information. Risky borrowers internalize the probability that their application is rejected, and behave as if they had higher search costs. The model rationalizes the relationship between search, interest rates, defaults, and application rejections, and highlights the tight link between credit standards and pricing. We estimate the parameters of the model and study several counterfactuals. The model suggests that "overpayment" may be a poor proxy for consumer unsophistication since it partly represents rational search in presence of rejections. Moreover, the development of improved screening technologies from AI and big data (i.e., fintech lending) could endogenously lead to more severe adverse selection in credit markets. Finally, place based policies, such as the Community Reinvestment Act, may affect equilibrium prices through endogenous search responses rather than increased credit risk.
    Keywords: credit markets, household finance
    JEL: G21 G50 G51 G53 L00
    Date: 2020–06
  11. By: Yu Zhao; Shaopeng Wei; Yu Guo; Qing Yang; Gang Kou
    Abstract: In this paper, we propose to model enterprise bankruptcy risk by fusing its intra-risk and spillover-risk. Under this framework, we propose a novel method that is equipped with an LSTM-based intra-risk encoder and GNNs-based spillover-risk encoder. Specifically, the intra-risk encoder is able to capture enterprise intra-risk using the statistic correlated indicators from the basic business information and litigation information. The spillover-risk encoder consists of hypergraph neural networks and heterogeneous graph neural networks, which aim to model spillover risk through two aspects, i.e. hyperedge and multiplex heterogeneous relations among enterprise knowledge graph, respectively. To evaluate the proposed model, we collect multi-sources SMEs data and build a new dataset SMEsD, on which the experimental results demonstrate the superiority of the proposed method. The dataset is expected to become a significant benchmark dataset for SMEs bankruptcy prediction and promote the development of financial risk study further.
    Date: 2022–01
  12. By: Gao, Robert (Monash University)
    Abstract: We use a dictionary based natural language processing approach to quantify the sentiment of RBA communications. This measure of sentiment is then used as a proxy for loss in the estimation of the RBA’s objective function. We find that RBA communications imply a target for average inflation between 2.4% to 2.7% for short run horizons of up to one year ahead, consistent with the RBA’s medium term inflation target band of 2-3%. This result is robust to different forms of communication, forecast horizons, and allowing for asymmetric preferences. We also find that the RBA’s loss improves with rising output growth, commodity prices and stock market returns, as well as an appreciating exchange rate and falling unemployment.
    Keywords: central bank ; natural language processing ; objective function ; Reserve Bank of Australia ; text analysis JEL Classification: E58 ; E5
    Date: 2021
  13. By: Philippe Goulet Coulombe
    Abstract: Many problems plague the estimation of Phillips curves. Among them is the hurdle that the two key components, inflation expectations and the output gap, are both unobserved. Traditional remedies include creating reasonable proxies for the notable absentees or extracting them via some form of assumptions-heavy filtering procedure. I propose an alternative route: a Hemisphere Neural Network (HNN) whose peculiar architecture yields a final layer where components can be interpreted as latent states within a Neural Phillips Curve. There are benefits. First, HNN conducts the supervised estimation of nonlinearities that arise when translating a high-dimensional set of observed regressors into latent states. Second, computations are fast. Third, forecasts are economically interpretable. Fourth, inflation volatility can also be predicted by merely adding a hemisphere to the model. Among other findings, the contribution of real activity to inflation appears severely underestimated in traditional econometric specifications. Also, HNN captures out-of-sample the 2021 upswing in inflation and attributes it first to an abrupt and sizable disanchoring of the expectations component, followed by a wildly positive gap starting from late 2020. HNN's gap unique path comes from dispensing with unemployment and GDP in favor of an amalgam of nonlinearly processed alternative tightness indicators -- some of which are skyrocketing as of early 2022.
    Date: 2022–02
  14. By: Fraccaroli, Nicolò; Giovannini, Alessandro; Jamet, Jean-Francois; Persson, Eric
    Abstract: We investigate whether ideology drives the sentiments of parliamentarians when they speak to the central bank they hold accountable. To this end, we collect textual data on the quarterly hearings of the ECB President before the European Parliament from 1999 to 2019. We apply sentiment analysis to more than 1,900 speeches of individual Members of the European Parliament (MEPs) from 128 parties. We find robust evidence that MEPs’ sentiments toward the ECB are correlated with the ideological stance predominantly on a pro-/anti-European dimension rather than on a left-right dimension. JEL Classification: E02, E52, E58
    Keywords: Central Bank Accountability, Central Bank Independence, Party Ideology, Sentiment Analysis
    Date: 2022–03
  15. By: Vicinanza, Paul; Goldberg, Amir (Stanford University); Srivastava, Sameer
    Abstract: Where do visionary ideas come from? Although the products of vision as manifested in technical innovation are readily observed, the ideas that eventually change the world are often obscured. Here we develop a novel method that uses deep learning to identify visionary ideas from the language used by individuals and groups. Quantifying vision this way unearths prescient ideas, individuals, and documents that prevailing methods would fail to detect. Applying our model to corpora spanning the disparate worlds of politics, law, and business, we demonstrate that it reliably detects vision in each domain. Moreover, counter to many prevailing intuitions, vision emanates from each domain’s periphery rather than its center. These findings suggest that vision may be as much as property of contexts as of individuals.
    Date: 2021–11–12
  16. By: Shenkman, Carey; Franklin, Sharon Bradford; Nojeim, Greg; Thakur, Dhanaraj
    Abstract: Typically, government agencies seeking access to the personal electronic data of Americans must comply with a legal process to obtain that data. That process can be mandated by the Constitution (the Fourth Amendment’s warrant and probable cause requirement) or by statute (such as the federal Electronic Communications Privacy Act, or various state laws). This report examines the concerning and rising practice of federal agencies sidestepping these legal requirements by obtaining data on Americans through commercial purchases from data brokers. Our research for this report involved interviewing experts on this issue and reviewing approximately 150 publicly available documents covering awards, solicitations, requests for proposals, and related information on contracts. We found significant evidence of agencies exploiting loopholes in existing law by purchasing data from private data brokers. The practice has prompted scrutiny from government watchdogs as well as members of Congress (Tau, 2021a; Wyden, 2021).
    Date: 2021–12–01

This nep-big issue is ©2022 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.