nep-big 2022-11-21 papers

on Big Data

Issue of 2022–11–21
fourteen papers chosen by
Tom Coupé, University of Canterbury

Whatever it Takes to Understand a Central Banker – Embedding their Words Using Neural Networks. By Zahner, Johannes; Baumgärtner, Martin
To Be or Not to Be: The Entrepreneur in Neo-Schumpeterian Growth Theory By Henrekson, Magnus; Johansson, Dan; Karlsson, Johan
Using Deep Learning to Find the Next Unicorn: A Practical Synthesis By Lele Cao; Vilhelm von Ehrenheim; Sebastian Krakowski; Xiaoxue Li; Alexandra Lutz
Cyber-Risk Forecasting using Machine Learning Models and Generalized Extreme Value Distributions By Jules Sadefo Kamdem; Danielle Selambi
Guiding Social Protection Targeting Through Satellite Data in São Tomé and Príncipe By Fisker,Peter Simonsen; Gallego-Ayala,Jordi Jose; Malmgren Hansen,David; Pave Sohnesen,Thomas; Murrugarra,Edmundo
Tweeting for Money: Social Media and Mutual Fund Flows By Javier Gil-Bazo; Juan F. Imbet
Designing Universal Causal Deep Learning Models: The Case of Infinite-Dimensional Dynamical Systems from Stochastic Analysis By Luca Galimberti; Giulia Livieri; Anastasis Kratsios
Dissecting the explanatory power of ESG features on equity returns by sector, capitalization, and year with interpretable machine learning By Jeremi Assael; Laurent Carlier; Damien Challet
Factor Investing with a Deep Multi-Factor Model By Zikai Wei; Bo Dai; Dahua Lin
In Light of What They Know : How do Local Leaders Make Targeting Decisions? By Dervisevic,Ervin; Garz,Seth Aaron Levine; Mannava,Aneesh; Perova,Elizaveta
Social Media and Newsroom Production Decisions By Julia Cagé; Nicolas Hervé; Béatrice Mazoyer
Congestion Costs and Scheduling Preferences of Car Commuters in California: Estimates Using Big Data By Jinwon Kim; Jucheol Moon
Ireland: Financial Sector Assessment Program-Technical Note on Anti-Money Laundering/Combating the Financing of Terrorism By International Monetary Fund
Effect of typhoons on economic activities in Vietnam: Evidence using satellite imagery By Etienne Espagne; Yen Boi Ha; Kenneth Houngbedji; Thanh Ngo-Duc

Whatever it Takes to Understand a Central Banker – Embedding their Words Using Neural Networks.

By: Zahner, Johannes; Baumgärtner, Martin

JEL: C45 C53 E52 Z13

Date: 2022

URL: https://d.repec.org/n?u=RePEc:zbw:vfsc22:264019

To Be or Not to Be: The Entrepreneur in Neo-Schumpeterian Growth Theory

By:	Henrekson, Magnus (Research Institute of Industrial Economics); Johansson, Dan (Örebro University); Karlsson, Johan (Jönköping University, Sogang University)
Abstract:	Based on a review of 700+ peer-reviewed articles since 1990, identified using text mining methodology and supervised machine learning, we analyze how neo-Schumpeterian growth theorists relate to the entrepreneur-centered view of Schumpeter (1934) and the entrepreneurless framework of Schumpeter (1942). The literature leans heavily towards Schumpeter (1942); innovation returns are modeled as following an ex ante known probability distribution. By assuming that innovation outcomes are (probabilistically) deterministic, the entrepreneur becomes redundant. Abstracting from genuine uncertainty, implies that central issues regarding the economic function of the entrepreneur are overlooked such as the roles of proprietary resources, skills, and profits.
Keywords:	creative destruction, economic growth, entrepreneur, innovation, judgment, Knightian uncertainty
JEL:	B40 O10 O30
Date:	2022–09
URL:	https://d.repec.org/n?u=RePEc:iza:izadps:dp15605

Using Deep Learning to Find the Next Unicorn: A Practical Synthesis

By:	Lele Cao; Vilhelm von Ehrenheim; Sebastian Krakowski; Xiaoxue Li; Alexandra Lutz
Abstract:	Startups often represent newly established business models associated with disruptive innovation and high scalability. They are commonly regarded as powerful engines for economic and social development. Meanwhile, startups are heavily constrained by many factors such as limited financial funding and human resources. Therefore the chance for a startup to eventually succeed is as rare as ``spotting a unicorn in the wild''. Venture Capital (VC) strives to identify and invest in unicorn startups during their early stages, hoping to gain a high return. To avoid entirely relying on human domain expertise and intuition, investors usually employ data-driven approaches to forecast the success probability of startups. Over the past two decades, the industry has gone through a paradigm shift moving from conventional statistical approaches towards becoming machine-learning (ML) based. Notably, the rapid growth of data volume and variety is quickly ushering in deep learning (DL), a subset of ML, as a potentially superior approach in terms capacity and expressivity. In this work, we carry out a literature review and synthesis on DL-based approaches, covering the entire DL life cycle. The objective is a) to obtain a thorough and in-depth understanding of the methodologies for startup evaluation using DL, and b) to distil valuable and actionable learning for practitioners. To the best of our knowledge, our work is the first of this kind.
Date:	2022–10
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2210.14195

Cyber-Risk Forecasting using Machine Learning Models and Generalized Extreme Value Distributions

By:	Jules Sadefo Kamdem (MRE - Montpellier Recherche en Economie - UM - Université de Montpellier); Danielle Selambi (African Institute for Mathematical Sciences (AIMS-Cameroon))
Abstract:	In this paper, we estimate the cost of a data breach using the number of compromised records. The number of such records is predicted by means of a machine learning model, particularly the Random Forest. We further analyse the fat tail phenomena which capture the underlying dynamics in the number of affected records. The objective is to calculate the maximum loss in order to answer the question of the insurability of cyber risk. Our results show that the total number of affected records follow a Frechet distribution, and we then estimate the Generalized Extreme Value (GEV) parameters to calculate the value at risk (VaR). This analysis is critical because it gives an idea of the maximum loss that can be generated by an enterprise data breach. These results are usable in anticipating the premiums for cyber risk coverage in the insurance markets.
Keywords:	Cyber insurance,Cyber risk,Machine Learning,Regression Trees,Random Forest,Generalized Extreme Value
Date:	2022–10–13
URL:	https://d.repec.org/n?u=RePEc:hal:wpaper:hal-03814979

Guiding Social Protection Targeting Through Satellite Data in São Tomé and Príncipe

By:	Fisker,Peter Simonsen; Gallego-Ayala,Jordi Jose; Malmgren Hansen,David; Pave Sohnesen,Thomas; Murrugarra,Edmundo
Abstract:	Social safety net programs focus on a subset of the population, usually the poorest and mostvulnerable. However, in most developing countries there is no administrative data on relative wealth of the populationto support the selection process of the potential beneficiaries of the social safety net programs. Hence,selection into programs is often multi-methodological approached and starts with geographical targeting for theselection of program implementation areas. To facilitate this stage of the targeting process in São Tomé andPríncipe, this working paper develops High Resolution Satellite Imagery (HRSI) poverty maps, providing bothestimates of poverty incidence and program eligibility at a highly detailed resolution (110 m x 110 m). Furthermore, theanalysis combines poverty incidence and population density to enable the geographical targeting process. This workingpaper shows that HRSI poverty maps can be used as key operational tools to facilitate the decision-making processof the geographical targeting and efficiently identify entry points for rapidly expanding social safety net programs.Unlike HRSI poverty maps based on census data, poverty maps based on satellite data and machine learning can be updatedfrequently at low cost supporting more adaptive social protection programs.
Date:	2022–09–30
URL:	https://d.repec.org/n?u=RePEc:wbk:hdnspu:177340

Tweeting for Money: Social Media and Mutual Fund Flows

By:	Javier Gil-Bazo; Juan F. Imbet
Abstract:	We investigate whether asset management firms use social media to persuade investors. Combining a database of almost 1.6 million Twitter posts by U.S. mutual fund families with textual analysis, we find that flows of money to mutual funds respond positively to tweets with a positive tone. Consistently with the persuasion hypothesis, positive tweets work best when they convey advice or views on the market and when investor sentiment is higher. Using a high-frequency approach, we are able to identify a short-lived impact of families' tweets on ETF share prices. Finally, we reject the alternative hypothesis that asset management companies use social media to alleviate information asymmetries by either lowering search costs or disclosing privately observed information.
Keywords:	social media, Twitter, persuasion, mutual funds, mutual fund, flows, machine learning, textual analysis
JEL:	G11 G23 D83
Date:	2022–10
URL:	https://d.repec.org/n?u=RePEc:bge:wpaper:1366

Designing Universal Causal Deep Learning Models: The Case of Infinite-Dimensional Dynamical Systems from Stochastic Analysis

By:	Luca Galimberti; Giulia Livieri; Anastasis Kratsios
Abstract:	Deep learning (DL) is becoming indispensable to contemporary stochastic analysis and finance; nevertheless, it is still unclear how to design a principled DL framework for approximating infinite-dimensional causal operators. This paper proposes a "geometry-aware" solution to this open problem by introducing a DL model-design framework that takes a suitable infinite-dimensional linear metric spaces as inputs and returns a universal sequential DL models adapted to these linear geometries: we call these models Causal Neural Operators (CNO). Our main result states that the models produced by our framework can uniformly approximate on compact sets and across arbitrarily finite-time horizons H\"older or smooth trace class operators which causally map sequences between given linear metric spaces. Consequentially, we deduce that a single CNO can efficiently approximate the solution operator to a broad range of SDEs, thus allowing us to simultaneously approximate predictions from families of SDE models, which is vital to computational robust finance. We deduce that the CNO can approximate the solution operator to most stochastic filtering problems, implying that a single CNO can simultaneously filter a family of partially observed stochastic volatility models.
Date:	2022–10
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2210.13300

Dissecting the explanatory power of ESG features on equity returns by sector, capitalization, and year with interpretable machine learning

By:	Jeremi Assael (BNPP CIB GM Lab - BNP Paribas CIB Global Markets Data & AI Lab, MICS - Mathématiques et Informatique pour la Complexité et les Systèmes - CentraleSupélec - Université Paris-Saclay); Laurent Carlier (BNPP CIB GM Lab - BNP Paribas CIB Global Markets Data & AI Lab); Damien Challet (MICS - Mathématiques et Informatique pour la Complexité et les Systèmes - CentraleSupélec - Université Paris-Saclay)
Abstract:	We systematically investigate the links between price returns and ESG features in the European market. We propose a cross-validation scheme with random company-wise validation to mitigate the relative initial lack of quantity and quality of ESG data, which allows us to use most of the latest and best data to both train and validate our models. Boosted trees successfully explain a part of 1 annual price returns not accounted by the market factor. We check with benchmark features that ESG features do contain significantly more information than basic fundamental features alone. The most relevant sub-ESG feature encodes controversies. Finally, we find opposite effects of better ESG scores on the price returns of small and large capitalization companies: better ESG scores are generally associated with larger price returns for the latter, and reversely for the former.
Keywords:	ESG features,sustainable investing,interpretable machine learning,model selection,asset management,equity returns
Date:	2022–09–29
URL:	https://d.repec.org/n?u=RePEc:hal:wpaper:hal-03791538

Factor Investing with a Deep Multi-Factor Model

By:	Zikai Wei; Bo Dai; Dahua Lin
Abstract:	Modeling and characterizing multiple factors is perhaps the most important step in achieving excess returns over market benchmarks. Both academia and industry are striving to find new factors that have good explanatory power for future stock returns and good stability of their predictive power. In practice, factor investing is still largely based on linear multi-factor models, although many deep learning methods show promising results compared to traditional methods in stock trend prediction and portfolio risk management. However, the existing non-linear methods have two drawbacks: 1) there is a lack of interpretation of the newly discovered factors, 2) the financial insights behind the mining process are unclear, making practitioners reluctant to apply the existing methods to factor investing. To address these two shortcomings, we develop a novel deep multi-factor model that adopts industry neutralization and market neutralization modules with clear financial insights, which help us easily build a dynamic and multi-relational stock graph in a hierarchical structure to learn the graph representation of stock relationships at different levels, e.g., industry level and universal level. Subsequently, graph attention modules are adopted to estimate a series of deep factors that maximize the cumulative factor returns. And a factor-attention module is developed to approximately compose the estimated deep factors from the input factors, as a way to interpret the deep factors explicitly. Extensive experiments on real-world stock market data demonstrate the effectiveness of our deep multi-factor model in the task of factor investing.
Date:	2022–10
URL:	https://d.repec.org/n?u=RePEc:arx:papers:2210.12462

In Light of What They Know : How do Local Leaders Make Targeting Decisions?

By:	Dervisevic,Ervin; Garz,Seth Aaron Levine; Mannava,Aneesh; Perova,Elizaveta
Abstract:	This paper analyzes how local leaders make targeting decisions in the context of a public workfare program in the Lao People~^!!^s Democratic Republic. The study finds that village heads are progressive in their targeting, prioritizing the poorer households in their villages. The study benchmarks this decentralized selection to the common alternative proxy means test method and finds that village heads are at least as progressive as a proxy means test method approach. To illuminate what poverty-related information village heads could plausibly be incorporating into their internal selection decisions, the study designs and administers a set of exercises for village heads to rank villagers on land ownership, access to nutrition, and experience with recent shocks -- indicators that are likely to differ in their observability to village heads and could plausibly be associated with need for public support. The study finds that village heads~^!!^ perceptions, as revealed through the ranking exercise, differ substantially from actual levels reported in surveys of the villagers themselves. The study then uses a data-driven machine learning approach to identify the predictors of village head selection. It concludes that village heads rely on a combination of easily observable household characteristics, forming a holistic impression of household welfare, rather than specific indicators like actual land ownership, nutrition, or economic shocks.
Keywords:	Inequality,Nutrition,Urban Governance and Management,Municipal Management and Reform,Urban Housing,Urban Housing and Land Settlements,Services&Transfers to Poor,Economic Assistance,Access of Poor to Social Services,Disability,Roads&Highways
Date:	2020–11–03
URL:	https://d.repec.org/n?u=RePEc:wbk:wbrwps:9465

Social Media and Newsroom Production Decisions

By:	Julia Cagé (ECON - Département d'économie (Sciences Po) - Sciences Po - Sciences Po - CNRS - Centre National de la Recherche Scientifique, CEPR - Center for Economic Policy Research - CEPR); Nicolas Hervé (INA - Institut National de l'Audiovisuel); Béatrice Mazoyer (Médialab - Médialab (Sciences Po) - Sciences Po - Sciences Po)
Abstract:	Social media affects not only the way we consume news, but also the way news is produced, including by traditional media outlets. In this paper, we study the propagation of information from social media to mainstream media, and investigate whether news editors' editorial decisions are influenced by the popularity of news stories on social media To do so, we build a novel dataset including a representative sample of all the tweets produced in French between August 1st 2018 and July 31st 2019 (1.8 billion tweets, around 70% of all tweets in French) and the content published online by 200 mainstream media outlets. We then develop novel algorithms to identify and link events on social and mainstream media. To isolate the causal impact of popularity, we rely on the structure of the Twitter network and propose a new instrument based on the interaction between measures of user centrality and "social media news pressure" at the time of the event. We show that story popularity has a positive effect on media coverage, and that this effect varies depending on the media outlets' characteristics, in particular on whether they use a paywall. Finally, we investigate consumers' reaction to a surge in social media popularity. Our findings shed new light on our understanding of how editors decide on the coverage for stories, and question the welfare effects of social media.
Keywords:	Internet,Information spreading,News editors,Network analysis,Social media,Twitter,Text analysis
Date:	2022–05–31
URL:	https://d.repec.org/n?u=RePEc:hal:wpaper:hal-03811318

Congestion Costs and Scheduling Preferences of Car Commuters in California: Estimates Using Big Data

By:	Jinwon Kim (Department of Economics, Sogang University, Seoul, Korea); Jucheol Moon (Department of Computer Engineering & Computer Science, California State University, Long Beach)
Abstract:	This paper aims to quantify congestion costs and estimate the scheduling utility function for commuters. To do so, we construct California commuters' travel-time profiles, namely, the menu of travel times that each individual will likely face according to alternate trip timing choices. On average, California commuters waste about 5 minutes per morning commute due to congestion. Commuters facing a higher congestion level at the peak hour tend to avoid congestion delays by arriving at an inconvenient edge time. We also discover that for the majority of the commuters in our data, travel-time profiles are much atter than our estimated schedule utility. From this finding, we question the accuracy of the existing bottleneck models in quantifying the economic costs of congestion and the optimal toll to ameliorate congestion.
Keywords:	congestion costs; scheduling preference; commuting; Google Maps; big data; machinelearning
JEL:	R41 R48 C8 C25 H21
Date:	2022
URL:	https://d.repec.org/n?u=RePEc:sgo:wpaper:2201

Ireland: Financial Sector Assessment Program-Technical Note on Anti-Money Laundering/Combating the Financing of Terrorism

By:	International Monetary Fund
Abstract:	While domestic money laundering (ML) threats are well understood by the authorities, Ireland faces significant and increasing threats from foreign criminal proceeds. As a growing international financial center,1 Ireland is exposed to inherent transnational money laundering and terrorist financing (ML/TF) related risks. The ML risks facing Ireland include illicit proceeds from foreign crimes (e.g., corruption, tax crimes). Retail and international banks, trust and company service providers (TCSPs),2 lawyers, and accountants are medium to high-risk for ML, while virtual asset service providers (VASPs) pose emerging risks. Brexit, the recent move of international banks to Dublin, and the COVID-19 pandemic increased the money laundering risks faced by Ireland. The Central Bank of Ireland (Central Bank) nevertheless has demonstrated a deep and robust experience in assessing and understanding their domestic ML/TF risks; however, an increased focus on risks related to transnational illicit financial flows is required. A thematic risk assessment undertaken by the Anti-Money Laundering Steering Committee (AMLSC) of international ML/TF risks would enhance the authorities’ risk understanding and is key to effective response to the rapid financial sector growth. Introducing data analytics tools, including machine learning to leverage potentially available big data on cross-border payments, would allow for efficient detection of emerging risks. The results of this assessment should be published to improve the understanding of transnational ML/TF risks and feed into the anti-money laundering and combating the financing of terrorism (AML/CFT) policy priorities going forward.
Date:	2022–10–21
URL:	https://d.repec.org/n?u=RePEc:imf:imfscr:2022/324

Effect of typhoons on economic activities in Vietnam: Evidence using satellite imagery

By:	Etienne Espagne (World Bank); Yen Boi Ha (ETH Zurich); Kenneth Houngbedji (LEDa-DIAL (IRD, CNRS, Universite Paris-Dauphine, Universite PSL)); Thanh Ngo-Duc (University of Science and Technology of Hanoi)
Abstract:	This paper investigates the effect of typhoons on economic activities in Vietnam. During the period covered by our analysis, 1992-2013, we observed 63 typhoons affecting different locations of the country in different years with varying intensity. Using measures of the intensity of nightlight from satellite imagery as a proxy for the level of economic activity, we study how the nighttime light brightness varies across locations that were variably affected by the tropical cyclones. The results suggest that typhoons have on average dimmed nighttime luminosity of the places hit by 5 ± 5.8 % or 8 ± 7.8 % depending on the specifications we made.
Keywords:	Natural disasters, economic growth, Vietnam
JEL:	Q54 Q52 C21 O53
Date:	2022–07
URL:	https://d.repec.org/n?u=RePEc:dia:wpaper:dt202206

This nep-big issue is ©2022 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.

General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.

NEP’s infrastructure is sponsored by the Griffith Business School of Griffith University in Australia.

By:	Zahner, Johannes; Baumgärtner, Martin
JEL:	C45 C53 E52 Z13
Date:	2022
URL:	https://d.repec.org/n?u=RePEc:zbw:vfsc22:264019