nep-big New Economics Papers
on Big Data
Issue of 2024‒02‒19
seventeen papers chosen by
Tom Coupé, University of Canterbury


  1. Causal machine learning in public policy evaluation -- an application to the conditioning of cash transfers in Morocco By Patrick Rehill; Nicholas Biddle
  2. Text mining arXiv: a look through quantitative finance papers By Michele Leonardo Bianchi
  3. Comparing spatial and spatio-temporal paradigms to estimate the evolution of socio-economical indicators from satellite images By Robin Jarry; Marc Chaumont; Laure Berti-Équille; Gérard Subsol
  4. Sustainable digital marketing under big data: an AI random forest model approach By Jin, Keyan; Zhong, Ziqi; Zhao, Elena Yifei
  5. Model Averaging and Double Machine Learning By Ahrens, Achim; Hansen, Christian B.; Schaffer, Mark E; Wiemann, Thomas
  6. Remote work across jobs, companies and space By Bloom, Nicholas; Davis, Steven J.; Hansen, Stephen; Lambert, Peter John; Sadun, Raffaella; Taska, Bledi
  7. Using Supply Chain Network Information and High-frequency Mobility Data to Forecast Firm Dynamics (Japanese) By KATO Rui; MIYAKAWA Daisuke; YANAOKA Masaki; YUKIMOTO Shinji
  8. Business Model Contributions to Bank Profit Performance: A Machine Learning Approach By F. Bolivar; Miguel A. Duran; A. Lozano-Vivas
  9. Computing the Gerber-Shiu function with interest and a constant dividend barrier by physics-informed neural networks By Zan Yu; Lianzeng Zhang
  10. StockFormer: A Swing Trading Strategy Based on STL Decomposition and Self-Attention Networks By Bohan Ma; Yiheng Wang; Yuchao Lu; Tianzixuan Hu; Jinling Xu; Patrick Houlihan
  11. Deep Learning With DAGs By Sourabh Balgi; Adel Daoud; Jose M. Pe\~na; Geoffrey T. Wodtke; Jesse Zhou
  12. Multimodal Gen-AI for Fundamental Investment Research By Lezhi Li; Ting-Yu Chang; Hai Wang
  13. New accessibility measures based on unconventional big data sources By G. Arbia; V. Nardelli; N. Salvini; I. Valentini
  14. Like, Comment, and Share: Analyzing Public Sentiments of Government Policies in Social Media By Albert, Jose Ramon G.; Siar, Sheila V.; Vizmanos, Jana Flor V.; Hernandez, Angelo C.; Sarmiento, Janina Luz C.
  15. Learning to be Homo Economicus: Can an LLM Learn Preferences from Choice By Jeongbin Kim; Matthew Kovach; Kyu-Min Lee; Euncheol Shin; Hector Tzavellas
  16. Impact of the central bank's communication on macro financial outcomes By Tetiana Yukhymenko; Oleh Sorochan
  17. Consumer-Driven Climate Mitigation: Exploring Barriers and Solutions in Studying Higher Mitigation Potential Behaviors By Lembregts, Christophe; Cadario, Romain

  1. By: Patrick Rehill; Nicholas Biddle
    Abstract: Causal machine learning methods can be used to search for treatment effect heterogeneity in high-dimensional datasets even where we lack a strong enough theoretical framework to select variables or make parametric assumptions about data. This paper uses causal machine learning methods to estimate heterogeneous treatment effects in the case of an experimental study carried out in Morocco which evaluated the effect of conditionalizing a cash transfer program on school attendance compared to a labelled cash transfer. We show that there is little heterogeneity in effects with the average treatment effect across three different conditioning policies all being negative. We then explore if there are any variables in the dataset of 1936 pre-treatment variables that are particularly strong predictors of heterogeneity to try to understand this effect. While there are some variables we expected to be important here based on our theoretical framework, most are atheoretical variables whose effects are difficult to interpret. Household spending variables and child time-use variables are particularly important, however no variables have particularly large effects. The second purpose of this paper is to demonstrate and reflect upon a causal machine learning approach to policy evaluation. In this vein we suggest that findings that are difficult to interpret in this way are not surprising given the atheoretical methodology. We reflect that causal machine learning methods should not replace existing evaluation methodologies, but rather could be a useful tool for working with high-dimensional data and generating hypotheses.
    Date: 2024–01
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2401.07075&r=big
  2. By: Michele Leonardo Bianchi
    Abstract: This paper explores articles hosted on the arXiv preprint server with the aim to uncover valuable insights hidden in this vast collection of research. Employing text mining techniques and through the application of natural language processing methods, we examine the contents of quantitative finance papers posted in arXiv from 1997 to 2022. We extract and analyze crucial information from the entire documents, including the references, to understand the topics trends over time and to find out the most cited researchers and journals on this domain. Additionally, we compare numerous algorithms to perform topic modeling, including state-of-the-art approaches.
    Date: 2024–01
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2401.01751&r=big
  3. By: Robin Jarry (LIRMM | ICAR - Image & Interaction - LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier - CNRS - Centre National de la Recherche Scientifique - UM - Université de Montpellier); Marc Chaumont (LIRMM | ICAR - Image & Interaction - LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier - CNRS - Centre National de la Recherche Scientifique - UM - Université de Montpellier, UNIMES - Université de Nîmes); Laure Berti-Équille (UMR 228 Espace-Dev, Espace pour le développement - IRD - Institut de Recherche pour le Développement - UPVD - Université de Perpignan Via Domitia - AU - Avignon Université - UR - Université de La Réunion - UG - Université de Guyane - UA - Université des Antilles - UM - Université de Montpellier, AMU - Aix Marseille Université); Gérard Subsol (LIRMM | ICAR - Image & Interaction - LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier - CNRS - Centre National de la Recherche Scientifique - UM - Université de Montpellier)
    Abstract: In remote sensing, deep spatio-temporal models, i.e., deep learning models that estimate information based on Satellite Image Time Series obtain successful results in Land Use/Land Cover classification or change detection. Nevertheless, for socioeconomic applications such as poverty estimation, only deep spatial models have been proposed. In this paper, we propose a test-bed to compare spatial and spatio-temporal paradigms to estimate the evolution of Nighttime Light (NTL), a standard proxy for socioeconomic indicators. We applied the test-bed in the area of Zanzibar, Tanzania for 21 years. We observe that (1) both models obtain roughly equivalent performances when predicting the NTL value at a given time, but (2) the spatio-temporal model is significantly more efficient when predicting the NTL evolution.
    Keywords: Zanzibar, Tanzania, Deep learning, Time series analysis, Estimation, Predictive models, Satellite images, Standards, Remote sensing
    Date: 2023–07–16
    URL: http://d.repec.org/n?u=RePEc:hal:journl:hal-04268542&r=big
  4. By: Jin, Keyan; Zhong, Ziqi; Zhao, Elena Yifei
    Abstract: Digital marketing refers to the process of promoting, selling, and delivering products or services through online platforms and channels using the internet and electronic devices in a digital environment. Its aim is to attract and engage target audiences through various strategies and methods, driving brand promotion and sales growth. The primary objective of this scholarly study is to seamlessly integrate advanced big data analytics and artificial intelligence (AI) technology into the realm of digital marketing, thereby fostering the progression and optimization of sustainable digital marketing practices. First, the characteristics and applications of big data involving vast, diverse, and complex datasets are analyzed. Understanding their attributes and scope of application is essential. Subsequently, a comprehensive investigation into AI-driven learning mechanisms is conducted, culminating in the development of an AI random forest model (RFM) tailored for sustainable digital marketing. Subsequent to this, leveraging a real-world case study involving enterprise X, fundamental customer data is collected and subjected to meticulous analysis. The RFM model, ingeniously crafted in this study, is then deployed to prognosticate the anticipated count of prospective customers for said enterprise. The empirical findings spotlight a pronounced prevalence of university-affiliated individuals across diverse age cohorts. In terms of occupational distribution within the customer base, the categories of workers and educators emerge as dominant, constituting 41% and 31% of the demographic, respectively. Furthermore, the price distribution of patrons exhibits a skewed pattern, whereby the price bracket of 0–150 encompasses 17% of the population, whereas the range of 150–300 captures a notable 52%. These delineated price bands collectively constitute a substantial proportion, whereas the range exceeding 450 embodies a minority, accounting for less than 20%. Notably, the RFM model devised in this scholarly endeavor demonstrates a remarkable proficiency in accurately projecting forthcoming passenger volumes over a seven-day horizon, significantly surpassing the predictive capability of logistic regression. Evidently, the AI-driven RFM model proffered herein excels in the precise anticipation of target customer counts, thereby furnishing a pragmatic foundation for the intelligent evolution of sustainable digital marketing strategies,
    Keywords: artificial intelligence (AI); big data; random forest model (RFM); social media; sustainable digital marketing; technological innovation; AAM requested
    JEL: L81
    Date: 2024–01–01
    URL: http://d.repec.org/n?u=RePEc:ehl:lserod:121402&r=big
  5. By: Ahrens, Achim (ETH Zurich); Hansen, Christian B. (University of Chicago); Schaffer, Mark E (Heriot-Watt University, Edinburgh); Wiemann, Thomas (University of Chicago)
    Abstract: This paper discusses pairing double/debiased machine learning (DDML) with stacking, a model averaging method for combining multiple candidate learners, to estimate structural parameters. We introduce two new stacking approaches for DDML: short-stacking exploits the cross-fitting step of DDML to substantially reduce the computational burden and pooled stacking enforces common stacking weights over cross-fitting folds. Using calibrated simulation studies and two applications estimating gender gaps in citations and wages, we show that DDML with stacking is more robust to partially unknown functional forms than common alternative approaches based on single pre-selected learners. We provide Stata and R software implementing our proposals.
    Keywords: causal inference, partially linear model, high-dimensional models, super learners, nonparametric estimation
    JEL: C21 C26 C52 C55 J01 J08
    Date: 2024–01
    URL: http://d.repec.org/n?u=RePEc:iza:izadps:dp16714&r=big
  6. By: Bloom, Nicholas; Davis, Steven J.; Hansen, Stephen; Lambert, Peter John; Sadun, Raffaella; Taska, Bledi
    Abstract: The pandemic catalyzed an enduring shift to remote work. To measure and characterize this shift, we examine more than 250 million job vacancy postings across five English-speaking countries. Our measurements rely on a state-of-the-art language-processing framework that we fit, test, and refine using 30, 000 human classifications. We achieve 99% accuracy in flagging job postings that advertise hybrid or fully remote work, greatly outperforming dictionary methods and also outperforming other machine-learning methods. From 2019 to early 2023, the share of postings that say new employees can work remotely one or more days per week rose more than three-fold in the U.S and by a factor of five or more in Australia, Canada, New Zealand and the U.K. These developments are highly non-uniform across and within cities, industries, occupations, and companies. Even when zooming in on employers in the same industry competing for talent in the same occupations, we find large differences in the share of job postings that explicitly offer remote work.
    Keywords: Covid-19; hybrid working; employment; Consolidator Grant 864863; STICERD PhD research grant; e Booth School of Business
    JEL: C50 E24 M54 O33 R3
    Date: 2023–07–14
    URL: http://d.repec.org/n?u=RePEc:ehl:lserod:121302&r=big
  7. By: KATO Rui; MIYAKAWA Daisuke; YANAOKA Masaki; YUKIMOTO Shinji
    Abstract: The use of GPS location data is increasingly common in recent years. In this paper, we use individual-level GPS location data to measure the size of factory-level populations and to forecast the leasing demand of the transaction partners of the companies for which the factory-level population is measured. First, we use GPS location data to measure changes in the population at the main factories of companies in the manufacturing industry. Second, using such measured data and their lease contract data, we construct a machine learning-based prediction model of leasing demand within the company’s suppliers. Except for the periods when corporate activities were greatly disturbed by the COVID-19 pandemic, the use of the GPS location data improves the prediction power of the leasing demand.
    Date: 2024–01
    URL: http://d.repec.org/n?u=RePEc:eti:rdpsjp:24005&r=big
  8. By: F. Bolivar; Miguel A. Duran; A. Lozano-Vivas
    Abstract: This paper analyzes the relation between bank profit performance and business models. Using a machine learning-based approach, we propose a methodological strategy in which balance sheet components' contributions to profitability are the identification instruments of business models. We apply this strategy to the European Union banking system from 1997 to 2021. Our main findings indicate that the standard retail-oriented business model is the profile that performs best in terms of profitability, whereas adopting a non-specialized business profile is a strategic decision that leads to poor profitability. Additionally, our findings suggest that the effect of high capital ratios on profitability depends on the business profile. The contributions of business models to profitability decreased during the Great Recession. Although the situation showed signs of improvement afterward, the European Union banking system's ability to yield returns is still problematic in the post-crisis period, even for the best-performing group.
    Date: 2024–01
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2401.12334&r=big
  9. By: Zan Yu; Lianzeng Zhang
    Abstract: In this paper, we propose a new efficient method for calculating the Gerber-Shiu discounted penalty function. Generally, the Gerber-Shiu function usually satisfies a class of integro-differential equation. We introduce the physics-informed neural networks (PINN) which embed a differential equation into the loss of the neural network using automatic differentiation. In addition, PINN is more free to set boundary conditions and does not rely on the determination of the initial value. This gives us an idea to calculate more general Gerber-Shiu functions. Numerical examples are provided to illustrate the very good performance of our approximation.
    Date: 2024–01
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2401.04378&r=big
  10. By: Bohan Ma; Yiheng Wang; Yuchao Lu; Tianzixuan Hu; Jinling Xu; Patrick Houlihan
    Abstract: Amidst ongoing market recalibration and increasing investor optimism, the U.S. stock market is experiencing a resurgence, prompting the need for sophisticated tools to protect and grow portfolios. Addressing this, we introduce "Stockformer, " a cutting-edge deep learning framework optimized for swing trading, featuring the TopKDropout method for enhanced stock selection. By integrating STL decomposition and self-attention networks, Stockformer utilizes the S&P 500's complex data to refine stock return predictions. Our methodology entailed segmenting data for training and validation (January 2021 to January 2023) and testing (February to June 2023). During testing, Stockformer's predictions outperformed ten industry models, achieving superior precision in key predictive accuracy indicators (MAE, RMSE, MAPE), with a remarkable accuracy rate of 62.39% in detecting market trends. In our backtests, Stockformer's swing trading strategy yielded a cumulative return of 13.19% and an annualized return of 30.80%, significantly surpassing current state-of-the-art models. Stockformer has emerged as a beacon of innovation in these volatile times, offering investors a potent tool for market forecasting. To advance the field and foster community collaboration, we have open-sourced Stockformer, available at https://github.com/Eric991005/Stockforme r.
    Date: 2023–11
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2401.06139&r=big
  11. By: Sourabh Balgi; Adel Daoud; Jose M. Pe\~na; Geoffrey T. Wodtke; Jesse Zhou
    Abstract: Social science theories often postulate causal relationships among a set of variables or events. Although directed acyclic graphs (DAGs) are increasingly used to represent these theories, their full potential has not yet been realized in practice. As non-parametric causal models, DAGs require no assumptions about the functional form of the hypothesized relationships. Nevertheless, to simplify the task of empirical evaluation, researchers tend to invoke such assumptions anyway, even though they are typically arbitrary and do not reflect any theoretical content or prior knowledge. Moreover, functional form assumptions can engender bias, whenever they fail to accurately capture the complexity of the causal system under investigation. In this article, we introduce causal-graphical normalizing flows (cGNFs), a novel approach to causal inference that leverages deep neural networks to empirically evaluate theories represented as DAGs. Unlike conventional approaches, cGNFs model the full joint distribution of the data according to a DAG supplied by the analyst, without relying on stringent assumptions about functional form. In this way, the method allows for flexible, semi-parametric estimation of any causal estimand that can be identified from the DAG, including total effects, conditional effects, direct and indirect effects, and path-specific effects. We illustrate the method with a reanalysis of Blau and Duncan's (1967) model of status attainment and Zhou's (2019) model of conditional versus controlled mobility. To facilitate adoption, we provide open-source software together with a series of online tutorials for implementing cGNFs. The article concludes with a discussion of current limitations and directions for future development.
    Date: 2024–01
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2401.06864&r=big
  12. By: Lezhi Li; Ting-Yu Chang; Hai Wang
    Abstract: This report outlines a transformative initiative in the financial investment industry, where the conventional decision-making process, laden with labor-intensive tasks such as sifting through voluminous documents, is being reimagined. Leveraging language models, our experiments aim to automate information summarization and investment idea generation. We seek to evaluate the effectiveness of fine-tuning methods on a base model (Llama2) to achieve specific application-level goals, including providing insights into the impact of events on companies and sectors, understanding market condition relationships, generating investor-aligned investment ideas, and formatting results with stock recommendations and detailed explanations. Through state-of-the-art generative modeling techniques, the ultimate objective is to develop an AI agent prototype, liberating human investors from repetitive tasks and allowing a focus on high-level strategic thinking. The project encompasses a diverse corpus dataset, including research reports, investment memos, market news, and extensive time-series market data. We conducted three experiments applying unsupervised and supervised LoRA fine-tuning on the llama2_7b_hf_chat as the base model, as well as instruction fine-tuning on the GPT3.5 model. Statistical and human evaluations both show that the fine-tuned versions perform better in solving text modeling, summarization, reasoning, and finance domain questions, demonstrating a pivotal step towards enhancing decision-making processes in the financial domain. Code implementation for the project can be found on GitHub: https://github.com/Firenze11/finance_lm.
    Date: 2023–12
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2401.06164&r=big
  13. By: G. Arbia; V. Nardelli; N. Salvini; I. Valentini
    Abstract: In health econometric studies we are often interested in quantifying aspects related to the accessibility to medical infrastructures. The increasing availability of data automatically collected through unconventional sources (such as webscraping, crowdsourcing or internet of things) recently opened previously unconceivable opportunities to researchers interested in measuring accessibility and to use it as a tool for real-time monitoring, surveillance and health policies definition. This paper contributes to this strand of literature proposing new accessibility measures that can be continuously feeded by automatic data collection. We present new measures of accessibility and we illustrate their use to study the territorial impact of supply-side shocks of health facilities. We also illustrate the potential of our proposal with a case study based on a huge set of data (related to the Emergency Departments in Milan, Italy) that have been webscraped for the purpose of this paper every 5 minutes since November 2021 to March 2022, amounting to approximately 5 million observations.
    Date: 2024–01
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2401.13370&r=big
  14. By: Albert, Jose Ramon G.; Siar, Sheila V.; Vizmanos, Jana Flor V.; Hernandez, Angelo C.; Sarmiento, Janina Luz C.
    Abstract: Social media has become an increasingly important tool for gauging public sentiment, offering real-time insights that can guide policy decisions. This study focuses on analyzing sentiments expressed on the Philippine Institute for Development Studies (PIDS) Facebook page, providing a window into public opinion on various development issues and governmental policies. By conducting opinion mining and sentiment analysis on comments from the top three viral Facebook posts of PIDS, which discuss education, the middle class, and social protection policies, the study reveals a range of public perspectives and highlights the challenges faced by the populace. Additionally, an online survey targeting PIDS' social media followers was conducted to understand their demographics and preferences in accessing development research. The findings demonstrate the effectiveness of social media analytics in capturing genuine public opinion, which can be instrumental in refining policies based on evidence. The study recommends enhancing analytics capabilities, systematically incorporating these insights while safeguarding data privacy, and continuously updating strategies to reflect changing public sentiments. This policy research study underscores the value of social media data in making governance more responsive and inclusive. Comments to this paper are welcome within 60 days from the date of posting. Email publications@pids.gov.ph.
    Keywords: public sentiments;opinion mining;social media
    Date: 2023
    URL: http://d.repec.org/n?u=RePEc:phd:dpaper:dp_2023-33&r=big
  15. By: Jeongbin Kim; Matthew Kovach; Kyu-Min Lee; Euncheol Shin; Hector Tzavellas
    Abstract: This paper explores the use of Large Language Models (LLMs) as decision aids, with a focus on their ability to learn preferences and provide personalized recommendations. To establish a baseline, we replicate standard economic experiments on choice under risk (Choi et al., 2007) with GPT, one of the most prominent LLMs, prompted to respond as (i) a human decision maker or (ii) a recommendation system for customers. With these baselines established, GPT is provided with a sample set of choices and prompted to make recommendations based on the provided data. From the data generated by GPT, we identify its (revealed) preferences and explore its ability to learn from data. Our analysis yields three results. First, GPT's choices are consistent with (expected) utility maximization theory. Second, GPT can align its recommendations with people's risk aversion, by recommending less risky portfolios to more risk-averse decision makers, highlighting GPT's potential as a personalized decision aid. Third, however, GPT demonstrates limited alignment when it comes to disappointment aversion.
    Date: 2024–01
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2401.07345&r=big
  16. By: Tetiana Yukhymenko (National Bank of Ukraine); Oleh Sorochan (National Bank of Ukraine)
    Abstract: The study explores the impact of central bank communications on a range of macro-financial indicators. Specifically, we examine whether information posted on the National Bank of Ukraine (NBU) website influences foreign exchange (FX) markets and the inflation expectations of experts. Our main results suggest that the NBU's statements and press releases on monetary policy issues matter. For instance, we find that exchange rate movements and volatility are negatively correlated with the volumes of publications of the NBU on its official website. However, this effect is noticeably bigger for volatility than for exchange rate changes. The impact of communication on FX developments is the strongest a week after the news release, and it persists further. Furthermore, inflation expectations of financial experts, though indifferent to all NBU updates, turn out to be sensitive to monetary policy announcements. The letter reduces the level of expectations and interest rates.
    Keywords: central bank communications ; monetary policy ; FX market ; text analysis
    JEL: E58 E71 C55
    Date: 2024–02–05
    URL: http://d.repec.org/n?u=RePEc:gii:giihei:heidwp01-2024&r=big
  17. By: Lembregts, Christophe; Cadario, Romain
    Abstract: A systematic review of green consumer behaviors in five prominent consumer research journals revealed that behaviors with greater potential for climate mitigation (e.g., plant-based consumption) have not been broadly studied, indicating promising opportunities for future research. In an exploratory survey, we conceptually replicate this finding using a sample of consumer researchers with a general interest in studying higher-potential behaviors. We consider evidence for potential explanations, such as researchers’ primary focus on construct-to-construct mapping, a tendency to study behaviors that researchers have personal experience with or are easy to implement, a lack of incentives to study higher-potential behaviors, and insufficient knowledge of mitigation potential. To help shift consumer researchers’ focus on higher-potential behaviors, we offer concrete recommendations, such as proactively considering mitigation potential both as authors and reviewers, and utilizing phenomenon-to-construct mapping for enhancing theoretical contributions. In sum, we hope that this research will help interested consumer researchers to provide more relevant answers to the urgent challenge of climate change mitigation.
    Date: 2024–01–19
    URL: http://d.repec.org/n?u=RePEc:osf:osfxxx:ywus6&r=big

This nep-big issue is ©2024 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.