nep-big New Economics Papers
on Big Data
Issue of 2024‒08‒19
seventeen papers chosen by
Tom Coupé, University of Canterbury


  1. Exploring Sectoral Profitability in the Indian Stock Market Using Deep Learning By Jaydip Sen; Hetvi Waghela; Sneha Rakshit
  2. A Review of New Developments in Finance with Deep Learning: Deep Hedging and Deep Calibration By Yuji Shinozaki
  3. Unveiling Patterns in European Airbnb Prices: A Comprehensive Analytical Study Using Machine Learning Techniques By Trinath Sai Subhash Reddy Pittala; Uma Maheswara R Meleti; Hemanth Vasireddy
  4. A Data-Driven Approach to Manage High-Occupancy Toll Lanes in California By Zhang, Michael PhD; Gao, Hang PhD; Chen, Di; Qi, Yanlin
  5. Nonparametric determinants of market Liquidity By João A. Bastos; Fernando Cascão
  6. Artificial Intelligence Driven Trend Forecasting: Integrating BERT Topic Modelling and Generative Artificial Intelligence for Semantic Insights By Kumar, Deepak; Weissenberger-Eibl, Marion
  7. Portfolio management with big data By Francisco Peñaranda; Enrique Sentana
  8. Artificial intelligence and central bank digital currency By Ozili, Peterson K
  9. Cattle Prices Under Arid Conditions: Hedonic and Neural Network Approach By Calil, Yuri Clements Daglia
  10. Macroeconomic Forecasting with Large Language Models By Andrea Carriero; Davide Pettenuzzo; Shubhranshu Shekhar
  11. Recovering Overlooked Information in Categorical Variables with LLMs: An Application to Labor Market Mismatch By Yi Chen; Hanming Fang; Yi Zhao; Zibo Zhao
  12. Objectifying the Measurement of Voter Ideology with Expert Data By Patrick Mellacher; Gernot Lechner
  13. Model Estimation using Categorical Satellite Data with Misclassification By Wardle, Arthur R.; Bruno, Ellen
  14. Expanding the Frontier of Economic Statistics Using Big Data: A Case Study of Regional Employment By Abe Dunn; Eric English; Kyle Hood; Lowell Mason; Brian Quistorff
  15. Dynamic Relationship between Information Dissemination by Local Governors and Mobility during the COVID-19 Pandemic By Yasuhiro Hara
  16. Humans vs GPTs: Bias and validity in hiring decisions By Lippens, Louis
  17. Returns to Data: Evidence from Web Tracking By Hannes Ullrich; Jonas Hannane; Christian Peukert; Luis Aguiar; Tomaso Duso

  1. By: Jaydip Sen; Hetvi Waghela; Sneha Rakshit
    Abstract: This paper explores using a deep learning Long Short-Term Memory (LSTM) model for accurate stock price prediction and its implications for portfolio design. Despite the efficient market hypothesis suggesting that predicting stock prices is impossible, recent research has shown the potential of advanced algorithms and predictive models. The study builds upon existing literature on stock price prediction methods, emphasizing the shift toward machine learning and deep learning approaches. Using historical stock prices of 180 stocks across 18 sectors listed on the NSE, India, the LSTM model predicts future prices. These predictions guide buy/sell decisions for each stock and analyze sector profitability. The study's main contributions are threefold: introducing an optimized LSTM model for robust portfolio design, utilizing LSTM predictions for buy/sell transactions, and insights into sector profitability and volatility. Results demonstrate the efficacy of the LSTM model in accurately predicting stock prices and informing investment decisions. By comparing sector profitability and prediction accuracy, the work provides valuable insights into the dynamics of the current financial markets in India.
    Date: 2024–05
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2407.01572
  2. By: Yuji Shinozaki (Deputy Director, Institute for Monetary and Economic Studies, Bank of Japan (currently, Associate Professor, Musashino University, E-mail:y-shino@musashino-u.ac.jp))
    Abstract: The application of machine learning to the field of finance has recently become the subject of active discussions. In particular, the deep learning is expected to significantly advance the techniques of hedging and calibration. As these two techniques play a central role in financial engineering and mathematical finance, the application to them attracts attentions of both practitioners and researchers. Deep hedging, which applies deep learning to hedging, is expected to make it possible to analyze how factors such as transaction costs affect hedging strategies. Since the impact of these factors was difficult to be assessed quantitatively due to the computational costs, deep hedging opens possibilities not only for refining and automating hedging operations of derivatives but also for broader applications in risk management. Deep calibration, which applies deep learning to calibration, is expected to make the parameter optimization calculation, which is an essential procedure in derivative pricing and risk management, faster and more stable. This paper provides an overview of the existing literature and suggests future research directions from both practical and academic perspectives. Specifically, the paper shows the implications of deep learning to existing theoretical frameworks and practical motivations in finance and identifies potential future developments that deep learning can bring about and the practical challenges.
    Keywords: Financial engineering, Mathematical finance, Derivatives, Hedging, Calibration, Numerical optimization
    JEL: C63 G12 G13
    Date: 2024–04
    URL: https://d.repec.org/n?u=RePEc:ime:imedps:24-e-02
  3. By: Trinath Sai Subhash Reddy Pittala; Uma Maheswara R Meleti; Hemanth Vasireddy
    Abstract: In the burgeoning market of short-term rentals, understanding pricing dynamics is crucial for a range of stake-holders. This study delves into the factors influencing Airbnb pricing in major European cities, employing a comprehensive dataset sourced from Kaggle. We utilize advanced regression techniques, including linear, polynomial, and random forest models, to analyze a diverse array of determinants, such as location characteristics, property types, and host-related factors. Our findings reveal nuanced insights into the variables most significantly impacting pricing, highlighting the varying roles of geographical, structural, and host-specific attributes. This research not only sheds light on the complex pricing landscape of Airbnb accommodations in Europe but also offers valuable implications for hosts seeking to optimize pricing strategies and for travelers aiming to understand pricing trends. Furthermore, the study contributes to the broader discourse on pricing mechanisms in the shared economy, suggesting avenues for future research in this rapidly evolving sector.
    Date: 2024–04
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2407.01555
  4. By: Zhang, Michael PhD; Gao, Hang PhD; Chen, Di; Qi, Yanlin
    Abstract: Managing traffic flow in high-occupancy toll (HOT) lanes is a tough balancing act and current tolling schemes often lead to either under- or over-utilization of HOT lane capacity. The inherent linear/nonlinear relationship between flow and tolls in HOT lanes suggest that recent advances in machine learning and the use of a data-driven model may help set toll rates for optimal flow and lane use. In this research project, a data-driven model was developed, using long short-term memory (LSTM) neural networks to capture the underlying flow-toll pattern on both HOT and general-purpose lanes. Then, a dynamic control strategy, using linear quadratic regulator (LQR) feedback controller was implemented to fully utilize the HOT lane capacity while maintaining congestion-free conditions. A case study of the I-580 freeway in Alameda County, California was carried out. The control system was evaluated in terms of vehicle hours traveled and person hours traveled for solo drivers and carpoolers. Results show that the tolling strategy helps to mitigate congestion in HOT and general-purpose lanes, benefiting every traveler on I-580.
    Keywords: Engineering, High occupancy toll lanes, traffic flow, traffic models, highway traffic control systems
    Date: 2024–06–01
    URL: https://d.repec.org/n?u=RePEc:cdl:itsdav:qt71d0h6hz
  5. By: João A. Bastos; Fernando Cascão
    Abstract: We examine the factors influencing equity market liquidity through explainable machine learning techniques. Unlike previous studies, our approach is entirely nonparametric. By studying daily placement orders for equity securities managed by a European asset management institution, we uncover multiple nonlinear relationships between market liquidity and placement characteristics typically not captured by a traditional parametric model. As expected, the results show that liquidity tends to increase in highly active markets. However, we also note that liquidity remains relatively stable within certain trading volume ranges. Price volatility, broker efficiency, and the market impact of the trade are important predictors of liquidity. Price volatility shows a linear relationship with bid-ask spreads, whereas broker efficiency and market impact have non-symmetric convex effects. Large bid-ask spreads are linked to increased uncertainty and weak economic activity.
    Keywords: Market liquidity; Equity markets; Bid-ask spreads, Nonparametric models; Machine learning, Explainable AI.
    Date: 2024–07
    URL: https://d.repec.org/n?u=RePEc:ise:remwps:wp03322024
  6. By: Kumar, Deepak; Weissenberger-Eibl, Marion
    Abstract: In the fast-paced realm of technological evolution, accurately forecasting emerging trends is critical for both academic inquiry and industry application. Traditional trend analysis methodologies, while valuable, struggle to efficiently process and interpret the vast datasets of today's information age. This paper introduces a novel approach that synergizes Generative AI and Bidirectional Encoder Representations from Transformers (BERT) for semantic insights and trend forecasting, leveraging the power of Retrieval-Augmented Generation (RAG) and the analytical prowess of BERT topic modeling. By automating the analysis of extensive datasets from publications and patents, the presented methodology not only expedites the discovery of emergent trends but also enhances the precision of these findings by generating a short summary for found emergent trends. For validation, three technologies - reinforcement learning, quantum machine learning, and Cryptocurrencies - were analysed prior to their first appearance in the Gartner Hype Cycle. Research highlights the integration of advanced AI techniques in trend forecasting, providing a scalable and accurate tool for strategic planning and innovation management. Results demonstrated a significant correlation between model's predictions and the technologies' appearances in the Hype Cycle, underscoring the potential of this methodology in anticipating technological shifts across various sectors
    Keywords: BERT, Topic modelling, RAG, Gartner Hype Cycle, LLM, BERTopic
    Date: 2024
    URL: https://d.repec.org/n?u=RePEc:zbw:esconf:300545
  7. By: Francisco Peñaranda (Queens College CUNY); Enrique Sentana (CEMFI, Centro de Estudios Monetarios y Financieros)
    Abstract: The purpose of this survey is to summarize the academic literature that studies some of the ways in which portfolio management has been affected in recent years by the availability of big datasets: many assets, many characteristics for each of them, many macro predictors, and various sources of unstructured data. Thus, we deliberately focus on applications rather than methods. We also include brief reviews of the financial theories underlying asset management, which provide the relevant background to assess the plethora of recent contributions to such an active research field.
    Keywords: Conditioning information, intertemporal portfolio decisions, machine learning, mean-variance analysis, stochastic discount factors.
    JEL: G11 G12 C55 G17
    Date: 2024–06
    URL: https://d.repec.org/n?u=RePEc:cmf:wpaper:wp2024_2411
  8. By: Ozili, Peterson K
    Abstract: The purpose of this article is to explore the role of artificial intelligence, or AI, in a central bank digital currency project and its challenges. Artificial intelligence is transforming the digital finance landscape. Central bank digital currency is also transforming the nature of central bank money. This study also suggests some considerations which central banks should be aware of when deploying artificial intelligence in their central bank digital currency project. The study concludes by acknowledging that artificial intelligence will continue to evolve, and its role in developing a sustainable CBDC will expand. While AI will be useful in many CBDC projects, ethical concerns will emerge about the use AI in a CBDC project. When such concerns arise, central banks should be prepared to have open discussions about how they are using, or intend to use, AI in their CBDC projects.
    Keywords: artificial intelligence, central bank digital currency, CBDC, machine learning, deep learning, cryptocurrency, CBDC project, CBDC pilot, blockchain
    JEL: E50 E51 E52 E58 O31
    Date: 2024
    URL: https://d.repec.org/n?u=RePEc:pra:mprapa:121567
  9. By: Calil, Yuri Clements Daglia
    Keywords: Livestock Production/Industries, Marketing, Agricultural Finance
    Date: 2024
    URL: https://d.repec.org/n?u=RePEc:ags:aaea22:343928
  10. By: Andrea Carriero; Davide Pettenuzzo; Shubhranshu Shekhar
    Abstract: This paper presents a comparative analysis evaluating the accuracy of Large Language Models (LLMs) against traditional macro time series forecasting approaches. In recent times, LLMs have surged in popularity for forecasting due to their ability to capture intricate patterns in data and quickly adapt across very different domains. However, their effectiveness in forecasting macroeconomic time series data compared to conventional methods remains an area of interest. To address this, we conduct a rigorous evaluation of LLMs against traditional macro forecasting methods, using as common ground the FRED-MD database. Our findings provide valuable insights into the strengths and limitations of LLMs in forecasting macroeconomic time series, shedding light on their applicability in real-world scenarios
    Date: 2024–06
    URL: https://d.repec.org/n?u=RePEc:arx:papers:2407.00890
  11. By: Yi Chen (ShanghaiTech University); Hanming Fang (University of Pennsylvania); Yi Zhao (Tsinghua University); Zibo Zhao (ShanghaiTech University)
    Abstract: Categorical variables have no intrinsic ordering, and researchers often adopt a fixed-effect (FE) approach in empirical analysis. However, this approach has two significant limitations: it overlooks textual information associated with the categorical variables; and it produces unstable results when there are only limited observations in a category. In this paper, we propose a novel method that utilizes recent advances in large language models (LLMs) to recover overlooked information in categorical variables. We apply this method to investigate labor market mismatch. Specifically, we task LLMs with simulating the role of a human resources specialist to assess the suitability of an applicant with specific characteristics for a given job. Our main findings can be summarized in three parts. First, using comprehensive administrative data from an online job posting platform, we show that our new match quality measure is positively correlated with several traditional measures in the literature, and we highlight the LLM’s capability to provide additional information beyond that contained in the traditional measures. Second, we demonstrate the broad applicability of the new method with a survey data containing significantly less information than the administrative data, which makes it impossible to compute most of the traditional match quality measures. Our LLM measure successfully replicates most of the salient patterns observed in a hard-to-access administrative dataset using easily accessible survey data. Third, we investigate the gender gap in match quality and explore whether there exists gender stereotypes in the hiring process. We simulate an audit study, examining whether revealing gender information to LLMs influences their assessment. We show that when gender information is disclosed to the LLMs, the model deems females better suited for traditionally female-dominated roles.
    Keywords: Large Language Models, Categorical Variables, Labor Market Mismatch
    JEL: C55 J16 J24 J31
    Date: 2024–07–23
    URL: https://d.repec.org/n?u=RePEc:pen:papers:24-017
  12. By: Patrick Mellacher (University of Graz, Austria); Gernot Lechner (University of Graz, Austria)
    Abstract: Many surveys require respondents to place themselves on a left-right ideology scale. However, non-experts may not understand the scale or their “objective†position. Furthermore, a uni-dimensional approach may not suffice to describe ideology coherently. We thus develop a novel way to measure voter ideology: Combining expert and voter survey data, we use classification models to infer how experts would place voters based on their policy stances on three axes: general left-right, economic left-right and libertarian-authoritarian. We validate our approach by finding i) a strong connection between policies and ideology using data-driven approaches, ii) a strong predictive power of our models in cross-validation exercises, and iii) that “objective†ideology as predicted by our models significantly explains the vote choice in simple spatial voting models even after accounting for the subjective ideological distance between voters and parties as perceived by the voters. Our results shed new light on debates around mass polarization.
    Keywords: machine learning, random forest, voter ideology, political economy, spatial voting.
    JEL: C38 D70 D72
    Date: 2024–01
    URL: https://d.repec.org/n?u=RePEc:grz:wpaper:2024-03
  13. By: Wardle, Arthur R.; Bruno, Ellen
    Keywords: Research Methods/Statistical Methods, Crop Production/Industries
    Date: 2024
    URL: https://d.repec.org/n?u=RePEc:ags:aaea22:343766
  14. By: Abe Dunn; Eric English; Kyle Hood; Lowell Mason; Brian Quistorff
    Abstract: Big data offers potentially enormous benefits for improving economic measurement, but it also presents challenges (e.g., lack of representativeness and instability), implying that their value is not always clear. We propose a framework for quantifying the usefulness of these data sources for specific applications, relative to existing official sources. We specifically weigh the potential benefits of additional granularity and timeliness, while examining the accuracy associated with any new or improved estimates, relative to comparable accuracy produced in existing official statistics. We apply the methodology to employment estimates using data from a payroll processor, considering both the improvement of existing state-level estimates, but also the production of new, more timely, county-level estimates. We find that incorporating payroll data can improve existing state-level estimates by 11% based on out-of-sample mean absolute error, although the improvement is considerably higher for smaller state-industry cells. We also produce new county-level estimates that could provide more timely granular estimates than previously available. We develop a novel test to determine if these new county-level estimates have errors consistent with official series. Given the level of granularity, we cannot reject the hypothesis that the new county estimates have an accuracy in line with official measures, implying an expansion of the existing frontier. We demonstrate the practical importance of these experimental estimates by investigating a hypothetical application during the COVID-19 pandemic, a period in which more timely and granular information could have assisted in implementing effective policies. Relative to existing estimates, we find that the alternative payroll data series could help identify areas of the country where employment was lagging. Moreover, we also demonstrate the value of a more timely series.
    Date: 2024–07
    URL: https://d.repec.org/n?u=RePEc:cen:wpaper:24-37
  15. By: Yasuhiro Hara (Visiting Scholar, Policy Research Institute, Ministry of Finance)
    Abstract: The COVID-19 pandemic has prompted countries to implement a variety of containment measures, including non-pharmaceutical interventions such as stay-at-home orders. Japan has avoided legally enforcing strict measures such as complete or partial lockdowns, instead relying on voluntary restraint from going out during the state of emergency. We evaluate the impact of information dissemination on people’s mobility. First, we apply the latest findings in natural language processing research to precisely measure the information dissemination effect for each prefecture in Japan. Second, we analyse the dynamic relationship between information dissemination and mobility in each prefecture in Japan using econometric methods. Third, we divide the sample into an early and a later period when the Delta variant emerged in order to analyse the time-varying dynamics of the information effect. Our investigation yields two major findings: First, the stay-at-home information dissemination significantly suppressed people’s mobility. Second, we found a remarkable change in the magnitude of the information effect over time. The information effect weakens after the dominance of the Delta variant compared with the early stage of the pandemic.
    Keywords: COVID-19, impulse response analysis, mobility control policy, sentiment analysis, BERT
    JEL: C23 C55 C61 H12 I18
    URL: https://d.repec.org/n?u=RePEc:mof:wpaper:ron373
  16. By: Lippens, Louis (Ghent University)
    Abstract: The advent of large language models (LLMs) may reshape hiring in the labour market. This paper investigates how generative pre-trained transformers (GPTs)—i.e. OpenAI’s GPT-3.5, GPT-4, and GPT-4o—can aid hiring decisions. In a direct comparison between humans and GPTs on an identical hiring task, I show that GPTs tend to select candidates more liberally than humans but exhibit less ethnic bias. GPT-4 even slightly favours certain ethnic minorities. While LLMs may complement humans in hiring by making a (relatively extensive) pre-selection of job candidates, the findings suggest that they may miss-select due to a lack of contextual understanding and may reproduce pre-trained human bias at scale.
    Date: 2024–07–11
    URL: https://d.repec.org/n?u=RePEc:osf:osfxxx:zxf5y
  17. By: Hannes Ullrich; Jonas Hannane; Christian Peukert; Luis Aguiar; Tomaso Duso
    Abstract: Tracking online user behavior is essential for targeted advertising and is at the heart of the business model of major online platforms. We analyze tracker-specific web browsing data to show how the prediction quality of consumer profiles varies with data size and scope. We find decreasing returns to the number of observed users and tracked websites. However, prediction quality increases considerably when web browsing data can be combined with demographic data. We show that Google, Facebook, and Amazon, which can combine such data at scale via their digital ecosystems, may thus attenuate the impact of regulatory interventions such as the GDPR. In this light, even with decreasing returns to data small firms can be prevented from catching up with these large incumbents. We document that proposed data-sharing provisions may level the playing field concerning the prediction quality of consumer profiles.
    Keywords: Prediction quality, Web Tracking, Cookies, Data protection, Competition Policy, Internet Regulation, GDPR
    JEL: C53 D22 D43 K21 L13 L4
    Date: 2024
    URL: https://d.repec.org/n?u=RePEc:diw:diwwpp:dp2091

This nep-big issue is ©2024 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.