nep-big New Economics Papers
on Big Data
Issue of 2024‒03‒04
thirty papers chosen by
Tom Coupé, University of Canterbury


  1. Words of the RBNZ: Textual analysis of Monetary Policy Statements By Rennae Cherry; Eric Tong
  2. AI Oversight and Human Mistakes: Evidence from Centre Court By David Almog; Romain Gauriot; Lionel Page; Daniel Martin
  3. Disentangling Demand and Supply of Media Bias: The Case of Newspaper Homepages By Tin Cheuk Leung; Koleman Strumpf
  4. Determinants of renewable energy consumption in Madagascar: Evidence from feature selection algorithms By Franck Ramaharo; Fitiavana Randriamifidy
  5. Quantitative Theory of Mony or Prices? A Historical, Theoretical, and Econometric Analysis By Gómez Julián, José Mauricio
  6. Inherited Inequality: A General Framework and an Application to South Africa By Brunori, Paolo; Ferreira, Francisco H. G.; Salas-Rojo, Pedro
  7. Multi-agent Deep Reinforcement Learning for Dynamic Pricing by Fast-charging Electric Vehicle Hubs in ccompetition By Diwas Paudel; Tapas K. Das
  8. Predicting the state of synchronization of financial time series using cross recurrence plots By M. Shabani; M. Magris; George Tzagkarakis; J. Kanniainen; A. Iosifidis
  9. Childhood Circumstances and Health of American and Chinese Older Adults: A Machine Learning Evaluation of Inequality of Opportunity in Health By Huo, Shutong; Feng, Derek; Gill, Thomas M.; Chen, Xi
  10. Improving Business Insurance Loss Models by Leveraging InsurTech Innovation By Zhiyu Quan; Changyue Hu; Panyi Dong; Emiliano A. Valdez
  11. THE DETERMINANTS OF CO2 EMISSIONS IN THE CONTEXT OF ESG MODELS AT WORLD LEVEL By Costantiello, Alberto; Leogrande, Angelo
  12. Preferential Trading in Agriculture: New Insights from a Structural Gravity Analysis and Machine Learning By Kim, Dongin
  13. BEDS IN HEALTH FACILITIES IN THE ITALIAN REGIONS: A SOCIO-ECONOMIC APPROACH By Leogrande, Angelo; Costantiello, Alberto; Leogrande, Domenico; Anobile, Fabio
  14. "Monitoring time-varying systemic risk in sovereign debt and currency markets with generative AI" By Helena Chuliá; Sabuhi Khalili; Jorge M. Uribe
  15. Nowcasting Madagascar's real GDP using machine learning algorithms By Franck Ramaharo; Gerzhino Rasolofomanana
  16. The Impact of China’s “Stadium Diplomacy” on Local Economic Development in Sub-Saharan Africa By Lindlacher Valentin; Gustav Pirich
  17. Portfolio Selection Under Non-Gaussianity And Systemic Risk: A Machine Learning Based Forecasting Approach. By Weidong Lin; Abderrahim Taamouti
  18. No More Cost in Translation: Validating Open-Source Machine Translation for Quantitative Text Analysis By Hauke Licht; Ronja Sczepanski; Moritz Laurer; Ayjeren Bekmuratovna
  19. Who are They Talking About? Detecting Mentions of Social Groups in Political Texts with Supervised Learning By Hauke Licht; Ronja Sczepanksi
  20. Machine Learning Based Portfolio Selection Under Systemic Risk. By Weidong Lin; Abderrahim Taamouti
  21. Non-Banking Sector development effect on Economic Growth. A Nighttime light data approach By Leonard Mushunje; Maxwell Mashasha
  22. Using Generative Pre-Trained Transformers (GPT) for Supervised Content Encoding: An Application in Corresponding Experiments By Churchill, Alexander; Pichika, Shamitha; Xu, Chengxin
  23. The Role of Emotions in Investment Decisions: The Effects of Vividness of a Crowdfunding Campaign Video By Sander, Julian
  24. Childhood Circumstances and Health of American and Chinese Older Adults: A Machine Learning Evaluation of Inequality of Opportunity in Health By Huo, Shutong; Feng, Derek; Gill, Thomas M.; Chen, Xi
  25. Credit Risk Meets Large Language Models: Building a Risk Indicator from Loan Descriptions in P2P Lending By Mario Sanz-Guerrero; Javier Arroyo
  26. Optimal Linear Signal: An Unsupervised Machine Learning Framework to Optimize PnL with Linear Signals By Pierre Renucci
  27. Supervised Autoencoder MLP for Financial Time Series Forecasting By Bartosz Bieganowski; Robert Ślepaczuk
  28. What Matters for Agricultural Trade? Assessing the Role of Trade Deal Provisions using Machine Learning By Gordeev, Stepan; Jelliffe, Jeremy; Kim, Dongin; Steinbach, Sandro
  29. The ECB Press Conference Statement Deriving a New Sentiment Indicator for the Euro Area By Dimitrios Kanelis; Pierre L. Siklos
  30. Application of Machine Learning in Stock Market Forecasting: A Case Study of Disney Stock By Dengxin Huang

  1. By: Rennae Cherry; Eric Tong (Reserve Bank of New Zealand)
    Abstract: Clear communication helps New Zealanders understand monetary policy and its relationship to them. Communication explains to the public the purpose and rationale behind monetary policy decisions and, when done right, may enhance monetary policy transmission via different channels (RBNZ, 2020; Blinder et al. 2008; Blot and Hubert, 2018). With this motivation, we apply textual analysis to flagship publications of the Reserve Bank—with the aim of assessing Reserve Bank communications and supporting its mandates of maintaining price stability over the medium term and supporting maximum sustainable employment. Key findings: - Textual analysis shows that keywords mentioned in the Monetary Policy Statements (MPS) align with the objectives in the Remit. - The tone of MPS has been neutral and objective, even as the sentiment in the MPS moves in tandem with household and business confidence surveys. - Similar to monetary policy documents published by central banks overseas, the MPS are complex and may not be accessible to the general public. However, readability, which measures the complexity of a text based on sentence length and the number of syllables in words, has remained stable over 1997Q1-2021Q4 and has marginally improved recently. - The Monetary Policy Snapshots, introduced in 2018, are easier to read than the main part of the MPS – they are accessible to a high school graduate rather than a university graduate.
    Date: 2023–07
    URL: http://d.repec.org/n?u=RePEc:nzb:nzbans:2023/03&r=big
  2. By: David Almog; Romain Gauriot; Lionel Page; Daniel Martin
    Abstract: Powered by the increasing predictive capabilities of machine learning algorithms, artificial intelligence (AI) systems have begun to be used to overrule human mistakes in many settings. We provide the first field evidence this AI oversight carries psychological costs that can impact human decision-making. We investigate one of the highest visibility settings in which AI oversight has occurred: the Hawk-Eye review of umpires in top tennis tournaments. We find that umpires lowered their overall mistake rate after the introduction of Hawk-Eye review, in line with rational inattention given psychological costs of being overruled by AI. We also find that umpires increased the rate at which they called balls in, which produced a shift from making Type II errors (calling a ball out when in) to Type I errors (calling a ball in when out). We structurally estimate the psychological costs of being overruled by AI using a model of rational inattentive umpires, and our results suggest that because of these costs, umpires cared twice as much about Type II errors under AI oversight.
    Date: 2024–01
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2401.16754&r=big
  3. By: Tin Cheuk Leung; Koleman Strumpf
    Abstract: In this study, we propose a novel approach to detect supply-side media bias, independent of external factors like ownership or editors’ ideological leanings. Analyzing over 100, 000 articles from The New York Times (NYT) and The Wall Street Journal (WSJ), complemented by data from 22 million tweets, we assess the factors influencing article duration on their digital homepages. By flexibly controlling for demand-side preferences, we attribute extended homepage presence of ideologically slanted articles to supply-side biases. Utilizing a machine learning model, we assign “pro-Democrat” scores to articles, revealing that both tweets count and ideological orientation significantly impact homepage longevity. Our findings show that liberal articles tend to remain longer on the NYT homepage, while conservative ones persist on the WSJ. Further analysis into articles’ transition to print and podcasts suggests that increased competition may reduce media bias, indicating a potential direction for future theoretical exploration.
    Keywords: media bias, media economics, social media, machine learning
    JEL: D22 D72 D83 L82
    Date: 2024
    URL: http://d.repec.org/n?u=RePEc:ces:ceswps:_10890&r=big
  4. By: Franck Ramaharo; Fitiavana Randriamifidy
    Abstract: The aim of this note is to identify the factors influencing renewable energy consumption in Madagascar. We tested 12 features covering macroeconomic, financial, social, and environmental aspects, including economic growth, domestic investment, foreign direct investment, financial development, industrial development, inflation, income distribution, trade openness, exchange rate, tourism development, environmental quality, and urbanization. To assess their significance, we assumed a linear relationship between renewable energy consumption and these features over the 1990-2021 period. Next, we applied different machine learning feature selection algorithms classified as filter-based (relative importance for linear regression, correlation method), embedded (LASSO), and wrapper-based (best subset regression, stepwise regression, recursive feature elimination, iterative predictor weighting partial least squares, Boruta, simulated annealing, and genetic algorithms) methods. Our analysis revealed that the five most influential drivers stem from macroeconomic aspects. We found that domestic investment, foreign direct investment, and inflation positively contribute to the adoption of renewable energy sources. On the other hand, industrial development and trade openness negatively affect renewable energy consumption in Madagascar.
    Date: 2023–10
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2401.13671&r=big
  5. By: Gómez Julián, José Mauricio
    Abstract: This research studies the relation between money and prices and its practical implications analyzing quarterly data from United States (1959-2022), Canada (1961-2022), United Kingdom (1986-2022), and Brazil (1996-2022). The historical, logical, and econometric consistency of the logical core of the two main theories of money is analyzed using objective bayesian and frequentist machine learning models, bayesian regularized artificial neural networks, and ensemble learning. It is concluded that money is not neutral at any time horizon and that, despite money is ultimately subordinated to prices, there is a reciprocal influence over time between money and prices which constitute a complex system. Non-neutrality is transmitted through aggregate demand and is based on the exchange value of money as a monetary unit.
    Date: 2024–02–05
    URL: http://d.repec.org/n?u=RePEc:osf:osfxxx:eskmx&r=big
  6. By: Brunori, Paolo; Ferreira, Francisco H. G.; Salas-Rojo, Pedro
    Abstract: Scholars have sought to quantify the extent of inequality which is inherited from past generations in many different ways, including a large body of work on intergenerational mobility and inequality of opportunity. This paper makes three contributions to that broad literature. First, we show that many of the most prominent approaches to measuring mobility or inequality of opportunity fit within a general framework which involves, as a first step, a calculation of the extent to which inherited circumstances can predict current incomes. The importance of prediction has led to recent applications of machine learning tools to solve the model selection challenge in the presence of competing upward and downward biases. Our second contribution is to apply transformation trees to the computation of inequality of opportunity. Because the algorithm is built on a likelihood maximization that involves splitting the sample into groups with the most salient differences between their conditional cumulative distributions, it is particularly well-suited to measuring ex-post inequality of opportunity, following Roemer (1998). Our third contribution is to apply the method to data from South Africa, arguably the world’s most unequal country, and find that almost three-quarters of its current inequality is inherited from predetermined circumstances, with race playing the largest role, but parental background also making an important contribution. (Stone Center on Socio-Economic Inequality Working Paper)
    Date: 2024–01–22
    URL: http://d.repec.org/n?u=RePEc:osf:socarx:rgq7t&r=big
  7. By: Diwas Paudel; Tapas K. Das
    Abstract: Fast-charging hubs for electric vehicles will soon become part of the newly built infrastructure for transportation electrification across the world. These hubs are expected to host many DC fast-charging stations and will admit EVs only for charging. Like the gasoline refueling stations, fast-charging hubs in a neighborhood will dynamically vary their prices to compete for the same pool of EV owners. These hubs will interact with the electric power network by making purchase commitments for a significant part of their power needs in the day-ahead (DA) electricity market and meeting the difference from the real-time (RT) market. Hubs may have supplemental battery storage systems (BSS), which they will use for arbitrage. In this paper, we develop a two-step data-driven dynamic pricing methodology for hubs in price competition. We first obtain the DA commitment by solving a stochastic DA commitment model. Thereafter we obtain the hub pricing strategies by modeling the game as a competitive Markov decision process (CMDP) and solving it using a multi-agent deep reinforcement learning (MADRL) approach. We develop a numerical case study for a pricing game between two charging hubs. We solve the case study with our methodology by using combinations of two different DRL algorithms, DQN and SAC, and two different neural networks (NN) architectures, a feed-forward (FF) neural network, and a multi-head attention (MHA) neural network. We construct a measure of collusion (index) using the hub profits. A value of zero for this index indicates no collusion (perfect competition) and a value of one indicates full collusion (monopolistic behavior). Our results show that the collusion index varies approximately between 0.14 and 0.45 depending on the combinations of the algorithms and the architectures chosen by the hubs.
    Date: 2024–01
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2401.15108&r=big
  8. By: M. Shabani; M. Magris; George Tzagkarakis (IRGO - Institut de Recherche en Gestion des Organisations - UB - Université de Bordeaux - Institut d'Administration des Entreprises (IAE) - Bordeaux); J. Kanniainen; A. Iosifidis
    Abstract: Cross-correlation analysis is a powerful tool for understanding the mutual dynamics of time series. This study introduces a new method for predicting the future state of synchronization of the dynamics of two financial time series. To this end, we use the cross recurrence plot analysis as a nonlinear method for quantifying the multidimensional coupling in the time domain of two time series and for determining their state of synchronization. We adopt a deep learning framework for methodologically addressing the prediction of the synchronization state based on features extracted from dynamically sub-sampled cross recurrence plots. We provide extensive experiments on several stocks, major constituents of the S &P100 index, to empirically validate our approach. We find that the task of predicting the state of synchronization of two time series is in general rather difficult, but for certain pairs of stocks attainable with very satisfactory performance (84% F1-score, on average). © 2023, The Author(s).
    Keywords: Cross recurrence plot, Synchronization, Kernel convolutional neural network, Financial time series
    Date: 2023–06
    URL: http://d.repec.org/n?u=RePEc:hal:journl:hal-04415269&r=big
  9. By: Huo, Shutong (University of California, Irvine); Feng, Derek (Yale University); Gill, Thomas M. (Yale University); Chen, Xi (Yale University)
    Abstract: Childhood circumstances may impact senior health, prompting this study to introduce novel machine learning methods to assess their individual and collective contributions to health inequality in old age. Using the US Health and Retirement Study (HRS) and the China Health and Retirement Longitudinal Study (CHARLS), we analyzed health outcomes of American and Chinese participants aged 60 and above. Conditional inference trees and forest were employed to estimate the influence of childhood circumstances on self-rated health (SRH), comparing with the conventional parametric Roemer method. The conventional parametric Roemer method estimated higher IOP in health ( China: 0.039, 22.67% of the total Gini coefficient 0.172; US: 0.067, 35.08% of the total Gini coefficient 0.191) than conditional inference tree ( China: 0.022, 12.79% of 0.172; US: 0.044, 23.04% of 0.191) and forest ( China: 0.035, 20.35% of 0.172; US: 0.054, 28.27% of 0.191). Key determinants of health in old age were identified, including childhood health, family financial status, and regional differences. The conditional inference forest consistently outperformed other methods in predictive accuracy as measured by out-of-sample mean squared error (MSE). The findings demonstrate the importance of early-life circumstances in shaping later health outcomes and stress the early-life interventions for health equity in aging societies. Our methods highlight the utility of machine learning in public health to identify determinants of health inequality.
    Keywords: life course, inequality of opportunity, childhood circumstances, machine learning, conditional inference tree, random forest
    JEL: I14 J13 J14 O57 C53
    Date: 2024–01
    URL: http://d.repec.org/n?u=RePEc:iza:izadps:dp16764&r=big
  10. By: Zhiyu Quan; Changyue Hu; Panyi Dong; Emiliano A. Valdez
    Abstract: Recent transformative and disruptive advancements in the insurance industry have embraced various InsurTech innovations. In particular, with the rapid progress in data science and computational capabilities, InsurTech is able to integrate a multitude of emerging data sources, shedding light on opportunities to enhance risk classification and claims management. This paper presents a groundbreaking effort as we combine real-life proprietary insurance claims information together with InsurTech data to enhance the loss model, a fundamental component of insurance companies' risk management. Our study further utilizes various machine learning techniques to quantify the predictive improvement of the InsurTech-enhanced loss model over that of the insurance in-house. The quantification process provides a deeper understanding of the value of the InsurTech innovation and advocates potential risk factors that are unexplored in traditional insurance loss modeling. This study represents a successful undertaking of an academic-industry collaboration, suggesting an inspiring path for future partnerships between industry and academic institutions.
    Date: 2024–01
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2401.16723&r=big
  11. By: Costantiello, Alberto; Leogrande, Angelo
    Abstract: We estimate the determinants of CO2 Emissions-COE in the context of Environmental, Social and Governance-ESG model at world level. We use data of the World Bank for 193 countries in the period 2011-2020. We found that the level of COE is positively associated, among others to “Methane Emissions”, “Research and Development Expenditures”, and negatively associated among others to “Renewable Energy Consumption” and “Mean Drought Index”. Furthermore, we have applied a cluster analysis with the k-Means algorithm optimized with the Elbow Method and we find the presence of four cluster. Finally, we apply eight machine-learning algorithms for the prediction of the future value of COE and we find that the Artificial Neural Network-ANN algorithm is the best predictor. The ANN predicts a reduction in the level of COE equal to 5.69% on average for the analysed countries.
    Date: 2024–01–25
    URL: http://d.repec.org/n?u=RePEc:osf:socarx:53djm&r=big
  12. By: Kim, Dongin
    Keywords: Agribusiness, Agricultural Finance, International Relations/Trade, Research and Development/Tech Change/Emerging Technologies
    Date: 2022–12
    URL: http://d.repec.org/n?u=RePEc:ags:iats22:339469&r=big
  13. By: Leogrande, Angelo; Costantiello, Alberto; Leogrande, Domenico; Anobile, Fabio
    Abstract: In this article, we consider the determinants of the beds in healthcare facilities-BEDS in the Italian regions between 2004 and 2022. We use the ISTAT-BES database. We use different econometric techniques i.e.: Panel Data with Fixed Effects, Panel Data with Random Effects, Pooled Ordinary Least Squares-OLS, Weighted Least Squares-WLS, and Dynamic Panel at 1 Stage. The results show that the level of BEDS is positively associated, among others, to "General Doctors with a Number of Clients over the Threshold" and "Life Satisfaction", and negatively associated among others, to "Trust in Parties" and "Positive Judgment on Future Prospects". Furthermore, we apply a clusterization with the k-Means algorithm optimized with the Silhouette Coefficient and we find the presence of two clusters in terms of BEDS. Finally, we make a confrontation among eight machine-learning algorithms and we find that the best predictor is the ANN-Artificial Neural Network.
    Date: 2024–01–25
    URL: http://d.repec.org/n?u=RePEc:osf:socarx:shkjt&r=big
  14. By: Helena Chuliá (Riskcenter- IREA and Department of Econometrics and Statistics, University of Barcelona.); Sabuhi Khalili (Department of Econometrics and Statistics, University of Barcelona.); Jorge M. Uribe (Faculty of Economics and Business Studies, Open University of Catalonia.)
    Abstract: SWe propose generative artificial intelligence to measure systemic risk in the global markets of sovereign debt and foreign exchange. Through a comparative analysis, we explore three novel models to the economics literature and integrate them with traditional factor models. These models are: Time Variational Autoencoders, Time Generative Adversarial Networks, and Transformer-based Time-series Generative Adversarial Networks. Our empirical results provide evidence in support of the Variational Autoencoder. Results here indicate that both the Credit Default Swaps and foreign exchange markets are susceptible to systemic risk, with a historically high probability of distress observed by the end of 2022, as measured by both the Joint Probability of Distress and the Expected Proportion of Markets in Distress. Our results provide insights for governments in both developed and developing countries, since the realistic counterfactual scenarios generated by the AI, yet to occur in global markets, underscore the potential worst-case scenarios that may unfold if systemic risk materializes. Considering such scenarios is crucial when designing macroprudential policies aimed at preserving financial stability and when measuring the effectiveness of the implemented policies.
    Keywords: Twin Ds, Sovereign Debt, Credit Risk, TimeGANs, Transformers, TimeVAEs, Autoencoders, Variational Inference. JEL classification: C45, C53, F31, F37.
    Date: 2024–02
    URL: http://d.repec.org/n?u=RePEc:ira:wpaper:202402&r=big
  15. By: Franck Ramaharo; Gerzhino Rasolofomanana
    Abstract: We investigate the predictive power of different machine learning algorithms to nowcast Madagascar's gross domestic product (GDP). We trained popular regression models, including linear regularized regression (Ridge, Lasso, Elastic-net), dimensionality reduction model (principal component regression), k-nearest neighbors algorithm (k-NN regression), support vector regression (linear SVR), and tree-based ensemble models (Random forest and XGBoost regressions), on 10 Malagasy quarterly macroeconomic leading indicators over the period 2007Q1--2022Q4, and we used simple econometric models as a benchmark. We measured the nowcast accuracy of each model by calculating the root mean square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE). Our findings reveal that the Ensemble Model, formed by aggregating individual predictions, consistently outperforms traditional econometric models. We conclude that machine learning models can deliver more accurate and timely nowcasts of Malagasy economic performance and provide policymakers with additional guidance for data-driven decision making.
    Date: 2023–12
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2401.10255&r=big
  16. By: Lindlacher Valentin; Gustav Pirich
    Abstract: This study investigates the economic impact of China’s “stadium diplomacy” in Sub-Saharan Africa. Exploiting the staggered timing of the construction in a difference-in-differences framework, we analyze the effect of Chinese-built and financed stadiums on local economic development. Employing nighttime light satellite data, we provide both an aggregate and spatially disaggregated assessment of these investments. We find that a stadium’s city nighttime light intensity increases by 25 percent, on average, after stadium completion. The stadium’s direct surrounding increases by 34 percent, on average, in its nighttime light activity. The effects can be attributed to the stadiums but are not only visible close to the stadium’s location. The effect remains strong when controlling for other local Chinese investments. Thus, we find evidence for beneficial effects of Chinese-built and financed stadiums on local economic development in Sub-Saharan Africa, contrasting with the widely held notion that China’s development finance projects constitute “white elephants”.
    Keywords: stadium diplomacy, regional development, nighttime light, local public infrastructure, Sub-Saharan Africa
    JEL: O18 R11 O55 R53 Z20
    Date: 2024
    URL: http://d.repec.org/n?u=RePEc:ces:ceswps:_10893&r=big
  17. By: Weidong Lin; Abderrahim Taamouti
    Abstract: The Sharpe-ratio-maximizing portfolio becomes questionable under non-Gaussian returns, and it rules out, by construction, systemic risk, which can negatively affect its out-of-sample performance. In the present work, we develop a new performance ratio that simultaneously addresses these two problems when building optimal portfolios. To robustify the portfolio optimization and better represent extreme market scenarios, we simulate a large number of returns via a Monte Carlo method. This is done by Örst obtaining probabilistic return forecasts through a distributional machine learning approach in a big data setting, and then combining them with a Ötted copula to generate return scenarios. Based on a large-scale comparative analysis conducted on the US market, the backtesting results demonstrate the superiority of our proposed portfolio selection approach against several popular benchmark strategies in terms of both proÖtability and minimizing systemic risk. This outperformance is robust to the inclusion of transaction costs.
    Keywords: Portfolio optimization; probability forecasting; quantile regression neural network; extreme scenarios; big data.
    Date: 2023–08
    URL: http://d.repec.org/n?u=RePEc:liv:livedp:202310&r=big
  18. By: Hauke Licht (University of Cologne); Ronja Sczepanski (Sciences Po Paris); Moritz Laurer (Hugging Face; Vrije Universiteit Amsterdam); Ayjeren Bekmuratovna (DHL)
    Abstract: As more and more scholars apply computational text analysis methods to multilingual corpora, machine translation has become an indispensable tool. However, relying on commercial services for machine translation, such as Google Translate or DeepL, limits reproducibility and can be expensive. This paper assesses the viability of a reproducible and affordable alternative: free and open-source machine translation models. We ask whether researchers who use an open-source model instead of a commercial service for machine translation would obtain substantially different measurements from their multilingual corpora. We address this question by replicating and extending an influential study by de Vries et al. (2018) on the use of machine translation in cross-lingual topic modeling, and an original study of its use in supervised text classification with Transformer-based classifiers. We find only minor differences between the measurements generated by these methods when applied to corpora translated with open-source models and commercial services, respectively. We conclude that “free” machine translation is a very valuable addition to researchers’ multilingual text analysis toolkit. Our study adds to a growing body of work on multilingual text analysis methods and has direct practical implications for applied researchers.
    Keywords: machine translation, multilingual topic modeling, multilingual Transformers
    JEL: C45
    Date: 2024–02
    URL: http://d.repec.org/n?u=RePEc:ajk:ajkdps:276&r=big
  19. By: Hauke Licht (University of Cologne, Cologne Center for Comparative Politics); Ronja Sczepanksi (Sciences Po Paris, Center for European Studies and Comparative Research)
    Abstract: Politicians appeal to social groups to court their electoral support. However, quantifying which groups politicians refer to, claim to represent, or address in their public communication presents researchers with challenges. We propose a novel supervised learning approach for extracting group mentions in political texts. We first collect human annotations to determine the exact text passages that refer to social groups. We then fine-tune a Transformer language model for contextualized supervised classification at the word level. Applied to unlabeled texts, our approach enables researchers to automatically detect and extract word spans that contain group mentions. We illustrate our approach in three applications, generating new empirical insights how British parties use social groups in their rhetoric. Our methodological innovation allows to detect and extract mentions of social groups from various sources of texts, creating new possibilities for empirical research in political science.
    Keywords: social groups, political rhetoric, computational text analysis, supervised classification
    JEL: C45
    Date: 2024–02
    URL: http://d.repec.org/n?u=RePEc:ajk:ajkdps:277&r=big
  20. By: Weidong Lin; Abderrahim Taamouti
    Abstract: This paper aims to enhance the classical mean-variance portfolio selection by using machine learning techniques and accounting for systemic risk. The optimal portfolio is solved through a three-step supervised learning model. Firstly, the Smooth Pinball Neural Network is employed to predict return distributions of individual assets and the market. Secondly, we use copula to model dependence between assets and the market, based on which we simulate return scenarios. Lastly, we maximize an ex-ante conditional Sharpe ratio conditioning on systemic events. We run a large-scale comparative study using nearly 600 US individual stocks over 37 years. Our set of predictors includes 94 firm characteristics, 14 macroeconomic variables, and 74 industry dummies. The backtesting results demonstrate the superiority of our proposed approach over popular benchmark strategies including a GARCH-based model. This outperformance is statistically significant and robust to the inclusion of transaction costs.
    Keywords: machine learning, portfolio selection, systemic risk, simulation, probabilistic forecasting
    JEL: G11 C45 C53 C58
    URL: http://d.repec.org/n?u=RePEc:liv:livedp:202311&r=big
  21. By: Leonard Mushunje; Maxwell Mashasha
    Abstract: This paper uses nighttime light(NTL) data to measure the nexus of the non-banking sector, particularly insurance, and economic growth in South Africa. We hypothesize that insurance sector growth positively propels economic growth due to its economic growth-supportive traits like investment protection and optimal risk mitigation. We also claim that Nighttime light data is a good economic measure than Gross domestic product (GDP). We used weighted regressions to measure the relationships between nighttime light data, GDP, and insurance sector development. We used time series South African GDP data collected from the World Bank for the period running from 2000 to 2018, and the nighttime lights data from the National Geophysical Data Centre (NGDC) in partnership with the National Oceanic and Atmospheric Administration (NOAA). From the models fitted and the reported BIC, AIC, and likelihood ratios, the insurance sector proved to have more predictive power on economic development in South Africa, and radiance light explained economic growth better than GDP and GDP/Capita. We concluded that nighttime data is a good proxy for economic growth than GDP/Capita in emerging economies like South Africa, where secondary data needs to be more robust and sometimes inflated. The findings will guide researchers and policymakers on what drives economic development and what policies to put in place. It would be interesting to extend the current study to other sectors such as micro-finances, mutual and hedge funds.
    Date: 2023–11
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2401.08596&r=big
  22. By: Churchill, Alexander; Pichika, Shamitha; Xu, Chengxin (Seattle University)
    Abstract: Supervised content encoding applies a given codebook to a larger non-numerical dataset and is central to empirical research in public administration. Not only is it a key analytical approach for qualitative studies, but the method also allows researchers to measure constructs using non-numerical data, which can then be applied to quantitative description and causal inference. Despite its utility, supervised content encoding faces challenges including high cost and low reproducibility. In this report, we test if large language models (LLM), specifically generative pre-trained transformers (GPT), can solve these problems. Using email messages collected from a national corresponding experiment in the U.S. nursing home market as an example, we demonstrate that although we found some disparities between GPT and human coding results, the disagreement is acceptable for certain research design, which makes GPT encoding a potential substitute for human encoders. Practical suggestions for encoding with GPT are provided at the end of the letter.
    Date: 2024–01–25
    URL: http://d.repec.org/n?u=RePEc:osf:socarx:6fpgj&r=big
  23. By: Sander, Julian
    Abstract: Current research in the domain of crowdfunding often overlooks the role of emotions in investment decision-making and the impact of varying vividness in campaign videos. Employing the Emotion-Imbued Choice Model (Lerner et al., 2015), this study investigates how different levels of a campaign video’s vividness shape emotions and the subsequent investment decision. Utilizing a quantitative approach with a between-group design, participants were exposed to either a campaign video, or an infographic accompanied by identical audio. Emotional reactions were analyzed via TAWNY, a deep learning tool for facial emotion recognition. Additionally, questionnaires assessed investment decisions, self-reported emotions, and evaluations of the stimulus. The findings suggest that emotions are likely to play an informative role throughout the decision-making process. Results show that the group receiving the infographic was more likely to invest, supporting the findings of Lagazio and Querci (2018). As posited by Dey et al. (2017), a crowdfunding video does not necessarily lead to higher crowdfunding chances. Notably, the mediating role of the conscious evaluation differs concerning idea and video evaluation, showing different effects on the relationship between current emotions and the outcome, that is, the investment decision and self-reported emotions. These findings align with the assumption of System 1 and System 2 thinking (Kahneman, 2011). Additionally, the lack of direct effects of current emotions on the outcome suggests the potential for a full mediation by the conscious evaluation, supporting modern decision-making theories (Lerner et al., 2015; Loewenstein et al., 2001).
    Date: 2024–01–21
    URL: http://d.repec.org/n?u=RePEc:osf:thesis:6gptv&r=big
  24. By: Huo, Shutong; Feng, Derek; Gill, Thomas M.; Chen, Xi
    Abstract: Childhood circumstances may impact senior health, prompting this study to introduce novel machine learning methods to assess their individual and collective contributions to health inequality in old age. Using the US Health and Retirement Study (HRS) and the China Health and Retirement Longitudinal Study (CHARLS), we analyzed health outcomes of American and Chinese participants aged 60 and above. Conditional inference trees and forest were employed to estimate the influence of childhood circumstances on self-rated health (SRH), comparing with the conventional parametric Roemer method. The conventional parametric Roemer method estimated higher IOP in health (China: 0.039, 22.67% of the total Gini coefficient 0.172; US: 0.067, 35.08% of the total Gini coefficient 0.191) than conditional inference tree (China: 0.022, 12.79% of 0.172; US: 0.044, 23.04% of 0.191) and forest (China: 0.035, 20.35% of 0.172; US: 0.054, 28.27% of 0.191). Key determinants of health in old age were identified, including childhood health, family financial status, and regional differences. The conditional inference forest consistently outperformed other methods in predictive accuracy as measured by out-of-sample mean squared error (MSE). The findings demonstrate the importance of early-life circumstances in shaping later health outcomes and stress the earlylife interventions for health equity in aging societies. Our methods highlight the utility of machine learning in public health to identify determinants of health inequality.
    Keywords: Life Course, Inequality of Opportunity, Childhood Circumstances, Machine Learning, Conditional Inference Tree, Random Forest
    JEL: I14 J13 J14 O57 C53
    Date: 2024
    URL: http://d.repec.org/n?u=RePEc:zbw:glodps:1384&r=big
  25. By: Mario Sanz-Guerrero; Javier Arroyo
    Abstract: Peer-to-peer (P2P) lending has emerged as a distinctive financing mechanism, linking borrowers with lenders through online platforms. However, P2P lending faces the challenge of information asymmetry, as lenders often lack sufficient data to assess the creditworthiness of borrowers. This paper proposes a novel approach to address this issue by leveraging the textual descriptions provided by borrowers during the loan application process. Our methodology involves processing these textual descriptions using a Large Language Model (LLM), a powerful tool capable of discerning patterns and semantics within the text. Transfer learning is applied to adapt the LLM to the specific task at hand. Our results derived from the analysis of the Lending Club dataset show that the risk score generated by BERT, a widely used LLM, significantly improves the performance of credit risk classifiers. However, the inherent opacity of LLM-based systems, coupled with uncertainties about potential biases, underscores critical considerations for regulatory frameworks and engenders trust-related concerns among end-users, opening new avenues for future research in the dynamic landscape of P2P lending and artificial intelligence.
    Date: 2024–01
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2401.16458&r=big
  26. By: Pierre Renucci
    Abstract: This study presents an unsupervised machine learning approach for optimizing Profit and Loss (PnL) in quantitative finance. Our algorithm, akin to an unsupervised variant of linear regression, maximizes the Sharpe Ratio of PnL generated from signals constructed linearly from exogenous variables. The methodology employs a linear relationship between exogenous variables and the trading signal, with the objective of maximizing the Sharpe Ratio through parameter optimization. Empirical application on an ETF representing U.S. Treasury bonds demonstrates the model's effectiveness, supported by regularization techniques to mitigate overfitting. The study concludes with potential avenues for further development, including generalized time steps and enhanced corrective terms.
    Date: 2023–11
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2401.05337&r=big
  27. By: Bartosz Bieganowski (UUniversity of Warsaw, Faculty of Economic Sciences, Quantitative Finance Research Group); Robert Ślepaczuk (University of Warsaw, Faculty of Economic Sciences, Quantitative Finance Research Group, Department of Quantitative Finance)
    Abstract: This paper investigates the enhancement of financial time series forecasting with the use of neural networks through supervised autoencoders, aiming to improve investment strategy performance. It specifically examines the impact of noise augmentation and triple barrier labeling on risk-adjusted returns, using the Sharpe and Information Ratios. The study focuses on the S&P 500 index, EUR/USD, and BTC/USD as the traded assets from January 1, 2010, to April 30, 2022. Findings indicate that supervised autoencoders, with balanced noise augmentation and bottleneck size, significantly boost strategy effectiveness. However, excessive noise and large bottleneck sizes can impair performance, highlighting the importance of precise parameter tuning. This paper also presents a derivation of a novel optimization metric that can be used with triple barrier labeling. The results of this study have substantial policy implications, suggesting that financial institutions and regulators could leverage techniques presented to enhance market stability and investor protection, while also encouraging more informed and strategic investment approaches in various financial sectors.
    Keywords: machine learning, algorithmic investment strategy, supervised autoencoders, financial time series, trading strategy, risk-adjusted return
    JEL: C4 C14 C45 C53 C58 G13
    Date: 2024
    URL: http://d.repec.org/n?u=RePEc:war:wpaper:2024-03&r=big
  28. By: Gordeev, Stepan; Jelliffe, Jeremy; Kim, Dongin; Steinbach, Sandro
    Keywords: International Relations/Trade, Research and Development/Tech Change/Emerging Technologies
    Date: 2023–12
    URL: http://d.repec.org/n?u=RePEc:ags:iats23:339533&r=big
  29. By: Dimitrios Kanelis; Pierre L. Siklos
    Abstract: We analyze the introductory statements of the ECB president and derive new sentiment indicators for the euro area based on a novel approach. To evaluate sentiment, we utilize a Large Language Model, namely FinBERT, which classifies the verbal sentiment of economics and finance-related textual data. We find that the ECB's conveyed sentiment about monetary policy, which is influenced by the economic outlook and the state of the euro area macroeconomy as expressed in speeches, plays a significant role in shaping the content of press conferences following a Governing Council decision. In contrast, speech sentiment regarding financial stability does not significantly influence introductory statements.
    Keywords: ECB, communication, financial stability, FinBERT, monetary policy, sentiment analysis
    JEL: E50 E58
    Date: 2024–01
    URL: http://d.repec.org/n?u=RePEc:een:camaaa:2024-10&r=big
  30. By: Dengxin Huang
    Abstract: This document presents a stock market analysis conducted on a dataset consisting of 750 instances and 16 attributes donated in 2014-10-23. The analysis includes an exploratory data analysis (EDA) section, feature engineering, data preparation, model selection, and insights from the analysis. The Fama French 3-factor model is also utilized in the analysis. The results of the analysis are presented, with linear regression being the best-performing model.
    Date: 2023–12
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2401.10903&r=big

This nep-big issue is ©2024 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.