nep-big New Economics Papers
on Big Data
Issue of 2024‒01‒22
twenty-one papers chosen by
Tom Coupé, University of Canterbury


  1. Energy poverty prediction and effective targeting for just transitions with machine learning By Spandagos, Constantine; Tovar Reaños, Miguel; Lynch, Muireann Ã
  2. Finding the best trade-off between performance and interpretability in predicting hospital length of stay using structured and unstructured data By Franck Jaotombo; Luca Adorni; Badih Ghattas; Laurent Boyer
  3. Historical Calibration of SVJD Models with Deep Learning By Milan Ficura; Jiri Witzany
  4. How Generative-AI can be Effectively used in Government Chatbots By Zeteng Lin
  5. Machine-learning prediction for hospital length of stay using a French medico-administrative database By Franck Jaotombo; Vanessa Pauly; Guillaume Fond; Veronica Orleans; Pascal Auquier; Badih Ghattas; Laurent Boyer
  6. Leveraging Sample Entropy for Enhanced Volatility Measurement and Prediction in International Oil Price Returns By Radhika Prosad Datta
  7. Corporate Bankruptcy Prediction with Domain-Adapted BERT By Alex Kim; Sangwon Yoon
  8. The Machine Learning Control Method for Counterfactual Forecasting By Augusto Cerqua; Marco Letta; Fiammetta Menchetti
  9. Detecting Toxic Flow By \'Alvaro Cartea; Gerardo Duran-Martin; Leandro S\'anchez-Betancourt
  10. Economic Forecasts Using Many Noises By Yuan Liao; Xinjie Ma; Andreas Neuhierl; Zhentao Shi
  11. Double Machine Learning for Static Panel Models with Fixed Effects By Paul Clarke; Annalivia Polselli
  12. Mapping the Dynamics of Management Styles— Evidence from German Survey Data By Florian Englmaier; Michael Hofmann; Stefanie Wolter
  13. Gender stereotypes embedded in natural language are stronger in more economically developed and individualistic countries By Clotilde Napp
  14. A unified repository for pre-processed climate data weighted by gridded economic activity By Marco Gortan; Lorenzo Testa; Giorgio Fagiolo; Francesco Lamperti
  15. The Causal Impact of Credit Lines on Spending Distributions By Yijun Li; Cheuk Hang Leung; Xiangqian Sun; Chaoqun Wang; Yiyan Huang; Xing Yan; Qi Wu; Dongdong Wang; Zhixiang Huang
  16. Predicting Financial Literacy via Semi-supervised Learning By David Hason Rudd; Huan Huo; Guandong Xu
  17. Exploring Nature: Datasets and Models for Analyzing Nature-Related Disclosures By Tobias Schimanski; Chiara Colesanti Senni; Glen Gostlow; Jingwei Ni; Tingyu Yu; Markus Leippold
  18. Fused Extended Two-Way Fixed Effects for Difference-in-Differences with Staggered Adoptions By Gregory Faletto
  19. Comparative Evaluation of Anomaly Detection Methods for Fraud Detection in Online Credit Card Payments By Hugo Thimonier; Fabrice Popineau; Arpad Rimmel; Bich-Li\^en Doan; Fabrice Daniel
  20. Skills or Degree? The Rise of Skill-Based Hiring for AI and Green Jobs By Eugenia Gonzalez Ehlinger; Fabian Stephany
  21. Shai: A large language model for asset management By Zhongyang Guo; Guanran Jiang; Zhongdan Zhang; Peng Li; Zhefeng Wang; Yinchun Wang

  1. By: Spandagos, Constantine; Tovar Reaños, Miguel; Lynch, Muireann Ã
    Date: 2023
    URL: http://d.repec.org/n?u=RePEc:esr:wpaper:wp762&r=big
  2. By: Franck Jaotombo (EM - emlyon business school); Luca Adorni; Badih Ghattas; Laurent Boyer
    Abstract: Objective This study aims to develop high-performing Machine Learning and Deep Learning models in predicting hospital length of stay (LOS) while enhancing interpretability. We compare performance and interpretability of models trained only on structured tabular data with models trained only on unstructured clinical text data, and on mixed data. Methods The structured data was used to train fourteen classical Machine Learning models including advanced ensemble trees, neural networks and k-nearest neighbors. The unstructured data was used to fine-tune a pre-trained Bio Clinical BERT Transformer Deep Learning model. The structured and unstructured data were then merged into a tabular dataset after vectorization of the clinical text and a dimensional reduction through Latent Dirichlet Allocation. The study used the free and publicly available Medical Information Mart for Intensive Care (MIMIC) III database, on the open AutoML Library AutoGluon. Performance is evaluated with respect to two types of random classifiers, used as baselines. Results The best model from structured data demonstrates high performance (ROC AUC = 0.944, PRC AUC = 0.655) with limited interpretability, where the most important predictors of prolonged LOS are the level of blood urea nitrogen and of platelets. The Transformer model displays a good but lower performance (ROC AUC = 0.842, PRC AUC = 0.375) with a richer array of interpretability by providing more specific in-hospital factors including procedures, conditions, and medical history. The best model trained on mixed data satisfies both a high level of performance (ROC AUC = 0.963, PRC AUC = 0.746) and a much larger scope in interpretability including pathologies of the intestine, the colon, and the blood; infectious diseases, respiratory problems, procedures involving sedation and intubation, and vascular surgery. Conclusions Our results outperform most of the state-of-the-art models in LOS prediction both in terms of performance and of interpretability. Data fusion between structured and unstructured text data may significantly improve performance and interpretability.
    Keywords: hospital length of stay, explainable AI, data fusion, structured and unstructured data, clinical transformers
    Date: 2023–11–30
    URL: http://d.repec.org/n?u=RePEc:hal:journl:hal-04339462&r=big
  3. By: Milan Ficura (Faculty of Finance and Accounting, Prague University of Economics and Business, Czech Republic); Jiri Witzany (Faculty of Finance and Accounting, Prague University of Economics and Business, Czech Republic.)
    Abstract: We propose how deep neural networks can be used to calibrate the parameters of Stochastic-Volatility Jump-Diffusion (SVJD) models to historical asset return time series. 1-Dimensional Convolutional Neural Networks (1D-CNN) are used for that purpose. The accuracy of the deep learning approach is compared with machine learning methods based on shallow neural networks and hand-crafted features, and with commonly used statistical approaches such as MCMC and approximate MLE. The deep learning approach is found to be accurate and robust, outperforming the other approaches in simulation tests. The main advantage of the deep learning approach is that it is fully generic and can be applied to any SVJD model from which simulations can be drawn. An additional advantage is the speed of the deep learning approach in situations when the parameter estimation needs to be repeated on new data. The trained neural network can be in these situations used to estimate the SVJD model parameters almost instantaneously.
    Keywords: Stochastic volatility, price jumps, SVJD, neural networks, deep learning, CNN
    Date: 2023–12
    URL: http://d.repec.org/n?u=RePEc:fau:wpaper:wp2023_36&r=big
  4. By: Zeteng Lin
    Abstract: With the rapid development of artificial intelligence and breakthroughs in machine learning and natural language processing, intelligent question-answering robots have become widely used in government affairs. This paper conducts a horizontal comparison between Guangdong Province's government chatbots, ChatGPT, and Wenxin Ernie, two large language models, to analyze the strengths and weaknesses of existing government chatbots and AIGC technology. The study finds significant differences between government chatbots and large language models. China's government chatbots are still in an exploratory stage and have a gap to close to achieve "intelligence." To explore the future direction of government chatbots more deeply, this research proposes targeted optimization paths to help generative AI be effectively applied in government chatbot conversations.
    Date: 2023–11
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2312.02181&r=big
  5. By: Franck Jaotombo (EM - emlyon business school); Vanessa Pauly; Guillaume Fond; Veronica Orleans; Pascal Auquier; Badih Ghattas; Laurent Boyer
    Abstract: "Introduction: Prolonged Hospital Length of Stay (PLOS) is an indicator of deteriorated efficiency in Quality of Care. One goal of public health management is to reduce PLOS by identifying its most relevant predictors. The objective of this study is to explore Machine Learning (ML) models that best predict PLOS.Methods: Our dataset was collected from the French Medico-Administrative database (PMSI) as a retrospective cohort study of all discharges in the year 2015 from a large university hospital in France (APHM). The study outcomes were LOS transformed into a binary variable (long vs. short LOS) according to the 90th percentile (14 days). Logistic regression (LR), classification and regression trees (CART), random forest (RF), gradient boosting (GB) and neural networks (NN) were applied to the collected data. The predictive performance of the models was evaluated using the area under the ROC curve (AUC).Results: Our analysis included 73, 182 hospitalizations, of which 7, 341 (10.0%) led to PLOS. The GB classifier was the most performant model with the highest AUC (0.810), superior to all the other models (all p-values
    Keywords: Machine learning, neural network, prediction, health services research, public health
    Date: 2023–01–01
    URL: http://d.repec.org/n?u=RePEc:hal:journl:hal-04325691&r=big
  6. By: Radhika Prosad Datta
    Abstract: This paper explores the application of Sample Entropy (SampEn) as a sophisticated tool for quantifying and predicting volatility in international oil price returns. SampEn, known for its ability to capture underlying patterns and predict periods of heightened volatility, is compared with traditional measures like standard deviation. The study utilizes a comprehensive dataset spanning 27 years (1986-2023) and employs both time series regression and machine learning methods. Results indicate SampEn's efficacy in predicting traditional volatility measures, with machine learning algorithms outperforming standard regression techniques during financial crises. The findings underscore SampEn's potential as a valuable tool for risk assessment and decision-making in the realm of oil price investments.
    Date: 2023–12
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2312.12788&r=big
  7. By: Alex Kim; Sangwon Yoon
    Abstract: This study performs BERT-based analysis, which is a representative contextualized language model, on corporate disclosure data to predict impending bankruptcies. Prior literature on bankruptcy prediction mainly focuses on developing more sophisticated prediction methodologies with financial variables. However, in our study, we focus on improving the quality of input dataset. Specifically, we employ BERT model to perform sentiment analysis on MD&A disclosures. We show that BERT outperforms dictionary-based predictions and Word2Vec-based predictions in terms of adjusted R-square in logistic regression, k-nearest neighbor (kNN-5), and linear kernel support vector machine (SVM). Further, instead of pre-training the BERT model from scratch, we apply self-learning with confidence-based filtering to corporate disclosure data (10-K). We achieve the accuracy rate of 91.56% and demonstrate that the domain adaptation procedure brings a significant improvement in prediction accuracy.
    Date: 2023–12
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2312.03194&r=big
  8. By: Augusto Cerqua; Marco Letta; Fiammetta Menchetti
    Abstract: Without a credible control group, the most widespread methodologies for estimating causal effects cannot be applied. To fill this gap, we propose the Machine Learning Control Method (MLCM), a new approach for causal panel analysis based on counterfactual forecasting with machine learning. The MLCM estimates policy-relevant causal parameters in short- and long-panel settings without relying on untreated units. We formalize identification in the potential outcomes framework and then provide estimation based on supervised machine learning algorithms. To illustrate the advantages of our estimator, we present simulation evidence and an empirical application on the impact of the COVID-19 crisis on educational inequality in Italy. We implement the proposed method in the companion R package MachineControl.
    Date: 2023–12
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2312.05858&r=big
  9. By: \'Alvaro Cartea; Gerardo Duran-Martin; Leandro S\'anchez-Betancourt
    Abstract: This paper develops a framework to predict toxic trades that a broker receives from her clients. Toxic trades are predicted with a novel online Bayesian method which we call the projection-based unification of last-layer and subspace estimation (PULSE). PULSE is a fast and statistically-efficient online procedure to train a Bayesian neural network sequentially. We employ a proprietary dataset of foreign exchange transactions to test our methodology. PULSE outperforms standard machine learning and statistical methods when predicting if a trade will be toxic; the benchmark methods are logistic regression, random forests, and a recursively-updated maximum-likelihood estimator. We devise a strategy for the broker who uses toxicity predictions to internalise or to externalise each trade received from her clients. Our methodology can be implemented in real-time because it takes less than one millisecond to update parameters and make a prediction. Compared with the benchmarks, PULSE attains the highest PnL and the largest avoided loss for the horizons we consider.
    Date: 2023–12
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2312.05827&r=big
  10. By: Yuan Liao; Xinjie Ma; Andreas Neuhierl; Zhentao Shi
    Abstract: This paper addresses a key question in economic forecasting: does pure noise truly lack predictive power? Economists typically conduct variable selection to eliminate noises from predictors. Yet, we prove a compelling result that in most economic forecasts, the inclusion of noises in predictions yields greater benefits than its exclusion. Furthermore, if the total number of predictors is not sufficiently large, intentionally adding more noises yields superior forecast performance, outperforming benchmark predictors relying on dimension reduction. The intuition lies in economic predictive signals being densely distributed among regression coefficients, maintaining modest forecast bias while diversifying away overall variance, even when a significant proportion of predictors constitute pure noises. One of our empirical demonstrations shows that intentionally adding 300~6, 000 pure noises to the Welch and Goyal (2008) dataset achieves a noteworthy 10% out-of-sample R square accuracy in forecasting the annual U.S. equity premium. The performance surpasses the majority of sophisticated machine learning models.
    Date: 2023–12
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2312.05593&r=big
  11. By: Paul Clarke; Annalivia Polselli
    Abstract: Machine Learning (ML) algorithms are powerful data-driven tools for approximating high-dimensional or non-linear nuisance functions which are useful in practice because the true functional form of the predictors is ex-ante unknown. In this paper, we develop estimators of policy interventions from panel data which allow for non-linear effects of the confounding regressors, and investigate the performance of these estimators using three well-known ML algorithms, specifically, LASSO, classification and regression trees, and random forests. We use Double Machine Learning (DML) (Chernozhukov et al., 2018) for the estimation of causal effects of homogeneous treatments with unobserved individual heterogeneity (fixed effects) and no unobserved confounding by extending Robinson (1988)'s partially linear regression model. We develop three alternative approaches for handling unobserved individual heterogeneity based on extending the within-group estimator, first-difference estimator, and correlated random effect estimator (Mundlak, 1978) for non-linear models. Using Monte Carlo simulations, we find that conventional least squares estimators can perform well even if the data generating process is non-linear, but there are substantial performance gains in terms of bias reduction under a process where the true effect of the regressors is non-linear and discontinuous. However, for the same scenarios, we also find -- despite extensive hyperparameter tuning -- inference to be problematic for both tree-based learners because these lead to highly non-normal estimator distributions and the estimator variance being severely under-estimated. This contradicts the performance of trees in other circumstances and requires further investigation. Finally, we provide an illustrative example of DML for observational panel data showing the impact of the introduction of the national minimum wage in the UK.
    Date: 2023–12
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2312.08174&r=big
  12. By: Florian Englmaier (LMU Munich); Michael Hofmann (LMU München); Stefanie Wolter (IAB Nürnberg)
    Abstract: We study how firms adjust the bundles of management practices they adopt over time, using repeated survey data collected in Germany from 2012 to 2018. By employing unsupervised machine learning, we leverage high-dimensional data on human resource policies to describe clusters of management practices (management styles). Our results suggest that two management styles exist, one of which employs many and highly structured practices, while the other lacks these practices but retains training measures. We document sizeable differences in styles across German firms, which can (only) partially be explained by firm characteristics. Further, we show that management is highly persistent over time, in part because newly adopted practices are discontinued after a short time. We suggest miscalculations of cots-benefit trade-offs and non-fitting corporate culture as potential hindrances of adopting structured management. In light of previous findings that structured management increases firm performance, our findings have important policy implications since they show that firms which are managed in an unstructured way fail to catch up and will continue to underperform.
    Keywords: management practices; personnel management; panel data analysis; machine learning;
    JEL: M12 D22 C38
    Date: 2023–12–14
    URL: http://d.repec.org/n?u=RePEc:rco:dpaper:481&r=big
  13. By: Clotilde Napp (DRM - Dauphine Recherches en Management - Université Paris Dauphine-PSL - PSL - Université Paris sciences et lettres - CNRS - Centre National de la Recherche Scientifique)
    Abstract: Gender stereotypes contribute to gender imbalances, and analyzing their variations across countries is important for understanding and mitigating gender inequalities. However, measuring stereotypes is difficult, particularly in a cross-cultural context. Word embeddings are a recent useful tool in natural language processing permitting to measure the collective gender stereotypes embedded in a society. In this work, we used word embedding models pre-trained on large text corpora from more than 70 different countries to examine how genderstereotypes vary across countries. We considered stereotypes associating men with career and women with family as well as those associating men with math or science and women with arts or liberal arts. Relying on two different sources (Wikipedia and Common Crawl), we found that these gender stereotypes are all significantly more pronounced in the text corpora of more economically developed and more individualistic countries. Our analysis suggests that more economically developed countries, while being more gender equal along several dimensions, also have stronger gender stereotypes. Public policy aiming at mitigating gender imbalances in these countries should take this feature into account. Besides, our analysis sheds light on the "gender equality paradox, " i.e. on the fact that gender imbalances in a large number of domains are paradoxically stronger in more developed/gender equal/individualisticcountries.
    Keywords: gender stereotypes, gender equality, cross-cultural variations, gender equality paradox, word embeddings
    Date: 2023
    URL: http://d.repec.org/n?u=RePEc:hal:journl:hal-04316389&r=big
  14. By: Marco Gortan; Lorenzo Testa; Giorgio Fagiolo; Francesco Lamperti
    Abstract: Although high-resolution gridded climate variables are provided by multiple sources, the need for country and region-specific climate data weighted by indicators of economic activity is becoming increasingly common in environmental and economic research. We process available information from different climate data sources to provide spatially aggregated data with global coverage for both countries (GADM0 resolution) and regions (GADM1 resolution) and for a variety of climate indicators (average precipitations, average temperatures, average SPEI). We weigh gridded climate data by population density or by night light intensity -- both proxies of economic activity -- before aggregation. Climate variables are measured daily, monthly, and annually, covering (depending on the data source) a time window from 1900 (at the earliest) to 2023. We pipeline all the preprocessing procedures in a unified framework, which we share in the open-access Weighted Climate Data Repository web app. Finally, we validate our data through a systematic comparison with those employed in leading climate impact studies.
    Date: 2023–12
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2312.05971&r=big
  15. By: Yijun Li; Cheuk Hang Leung; Xiangqian Sun; Chaoqun Wang; Yiyan Huang; Xing Yan; Qi Wu; Dongdong Wang; Zhixiang Huang
    Abstract: Consumer credit services offered by e-commerce platforms provide customers with convenient loan access during shopping and have the potential to stimulate sales. To understand the causal impact of credit lines on spending, previous studies have employed causal estimators, based on direct regression (DR), inverse propensity weighting (IPW), and double machine learning (DML) to estimate the treatment effect. However, these estimators do not consider the notion that an individual's spending can be understood and represented as a distribution, which captures the range and pattern of amounts spent across different orders. By disregarding the outcome as a distribution, valuable insights embedded within the outcome distribution might be overlooked. This paper develops a distribution-valued estimator framework that extends existing real-valued DR-, IPW-, and DML-based estimators to distribution-valued estimators within Rubin's causal framework. We establish their consistency and apply them to a real dataset from a large e-commerce platform. Our findings reveal that credit lines positively influence spending across all quantiles; however, as credit lines increase, consumers allocate more to luxuries (higher quantiles) than necessities (lower quantiles).
    Date: 2023–12
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2312.10388&r=big
  16. By: David Hason Rudd; Huan Huo; Guandong Xu
    Abstract: Financial literacy (FL) represents a person's ability to turn assets into income, and understanding digital currencies has been added to the modern definition. FL can be predicted by exploiting unlabelled recorded data in financial networks via semi-supervised learning (SSL). Measuring and predicting FL has not been widely studied, resulting in limited understanding of customer financial engagement consequences. Previous studies have shown that low FL increases the risk of social harm. Therefore, it is important to accurately estimate FL to allocate specific intervention programs to less financially literate groups. This will not only increase company profitability, but will also reduce government spending. Some studies considered predicting FL in classification tasks, whereas others developed FL definitions and impacts. The current paper investigated mechanisms to learn customer FL level from their financial data using sampling by synthetic minority over-sampling techniques for regression with Gaussian noise (SMOGN). We propose the SMOGN-COREG model for semi-supervised regression, applying SMOGN to deal with unbalanced datasets and a nonparametric multi-learner co-regression (COREG) algorithm for labeling. We compared the SMOGN-COREG model with six well-known regressors on five datasets to evaluate the proposed models effectiveness on unbalanced and unlabelled financial data. Experimental results confirmed that the proposed method outperformed the comparator models for unbalanced and unlabelled financial data. Therefore, SMOGN-COREG is a step towards using unlabelled data to estimate FL level.
    Date: 2023–12
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2312.10984&r=big
  17. By: Tobias Schimanski; Chiara Colesanti Senni; Glen Gostlow; Jingwei Ni; Tingyu Yu; Markus Leippold
    Abstract: Nature is an amorphous concept. Yet, it is essential for the planet's well-being to understand how the economy interacts with it. To address the growing demand for information on corporate nature disclosure, we provide datasets and classifiers to detect nature communication by companies. We ground our approach in the guidelines of the Taskforce on Nature-related Financial Disclosures (TNFD). Particularly, we focus on the specific dimensions of water, forest, and biodiversity. For each dimension, we create an expert-annotated dataset with 2, 200 text samples and train classifier models. Furthermore, we show that nature communication is more prevalent in hotspot areas and directly effected industries like agriculture and utilities. Our approach is the first to respond to calls to assess corporate nature communication on a large scale.
    Date: 2023–12
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2312.17337&r=big
  18. By: Gregory Faletto
    Abstract: To address the bias of the canonical two-way fixed effects estimator for difference-in-differences under staggered adoptions, Wooldridge (2021) proposed the extended two-way fixed effects estimator, which adds many parameters. However, this reduces efficiency. Restricting some of these parameters to be equal helps, but ad hoc restrictions may reintroduce bias. We propose a machine learning estimator with a single tuning parameter, fused extended two-way fixed effects (FETWFE), that enables automatic data-driven selection of these restrictions. We prove that under an appropriate sparsity assumption FETWFE identifies the correct restrictions with probability tending to one. We also prove the consistency, asymptotic normality, and oracle efficiency of FETWFE for two classes of heterogeneous marginal treatment effect estimators under either conditional or marginal parallel trends, and we prove consistency for two classes of conditional average treatment effects under conditional parallel trends. We demonstrate FETWFE in simulation studies and an empirical application.
    Date: 2023–12
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2312.05985&r=big
  19. By: Hugo Thimonier; Fabrice Popineau; Arpad Rimmel; Bich-Li\^en Doan; Fabrice Daniel
    Abstract: This study explores the application of anomaly detection (AD) methods in imbalanced learning tasks, focusing on fraud detection using real online credit card payment data. We assess the performance of several recent AD methods and compare their effectiveness against standard supervised learning methods. Offering evidence of distribution shift within our dataset, we analyze its impact on the tested models' performances. Our findings reveal that LightGBM exhibits significantly superior performance across all evaluated metrics but suffers more from distribution shifts than AD methods. Furthermore, our investigation reveals that LightGBM also captures the majority of frauds detected by AD methods. This observation challenges the potential benefits of ensemble methods to combine supervised, and AD approaches to enhance performance. In summary, this research provides practical insights into the utility of these techniques in real-world scenarios, showing LightGBM's superiority in fraud detection while highlighting challenges related to distribution shifts.
    Date: 2023–12
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2312.13896&r=big
  20. By: Eugenia Gonzalez Ehlinger; Fabian Stephany
    Abstract: For emerging professions, such as jobs in the field of Artificial Intelligence (AI) or sustainability (green), labour supply does not meet industry demand. In this scenario of labour shortages, our work aims to understand whether employers have started focusing on individual skills rather than on formal qualifications in their recruiting. By analysing a large time series dataset of around one million online job vacancies between 2019 and 2022 from the UK and drawing on diverse literature on technological change and labour market signalling, we provide evidence that employers have started so-called "skill-based hiring" for AI and green roles, as more flexible hiring practices allow them to increase the available talent pool. In our observation period the demand for AI roles grew twice as much as average labour demand. At the same time, the mention of university education for AI roles declined by 23%, while AI roles advertise five times as many skills as job postings on average. Our regression analysis also shows that university degrees no longer show an educational premium for AI roles, while for green positions the educational premium persists. In contrast, AI skills have a wage premium of 16%, similar to having a PhD (17%). Our work recommends making use of alternative skill building formats such as apprenticeships, on-the-job training, MOOCs, vocational education and training, micro-certificates, and online bootcamps to use human capital to its full potential and to tackle talent shortages.
    Date: 2023–12
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2312.11942&r=big
  21. By: Zhongyang Guo; Guanran Jiang; Zhongdan Zhang; Peng Li; Zhefeng Wang; Yinchun Wang
    Abstract: This paper introduces "Shai" a 10B level large language model specifically designed for the asset management industry, built upon an open-source foundational model. With continuous pre-training and fine-tuning using a targeted corpus, Shai demonstrates enhanced performance in tasks relevant to its domain, outperforming baseline models. Our research includes the development of an innovative evaluation framework, which integrates professional qualification exams, tailored tasks, open-ended question answering, and safety assessments, to comprehensively assess Shai's capabilities. Furthermore, we discuss the challenges and implications of utilizing large language models like GPT-4 for performance assessment in asset management, suggesting a combination of automated evaluation and human judgment. Shai's development, showcasing the potential and versatility of 10B-level large language models in the financial sector with significant performance and modest computational requirements, hopes to provide practical insights and methodologies to assist industry peers in their similar endeavors.
    Date: 2023–12
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2312.14203&r=big

This nep-big issue is ©2024 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.