|
on Big Data |
By: | Rameshwar Garg; Shriya Barpanda; Girish Rao Salanke N S; Ramya S |
Abstract: | Time series data is being used everywhere, from sales records to patients' health evolution metrics. The ability to deal with this data has become a necessity, and time series analysis and forecasting are used for the same. Every Machine Learning enthusiast would consider these as very important tools, as they deepen the understanding of the characteristics of data. Forecasting is used to predict the value of a variable in the future, based on its past occurrences. A detailed survey of the various methods that are used for forecasting has been presented in this paper. The complete process of forecasting, from preprocessing to validation has also been explained thoroughly. Various statistical and deep learning models have been considered, notably, ARIMA, Prophet and LSTMs. Hybrid versions of Machine Learning models have also been explored and elucidated. Our work can be used by anyone to develop a good understanding of the forecasting process, and to identify various state of the art models which are being used today. |
Date: | 2022–11 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2211.14387&r=big |
By: | Li, Ziqi |
Abstract: | Ridesharing, compared to traditional solo ride-hailing, can reduce traffic congestion, cut per-passenger carbon emissions, reduce parking infrastructure, and provide a more cost-effective way to travel. Despite these benefits, ridesharing only occupies a small percentage of the total ride-hailing trips. This study provides a reproducible and replicable framework that integrates big trip data, machine learning models, and explainable artificial intelligence (XAI) to better understand the factors that influence people's decisions to take or not to take a shared ride. |
Date: | 2022–04–01 |
URL: | http://d.repec.org/n?u=RePEc:osf:osfxxx:chy4p&r=big |
By: | Meisenbacher, Stephen; Norlander, Peter |
Abstract: | Popular approaches to building data from unstructured text come with limitations, such as scalability, interpretability, replicability, and real-world applicability. These can be overcome with Context Rule Assisted Machine Learning (CRAML), a method and no-code suite of software tools that builds structured, labeled datasets which are accurate and reproducible. CRAML enables domain experts to access uncommon constructs within a document corpus in a low-resource, transparent, and flexible manner. CRAML produces document-level datasets for quantitative research and makes qualitative classification schemes scalable over large volumes of text. We demonstrate that the method is useful for bibliographic analysis, transparent analysis of proprietary data, and expert classification of any documents with any scheme. To demonstrate this process for building data from text with Machine Learning, we publish open-source resources: the software, a new public document corpus, and a replicable analysis to build an interpretable classifier of suspected "no poach" clauses in franchise documents. |
Keywords: | machine learning, natural language processing, text classification, big data |
JEL: | B41 C38 C81 C88 J08 J41 J42 J47 J53 Z13 |
Date: | 2022 |
URL: | http://d.repec.org/n?u=RePEc:zbw:glodps:1214&r=big |
By: | Kühl, Niklas; Schemmer, Max; Goutier, Marc; Satzger, Gerhard |
Date: | 2022 |
URL: | http://d.repec.org/n?u=RePEc:dar:wpaper:135656&r=big |
By: | Christian B. Hansen (University of Chicago); Mark E. Schaffer (Heriot-Watt University); Thomas Wiemann (University of Chicago); Achim Ahrens (ETH Zürich) |
Abstract: | We introduce the Stata package ddml, which implements double/debiased machine learning (DDML) for causal inference aided by supervised machine learning. Five different models are supported, allowing for multiple treatment variables in the presence of high-dimensional controls and instrumental variables. ddml is compatible with many existing supervised machine learning programs in Stata. |
Date: | 2022–11–30 |
URL: | http://d.repec.org/n?u=RePEc:boc:csug22:02&r=big |
By: | Kakuho Furukawa (Bank of Japan); Ryohei Hisano (The University of Tokyo) |
Abstract: | We nowcast Japan's exports using maritime big data (the Automatic Identification System data), which contains information on vessels such as their locations. The data has been only relatively rarely used for capturing economic trends. In doing so, we improve the accuracy of nowcasts by utilizing official statistics such as geographical data on ports and machine learning techniques. The analysis shows that the nowcasting model augmented with AIS data improves the performance of nowcasting relative to existing statistics (provisional reports on the Trade Statistics of Japan) that is available in close to real-time. In particular, the nowcasting model developed in this paper follows the movements of exports reasonably well even when they increase or decrease significantly (e.g., when the pandemic began in the spring of 2020 and when the supply chain was disrupted around mid-2021). |
Keywords: | Nowcasting; Alternative data; AIS data; Exports |
JEL: | C49 C55 E27 |
Date: | 2022–12–23 |
URL: | http://d.repec.org/n?u=RePEc:boj:bojwps:wp22e19&r=big |
By: | Zhenkun Zhou; Zikun Song; Tao Ren |
Abstract: | Scanner big data has potential to construct Consumer Price Index (CPI). This work utilizes the scanner data of supermarket retail sales, which are provided by China Ant Business Alliance (CAA), to construct the Scanner-data Food Consumer Price Index (S-FCPI) in China, and the index reliability is verified by other macro indicators, especially by China's CPI. And not only that, we build multiple machine learning models based on S-FCPI to quantitatively predict the CPI growth rate in months, and qualitatively predict those directions and levels. The prediction models achieve much better performance than the traditional time series models in existing research. This work paves the way to construct and predict price indexes through using scanner big data in China. S-FCPI can not only reflect the changes of goods prices in higher frequency and wider geographic dimension than CPI, but also provide a new perspective for monitoring macroeconomic operation, predicting inflation and understanding other economic issues, which is beneficial supplement to China's CPI. |
Date: | 2022–11 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2211.16641&r=big |
By: | Tamay Besiroglu; Nicholas Emery-Xu; Neil Thompson |
Abstract: | Since its emergence around 2010, deep learning has rapidly become the most important technique in Artificial Intelligence (AI), producing an array of scientific firsts in areas as diverse as protein folding, drug discovery, integrated chip design, and weather prediction. As more scientists and engineers adopt deep learning, it is important to consider what effect widespread deployment would have on scientific progress and, ultimately, economic growth. We assess this impact by estimating the idea production function for AI in two computer vision tasks that are considered key test-beds for deep learning and show that AI idea production is notably more capital-intensive than traditional R&D. Because increasing the capital-intensity of R&D accelerates the investments that make scientists and engineers more productive, our work suggests that AI-augmented R&D has the potential to speed up technological change and economic growth. |
Date: | 2022–12 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2212.08198&r=big |
By: | Zheng Cao; Raymond Guo; Wenyu Du; Jiayi Gao; Kirill V. Golubnichiy |
Abstract: | This paper introduced key aspects of applying Machine Learning (ML) models, improved trading strategies, and the Quasi-Reversibility Method (QRM) to optimize stock option forecasting and trading results. It presented the findings of the follow-up project of the research "Application of Convolutional Neural Networks with Quasi-Reversibility Method Results for Option Forecasting". First, the project included an application of Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM) networks to provide a novel way of predicting stock option trends. Additionally, it examined the dependence of the ML models by evaluating the experimental method of combining multiple ML models to improve prediction results and decision-making. Lastly, two improved trading strategies and simulated investing results were presented. The Binomial Asset Pricing Model with discrete time stochastic process analysis and portfolio hedging was applied and suggested an optimized investment expectation. These results can be utilized in real-life trading strategies to optimize stock option investment results based on historical data. |
Date: | 2022–11 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2211.15912&r=big |
By: | Marc Wildi; Branka Hadji Misheva |
Abstract: | Artificial intelligence is creating one of the biggest revolution across technology driven application fields. For the finance sector, it offers many opportunities for significant market innovation and yet broad adoption of AI systems heavily relies on our trust in their outputs. Trust in technology is enabled by understanding the rationale behind the predictions made. To this end, the concept of eXplainable AI emerged introducing a suite of techniques attempting to explain to users how complex models arrived at a certain decision. For cross-sectional data classical XAI approaches can lead to valuable insights about the models' inner workings, but these techniques generally cannot cope well with longitudinal data (time series) in the presence of dependence structure and non-stationarity. We here propose a novel XAI technique for deep learning methods which preserves and exploits the natural time ordering of the data. |
Date: | 2022–12 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2212.02906&r=big |
By: | Kim Ristolainen (Department of Economics, Turku School of Economics, University of Turku, Finland) |
Abstract: | Economic research has shown that debt markets have an information sensitivity property that allows these markets to work properly when price discovery is absent and opaqueness is maintained. Dang, Gorton and Holmström (2015) argue that sufficiently âbad newsâ can switch debt to become information sensitive and start a financial crisis. We identify narrative triggers in the news by utilizing machine learning methods and daily information about firm default probability, the publicâs information acquisition and newspaper articles. We find state-specific generalizable triggers whose effect is determined by the language used by journalists. This language is associated with different psychological thinking processes. |
Keywords: | information sensitivity, debt markets, financial crisis, machine learning, news data, primordial thinking process |
JEL: | G01 G14 G41 |
Date: | 2022–12 |
URL: | http://d.repec.org/n?u=RePEc:tkk:dpaper:dp156&r=big |
By: | Massimiliano Fessina; Giambattista Albora; Andrea Tacchella; Andrea Zaccaria |
Abstract: | Tree-based machine learning algorithms provide the most precise assessment of the feasibility for a country to export a target product given its export basket. However, the high number of parameters involved prevents a straightforward interpretation of the results and, in turn, the explainability of policy indications. In this paper, we propose a procedure to statistically validate the importance of the products used in the feasibility assessment. In this way, we are able to identify which products, called explainers, significantly increase the probability to export a target product in the near future. The explainers naturally identify a low dimensional representation, the Feature Importance Product Space, that enhances the interpretability of the recommendations and provides out-of-sample forecasts of the export baskets of countries. Interestingly, we detect a positive correlation between the complexity of a product and the complexity of its explainers. |
Date: | 2022–12 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2212.03094&r=big |
By: | Tauchert, Christoph |
Abstract: | Artificial intelligence (AI) is fundamentally changing our society and economy. Companies are investing a great deal of money and time into building corresponding competences and developing prototypes with the aim of integrating AI into their products and services, as well as enriching and improving their internal business processes. This inevitably brings corporate and private users into contact with a new technology that functions fundamentally differently than traditional software. The possibility of using machine learning to generate precise models based on large amounts of data capable of recognizing patterns within that data holds great economic and social potential—for example, in task augmentation and automation, medical diagnostics, and the development of pharmaceutical drugs. At the same time, companies and users are facing new challenges that accompany the introduction of this technology. Businesses are struggling to manage and generate value from big data, and employees fear increasing automation. To better prepare society for the growing market penetration of AI-based information systems into everyday life, a deeper understanding of this technology in terms of organizational and individual use is needed. Motivated by the many new challenges and questions for theory and practice that arise from AI-based information systems, this dissertation addresses various research questions with regard to the use of such information systems from both user and organizational perspectives. A total of five studies were conducted and published: two from the perspective of organizations and three among users. The results of these studies contribute to the current state of research and provide a basis for future studies. In addition, the gained insights enable recommendations to be derived for companies wishing to integrate AI into their products, services, or business processes. The first research article (Research Paper A) investigated which factors and prerequisites influence the success of the introduction and adoption of AI. Using the technology–organization–environment framework, various factors in the categories of technology, organization, and environment were identified and validated through the analysis of expert interviews with managers experienced in the field of AI. The results show that factors related to data (especially availability and quality) and the management of AI projects (especially project management and use cases) have been added to the framework, but regulatory factors have also emerged, such as the uncertainty caused by the General Data Protection Regulation. The focus of Research Paper B is companies’ motivation to host data science competitions on online platforms and which factors influence their success. Extant research has shown that employees with new skills are needed to carry out AI projects and that many companies have problems recruiting such employees. Therefore, data science competitions could support the implementation of AI projects via crowdsourcing. The results of the study (expert interviews among data scientists) show that these competitions offer many advantages, such as exchanges and discussions with experienced data scientists and the use of state-of-the-art approaches. However, only a small part of the effort related to AI projects can be represented within the framework of such competitions. The studies in the other three research papers (Research Papers C, D, and E) examine AI-based information systems from a user perspective, with two studies examining user behavior and one focusing on the design of an AI-based IT artifact. Research Paper C analyses perceptions of AI-based advisory systems in terms of the advantages associated with their use. The results of the empirical study show that the greatest perceived benefit is the convenience such systems provide, as they are easy to access at any time and can immediately satisfy informational needs. Furthermore, this study examined the effectiveness of 11 different measures to increase trust in AI-based advisory systems. This showed a clear ranking of measures, with effectiveness decreasing from non-binding testing to providing additional information regarding how the system works to adding anthropomorphic features. The goal of Research Paper D was to investigate actual user behavior when interacting with AI-based advisory systems. Based on the theoretical foundations of task–technology fit and judge–advisor systems, an online experiment was conducted. The results show that, above all, perceived expertise and the ability to make efficient decisions through AI-based advisory systems influence whether users assess these systems as suitable for supporting certain tasks. In addition, the study provides initial indications that users might be more willing to follow the advice of AI-based systems than that of human advisors. Finally, Research Paper E designs and implements an IT artifact that uses machine learning techniques to support structured literature reviews. Following the approach of design science research, an artifact was iteratively developed that can automatically download research articles from various databases and analyze and group them according to their content using the word2vec algorithm, the latent Dirichlet allocation model, and agglomerative hierarchical cluster analysis. An evaluation of the artifact on a dataset of 308 publications shows that it can be a helpful tool to support literature reviews but that much manual effort is still required, especially with regard to the identification of common concepts in extant literature. |
Date: | 2022 |
URL: | http://d.repec.org/n?u=RePEc:dar:wpaper:135700&r=big |
By: | Altman, Edward I.; Balzano, Marco; Giannozzi, Alessandro; Srhoj, Stjepan |
Abstract: | SME default prediction is a long-standing issue in the finance and management literature. Proper estimates of the SME risk of failure can support policymakers in implementing restructuring policies, rating agencies and credit analytics firms in assessing creditworthiness, public and private investors in allocating funds, entrepreneurs in accessing funds, and managers in developing effective strategies. Drawing on the extant management literature, we argue that introducing management- and employee-related variables into SME prediction models can improve their predictive power. To test our hypotheses, we use a unique sample of SMEs and propose a novel and more accurate predictor of SME default, the Omega Score, developed by the Least Absolute Shortage and Shrinkage Operator (LASSO). Results were further confirmed through other machine-learning techniques. Beyond traditional financial ratios and payment behavior variables, our findings show that the incorporation of change in management, employee turnover, and mean employee tenure significantly improve the model's predictive accuracy. |
Keywords: | Default prediction modeling,small and medium-sized enterprises,machine learning techniques,LASSO,logit regression |
Date: | 2022 |
URL: | http://d.repec.org/n?u=RePEc:zbw:glodps:1207&r=big |
By: | Biagini Luigi; Severini Simone |
Abstract: | Identifying factors that affect participation is key to a successful insurance scheme. This study's challenges involve using many factors that could affect insurance participation to make a better forecast.Huge numbers of factors affect participation, making evaluation difficult. These interrelated factors can mask the influence on adhesion predictions, making them misleading.This study evaluated how 66 common characteristics affect insurance participation choices. We relied on individual farm data from FADN from 2016 to 2019 with type 1 (Fieldcrops) farming with 10,926 observations.We use three Machine Learning (ML) approaches (LASSO, Boosting, Random Forest) compare them to the GLM model used in insurance modelling. ML methodologies can use a large set of information efficiently by performing the variable selection. A highly accurate parsimonious model helps us understand the factors affecting insurance participation and design better products.ML predicts fairly well despite the complexity of insurance participation problem. Our results suggest Boosting performs better than the other two ML tools using a smaller set of regressors. The proposed ML tools identify which variables explain participation choice. This information includes the number of cases in which single variables are selected and their relative importance in affecting participation.Focusing on the subset of information that best explains insurance participation could reduce the cost of designing insurance schemes. |
Date: | 2022–12 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2212.03092&r=big |
By: | Sara Salamat; Nima Tavassoli; Behnam Sabeti; Reza Fahmi |
Abstract: | Graph neural networks (GNNs) have been utilized for various natural language processing (NLP) tasks lately. The ability to encode corpus-wide features in graph representation made GNN models popular in various tasks such as document classification. One major shortcoming of such models is that they mainly work on homogeneous graphs, while representing text datasets as graphs requires several node types which leads to a heterogeneous schema. In this paper, we propose a transductive hybrid approach composed of an unsupervised node representation learning model followed by a node classification/edge prediction model. The proposed model is capable of processing heterogeneous graphs to produce unified node embeddings which are then utilized for node classification or link prediction as the downstream task. The proposed model is developed to classify stock market technical analysis reports, which to our knowledge is the first work in this domain. Experiments, which are carried away using a constructed dataset, demonstrate the ability of the model in embedding extraction and the downstream tasks. |
Date: | 2022–11 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2211.16103&r=big |
By: | Edouard Ribes (CERNA i3 - Centre d'économie industrielle i3 - Mines Paris - PSL (École nationale supérieure des mines de Paris) - PSL - Université Paris sciences et lettres - CNRS - Centre National de la Recherche Scientifique) |
Abstract: | Background context. The retail side of the finance industry is currently undergoing a deep transformation associated to the rise of automation technologies. Wealth management services, which are traditionally associated to the retail distribution of financial investments products, are no stranger to this phenomena. Specific knowledge gap the work aims to fill. The retail distribution of financial instruments is currently normalized for regulatory purposes but yet remains costly. Documented examples of the use of automation technologies to improve its performance (outside of the classical example of robo-advisors) remain sparse. Methods used in the study. This work shows how machine learning techniques under the form of classification algorithms can be of use to automate some activities (i.e. client expectations analysis) associated to one of the core steps behind the distribution of financial products, namely client discovery. Key findings. Once calibrated to a proprietary data-set owned by one of the leading french productivity tools providers specialized on the wealth management segment, standard classification algorithms (such as random forests or support vector machines) are able to accurately predict the majority of households financial expectations (ROC either above 80% or 90%) when fed with standard wealth information available in most of the database of financial products distributors. Implications. This study thus shows that classifications tools could be easily embedded in digital journey of distributors to improve the access to financial expertise and accelerate the sales of financial products. |
Keywords: | Wealth Management Brokerage Machine learning Classification, Technological Change, Wealth Management, Brokerage, Machine learning, Classification |
Date: | 2022–12–07 |
URL: | http://d.repec.org/n?u=RePEc:hal:wpaper:hal-03887759&r=big |
By: | Pierre Brugière (CEREMADE - CEntre de REcherches en MAthématiques de la DEcision - Université Paris Dauphine-PSL - PSL - Université Paris sciences et lettres - CNRS - Centre National de la Recherche Scientifique); Gabriel Turinici (CEREMADE - CEntre de REcherches en MAthématiques de la DEcision - Université Paris Dauphine-PSL - PSL - Université Paris sciences et lettres - CNRS - Centre National de la Recherche Scientifique) |
Abstract: | We present in this paper a method to compute, using generative neural networks, an estimator of the "Value at Risk" for a nancial asset. The method uses a Variational Auto Encoder with a 'energy' (a.k.a. Radon- Sobolev) kernel. The result behaves according to intuition and is in line with more classical methods. |
Date: | 2022–12–01 |
URL: | http://d.repec.org/n?u=RePEc:hal:wpaper:hal-03880381&r=big |
By: | Edward I. Altman; Marco Balzano; Alessandro Giannozzi; Stjepan Srhoj |
Abstract: | SME default prediction is a long-standing issue in the finance and management literature. Proper estimates of the SME risk of failure can support policymakers in implementing restructuring policies, rating agencies and credit analytics firms in assessing creditworthiness, public and private investors in allocating funds, entrepreneurs in accessing funds, and managers in developing effective strategies. Drawing on the extant management literature, we argue that introducing management- and employee-related variables into SME prediction models can improve their predictive power. To test our hypotheses, we use a unique sample of SMEs and propose a novel and more accurate predictor of SME default, the Omega Score, developed by the Least Absolute Shortage and Shrinkage Operator (LASSO). Results were further confirmed through other machine-learning techniques. Beyond traditional financial ratios and payment behavior variables, our findings show that the incorporation of change in management, employee turnover, and mean employee tenure significantly improve the model’s predictive accuracy. |
Keywords: | Default prediction modeling; small and medium-sized enterprises; machine learning techniques; LASSO; logit regression |
Date: | 2022 |
URL: | http://d.repec.org/n?u=RePEc:inn:wpaper:2022-19&r=big |
By: | Jan Ditzen (Free University of Bozen-Bolzano, Italy); Francesco Ravazzolo (Free University of Bozen-Bolzano, Italy) |
Abstract: | For western economies a long-forgotten phenomenon is on the horizon: rising inflation rates. We propose a novel approach christened D^{2}ML to identify drivers of national inflation. D^{2}ML combines machine learning for model selection with time dependent data and graphical models to estimate the inverse of the covariance matrix, which is then used to identify dominant drivers. Using a dataset of 33 countries, we find that the US inflation rate and oil prices are dominant drivers of national in ation rates. For a more general framework, we carry out Monte Carlo simulations to show that our estimator correctly identifies dominant drivers. |
Keywords: | Time Series, Machine Learning, LASSO, High dimensional data, Dominant Units, Inflation. |
JEL: | C22 C23 C55 |
Date: | 2022–12 |
URL: | http://d.repec.org/n?u=RePEc:bzn:wpaper:bemps97&r=big |
By: | Luigi Biagini |
Abstract: | This paper evaluates Machine Learning (ML) in establishing ratemaking for new insurance schemes. To make the evaluation feasible, we established expected indemnities as premiums. Then, we use ML to forecast indemnities using a minimum set of variables. The analysis simulates the introduction of an income insurance scheme, the so-called Income Stabilization Tool (IST), in Italy as a case study using farm-level data from the FADN from 2008-2018. We predicted the expected IST indemnities using three ML tools, LASSO, Elastic Net, and Boosting, that perform variable selection, comparing with the Generalized Linear Model (baseline) usually adopted in insurance investigations. Furthermore, Tweedie distribution is implemented to consider the peculiarity shape of the indemnities function, characterized by zero-inflated, no-negative value, and asymmetric fat-tail. The robustness of the results was evaluated by comparing the econometric and economic performance of the models. Specifically, ML has obtained the best goodness-of-fit than baseline, using a small and stable selection of regressors and significantly reducing the gathering cost of information. However, Boosting enabled it to obtain the best economic performance, balancing the most and most minor risky subjects optimally and achieving good economic sustainability. These findings suggest how machine learning can be successfully applied in agricultural insurance.This study represents one of the first to use ML and Tweedie distribution in agricultural insurance, demonstrating its potential to overcome multiple issues. |
Date: | 2022–12 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2212.03114&r=big |
By: | Laura Bonacorsi; Vittoria Cerasi; Paola Galfrascoli; Matteo Manera |
Abstract: | We study the relationship between the risk of default and Environmental, Social and Governance (ESG) factors using Machine Learning (ML) techniques on a cross-section of European listed companies. Our proxy for credit risk is the z-score originally proposed by Altman (1968).We consider an extensive number of ESG raw factors sourced from the rating provider MSCI as potential explanatory variables. In a first stage we show, using different SML methods such as LASSO and Random Forest, that a selection of ESG factors, in addition to the usual accounting ratios, helps explaining a firm’s probability of default. In a second stage, we measure the impact of the selected variables on the risk of default. Our approach provides a novel perspective to understand which environmental, social responsibility and governance characteristics may reinforce the credit score of individual companies. |
Keywords: | credit risk, z-scores, ESG factors, Machine learning. |
JEL: | C5 D4 G3 |
Date: | 2022–12 |
URL: | http://d.repec.org/n?u=RePEc:mib:wpaper:507&r=big |
By: | Thi Thu Giang Nguyen (University of Warsaw, Faculty of Economic Sciences, Quantitative Finance Research Group); Robert Ślepaczuk (University of Warsaw, Faculty of Economic Sciences, Quantitative Finance Research Group, Department of Quantitative Finance) |
Abstract: | The study compares the use of various Long Short-Term Memory (LSTM) variants to conventional technical indicators for trading the S&P 500 index between 2011 and 2022. Two methods were used to test each strategy: a fixed training data set from 2001–2010 and a rolling train–test window. Due to the input sensitivity of LSTM models, we concentrated on data processing and hyperparameter tuning to find the best model. Instead of using the traditional MSE function, we used the Mean Absolute Directional Loss (MADL) function based on recent research to enhance model performance. The models were assessed using the Information Ratio and the Modified Information Ratio, which considers the maximum drawdown and the sign of the annualized return compounded (ARC). LSTM models' performance was compared to benchmark strategies using the SMA, MACD, RSI, and Buy&Hold strategies. We rejected the hypothesis that algorithmic investment strategy using signals from LSTM model consisting only from daily returns in its input layer is more efficient. However, we could not reject the hypothesis that signals generated by LSTM model combining daily returns and technical indicators in its input layer are more efficient. The LSTM Extended model that combined daily returns with MACD and RSI in the input layer generated a better result than Buy&Hold and other strategies using a single technical indicator. The results of the sensitivity analysis show how sensitive this model is to inputs like sequence length, batch size, technical indicators, and the length of the rolling train - test window. |
Keywords: | algorithmic investment strategies, machine learning, testing architecture, deep learning, recurrent neural networks, LSTM, technical indicators, forecasting financial-time series, technical indicators, hyperparameter tuning S&P 500 Index |
JEL: | C15 C45 C52 C53 C58 C61 G14 G17 |
Date: | 2022 |
URL: | http://d.repec.org/n?u=RePEc:war:wpaper:2022-29&r=big |
By: | Damian Kisiel; Denise Gorse |
Abstract: | Previous attempts to predict stock price from limit order book (LOB) data are mostly based on deep convolutional neural networks. Although convolutions offer efficiency by restricting their operations to local interactions, it is at the cost of potentially missing out on the detection of long-range dependencies. Recent studies address this problem by employing additional recurrent or attention layers that increase computational complexity. In this work, we propose Axial-LOB, a novel fully-attentional deep learning architecture for predicting price movements of stocks from LOB data. By utilizing gated position-sensitive axial attention layers our architecture is able to construct feature maps that incorporate global interactions, while significantly reducing the size of the parameter space. Unlike previous works, Axial-LOB does not rely on hand-crafted convolutional kernels and hence has stable performance under input permutations and the capacity to incorporate additional LOB features. The effectiveness of Axial-LOB is demonstrated on a large benchmark dataset, containing time series representations of millions of high-frequency trading events, where our model establishes a new state of the art, achieving an excellent directional classification performance at all tested prediction horizons. |
Date: | 2022–12 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2212.01807&r=big |
By: | Morteza Tahami Pour Zarandi; Mehdi Ghasemi Meymandi; Mohammad Hemami |
Abstract: | This paper proposes a new coherent model for a comprehensive study of the cotton price using econometrics and Long-Short term memory neural network (LSTM) methodologies. We call a simple cotton price trend and then assumed conjectures in structural method (ARMA), Markov switching dynamic regression, simultaneous equation system, GARCH families procedures, and Artificial Neural Networks that determine the characteristics of cotton price trend duration 1990-2020. It is established that in the structural method, the best procedure is AR (2) by Markov switching estimation. Based on the MS-AR procedure, it concludes that tending to regime change from decreasing trend to an increasing one is more significant than a reverse mode. The simultaneous equation system investigates three procedures based on the acreage cotton, value-added, and real cotton price. Finally, prediction with the GARCH families TARCH procedure is the best-fitting model, and in the LSTM neural network, the results show an accurate prediction by the training-testing method. |
Date: | 2022–12 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2212.01584&r=big |
By: | Laura Bonacorsi (Department of Social and Political Sciences, Bocconi University); Vittoria Cerasi (Italian Court of Auditors and CefES & O-Fire, University of Milano-Bicocca); Paola Galfrascoli (Department of Economics, Management and Statistics and CefES & O-Fire, University of Milano-Bicocca); Matteo Manera (Department of Economics, Management and Statistics, University of Milano-Bicocca and Fondazione Eni Enrico Mattei) |
Abstract: | We study the relationship between the risk of default and Environmental, Social and Governance (ESG) factors using Supervised Machine Learning (SML) techniques on a cross-section of European listed companies. Our proxy for credit risk is the z-score originally proposed by Altman (1968). We consider an extensive number of ESG raw factors sourced from the rating provider MSCI as potential explanatory variables. In a first stage we show, using different SML methods such as LASSO and Random Forest, that a selection of ESG factors, in addition to the usual accounting ratios, helps explaining a firm’s probability of default. In a second stage, we measure the impact of the selected variables on the risk of default. Our approach provides a novel perspective to understand which environmental, social responsibility and governance characteristics may reinforce the credit score of individual companies. |
Keywords: | Credit risk, Z-scores, ESG factors, Machine learning |
JEL: | C5 D4 G3 |
Date: | 2022–11 |
URL: | http://d.repec.org/n?u=RePEc:fem:femwpa:2022.36&r=big |
By: | Ziwei Mei (The Chinese University of Hong Kong); Zhentao Shi (The Chinese University of Hong Kong); Peter C. B. Phillips (Cowles Foundation, Yale University) |
Abstract: | The global financial crisis and Covid recession have renewed discussion concerning trend-cycle discovery in macroeconomic data, and boosting has recently upgraded the popular HP filter to a modern machine learning device suited to data-rich and rapid computational environments. This paper sheds light on its versatility in trend-cycle determination, explaining in a simple manner both HP filter smoothing and the consistency delivered by boosting for general trend detection. Applied to a universe of time series in FRED databases, boosting outperforms other methods in timely capturing downturns at crises and recoveries that follow. With its wide applicability the boosted HP filter is a useful automated machine learning addition to the macroeconometric toolkit. |
Date: | 2022–09 |
URL: | http://d.repec.org/n?u=RePEc:cwl:cwldpp:2348&r=big |
By: | Julia Cagé (ECON - Département d'économie (Sciences Po) - Sciences Po - Sciences Po - CNRS - Centre National de la Recherche Scientifique, CEPR - Center for Economic Policy Research - CEPR); Nicolas Hervé (INA - Institut National de l'Audiovisuel); Béatrice Mazoyer (INA - Institut National de l'Audiovisuel, médialab - médialab (Sciences Po) - Sciences Po - Sciences Po) |
Abstract: | Social media are increasingly influencing society and politics, despite the fact that legacy media remain the most consumed source of news. In this paper, we study the propagation of information from social media to mainstream media, and investigate whether news editors' editorial decisions are influenced by the popularity of news stories on social media. To do so, we build a novel dataset including around 70% of all the tweets produced in French between August 2018 and July 2019 and the content published online by 200 mainstream media outlets. We then develop novel algorithms to identify and link events on social and mainstream media. To isolate the causal impact of popularity, we rely on the structure of the Twitter network and propose a new instrument based on the interaction between measures of user centrality and "social media news pressure" at the time of the event. We show that the social media popularity of a story increases the coverage of the same story by mainstream media. This effect varies depending on the media outlets' characteristics, in particular on whether they use a paywall. Finally, we investigate consumers' reaction to a surge in social media popularity. Our findings shed new light on news production decisions in the digital age and the welfare effects of social media. |
Keywords: | Internet, Information spreading, News editors, Network analysis, Social media, Twitter, Text analysis |
Date: | 2022–07–30 |
URL: | http://d.repec.org/n?u=RePEc:hal:wpaper:hal-03877907&r=big |
By: | C. Biliotti; F. J. Bargagli-Stoffi; N. Fraccaroli; M. Puliga; M. Riccaboni |
Abstract: | We study the causal effects of lockdown measures on uncertainty and sentiments on Twitter. By exploiting the quasi-experimental setting induced by the first Western COVID-19 lockdown - the unexpected lockdown implemented in Northern Italy in February 2020 - we measure changes in public uncertainty and sentiment expressed on daily pre and post-lockdown tweets geolocalized inside and in the proximity of the lockdown areas. Using natural language processing, including dictionary-based methods and deep learning models, we classify each tweet across four categories - economics, health, politics and lockdown policy - to identify in which areas uncertainty and sentiments concentrate. Using a Difference-in-Difference analysis, we show that areas under lockdown depict lower uncertainty around the economy and the stay-at-home mandate itself. This surprising result likely stems from an informational asymmetry channel, for which treated individuals adjusts their expectations once the policy is in place, while uncertainty persists around the untreated. However, we also find that the lockdown comes at a cost as political sentiments worsen. |
Date: | 2022–12 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2212.01705&r=big |
By: | Jie Zou; Jiashu Lou; Baohua Wang; Sixue Liu |
Abstract: | More and more stock trading strategies are constructed using deep reinforcement learning (DRL) algorithms, but DRL methods originally widely used in the gaming community are not directly adaptable to financial data with low signal-to-noise ratios and unevenness, and thus suffer from performance shortcomings. In this paper, to capture the hidden information, we propose a DRL based stock trading system using cascaded LSTM, which first uses LSTM to extract the time-series features from stock daily data, and then the features extracted are fed to the agent for training, while the strategy functions in reinforcement learning also use another LSTM for training. Experiments in DJI in the US market and SSE50 in the Chinese stock market show that our model outperforms previous baseline models in terms of cumulative returns and Sharp ratio, and this advantage is more significant in the Chinese stock market, a merging market. It indicates that our proposed method is a promising way to build a automated stock trading system. |
Date: | 2022–12 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2212.02721&r=big |
By: | Zhongchen Song; Tom Coupé (University of Canterbury) |
Abstract: | There is a substantial literature that suggests that search behavior data from Google Trends can be used for both private and public sector decision-making. In this paper, we use search behavior data from Baidu, the internet search engine most popular in China, to analyze whether these can improve nowcasts and forecasts of the Chinese economy. Using a wide variety of estimation and variable selection procedures, we find that Baidu’s search data can improve nowcast and forecast performance of the sales of automobiles and mobile phones reducing forecast errors by more than 10%, as well as reducing forecast errors of total retail sales of consumptions goods in China by more than 40%. Google Trends data, in contrast, do not improve performance. |
Keywords: | China, Baidu Index, Google Trends, forecasting, consumption. |
JEL: | C53 E21 E27 |
Date: | 2022–12–01 |
URL: | http://d.repec.org/n?u=RePEc:cbt:econwp:22/19&r=big |
By: | MASSUCCI Francesco; SERI Alessandro |
Abstract: | This study work aims at improving the GLORIA project’s understanding of the alignment between private firms’ research, development & technological innovation (RD&TI) activities and SDGs. To accomplish such a goal, textual descriptions of single RD&TI records produced by the companies featured in the scoreboard will be analysed by means of different Natural Language Processing (NLP) techniques and classified in accordance with potential SDG of interest. Specifically, patent descriptions, publication abstracts and summaries of research projects funded by the European commission via the H2020 Framework Programme are analysed (see chapter 2. Data Sources) and linked with potential SDGs of concern by means of keyword-based and Deep Learning textual classifiers. |
Keywords: | SDG, innovation, Scoreboard |
Date: | 2022–11 |
URL: | http://d.repec.org/n?u=RePEc:ipt:iptwpa:jrc130479&r=big |
By: | Kim, Hyo Sang (KOREA INSTITUTE FOR INTERNATIONAL ECONOMIC POLICY (KIEP)); Kang, Eunjung (KOREA INSTITUTE FOR INTERNATIONAL ECONOMIC POLICY (KIEP)); Kim, Yuri (KOREA INSTITUTE FOR INTERNATIONAL ECONOMIC POLICY (KIEP)); Moon, Seongman (Jeonbuk National University); Jang, Huisu (Soongsil University) |
Abstract: | It is well-known that exchange rates are difficult to forecast using observed macro-fundamental variables. This discrepancy between economic theory and empirical results is called the Meese and Rogoff puzzle. The purpose of this study is to address this puzzle from a new approach. Rather than pursuing a linkage between macro-fundamentals and exchange rates, we focus on the market sentiment index as a factor that could possibly enhance exchange rate predictability. The analysis folds into three phases. First, we conducted an assessment of the traditional exchange rate predictability model, as well as the augmented traditional model incorporating the market sentiment index. Second, we predicted the exchange rate by applying the market sentiment index, based on the contrarian opinion investment strategy commonly used by foreign exchange dealers. Finally, we analyzed if the machine learning model incorporating both economic fundamentals and market sentiment index could enhance the predictability of the exchange rate. |
Keywords: | Exchange Rate; Exchange Rate Predictability; Market Sentiments |
Date: | 2022–09–30 |
URL: | http://d.repec.org/n?u=RePEc:ris:kiepwe:2022_042&r=big |
By: | Axenbeck, Janna; Berner, Anne; Kneib, Thomas |
Abstract: | The ongoing digital transformation has raised hopes for ICT-based climate protection within manufacturing industries, such as dematerialized products and energy efficiency gains. However, ICT also consume energy as well as resources, and detrimental effects on the environment are increasingly gaining attention. Accordingly, it is unclear whether trade-offs or synergies between the use of digital technologies and energy savings exist. Our analysis sheds light on the most important drivers of the relationship between ICT and energy use in manufacturing. We apply flexible tree-based machine learning to a German administrative panel data set including more than 25,000 firms. The results indicate firm-level heterogeneity, but suggest that digital technologies relate more frequently to an increase in energy use. Multiple characteristics, such as energy prices and firms' energy mix, explain differences in the effect. |
Keywords: | digital technologies,energy use,manufacturing,machine learning |
JEL: | C14 D22 L60 O33 Q40 |
Date: | 2022 |
URL: | http://d.repec.org/n?u=RePEc:zbw:zewdip:22059&r=big |
By: | Peter Fisker (Development Economics Research Group, University of Copenhagen); Kenneth Mdadila (University of Dar es Salaam, School of Economics) |
Abstract: | This paper combines information from a representative household survey with publicly available spatial data extracted from satellite images to produce a high-resolution poverty map of Dar es Salaam. In particular, it builds a prediction model for per capita household consumption based on characteristics of the immediate neighborhood of the household, including the density of roads and buildings, the average size of houses, distances to places of interest, and night-time lights. The resulting poverty map of Dar es Salaam dramatically improves the spatial resolution of previous examples. Extreme Gradient Boosting (XGB) performs best in predicting household consumption levels given the input data. This result demonstrates the simplicity with which policy-relevant information containing a spatial dimension can be generated. |
Keywords: | Poverty, small-area estimation, building footprints, prediction models |
JEL: | O18 Q54 R11 |
Date: | 2022–12–17 |
URL: | http://d.repec.org/n?u=RePEc:kud:kuderg:2217&r=big |
By: | Dorothee Weiffen (ISDC - International Security and Development Center); Ghassan Baliki (ISDC - International Security and Development Center); Tilman Brück (ISDC - International Security and Development Center, Thaer-Institute, Humboldt University of Berlin, Leibniz Institute of Vegetable and Ornamental Crops) |
Abstract: | Agricultural interventions are one of the key policy tools to strengthen the food security of households living in conflict settings. Yet, given the complex nature of conflict-affected settings, existing theories of change might not hold, leading to misinterpretation of the significance and magnitude of these impacts. How contextual factors, including exposure to conflict intensity, shape treatment effects remain broadly unconfirmed. To address this research gap, we apply an honest causal forest algorithm to analyse the short-term impacts of an agricultural asset transfer on food security. Using a quasi-experimental panel dataset in Syria, comparing treatment and control households two years after receiving support, we first estimate the average treatment effect, and then we examine how contextual factors, particularly conflict, shape treatment heterogeneity. Our results show that agricultural asset transfers significantly improve food security in the short-term. Moreover, exposure to conflict intensity plays a key role in determining impact size. We find that households living in moderately affected conflict areas benefited significantly from the agricultural intervention and improved their food security by up to 14.4%, while those living in no or high conflict areas did not. The positive effects were particularly strong for female–headed households. Our findings provide new insights on how violent conflict determines how households benefit from and respond to agricultural programming. This underscores the need to move away from one-size-fits-all agricultural support in difficult settings towards designing conflict-sensitive and inclusive interventions to ensure that no households are left behind. |
Keywords: | Agricultural intervention, Asset transfers, Food security, honest causal forest, impact evaluation, machine learning, Syria, Violent conflict |
JEL: | D10 D60 O12 O22 Q12 |
Date: | 2022–12 |
URL: | http://d.repec.org/n?u=RePEc:hic:wpaper:381&r=big |
By: | Edouard Ribes (CERNA i3 - Centre d'économie industrielle i3 - Mines Paris - PSL (École nationale supérieure des mines de Paris) - PSL - Université Paris sciences et lettres - CNRS - Centre National de la Recherche Scientifique) |
Keywords: | Personal Finance, Wealth Management, Brokerage, Artificial Intelligence, Technological Change |
Date: | 2022–11–21 |
URL: | http://d.repec.org/n?u=RePEc:hal:wpaper:hal-03862261&r=big |
By: | Jo\"el Terschuur |
Abstract: | Inequality of Opportunity (IOp) is considered an unfair source of inequality, an obstacle for economic growth and a determinant of preferences for redistribution. IOp can be estimated in two steps: (i) fitted values are estimated by predicting an outcome given some circumstances out of the control of the individual, (ii) the inequality in the distribution of the fitted values is measured by some inequality index. Using machine learners in the prediction step allows to consider a rich set of circumstances but it leads to biases in the estimation of IOp. We propose to use debiased estimators based on the Gini coefficient and the Mean Logarithmic Deviation (MLD). Further, we measure the effect of each circumstance on IOp and provide valid standard errors. To stress the usefulness of inference, we provide a test to compare IOp in two populations and a group test to check joint significance of a group of circumstances. We use the debiased estimators to measure IOp in 29 European countries. Romania and Bulgaria are the countries with highest IOp. Southern countries tend to have high levels of IOp while Nordic countries have low IOp. Debiased estimators are more robust to the choice of the machine learner in the first step. Mother's education and father's occupation are key circumstances to explain inequality. |
Date: | 2022–12 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2212.02407&r=big |
By: | Mathias Valla (LSAF - Laboratoire de Sciences Actuarielles et Financières [Lyon] - ISFA - Institut de Science Financière et d'Assurances); Xavier Milhaud (I2M - Institut de Mathématiques de Marseille - AMU - Aix Marseille Université - ECM - École Centrale de Marseille - CNRS - Centre National de la Recherche Scientifique); Anani Ayodélé Olympio (SAF - Laboratoire de Sciences Actuarielle et Financière - UCBL - Université Claude Bernard Lyon 1 - Université de Lyon) |
Abstract: | A retention strategy based on an enlightened lapse modelization can be a powerful profitability lever for a life insurer. Some machine learning models are excellent at predicting lapse, but from the insurer's perspective, predicting which policyholder is likely to lapse is not enough to design a retention strategy. Changing the classical classification problem to a regression one with an appropriate validation metric based on Customer Lifetime Value (CLV) has recently been proposed. In our paper, we suggest several improvements and apply them to a sizeable real-world life insurance dataset. We include the risk of death in the study through competing risk considerations in parametric and tree-based models and show that further individualization of the existing approach leads to increased performance. We show that survival tree-based models can outperform parametric approaches and that the actuarial literature can significantly benefit from them. Then, we compare how this framework leads to increased predicted gains for the insurer regardless of the retention strategy. Finally, we discuss the benefits of our modelization in terms of commercial and strategic decision-making for a life insurer. |
Keywords: | Lapse, Lapse management strategy, Tree-based models, Competing risks, Customer lifetime value, Machine Learning |
Date: | 2022–12–16 |
URL: | http://d.repec.org/n?u=RePEc:hal:wpaper:hal-03903047&r=big |
By: | Kühl, Niklas; Goutier, Marc; Baier, Lucas; Wolff, Clemens; Martin, Dominik |
Date: | 2022 |
URL: | http://d.repec.org/n?u=RePEc:dar:wpaper:135657&r=big |
By: | Dimitrios Kanelis; Pierre L. Siklos |
Abstract: | We combine modern methods from Speech Emotion Recognition and Natural Language Processing with high-frequency financial data to analyze how the vocal emotions and language of ECB President Mario Draghi affect the yield curve of major euro area economies. Vocal emotions significantly impact the yield curve. However, their impact varies in size and sign: positive signals raise German and French yields, while Italian yields react negatively, which is reflected in an increase in yield spreads. A by-product of our study is the construction and provision of a synchronized data set for voice and language. |
Keywords: | Communication, ECB, Neural Networks, High-Frequency Data, Speech Emotion Recognition, Asset Prices |
JEL: | E50 E58 G12 G14 |
Date: | 2022–12 |
URL: | http://d.repec.org/n?u=RePEc:cqe:wpaper:10322&r=big |
By: | Dimitrios Kanelis; Pierre L. Siklos |
Abstract: | We combine modern methods from Speech Emotion Recognition and Natural Language Processing with high-frequency financial data to analyze how the vocal emotions and language of ECB President Mario Draghi affect the yield curve of major euro area economies. Vocal emotions significantly impact the yield curve. However, their impact varies in size and sign: positive signals raise German and French yields, while Italian yields react negatively, which is reflected in an increase in yield spreads. A by-product of our study is the construction and provision of a synchronized data set for voice and language. |
Keywords: | Communication, ECB, Neural Networks, High-Frequency Data, Speech Emotion Recognition, Asset Prices |
JEL: | E50 E58 G12 G14 |
Date: | 2022–12 |
URL: | http://d.repec.org/n?u=RePEc:een:camaaa:2022-75&r=big |
By: | Dragos Gorduza; Xiaowen Dong; Stefan Zohren |
Abstract: | Understanding stock market instability is a key question in financial management as practitioners seek to forecast breakdowns in asset co-movements which expose portfolios to rapid and devastating collapses in value. The structure of these co-movements can be described as a graph where companies are represented by nodes and edges capture correlations between their price movements. Learning a timely indicator of co-movement breakdowns (manifested as modifications in the graph structure) is central in understanding both financial stability and volatility forecasting. We propose to use the edge reconstruction accuracy of a graph auto-encoder (GAE) as an indicator for how spatially homogeneous connections between assets are, which, based on financial network literature, we use as a proxy to infer market volatility. Our experiments on the S&P 500 over the 2015-2022 period show that higher GAE reconstruction error values are correlated with higher volatility. We also show that out-of-sample autoregressive modeling of volatility is improved by the addition of the proposed measure. Our paper contributes to the literature of machine learning in finance particularly in the context of understanding stock market instability. |
Date: | 2022–12 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2212.04974&r=big |
By: | Rossouw, Stephanié; Greyling, Talita |
Abstract: | We know that when collective emotions are prolonged, it leads not only to action (which could be negative) but also to the formation of identity, culture, or an emotional climate. Therefore, policymakers must understand how collective emotions react to macro-level shocks to mitigate potentially violent and destructive outcomes. Given the above, our paper's main aim is to determine the effect of macro-level shocks on collective emotions and the various stages they follow. To this end, we analyse the temporal evolution of different emotions from pre to post two different types of macro-level shocks; lockdown, a government-implemented regulation brought on by COVID-19 and the invasion of Ukraine. A secondary aim is to use narrative analysis to understand the public perceptions and concerns that lead to the observed emotional changes. To achieve these aims, we use a unique time series dataset derived from extracting tweets in real-time, filtering on specific keywords related to lockdowns (COVID-19) and the Ukrainian war for ten countries. Applying Natural Language Processing, we obtain these tweets underlying emotion scores and derive daily time series data per emotion. We compare the different emotional time series data to a counterfactual to derive changes from the norm. Additionally, we use topic modelling to explain the emotional changes. We find that the same collective emotions are evoked following similar patterns over time regardless of whether it is a health or a war shock. Specifically, we find fear is the predominant emotion before the shocks, and anger leads the emotions after the shocks, followed by sadness and fear. |
Keywords: | COVID-19, Big Data, Twitter, collective emotions, Ukraine, macro-level shock |
JEL: | C55 I10 I31 H12 N40 |
Date: | 2022 |
URL: | http://d.repec.org/n?u=RePEc:zbw:glodps:1210&r=big |
By: | Emily Silcock; Luca D'Amico-Wong; Jinglin Yang; Melissa Dell |
Abstract: | Identifying near duplicates within large, noisy text corpora has a myriad of applications that range from de-duplicating training datasets, reducing privacy risk, and evaluating test set leakage, to identifying reproduced news articles and literature within large corpora. Across these diverse applications, the overwhelming majority of work relies on N-grams. Limited efforts have been made to evaluate how well N-gram methods perform, in part because it is unclear how one could create an unbiased evaluation dataset for a massive corpus. This study uses the unique timeliness of historical news wires to create a 27,210 document dataset, with 122,876 positive duplicate pairs, for studying noise-robust de-duplication. The time-sensitivity of news makes comprehensive hand labelling feasible - despite the massive overall size of the corpus - as duplicates occur within a narrow date range. The study then develops and evaluates a range of de-duplication methods: hashing and N-gram overlap (which predominate in the literature), a contrastively trained bi-encoder, and a ``re-rank'' style approach combining a bi- and cross-encoder. The neural approaches significantly outperform hashing and N-gram overlap. We show that the bi-encoder scales well, de-duplicating a 10 million article corpus on a single GPU card in a matter of hours. We also apply our pre-trained model to the RealNews and patent portions of C4 (Colossal Clean Crawled Corpus), illustrating that a neural approach can identify many near duplicates missed by hashing, in the presence of various types of noise. The public release of our NEWS-COPY de-duplication dataset, de-duplicated RealNews and patent corpuses, and the pre-trained models will facilitate further research and applications. |
JEL: | C81 |
Date: | 2022–12 |
URL: | http://d.repec.org/n?u=RePEc:nbr:nberwo:30726&r=big |
By: | Taiga Saito (Graduate School of Economics, The University of Tokyo); Shivam Gupta (Department of Information Systems, Supply Chain Management & Decision Support, NEOMA Business School) |
Abstract: | This study presents big data applications with quantitative theoretical models in financial management and investigates possible incorporation of social media factors into the models. Specifically, we examine three models, a revenue management model, an interest rate model with market sentiments, and a high-frequency trading equity market model, and consider possible extensions of those models to include social media. Since social media plays a substantial role in promoting products and services, engaging with customers, and sharing sentiments among market participants, it is important to include social media factors in the stochastic optimization models for financial management. Moreover, we compare the three models from a qualitative and quantitative point of view and provide managerial implications on how these models are synthetically used along with social media in financial management with a concrete case of a hotel REIT. The contribution of this research is that we investigate the possible incorporation of social media factors into the three models whose objectives are revenue management and debt and equity financing, essential areas in financial management, which helps to estimate the effect and the impact of social media quantitatively if internal data necessary for parameter estimation are available, and provide managerial implications for the synthetic use of the three models from a higher viewpoint. The numerical experiment along with the proposition indicates that the model can be used in the revenue management of hotels, and by improving the social media factor, the hotel can work on maximizing its sales. |
Date: | 2022–12 |
URL: | http://d.repec.org/n?u=RePEc:cfi:fseres:cf550&r=big |
By: | Salazar, Lina; Palacios, Ana Claudia; Selvaraj, Michael; Montenegro, Frank |
Abstract: | This study combines three rounds of surveys with remote sensing to measure long-term impacts of a randomized irrigation program in the Dominican Republic. Specifically, Landsat 7 and Landsat 8 satellite images are used to measure the causal effects of the program on agricultural productivity, measured through vegetation indices (NDVI and OSAVI). To this end, 377 plots were analyzed (129 treated and 248 controls) for the period from 2011 to 2019. Following a Differencein-Differences (DD) and Event study methodology, the results confirmed that program beneficiaries have higher vegetation indices, and therefore experienced a higher productivity throughout the post-treatment period. Also, there is some evidence of spillover effects to neighboring farmers. Furthermore, the Event Study model shows that productivity impacts are obtained in the third year after the adoption takes place. These findings suggest that adoption of irrigation technologies can be a long and complex process that requires time to generate productivity impacts. In a more general sense, this study reveals the great potential that exists in combining field data with remote sensing information to assess long-term impacts of agricultural programs on agricultural productivity. |
Keywords: | Irrigation;Remote Sensing;Impact Evaluation;Agriculture |
JEL: | Q00 |
Date: | 2021–09 |
URL: | http://d.repec.org/n?u=RePEc:idb:brikps:11607&r=big |
By: | Irene Aldridge; Payton Martin |
Abstract: | Our main contribution is that we are using AI to discern the key drivers of variation of ESG mentions in the corporate filings. With AI, we are able to separate "dimensions" along which the corporate management presents their ESG policies to the world. These dimensions are 1) diversity, 2) hazardous materials, and 3) greenhouse gasses. We are also able to identify separate "background" dimensions of unofficial ESG activity in the firms, which provide more color into the firms and their shareholders' thinking about their ESG processes. We then measure investors' response to the ESG activity "factors". The AI techniques presented can assist in building better, more reliable and useful ESG ratings systems. |
Date: | 2022–11 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2212.00018&r=big |
By: | Massimiliano Marcellino; Dalibor Stevanovic |
Abstract: | In this article we study how the demand and supply of information about inflation affect inflation developments. As a proxy for the demand of information, we extract Google Trends (GT) for keywords such as "inflation", "inflation rate", or "price increase". The rationale is that when agents are more interested about inflation, they should search for information about it, and Google is by now a natural source. As a proxy for the supply of information about inflation, we instead use an indicator based on a (standardized) count of the Wall Street Journal (WSJ) articles containing the word "inflat" in their title. We find that measures of demand (GT) and supply (WSJ) of inflation information have a relevant role to understand and predict actual inflation developments, with the more granular information improving expectation formation, especially so during periods when inflation is very high or low. In particular, the full information rational expectation hypothesis is rejected, suggesting that some informational rigidities exist and are waiting to be exploited. Contrary to the existing evidence, we conclude that the media communication and agents attention do play an important role for aggregate inflation expectations, and this remains valid also when controlling for FED communications. Dans cet article, nous étudions comment la demande et l'offre d'informations sur l'inflation affectent l'évolution de l'inflation. Comme indicateur de la demande d'informations, nous extrayons les tendances de Google (GT) pour des mots clés tels que "inflation", "taux d'inflation" ou "augmentation des prix". Le raisonnement est le suivant : lorsque les agents sont plus intéressés par l'inflation, ils doivent rechercher des informations à ce sujet, et Google est désormais une source naturelle. Comme indicateur de l'offre d'informations sur l'inflation, nous utilisons un indicateur basé sur un comptage (standardisé) des articles du Wall Street Journal (WSJ) contenant le mot "inflat" dans leur titre. Nous constatons que les mesures de la demande (GT) et de l'offre (WSJ) d'informations sur l'inflation jouent un rôle important dans la compréhension et la prévision de l'évolution réelle de l'inflation, les informations les plus granulaires améliorant la formation des attentes, en particulier pendant les périodes où l'inflation est très élevée ou très faible. En particulier, l'hypothèse de l'espérance rationnelle à information complète est rejetée, ce qui suggère que certaines rigidités informationnelles existent et attendent d'être exploitées. Contrairement à l'évidence établie, nous concluons que la communication des médias et l'attention des agents jouent un rôle important dans les attentes d'inflation agrégées, et ceci reste valable même en contrôlant les communications de la FED. |
Keywords: | Inflation,Expectations,Google trends,Text analysis, Inflation,Attentes,Google trends,Analyse de texte |
JEL: | C53 C83 D83 D84 E31 E37 |
Date: | 2022–12–01 |
URL: | http://d.repec.org/n?u=RePEc:cir:cirwor:2022s-27&r=big |
By: | Proeger, Till; Meub, Lukas |
Abstract: | Die Webscraping-Analyse erfasst erstmalig den gesamten Kammerbezirk Osnabrück-Emsland-Grafschaft Bentheim bzgl. Online-Präsenzen der Handwerksbetriebe. Die Websites werden hinsichtlich direkter KI-Nähe, fortgeschrittener Digitalisierung sowie indirekter KI-Betroffenheit analysiert. Dazu wird ein Suchbegriffsraster von 245 Begriffen auf Basis einer Literaturübersicht und Expertengesprächen erstellt, das im Anschluss die Analyse der Webseiten leitet. Es entsteht umfassender Überblick zur Technologienutzung im Kammerbezirk mit einem Schwerpunkt auf künstlicher Intelligenz. Wie aus der Literaturanalyse zu erwarten, ist die direkte Nutzung von KI in Form entsprechender Technologien selten, die Technologien fortgeschrittener Digitalisierung deutlich häufiger und die indirekte KI-Betroffenheit über Software, Plattformen und Soziale Medien hoch. Es zeigt sich somit eine charakteristische Pyramidenstruktur im Hinblick auf die KI-Nutzung, wobei technologisch und digitalisierungsbezogen fortgeschrittene Betriebe in direkte Berührung mit KI kommen und die Mehrzahl der Betriebe eine indirekte Betroffenheit aufweist. Die Größenordnung für den Kammerbezirk ist dabei: rund 180 Betriebe mit direktem KI-Bezug, 1.200 Betriebe mit Kennzeichen fortgeschrittener Digitalisierung, 1.700 Betriebe mit indirekter KI-Betroffenheit und 3.400 Betriebe lediglich mit Website ohne Hinweis auf die drei Kategorien. Im Hinblick auf die Gewerke zeigt sich, dass in absoluten Zahlen die meisten Betriebe mit einem unmittelbaren Bezug zur Künstlichen Intelligenz bei den Elektrotechnikern zu finden sind. KI-Begriffe treten außerdem häufig bei Land- und Baumaschinenmechatronikern, SHK-Betrieben, Feinwerkmechanikern sowie Informationstechnikern auf. Weitere Auffälligkeiten in der Gewerkeverteilung sind die starke Nennung von Internet of Things (IoT) bei den Elektrotechnikern und die relativ häufige Nennung von Big Data und Prognosemodell bei den Land- und Baumaschinenmechatronikern. Indikatoren für fortgeschrittene Digitalisierung finden sich insbesondere bei den Elektrotechnikern, Tischlern, Metallbauern, SHK-Betrieben, Feinwerkmechanikern, Augenoptikern, Informationstechnikern und Hörakustikern. Eine starke indirekte Betroffenheit von KI zeigt sich insbesondere bei den Elektrotechnikern, Tischlern, Maurer/Betonbauern, Maler /Lackierern, Fotografen, Augenoptikern, Fliesenlegern, Bäckern und Schornsteinfegern, wobei vor allem die Nutzung der sozialen Medien über die Zugehörigkeit zu dieser Kategorie entscheidet. In der Betrachtung der regionalen Verteilung wird deutlich, dass für die zentrale Kategorie der KI-Nutzung grundsätzlich eine relativ gleichmäßige Verteilung der betroffenen Betriebe im Raum vorliegt. Regionale Schwerpunkte sind der Osnabrücker Raum, Meppen, Nordhorn und Bramsche. Es besteht kein eindeutiger Schwerpunkt auf dem städtischen Raum, vielmehr sind sowohl städtische als auch ländliche Kreise mit Betrieben vertreten. Dasselbe gilt für die fortgeschrittene Digitalisierung und indirekte Betroffenheit: Auch hier besteht eine gleichmäßige räumliche Verteilung. |
Date: | 2022 |
URL: | http://d.repec.org/n?u=RePEc:zbw:ifhfob:5&r=big |
By: | Oeindrila Dube; Joshua E. Blumenstock; Michael Callen; Michael J. Callen |
Abstract: | Religious adherence has been hard to study in part because it is hard to measure. We develop a new measure of religious adherence, which is granular in both time and space, using anonymized mobile phone transaction records. After validating the measure with traditional data, we show how it can shed light on the nature of religious adherence in Islamic societies. Exploiting random variation in climate, we find that as economic conditions in Afghanistan worsen, people become more religiously observant. The effects are most pronounced in areas where droughts have the biggest economic consequences, such as croplands without access to irrigation. |
Keywords: | religion, mobile phones, big data, climate, economic shocks |
JEL: | Z10 Z12 Q10 Q15 Q54 O13 |
Date: | 2022 |
URL: | http://d.repec.org/n?u=RePEc:ces:ceswps:_10114&r=big |
By: | Carmen Vázquez de Castro Álvarez-Buylla |
Abstract: | Este trabajo tiene un doble propósito. Por un lado, exponer brevemente algunos casos útiles de la infinitud de posibilidades que tiene la Inteligencia Artificial en el ámbito empresarial. Por otro lado, explicar cómo se está desarrollando la formación en inteligencia artificial en entornos no académicos. El trabajo parte de considerar que, para poder acoger esta ola de innovaciones desde una organización, es necesario entender cuál es el valor que pueden aportar a esta. En ese contexto, se describen algunos de los retos empresariales que en la actualidad tratan de ser resueltos con estas herramientas y los retos a los que se enfrenta la formación no reglada en Inteligencia Artificial. |
Date: | 2022–12 |
URL: | http://d.repec.org/n?u=RePEc:fda:fdafen:2022-33&r=big |
By: | Sholler, Dan; MacInnes, Ian |
Date: | 2022 |
URL: | http://d.repec.org/n?u=RePEc:zbw:itse22:265669&r=big |