|
on Big Data |
By: | Giovanni Cerulli (IRCrES-CNR) |
Abstract: | We present two related Stata modules, r_ml_stata and c_ml_stata, for fitting popular machine learning (ML) methods in both regression and classification settings. Using the recent Stata/Python integration platform (sfi) of Stata 16, these commands provide hyperparameters' optimal tuning via K-fold cross-validation using greed search. More specifically, they make use of the Python Scikit-learn API to carry out both cross-validation and outcome/label prediction. |
Date: | 2021–08–07 |
URL: | http://d.repec.org/n?u=RePEc:boc:scon21:25&r= |
By: | Simon Blöthner; Mario Larch |
Abstract: | While traditional empirical models using determinants like size and trade costs are able to predict RTA formation reasonably well, we demonstrate that allowing for machine detected non-linear patterns helps to improve the predictive power of RTA formation substantially. We employ machine learning methods and find that the fitted tree-based methods and neural networks deliver sharper and more accurate predictions than the probit model. For the majority of models the allowance of fixed effects increases the predictive performance considerably. We apply our models to predict the likelihood of RTA formation of the EU and the United States with their trading partners, respectively. |
Keywords: | Regional Trade Agreements, neural networks, tree-based methods, high-dimensional fixed effects |
JEL: | F14 F15 C45 C53 |
Date: | 2021 |
URL: | http://d.repec.org/n?u=RePEc:ces:ceswps:_9233&r= |
By: | Martin Baumgaertner (THM Business School); Johannes Zahner (Philipps-Universitaet Marburg) |
Abstract: | Dictionary approaches are at the forefront of current techniques for quantifying central bank communication. This paper proposes embeddings – a language model trained using machine learning techniques – to locate words and documents in a multidimensional vector space. To accomplish this, we gather a text corpus that is unparalleled in size and diversity in the central bank communication literature, as well as introduce a novel approach to text quantification from computational linguistics. Utilizing this novel text corpus of over 23,000 documents from over 130 central banks we are able to provide high quality text-representations –embeddings– for central banks. Finally, we demonstrate the applicability of embeddings in this paper by several examples in the fields of monetary policy surprises, financial uncertainty, and gender bias. |
Keywords: | Word Embedding, Neural Network, Central Bank Communication, Natural Language Processing, Transfer Learning |
JEL: | C45 C53 E52 Z13 |
Date: | 2021 |
URL: | http://d.repec.org/n?u=RePEc:mar:magkse:202130&r= |
By: | Байкулаков Шалкар // Baikulakov Shalkar (Center for the Development of Payment and Financial Technologies); Белгибаев Зангар // Belgibayev Zanggar (National Bank of Kazakhstan) |
Abstract: | Данное исследование представляет собой попытку оценки кредитоспособности физических лиц с помощью алгоритмов машинного обучения на основе данных, предоставляемых банками второго уровня Национальному Банку Республики Казахстан. Оценка кредитоспособности заемщиков позволяет НБРК исследовать качество выданных кредитов банками второго уровня и прогнозировать потенциальные системные риски. В данном исследовании были применены два линейных и шесть нелинейных методов классификации (линейные модели - логистическая регрессия, стохастический градиентный спуск, и нелинейные - нейронные сети, k-ближайшие соседи (kNN), дерево решений (decision tree), случайный лес (random tree), XGBoost, наивный Байесовский классификатор (Naïve Bayes)) и сравнивались алгоритмы, основанные на правильности классификации (accuracy), точности (precision) и ряде других показателей. Нелинейные модели показывают более точные прогнозы по сравнению с линейными моделями. В частности, нелинейные модели, такие как случайный лес (random forest) и k-ближайшие соседи (kNN) на передискредитированных данных (oversampled data) продемонстрировали наиболее многообещающие результаты. // This project is an attempt to assess the creditworthiness of individuals through machine learning algorithms and based on regulatory data provided by second-tier banks to the central bank. The assessment of the creditworthiness of borrowers can allow the central bank to investigate the accuracy of issued loans by second-tier banks, and predict potential systematic risks. In this project, two linear and six nonlinear classification methods were developed (linear models – Logistic Regression, Stochastic Gradient Descent, and nonlinear - Neural Networks, kNN, Decision tree, Random forest, XGBoost, Naïve Bayes), and the algorithms were compared based on accuracy, precision, and several other metrics. The non-linear models illustrate more accurate predictions in comparison with the linear models. In particular, the non-linear models such as the Random Forest and kNN classifiers on oversampled data demonstrated promising outcomes. |
Keywords: | потребительские кредиты, машинное обучение, банковское регулирование, стохастический градиентный спуск, логистическая регрессия, k-ближайшие соседи, классификатор случайных лесов, дерево решений, gaussian NB (Гауссовский наивный Байесовский классификатор), XGBoost, нейронные сети (многослойный персептрон), consumer credits, machine learning, bank regulation, stochastic gradient descent (linear model), logistic regression (linear model), kNN (neighbors), random forest classifier (ensemble), decision tree (tree), gaussian NB (naïve bayes), XGBoost, Neural network (MLP classifier) |
JEL: | G21 G28 E37 E51 |
Date: | 2021 |
URL: | http://d.repec.org/n?u=RePEc:aob:wpaper:21&r= |
By: | Meerza, Syed Imran Ali; Brooks, Kathleen R.; Gustafson, Christopher R.; Yiannaka, Amalia |
Keywords: | Institutional and Behavioral Economics, Research Methods/Statistical Methods, Health Economics and Policy |
Date: | 2021–08 |
URL: | http://d.repec.org/n?u=RePEc:ags:aaea21:312876&r= |
By: | International Monetary Fund |
Abstract: | The IMF’s Vulnerability Exercise (VE) is a cross-country exercise that identifies country-specific near-term macroeconomic risks. As a key element of the Fund’s broader risk architecture, the VE is a bottom-up, multi-sectoral approach to risk assessments for all IMF member countries. The VE modeling toolkit is regularly updated in response to global economic developments and the latest modeling innovations. The new generation of VE models presented here leverages machine-learning algorithms. The models can better capture interactions between different parts of the economy and non-linear relationships that are not well measured in ”normal times.” The performance of machine-learning-based models is evaluated against more conventional models in a horse-race format. The paper also presents direct, transparent methods for communicating model results. |
Keywords: | Risk Assessment, Supervised Machine Learning, Prediction, Sudden Stop, Exchange Market Pressure, Fiscal Crisis, Debt, Financial Crisis, Economic Crisis, Economic Growth |
Date: | 2021–05–07 |
URL: | http://d.repec.org/n?u=RePEc:imf:imftnm:2021/003&r= |
By: | Andres, Antonio Rodriguez; Otero, Abraham; Amavilah, Voxi Heinrich |
Abstract: | Missing values and the inconsistency of the measures of the knowledge economy remain vexing problems that hamper policy-making and future research in developing and emerging economies. This paper contributes to the new and evolving literature that seeks to advance better understanding of the importance of the knowledge economy for policy and further research in developing and emerging economies. In this paper we use a supervised machine deep learning neural network (DLNN) approach to predict the knowledge economy index of 71 developing and emerging economies during the 1995-2017 period. Applied in combination with a data imputation procedure based on the K-closest neighbor algorithm, DLNN is capable of handling missing data problems better than alternative methods. A 10-fold validation of the DLNN yielded low quadratic and absolute error (0,382 +- 0,065). The results are robust and efficient, and the model’s predictive power is high. There is a difference in the predictive power when we disaggregate countries in all emerging economies versus emerging Central European countries. We explain this result and leave the rest to future endeavors. Overall, this research has filled in gaps due to missing data thereby allowing for effective policy strategies. At the aggregate level development agencies, including the World Bank that originated the KEI, would benefit from our approach until substitutes come along. |
Keywords: | Machine deep learning neural networks; developing economies, emerging economies, knowledge economy, knowledge economy index, World Bank |
JEL: | C45 C53 O38 O41 O57 P41 |
Date: | 2021–04–15 |
URL: | http://d.repec.org/n?u=RePEc:pra:mprapa:109137&r= |
By: | Gabriel de Oliveira Guedes Nogueira; Marcel Otoboni de Lima |
Abstract: | In order to make good investment decisions, it is vitally important for an investor to know how to make good analysis of financial time series. Within this context, studies on the forecast of the values and trends of stock prices have become more relevant. Currently, there are different approaches to dealing with the task. The two main ones are the historical analysis of stock prices and technical indicators and the analysis of sentiments in news, blogs and tweets about the market. Some of the most used statistical and artificial intelligence techniques are genetic algorithms, Support Vector Machines (SVM) and different architectures of artificial neural networks. This work proposes the improvement of a model based on the association of three distinct LSTM neural networks, each acting in parallel to predict the opening, minimum and maximum prices of stock exchange indices on the day following the analysis. The dataset is composed of historical data from more than 10 indices from the world's largest stock exchanges. The results demonstrate that the model is able to predict trends and stock prices with reasonable accuracy. |
Date: | 2021–07 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2108.10065&r= |
By: | Buckmann, Marcus (Bank of England); Haldane, Andy (Bank of England); Hüser, Anne-Caroline (Bank of England) |
Abstract: | Is human or artificial intelligence more conducive to a stable financial system? To answer this question, we compare human and artificial intelligence with respect to several facets of their decision-making behaviour. On that basis, we characterise possibilities and challenges in designing partnerships that combine the strengths of both minds and machines. Leveraging on those insights, we explain how the differences in human and artificial intelligence have driven the usage of new techniques in financial markets, regulation, supervision, and policy making and discuss their potential impact on financial stability. Finally, we describe how effective mind-machine partnerships might be able to reduce systemic risks. |
Keywords: | Artificial intelligence; machine learning; financial stability; innovation; systemic risk |
JEL: | C45 C55 C63 C81 |
Date: | 2021–08–20 |
URL: | http://d.repec.org/n?u=RePEc:boe:boeewp:0937&r= |
By: | Jieyi Kang (Department of Land Economy, University of Cambridge); David Reiner (EPRG, CJBS, University of Cambridge) |
Keywords: | Weather sensitivity, smart metering data, unsupervised learning, clusters, residential electricity, consumption patterns, Ireland |
JEL: | C55 D12 R22 Q41 |
Date: | 2021–05 |
URL: | http://d.repec.org/n?u=RePEc:enp:wpaper:eprg2113&r= |
By: | Qi Feng; Man Luo; Zhaoyu Zhang |
Abstract: | We propose a deep signature/log-signature FBSDE algorithm to solve forward-backward stochastic differential equations (FBSDEs) with state and path dependent features. By incorporating the deep signature/log-signature transformation into the recurrent neural network (RNN) model, our algorithm shortens the training time, improves the accuracy, and extends the time horizon comparing to methods in the existing literature. Moreover, our algorithms can be applied to a wide range of applications such as state and path dependent option pricing involving high-frequency data, model ambiguity, and stochastic games, which are linked to parabolic partial differential equations (PDEs), and path-dependent PDEs (PPDEs). Lastly, we also derive the convergence analysis of the deep signature/log-signature FBSDE algorithm. |
Date: | 2021–08 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2108.10504&r= |
By: | Amin, Modhurima D.; Badruddoza, Syed; Mantle, Steve |
Keywords: | Productivity Analysis, Research Methods/Statistical Methods, Agribusiness |
Date: | 2021–08 |
URL: | http://d.repec.org/n?u=RePEc:ags:aaea21:312764&r= |
By: | Paul H\"unermund (Copenhagen Business School); Beyers Louw (Maastricht University); Itamar Caspi (Bank of Israel) |
Abstract: | Double machine learning (DML) is becoming an increasingly popular tool for automated model selection in high-dimensional settings. At its core, DML assumes unconfoundedness, or exogeneity of all considered controls, which might likely be violated if the covariate space is large. In this paper, we lay out a theory of bad controls building on the graph-theoretic approach to causality. We then demonstrate, based on simulation studies and an application to real-world data, that DML is very sensitive to the inclusion of bad controls and exhibits considerable bias even with only a few endogenous variables present in the conditioning set. The extent of this bias depends on the precise nature of the assumed causal model, which calls into question the ability of selecting appropriate controls for regressions in a purely data-driven way. |
Date: | 2021–08 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2108.11294&r= |
By: | David T. Frazier; Ruben Loaiza-Maya; Gael M. Martin; Bonsoo Koo |
Abstract: | We propose a new method for Bayesian prediction that caters for models with a large number of parameters and is robust to model misspecification. Given a class of high-dimensional (but parametric) predictive models, this new approach constructs a posterior predictive using a variational approximation to a loss-based, or Gibbs, posterior that is directly focused on predictive accuracy. The theoretical behavior of the new prediction approach is analyzed and a form of optimality demonstrated. Applications to both simulated and empirical data using high-dimensional Bayesian neural network and autoregressive mixture models demonstrate that the approach provides more accurate results than various alternatives, including misspecified likelihood-based predictions |
Keywords: | loss-based Bayesian forecasting, variational inference, Gibbs posteriors, proper scoring rules, Bayesian neural networks, M4 forecasting competition |
JEL: | C11 C53 C58 |
Date: | 2021 |
URL: | http://d.repec.org/n?u=RePEc:msh:ebswps:2021-8&r= |
By: | Lin William Cong; Ke Tang; Jingyuan Wang; Yang Zhang |
Abstract: | We predict asset returns and measure risk premia using a prominent technique from artificial intelligence -- deep sequence modeling. Because asset returns often exhibit sequential dependence that may not be effectively captured by conventional time series models, sequence modeling offers a promising path with its data-driven approach and superior performance. In this paper, we first overview the development of deep sequence models, introduce their applications in asset pricing, and discuss their advantages and limitations. We then perform a comparative analysis of these methods using data on U.S. equities. We demonstrate how sequence modeling benefits investors in general through incorporating complex historical path dependence, and that Long- and Short-term Memory (LSTM) based models tend to have the best out-of-sample performance. |
Date: | 2021–08 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2108.08999&r= |
By: | Parvez, Rezwanul; Ali Meerza, Syed Imran; Hasan Khan Chowdhury, Nazea |
Keywords: | Teaching/Communication/Extension/Profession, Community/Rural/Urban Development, Institutional and Behavioral Economics |
Date: | 2021–08 |
URL: | http://d.repec.org/n?u=RePEc:ags:aaea21:312912&r= |
By: | Ramit Debnath (EPRG, CJBS, University of Cambridge); Sarah Darby (University of Oxford); Ronita Bardhan (Department of Architecture, University of Cambridge); Kamiar Mohaddes (EPRG, CJBS, University of Cambridge); Minna Sunikka-Blank (Department of Architecture, University of Cambridge) |
Keywords: | energy policy, narratives, topic modelling, computational social science, text analysis, methodological framework |
JEL: | Q40 Q48 R28 |
Date: | 2020–07 |
URL: | http://d.repec.org/n?u=RePEc:enp:wpaper:eprg2019&r= |
By: | Jieyi Kang (Department of Land Economy, University of Cambridge); David Reiner (EPRG, CJBS, University of Cambridge) |
Keywords: | Residential electricity, household consumption behaviour, China, machine learning |
JEL: | C55 D12 R22 Q41 |
Date: | 2021–05 |
URL: | http://d.repec.org/n?u=RePEc:enp:wpaper:eprg2114&r= |
By: | Marco Vega (Departamento de Economía de la Pontificia Universidad Católica del Perú); Erick Lahura (Departamento de Economía de la Pontificia Universidad Católica del Perú); Hilary Coronado (Universidad Científica del Sur) |
Abstract: | Motivado por el desarrollo del comercio electrónico y la importancia de la rigidez de precios para explicar los efectos reales de choques monetarios, el presente trabajo de investigación tiene como objetivo evaluar el grado de rigidez de los precios en línea en el Perú. Para ello, se analizan 4.5 millones de precios publicados diariamente en la pagina web de una tienda por departamentos que, durante el periodo de análisis, tuvo una participación de mercado de aproximadamente 50 por ciento. Esta gran cantidad de datos o “big data” fueron obtenidos a través de la técnica de raspado de datos de la web o “web scraping”, la cual fue aplicada diariamente entre los años 2016 y 2020. Tomando en cuenta la frecuencia de cambio de precios y la duración de los mismos, los resultados indican que los precios en línea en el Perú son menos rígidos que en otros países. JEL Classification-JE: C55, C81, E31, L11, L81 |
Keywords: | Rigidez de precios, precios de internet, web scraping, big data. |
Date: | 2021 |
URL: | http://d.repec.org/n?u=RePEc:pcp:pucwps:wp00497&r= |
By: | Ramit Debnath (EPRG, CJBS, University of Cambridge); Ronita Bardhan (Department of Architecture, University of Cambridge); Sarah Darby (University of Oxford); Kamiar Mohaddes (EPRG, CJBS, University of Cambridge); Minna Sunikka-Blank (Department of Architecture, University of Cambridge) |
Keywords: | energy justce, poverty, computational social science, policy design, machine learning, textual analysis |
JEL: | D63 I30 Q48 R20 |
Date: | 2020–11 |
URL: | http://d.repec.org/n?u=RePEc:enp:wpaper:eprg2030&r= |
By: | Sung Hoon Choi |
Abstract: | I develop a feasible weighted projected principal component (FPPC) analysis for factor models in which observable characteristics partially explain the latent factors. This novel method provides more efficient and accurate estimators than existing methods. To increase estimation efficiency, I take into account both cross-sectional dependence and heteroskedasticity by using a consistent estimator of the inverse error covariance matrix as the weight matrix. To improve accuracy, I employ a projection approach using characteristics because it removes noise components in high-dimensional factor analysis. By using the FPPC method, estimators of the factors and loadings have faster rates of convergence than those of the conventional factor analysis. Moreover, I propose an FPPC-based diffusion index forecasting model. The limiting distribution of the parameter estimates and the rate of convergence for forecast errors are obtained. Using U.S. bond market and macroeconomic data, I demonstrate that the proposed model outperforms models based on conventional principal component estimators. I also show that the proposed model performs well among a large group of machine learning techniques in forecasting excess bond returns. |
Date: | 2021–08 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2108.10250&r= |
By: | Jooyoung Cha; Harold D. Chiang; Yuya Sasaki |
Abstract: | This paper proposes a new method of inference in high-dimensional regression models and high-dimensional IV regression models. Estimation is based on a combined use of the orthogonal greedy algorithm, high-dimensional Akaike information criterion, and double/debiased machine learning. The method of inference for any low-dimensional subvector of high-dimensional parameters is based on a root-$N$ asymptotic normality, which is shown to hold without requiring the exact sparsity condition or the $L^p$ sparsity condition. Simulation studies demonstrate superior finite-sample performance of this proposed method over those based on the LASSO or the random forest, especially under less sparse models. We illustrate an application to production analysis with a panel of Chilean firms. |
Date: | 2021–08 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2108.09520&r= |
By: | Rughinis, Razvan; Rughinis, Cosima; Vulpe, Simona Nicoleta; Rosner, Daniel |
Abstract: | We studied variability in General Data Protection Regulation (GDPR) awareness in relation to digital experience in the 28 European countries of EU27-UK, through secondary analysis of the Eurobarometer 91.2 survey conducted in March 2019 (N = 27,524). Education, occupation, and age were the strongest sociodemographic predictors of GDPR awareness, with little influence of gender, subjective economic well-being, or locality size. Digital experience was significantly and positively correlated with GDPR awareness in a linear model, but this relationship proved to be more complex when we examined it through a typological analysis. Using an exploratory k-means cluster analysis we identified four clusters of digital citizenship, across both dimensions of digital experience and GDPR awareness: the off-line citizens (22%), the social netizens (32%), the web citizens (17%), and the data citizens (29%). The off-line citizens ranked lowest in internet use and GDPR awareness; the web citizens ranked at about average values, while the data citizens ranked highest in both digital experience and GDPR knowledge and use. The fourth identified cluster, the social netizens, had a discordant profile, with remarkably high social network use, below average online shopping experiences, and low GDPR awareness. Digitalization in human capital and general internet use is a strong country-level correlate of the national frequency of the data citizen type. Our results confirm previous studies of the low privacy awareness and skills associated with intense social media consumption, but we found that young generations are evenly divided between the rather carefree social netizens and the strongly invested data citizens. In order to achieve the full potential of the GDPR in changing surveillance practices while fostering consumer trust and responsible use of Big Data, policymakers should more effectively engage the digitally connected yet politically disconnected social netizens, while energizing the data citizens and the web citizens into proactive actions for defending the fundamental rights to private life and data protection. |
Keywords: | Privacy awareness; data citizenship; GDPR; Eurobarometer survey; cluster analysis |
JEL: | Y80 |
Date: | 2021–09 |
URL: | http://d.repec.org/n?u=RePEc:pra:mprapa:109117&r= |
By: | Ludovic Gouden\`ege; Andrea Molent; Antonino Zanette |
Abstract: | Evaluating moving average options is a tough computational challenge for the energy and commodity market as the payoff of the option depends on the prices of a certain underlying observed on a moving window so, when a long window is considered, the pricing problem becomes high dimensional. We present an efficient method for pricing Bermudan style moving average options, based on Gaussian Process Regression and Gauss-Hermite quadrature, thus named GPR-GHQ. Specifically, the proposed algorithm proceeds backward in time and, at each time-step, the continuation value is computed only in a few points by using Gauss-Hermite quadrature, and then it is learned through Gaussian Process Regression. We test the proposed approach in the Black-Scholes model, where the GPR-GHQ method is made even more efficient by exploiting the positive homogeneity of the continuation value, which allows one to reduce the problem size. Positive homogeneity is also exploited to develop a binomial Markov chain, which is able to deal efficiently with medium-long windows. Secondly, we test GPR-GHQ in the Clewlow-Strickland model, the reference framework for modeling prices of energy commodities. Finally, we consider a challenging problem which involves double non-Markovian feature, that is the rough-Bergomi model. In this case, the pricing problem is even harder since the whole history of the volatility process impacts the future distribution of the process. The manuscript includes a numerical investigation, which displays that GPR-GHQ is very accurate and it is able to handle options with a very long window, thus overcoming the problem of high dimensionality. |
Date: | 2021–08 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2108.11141&r= |
By: | Dörr, Julian Oliver; Kinne, Jan; Lenz, David; Licht, Georg; Winker, Peter |
Abstract: | Usually, offcial and survey-based statistics guide policy makers in their choice of response instruments to economic crises. However, in an early phase, after a sudden and unforeseen shock has caused incalculable and fast-changing dynamics, data from traditional statistics are only available with non-negligible time delays. This leaves policy makers uncertain about how to most effectively manage their economic countermeasures to support businesses, especially when they need to respond quickly, as in the COVID-19 pandemic. Given this information deficit, we propose a framework that guides policy makers throughout all stages of an unforeseen economic shock by providing timely and reliable data as a basis to make informed decisions. We do so by combining early stage 'ad hoc' web analyses, 'follow-up' business surveys, and 'retrospective' analyses of firm outcomes. A particular focus of our framework is on assessing the early effects of the pandemic, using highly dynamic and largescale data from corporate websites. Most notably, we show that textual references to the coronavirus pandemic published on a large sample of company websites and state-of-the-art text analysis methods allow to capture the heterogeneity of the crisis' effects at a very early stage and entail a leading indication on later movements in firm credit ratings. |
Keywords: | COVID-19,impact assessment,corporate sector,corporate websites,web mining,NLP |
JEL: | C38 C45 C55 C80 H12 |
Date: | 2021 |
URL: | http://d.repec.org/n?u=RePEc:zbw:zewdip:21062&r= |
By: | Li, Ran; Xu, Yuetong; Chen, Jian; Qi, Danyi |
Keywords: | Marketing, Research Methods/Statistical Methods, Food Consumption/Nutrition/Food Safety |
Date: | 2021–08 |
URL: | http://d.repec.org/n?u=RePEc:ags:aaea21:312878&r= |