|
on Big Data |
By: | Steyn, Dimitri H. W.; Greyling, Talita; Rossouw, Stephanie; Mwamba, John M. |
Abstract: | This paper investigates the predictability of stock market movements using text data extracted from the social media platform, Twitter. We analyse text data to determine the sentiment and the emotion embedded in the Tweets and use them as explanatory variables to predict stock market movements. The study contributes to the literature by analysing high-frequency data and comparing the results obtained from analysing emerging and developed markets, respectively. To this end, the study uses three different Machine Learning Classification Algorithms, the Naïve Bayes, K-Nearest Neighbours and the Support Vector Machine algorithm. Furthermore, we use several evaluation metrics such as the Precision, Recall, Specificity and the F-1 score to test and compare the performance of these algorithms. Lastly, we use the K-Fold Cross-Validation technique to validate the results of our machine learning models and the Variable Importance Analysis to show which variables play an important role in the prediction of our models. The predictability of the market movements is estimated by first including sentiment only and then sentiment with emotions. Our results indicate that investor sentiment and emotions derived from stock market-related Tweets are significant predictors of stock market movements, not only in developed markets but also in emerging markets. |
Keywords: | Sentiment Analysis,Classification,Stock Prediction,Machine Learning |
JEL: | C6 C8 G0 |
Date: | 2020 |
URL: | http://d.repec.org/n?u=RePEc:zbw:glodps:502&r=all |
By: | Artur Strzelecki |
Abstract: | The recent emergence of a new coronavirus, COVID-19, has gained extensive coverage in public media and global news. As of 24 March 2020, the virus has caused viral pneumonia in tens of thousands of people in Wuhan, China, and thousands of cases in 184 other countries and territories. This study explores the potential use of Google Trends (GT) to monitor worldwide interest in this COVID-19 epidemic. GT was chosen as a source of reverse engineering data, given the interest in the topic. Current data on COVID-19 is retrieved from (GT) using one main search topic: Coronavirus. Geographical settings for GT are worldwide, China, South Korea, Italy and Iran. The reported period is 15 January 2020 to 24 March 2020. The results show that the highest worldwide peak in the first wave of demand for information was on 31 January 2020. After the first peak, the number of new cases reported daily rose for 6 days. A second wave started on 21 February 2020 after the outbreaks were reported in Italy, with the highest peak on 16 March 2020. The second wave is six times as big as the first wave. The number of new cases reported daily is rising day by day. This short communication gives a brief introduction to how the demand for information on coronavirus epidemic is reported through GT. |
Date: | 2020–03 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2003.10998&r=all |
By: | Kohei Maehashi (School of Engineering, The University of Tokyo); Mototsugu Shintani (Faculty of Economics, The University of Tokyo) |
Abstract: | We perform a thorough comparative analysis of factor models and machine learningto forecast Japanese macroeconomic time series. Our main results can be summarizedas follows. First, factor models and machine learning perform better than the con-ventional AR model in many cases. Second, predictions made by machine learningmethods perform particularly well for medium to long forecast horizons. Third, thesuccess of machine learning mainly comes from the nonlinearity and interaction ofvariables, suggesting the importance of nonlinear structure in predicting the Japanesemacroeconomic series. Fourth, while neural networks are helpful in forecasting, simplyadding many hidden layers does not necessarily enhance its forecast accuracy. Fifth,the composite forecast of factor models and machine learning performs better thanfactor models or machine learning alone, and machine learning methods applied toprincipal components are found to be useful in the composite forecast. |
Date: | 2020–03 |
URL: | http://d.repec.org/n?u=RePEc:tky:fseres:2020cf1146&r=all |
By: | Paolo Massaro (Bank of Italy); Ilaria Vannini (Bank of Italy); Oliver Giudice (Bank of Italy) |
Abstract: | We implement machine learning techniques to obtain an automatic classification by sector of economic activity of the Italian companies recorded in the Bank of Italy Entities Register. To this end, first we extract a sample of correctly classified corporations from the universe of Italian companies. Second, we select a set of features that are related to the sector of economic activity code and use these to implement supervised approaches to infer output predictions. We choose a multi-step approach based on the hierarchical structure of the sector classification. Because of the imbalance in the target classes, at each step, we first apply two resampling procedures – random oversampling and the Synthetic Minority Over-sampling Technique – to get a more balanced training set. Then, we fit Gradient Boosting and Support Vector Machine models. Overall, the performance of our multi-step classifier yields very reliable predictions of the sector code. This approach can be employed to make the whole classification process more efficient by reducing the area of manual intervention. |
Keywords: | machine learning, entities register, classification by institutional sector |
JEL: | C18 C81 G21 |
Date: | 2020–03 |
URL: | http://d.repec.org/n?u=RePEc:bdi:opques:qef_548_20&r=all |
By: | Cerulli, Giovanni |
Abstract: | We present a Super-Learning Machine (SLM) to predict economic outcomes which improves prediction (i) by cross-validated optimal tuning, (ii) by comparing/combining results from different learners. Our application to a labor economics dataset shows that different learners may behave differently. However, combining learners into one singleton super-learner proves to preserve good predictive accuracy lowering the variance more than stand-alone approaches. |
Keywords: | Machine learning; Ensemble methods; Optimal prediction |
JEL: | C53 C61 C63 |
Date: | 2020–03–10 |
URL: | http://d.repec.org/n?u=RePEc:pra:mprapa:99111&r=all |
By: | Elena Dumitrescu; Sullivan Hué; Christophe Hurlin (University of Orleans - LEO); Sessi Tokpavi (LEO - Laboratoire d'économie d'Orleans - UO - Université d'Orléans - CNRS - Centre National de la Recherche Scientifique) |
Abstract: | Decision trees and related ensemble methods like random forest are state-of-the-art tools in the field of machine learning for credit scoring. Although they are shown to outperform logistic regression, they lack interpretability and this drastically reduces their use in the credit risk management industry, where decision-makers and regulators need transparent score functions. This paper proposes to get the best of both worlds, introducing a new, simple and interpretable credit scoring method which uses information from decision trees to improve the performance of logistic regression. Formally, rules extracted from various short-depth decision trees built with couples of predictive variables are used as predictors in a penalized or regularized logistic regression. By modeling such univariate and bivariate threshold effects, we achieve significant improvement in model performance for the logistic regression while preserving its simple interpretation. Applications using simulated and four real credit defaults datasets show that our new method outperforms traditional logistic regressions. Moreover, it compares competitively to random forest, while providing an interpretable scoring function. JEL Classification: G10 C25, C53 |
Keywords: | Credit scoring,Machine Learning,Risk management,Interpretability,Econometrics |
Date: | 2020–03–13 |
URL: | http://d.repec.org/n?u=RePEc:hal:wpaper:hal-02507499&r=all |
By: | Fabio Zambuto (Bank of Italy); Maria Rosaria Buzzi (Bank of Italy); Giuseppe Costanzo (Bank of Italy); Marco Di Lucido (Bank of Italy); Barbara La Ganga (Bank of Italy); Pasquale Maddaloni (Bank of Italy); Fabio Papale (Bank of Italy); Emiliano Svezia (Bank of Italy) |
Abstract: | We propose a new methodology, based on machine learning algorithms, for the automatic detection of outliers in the data that banks report to the Bank of Italy. Our analysis focuses on granular data gathered within the statistical data collection on payment services, in which the lack of strong ex ante deterministic relationships among the collected variables makes standard diagnostic approaches less powerful. Quantile regression forests are used to derive a region of acceptance for the targeted information. For a given level of probability, plausibility thresholds are obtained on the basis of individual bank characteristics and are automatically updated as new data are reported. The approach was applied to validate semi-annual data on debit card issuance received from reporting agents between December 2016 and June 2018. The algorithm was trained with data reported in previous periods and tested by cross-checking the identified outliers with the reporting agents. The method made it possible to detect, with a high level of precision in term of false positives, new outliers that had not been detected using the standard procedures. |
Keywords: | banking data, data quality management, outlier detection, machine learning, quantile regression, random forests |
JEL: | C18 C81 G21 |
Date: | 2020–03 |
URL: | http://d.repec.org/n?u=RePEc:bdi:opques:qef_547_20&r=all |
By: | Francesco Bloise (AFFILIATION); Paolo Brunori (University of Florence); Patrizio Piraino (University of Notre Dame) |
Abstract: | Much of the global evidence on intergenerational income mobility is based on sub-optimal data. In particular, two-stage techniques are widely used to impute parental incomes for analyses of developing countries and for estimating long-run trends across multiple generations and historical periods. We propose a machine learning method that may improve the reliability and comparability of such estimates. Our approach minimizes the out-of-sample prediction error in the parental income imputation, which provides an objective criterion for choosing across different specifications of the first-stage equation. We apply the method to data from the United States and South Africa to show that under common conditions it can limit the bias generally associated to mobility estimates based on imputed parental income. |
Keywords: | Intergenerational elasticity; income; mobility; elastic net; regularization; PSID, South Africa.. |
JEL: | J62 D63 C18 |
Date: | 2020–03 |
URL: | http://d.repec.org/n?u=RePEc:inq:inqwps:ecineq2020-526&r=all |
By: | Johannes Dahlke; Kristina Bogner; Matthias Mueller; Thomas Berger; Andreas Pyka; Bernd Ebersberger |
Abstract: | In recent years, many scholars praised the seemingly endless possibilities of using machine learning (ML) techniques in and for agent-based simulation models (ABM). To get a more comprehensive understanding of these possibilities, we conduct a systematic literature review (SLR) and classify the literature on the application of ML in and for ABM according to a theoretically derived classification scheme. We do so to investigate how exactly machine learning has been utilized in and for agent-based models so far and to critically discuss the combination of these two promising methods. We find that, indeed, there is a broad range of possible applications of ML to support and complement ABMs in many different ways, already applied in many different disciplines. We see that, so far, ML is mainly used in ABM for two broad cases: First, the modelling of adaptive agents equipped with experience learning and, second, the analysis of outcomes produced by a given ABM. While these are the most frequent, there also exist a variety of many more interesting applications. This being the case, researchers should dive deeper into the analysis of when and how which kinds of ML techniques can support ABM, e.g. by conducting a more in-depth analysis and comparison of different use cases. Nonetheless, as the application of ML in and for ABM comes at certain costs, researchers should not use ML for ABMs just for the sake of doing it. |
Date: | 2020–03 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2003.11985&r=all |
By: | Debnath, Ramit; darby, Sarah; Bardhan, Ronita; Mohaddes, Kamiar; Sunikka-Blank, Minna |
Abstract: | Text-based data sources like narratives and stories have become increasingly popular as critical insight generator in energy research and social science. However, their implications in policy application usually remain superficial and fail to fully exploit state-of-the-art resources which digital era holds for text analysis. This paper illustrates the potential of deep-narrative analysis in energy policy research using text analysis tools from the cutting-edge domain of computational social sciences, notably topic modelling. We argue that a nested application of topic modelling and grounded theory in narrative analysis promises advances in areas where manual-coding driven narrative analysis has traditionally struggled with directionality biases, scaling, systematisation and repeatability. The nested application of the topic model and the grounded theory goes beyond the frequentist approach of narrative analysis and introduces insight generation capabilities based on the probability distribution of words and topics in a text corpus. In this manner, our proposed methodology deconstructs the corpus and enables the analyst to answer research questions based on the foundational element of the text data structure. We verify the theoretical and epistemological fit of the proposed nested methodology through a meta-analysis of a state-of-the-art bibliographic database on energy policy and computational social science. We find that the nested application contributes to the literature gap on the need for multidisciplinary polyvalence methodologies that can systematically include qualitative evidence into policymaking. |
Date: | 2020–03–27 |
URL: | http://d.repec.org/n?u=RePEc:osf:socarx:hvcb5&r=all |
By: | Flor, Nick V. (University of New Mexico) |
Abstract: | The document term matrix (“DTM”) is a representation of a collection of documents, and is a key input to many machine learning algorithms. It can be applied to a collection of tweets as well. I give the set-predicate formalism for the tweet term matrix (“TTM”), and the tweet bio-term matrix (“TBTM”). |
Date: | 2020–03–09 |
URL: | http://d.repec.org/n?u=RePEc:osf:socarx:tp5mu&r=all |
By: | Jianbin Lin; Zhiqiang Zhang; Jun Zhou; Xiaolong Li; Jingli Fang; Yanming Fang; Quan Yu; Yuan Qi |
Abstract: | Ant Credit Pay is a consumer credit service in Ant Financial Service Group. Similar to credit card, loan default is one of the major risks of this credit product. Hence, effective algorithm for default prediction is the key to losses reduction and profits increment for the company. However, the challenges facing in our scenario are different from those in conventional credit card service. The first one is scalability. The huge volume of users and their behaviors in Ant Financial requires the ability to process industrial-scale data and perform model training efficiently. The second challenges is the cold-start problem. Different from the manual review for credit card application in conventional banks, the credit limit of Ant Credit Pay is automatically offered to users based on the knowledge learned from big data. However, default prediction for new users is suffered from lack of enough credit behaviors. It requires that the proposal should leverage other new data source to alleviate the cold-start problem. Considering the above challenges and the special scenario in Ant Financial, we try to incorporate default prediction with network information to alleviate the cold-start problem. In this paper, we propose an industrial-scale distributed network representation framework, termed NetDP, for default prediction in Ant Credit Pay. The proposal explores network information generated by various interaction between users, and blends unsupervised and supervised network representation in a unified framework for default prediction problem. Moreover, we present a parameter-server-based distributed implement of our proposal to handle the scalability challenge. Experimental results demonstrate the effectiveness of our proposal, especially in cold-start problem, as well as the efficiency for industrial-scale dataset. |
Date: | 2020–03 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2004.00201&r=all |
By: | Xinyi Guo; Jinfeng Li |
Abstract: | A novel social networks sentiment analysis model is proposed based on Twitter sentiment score (TSS) for real-time prediction of the future stock market price FTSE 100, as compared with conventional econometric models of investor sentiment based on closed-end fund discount (CEFD). The proposed TSS model features a new baseline correlation approach, which not only exhibits a decent prediction accuracy, but also reduces the computation burden and enables a fast decision making without the knowledge of historical data. Polynomial regression, classification modelling and lexicon-based sentiment analysis are performed using R. The obtained TSS predicts the future stock market trend in advance by 15 time samples (30 working hours) with an accuracy of 67.22% using the proposed baseline criterion without referring to historical TSS or market data. Specifically, TSS's prediction performance of an upward market is found far better than that of a downward market. Under the logistic regression and linear discriminant analysis, the accuracy of TSS in predicting the upward trend of the future market achieves 97.87%. |
Date: | 2020–03 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2003.08137&r=all |
By: | Jesus Fernandez-Villaverde (University of Pennsylvania) |
Abstract: | Can arti?cial intelligence, in particular, machine learning algorithms, replace the idea of simple rules, such as ?rst possession and voluntary exchange in free markets, as a foundation for public policy? This paper argues that the preponderance of the evidence sides with the interpretation that while arti?cial intelligence will help public policy along important aspects, simple rules will remain the fundamental guideline for the design of institutions and legal environments where markets operate. “Digital socialism” might be a hipster thing to talk about in Williamsburg or Shoreditch, but it is as much of a chimera as “analog socialism.” |
Keywords: | Arti?cial intelligence, machine learning, economics, law, rule of law |
JEL: | D85 H10 H30 |
Date: | 2020–03–20 |
URL: | http://d.repec.org/n?u=RePEc:pen:papers:20-010&r=all |
By: | Nicola Branzoli (Bank of Italy); Ilaria Supino (Bank of Italy) |
Abstract: | FinTech credit has attracted significant attention from academics and policymakers in recent years. Given its growing importance, in this paper we provide an overview of the empirical research on FinTech credit to households and non-financial corporations (NFCs). We focus on three broad topics: i) the factors supporting the development of innovative business models for credit intermediation, such as marketplace lending; ii) the benefits of new credit risk assessment data and methods; iii) the implications of these innovations for access to credit. Three main messages emerge from the literature. First, the growth of lenders with innovative business models is mainly driven by the degree of local economic development and of competition in the banking sector. Second, new data and methods can improve traditional credit risk models because they are particularly helpful in screening opaque borrowers, such as those with scant credit history. Third, FinTech borrowers generally lack (or have limited) access to finance and tend to be riskier than traditional bank borrowers. |
Keywords: | artificial intelligence, credit, digital technologies, FinTech, marketplace lending |
JEL: | G21 G22 G23 G24 |
Date: | 2020–03 |
URL: | http://d.repec.org/n?u=RePEc:bdi:opques:qef_549_20&r=all |
By: | Yang Chen; Emerson Li |
Abstract: | Stock prices are influenced over time by underlying macroeconomic factors. Jumping out of the box of conventional assumptions about the unpredictability of the market noise, we modeled the changes of stock prices over time through the Markov Decision Process, a discrete stochastic control process that aids decision making in a situation that is partly random. We then did a "Region of Interest" (RoI) Pooling of the stock time-series graphs in order to predict future prices with existing ones. Generative Adversarial Network (GAN) is then used based on a competing pair of supervised learning algorithms, to regenerate future stock price projections on a real-time basis. The supervised learning algorithm used in this research, moreover, is original to this study and will have wider uses. With the ensemble of these algorithms, we are able to identify, to what extent, each specific macroeconomic factor influences the change of the Brownian/random market movement. In addition, our model will have a wider influence on the predictions of other Brownian movements. |
Date: | 2020–03 |
URL: | http://d.repec.org/n?u=RePEc:arx:papers:2003.11473&r=all |
By: | Fabio Montobbio (Dipartimento di Politica Economica, DISCE, Università Cattolica del Sacro Cuore – BRICK, Collegio Carlo Alberto, Torino – ICRIOS, Bocconi University, Milano); Jacopo Staccioli (Dipartimento di Politica Economica, DISCE, Università Cattolica del Sacro Cuore – Institute of Economics, Scuola Superiore Sant’Anna, Pisa); Maria Enrica Virgillito (Institute of Economics, Scuola Superiore Sant’Anna, Pisa – Dipartimento di Politica Economica, DISCE, Università Cattolica del Sacro Cuore); Marco Vivarelli (Dipartimento di Politica Economica, DISCE, Università Cattolica del Sacro Cuore – UNU-MERIT, Maastricht, The Netherlands – IZA, Bonn, Germany) |
Abstract: | This paper investigates the presence of explicit labour-saving heuristics within robotic patents. It analyses innovative actors engaged in robotic technology and their economic environment (identity, location, industry), and identifies the technological fields particularly exposed to labour-saving innovations. It exploits advanced natural language processing and probabilistic topic modelling techniques on the universe of patent applications at the USPTO between 2009 and 2018, matched with ORBIS (Bureau van Dijk) firm-level dataset. The results show that labour-saving patent holders comprise not only robots producers, but also adopters. Consequently, labour-saving robotic patents appear along the entire supply chain. The paper shows that labour-saving innovations challenge manual activities (e.g. in the logistics sector), activities entailing social intelligence (e.g. in the healthcare sector) and cognitive skills (e.g. learning and predicting). |
Keywords: | Robotic Patents, Labour-Saving Technology, Search Heuristics, Probabilistic Topic Models |
JEL: | O33 J24 C38 |
Date: | 2020–03 |
URL: | http://d.repec.org/n?u=RePEc:ctc:serie5:dipe0009&r=all |
By: | Montobbio, Fabio (Università Cattolica del Sacro Cuore, BRICK, Collegio Carlo Alberto, and ICRIOS, Bocconi University); Staccioli, Jacopo (Università Cattolica del Sacro Cuore, and Institute of Economics, Scuola Superiore Sant’Anna); Virgillito, Maria Enrica (Università Cattolica del Sacro Cuore, and Institute of Economics, Scuola Superiore Sant’Anna); Vivarelli, Marco (UNU-MERIT, Università Cattolica del Sacro Cuore, and Forschungsinstitut zur Zukunft der Arbeit GmbH (IZA)) |
Abstract: | This paper investigates the presence of explicit labour-saving heuristics within robotic patents. It analyses innovative actors engaged in robotic technology and their economic environment (identity, location, industry), and identifies the technological fields particularly exposed to labour-saving innovations. It exploits advanced natural language processing and probabilistic topic modelling techniques on the universe of patent applications at the USPTO between 2009 and 2018, matched with ORBIS (Bureau van Dijk) firm-level dataset. The results show that labour-saving patent holders comprise not only robots producers, but also adopters. Consequently, labour-saving robotic patents appear along the entire supply chain. The paper shows that labour-saving innovations challenge manual activities (e.g. in the logistics sector), activities entailing social intelligence (e.g. in the healthcare sector) and cognitive skills (e.g. learning and predicting). |
Keywords: | Robotic Patents, Labour-Saving Technology, Search Heuristics, Probabilistic Topic Models |
JEL: | O33 J24 C38 |
Date: | 2020–02–10 |
URL: | http://d.repec.org/n?u=RePEc:unm:unumer:2020007&r=all |
By: | J.W.A.M. Steegmans |
Abstract: | This study aims to provide insights into the correct usage of Google search data, which are available through Google Trends. The main focus is on the effects of sampling error in these data as these are ignored by most scholars using Google Trends. To demonstrate the effect a housing market application is used; that is, the relationship between online search activity for mortgages and real housing market activity is investigated. A simple time series model, based on Van Veldhuizen, Vogt,and Voogt (2016), is estimated that explains house transactions using Google search data for mortgages. The results show that the effects of sampling errors are substantial. It is also stressed that in this particular application of Google Trends data 'predetermined' transactions, house sales where the purchase contracts have been signed but where the conveyance hasn't occurred yet, should be excluded as they lead to an overestimation of the effects of mortgage searches. All in all, the application of Google Trends data in economic applications remains promising.However, far more attention should be given to the limitations of these data. |
Date: | 2019–08 |
URL: | http://d.repec.org/n?u=RePEc:use:tkiwps:1911&r=all |
By: | Kaiser, Ulrich (University of Zurich); Kuhn, Johan Moritz (EPAC) |
Abstract: | Can publicly available, web-scraped data be used to identify promising business startups at an early stage? To answer this question, we use such textual and non-textual information about the names of Danish firms and their addresses as well as their business purpose statements (BPSs) supplemented by core accounting information along with founder and initial startup characteristics to forecast the performance of newly started enterprises over a five years' time horizon. The performance outcomes we consider are involuntary exit, above–average employment growth, a return on assets of above 20 percent, new patent applications and participation in an innovation subsidy program. Our first key finding is that our models predict startup performance with either high or very high accuracy with the exception of high returns on assets where predictive power remains poor. Our second key finding is that the data requirements for predicting performance outcomes with such accuracy are low. To forecast the two innovation-related performance outcomes well, we only need to include a set of variables derived from the BPS texts while an accurate prediction of startup survival and high employment growth needs the combination of (i) information derived from the names of the startups, (ii) data on elementary founder-related characteristics and (iii) either variables describing the initial characteristics of the startup (to predict startup survival) or business purpose statement information (to predict high employment growth). These sets of variables are easily obtainable since the underlying information is mandatory to report upon business registration. The substantial accuracy of our predictions for survival, employment growth, new patents and participation in innovation subsidy programs indicates ample scope for algorithmic scoring models as an additional pillar of funding and innovation support decisions. |
Keywords: | startup, performance, prediction, text as data |
JEL: | L26 C53 |
Date: | 2020–03 |
URL: | http://d.repec.org/n?u=RePEc:iza:izadps:dp13029&r=all |