nep-big New Economics Papers
on Big Data
Issue of 2021‒03‒15
27 papers chosen by
Tom Coupé
University of Canterbury

  1. Big data and machine learning in central banking By Sebastian Doerr; Leonardo Gambacorta; José María Serena Garralda
  2. Understanding the performance of machine learning models to predict credit default: a novel approach for supervisory evaluation By Andrés Alonso; José Manuel Carbó
  3. Explainable AI in Credit Risk Management By Branka Hadji Misheva; Joerg Osterrieder; Ali Hirsa; Onkar Kulkarni; Stephen Fung Lin
  4. Accepting the Future as Unforeseeable: Sensemaking by Professionals in the Rise of Artificial Intelligence By Masashi Goto
  5. DeepSets and their derivative networks for solving symmetric PDEs By Maximilien Germain; Mathieu Lauri\`ere; Huy\^en Pham; Xavier Warin
  6. Gender distribution across topics in Top 5 economics journals: A machine learning approach By J. Ignacio Conde-Ruiz; Juan José Ganuza; Manu Garcia; Luis A. Puch
  7. Panel semiparametric quantile regression neural network for electricity consumption forecasting By Xingcai Zhou; Jiangyan Wang
  8. Return on Investment on AI: The Case of Capital Requirement By Henri Fraisse; Matthias Laporte
  9. Confronting Machine Learning With Financial Research By Kristof Lommers; Ouns El Harzli; Jack Kim
  10. Forex exchange rate forecasting using deep recurrent neural networks By Dautel, Alexander Jakob; Härdle, Wolfgang Karl; Lessmann, Stefan; Seow, Hsin-Vonn
  11. A Machine Learning Based Regulatory Risk Index for Cryptocurrencies By Ni, Xinwen; Härdle, Wolfgang Karl; Xie, Taojun
  12. The LOB Recreation Model: Predicting the Limit Order Book from TAQ History Using an Ordinary Differential Equation Recurrent Neural Network By Zijian Shi; Yu Chen; John Cartlidge
  13. Using Machine Learning for Measuring Democracy: An Update By Klaus Gründler; Tommy Krieger
  14. Prediction of Attrition in Large Longitudinal Studies: Tree-based methods versus Multinomial Logistic Models By Best, Katherine Laura; Speyer, Lydia Gabriela; Murray, Aja Louise; Ushakova, Anastasia
  15. How People Pay Each Other: Data, Theory, and Calibrations By ; Claire Greene; Oz Shy
  16. A mathematical model for automatic differentiation in machine learning By Bolte, Jérôme; Pauwels, Edouard
  17. Can Machine Learning Catch the COVID-19 Recession? By Philippe Goulet Coulombe; Massimiliano Marcellino; Dalibor Stevanovic
  18. Learning Organization using Conversational Social Network for Social Customer Relationship Management Effort By Andry Alamsyah; Yahya Peranginangin; Gabriel Nurhadi
  19. Sales Prediction Model Using Classification Decision Tree Approach For Small Medium Enterprise Based on Indonesian E-Commerce Data By Raden Johannes; Andry Alamsyah
  20. Slow-Growing Trees By Philippe Goulet Coulombe
  21. Thinking outside the container: A machine learning approach to forecasting trade flows By Stamer, Vincent
  22. Service Data Analytics and Business Intelligence By Wu, Desheng Dang; Härdle, Wolfgang Karl
  23. Answering the Queen: Machine learning and financial crises By Jérémy Fouliard; Michael Howell; Hélène Rey
  24. On the Subbagging Estimation for Massive Data By Tao Zou; Xian Li; Xuan Liang; Hansheng Wang
  25. Forecasting the Stability and Growth Pact compliance using Machine Learning. By Kéa Baret; Amélie Barbier-Gauchard; Théophilos Papadimitriou
  26. Standing on the Shoulders of Machine Learning: Can We Improve Hypothesis Testing? By Gary Cornwall; Jeff Chen; Beau Sauley
  27. An economic perspective on data and platform market power By MARTENS Bertin

  1. By: Sebastian Doerr; Leonardo Gambacorta; José María Serena Garralda
    Abstract: This paper reviews the use of big data and machine learning in central banking, leveraging on a recent survey conducted among the members of the Irving Fischer Committee (IFC). The majority of central banks discuss the topic of big data formally within their institution. Big data is used with machine learning applications in a variety of areas, including research, monetary policy and financial stability. Central banks also report using big data for supervision and regulation (suptech and regtech applications). Data quality, sampling and representativeness are major challenges for central banks, and so is legal uncertainty around data privacy and confidentiality. Several institutions report constraints in setting up an adequate IT infrastructure and in developing the necessary human capital. Cooperation among public authorities could improve central banks' ability to collect, store and analyse big data.
    Keywords: big data, central banks, machine learning, artificial intelligence, data science
    JEL: G17 G18 G23 G32
    Date: 2021–03
  2. By: Andrés Alonso (Banco de España); José Manuel Carbó (Banco de España)
    Abstract: In this paper we study the performance of several machine learning (ML) models for credit default prediction. We do so by using a unique and anonymized database from a major Spanish bank. We compare the statistical performance of a simple and traditionally used model like the Logistic Regression (Logit), with more advanced ones like Lasso penalized logistic regression, Classification And Regression Tree (CART), Random Forest, XGBoost and Deep Neural Networks. Following the process deployed for the supervisory validation of Internal Rating-Based (IRB) systems, we examine the benefits of using ML in terms of predictive power, both in classification and calibration. Running a simulation exercise for different sample sizes and number of features we are able to isolate the information advantage associated to the access to big amounts of data, and measure the ML model advantage. Despite the fact that ML models outperforms Logit both in classification and in calibration, more complex ML algorithms do not necessarily predict better. We then translate this statistical performance into economic impact. We do so by estimating the savings in regulatory capital when using ML models instead of a simpler model like Lasso to compute the risk-weighted assets. Our benchmark results show that implementing XGBoost could yield savings from 12.4% to 17% in terms of regulatory capital requirements under the IRB approach. This leads us to conclude that the potential benefits in economic terms for the institutions would be significant and this justify further research to better understand all the risks embedded in ML models.
    Keywords: machine learning, credit risk, prediction, probability of default, IRB system
    JEL: C45 C38 G21
    Date: 2021–01
  3. By: Branka Hadji Misheva; Joerg Osterrieder; Ali Hirsa; Onkar Kulkarni; Stephen Fung Lin
    Abstract: Artificial Intelligence (AI) has created the single biggest technology revolution the world has ever seen. For the finance sector, it provides great opportunities to enhance customer experience, democratize financial services, ensure consumer protection and significantly improve risk management. While it is easier than ever to run state-of-the-art machine learning models, designing and implementing systems that support real-world finance applications have been challenging. In large part because they lack transparency and explainability which are important factors in establishing reliable technology and the research on this topic with a specific focus on applications in credit risk management. In this paper, we implement two advanced post-hoc model agnostic explainability techniques called Local Interpretable Model Agnostic Explanations (LIME) and SHapley Additive exPlanations (SHAP) to machine learning (ML)-based credit scoring models applied to the open-access data set offered by the US-based P2P Lending Platform, Lending Club. Specifically, we use LIME to explain instances locally and SHAP to get both local and global explanations. We discuss the results in detail and present multiple comparison scenarios by using various kernels available for explaining graphs generated using SHAP values. We also discuss the practical challenges associated with the implementation of these state-of-art eXplainabale AI (XAI) methods and document them for future reference. We have made an effort to document every technical aspect of this research, while at the same time providing a general summary of the conclusions.
    Date: 2021–03
  4. By: Masashi Goto (Research Institute for Economics and Business Administration, Kobe University, JAPAN)
    Abstract: The disruptive influence of digitalisation on professions has been actively debated. This study analyses the nature of professionals' sensemaking with the rise of artificial intelligence (AI). I conducted a qualitative study with interview and archival data on a Big Four audit firm in Japan as they considered the application of AI to their core audit service in 2017–2019. The study discovers three themes in the professionals' sensemaking: monitoring environmental change, evolving future blueprints and learning through experimentation. It also shows their acceptance of the future as inherently unforeseeable and their perception of endless sensemaking as a resolution to cope with the excessive environmental complexity. In their sensemaking, institutional factors—audit standards, societal expectations and trends in other firms—played an important role in setting what is acceptable in technology use, while technological factors set what is possible. The professionals continue to explore what the unique value of humans is against machines and what stays legitimate to sustain their profession. With these findings, this article contributes to the literature of sensemaking and professions by discussing implications of this distinct and important perception mode.
    Keywords: Prospective sensemaking; Occupations and Professions; Qualitative research methods
    Date: 2021–03
  5. By: Maximilien Germain (LPSM); Mathieu Lauri\`ere (LPSM); Huy\^en Pham (LPSM); Xavier Warin
    Abstract: Machine learning methods for solving nonlinear partial differential equations (PDEs) are hot topical issues, and different algorithms proposed in the literature show efficient numerical approximation in high dimension. In this paper, we introduce a class of PDEs that are invariant to permutations, and called symmetric PDEs. Such problems are widespread, ranging from cosmology to quantum mechanics, and option pricing/hedging in multi-asset market with exchangeable payoff. Our main application comes actually from the particles approximation of mean-field control problems. We design deep learning algorithms based on certain types of neural networks, named PointNet and DeepSet (and their associated derivative networks), for computing simultaneously an approximation of the solution and its gradient to symmetric PDEs. We illustrate the performance and accuracy of the PointNet/DeepSet networks compared to classical feedforward ones, and provide several numerical results of our algorithm for the examples of a mean-field systemic risk, mean-variance problem and a min/max linear quadratic McKean-Vlasov control problem.
    Date: 2021–03
  6. By: J. Ignacio Conde-Ruiz; Juan José Ganuza; Manu Garcia; Luis A. Puch
    Abstract: We analyze all the articles published in Top 5 economic journals between 2002 and 2019 in order to find gender differences in their research approach. Using an unsuper vised machine learning algorithm (Structural Topic Model) developed by Roberts et al. (2019) we characterize jointly the set of latent topics that best fits our data (the set of abstracts) and how the documents/abstracts are allocated in each latent topic. This latent topics are mixtures over words were each word has a probability of belonging to a topic after controlling by year and journal. This latent topics may capture research fields but also other more subtle characteristics related to the way in which the articles are written. We find that females are uneven distributed along these latent topics by using only data driven methods. The differences about gender research approaches we found in this paper, are "automatically" generated given the research articles, without an arbitrary allocation to particular categories (as JEL codes, or research areas).
    Keywords: machine learning, structural topic model, gender, research fields
    JEL: I20 J16
    Date: 2021–02
  7. By: Xingcai Zhou; Jiangyan Wang
    Abstract: China has made great achievements in electric power industry during the long-term deepening of reform and opening up. However, the complex regional economic, social and natural conditions, electricity resources are not evenly distributed, which accounts for the electricity deficiency in some regions of China. It is desirable to develop a robust electricity forecasting model. Motivated by which, we propose a Panel Semiparametric Quantile Regression Neural Network (PSQRNN) by utilizing the artificial neural network and semiparametric quantile regression. The PSQRNN can explore a potential linear and nonlinear relationships among the variables, interpret the unobserved provincial heterogeneity, and maintain the interpretability of parametric models simultaneously. And the PSQRNN is trained by combining the penalized quantile regression with LASSO, ridge regression and backpropagation algorithm. To evaluate the prediction accuracy, an empirical analysis is conducted to analyze the provincial electricity consumption from 1999 to 2018 in China based on three scenarios. From which, one finds that the PSQRNN model performs better for electricity consumption forecasting by considering the economic and climatic factors. Finally, the provincial electricity consumptions of the next $5$ years (2019-2023) in China are reported by forecasting.
    Date: 2021–02
  8. By: Henri Fraisse; Matthias Laporte
    Abstract: Taking advantage of granular data we measure the change in bank capital requirement resulting from the implementation of AI techniques to predict corporate defaults. For each of the largest banks operating in France we design an algorithm to build pseudo-internal models of credit risk management for a range of methodologies extensively used in AI (random forest, gradient boosting, ridge regression, deep learning). We compare these models to the traditional model usually in place that basically relies on a combination of logistic regression and expert judgement. The comparison is made along two sets of criterias capturing : the ability to pass compliance tests used by the regulators during on-site missions of model validation (i), and the induced changes in capital requirement (ii). The different models show noticeable differences in their ability to pass the regulatory tests and to lead to a reduction in capital requirement. While displaying a similar ability than the traditional model to pass compliance tests, neural networks provide the strongest incentive for banks to apply AI models for their internal model of credit risk of corporate businesses as they lead in some cases to sizeable reduction in capital requirement.[1]
    Keywords: Artificial Intelligence, Credit Risk, Regulatory Requirement
    JEL: C4 C55 G21 K35
    Date: 2021
  9. By: Kristof Lommers; Ouns El Harzli; Jack Kim
    Abstract: This study aims to examine the challenges and applications of machine learning for financial research. Machine learning algorithms have been developed for certain data environments which substantially differ from the one we encounter in finance. Not only do difficulties arise due to some of the idiosyncrasies of financial markets, there is a fundamental tension between the underlying paradigm of machine learning and the research philosophy in financial economics. Given the peculiar features of financial markets and the empirical framework within social science, various adjustments have to be made to the conventional machine learning methodology. We discuss some of the main challenges of machine learning in finance and examine how these could be accounted for. Despite some of the challenges, we argue that machine learning could be unified with financial research to become a robust complement to the econometrician's toolbox. Moreover, we discuss the various applications of machine learning in the research process such as estimation, empirical discovery, testing, causal inference and prediction.
    Date: 2021–02
  10. By: Dautel, Alexander Jakob; Härdle, Wolfgang Karl; Lessmann, Stefan; Seow, Hsin-Vonn
    Abstract: Deep learning has substantially advanced the state of the art in computer vision, natural language processing, and other fields. The paper examines the potential of deep learning for exchange rate forecasting. We systematically compare long short- term memory networks and gated recurrent units to traditional recurrent network architectures as well as feedforward networks in terms of their directional forecasting accuracy and the profitability of trading model predictions. Empirical results indicate the suitability of deep networks for exchange rate forecasting in general but also evidence the difficulty of implementing and tuning corresponding architectures. Especially with regard to trading profit, a simpler neural network may perform as well as if not better than a more complex deep neural network.
    Keywords: Deep learning,Financial time series forecasting,Recurrent neural networks,Foreign exchange rates
    JEL: C14 C22 C45
    Date: 2020
  11. By: Ni, Xinwen; Härdle, Wolfgang Karl; Xie, Taojun
    Abstract: Cryptocurrencies’ values often respond aggressively to major policy changes, but none of the existing indices informs on the market risks associated with regulatory changes. In this paper, we quantify the risks originating from new regulations on FinTech and cryptocurrencies (CCs), and analyse their impact on market dynamics. Specifically, a Cryptocurrency Regulatory Risk IndeX (CRRIX) is constructed based on policy-related news coverage frequency. The unlabeled news data are collected from the top online CC news platforms and further classified using a Latent Dirichlet Allocation model and Hellinger distance. Our results show that the machine-learning-based CRRIX successfully captures major policy-changing moments. The movements for both the VCRIX, a market volatility index, and the CRRIX are synchronous, meaning that the CRRIX could be helpful for all participants in the cryptocurrency market. The algorithms and Python code are available for research purposes on
    Keywords: Cryptocurrency,Regulatory Risk,Index,LDA,News Classification
    JEL: C45 G11 G18
    Date: 2020
  12. By: Zijian Shi; Yu Chen; John Cartlidge
    Abstract: In an order-driven financial market, the price of a financial asset is discovered through the interaction of orders - requests to buy or sell at a particular price - that are posted to the public limit order book (LOB). Therefore, LOB data is extremely valuable for modelling market dynamics. However, LOB data is not freely accessible, which poses a challenge to market participants and researchers wishing to exploit this information. Fortunately, trades and quotes (TAQ) data - orders arriving at the top of the LOB, and trades executing in the market - are more readily available. In this paper, we present the LOB recreation model, a first attempt from a deep learning perspective to recreate the top five price levels of the LOB for small-tick stocks using only TAQ data. Volumes of orders sitting deep in the LOB are predicted by combining outputs from: (1) a history compiler that uses a Gated Recurrent Unit (GRU) module to selectively compile prediction relevant quote history; (2) a market events simulator, which uses an Ordinary Differential Equation Recurrent Neural Network (ODE-RNN) to simulate the accumulation of net order arrivals; and (3) a weighting scheme to adaptively combine the predictions generated by (1) and (2). By the paradigm of transfer learning, the source model trained on one stock can be fine-tuned to enable application to other financial assets of the same class with much lower demand on additional data. Comprehensive experiments conducted on two real world intraday LOB datasets demonstrate that the proposed model can efficiently recreate the LOB with high accuracy using only TAQ data as input.
    Date: 2021–03
  13. By: Klaus Gründler; Tommy Krieger
    Abstract: We provide a comprehensive overview of the literature on the measurement of democracy and present an extensive update of the Machine Learning indicator of Gründler and Krieger (2016, European Journal of Political Economy). Four improvements are particularly notable: First, we produce a continuous and a dichotomous version of the Machine Learning democracy indicator. Second, we calculate intervals that reflect the degree of measurement uncertainty. Third, we refine the conceptualization of the Machine Learning Index. Finally, we largely expand the data coverage by providing democracy indicators for 186 countries in the period from 1919 to 2019.
    Keywords: data aggregation, democracy indicators, machine learning, measurement issues, regime classifications, support vector machines
    JEL: C38 C43 C82 E02 P16
    Date: 2021
  14. By: Best, Katherine Laura; Speyer, Lydia Gabriela; Murray, Aja Louise; Ushakova, Anastasia
    Abstract: Identifying predictors of attrition is essential for designing longitudinal studies such that attrition bias can be minimised, and for identifying the variables that can be used as auxiliary in statistical techniques to help correct for non-random drop-out. This paper provides a comparative overview of predictive techniques that can be used to model attrition and identify important risk factors that help in its prediction. Logistic regression and several tree-based machine learning methods were applied to Wave 2 dropout in an illustrative sample of 5000 individuals from a large UK longitudinal study, Understanding Society. Each method was evaluated based on accuracy, AUC-ROC, plausibility of key assumptions and interpretability. Our results suggest a 10% improvement in accuracy for random forest compared to logistic regression methods. However, given the differences in estimation procedures we suggest that both models could be used in conjunction to provide the most comprehensive understanding of attrition predictors.
    Date: 2021–03–02
  15. By: ; Claire Greene; Oz Shy
    Abstract: Using a representative sample of the U.S. adult population, we analyze which payment methods consumers use to pay other consumers (p2p) and how these choices depend on transaction and demographic characteristics. We additionally construct a random matching model of consumers with diverse preferences over the use of different payment methods for p2p payments. The random matching model is calibrated to the share of p2p payments made with cash, paper check, and electronic technologies observed from 2015 to 2019. We find about two thirds of consumers have a first p2p payment preference of cash. The remaining one third rank checks first. Approximately 93 percent of consumers rank electronic technologies second. Our empirical analysis finds that the most significant factors in determining the payment method used are the transaction value and the age and education of the payer.
    Keywords: consumer payment choice; person-to-person payments; electronic payments; mixed logit; machine learning; random matching
    JEL: D9 E42
    Date: 2021–02–05
  16. By: Bolte, Jérôme; Pauwels, Edouard
    Abstract: Automatic differentiation, as implemented today, does not have a simple mathematical model adapted to the needs of modern machine learning. In this work we articulate the relationships between differentiation of programs as implemented in practice and differentiation of nonsmooth functions. To this end we provide a simple class of functions, a nonsmooth calculus, and show how they apply to stochastic approximation methods. We also evidence the issue of artificial critical points created by algorithmic differentiation and show how usual methods avoid these points with probability one.
    Date: 2021–02–01
  17. By: Philippe Goulet Coulombe; Massimiliano Marcellino; Dalibor Stevanovic
    Abstract: Based on evidence gathered from a newly built large macroeconomic data set for the UK, labeled UK-MD and comparable to similar datasets for the US and Canada, it seems the most promising avenue for forecasting during the pandemic is to allow for general forms of nonlinearity by using machine learning (ML) methods. But not all nonlinear ML methods are alike. For instance, some do not allow to extrapolate (like regular trees and forests) and some do (when complemented with linear dynamic components). This and other crucial aspects of ML-based forecasting in unprecedented times are studied in an extensive pseudo-out-of-sample exercise.
    Date: 2021–03
  18. By: Andry Alamsyah; Yahya Peranginangin; Gabriel Nurhadi
    Abstract: The challenge of each organization is how they adapt to the shift of more complex technology such as mobile, big data, interconnected world, and the Internet of things. In order to achieve their objective, they must understand how to take advantage of the interconnected individuals inside and outside the organization. Learning organization continues to transform by listening and maintain the connection with their counterparts. Customer relationship management is an important source for business organizations to grow and to assure their future. The complex social network, where interconnected peoples get information and get influenced very quickly, certainly a big challenge for business organizations. The combination of these complex technologies provides intriguing insight such as the capabilities to listen to what the markets want, to understand their market competition, and to understand their market segmentation. In this paper, as a part of organization transformation, we show how a business organization mine online conversational in Twitter related to their brand issue and analyze them in the context of customer relationship management to extract several insights regarding their market.
    Date: 2021–03
  19. By: Raden Johannes; Andry Alamsyah
    Abstract: The growth of internet users in Indonesia gives an impact on many aspects of daily life, including commerce. Indonesian small-medium enterprises took this advantage of new media to derive their activity by the meaning of online commerce. Until now, there is no known practical implementation of how to predict their sales and revenue using their historical transaction. In this paper, we build a sales prediction model on the Indonesian footwear industry using real-life data crawled on Tokopedia, one of the biggest e-commerce providers in Indonesia. Data mining is a discipline that can be used to gather information by processing the data. By using the method of classification in data mining, this research will describe patterns of the market and predict the potential of the region in the national market commodities. Our approach is based on the classification decision tree. We managed to determine predicted the number of items sold by the viewers, price, and type of shoes.
    Date: 2021–03
  20. By: Philippe Goulet Coulombe
    Abstract: Random Forest's performance can be matched by a single slow-growing tree (SGT), which uses a learning rate to tame CART's greedy algorithm. SGT exploits the view that CART is an extreme case of an iterative weighted least square procedure. Moreover, a unifying view of Boosted Trees (BT) and Random Forests (RF) is presented. Greedy ML algorithms' outcomes can be improved using either "slow learning" or diversification. SGT applies the former to estimate a single deep tree, and Booging (bagging stochastic BT with a high learning rate) uses the latter with additive shallow trees. The performance of this tree ensemble quaternity (Booging, BT, SGT, RF) is assessed on simulated and real regression tasks.
    Date: 2021–03
  21. By: Stamer, Vincent
    Abstract: Global container ship movements may reliably predict global trade flows. Aggregating both movements at sea and port call events produces a wealth of explanatory variables. The machine learning algorithm partial least squares can map these explanatory time series to unilateral imports and exports, as well as bilateral trade flows. Applying out-of-sample and time series methods on monthly trade data of 75 countries, this paper shows that the new shipping indicator outperforms benchmark models for the vast majority of countries. This holds true for predictions for the current and subsequent month even if one limits the analysis to data during the first half of the month. This makes the indicator available at least as early as other leading indicators.
    Keywords: Trade,Forecasting,Machine Learning,Container Shipping
    JEL: F17 C53
    Date: 2021
  22. By: Wu, Desheng Dang; Härdle, Wolfgang Karl
    Abstract: With growing economic globalization, the modern service sector is in great need of business intelligence for data analytics and computational statistics. The joint application of big data analytics, computational statistics and business intelligence has great potential to make the engineering of advanced service systems more efficient. The purpose of this COST issue is to publish high- quality research papers (including reviews) that address the challenges of service data analytics with business intelligence in the face of uncertainty and risk. High quality contributions that are not yet published or that are not under review by other journals or peer-reviewed conferences have been collected. The resulting topic oriented special issue includes research on business intelligence and computational statistics, data-driven financial engineering, service data analytics and algorithms for optimizing the business engineering. It also covers implementation issues of managing the service process, computational statistics for risk analysis and novel theoretical and computational models, data mining algorithms for risk management related business applications.
    Keywords: Data Analytics,Business Intelligence Systems
    JEL: C00
    Date: 2020
  23. By: Jérémy Fouliard; Michael Howell; Hélène Rey
    Abstract: Financial crises cause economic, social and political havoc. Macroprudential policies are gaining traction but are still severely under-researched compared to monetary policy and fiscal policy. We use the general framework of sequential predictions also called online machine learning to forecast crises out-of-sample. Our methodology is based on model averaging and is "meta-statistic" since we can incorporate any predictive model of crises in our set of experts and test its ability to add information. We are able to predict systemic financial crises twelve quarters ahead out-of-sample with high signal-to-noise ratio in most cases. We analyse which experts provide the most information for our predictions at each point in time and for each country, allowing us to gain some insights into economic mechanisms underlying the building of risk in economies.
    JEL: E37 E44 G01
    Date: 2021–02
  24. By: Tao Zou; Xian Li; Xuan Liang; Hansheng Wang
    Abstract: This article introduces subbagging (subsample aggregating) estimation approaches for big data analysis with memory constraints of computers. Specifically, for the whole dataset with size $N$, $m_N$ subsamples are randomly drawn, and each subsample with a subsample size $k_N\ll N$ to meet the memory constraint is sampled uniformly without replacement. Aggregating the estimators of $m_N$ subsamples can lead to subbagging estimation. To analyze the theoretical properties of the subbagging estimator, we adapt the incomplete $U$-statistics theory with an infinite order kernel to allow overlapping drawn subsamples in the sampling procedure. Utilizing this novel theoretical framework, we demonstrate that via a proper hyperparameter selection of $k_N$ and $m_N$, the subbagging estimator can achieve $\sqrt{N}$-consistency and asymptotic normality under the condition $(k_Nm_N)/N\to \alpha \in (0,\infty]$. Compared to the full sample estimator, we theoretically show that the $\sqrt{N}$-consistent subbagging estimator has an inflation rate of $1/\alpha$ in its asymptotic variance. Simulation experiments are presented to demonstrate the finite sample performances. An American airline dataset is analyzed to illustrate that the subbagging estimate is numerically close to the full sample estimate, and can be computationally fast under the memory constraint.
    Date: 2021–02
  25. By: Kéa Baret; Amélie Barbier-Gauchard; Théophilos Papadimitriou
    Abstract: Since the reinforcement of the Stability and Growth Pact (1996), the European Commission closely monitors public finance in the EU members. A failure to comply with the 3% limit rule on the public deficit by a country triggers an audit. In this paper, we present a Machine Learning based forecasting model for the compliance with the 3% limit rule. To do so, we use data spanning the period from 2006 to 2018 (a turbulent period including the Global Financial Crisis and the Sovereign Debt Crisis) for the 28 EU Member States. A set of eight features are identified as predictors from 141 variables through a feature selection procedure. The forecasting is performed using the Support Vector Machines (SVM). The proposed model reached 91.7% forecasting accuracy and outperformed the Logit model that we used as benchmark.
    Keywords: Fiscal Rules; Fiscal Compliance; Stability and Growth Pact; Machine learning.
    JEL: E62 H11 H60 H68
    Date: 2021
  26. By: Gary Cornwall; Jeff Chen; Beau Sauley
    Abstract: In this paper we have updated the hypothesis testing framework by drawing upon modern computational power and classification models from machine learning. We show that a simple classification algorithm such as a boosted decision stump can be used to fully recover the full size-power trade-off for any single test statistic. This recovery implies an equivalence, under certain conditions, between the basic building block of modern machine learning and hypothesis testing. Second, we show that more complex algorithms such as the random forest and gradient boosted machine can serve as mapping functions in place of the traditional null distribution. This allows for multiple test statistics and other information to be evaluated simultaneously and thus form a pseudo-composite hypothesis test. Moreover, we show how practitioners can make explicit the relative costs of Type I and Type II errors to contextualize the test into a specific decision framework. To illustrate this approach we revisit the case of testing for unit roots, a difficult problem in time series econometrics for which existing tests are known to exhibit low power. Using a simulation framework common to the literature we show that this approach can improve upon overall accuracy of the traditional unit root test(s) by seventeen percentage points, and the sensitivity by thirty six percentage points.
    Date: 2021–03
  27. By: MARTENS Bertin (European Commission – JRC)
    Abstract: This paper starts with some basic economic characteristics of data that distinguish them from ordinary goods and services, including non-excludability and non-rivalry, economies of scope in data re-use and aggregation, the social value of data and their role in generating network effects. It explores how these characteristics contribute to the emergence of large digital platforms that generate a combination of positive and negative welfare effects for society, including data-driven network effects. It distinguishes between lexicographic and probabilistic data-driven matching in networks. Both may lead to market "tipping". It emphasizes the social value of data and the positive and negative social externalities that may come with this. Platforms are necessary intermediaries to generate the social welfare or network externalities from data. However, the economic role of data-driven platforms is ambivalent. On the one hand, platforms enable society to benefit from positive externalities in data collection via economies of scale and scope in data aggregation of transactions and interactions across users, both firms and consumers. That gives them a privileged market overview that none of the individual users has. Platforms can use this information asymmetry to facilitate interaction and increase welfare for users. These data externalities attract users to the platform. On the other hand, data-driven network effects may result in monopolistic market power of platforms which they can use for their own benefit, at the expense of users. Any policy intervention that seeks to address the market power of online platforms requires careful balancing between these two poles. Finally, the paper briefly discusses ecosystems that leverage data to coordinate interactions between different platforms.
    Keywords: data, platforms, market power
    Date: 2021–02

This nep-big issue is ©2021 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.