nep-big New Economics Papers
on Big Data
Issue of 2022‒07‒11
nineteen papers chosen by
Tom Coupé
University of Canterbury

  1. The Spatial Distribution of Public Support for AI Research By Chowdhury, Farhat; Link, Albert; van Hasselt, Martijn
  2. Deep Learning in Business Analytics: A Clash of Expectations and Reality By Marc Andreas Schmitt
  3. Uncovering Heterogeneous Regional Impacts of Chinese Monetary Policy By Tsang, Andrew
  4. Dynamic and Context-Dependent Stock Price Prediction Using Attention Modules and News Sentiment By Nicole Koenigstein
  5. Fast Instrument Learning with Faster Rates By Ziyu Wang; Yuhao Zhou; Jun Zhu
  6. Applying Machine Learning to Detect Outliers in Alternative Data Sources. A universal methodology framework for scanner and web-scraped data sources By Xuxin Mao; Janine Boshoff; Garry Young; Hande Kucuk
  7. Machine learning method for return direction forecasting of Exchange Traded Funds using classification and regression models By Raphael P. B. Piovezan; Pedro Paulo de Andrade Junior
  8. Real Time Indicators During the COVID-19 Pandemic Individual Predictors & Selection By George Kapetanios; Fotis Papailias
  9. Volatility-inspired $\sigma$-LSTM cell By German Rodikov; Nino Antulov-Fantulin
  10. Average Adjusted Association: Efficient Estimation with High Dimensional Confounders By Sung Jae Jun; Sokbae Lee
  11. Forecasting Bitcoin price direction with random forests: How important are interest rates, inflation, and market volatility? By Syed Abul, Basher; Perry, Sadorsky
  12. Nonparametric Value-at-Risk via Sieve Estimation By Philipp Ratz
  13. Public Support for Research in Artificial Intelligence: A Descriptive Study of U.S. Department of Defense SBIR Projects By Chowdhury, Farhat; Link, Albert; van Hasselt, Martijn
  14. A New Approach to Building a Skills Taxonomy By Elizabeth Gallagher; India Kerle; Cath Sleeman; George Richardson
  15. Bitcoin Price Factors: Natural Language Processing Approach By Oksana Bashchenko
  16. Deep Generators on Commodity Markets; application to Deep Hedging By Nicolas Boursin; Carl Remlinger; Joseph Mikael; Carol Anne Hargreaves
  17. Fifty years since Altman (1968): Performance of financial distress prediction models By Surbhi Bhatia; Manish K. Singh
  18. The Use of Online Job Sites for Measuring Skills and Labour Market Trends: A Review By Oleksii Romanko; Mary O'Mahony
  19. A Quality Assessment Framework for Maintaining & Publishing New Indicators By George Kapetanios; Fotis Papailias

  1. By: Chowdhury, Farhat (University of North Carolina at Greensboro, Department of Economics); Link, Albert (University of North Carolina at Greensboro, Department of Economics); van Hasselt, Martijn (University of North Carolina at Greensboro, Department of Economics)
    Abstract: A spatial distributional analysis of the population of Phase II research projects funded by the U.S. SBIR program in FY 2020 shows differences across states in projects focused on Artificial Intelligence (AI). AI is a relatively new research field, and this paper contributes to a better understanding of government support for such research. We find that AI projects are concentrated in states with complementary AI research resources available from universities nationally ranked in terms of their own AI research. To achieve a more diverse spatial distribution of AI-related technology development, the availability of complementary AI research resources must be expanded. We suggest that the National Science Foundation’s National AI Research Institutes represents an important step in this direction.
    Keywords: Artificial intelligence (AI); Public sector program management; Small Business Innovation Research (SBIR); Agglomeration; University research;
    JEL: H54 O31 O38 R11
    Date: 2022–06–07
  2. By: Marc Andreas Schmitt
    Abstract: Our fast-paced digital economy shaped by global competition requires increased data-driven decision-making based on artificial intelligence (AI) and machine learning (ML). The benefits of deep learning (DL) are manifold, but it comes with limitations that have - so far - interfered with widespread industry adoption. This paper explains why DL - despite its popularity - has difficulties speeding up its adoption within business analytics. It is shown - by a mixture of content analysis and empirical study - that the adoption of deep learning is not only affected by computational complexity, lacking big data architecture, lack of transparency (black-box), and skill shortage, but also by the fact that DL does not outperform traditional ML models in the case of structured datasets with fixed-length feature vectors. Deep learning should be regarded as a powerful addition to the existing body of ML models instead of a one size fits all solution.
    Date: 2022–05
  3. By: Tsang, Andrew
    Abstract: This paper applies causal machine learning methods to analyze the heterogeneous regional impacts of monetary policy in China. The method uncovers the heterogeneous regional impacts of different monetary policy stances on the provincial figures for real GDP growth, CPI inflation and loan growth compared to the national averages. The varying effects of expansionary and contractionary monetary policy phases on Chinese provinces are highlighted and explained. Subsequently, applying interpretable machine learning, the empirical results show that the credit channel is the main channel affecting the regional impacts of monetary policy. An imminent conclusion of the uneven provincial responses to the "one size fits all" monetary policy is that different policymakers should coordinate their efforts to search for the optimal fiscal and monetary policy mix.
    Keywords: China,monetary policy,regional heterogeneity,machine learning,shadow banking
    JEL: E52 C54 R11 E61
    Date: 2021
  4. By: Nicole Koenigstein
    Abstract: The growth of machine-readable data in finance, such as alternative data, requires new modeling techniques that can handle non-stationary and non-parametric data. Due to the underlying causal dependence and the size and complexity of the data, we propose a new modeling approach for financial time series data, the $\alpha_{t}$-RIM (recurrent independent mechanism). This architecture makes use of key-value attention to integrate top-down and bottom-up information in a context-dependent and dynamic way. To model the data in such a dynamic manner, the $\alpha_{t}$-RIM utilizes an exponentially smoothed recurrent neural network, which can model non-stationary times series data, combined with a modular and independent recurrent structure. We apply our approach to the closing prices of three selected stocks of the S\&P 500 universe as well as their news sentiment score. The results suggest that the $\alpha_{t}$-RIM is capable of reflecting the causal structure between stock prices and news sentiment, as well as the seasonality and trends. Consequently, this modeling approach markedly improves the generalization performance, that is, the prediction of unseen data, and outperforms state-of-the-art networks such as long short-term memory models.
    Date: 2022–03
  5. By: Ziyu Wang; Yuhao Zhou; Jun Zhu
    Abstract: We investigate nonlinear instrumental variable (IV) regression given high-dimensional instruments. We propose a simple algorithm which combines kernelized IV methods and an arbitrary, adaptive regression algorithm, accessed as a black box. Our algorithm enjoys faster-rate convergence and adapts to the dimensionality of informative latent features, while avoiding an expensive minimax optimization procedure, which has been necessary to establish similar guarantees. It further brings the benefit of flexible machine learning models to quasi-Bayesian uncertainty quantification, likelihood-based model selection, and model averaging. Simulation studies demonstrate the competitive performance of our method.
    Date: 2022–05
  6. By: Xuxin Mao; Janine Boshoff; Garry Young; Hande Kucuk
    Abstract: This research explores new ways of applying machine learning to detect outliers in alternative price data resources such as web-scraped data and scanner data sources. Based on text vectorisation and clustering methods, we build a universal methodology framework which identifies outliers in both data sources. We provide a unique way of conducting goods classification and outlier detection. Using Density based spatial clustering of applications with noise (DBSCAN), we can provide two layers of outlier detection for both scanner data and web-scraped data. For web-scraped data we provide a method to classify text information and identify clusters of products. The framework allows us to efficiently detect outliers and explore abnormal price changes that may be omitted by the current practices in line with the 2019 Consumer Prices Indices Manual 2019. Our methodology also provides a good foundation for building better measurement of consumer prices with standard time series data transformed from alternative data sources.
    Keywords: consumer price index, machine learning, outlier detection, scanner data, text density based clustering, web-scraped data
    JEL: C43 E31
    Date: 2021–11
  7. By: Raphael P. B. Piovezan; Pedro Paulo de Andrade Junior
    Abstract: This article aims to propose and apply a machine learning method to analyze the direction of returns from Exchange Traded Funds (ETFs) using the historical return data of its components, helping to make investment strategy decisions through a trading algorithm. In methodological terms, regression and classification models were applied, using standard datasets from Brazilian and American markets, in addition to algorithmic error metrics. In terms of research results, they were analyzed and compared to those of the Na\"ive forecast and the returns obtained by the buy & hold technique in the same period of time. In terms of risk and return, the models mostly performed better than the control metrics, with emphasis on the linear regression model and the classification models by logistic regression, support vector machine (using the LinearSVC model), Gaussian Naive Bayes and K-Nearest Neighbors, where in certain datasets the returns exceeded by two times and the Sharpe ratio by up to four times those of the buy & hold control model.
    Date: 2022–05
  8. By: George Kapetanios; Fotis Papailias
    Abstract: This technical report aims to present a generalised framework for assessing the predictive content of ONS real time indicators in both dimensions: (i) individual predictors (i.e. variable-by-variable), and (ii) using machine learning techniques to build variable selection models. The evaluation is done on a nowcasting basis (h = 0). Simple correlation and predictive power scores are included as well as best subset selection, penalised regressions, random forests and principal components.
    Keywords: factor models, nowcasting, penalised regression, variable selection
    JEL: C53 E37
    Date: 2022–03
  9. By: German Rodikov; Nino Antulov-Fantulin
    Abstract: Volatility models of price fluctuations are well studied in the econometrics literature, with more than 50 years of theoretical and empirical findings. The recent advancements in neural networks (NN) in the deep learning field have naturally offered novel econometric modeling tools. However, there is still a lack of explainability and stylized knowledge about volatility modeling with neural networks; the use of stylized facts could help improve the performance of the NN for the volatility prediction task. In this paper, we investigate how the knowledge about the "physics" of the volatility process can be used as an inductive bias to design or constrain a cell state of long short-term memory (LSTM) for volatility forecasting. We introduce a new type of $\sigma$-LSTM cell with a stochastic processing layer, design its learning mechanism and show good out-of-sample forecasting performance.
    Date: 2022–05
  10. By: Sung Jae Jun; Sokbae Lee
    Abstract: The log odds ratio is a common parameter to measure association between (binary) outcome and exposure variables. Much attention has been paid to its parametric but robust estimation, or its nonparametric estimation as a function of confounders. However, discussion on how to use a summary statistic by averaging the log odds ratio function is surprisingly difficult to find despite the popularity and importance of averaging in other contexts such as estimating the average treatment effect. We propose a couple of efficient double/debiased machine learning (DML) estimators of the average log odds ratio, where the odds ratios are adjusted for observed (potentially high dimensional) confounders and are averaged over them. The estimators are built from two equivalent forms of the efficient influence function. The first estimator uses a prospective probability of the outcome conditional on the exposure and confounders; the second one employs a retrospective probability of the exposure conditional on the outcome and confounders. Our framework encompasses random sampling as well as outcome-based or exposure-based sampling. Finally, we illustrate how to apply the proposed estimators using real data.
    Date: 2022–05
  11. By: Syed Abul, Basher; Perry, Sadorsky
    Abstract: Bitcoin has grown in popularity and has now attracted the attention of individual and institutional investors. Accurate Bitcoin price direction forecasts are important for determining the trend in Bitcoin prices and asset allocation. This paper addresses several unanswered questions. How important are business cycle variables like interest rates, inflation, and market volatility for forecasting Bitcoin prices? Does the importance of these variables change across time? Are the most important macroeconomic variables for forecasting Bitcoin prices the same as those for gold prices? To answer these questions, we utilize tree-based machine learning classifiers, along with traditional logit econometric models. The analysis reveals several important findings. First, random forests predict Bitcoin and gold price directions with a higher degree of accuracy than logit models. Prediction accuracy for bagging and random forests is between 75% and 80% for a five-day prediction. For 10-day to 20-day forecasts bagging and random forests record accuracies greater than 85%. Second, technical indicators are the most important features for predicting Bitcoin and gold price direction, suggesting some degree of market inefficiency. Third, oil price volatility is important for predicting Bitcoin and gold prices indicating that Bitcoin is a substitute for gold in diversifying this type of volatility. By comparison, gold prices are more influenced by inflation than Bitcoin prices, indicating that gold can be used as a hedge or diversification asset against inflation.
    Keywords: forecasting; machine learning; random forests; Bitcoin; gold; inflation
    JEL: C58 E44 G17
    Date: 2022–06–06
  12. By: Philipp Ratz
    Abstract: Artificial Neural Networks (ANN) have been employed for a range of modelling and prediction tasks using financial data. However, evidence on their predictive performance, especially for time-series data, has been mixed. Whereas some applications find that ANNs provide better forecasts than more traditional estimation techniques, others find that they barely outperform basic benchmarks. The present article aims to provide guidance as to when the use of ANNs might result in better results in a general setting. We propose a flexible nonparametric model and extend existing theoretical results for the rate of convergence to include the popular Rectified Linear Unit (ReLU) activation function and compare the rate to other nonparametric estimators. Finite sample properties are then studied with the help of Monte-Carlo simulations to provide further guidance. An application to estimate the Value-at-Risk of portfolios of varying sizes is also considered to show the practical implications.
    Date: 2022–05
  13. By: Chowdhury, Farhat (University of North Carolina at Greensboro, Department of Economics); Link, Albert (University of North Carolina at Greensboro, Department of Economics); van Hasselt, Martijn (University of North Carolina at Greensboro, Department of Economics)
    Abstract: We describe public support for AI research in small firms using data from U.S. Department of Defense-funded SBIR projects. Ours is the first collection of firm-level project information on publicly funded R&D investments in AI. We find that the likelihood of an SBIR funded research project being focused on AI is greater the larger the amount of the SBIR award. AI-focused research projects are associated with a 7.6 percent increase in average award amounts. We also find suggestive evidence that the likelihood of an SBIR project being AI-focused is greater in smaller-sized firms. Finally, we find that SBIR-funded AI research is more likely to occur in states with complementary university research resources.
    Keywords: Artificial intelligence; machine learning; Department of Defense; Small Business Innovation Research program; agglomeration;
    JEL: O31 O38
    Date: 2022–06–07
  14. By: Elizabeth Gallagher; India Kerle; Cath Sleeman; George Richardson
    Abstract: This paper presents a new data-driven approach to building a UK skills taxonomy, improving upon the original approach developed in Djumalieva and Sleeman (2018). The new method improves on the original method as it does not rely on a predetermined list of skills, and can instead automatically detect previously unseen skills. This 'minimal judgement' approach is made possible by a classifier that automatically detects sentences within job adverts that are likely to contain skills. These 'skill sentences' are then grouped to define distinct skills, and a hierarchy is formed. The resulting taxonomy contains three levels and 6,685 separate skills. The taxonomy could be used as a base for developing the first UK-specific skills taxonomy, which domain experts would then refine and extend. It could also be used to spot regional skill clusters, and for rapid assessments of skill changes following shocks such as the COVID-19 pandemic.
    Keywords: big data, cluster analysis, job market, labour demand, machine learning, nlp, online job adverts, sentence embeddings, skills, skills taxonomy
    JEL: C18 C38 J23 J24
    Date: 2022–05
  15. By: Oksana Bashchenko (Swiss Finance Institute - HEC Lausanne)
    Abstract: I propose a new methodology to construct interpretable, fundamental-based pricing factors from news to explain Bitcoin returns. Each news article from a specialized cryptocurrency website is classified in a semi-supervised manner into one of the few predefined topics. Topic sentiments become factors contributing to the price variation. I use a cutting-edge NLP algorithm (SBERT network) to embed linguistic data into a vector space, which allows the application of an intuitive classification rule. This approach permits the exclusion of news pieces that describe the price movements per se from the analysis, thus mitigating endogeneity concerns. I show that non-endogenous news contains fundamental information about Bitcoin. Thus I reject the concept of Bitcoin price being based on pure speculation and show that Bitcoin returns are partially explained by fundamental topics. Among those, the adoption of cryptocurrencies and blockchain technology is the most important aspect. On top of that, I study the media expressed attitude toward Bitcoin from the functions of money perspective. I show that investors consider Bitcoin as the store of value rather than the medium of exchange.
    Keywords: Bitcoin, Cryptocurrency, Natural Language Processing, BERT.
    JEL: C45 C55 C80 G12 G19
    Date: 2022–05
  16. By: Nicolas Boursin; Carl Remlinger; Joseph Mikael; Carol Anne Hargreaves
    Abstract: Driven by the good results obtained in computer vision, deep generative methods for time series have been the subject of particular attention in recent years, particularly from the financial industry. In this article, we focus on commodity markets and test four state-of-the-art generative methods, namely Time Series Generative Adversarial Network (GAN) Yoon et al. [2019], Causal Optimal Transport GAN Xu et al. [2020], Signature GAN Ni et al. [2020] and the conditional Euler generator Remlinger et al. [2021], are adapted and tested on commodity time series. A first series of experiments deals with the joint generation of historical time series on commodities. A second set deals with deep hedging of commodity options trained on he generated time series. This use case illustrates a purely data-driven approach to risk hedging.
    Date: 2022–05
  17. By: Surbhi Bhatia (Independent Researcher); Manish K. Singh (Department of Humanities and Social Sciences Indian Institute of Technology Roorkee and XKDR Forum)
    Abstract: Using bankruptcy filings under the new Insolvency and Bankruptcy Code (2016), we investigate the effect of firm characteristics and balance sheet variables on the forecast of one-year-ahead default for Indian manufacturing firms. We compare traditional discriminant analysis and logistic regression models with state-of-the-art variable selection technique-the least absolute shrinkage and selection operator, and the unsupervised techniques of variable selection-to identify key predictive variables. Our findings suggest that the ratios considered as important by Altman (1968) still hold relevance for the prediction of default, no matter the technique applied for variables selection. We find cash to current liability (a liquidity measure) as an additional robust and significant predictor of default. In terms of predictive accuracy, the reduced-form multivariate discriminant analysis model used in Altman (1968) performs at par with the more advanced econometric specification for both in-sample and full-sample default prediction.
    JEL: C53 G17 G32 G33
    Date: 2022–06
  18. By: Oleksii Romanko; Mary O'Mahony
    Abstract: We explore the use of online job postings as an innovative complementary source of labour demand statistics. The paper concentrates on new developments in the area, including the usage and validation of online data sources, trends, biases and caveats of the data generation and data extraction process. We provide detailed explanations of the data cleansing and data preparation process which proves to be useful for anyone working with raw online data sources. We explore the general data pipeline underpinning continuous data mining and data utilization, that could be beneficial for any organization building its own online data analysis process. We provide detailed discussions of the design of the skills extraction process using word2vec model. We discuss the application of the model and explore some of the skills analysis methods and visualizations, such as job titles, salaries, frequent skills histograms, skills correlation scatterplots, graph analysis of skills co-occurrence, UK regional skills analysis. We applied regression analysis, outlining various effects of person competencies on the salary. We conclude that online job postings provide rich and extensive insights into the labour market and can complement the official statistics.
    Keywords: labour demand, skills analysis, skills demand, skills extraction, web scraping
    JEL: J23 J24 J31
    Date: 2022–05
  19. By: George Kapetanios; Fotis Papailias
    Abstract: This technical report builds on the research output of “National Accounts and Beyond GDP: Predictive Performance of Real-Time Indicators” ESCoE/ONS collaborative project to summarise the key findings and provide a standardised methodology to guide the editing, maintenance, publishing and potential incorporation of new real-time indicators into the development of early estimates of headline economic statistics. Empirical results from previous tasks are revisited and standardised qualitative and quantitative criteria are discussed.
    Keywords: machine learning, nowcasting, real-time indicators
    JEL: C53 E37
    Date: 2022–05

This nep-big issue is ©2022 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.