nep-big New Economics Papers
on Big Data
Issue of 2019‒06‒17
eighteen papers chosen by
Tom Coupé
University of Canterbury

  1. Evaluating the Performance of Machine Learning Algorithms in Financial Market Forecasting: A Comprehensive Survey By Lukas Ryll; Sebastian Seidens
  2. Artificial Intelligence and Big Data in Entrepreneurship: A New Era Has Begun By Martin Obschonka; David B. Audretsch
  3. Institutional settings and urban sprawl: evidence from Europe By Ehrlich, Maximilian V.; Hilber, Christian A. L.; Schöni, Olivier
  4. The Art of Central Bank Communication: A Topic Analysis on Words used by the Bank of Japan's Governors By KEIDA Masayuki; TAKEDA Yosuke
  5. FinTech in Financial Inclusion: Machine Learning Applications in Assessing Credit Risk By Majid Bazarbash
  6. Neural Learning of Online Consumer Credit Risk By Di Wang; Qi Wu; Wen Zhang
  7. The Hard Problem of Prediction for Conflict Prevention By Mueller, Hannes Felix; Rauh, Christopher
  8. Investment Ranking Challenge: Identifying the best performing stocks based on their semi-annual returns By Shanka Subhra Mondal; Sharada Prasanna Mohanty; Benjamin Harlander; Mehmet Koseoglu; Lance Rane; Kirill Romanov; Wei-Kai Liu; Pranoot Hatwar; Marcel Salathe; Joe Byrum
  9. Multi-Likelihood Methods for Developing Stock Relationship Networks Using Financial Big Data By Xue Guo; Hu Zhang; Tianhai Tian
  10. Linkage of Patent and Design Right Data: Analysis of Industrial Design Activities in Companies at the Creator Level (Japanese) By IKEUCHI Kenta; MOTOHASHI Kazuyuki
  11. Deep Generalized Method of Moments for Instrumental Variable Analysis By Andrew Bennett; Nathan Kallus; Tobias Schnabel
  12. Counterfactual Inference for Consumer Choice Across Many Product Categories By Rob Donnelly; Francisco R. Ruiz; David Blei; Susan Athey
  13. Newspaper-based economic uncertainty indices for Poland By Marcin Hołda
  14. Economic growth and convergence during the transition to production using automation capital By Martin Labaj; Daniel Dujava
  15. An economy under the digital transformation By Bertani, Filippo; Ponta, Linda; Raberto, Marco; Teglio, Andrea; Cincotti, Silvano
  16. Let’s meet as usual: Do games on non-frequent days differ? Evidence from top European soccer leagues By Goller, Daniel; Krumer, Alex
  17. Early Identification of College Dropouts Using Machine-Learning: Conceptual Considerations and an Empirical Example By Isphording, Ingo E.; Raabe, Tobias
  18. Risk-Sensitive Compact Decision Trees for Autonomous Execution in Presence of Simulated Market Response By Svitlana Vyetrenko; Shaojie Xu

  1. By: Lukas Ryll; Sebastian Seidens
    Abstract: With increasing competition and pace in the financial markets, robust forecasting methods are becoming more and more valuable to investors. While machine learning algorithms offer a proven way of modeling non-linearities in time series, their advantages against common stochastic models in the domain of financial market prediction are largely based on limited empirical results. The same holds true for determining advantages of certain machine learning architectures against others. This study surveys more than 150 related articles on applying machine learning to financial market forecasting. Based on a comprehensive literature review, we build a table across seven main parameters describing the experiments conducted in these studies. Through listing and classifying different algorithms, we also introduce a simple, standardized syntax for textually representing machine learning algorithms. Based on performance metrics gathered from papers included in the survey, we further conduct rank analyses to assess the comparative performance of different algorithm classes. Our analysis shows that machine learning algorithms tend to outperform most traditional stochastic methods in financial market forecasting. We further find evidence that, on average, recurrent neural networks outperform feed forward neural networks as well as support vector machines which implies the existence of exploitable temporal dependencies in financial time series across multiple asset classes and geographies.
    Date: 2019–06
  2. By: Martin Obschonka; David B. Audretsch
    Abstract: While the disruptive potential of artificial intelligence (AI) and Big Data has been receiving growing attention and concern in a variety of research and application fields over the last few years, it has not received much scrutiny in contemporary entrepreneurship research so far. Here we present some reflections and a collection of papers on the role of AI and Big Data for this emerging area in the study and application of entrepreneurship research. While being mindful of the potentially overwhelming nature of the rapid progress in machine intelligence and other Big Data technologies for contemporary structures in entrepreneurship research, we put an emphasis on the reciprocity of the co-evolving fields of entrepreneurship research and practice. How can AI and Big Data contribute to a productive transformation of the research field and the real-world phenomena (e.g., 'smart entrepreneurship')? We also discuss, however, ethical issues as well as challenges around a potential contradiction between entrepreneurial uncertainty and rule-driven AI rationality. The editorial gives researchers and practitioners orientation and showcases avenues and examples for concrete research in this field. At the same time, however, it is not unlikely that we will encounter unforeseeable and currently inexplicable developments in the field soon. We call on entrepreneurship scholars, educators, and practitioners to proactively prepare for future scenarios.
    Date: 2019–06
  3. By: Ehrlich, Maximilian V.; Hilber, Christian A. L.; Schöni, Olivier
    Abstract: This article explores the role of institutional settings in determining spatial variation in urban sprawl across Europe. We first synthesize the emerging literature that links land use policies and local fiscal incentives to urban sprawl. Next, we compile a panel dataset on various measures of urban sprawl for European countries using high-resolution satellite images. We document substantial variation in urban sprawl across countries. This variation remains roughly stable over the period of our analysis (1990-2012). Urban sprawl is particularly pronounced in emerging Central and Eastern Europe but is comparatively low in Northern European countries. Urban sprawl – especially outside functional urban areas – is strongly negatively associated with real house price growth, suggesting a trade-off between urban containment and housing affordability. Our main novel empirical findings are that decentralization and local political fragmentation are significantly positively associated with urban sprawl. Decentralized countries have a 25 to 30 percent higher sprawl index than centralized ones. This finding is consistent with the proposition that in decentralized countries fiscal incentives at local level may provide strong incentives to permit residential development at the outskirts of existing developments
    Keywords: Decentralization; Housing supply; Supply constraints; Land use regulation; Urban sprawl; Europe
    JEL: H2 H3 H4 H7 R3 R4 R5
    Date: 2018–12–01
  4. By: KEIDA Masayuki; TAKEDA Yosuke
    Abstract: This paper addresses the art of central bank communication, in a semantic analysis which applies a topic model to the regular press conference documents of the Bank of Japan (BOJ)'s Gov. Masaaki Shirakawa and Gov. Haruhiko Kuroda. Based on the standard method of latent Dirichlet allocation (LDA) in the statistical natural language processing literature, our research on the communication strategies that the BOJ pursued under two governorships using over 70 press conference documents indicates significant differences between the Shirakawa and Kuroda governorships in terms of topic distribution. In early 2016, when the negative interest rate policy was introduced during the era of Kuroda's governorship, the ratio of "policy goal" topics decreased dramatically, despite being an essential feature of Gov. Kuroda's vocabulary relative to Gov. Shirakawa to that point in time. Since the ambiguity in the words of the governors is contained in "discretionary" topics, which include to strengthen, to confront, to recognize, to plan and so forth, the communication strategy in the Shirakawa governorship was considered "Delphic" in that the semantic ambiguity may reveal bad fundamental conditions concerning the Japanese economy.
    Date: 2019–05
  5. By: Majid Bazarbash
    Abstract: Recent advances in digital technology and big data have allowed FinTech (financial technology) lending to emerge as a potentially promising solution to reduce the cost of credit and increase financial inclusion. However, machine learning (ML) methods that lie at the heart of FinTech credit have remained largely a black box for the nontechnical audience. This paper contributes to the literature by discussing potential strengths and weaknesses of ML-based credit assessment through (1) presenting core ideas and the most common techniques in ML for the nontechnical audience; and (2) discussing the fundamental challenges in credit risk analysis. FinTech credit has the potential to enhance financial inclusion and outperform traditional credit scoring by (1) leveraging nontraditional data sources to improve the assessment of the borrower’s track record; (2) appraising collateral value; (3) forecasting income prospects; and (4) predicting changes in general conditions. However, because of the central role of data in ML-based analysis, data relevance should be ensured, especially in situations when a deep structural change occurs, when borrowers could counterfeit certain indicators, and when agency problems arising from information asymmetry could not be resolved. To avoid digital financial exclusion and redlining, variables that trigger discrimination should not be used to assess credit rating.
    Date: 2019–05–17
  6. By: Di Wang; Qi Wu; Wen Zhang
    Abstract: This paper takes a deep learning approach to understand consumer credit risk when e-commerce platforms issue unsecured credit to finance customers' purchase. The "NeuCredit" model can capture both serial dependences in multi-dimensional time series data when event frequencies in each dimension differ. It also captures nonlinear cross-sectional interactions among different time-evolving features. Also, the predicted default probability is designed to be interpretable such that risks can be decomposed into three components: the subjective risk indicating the consumers' willingness to repay, the objective risk indicating their ability to repay, and the behavioral risk indicating consumers' behavioral differences. Using a unique dataset from one of the largest global e-commerce platforms, we show that the inclusion of shopping behavioral data, besides conventional payment records, requires a deep learning approach to extract the information content of these data, which turns out significantly enhancing forecasting performance than the traditional machine learning methods.
    Date: 2019–06
  7. By: Mueller, Hannes Felix; Rauh, Christopher
    Abstract: There is a growing interest in better conflict prevention and this provides a strong motivation for better conflict forecasting. A key problem of conflict forecasting for prevention is that predicting the start of conflict in previously peaceful countries is extremely hard. To make progress in this hard problem this project exploits both supervised and unsupervised machine learning. Specifically, the latent Dirichlet allocation (LDA) model is used for feature extraction from 3.8 million newspaper articles and these features are then used in a random forest model to predict conflict. We find that forecasting hard cases is possible and benefits from supervised learning despite the small sample size. Several topics are negatively associated with the outbreak of conflict and these gain importance when predicting hard onsets. The trees in the random forest use the topics in lower nodes where they are evaluated conditionally on conflict history, which allows the random forest to adapt to the hard problem and provides useful forecasts for prevention.
    Keywords: Armed Conflict; Forecasting; Machine Learning; Newspaper Text; Random Forest; Topic Models
    Date: 2019–05
  8. By: Shanka Subhra Mondal; Sharada Prasanna Mohanty; Benjamin Harlander; Mehmet Koseoglu; Lance Rane; Kirill Romanov; Wei-Kai Liu; Pranoot Hatwar; Marcel Salathe; Joe Byrum
    Abstract: In the IEEE Investment ranking challenge 2018, participants were asked to build a model which would identify the best performing stocks based on their returns over a forward six months window. Anonymized financial predictors and semi-annual returns were provided for a group of anonymized stocks from 1996 to 2017, which were divided into 42 non-overlapping six months period. The second half of 2017 was used as an out-of-sample test of the model's performance. Metrics used were Spearman's Rank Correlation Coefficient and Normalized Discounted Cumulative Gain (NDCG) of the top 20% of a model's predicted rankings. The top six participants were invited to describe their approach. The solutions used were varied and were based on selecting a subset of data to train, combination of deep and shallow neural networks, different boosting algorithms, different models with different sets of features, linear support vector machine, combination of convoltional neural network (CNN) and Long short term memory (LSTM).
    Date: 2019–06
  9. By: Xue Guo; Hu Zhang; Tianhai Tian
    Abstract: Development of stock networks is an important approach to explore the relationship between different stocks in the era of big-data. Although a number of methods have been designed to construct the stock correlation networks, it is still a challenge to balance the selection of prominent correlations and connectivity of networks. To address this issue, we propose a new approach to select essential edges in stock networks and also maintain the connectivity of established networks. This approach uses different threshold values for choosing the edges connecting to a particular stock, rather than employing a single threshold value in the existing asset-value method. The innovation of our algorithm includes the multiple distributions in a maximum likelihood estimator for selecting the threshold value rather than the single distribution estimator in the existing methods. Using the Chinese Shanghai security market data of 151 stocks, we develop a stock relationship network and analyze the topological properties of the developed network. Our results suggest that the proposed method is able to develop networks that maintain appropriate connectivities in the type of assets threshold methods.
    Date: 2019–06
  10. By: IKEUCHI Kenta; MOTOHASHI Kazuyuki
    Abstract: In addition to technological superiority (functional value), attention to design superiority (semantic value) is increasing as a source of competitiveness of new products relative to competing products. In this research, we connect patent right data and design right data at the inventor / creator level, and quantitatively analyze corporate organizations related to design innovation. First, machine learning was performed on a classification model for name disambiguation of inventors / creators using patent right and design right applications to the Japan Patent Office. The training data was constructed using rare name information that is less likely to have the same problem. By interconnecting the inventors' and creators' identifiers estimated with the learned classification model, we identified design creators who also created the patent inventions. Next, using this information, the participation status of the design creator in the patent invention was organized by time series and design category. As a result, it was found that the division of innovative labor into invention activity and design activity is in progress. Furthermore, we confirmed that this division of labor is particularly advanced among major patent applicants. As background information, specialization and fragmentation of innovation activities, utilization of external designers and open innovation may be examples of progress influencing the process.
    Date: 2019–03
  11. By: Andrew Bennett; Nathan Kallus; Tobias Schnabel
    Abstract: Instrumental variable analysis is a powerful tool for estimating causal effects when randomization or full control of confounders is not possible. The application of standard methods such as 2SLS, GMM, and more recent variants are significantly impeded when the causal effects are complex, the instruments are high-dimensional, and/or the treatment is high-dimensional. In this paper, we propose the DeepGMM algorithm to overcome this. Our algorithm is based on a new variational reformulation of GMM with optimal inverse-covariance weighting that allows us to efficiently control very many moment conditions. We further develop practical techniques for optimization and model selection that make it particularly successful in practice. Our algorithm is also computationally tractable and can handle large-scale datasets. Numerical results show our algorithm matches the performance of the best tuned methods in standard settings and continues to work in high-dimensional settings where even recent methods break.
    Date: 2019–05
  12. By: Rob Donnelly; Francisco R. Ruiz; David Blei; Susan Athey
    Abstract: This paper proposes a method for estimating consumer preferences among discrete choices, where the consumer chooses at most one product in a category, but selects from multiple categories in parallel. The consumer's utility is additive in the different categories. Her preferences about product attributes as well as her price sensitivity vary across products and are in general correlated across products. We build on techniques from the machine learning literature on probabilistic models of matrix factorization, extending the methods to account for time-varying product attributes and products going out of stock. We evaluate the performance of the model using held-out data from weeks with price changes or out of stock products. We show that our model improves over traditional modeling approaches that consider each category in isolation. One source of the improvement is the ability of the model to accurately estimate heterogeneity in preferences (by pooling information across categories); another source of improvement is its ability to estimate the preferences of consumers who have rarely or never made a purchase in a given category in the training data. Using held-out data, we show that our model can accurately distinguish which consumers are most price sensitive to a given product. We consider counterfactuals such as personally targeted price discounts, showing that using a richer model such as the one we propose substantially increases the benefits of personalization in discounts.
    Date: 2019–06
  13. By: Marcin Hołda (Narodowy Bank Polski)
    Abstract: Using text mining and web scraping techniques, we develop newspaper-based economic uncertainty measures for Poland. We build ‘general’ economic and economic-policy uncertainty indices, as well as category-specific ones designed to capture e.g. the economic uncertainty related to fiscal policy or to stockmarket movements. Several types of evidence suggest that these indices do proxy for changes in economic uncertainty in Poland. In particular, our measures spike around uncertainty-laden events or periods, such as the initial phase of Poland’s post-communist economic transition, the global financial crisis or the European debt crisis that followed. Our indices also exhibit correlation with a variety of other indicators of economic uncertainty, such as financial-market data and results of corporate surveys. The newspaper-based indices behave similarly to uncertainty indicators developed using other textual data and are strongly correlated with relevant economic uncertainty indicators developed by other researchers.
    Keywords: economic uncertainty, index, macroeconomic policy, text mining, web scraping
    JEL: C82 D80 E66
    Date: 2019
  14. By: Martin Labaj; Daniel Dujava
    Abstract: This paper examines the implications of automation capital in a Solow growth model withtwo types of labour. We study the transition from standard production to production usingautomation capital which substitutes low-skilled workers. We assume that despite advancesin technology, AI and machine learning, certain tasks can be performed only by high-skilledlabour and are not automatable. We show that under these assumptions, automation capitaldoes not generate endogenous growth without technological progress. However, assumingpresence of technological progress augmenting both effective number of workers and effectivenumber of industrial robots, automation increases rate of long-run growth. We analyse asituation in which some countries do not use robots at all and other group of countries startsthe transition to the economy where industrial robots replace low-skilled labour. We showthat this has potential non-linear effects on?-convergence and that the model is consistentwith temporary divergence of incomes per capita. We derive a set of estimable equationsthat allows us to test the hypotheses in a Mankiw-Romer-Weil framework.
    Keywords: Automation, Economic growth, Income inequality, Convergence, Robots
    JEL: D63 E25 O11 O41
    Date: 2019–04–24
  15. By: Bertani, Filippo; Ponta, Linda; Raberto, Marco; Teglio, Andrea; Cincotti, Silvano
    Abstract: During the last twenty years, we have witnessed the deep development of digital technologies. Artificial intelligence, software and algorithms have started to impact more and more frequently in our daily lives and most people didn't notice it. Recently, economists seem to have perceived that this new technological wave could have some consequences, but which one are they? Will they be positive or negative? In this paper we try to give a possible answer to these questions through an agent based computational approach; more specifically we enriched the large-scale macroeconomics model EURACE with the concept of digital technologies in order to investigate the effect that their business dynamics have at a macroeconomic level. Our preliminary results show that this productivity increase could be a double-edged sword: notwithstanding the development of the digital technologies sector can create new job opportunities, at the same time, these products could jeopardize the employment inside the traditional mass-production system.
    Keywords: Intangible assets, Industry 4.0, Digital revolution, Agent-based macroeconomics
    JEL: C63 O33
    Date: 2019–05–30
  16. By: Goller, Daniel; Krumer, Alex
    Abstract: Balancing the allocation of games in sports competitions is an important organizational task that can have serious financial consequences. In this paper, we examine data from 9,930 soccer games played in the top German, Spanish, French, and English soccer leagues between 2007/2008 and 2016/2017. Using a machine learning technique for variable selection and applying a semi-parametric analysis of radius matching on the propensity score, we find that all four leagues have a lower attendance as the share of stadium capacity in games that take place on non-frequently played days compared to the frequently played days. In addition, we find that in all leagues except for the English Premier League, there is a significantly lower home advantage for the underdog teams on non-frequent days. Our findings suggest that the current schedule favors underdog teams with fewer home games on non-frequent days. Therefore, to increase the fairness of the competitions, it is necessary to adjust the allocation of the home games on non-frequent days in a way that eliminates any advantage driven by the schedule. These findings have implications for the stakeholders of the leagues, as well as for coaches and players.
    Keywords: Performance, schedule effects, soccer
    JEL: D00 L00 D20
    Date: 2019–06
  17. By: Isphording, Ingo E. (IZA); Raabe, Tobias (IZA)
    Abstract: Forschungsbericht mit Förderung durch das Bundesministerium für Bildung und Forschung, Bonn 2019 (30 Seiten)
    Date: 2019–06–13
  18. By: Svitlana Vyetrenko; Shaojie Xu
    Abstract: We demonstrate an application of risk-sensitive reinforcement learning to optimizing execution in limit order book markets. We represent taking order execution decisions based on limit order book knowledge by a Markov Decision Process; and train a trading agent in a market simulator, which emulates multi-agent interaction by synthesizing market response to our agent's execution decisions from historical data. Due to market impact, executing high volume orders can incur significant cost. We learn trading signals from market microstructure in presence of simulated market response and derive explainable decision-tree-based execution policies using risk-sensitive Q-learning to minimize execution cost subject to constraints on cost variance.
    Date: 2019–06

This nep-big issue is ©2019 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.