nep-big New Economics Papers
on Big Data
Issue of 2021‒11‒22
twenty-two papers chosen by
Tom Coupé
University of Canterbury

  1. What Drives Financial Sector Development in Africa? Insights from Machine Learning By Isaac K. Ofori; Christopher Quaidoo; Pamela E. Ofori
  2. Power of machine learning algorithms for predicting dropouts from a German telemonitoring program using standardized claims data By Hofer, Florian; Birkner, Benjamin; Spindler, Martin
  3. A Scalable Inference Method For Large Dynamic Economic Systems By Pratha Khandelwal; Philip Nadler; Rossella Arcucci; William Knottenbelt; Yi-Ke Guo
  4. Empirical asset pricing and ensemble machine learning By Zhang, Hongwei
  5. On the Fairness of Machine-Assisted Human Decisions By Talia Gillis; Bryce McLaughlin; Jann Spiess
  6. Understanding Science and Policy Making in Agriculture: A Machine Learning Application for India By SJ, Balaji; Babu, Suresh Chandra; Pal, Suresh
  7. Fraud detection in the era of Machine Learning: a household insurance case By Denisa BANULESCU-RADU; Meryem YANKOL-SCHALCK
  8. Understanding Algorithmic Discrimination in Health Economics Through the Lens of Measurement Errors By Anirban Basu; Noah Hammarlund; Sara Khor; Aasthaa Bansal
  9. Forecasting with VAR-teXt and DFM-teXt Models:exploring the predictive power of central bank communication By Leonardo Nogueira Ferreira
  10. Deep Calibration of Interest Rates Model By Mohamed Ben Alaya; Ahmed Kebaier; Djibril Sarr
  11. Algorithmic and human collusion By Werner, Tobias
  12. Double generative adversarial networks for conditional independence testing By Shi, Chengchun; Xu, Tianlin; Bergsma, Wicher; Li, Lexin
  13. Data-Driven Incentive Alignment in Capitation Schemes By Mark Braverman; Sylvain Chassang
  14. Crypto-exchanges and Credit Risk: Modelling and Forecasting the Probability of Closure By Fantazzini, Dean; Calabrese, Raffaella
  15. Ask "Who", Not "What": Bitcoin Volatility Forecasting with Twitter Data By M. Eren Akbiyik; Mert Erkul; Killian Kaempf; Vaiva Vasiliauskaite; Nino Antulov-Fantulin
  16. Forecasting internal migration in Russia using Google Trends: Evidence from Moscow and Saint Petersburg By Fantazzini, Dean; Pushchelenko, Julia; Mironenkov, Alexey; Kurbatskii, Alexey
  17. Regulating big tech: From competition policy to sector regulation? By Budzinski, Oliver; Mendelsohn, Juliane
  18. The direction of technical change in AI and the trajectory effects of government funding By Martina Iori; Arianna Martinelli; Andrea Mina
  19. Tracking Trade from Space: An Application to Pacific Island Countries By Mr. Serkan Arslanalp; Mr. Robin Koepke; Jasper Verschuur
  20. Education for All? Assessing the Impact of Socio-economic Disparity on Learning Engagement During the COVID-19 Pandemic in Indonesia By Samuel Nursamsu; Wisnu Harto Adiwijoyo; Anissa Rahmawati
  21. Landmines: The Local Effects of Demining By Mounu Prem; Juan Vargas; Miguel E. Purroy
  22. As long as you talk about me: The importance of family firm brands and the contingent role of family-firm identity By P. Rovelli; C. Benedetti; A. Fronzetti Colladon; A. De Massis

  1. By: Isaac K. Ofori (University of Insubria, Varese, Italy); Christopher Quaidoo (University of Professional Studies, Accra, Ghana); Pamela E. Ofori (University of Insubria, Varese, Italy)
    Abstract: This study uses machine learning techniques to identify the key drivers of financial development in Africa. To this end, four regularization techniques— the Standard lasso, Adaptive lasso, the minimum Schwarz Bayesian information criterion lasso, and the Elasticnet are trained based on a dataset containing 86 covariates of financial development for the period 1990 – 2019. The results show that variables such as cell phones, economic globalisation, institutional effectiveness, and literacy are crucial for financial sector development in Africa. Evidence from the Partialing-out lasso instrumental variable regression reveals that while inflation and agricultural sector employment suppress financial sector development, cell phones and institutional effectiveness are remarkable in spurring financial sector development in Africa. Policy recommendations are provided in line with the rise in globalisation, and technological progress in Africa.
    Keywords: Africa, Elasticnet, Financial Development, Financial Inclusion, Lasso, Regularization, Variable Selection
    JEL: C01 C14 C52 C53 C55 E5 O55
    Date: 2021–01
  2. By: Hofer, Florian; Birkner, Benjamin; Spindler, Martin
    Abstract: Background: Statutory health insurers in Germany offer a variety of disease management, prevention and health promotion programs to their insurees. Identifying patients with a high probability of leaving these programs prematurely helps insurers to offer better support to those at the highest risk of dropping out, potentially reducing costs and improving health outcomes for the most vulnerable. Objective: To evaluate whether machine learning methods outperform linear regression in predicting dropouts from a telemonitoring program. Methods: Use of linear regression and machine learning to predict dropouts from a telemonitoring program for patients with COPD by using information derived from claims data only. Different feature sets are used to compare model performance between and within different methods. Repeated 10-fold cross-validation with downsampling followed by grid searches was applied to tune relevant hyperparameters. Results: Random forest performed best with the highest AUC of 0.60. Applying logistic regression resulted in higher predictive power with regard to the correct classification of dropouts compared to neural networks with a sensitivity of 56%. All machine learning algorithms outperformed linear regression with respect to specificity. Overall predictive performance of all methods was only modest at best. Conclusion: Using features derived from claims data only, machine learning methods performed similar in comparison to linear regression in predicting dropouts from a telemonitoring program. However, as our data set contained information from only 1,302 individuals, our results may not be generalizable to the broader population.
    Date: 2021
  3. By: Pratha Khandelwal; Philip Nadler; Rossella Arcucci; William Knottenbelt; Yi-Ke Guo
    Abstract: The nature of available economic data has changed fundamentally in the last decade due to the economy's digitisation. With the prevalence of often black box data-driven machine learning methods, there is a necessity to develop interpretable machine learning methods that can conduct econometric inference, helping policymakers leverage the new nature of economic data. We therefore present a novel Variational Bayesian Inference approach to incorporate a time-varying parameter auto-regressive model which is scalable for big data. Our model is applied to a large blockchain dataset containing prices, transactions of individual actors, analyzing transactional flows and price movements on a very granular level. The model is extendable to any dataset which can be modelled as a dynamical system. We further improve the simple state-space modelling by introducing non-linearities in the forward model with the help of machine learning architectures.
    Date: 2021–10
  4. By: Zhang, Hongwei (Tilburg University, School of Economics and Management)
    Date: 2021
  5. By: Talia Gillis; Bryce McLaughlin; Jann Spiess
    Abstract: When machine-learning algorithms are deployed in high-stakes decisions, we want to ensure that their deployment leads to fair and equitable outcomes. This concern has motivated a fast-growing literature that focuses on diagnosing and addressing disparities in machine predictions. However, many machine predictions are deployed to assist in decisions where a human decision-maker retains the ultimate decision authority. In this article, we therefore consider how properties of machine predictions affect the resulting human decisions. We show in a formal model that the inclusion of a biased human decision-maker can revert common relationships between the structure of the algorithm and the qualities of resulting decisions. Specifically, we document that excluding information about protected groups from the prediction may fail to reduce, and may even increase, ultimate disparities. While our concrete results rely on specific assumptions about the data, algorithm, and decision-maker, they show more broadly that any study of critical properties of complex decision systems, such as the fairness of machine-assisted human decisions, should go beyond focusing on the underlying algorithmic predictions in isolation.
    Date: 2021–10
  6. By: SJ, Balaji; Babu, Suresh Chandra; Pal, Suresh
    Keywords: Agricultural and Food Policy, Research and Development/Tech Change/Emerging Technologies
    Date: 2021–08
    Keywords: , Fraud detection, Household insurance, Machine learning, Logistic LASSO, XGBoost,, Imbalanced data, SHAP
    Date: 2021
  8. By: Anirban Basu; Noah Hammarlund; Sara Khor; Aasthaa Bansal
    Abstract: There is growing concern that the increasing use of machine learning and artificial intelligence-based systems may exacerbate health disparities through discrimination. We provide a hierarchical definition of discrimination consisting of algorithmic discrimination arising from predictive scores used for allocating resources and human discrimination arising from allocating resources by human decision-makers conditional on these predictive scores. We then offer an overarching statistical framework of algorithmic discrimination through the lens of measurement errors, which is familiar to the health economics audience. Specifically, we show that algorithmic discrimination exists when measurement errors exist in either the outcome or the predictors, and there is endogenous selection for participation in the observed data. The absence of any of these phenomena would eliminate algorithmic discrimination. We show that although equalized odds constraints can be employed as bias-mitigating strategies, such constraints may increase algorithmic discrimination when there is measurement error in the dependent variable.
    JEL: C53 I10 I14
    Date: 2021–10
  9. By: Leonardo Nogueira Ferreira
    Abstract: This paper explores the complementarity between traditional econometrics and machine learning and applies the resulting model – the VAR-teXt – to central bank communication. The VAR-teXt is a vector autoregressive (VAR) model augmented with information retrieved from text, turned into quantitative data via a Latent Dirichlet Allocation (LDA) model, whereby the number of topics (or textual factors) is chosen based on their predictive performance. A Markov chain Monte Carlo (MCMC) sampling algorithm for the estimation of the VAR-teXt that takes into account the fact that the textual factors are estimates is also provided. The approach is then extended to dynamic factor models (DFM) generating the DFM-teXt. Results show that textual factors based on Federal Open Market Committee (FOMC) statements are indeed useful for forecasting.
    Date: 2021–11
  10. By: Mohamed Ben Alaya; Ahmed Kebaier; Djibril Sarr
    Abstract: For any financial institution it is a necessity to be able to apprehend the behavior of interest rates. Despite the use of Deep Learning that is growing very fastly, due to many reasons (expertise, ease of use, ...) classic rates models such as CIR, or the Gaussian family are still being used widely. We propose to calibrate the five parameters of the G2++ model using Neural Networks. To achieve that, we construct synthetic data sets of parameters drawn uniformly from a reference set of parameters calibrated from the market. From those parameters, we compute Zero-Coupon and Forward rates and their covariances and correlations. Our first model is a Fully Connected Neural network and uses only covariances and correlations. We show that covariances are more suited to the problem than correlations. The second model is a Convulutional Neural Network using only Zero-Coupon rates with no transformation. The methods we propose perform very quickly (less than 0.3 seconds for 2 000 calibrations) and have low errors and good fitting.
    Date: 2021–10
  11. By: Werner, Tobias
    Abstract: As self-learning pricing algorithms become popular, there are growing concerns among academics and regulators that algorithms could learn to collude tacitly on non-competitive prices and thereby harm competition. I study popular reinforcement learning algorithms and show that they develop collusive behavior in a simulated market environment. To derive a counterfactual that resembles traditional tacit collusion, I conduct market experiments with human participants in the same environment. Across different treatments, I vary the market size and the number of firms that use a self-learned pricing algorithm. I provide evidence that oligopoly markets can become more collusive if algorithms make pricing decisions instead of humans. In two-firm markets, market prices are weakly increasing in the number of algorithms in the market. In three-firm markets, algorithms weaken competition if most firms use an algorithm and human sellers are inexperienced.
    Keywords: Artificial Intelligence,Collusion,Experiment,Human-Machine Interaction
    JEL: C90 D83 L13 L41
    Date: 2021
  12. By: Shi, Chengchun; Xu, Tianlin; Bergsma, Wicher; Li, Lexin
    Abstract: In this article, we study the problem of high-dimensional conditional independence testing, a key building block in statistics and machine learning. We propose an inferential procedure based on double generative adversarial networks (GANs). Specifically, we first introduce a double GANs framework to learn two generators of the conditional distributions. We then integrate the two generators to construct a test statistic, which takes the form of the maximum of generalized covariance measures of multiple transformation functions. We also employ data-splitting and cross-fitting to minimize the conditions on the generators to achieve the desired asymptotic properties, and employ multiplier bootstrap to obtain the corresponding p-value. We show that the constructed test statistic is doubly robust, and the resulting test both controls type-I error and has the power approaching one asymptotically. Also notably, we establish those theoretical guarantees under much weaker and practically more feasible conditions compared to the existing tests, and our proposal gives a concrete example of how to utilize some state-of-the-art deep learning tools, such as GANs, to help address a classical but challenging statistical problem. We demonstrate the efficacy of our test through both simulations and an application to an anti-cancer drug dataset.
    Keywords: conditional independence; double-robustness; generalized covariance measure; generative adversarial networks; multiplier bootstrap
    JEL: C1
    Date: 2021–11–02
  13. By: Mark Braverman (Princeton University); Sylvain Chassang (Princeton University and NBER)
    Abstract: This paper explores whether Big Data, taking the form of extensive high dimensional records, can reduce the cost of adverse selection by private service providers in government-run capitation schemes, such as Medicare Advantage. We argue that using data to improve the ex ante precision of capitation regressions is unlikely to be helpful. Even if types become essentially observable, the high dimensionality of covariates makes it infeasible to precisely estimate the cost of serving a given type: Big Data makes types observable, but not necessarily interpretable. This gives an informed private operator scope to select types that are relatively cheap to serve. Instead, we argue that data can be used to align incentives by forming unbiased and non-manipulable ex-post estimates of a private operator’s gains from selection.
    Keywords: adverse selection, big data, capitation, health-care regulation, detail-free mechanism design, delegated model selection
    JEL: C55 D82 H51 I11 I13
    Date: 2021–02
  14. By: Fantazzini, Dean; Calabrese, Raffaella
    Abstract: While there is an increasing interest in crypto-assets, the credit risk of these exchanges is still relatively unexplored. To fill this gap, we consider a unique data set on 144 exchanges active from the first quarter of 2018 to the first quarter of 2021. We analyze the determinants of the decision of closing an exchange using credit scoring and machine learning techniques. The cybersecurity grades, having a public developer team, the age of the exchange, and the number of available traded cryptocurrencies are the main significant covariates across different model specifications. Both in-sample and out-of-sample analyses confirm these findings. These results are robust to the inclusion of additional variables considering the country of registration of these exchanges and whether they are centralized or decentralized.
    Keywords: Exchange; Bitcoin; Crypto-assets; Crypto-currencies; Credit risk; Bankruptcy; Default Probability
    JEL: C21 C35 C51 C53 G23 G32 G33
    Date: 2021
  15. By: M. Eren Akbiyik; Mert Erkul; Killian Kaempf; Vaiva Vasiliauskaite; Nino Antulov-Fantulin
    Abstract: Understanding the variations in trading price (volatility), and its response to external information is a well-studied topic in finance. In this study, we focus on volatility predictions for a relatively new asset class of cryptocurrencies (in particular, Bitcoin) using deep learning representations of public social media data from Twitter. For the field work, we extracted semantic information and user interaction statistics from over 30 million Bitcoin-related tweets, in conjunction with 15-minute intraday price data over a 144-day horizon. Using this data, we built several deep learning architectures that utilized a combination of the gathered information. For all architectures, we conducted ablation studies to assess the influence of each component and feature set in our model. We found statistical evidences for the hypotheses that: (i) temporal convolutional networks perform significantly better than both autoregressive and other deep learning-based models in the literature, and (ii) the tweet author meta-information, even detached from the tweet itself, is a better predictor than the semantic content and tweet volume statistics.
    Date: 2021–10
  16. By: Fantazzini, Dean; Pushchelenko, Julia; Mironenkov, Alexey; Kurbatskii, Alexey
    Abstract: This paper examines the suitability of Google Trends data for the modeling and forecasting of interregional migration in Russia. Monthly migration data, search volume data, and macro variables are used with a set of univariate and multivariate models to study the migration data of the two Russian cities with the largest migration inflows: Moscow and Saint Petersburg. The empirical analysis does not provide evidence that the more people search online, the more likely they are to relocate to other regions. However, the inclusion of Google Trends data in a model improves the forecasting of the migration flows, because the forecasting errors are lower for models with internet search data than for models without them. These results also hold after a set of robustness checks that consider multivariate models able to deal with potential parameter instability and with a large number of regressors.
    Keywords: Migration; Forecasting; Google Trends; VAR; Cointegration; ARIMA; Russia; Time-varying VAR; Multivariate Ridge regression.
    JEL: C22 C32 C52 C53 C55 F22 J11 O15 R23
    Date: 2021
  17. By: Budzinski, Oliver; Mendelsohn, Juliane
    Keywords: big tech,digital economy,digital ecosystems,GAFAM,competition policy,antitrust,Digital Markets Act (DMA),sector-specific regulation,law and economics
    JEL: K21 K23 K24 L40 L50 L81 L86
    Date: 2021
  18. By: Martina Iori; Arianna Martinelli; Andrea Mina
    Abstract: Government funding of innovation can have a significant impact not only on the rate of technical change, but also on its direction. In this paper, we examine the role that government grants and government departments played in the development of artificial intelligence (AI), an emergent general purpose technology with the potential to revolutionize many aspects of the economy and society. We analyze all AI patents filed at the US Patent and Trademark Office and develop network measures that capture each patent's influence on all possible sequences of follow-on innovation. By identifying the effect of patents on technological trajectories, we are able to account for the long-term cumulative impact of new knowledge that is not captured by standard patent citation measures. We show that patents funded by government grants, but above all patents filed by federal agencies and state departments, profoundly influenced the development of AI. These long-term effects were especially significant in early phases, and weakened over time as private incentives took over. These results are robust to alternative specifications and controlling for endogeneity.
    Keywords: R&D; Technical change; Government subsidies; Technology policy; General purpose technology.
    Date: 2021–11–16
  19. By: Mr. Serkan Arslanalp; Mr. Robin Koepke; Jasper Verschuur
    Abstract: This paper proposes an easy-to-follow approach to track merchandise trade using vessel data and applies it to Pacific island countries. Pacific islands rely heavily on imports and maritime transport for trade. They are also highly vulnerable to climate change and natural disasters that pose risks to ports and supply chains. Using satellite-based vessel tracking data from the UN Global Platform, we construct daily indicators of port and trade activity for Pacific island countries. The algorithm significantly advances estimation techniques of previous studies, particularly by employing ways to overcome challenges with the estimation of cargo payloads, using detailed information on shipping liner schedules to validate port calls, and applying country-specific information to define port boundaries. The approach can complement and help fill gaps in official data, provide early warning signs of turning points in economic activity, and assist policymakers and international organizations to monitor and provide timely responses to shocks (e.g., COVID-19).
    Keywords: Merchandise trade Data; I. Pacific island; liner shipping connectivity index; Merchandise import; supply disruption; Imports; Natural disasters; COVID-19; Trade balance; Tourism; Pacific Islands; Global; port boundary; port call; vessel data
    Date: 2021–08–20
  20. By: Samuel Nursamsu (Australia-Indonesia Partnership for Economic Development (PROSPERA)); Wisnu Harto Adiwijoyo (University of Göttingen); Anissa Rahmawati (Presisi Indonesia)
    Abstract: This paper attempts to shed light on the impact of socio-economic disparity on learning engagement during the COVID-19 pandemic in Indonesia. Utilising search intensity data from Google Trends, school data from Dapodik (Education Core Database), and socio-economic data from the National Socioeconomic Survey, we conduct descriptive analysis, an event study, and difference-in-difference estimations. First, school quality differs in terms of the regions’ development level, especially between western and eastern Indonesia. However, densely populated and well-developed areas generally have lower offline classroom availability. In addition, the quality of public schools is generally lower than private schools. Second, our estimation results show that only online-classroom related search intensity that increased significantly after school closures on 16 March 2020, not in self-learning related search intensity. Further the analysis shows that socio-economic disparity within provinces widens the gap in online learning engagement, albeit with weak evidence from per capita expenditure. Interestingly, provinces with a higher inequality and rural population tend to have higher self-learning related search intensity due to students’ necessity to compensate for low learning quality from schools. In addition, technology adoption does not seem to give much of an increase to online-classroom related search intensity but contributes to lower self-learning related search intensity due to increased academic distraction. Our study provides evidence for the Indonesian government to make more precise policy in improving learning quality during the pandemic.
    Keywords: Covid-19 Impact, Education Inequality, Online learning
    JEL: I24 O15
    Date: 2021–11–06
  21. By: Mounu Prem (Universidad del Rosario); Juan Vargas (Universidad del Rosario); Miguel E. Purroy (Inter-American Development Bank)
    Abstract: Anti-personnel landmines are one of the main causes of civilian victimization in conflict-affected areas and a significant obstacle for post-war reconstruction. Demining campaigns are therefore a promising policy instrument to promote long-term development. We argue that the economic and social effects of demining are not unambiguously positive. Demining may have unintended negative consequences if it takes place while conflicts are ongoing, or if they do not lead to full clearance. Using highly disaggregated data on demining operations in Colombia from 2004 to 2019, and exploiting the staggered fashion of demining activity, we find that post-conflict humanitarian demining generates economic growth (measured with nighttime light density) and increases students’ performance in test scores. In contrast, economic activity does not react to post-conflict demining events carried out during military operations, and it decreases if demining takes place while the conflict is ongoing. Rather, demining events that result from military operations are more likely to exacerbate extractive activities.
    Keywords: Landmines, demining, conflict, peace, local development, Colombia
    JEL: D74 P48 Q56 I25
    Date: 2021–11
  22. By: P. Rovelli; C. Benedetti; A. Fronzetti Colladon; A. De Massis
    Abstract: This study explores the role of external audiences in determining the importance of family firm brands and the relationship with firm performance. Drawing on text mining and social network analysis techniques, and considering the brand prevalence, diversity, and connectivity dimensions, we use the semantic brand score to measure the importance the media give to family firm brands. The analysis of a sample of 52,555 news articles published in 2017 about 63 Italian entrepreneurial families reveals that brand importance is positively associated with family firm revenues, and this relationship is stronger when there is identity match between the family and the firm. This study advances current literature by offering a rich and multifaceted perspective on how external audiences perceptions of the brand shape family firm performance.
    Date: 2021–10

This nep-big issue is ©2021 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.