nep-big New Economics Papers
on Big Data
Issue of 2022‒08‒29
nineteen papers chosen by
Tom Coupé
University of Canterbury

  1. Double/debiased machine learning in Stata By Achim Ahrens
  2. Stacking generalization and machine learning in Stata By Achim Ahrens
  3. Prévision de l’inflation en Côte D’ivoire : Analyse Comparée des Modèles Arima, Holt-Winters, et Lstm By Koffi, Siméon
  4. Accuracy of explanations of machine learning models for credit decisions By Andrés Alonso; José Manuel Carbó
  5. Prediction of WIC Program Participation: A Machine Learning Approach to Fix Reporting Error By Luo, Yufeng; Zhen, Chen
  6. Can Machine Learning Predict Defaults in Peer-to-Peer Small Loans? By Muriuki, James M.; Badruddoza, Syed; Fuad, Syed M.
  7. Learning Embedded Representation of the Stock Correlation Matrix using Graph Machine Learning By Bhaskarjit Sarmah; Nayana Nair; Dhagash Mehta; Stefano Pasquali
  8. Preferential Trading in Agricultural and Food Products: New Insights from a Structural Gravity Analysis and Machine Learning By Kim, Dongin; Steinbach, Sandro
  9. Understanding Unfairness in Fraud Detection through Model and Data Bias Interactions By Jos\'e Pombal; Andr\'e F. Cruz; Jo\~ao Bravo; Pedro Saleiro; M\'ario A. T. Figueiredo; Pedro Bizarro
  10. Re-examining adaptation theory using Big Data: Reactions to external shocks By Greyling, Talita; Rossouw, Stephanié
  11. Crypto Coins and Credit Risk: Modelling and Forecasting their Probability of Death By Fantazzini, Dean
  12. Global combinations of expert forecasts By Qian, Yilin; Thompson, Ryan; Vasnev, Andrey L
  13. Distributional neural networks for electricity price forecasting By Grzegorz Marcjasz; Micha{\l} Narajewski; Rafa{\l} Weron; Florian Ziel
  14. Who Did the Affordable Care Act Medicaid Expansion Impact? Using Linear Discriminant Analysis to Estimate the Probability of Being a Complier By Benjamin C. Chu
  15. Constructing GDP Nowcasting Models Using Alternative Data By Takashi Nakazawa
  16. Estimating Inequality with Missing Incomes By Brunori, Paolo; Salas-Rojoy, Pedro; Vermez, Paolo
  17. SustainGraph: a Knowledge Graph for tracking Evolution and Interlinking of Sustainable Development Goals' Targets By Eleni Fotopoulou; Ioanna Mandilara; Anastasios Zafeiropoulos; Chrysi Laspidou; Giannis Adamos; Phoebe Koundouri; Symeon Papavassiliou
  18. Trust in an autonomous agent for predictive maintenance: how agent transparency could impact compliance. By Loïck Simon; Philippe Rauffet; Clément Guérin; Cédric Seguin
  19. Developing a Pathway for the Adoption of Machine Learning Systems in Organizations: An Analysis of Drivers, Barriers, and Impacts with a Focus on the Healthcare Sector By Pumplun, Luisa

  1. By: Achim Ahrens (ETH Zürich)
    Abstract: ddml implements algorithms for causal inference aided by supervised machine learning as proposed in "Double/ debiased machine learning for treatment and structural parameters" (Econometrics Journal 2018). Five different models are supported, allowing for binary or continuous treatment variables and endogeneity. ddml supports a variety of different ML programs, including lassopack and pystacked.
    Date: 2022–07–03
  2. By: Achim Ahrens (ETH Zürich)
    Abstract: pystacked implements stacked generalization (Wolpert 1992) for regression and binary classi
    Date: 2022–07–03
  3. By: Koffi, Siméon
    Abstract: This paper attempts to highlight the role of new short-term forecasting methods. It leads to the fact that artificial neural networks (LSTM) are more efficient than classical methods (ARIMA and HOLT-WINTERS) in forecasting the HICP of Côte d'Ivoire. The data are from the “Direction des Prévisions, des Politiques et des Statistiques Economiques (DPPSE)” and cover the period from January 2012 to May 2022. The root mean square error of the long-term memory recurrent neural network (LSTM) is the lowest compared to the other two techniques. Thus, one can assert that the LSTM method improves the prediction by more than 90%, ARIMA by 68%, and Holt-Winters by 61%. These results make machine learning techniques (LSTM) excellent forecasting tools.
    JEL: C15 C81 C88
    Date: 2022–08–01
  4. By: Andrés Alonso (Banco de España); José Manuel Carbó (Banco de España)
    Abstract: One of the biggest challenges for the application of machine learning (ML) models in finance is how to explain their results. In recent years, innovative interpretability techniques have appeared to assist in this task, although their usefulness is still a matter of debate within the industry. In this article we propose a novel framework to assess how accurate these techniques are. Our work is based on the generation of synthetic datasets. This allows us to define the importance of the variables, so we can calculate to what extent the explanations given by these techniques match the ground truth of our data. We perform an empirical exercise in which we apply two non-interpretable ML models (XGBoost and Deep Learning) to the synthetic datasets, , and then we explain their results using two popular interpretability techniques, SHAP and permutation Feature Importance (FI). We conclude that generating synthetic datasets shows potential as a useful approach for supervisors and practitioners who wish to assess interpretability techniques.
    Keywords: synthetic datasets, artificial intelligence, interpretability, machine learning, credit assessment
    JEL: C55 C63 G17
    Date: 2022–06
  5. By: Luo, Yufeng; Zhen, Chen
    Keywords: Food Consumption/Nutrition/Food Safety, Research Methods/Statistical Methods, Health Economics and Policy
    Date: 2022–08
  6. By: Muriuki, James M.; Badruddoza, Syed; Fuad, Syed M.
    Keywords: Risk and Uncertainty, Institutional and Behavioral Economics, Agribusiness
    Date: 2022–08
  7. By: Bhaskarjit Sarmah; Nayana Nair; Dhagash Mehta; Stefano Pasquali
    Abstract: Understanding non-linear relationships among financial instruments has various applications in investment processes ranging from risk management, portfolio construction and trading strategies. Here, we focus on interconnectedness among stocks based on their correlation matrix which we represent as a network with the nodes representing individual stocks and the weighted links between pairs of nodes representing the corresponding pair-wise correlation coefficients. The traditional network science techniques, which are extensively utilized in financial literature, require handcrafted features such as centrality measures to understand such correlation networks. However, manually enlisting all such handcrafted features may quickly turn out to be a daunting task. Instead, we propose a new approach for studying nuances and relationships within the correlation network in an algorithmic way using a graph machine learning algorithm called Node2Vec. In particular, the algorithm compresses the network into a lower dimensional continuous space, called an embedding, where pairs of nodes that are identified as similar by the algorithm are placed closer to each other. By using log returns of S&P 500 stock data, we show that our proposed algorithm can learn such an embedding from its correlation network. We define various domain specific quantitative (and objective) and qualitative metrics that are inspired by metrics used in the field of Natural Language Processing (NLP) to evaluate the embeddings in order to identify the optimal one. Further, we discuss various applications of the embeddings in investment management.
    Date: 2022–07
  8. By: Kim, Dongin; Steinbach, Sandro
    Keywords: International Relations/Trade, International Development, Research Methods/Statistical Methods
    Date: 2022–08
  9. By: Jos\'e Pombal; Andr\'e F. Cruz; Jo\~ao Bravo; Pedro Saleiro; M\'ario A. T. Figueiredo; Pedro Bizarro
    Abstract: In recent years, machine learning algorithms have become ubiquitous in a multitude of high-stakes decision-making applications. The unparalleled ability of machine learning algorithms to learn patterns from data also enables them to incorporate biases embedded within. A biased model can then make decisions that disproportionately harm certain groups in society -- limiting their access to financial services, for example. The awareness of this problem has given rise to the field of Fair ML, which focuses on studying, measuring, and mitigating unfairness in algorithmic prediction, with respect to a set of protected groups (e.g., race or gender). However, the underlying causes for algorithmic unfairness still remain elusive, with researchers divided between blaming either the ML algorithms or the data they are trained on. In this work, we maintain that algorithmic unfairness stems from interactions between models and biases in the data, rather than from isolated contributions of either of them. To this end, we propose a taxonomy to characterize data bias and we study a set of hypotheses regarding the fairness-accuracy trade-offs that fairness-blind ML algorithms exhibit under different data bias settings. On our real-world account-opening fraud use case, we find that each setting entails specific trade-offs, affecting fairness in expected value and variance -- the latter often going unnoticed. Moreover, we show how algorithms compare differently in terms of accuracy and fairness, depending on the biases affecting the data. Finally, we note that under specific data bias conditions, simple pre-processing interventions can successfully balance group-wise error rates, while the same techniques fail in more complex settings.
    Date: 2022–07
  10. By: Greyling, Talita; Rossouw, Stephanié
    Abstract: During the global response to COVID-19, the analogy of fighting a war was often used. In 2022, the world faced a different war altogether, an unprovoked Russian invasion of Ukraine. Since 2020 the world has faced these unprecedented shocks. Although we realise these events' health and economic effects, more can be known about the happiness effects on the people in a country and how it differs between a health and a war shock. Additionally, we need to investigate if these external shocks do affect wellbeing, how they differ from one another, and how long it takes happiness to adapt to these shocks. Therefore, this paper aims to compare these two external shocks for ten countries spanning the Northern and Southern hemispheres to investigate the effect on happiness. By investigating the aforementioned, we also re-examine the adaptation theory and see whether it holds at the country level. We use a unique dataset derived from tweets extracted in real-time per country. We derive each tweet's underlying sentiment by applying Natural Language Processing (machine learning). Using the sentiment score, we apply algorithms to construct daily time-series data to measure happiness (Gross National Happiness (GNH)). Our Twitter dataset is combined with data from Oxford's COVID-19 Government Response Tracker. We find that in both instances, the external shocks caused a decrease in GNH. Considering both types of shocks, the adaptation to previous happiness levels occurred within weeks. Understanding the effects of external shocks on happiness is essential for policymakers as effects on happiness have a spillover effect on other variables such as production, safety and trust. Furthermore, the additional macro-level results on the adaptation theory contribute to previously unexplored fields of study.
    Date: 2022
  11. By: Fantazzini, Dean
    Abstract: This paper examined a set of over two thousand crypto-coins observed between 2015 and 2020 to estimate their credit risk by computing their probability of death. We employed different definitions of dead coins, ranging from academic literature to professional practice, alternative forecasting models, ranging from credit scoring models to machine learning and time series-based models, and different forecasting horizons. We found that the choice of the coin death definition affected the set of the best forecasting models to compute the probability of death. However, this choice was not critical, and the best models turned out to be the same in most cases. In general, we found that the \textit{cauchit} and the zero-price-probability (ZPP) based on the random walk or the Markov Switching-GARCH(1,1) were the best models for newly established coins, whereas credit scoring models and machine learning methods using lagged trading volumes and online searches were better choices for older coins. These results also held after a set of robustness checks that considered different time samples and the coins' market capitalization.
    Keywords: Bitcoin, Crypto-assets, Crypto-currencies, Credit risk, Default Probability, Probability of Death, ZPP, Cauchit, Logit, Probit, Random Forests, Google Trends.
    JEL: C32 C35 C51 C53 C58 G12 G17 G32 G33
    Date: 2022
  12. By: Qian, Yilin; Thompson, Ryan; Vasnev, Andrey L
    Abstract: Expert forecast combination—the aggregation of individual forecasts from multiple subject matter experts— is a proven approach to economic forecasting. To date, research in this area has exclusively concentrated on local combination methods, which handle separate but related forecasting tasks in isolation. Yet, it has been known for over two decades in the machine learning community that global methods, which exploit taskrelatedness, can improve on local methods that ignore it. Motivated by the possibility for improvement, this paper introduces a framework for globally combining expert forecasts. Through our framework, we develop global versions of several existing forecast combinations. To evaluate the efficacy of these new global forecast combinations, we conduct extensive comparisons using synthetic and real data. Our real data comparisons, which involve expert forecasts of core economic indicators in the Eurozone, are the first empirical evidence that the accuracy of global combinations of expert forecasts can surpass local combinations.
    Keywords: Forecast combination, local forecasting, global forecasting, multi-task learning, European Central Bank, Survey of Professional Forecasters
    Date: 2022–07–29
  13. By: Grzegorz Marcjasz; Micha{\l} Narajewski; Rafa{\l} Weron; Florian Ziel
    Abstract: We present a novel approach to probabilistic electricity price forecasting (EPF) which utilizes distributional artificial neural networks. The novel network structure for EPF is based on a regularized distributional multilayer perceptron (DMLP) which contains a probability layer. Using the TensorFlow Probability framework, the neural network's output is defined to be a distribution, either normal or potentially skewed and heavy-tailed Johnson's SU (JSU). The method is compared against state-of-the-art benchmarks in a forecasting study. The study comprises forecasting involving day-ahead electricity prices in the German market. The results show evidence of the importance of higher moments when modeling electricity prices.
    Date: 2022–07
  14. By: Benjamin C. Chu (University of Hawaii)
    Abstract: What is the likelihood of being a complier in the ACA Medicaid expansion? Using linear discriminant analysis (LDA), I estimate how characteristics relating to socioeconomic status and race/ethnicity affect the likelihood that an individual will be a complier, defined as those induced by the expansion to obtain Medicaid coverage. Across multiple specifications, part-time and full-time workers are more likely than non-workers to be compliers. Not only is this result more prominent for Black individuals, but they are also more likely to be compliers compared to other racial/ethnic groups. This paper not only serves to identify the types of individuals who were directly impacted by the expansion, but it also introduces a new approach that combines complier analysis with techniques from machine learning.
    Keywords: Medicaid, ACA, Complier, Linear discriminant analysis
    JEL: I13
    Date: 2022–08
  15. By: Takashi Nakazawa (Bank of Japan)
    Abstract: With coronavirus (COVID-19) having a significant impact on economic activity, the existing GDP nowcasting model, using only monthly and quarterly economic data, has become difficult to forecast with high accuracy. In this paper, we attempt to improve the accuracy of GDP nowcasting models by using alternative data that are available more promptly. Specifically, we construct nowcasting models that incorporate sparse estimation by Elastic Net using weekly retail sales data and hundreds of daily Internet search volume data, in addition to conventional monthly economic data. For the model formulation and data selection, we prepare a large number of candidate models using the method of forecast combination, which combines multiple forecasting models, and select "Best models" which minimize the forecast error, including data after the spread of COVID-19. The analysis shows that the use of alternative data significantly improves the forecasting accuracy of the model, especially at the 2-month prior to release of GDP, when the availability of monthly and quarterly economic data are limited.
    Keywords: Nowcasting; Alternative Data; Elastic Net; Forecast Combination
    JEL: C52 C53 C55
    Date: 2022–07–28
  16. By: Brunori, Paolo; Salas-Rojoy, Pedro; Vermez, Paolo
    Abstract: The measurement of income inequality is affected by missing observations, espe- cially if they are concentrated on the tails of an income distribution. This paper conducts an experiment to test how the different correction methods proposed by the statistical, econometric and machine learning literature address measurement biases of inequality due to item non response. We take a baseline survey and artificially corrupt the data employing several alternative non-linear functions that simulate pat- terns of income non-response, and show how biased inequality statistics can be when item non-responses are ignored. The comparative assessment of correction methods indicates that most methods are able to partially correct for missing data biases. Sam- ple reweighting based on probabilities on non-response produces inequality estimates quite close to true values in most simulated missing data patterns. Matching and Pareto corrections can also be effective to correct for selected missing data patterns. Other methods, such as Single and Multiple imputations and Machine Learning meth- ods are less effective. A final discussion provides some elements that help explaining these findings.
    Date: 2022
  17. By: Eleni Fotopoulou (National Technical University of Athens); Ioanna Mandilara (National Technical University of Athens); Anastasios Zafeiropoulos (National Technical University of Athens); Chrysi Laspidou (University of Thessaly); Giannis Adamos (University of Thessaly); Phoebe Koundouri; Symeon Papavassiliou (National Technical University of Athens)
    Abstract: The development of solutions to manage or mitigate climate change impacts is very challenging, given the complexity and dynamicity of the socio-environmental and socio-ecological systems that have to be modeled and analyzed to include qualitative variables that are not so easily quantifiable. The existence of qualitative, interoperable and well-interlinked data is considered a must to support this objective, since scientists from different disciplines will have no option but to collaborate and co-design solutions, overcoming barriers related to the semantic mis-alignment of the plethora of available data, the existence of multiple data silos that cannot be easily and jointly processed, and the lack of data quality in many of the produced datasets. In the current work, we present SustainGraph, as a Knowledge Graph that is developed to track information related to the evolution of targets defined in the United Nations Sustainable Development Goals (SDGs) at national and regional level. SustainGraph aims to act as a unified source of knowledge around information related to the SDGs, by taking advantage of the power provided by the development of graph databases and the exploitation of Machine Learning (ML) techniques for data population, knowledge production and analysis purposes. The main concepts represented in SustainGraph are detailed, while indicative usage scenarios are provided. A set of opportunities to take advantage of SustainGraph and open research areas are identified and presented.
    Keywords: Knowledge Graph, Sustainable Development Goal (SDG), Systems Innovation Approach, Climate Change Impact, Participatory Modeling, Graph Database
    Date: 2022–07–25
  18. By: Loïck Simon; Philippe Rauffet; Clément Guérin; Cédric Seguin
    Abstract: Human-machine cooperation is more and more present in the industry. Machines will be sources of proposal by giving human propositions and advice. Humans will need to make a decision (comply, i.e., agree, or not) with those propositions. Compliance can be seen as an objective trust and experiments results unclear about the role of risk in this compliance. We wanted to understand how transparency on reliability, risk or those two in addition will impact this compliance with machine propositions. With the use of an AI for predictive maintenance, we asked participants to make a decision about a proposition of replanification. Preliminary results show that transparency on risk and total transparency are linked with less compliance with the AI. We can see that risk transparency has more effect on creating an appropriate trust than reliability transparency. As we see, and in agreement with recent studies, there is a need to understand at a finer level the impact of transparency on human-machines interaction
    Keywords: Transparency,Human-machine interaction,Compliance,Trust
    Date: 2022–07–24
  19. By: Pumplun, Luisa
    Abstract: The potential of machine learning (ML) and systems based thereon has grown steadily in recent years. The ability of ML systems to rapidly and systematically identify relationships in large volumes of data, which can be used to analyze new data to make meaningful predictions, enables organizations of all industries to make their processes more effective and efficient. Healthcare in particular may benefit greatly from ML systems in the future, as these systems’ capabilities could help to ensure adequate patient care despite many pressing issues, such as the acute shortage of specialists (e.g., through diagnostic support). However, many organizations are currently still failing to harness the potential of ML systems to their advantage, as implementing these systems is not a trivial task. Rather, the integration of ML systems requires the organization to identify and meet novel, multi-faceted preconditions that are unfamiliar as compared with previous, conventional technologies. This is mainly because ML systems exhibit unique characteristics. In particular, ML systems possess probabilistic properties due to their data-based learning approach, implying that their application can lead to erroneous results and that their functioning is often opaque. Particularly in healthcare, in which patients' lives depend on proper diagnoses and treatment, these characteristics result in ML systems not only being helpful, but – if introduced improperly – can also lead to severe detrimental consequences. Since previous research on the adoption of conventional technologies has not yet considered the characteristic properties of ML systems, the aim of this dissertation is to better understand the complex requirements for the successful adoption of ML systems in organizations in order for them to sustainably realize ML systems’ potential. The three qualitative, two experimental, and one simulation study included in this cumulative dissertation have been published in peer-reviewed journals and conference proceedings and are divided into three distinct parts with different focuses: The first part of this dissertation identifies the drivers of and barriers to the adoption of ML systems in organizations in general, and in healthcare organizations specifically. Drawing on an interview study with 14 experts from a variety of industries, an integrative overview of the factors influencing the adoption of ML systems is provided, structured according to technical, organizational, and environmental aspects. The interviews further reveal several problem areas where ML provider and ML user organizations’ perceptions diverge, which can lead to the flawed design of ML systems and thus delayed integration. In a second qualitative study, specific factors affecting the integration of ML systems in healthcare organizations are derived based on 22 expert interviews with physicians with ML expertise, and with health information technology providers. In a following step, these interviews are used to establish an operationalized maturity model, which allows for the analysis of the status quo in the adoption process of ML systems in healthcare organizations. How the identified requirements for the organizational introduction of ML systems can be fulfilled is subject of the second part of this dissertation. First, the concept of data donation is introduced as a potential mechanism for organizations, particularly in the healthcare sector, to achieve a valid database. More specifically, individuals’ donation behavior along with its antecedents, such as privacy risks and trust, and under different emotional states, is investigated based on an experimental study among 445 Internet users. Next, a design for rendering ML systems more transparent is proposed and evaluated using a questionnaire and an experiment among 223 Internet users. Thereby, the relevance of transparency for building trust among potential users and the resulting willingness to pay for transparent designs is highlighted. A qualitative study is further employed to reveal what motivates potential users, and especially the elderly, to accept health-related ML systems. The third part of this work includes a simulation study that presents the potential impact of adopting ML systems for organizational learning. The results suggest that an organization’s employees can be relieved of some of their learning burden through the application of ML systems, but the systems must be reconfigured appropriately over time. This holds especially true in case of rapid environmental changes, such as those caused by the COVID-19 pandemic. In summary, this dissertation assumes a socio-technical perspective to shed light on the integration of ML systems in organizations. It helps organizations better understand the complex interplay of technical, organizational, human, and environmental factors that are critical to the successful adoption of ML systems, enabling decision makers to target scarce corporate resources more effectively. Moreover, this work enables IS researchers to better grasp the specifics of ML systems, provide required adjustments to theoretical foundations, and sharpen their understanding of the contextual factors involved in the adoption of ML systems in organizations.
    Date: 2022

This nep-big issue is ©2022 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.