nep-big New Economics Papers
on Big Data
Issue of 2021‒09‒13
eighteen papers chosen by
Tom Coupé
University of Canterbury

  1. Should I Contact Him or Not? – Quantifying the Demand for Real Estate with Interpretable Machine Learning Methods By Marcelo Cajias; Joseph-Alexander Zeitler
  2. Creating Powerful and Interpretable Models with Regression Networks By Lachlan O'Neill; Simon D Angus; Satya Borgohain; Nader Chmait; David Dowe
  3. The Roots of Inequality: Estimating Inequality of Opportunity from Regression Trees and Forests By Brunori, Paolo; Hufe, Paul; Mahler, Daniel Gerszon
  4. Scaling up SME's credit scoring scope with LightGBM By Bastien Lextrait
  5. Artificial intelligence in asset management By Söhnke M. Bartram; Jürgen Branke; Mehrshad Motahari
  6. Three fundamental problems in risk modeling on big data: an information theory view By Jiamin Yu
  7. Iterated and exponentially weighted moving principal component analysis By Paul Bilokon; David Finkelstein
  8. A random forest based approach for predicting spreads in the primary catastrophe bond market By Makariou, Despoina; Barrieu, Pauline; Chen, Yining
  9. A Gender perspective on the use of Artificial Intelligence in the African FinTech Ecosystem: Case studies from South Africa, Kenya, Nigeria, and Ghana By Ahmed, Shamira
  10. Socialbot representations on cross media platforms during 2020 Taiwanese Presidential Election By Lin, Trisha T. C.
  11. What drives bitcoin? An approach from continuous local transfer entropy and deep learning classification models By Andr\'es Garc\'ia-Medina; Toan Luu Duc Huynh3
  12. Matrix Completion of World Trade By Gnecco Giorgio; Nutarelli Federico; Riccaboni Massimo
  13. Document Classification and Key Information for Technical Due Diligence in Real Estate Management By Philipp Maximilian Mueller; Björn-Martin Kurzrock
  14. From Big to Smart: Ausgewählte Einsatzmöglichkeiten von Smart Data in Banken By Hastenteufel, Jessica; Günther, Maik; Rehfeld, Katharina
  15. Forecasting High-Dimensional Covariance Matrices of Asset Returns with Hybrid GARCH-LSTMs By Lucien Boulet
  16. Proceedings of KDD 2021 Workshop on Data-driven Humanitarian Mapping: Harnessing Human-Machine Intelligence for High-Stake Public Policy and Resilience Planning By Snehalkumar; S. Gaikwad; Shankar Iyer; Dalton Lunga; Elizabeth Bondi
  17. Bias in Algorithms: On the trade-off between accuracy and fairness By Janssen, Patrick; Sadowski, Bert M.
  18. Forecasting Dynamic Term Structure Models with Autoencoders By Castro-Iragorri, C; Ramírez, J

  1. By: Marcelo Cajias; Joseph-Alexander Zeitler
    Abstract: In the light of the rise of the World Wide Web, there is an intense debate about the potential impact of online user-generated data on classical economics. This paper is one of the first to analyze housing demand on that account by employing a large internet search dataset from a housing market platform. Focusing on the German rental housing market, we employ the variable ‘contacts per listing’ as a measure of demand intensity. Apart from traditional economic methods, we apply state-of-the-art artificial intelligence, the XGBoost, to quantify the factors that lead an apartment to be demanded. As using machine learning algorithms cannot solve the causal relationship between the independent and dependent variable, we make use of eXplainable AI (XAI) techniques to further show economic meanings and inferences of our results. Those suggest that both hedonic, socioeconomic and spatial aspects influence search intensity. We further find differences in temporal dynamics and geographical variations. Additionally, we compare our results to alternative parametric models and find evidence of the superiority of our nonparametric model. Overall, our findings entail some potentially very important implications for both researchers and practitioners.
    Keywords: eXtreme Gradient Boosting; Machine Learning; online usergenerated search data; Residential Real Estate
    JEL: R3
    Date: 2021–01–01
    URL: http://d.repec.org/n?u=RePEc:arz:wpaper:eres2021_70&r=
  2. By: Lachlan O'Neill (Faculty of Information Technology, Monash University); Simon D Angus (Dept. of Economics & SoDa Laboratories, Monash Business School, Monash University); Satya Borgohain (SoDa Laboratories, Monash Business School, Monash University); Nader Chmait (Faculty of Information Technology, Monash University); David Dowe (Faculty of Information Technology, Monash University)
    Abstract: As the discipline has evolved, research in machine learning has been focused more and more on creating more powerful neural networks, without regard for the interpretability of these networks. Such “black-box models†yield state-of-the-art results, but we cannot understand why they make a particular decision or prediction. Sometimes this is acceptable, but often it is not. We propose a novel architecture, Regression Networks, which combines the power of neural networks with the understandability of regression analysis. While some methods for combining these exist in the literature, our architecture generalizes these approaches by taking interactions into account, offering the power of a dense neural network without forsaking interpretability. We demonstrate that the models exceed the state-of-the-art performance of interpretable models on several benchmark datasets, matching the power of a dense neural network. Finally, we discuss how these techniques can be generalized to other neural architectures, such as convolutional and recurrent neural networks.
    Keywords: machine learning, policy evaluation, neural networks, regression, classification
    JEL: C45 C14 C52
    Date: 2021–09
    URL: http://d.repec.org/n?u=RePEc:ajr:sodwps:2021-09&r=
  3. By: Brunori, Paolo (London School of Economics); Hufe, Paul (LMU Munich); Mahler, Daniel Gerszon (World Bank)
    Abstract: In this paper we propose the use of machine learning methods to estimate inequality of opportunity. We illustrate how our proposed methods—conditional inference regression trees and forests—represent a substantial improvement over existing estimation approaches. First, they reduce the risk of ad-hoc model selection. Second, they establish estimation models by trading off upward and downward bias in inequality of opportunity estimates. The advantages of regression trees and forests are illustrated by an empirical application for a cross-section of 31 European countries. We show that arbitrary model selection may lead to significant biases in inequality of opportunity estimates relative to our preferred method. These biases are reflected in both point estimates and country rankings. Our results illustrate the practical importance of leveraging machine learning algorithms to avoid giving misleading information about the level of inequality of opportunity in different societies to policymakers and the general public.
    Keywords: equality of opportunity, machine learning, random forests
    JEL: D31 D63 C38
    Date: 2021–08
    URL: http://d.repec.org/n?u=RePEc:iza:izadps:dp14689&r=
  4. By: Bastien Lextrait
    Abstract: Small and Medium Size Enterprises (SMEs) are critical actors in the fabric of the economy. Their growth is often limited by the difficulty in obtaining fi nancing. Basel II accords enforced the obligation for banks to estimate the probability of default of their obligors. Currently used models are limited by the simplicity of their architecture and the available data. State of the art machine learning models are not widely used because they are often considered as black boxes that cannot be easily explained or interpreted. We propose a methodology to combine high predictive power and powerful explainability using various Gradient Boosting Decision Trees (GBDT) implementations such as the LightGBM algorithm and SHapley Additive exPlanation (SHAP) values as post-prediction explanation model. SHAP values are among the most recent methods quantifying with consistency the impact of each input feature over the credit score. This model is developed and tested using a nation-wide sample of French companies, with a highly unbalanced positive event ratio. The performances of GBDT models are compared with traditional credit scoring algorithms such as Support Vector Machine (SVM) and Logistic Regression. LightGBM provides the best performances over the test sample, while being fast to train and economically sound. Results obtained from SHAP values analysis are consistent with previous socio-economic studies, in that they can pinpoint known influent economical factors among hundreds of other features. Providing such a level of explainability to complex models may convince regulators to accept their use in automated credit scoring, which could ultimately benefi t both borrowers and lenders.
    Keywords: Credit scoring, SMEs, Machine Learning, Gradient Boosting, Interpretability
    JEL: C53 C63 M21
    Date: 2021
    URL: http://d.repec.org/n?u=RePEc:drm:wpaper:2021-25&r=
  5. By: Söhnke M. Bartram; Jürgen Branke; Mehrshad Motahari (Cambridge Judge Business School, University of Cambridge)
    Abstract: Artificial intelligence (AI) has a growing presence in asset management and has revolutionized the sector in many ways. It has improved portfolio management, trading, and risk management practices by increasing efficiency, accuracy, and compliance. In particular, AI techniques help construct portfolios based on more accurate risk and returns forecasts and under more complex constraints. Trading algorithms utilize AI to devise novel trading signals and execute trades with lower transaction costs, and AI improves risk modelling and forecasting by generating insights from new sources of data. Finally, robo-advisors owe a large part of their success to AI techniques. At the same time, the use of AI can create new risks and challenges, for instance as a result of model opacity, complexity, and reliance on data integrity.
    Date: 2020–03
    URL: http://d.repec.org/n?u=RePEc:jbs:wpaper:20202001&r=
  6. By: Jiamin Yu
    Abstract: Since Claude Shannon founded Information Theory, information theory has widely fostered other scientific fields, such as statistics, artificial intelligence, biology, behavioral science, neuroscience, economics, and finance. Unfortunately, actuarial science has hardly benefited from information theory. So far, only one actuarial paper on information theory can be searched by academic search engines. Undoubtedly, information and risk, both as Uncertainty, are constrained by entropy law. Today's insurance big data era means more data and more information. It is unacceptable for risk management and actuarial science to ignore information theory. Therefore, this paper aims to exploit information theory to discover the performance limits of insurance big data systems and seek guidance for risk modeling and the development of actuarial pricing systems.
    Date: 2021–09
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2109.03541&r=
  7. By: Paul Bilokon; David Finkelstein
    Abstract: The principal component analysis (PCA) is a staple statistical and unsupervised machine learning technique in finance. The application of PCA in a financial setting is associated with several technical difficulties, such as numerical instability and nonstationarity. We attempt to resolve them by proposing two new variants of PCA: an iterated principal component analysis (IPCA) and an exponentially weighted moving principal component analysis (EWMPCA). Both variants rely on the Ogita-Aishima iteration as a crucial step.
    Date: 2021–08
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2108.13072&r=
  8. By: Makariou, Despoina; Barrieu, Pauline; Chen, Yining
    Abstract: We introduce a random forest approach to enable spreads’ prediction in the primary catastrophe bond market. In a purely predictive framework, we assess the importance of catastrophe spread predictors using permutation and minimal depth methods. The whole population of non-life catastrophe bonds issued from December 2009 to May 2018 is used. We find that random forest has at least as good prediction performance as our benchmark-linear regression in the temporal context, and better prediction performance in the non-temporal one. Random forest also performs better than the benchmark when multiple predictors are excluded in accordance with the importance rankings or at random, which indicates that random forest extracts information from existing predictors more effectively and captures interactions better without the need to specify them. The results of random forest, in terms of prediction accuracy and the minimal depth importance are stable. There is only a small divergence between the drivers of catastrophe bond spread in the predictive versus explanatory framework. We believe that the usage of random forest can speed up investment decisions in the catastrophe bond industry both for would-be issuers and investors.
    Keywords: catastrophe bond pricing; interactions; machine learning in insurance; minimal depth-importance; permutation importance; primary market spread prediction; random forest; stability
    JEL: G22
    Date: 2021–07–30
    URL: http://d.repec.org/n?u=RePEc:ehl:lserod:111529&r=
  9. By: Ahmed, Shamira
    Abstract: The use of artificial intelligence (AI) based systems complements the time-sensitive and data intensive nature of many activities in the financial services industry (FSI)-Specifically, in FinTech Ecosystems (FEs) that have disrupted many financial services landscapes in sub-Saharan Africa (SSA). However, FEs have distinct interrelated supply-side and demand-side dynamics that perpetuate gender disparity. As evidenced in many wealthier economies with more AI maturity, technology is not neutral, AI in particular is inherently biased and magnifies pervasive gender and racial discrimination that exist in the ecosystems where it is deployed. This raises questions regarding the manner in which AI can exacerbate or alleviate current dynamics in inequitable and unfair ecosystems. With the aforementioned in mind, this paper adopts a gender lens to examine the potential impact of using AI based systems in the FEs of four SSA countries-South Africa, Kenya, Nigeria, and Ghana.
    Date: 2021
    URL: http://d.repec.org/n?u=RePEc:zbw:itsb21:238002&r=
  10. By: Lin, Trisha T. C.
    Abstract: This big data research uses CORPRO, sentiment analysis and inter-media agenda setting approach to investigate cross-platform representation of socialbots and disinformation during 2020 Taiwanese presidential election. The results show key terms associated with socialbots emphasize Internet armies, election candidates, Facebook and China/Taiwan relations. Sentiment analysis of socialbot-related cross-platform contents tend to be negative, regardless of media types. Forums contents encompassed more diverse topics and negativity than news media and Facebook. Polarized sentiments and political slants were found in three media types. The time-series analysis of the salient socialbot event regarding Facebook's deleting illegal account showed inter-media agenda-setting from news to Facebook and forms. Even though news media set the initial agenda, journalists and social media users alter story angels and extend narratives to fit political inclinations and reflect polarized views.
    Keywords: Socialbots,disinformation,Taiwanese Presidential Election,big data,sentiment analysis,inter media agenda setting
    Date: 2021
    URL: http://d.repec.org/n?u=RePEc:zbw:itsb21:238037&r=
  11. By: Andr\'es Garc\'ia-Medina; Toan Luu Duc Huynh3
    Abstract: Bitcoin has attracted attention from different market participants due to unpredictable price patterns. Sometimes, the price has exhibited big jumps. Bitcoin prices have also had extreme, unexpected crashes. We test the predictive power of a wide range of determinants on bitcoins' price direction under the continuous transfer entropy approach as a feature selection criterion. Accordingly, the statistically significant assets in the sense of permutation test on the nearest neighbour estimation of local transfer entropy are used as features or explanatory variables in a deep learning classification model to predict the price direction of bitcoin. The proposed variable selection methodology excludes the NASDAQ index and Tesla as drivers. Under different scenarios and metrics, the best results are obtained using the significant drivers during the pandemic as validation. In the test, the accuracy increased in the post-pandemic scenario of July 2020 to January 2021 without drivers. In other words, our results indicate that in times of high volatility, Bitcoin seems to autoregulate and does not need additional drivers to improve the accuracy of the price direction.
    Date: 2021–09
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2109.01214&r=
  12. By: Gnecco Giorgio; Nutarelli Federico; Riccaboni Massimo
    Abstract: This work applies Matrix Completion (MC) -- a class of machine-learning methods commonly used in the context of recommendation systems -- to analyse economic complexity. MC is applied to reconstruct the Revealed Comparative Advantage (RCA) matrix, whose elements express the relative advantage of countries in given classes of products, as evidenced by yearly trade flows. A high-accuracy binary classifier is derived from the application of MC, with the aim of discriminating between elements of the RCA matrix that are, respectively, higher or lower than one. We introduce a novel Matrix cOmpletion iNdex of Economic complexitY (MONEY) based on MC, which is related to the predictability of countries' RCA (the lower the predictability, the higher the complexity). Differently from previously-developed indices of economic complexity, the MONEY index takes into account the various singular vectors of the matrix reconstructed by MC, whereas other indices are based only on one/two eigenvectors of a suitable symmetric matrix, derived from the RCA matrix. Finally, MC is compared with a state-of-the-art economic complexity index (GENEPY). We show that the false positive rate per country of a binary classifier constructed starting from the average entry-wise output of MC can be used as a proxy of GENEPY.
    Date: 2021–09
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2109.03930&r=
  13. By: Philipp Maximilian Mueller; Björn-Martin Kurzrock
    Abstract: In real estate transactions, the parties generally have limited time to provide and process information. Using building documentation and digital building data may help to obtain an unbiased view of the asset. In practice, it is still particularly difficult to assess the physical structure of a building due to shortcomings in data structure and quality. Machine learning may improve speed and accuracy of information processing and results. This requires structured documents and applying a taxonomy of unambiguous document classes. In this paper, prioritized document classes from previous research (Müller, Päuser, Kurzrock 2020) are supplemented with key information for technical due diligence reports. The key information is derived from the analysis of n=35 due diligence reports. Based on the analyzed reports and identified key information, a checklist for technical due diligence is derived. The checklist will serve as a basis for a standardized reporting structure. The paper provides fundamentals for generating a (semi-)automated standardized due diligence report with a focus on the technical assessment of the asset. The paper includes recommendations for improving the machine readability of documents and indicates the potential for (partially) automated due diligence processes. The paper concludes with challenges towards an automated information extraction in due diligence processes and the potential for digital real estate management.
    Keywords: digital building documentation; Document Classification; Due diligence; Machine Learning
    JEL: R3
    Date: 2021–01–01
    URL: http://d.repec.org/n?u=RePEc:arz:wpaper:eres2021_64&r=
  14. By: Hastenteufel, Jessica; Günther, Maik; Rehfeld, Katharina
    Keywords: Big Data,Smart Data,Banken,Vertriebsmanagement,Vertriebssteuerung
    JEL: C89 G21 L29 M19
    Date: 2021
    URL: http://d.repec.org/n?u=RePEc:zbw:iubhbm:82021&r=
  15. By: Lucien Boulet
    Abstract: Several academics have studied the ability of hybrid models mixing univariate Generalized Autoregressive Conditional Heteroskedasticity (GARCH) models and neural networks to deliver better volatility predictions than purely econometric models. Despite presenting very promising results, the generalization of such models to the multivariate case has yet to be studied. Moreover, very few papers have examined the ability of neural networks to predict the covariance matrix of asset returns, and all use a rather small number of assets, thus not addressing what is known as the curse of dimensionality. The goal of this paper is to investigate the ability of hybrid models, mixing GARCH processes and neural networks, to forecast covariance matrices of asset returns. To do so, we propose a new model, based on multivariate GARCHs that decompose volatility and correlation predictions. The volatilities are here forecast using hybrid neural networks while correlations follow a traditional econometric process. After implementing the models in a minimum variance portfolio framework, our results are as follows. First, the addition of GARCH parameters as inputs is beneficial to the model proposed. Second, the use of one-hot-encoding to help the neural network differentiate between each stock improves the performance. Third, the new model proposed is very promising as it not only outperforms the equally weighted portfolio, but also by a significant margin its econometric counterpart that uses univariate GARCHs to predict the volatilities.
    Date: 2021–08
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2109.01044&r=
  16. By: Snehalkumar (Neil); S. Gaikwad; Shankar Iyer; Dalton Lunga; Elizabeth Bondi
    Abstract: Humanitarian challenges, including natural disasters, food insecurity, climate change, racial and gender violence, environmental crises, the COVID-19 coronavirus pandemic, human rights violations, and forced displacements, disproportionately impact vulnerable communities worldwide. According to UN OCHA, 235 million people will require humanitarian assistance in 20211 . Despite these growing perils, there remains a notable paucity of data science research to scientifically inform equitable public policy decisions for improving the livelihood of at-risk populations. Scattered data science efforts exist to address these challenges, but they remain isolated from practice and prone to algorithmic harms concerning lack of privacy, fairness, interpretability, accountability, transparency, and ethics. Biases in data-driven methods carry the risk of amplifying inequalities in high-stakes policy decisions that impact the livelihood of millions of people. Consequently, proclaimed benefits of data-driven innovations remain inaccessible to policymakers, practitioners, and marginalized communities at the core of humanitarian actions and global development. To help fill this gap, we propose the Data-driven Humanitarian Mapping Research Program, which focuses on developing novel data science methodologies that harness human-machine intelligence for high-stakes public policy and resilience planning.
    Date: 2021–08
    URL: http://d.repec.org/n?u=RePEc:arx:papers:2109.00100&r=
  17. By: Janssen, Patrick; Sadowski, Bert M.
    Abstract: Within the discussion on bias in algorithmic selection, fairness interventions are increasingly becoming a popular means to generate more socially responsible outcomes. The paper uses a modified framework based on Rambachan et. al. (2020) to empirically investigate the extent to which bias mitigation techniques can provide a more socially responsible outcome and prevent bias in algorithms. In using the algorithmic auditing tool AI Fairness 360 on a synthetically biased dataset, the paper applies different bias mitigation techniques at the preprocessing, inprocessing and postprocessing stage of algorithmic selection to account for fairness. The data analysis has been aimed at detecting violations of group fairness definitions in trained classifiers. In contrast to previous research, the empirical analysis focusses on the outcomes produced by decisions and the incentives problems behind fairness. The paper showed that binary classifiers trained on synthetically generated biased data while treating algorithms with bias mitigation techniques leads to a decrease in both social welfare and predictive accuracy in 43% of the cases tested. The results of our empirical study demonstrated that fairness interventions, which are designed to correct for bias often lead to worse societal outcomes. Based on these results, we propose that algorithmic selection involves a trade-between accuracy of prediction and fairness of outcomes. Furthermore, we suggest that bias mitigation techniques surely have to be included in algorithm selection but they have to be evaluated in the context of welfare economics.
    Date: 2021
    URL: http://d.repec.org/n?u=RePEc:zbw:itsb21:238032&r=
  18. By: Castro-Iragorri, C; Ramírez, J
    Abstract: Principal components analysis (PCA) is a statistical approach to build factor models in finance. PCA is also a particular case of a type of neural network known as an autoencoder. Recently, autoencoders have been successfully applied in financial applications using factor models, Gu et al. (2020), Heaton and Polson (2017). We study the relationship between autoencoders and dynamic term structure models; furthermore we propose different approaches for forecasting. We compare the forecasting accuracy of dynamic factor models based on autoencoders, classical models in term structure modelling proposed in Diebold and Li (2006) and neural network-based approaches for time series forecasting. Empirically, we test the forecasting performance of autoencoders using the U.S. yield curve data in the last 35 years. Preliminary results indicate that a hybrid approach using autoencoders and vector autoregressions framed as a dynamic term structure model provides an accurate forecast that is consistent throughout the sample. This hybrid approach overcomes in-sample overfitting and structural changes in the data.
    Keywords: autoencoders, factor models, principal components, recurrentneural networks
    JEL: C45 C53 C58
    Date: 2021–07–29
    URL: http://d.repec.org/n?u=RePEc:col:000092:019431&r=

This nep-big issue is ©2021 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at http://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.