nep-big 2021-09-06 papers

on Big Data

Issue of 2021‒09‒06
eleven papers chosen by
Tom Coupé
University of Canterbury

Peeking inside the Black Box: Interpretable Machine Learning and Hedonic Rental Estimation By Marcelo Cajias; Willwersch Jonas; Lorenz Felix; Franz Fuerst
On the interpretation of black-box default prediction models: an Italian Small and Medium Enterprises case By Lisa Crosato; Caterina Liberati; Marco Repetto
Evaluation of technology clubs by clustering: A cautionary note By Andres, Antonio Rodriguez; Otero, Abraham; Amavilah, Voxi Heinrich
Bilinear Input Normalization for Neural Networks in Financial Forecasting By Dat Thanh Tran; Juho Kanniainen; Moncef Gabbouj; Alexandros Iosifidis
The Future of Commercial Real Estate Market Research: A Case for Applying Machine Learning By Benedict von Ahlefeldt-Dehn; Marcelo Cajias; Wolfgang Schäfers
Document Classification for Machine Learning in Real Estate Professional Services – Results of the Property Research Trust Project By Philipp Maximilian Mueller; Björn-Martin Kurzrock
Accounting for Spatial Autocorrelation in Algorithm-Driven Hedonic Models: A Spatial Cross-Validation Approach By Juergen Deppner; Marcelo Cajias; Wolfgang Schäfers
Proceedings of KDD 2020 Workshop on Data-driven Humanitarian Mapping: Harnessing Human-Machine Intelligence for High-Stake Public Policy and Resilience Planning By Snehalkumar; S. Gaikwad; Shankar Iyer; Dalton Lunga; Yu-Ru Lin
Onboarding AI By Boris Babic; Daniel L. Chen; Theodoros Evgeniou; Anne-Laure Fayard
Analysis of Property Yields for Multi-Family Houses with Spatial Method and ANN By Matthias Soot; Sabine Horvath; Hans-Berndt Neuner; Alexandra Weitkamp
What costs should we expect from the EU’s AI Act? By Haataja, Meeri; Bryson, Joanna J.

Peeking inside the Black Box: Interpretable Machine Learning and Hedonic Rental Estimation

By:	Marcelo Cajias; Willwersch Jonas; Lorenz Felix; Franz Fuerst
Abstract:	Machine Learning (ML) can detect complex relationships to solve problems in various research areas. To estimate real estate prices and rents, ML represents a promising extension to the hedonic literature since it is able to increase predictive accuracy and is more flexible than the standard regression-based hedonic approach in handling a variety of quantitative and qualitative inputs. Nevertheless, its inferential capacity is limited due to its complex non-parametric structure and the ‘black box’ nature of its operations. In recent years, research on Interpretable Machine Learning (IML) has emerged that improves the interpretability of ML applications. This paper aims to elucidate the analytical behaviour of ML methods and their predictions of residential rents applying a set of model-agnostic methods. Using a dataset of 58k apartment listings in Frankfurt am Main (Germany), we estimate rent levels with the eXtreme Gradient Boosting Algorithm (XGB). We then apply Permutation Feature Importance (PFI), Partial Dependence Plots (PDP), Individual Conditional Expectation Curve (ICE) and Accumulated Local Effects (ALE). Our results suggest that IML methods can provide valuable insights and yield higher interpretability of ‘black box’ models. According to the results of PFI, most relevant locational variables for apartments are the proximity to bars, convenience stores and bus station hubs. Feature effects show that ML identifies non-linear relationships between rent and proximity variables. Rental prices increase up to a distance of approx. 3 kilometer to a central bus hub, followed by steep decline. We therefore assume tenants to face a trade-off between good infrastructural accessibility and locational separation from the disamenities associated with traffic hubs such as noise and air pollution. The same holds true for proximity to bar with rents peaking at 1 km distance. While tenants appear to appreciate nearby nightlife facilities, immediate proximity is subject to rental discounts. In summary, IML methods can increase transparency of ML models and therefore identify important patterns in rental markets. This may lead to a better understanding of residential real estate and offer new insights for researchers as well as practitioners.
Keywords:	Explainable Artifical Intelligence; housing; Machine Learning; Non parametric hedonic models
JEL:	R3
Date:	2021–01–01
URL:	http://d.repec.org/n?u=RePEc:arz:wpaper:eres2021_104&r=

On the interpretation of black-box default prediction models: an Italian Small and Medium Enterprises case

By:	Lisa Crosato; Caterina Liberati; Marco Repetto
Abstract:	Academic research and the financial industry have recently paid great attention to Machine Learning algorithms due to their power to solve complex learning tasks. In the field of firms' default prediction, however, the lack of interpretability has prevented the extensive adoption of the black-box type of models. To overcome this drawback and maintain the high performances of black-boxes, this paper relies on a model-agnostic approach. Accumulated Local Effects and Shapley values are used to shape the predictors' impact on the likelihood of default and rank them according to their contribution to the model outcome. Prediction is achieved by two Machine Learning algorithms (eXtreme Gradient Boosting and FeedForward Neural Network) compared with three standard discriminant models. Results show that our analysis of the Italian Small and Medium Enterprises manufacturing industry benefits from the overall highest classification power by the eXtreme Gradient Boosting algorithm without giving up a rich interpretation framework.
Date:	2021–08
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2108.13914&r=

Evaluation of technology clubs by clustering: A cautionary note

By:	Andres, Antonio Rodriguez; Otero, Abraham; Amavilah, Voxi Heinrich
Abstract:	Applications of machine learning techniques to economic problems are increasing. These are powerful techniques with great potential to extract insights from economic data. However, care must be taken to apply them correctly, or the wrong conclusions may be drawn. In the technology clubs literature, after applying a clustering algorithm, some authors train a supervised machine learning technique, such as a decision tree or a neural network, to predict the label of the clusters. Then, they use some performance metric (typically, accuracy) of that prediction as a measure of the quality of the clustering configuration they have found. This is an error with potential negative implications for policy, because obtaining a high accuracy in such a prediction does not mean that the clustering configuration found is correct. This paper explains in detail why this modus operandi is not sound from theoretical point of view and uses computer simulations to demonstrate it. We caution policy and indicate the direction for future investigations.
Keywords:	Machine learning; clustering, technological change; technology clubs; knowledge economy; cross-country
JEL:	C45 C53 O38 O57 P41
Date:	2021–05–15
URL:	http://d.repec.org/n?u=RePEc:pra:mprapa:109138&r=

Bilinear Input Normalization for Neural Networks in Financial Forecasting

By:	Dat Thanh Tran; Juho Kanniainen; Moncef Gabbouj; Alexandros Iosifidis
Abstract:	Data normalization is one of the most important preprocessing steps when building a machine learning model, especially when the model of interest is a deep neural network. This is because deep neural network optimized with stochastic gradient descent is sensitive to the input variable range and prone to numerical issues. Different than other types of signals, financial time-series often exhibit unique characteristics such as high volatility, non-stationarity and multi-modality that make them challenging to work with, often requiring expert domain knowledge for devising a suitable processing pipeline. In this paper, we propose a novel data-driven normalization method for deep neural networks that handle high-frequency financial time-series. The proposed normalization scheme, which takes into account the bimodal characteristic of financial multivariate time-series, requires no expert knowledge to preprocess a financial time-series since this step is formulated as part of the end-to-end optimization process. Our experiments, conducted with state-of-the-arts neural networks and high-frequency data from two large-scale limit order books coming from the Nordic and US markets, show significant improvements over other normalization techniques in forecasting future stock price dynamics.
Date:	2021–09
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2109.00983&r=

The Future of Commercial Real Estate Market Research: A Case for Applying Machine Learning

By:	Benedict von Ahlefeldt-Dehn; Marcelo Cajias; Wolfgang Schäfers
Abstract:	The commercial real estate market is opaque and build upon complex relationships of countless property market and macroeconomic factors. Yet, office markets are due to its sheer volume and importance for numerous market players such as investors, developers, mortgage underwriters and valuation firms broadly researched. Especially, the prediction of property market indicators found strong interest among researchers and practitioners in the field of commercial real estate. Thus, the literature proposes three main frameworks for predicting office rents, among other. The estimation via multiple equation models such as error correction mechanism models (e.g. Hendershott et al., 2002; Ke and White, 2009; McCartnery, 2012) or interlinked demand and supply models (e.g. Rosen, 1984; Hendershott et al., 1999; Kim, 2012), reduced form single equation models (e.g. Matysiak and Tsolacos, 2003; Voigtländer, 2010; Kiehelä and Falkenbach, 2014) or autoregressive models (e.g. McGough and Tsolacos, 1995; Brooks and Tsolacos, 2000; Stevenson and McGarth, 2003). However, the limitations of the applied methods lay within the econometric methods itself. “Traditional” statistical modeling as an approximation of causality will only understand trends and relationships in the underlying market to the degree the employed econometric methods themselves can mirror. In contrast, more recent methodological attempts such as machine learning can be seen as a process of selecting the relevant features leading to a trade-off between precision and stability of a predictive model (Conway, 2018). This however, creates opportunities to expand and enhance existing efforts – in a way that complex and non-linear relationships within the data are captured. Many studies (e.g Dabrowski and Adamczyk, 2010; Rafatirad, 2017, Cajias and Ertl, 2018; Mayer et al., 2019) apply advanced machine learning methods to residential markets and demonstrate that “traditional” linear hedonic models can be outperformed. While linear models are found to produce less volatile predictions advanced machine learning methods yield more accurate results. Promising results can also be shown in commercial real estate markets. In particular, the aim of research is the performance assessment of the forecasting of office rents in European markets with advanced machine learning methods. A dataset of European markets with office prime rents and market as well as macroeconomic indicators is analysed and advanced machine learning models are estimated. A “traditional” linear regression model (ordinary least squares) functions as a benchmark for the evaluation of the employed methods: random forest and extreme gradient boosting. In particular, the prediction power and forecasting ability is assessed in- and out-of-sample, respectively. The tree-based advanced machine learning methods yield promising estimations in the observed markets. It becomes clear that in commercial real estate markets complex and non-linear relationships are present and can effectively be estimated by non-parametric econometric models. By the application of these methods the estimation error (out-of-sample) can be reduced by up to 60 percent. To the best of the authors knowledge such applications of machine learning methods in commercial real estate markets has not been considered in prior research. However, in the area of textual analysis results show that commercial real estate markets can be forecasted on the basis of market sentiment (e.g. Beracha et al., 2019). The capability of improving the forecasting power with advanced machine learning methods creates value and transparency for numerous market players and authorities.
Keywords:	commercial real estate; Forecast; Machine Learning; Office Rent
JEL:	R3
Date:	2021–01–01
URL:	http://d.repec.org/n?u=RePEc:arz:wpaper:eres2021_49&r=

Document Classification for Machine Learning in Real Estate Professional Services – Results of the Property Research Trust Project

By:	Philipp Maximilian Mueller; Björn-Martin Kurzrock
Abstract:	Due to numerous documents and the lack of widely acknowledged standards, the capture and provision of information in transaction processes frequently remains challenging. Since construction and maintenance come with substantial costs, the evaluation of the structural condition and maintenance requirements as well as the assessment of contracts and legal structures are important in real estate transactions. The quality and completeness of digital building documentation is increasingly becoming a factor as deal maker and deal breaker. Artificial intelligence can well assist in the classification of documents and extraction of information This research provides fundamentals for generating a (semi-)automated standardized technical and legal assessment of buildings. Based on a large building documentation set from (institutional) investors, the potential for digital processing, automated classification and information extraction through machine learning algorithms is demonstrated. For this purpose, more than 400 document classes are derived, reviewed, prioritized and principally checked for machine readability. In addition, key information is structured and prioritized for technical and legal due diligence. The paper highlights recommendations for improving the machine readability of documents and indicates the potential for partially automating technical and legal due diligence processes. The practical recommendations are relevant for investors, owners, users and service providers who depend on specific real estate information as well as for companies that develop or use software tools. For policymaking, the research offers some guidance for standardizing documents to support digital information processing in real estate. The recommendations are helpful for improving information processing and in general, promoting the use of automated information extraction based on machine learning in real estate.
Keywords:	digital building documentation; Due diligence; Machine Learning; property research trust
JEL:	R3
Date:	2021–01–01
URL:	http://d.repec.org/n?u=RePEc:arz:wpaper:eres2021_65&r=

Accounting for Spatial Autocorrelation in Algorithm-Driven Hedonic Models: A Spatial Cross-Validation Approach

By:	Juergen Deppner; Marcelo Cajias; Wolfgang Schäfers
Abstract:	Aim of research: Real estate markets are featured with a spatial dimension that is pivotal for the economic value of housing. The inherent spatial dependence in the underlying price determination process cannot be simply overlooked in linear hedonic model specifications, as this would render spurious results (see Anselin, 1988; Can and Megbolugbe, 1997; Basu and Thibodeau, 1998). Guidance on how to account for spatial dependence in linear regression models is vast and remains subject of many contributions to the hedonic and spatial econometric literature (see LeSage and Pace, 2009; Anselin, 2010; Elhorst, 2014). Moving from the parametric paradigm of hedonic regression methods to the universe of non-parametric statistical learning methods such as decision trees, random forests, or boosting techniques, literature has brought forth an increasing body of evidence that such algorithms are capable of providing a superior predictive performance for complex non-linear and multi-dimensional regression problems, including various applications to house price estimation (e.g. Mayer et al., 2019; Pace and Hayunga, 2020; Bogin and Shui, 2020). However, in contrast to linear models, little attention has been paid to the implications of spatial dependence in house prices for the statistical validity of error estimates of machine learning algorithms although independence of the data is implicitly assumed (see Roberts et al., 2017; Schratz et al., 2019). Our study aims at investigating the role of spatial autocorrelation (SAC) on the accuracy assessment of algorithmic hedonic methods, thereby benchmarking spatially conscious machine learning approaches to linear and spatial hedonic methods. Study design and methodology: Machine learning algorithms learn the relationship between the response and the regressors autonomously without requiring any a-priori specifications about their functional form. As their high flexibility makes such approaches prone to overfitting, resampling strategies such as k-fold cross validation are applied to approximate a models out-of-sample predictive performance. During resampling, the observations are randomly partitioned into mutually exclusive training and test subsets, whereby the predictor is fitted on the training data and evaluated on the test data. SAC can be accounted for using spatial resampling strategies which attempt to reduce SAC between training and test data through a modification in the splitting process. Instead of randomly partitioning the data which implicitly assumes their independence, spatially clustered partitions are created using the observations coordinates (see Brenning, 2012). We train and evaluate tree-based algorithms on a pooled cross-section of asking rents in Germany using both, random as well as spatial partitioning and subsequently forecast out-of-sample data to assess the bias in the in-sample error estimates associated with SAC. The results are benchmarked to well-specified ordinary least squares and spatial autoregressive frameworks to compare the models generalizability. Originalty and implications: Applying machine learning to spatial data without accounting for SAC provides the predictor with information that is assumed to be unavailable during training, which may lead to biased accuracy assessment (see Lovelace et al., 2021). This study sheds light on the accuracy bias of random resampling induced by SAC in a hedonic context. The results prove useful for increasing the robustness and generalizability of algorithmic approaches to hedonic regression problems, thereby containing valuable implications for appraisal practices. To the best of our knowledge, no research in the existing literature has thus far accounted for SAC in an algorithm-driven hedonic context by applying spatial cross-validation. We conclude that random resampling yields over-optimistic prediction accuracies whereas spatial resampling increases generalizability, and thus robustness to unseen data. We also find the bias to be lower for algorithms which apply column-subsampling to counteract overfitting.
Keywords:	Hedonic Models; Machine Learning; Spatial Autocorrelation; Spatial Cross Validation
JEL:	R3
Date:	2021–01–01
URL:	http://d.repec.org/n?u=RePEc:arz:wpaper:eres2021_51&r=

Proceedings of KDD 2020 Workshop on Data-driven Humanitarian Mapping: Harnessing Human-Machine Intelligence for High-Stake Public Policy and Resilience Planning

By:	Snehalkumar (Neil); S. Gaikwad; Shankar Iyer; Dalton Lunga; Yu-Ru Lin
Abstract:	Humanitarian challenges, including natural disasters, food insecurity, climate change, racial and gender violence, environmental crises, the COVID-19 coronavirus pandemic, human rights violations, and forced displacements, disproportionately impact vulnerable communities worldwide. According to UN OCHA, 235 million people will require humanitarian assistance in 20211 . Despite these growing perils, there remains a notable paucity of data science research to scientifically inform equitable public policy decisions for improving the livelihood of at-risk populations. Scattered data science efforts exist to address these challenges, but they remain isolated from practice and prone to algorithmic harms concerning lack of privacy, fairness, interpretability, accountability, transparency, and ethics. Biases in data-driven methods carry the risk of amplifying inequalities in high-stakes policy decisions that impact the livelihood of millions of people. Consequently, proclaimed benefits of data-driven innovations remain inaccessible to policymakers, practitioners, and marginalized communities at the core of humanitarian actions and global development. To help fill this gap, we propose the Data-driven Humanitarian Mapping Research Program, which focuses on developing novel data science methodologies that harness human-machine intelligence for high-stakes public policy and resilience planning.
Date:	2021–09
URL:	http://d.repec.org/n?u=RePEc:arx:papers:2109.00435&r=

Onboarding AI

By:	Boris Babic (Unknown); Daniel L. Chen (TSE - Toulouse School of Economics - UT1 - Université Toulouse 1 Capitole - Université Fédérale Toulouse Midi-Pyrénées - EHESS - École des hautes études en sciences sociales - CNRS - Centre National de la Recherche Scientifique - INRAE - Institut National de Recherche pour l’Agriculture, l’Alimentation et l’Environnement); Theodoros Evgeniou (Unknown); Anne-Laure Fayard (Unknown)
Abstract:	In a 2018 Workforce Institute survey of 3,000 managers across eight industrialized nations, the majority of respondents described artificial intelligence as a valuable productivity tool. But respondents to that survey also expressed fears that AI would take their jobs. They are not alone. The Guardian recently reported that in the UK "more than 6 million workers fear being replaced by machines AI's advantages can be cast in a dark light: Why would humans be needed when machines can do a better job? To allay such fears, employers must set AI up to succeed rather than to fail. The authors draw on their own and others' research and consulting on AI and information systems implementation, along with organizational studies of innovation and work practices, to present a four-phase approach to implementing AI. It allows organizations to cultivate people's trust—a key condition for adoption—and to work toward a distributed cognitive system in which humans and artificial intelligence both continually improve
Date:	2021
URL:	http://d.repec.org/n?u=RePEc:hal:journl:hal-03276433&r=

Analysis of Property Yields for Multi-Family Houses with Spatial Method and ANN

By:	Matthias Soot; Sabine Horvath; Hans-Berndt Neuner; Alexandra Weitkamp
Abstract:	In this work, we compare the results of multiple linear regression analysis (MLR) with spatial analysis method (geographically weighted regressions (GWR)) and an artificial neural network (ANN) approach deriving a state-wide model for property yields. The database consists of approx. 3000 purchase prices in the market of multi-family houses collected in the purchase price database of Lower Saxony (Germany). The purchases occurred between 2014 and 2018. The locational quality as well as the theoretical age (deprecation) of the real estates are the influencing variables in the analysis. In the GWR, different fixed and variable kernels are used. The approaches are evaluated using cross-validation procedure with quality parameters like the root mean square error (RMSE), the mean absolute error (MAE) and the error below 5% (eb5). The first analysis shows that GWR leads to better results in comparison to classical approaches (MLR) because local phenomena can be modelled. Also, the approach of ANN is superior in comparison to the classical regression analysis because of its ability of nonlinear modelling. In this dataset, the ANN cannot reach the accuracy of GWR which leads to the conclusion, that the spatial inhomogeneity has a bigger influence than a data non-linearity. Further investigation shows that the complexity of the data and the amount of available data plays a key role in the performance of ANN.
Keywords:	ANN; GWR; Multi family houses; property yields
JEL:	R3
Date:	2021–01–01
URL:	http://d.repec.org/n?u=RePEc:arz:wpaper:eres2021_44&r=

What costs should we expect from the EU’s AI Act?

By:	Haataja, Meeri; Bryson, Joanna J. (Hertie School)
Abstract:	This short analysis aims to provide an overview of the anticipated costs caused by the EU’s proposed AI regulation, the AI Act (AIA), to impacted organisations: both providers and deployers of systems containing AI. We focus our analysis at an enterprise level, leaving the macroeconomic discussion for later. While the bulk of the paper explains and critiques the European Commission’s (EC) own analysis we also comment on the critiques raised recently by a high-profile US lobbyist, the Center for Data Innovation, in their report “How Much Will the Artificial Intelligence Act Cost Europe?” We conclude by highlighting topics that would benefit from further elaboration by the EC. As a reminder, the AIA is presently draft legislation, written by the European Commission. While something quite similar can be expected to be implemented ultimately by the European Union’s member states, the legislation is presently in a period of revision by the elected members of the European Parliament, in cooperation and consultation with EU national governments. While the heart of the EU’s regulatory proposal is in safeguarding people against AI risks to health, safety and fundamental rights, we acknowledge the importance of rooting policies in a sound analysis of financial impacts. It is only that way that requirements get translated into solid action plans and finally into actions. The process of such pragmatic analysis, can also get at assumptions and failures of coherence that might otherwise be overlooked. We also, separately, have a longer commentary on the act itself, see “Reflections on the EU’s AI Act and how we could make it even better.” Our analysis of the AIA costs is based on the two main sources: the EC’s Impact Assessment of the AIA, and the EC’s study to support an impact assessment of regulatory requirements for artificial intelligence in Europe. It is noteworthy that while the EC uses the support study as its main source for financial impact assessment, in some contexts, they specifically choose to interpret the figures differently, e.g. by excluding some categories of costs from the impact assessment. For this, it is critical to treat the EC’s impact assessment and the support study as two separate sources. This was one of a number of things apparently missed by the Center for Data Innovation in their report.
Date:	2021–08–27
URL:	http://d.repec.org/n?u=RePEc:osf:socarx:8nzb4&r=

This nep-big issue is ©2021 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.

General information on the NEP project can be found at http://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.

NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.