nep-big New Economics Papers
on Big Data
Issue of 2023‒03‒13
nineteen papers chosen by
Tom Coupé
University of Canterbury

  1. AI scoring for international large-scale assessments using a deep learning model and multilingual data By Tomoya Okubo; Wayne Houlden; Paul Montuoro; Nate Reinertsen; Chi Sum Tse; Tanja Bastianianic
  2. A Machine Learning Approach to Measuring Climate Adaptation By Max Vilgalys
  3. From Human Business to Machine Learning – Methods for Automating Real Estate Appraisals and their Practical Implications By Moritz Stang; Bastian Krämer; Cathrine Nagl; Wolfgang Schäfers
  4. Updating of Input-Output tables in Russia by machine learning methods By Vladimir Potashnikov
  5. Accounting for Spatial Autocorrelation in Algorithm-Driven Hedonic Models: A Spatial Cross-Validation Approach By Juergen Deppner; Marcelo Cajias
  6. A blueprint for building national compute capacity for artificial intelligence By OECD
  7. Policy Learning with Rare Outcomes By Julia Hatamyar; Noemi Kreif
  8. Economics of ChatGPT: A Labor Market View on the Occupational Impact of Artificial Intelligence By Zarifhonarvar, Ali
  9. Automated Assessment of Housing Quality with the Use of Wordscores Algorithm By Michal Hebdzynski
  10. Proposal for a Forecasting Methodology to Predict Commercial Real Estate Values in Istanbul Using Social Big Data By Maral Taclar; Kerem Yavuz Arslanli
  11. Assessing the impact of regulations and standards on innovation in the field of AI By Alessio Tartaro; Adam Leon Smith; Patricia Shaw
  12. Characterizing Financial Market Coverage using Artificial Intelligence By Jean Marie Tshimula; D'Jeff K. Nkashama; Patrick Owusu; Marc Frappier; Pierre-Martin Tardif; Froduald Kabanza; Armelle Brun; Jean-Marc Patenaude; Shengrui Wang; Belkacem Chikhaoui
  13. Minimax Instrumental Variable Regression and $L_2$ Convergence Guarantees without Identification or Closedness By Andrew Bennett; Nathan Kallus; Xiaojie Mao; Whitney Newey; Vasilis Syrgkanis; Masatoshi Uehara
  14. The Rule of Law in the ESG Framework in the World Economy By LEOGRANDE, ANGELO
  15. Machine Learning Applications to Valuation of Options on Non-liquid Markets By Jiří Witzany; Milan Fičura
  16. Data Management and the Nigerian Built Environment in the Fourth Industrial Revolution: Challenges and Prospects By David Akinwamide; Tunbosun Oyedokun; Jonas Hahn
  17. Order book regulatory impact on stock market quality: a multi-agent reinforcement learning perspective By Johann Lussange; Boris Gutkin
  18. Towards Evology: a Market Ecology Agent-Based Model of US Equity Mutual Funds II By Aymeric Vie; J. Doyne Farmer
  19. A Reinforcement Learning Algorithm for Trading Commodities By Federico Giorgi; Stefano Herzel; Paolo Pigato

  1. By: Tomoya Okubo; Wayne Houlden; Paul Montuoro; Nate Reinertsen; Chi Sum Tse; Tanja Bastianianic
    Abstract: Artificial Intelligence (AI) scoring for constructed-response items, using recent advancements in multilingual, deep learning techniques utilising models pre-trained with a massive multilingual text corpus, is examined using international large-scale assessment data. Historical student responses to Reading and Science literacy cognitive items developed under the PISA analytical framework are used as training data for deep learning together with multilingual data to construct an AI model. The trained AI models are then used to score and the results compared with human-scored data. The score distributions estimated based on the AI-scored data and the human-scored data are highly consistent with each other; furthermore, even item-level psychometric properties of the majority of items showed high levels of agreement, although a few items showed discrepancies. This study demonstrates a practical procedure for using a multilingual data approach, and this new AI-scoring methodology reached a practical level of quality, even in the context of an international large-scale assessment.
    Date: 2023–02–21
  2. By: Max Vilgalys
    Abstract: I measure adaptation to climate change by comparing elasticities from short-run and long-run changes in damaging weather. I propose a debiased machine learning approach to flexibly measure these elasticities in panel settings. In a simulation exercise, I show that debiased machine learning has considerable benefits relative to standard machine learning or ordinary least squares, particularly in high-dimensional settings. I then measure adaptation to damaging heat exposure in United States corn and soy production. Using rich sets of temperature and precipitation variation, I find evidence that short-run impacts from damaging heat are significantly offset in the long run. I show that this is because the impacts of long-run changes in heat exposure do not follow the same functional form as short-run shocks to heat exposure.
    Date: 2023–02
  3. By: Moritz Stang; Bastian Krämer; Cathrine Nagl; Wolfgang Schäfers
    Abstract: The ongoing digitalization is picking up speed in slowly changing industries such as real estate. Especially in the field of real estate valuation, which is strongly dependent on data quality and quantity, automation is able to change the appraisal process substantially. However, in most countries, only the use of Automated Valuation Models (AVMs) based on simple non-statistical methods is allowed, as the regulatory system does not yet give the green light to higher-order methods. This study provides a relevant contribution to the debate on why AVMs based on statistical and machine learning methods should be widely used in practice. Therefore, various methods for AVMs are implemented and applied to a dataset of 1.2 million observations across Germany. An automation of the traditional sales comparison method, two hedonic price functions, as well as a machine learning approach are compared with each other. The aim of this paper is to show how the methods perform in direct comparison with each other as well as in different structural regions of Germany and whether the use of modern learning-based algorithms in real estate valuation is beneficial. The results of this research have various implications regarding the different accuracy and transparency levels of the methods from a regulatory and practical perspective. Moreover, the comparison at the spatial level shows that the models perform differently in urban and rural areas. This allows conclusions to be drawn about the design of AVMs for cross-regional models.
    Keywords: Automated Valuation Models; eXtreme Gradient Boosting; housing market; Machine Learning
    JEL: R3
    Date: 2022–01–01
  4. By: Vladimir Potashnikov (The Russian Presidential Academy Of National Economy And Public Administration)
    Abstract: Relevance: Input-output tables are the basis for many types of analysis of the real sector, which are necessary to build a well-thought-out long-term and short-term policy. Evaluation of input-output tables is an expensive and time-consuming procedure. At the same time, national statistical agencies publish additional forecast information, which makes it possible to extend the input-output tables, for example, output and intermediate consumption by sector. The main methods of extending the RAS tables (or its modification GRAS) and Cross Entropy, use data on intermediate demand, the calculation of which requires additional time-consuming work. The use of information only for the previous period and the current period is the main disadvantage of this method. In recent decades, machine learning methods have been gaining popularity, the main advantage of which is finding relationships that can be hard to identify, for example, due to the large dimension of the task or the lack of evidence of cause-and-effect relationships. These methods have proven themselves well in all kinds of image recognition tasks, voice-to-text conversion, and so on. Currently, attempts are being made to apply machine learning methods to economic problems. The application of machine learning methods to the task of updating input-output tables carries a scientific novelty. The purpose of the study is to extend the input-output tables by machine learning methods. The method of extending the input-output tables using convolutional neural networks is the result of the work, as well as the forecast of the coefficients of the direct cost matrix for Russia. Conclusion: the use of input-output tables can improve the quality of forecasts of input-output tables. Recommendations: it is necessary to continue the research in this direction.
    Keywords: Input-Output tables, machine learning, CNN
  5. By: Juergen Deppner; Marcelo Cajias
    Abstract: Data-driven machine learning algorithms have initiated a paradigm shift in hedonic house price and rent modeling through their ability to capture highly complex and non-monotonic relationships. Their superior accuracy compared to parametric model alternatives has been demonstrated repeatedly in the literature. However, the statistical independence of the data implicitly assumed by resampling-based error estimates is unlikely to hold in a real estate context as price-formation processes in property markets are inherently spatial, which leads to spatial dependence structures in the data. When performing conventional cross-validation techniques for model selection and model assessment, spatial dependence between training and test data may lead to undetected overfitting and over-optimistic perception of predictive power. This study sheds light on the bias in cross-validation errors of tree-based algorithms induced by spatial autocorrelation and proposes a bias-reduced spatial cross-validation strategy. The findings confirm that error estimates from non-spatial resampling methods are overly optimistic, whereas spatially conscious techniques are more dependable and can increase generalizability. As accurate and unbiased error estimates are crucial to automated valuation methods, our results prove helpful for applications including, but not limited to, mass appraisal, credit risk management, portfolio allocation and investment decision making.
    Keywords: Hedonic modeling; Machine Learning; Spatial Autocorrelation; spatial cross-validation
    JEL: R3
    Date: 2022–01–01
  6. By: OECD
    Abstract: Artificial intelligence (AI) is transforming economies and promising new opportunities for productivity, growth, and resilience. Countries are responding with national AI strategies to capitalise on these transformations. However, no country today has data on, or a targeted plan for, national AI compute capacity. This policy blind-spot may jeopardise domestic economic goals. This report provides the first blueprint for policy makers to help assess and plan for the national AI compute capacity needed to enable productivity gains and capture AI’s full economic potential. It provides guidance for policy makers on how to develop a national AI compute plan along three dimensions: capacity (availability and use), effectiveness (people, policy, innovation, access), and resilience (security, sovereignty, sustainability). The report also defines AI compute, takes stock of indicators, datasets, and proxies for measuring national AI compute capacity, and identifies obstacles to measuring and benchmarking national AI compute capacity across countries.
    Date: 2023–02–28
  7. By: Julia Hatamyar; Noemi Kreif
    Abstract: Machine learning (ML) estimates of conditional average treatment effects (CATEs) can be used to inform policy allocation rules, such as treating those with a beneficial estimated CATE ("plug-in policy"), or searching for a decision tree that optimises overall outcomes. Little is known about the practical performance of these algorithms in usual settings of policy evaluations. We contrast the performance of various policy learning algorithms, using synthetic data with varying outcome prevalence (rare vs. not rare), positivity violations, extent of treatment effect heterogeneity and sample size. For each algorithm, we evaluate the performance of the estimated treatment allocation by assessing how far the benefit from a resulting policy is from the best possible ("oracle") policy. We find that the plug-in policy type outperforms tree-based policies, regardless of ML method used. Specifically, the tree-based policy class may lead to overly-optimistic estimated benefits of a learned policy; i.e., the estimated advantages of tree-based policies may be much larger than the true possible maximum advantage. Within either policy class, Causal Forests and the Normalised-Double-Robust Learner performed best, while Bayesian Additive Regression Trees performed worst. Additionally, we find evidence that with small sample sizes or in settings where the ratio of covariates to samples is high, learning policy trees using CATEs has a better performance than using doubly-robust scores. The methods are applied to a case study that investigates infant mortality through improved targeting of subsidised health insurance in Indonesia.
    Date: 2023–02
  8. By: Zarifhonarvar, Ali
    Abstract: This study examines how ChatGPT affects the labor market. I first thoroughly analyzed the prior research that has been done on the subject in order to start understanding how ChatGPT and other AI-related services are influencing the labor market. Using the supply and demand model, I then assess ChatGPT's impact. This paper examines this innovation's short- and long-term effects on the labor market, concentrating on its challenges and opportunities. Furthermore, I employ a text-mining approach to extract various tasks from the International Standard Occupation Classification to present a comprehensive list of occupations most sensitive to ChatGPT.
    Keywords: Large Language Models, Artificial Intelligence, Automation, Labor Saving Technology, ChatGPT, Labor Market, Generative AI, Occupational Classification
    JEL: O33 E24 J21 J24
    Date: 2023
  9. By: Michal Hebdzynski
    Abstract: The aim of this paper is to address the problem of unavailability or inaccuracy of information on the quality of housing in the existing data sources. This may lead to obtaining biased results in the hedonic analyses of the market conducted for macroprudential and statistical purposes. We target this problem by proposing a supervised machine learning framework that base on the Wordscores algorithm. We try to answer the question, whether it is possible to reliably, automatically assess the quality of apartment, based solely on the textual description of its listing posted in the internet advertisement site. The accuracy of the method has been tested on the example of the Polish-language apartment sales and rental listings from 2019-2021. The obtained point estimates of the quality level show a high correlation with the human assessments. The results indicate that the application of the Wordscores algorithm gives 71% effectiveness in categorizing the apartments for rent into three quality groups: low, medium and high. For the apartments for sale, the effectiveness equals 64%. The study indicates that textual descriptions of apartments’ listings convey usable, yet most often unused information on the housing quality. The usage of the fruits of the method may lead to the increased accuracy of the performed analyses of the market, thus to its better understanding. The relative easiness of application of the algorithm and its high interpretability make the proposed method advantageous over the already developed, more econometrically sophisticated approaches.
    Keywords: Hedonic methods; Housing quality; supervised machine-learning; Textual Analysis
    JEL: R3
    Date: 2022–01–01
  10. By: Maral Taclar; Kerem Yavuz Arslanli
    Abstract: This paper provides a new forecasting methodology for commercial real estates in Istanbul using social big data. Big data has gained popularity as a tool for the growth of real estate research in recent years. Location-based social networks (LBSNs), in particular, provide an excellent potential to demonstrate the characteristics of metropolitan cities and human activities within. Whilst there is relatively limited research on the relationship between social big data and real estate values, most of the existing research focuses on residential properties. This paper aims to discover the potential of social media data to forecast the future rent/price levels of retail properties in stanbul. Two different LBSN platforms, Instagram and Twitter, are chosen as the social media data sources. For the timeframe, June 2019 - May 2021, 16 million geo-tagged Instagram posts and 230 thousand geo-referenced tweets from a total of 174 thousand venues are collected by the authors. The data set is clustered by relevant districts of Istanbul and the spatial distribution of social media content is observed. Finally, the data sets are combined with the commercial real estate data temporally for the districts. Multivariate time-series analyses are conducted to obtain the optimum prediction model and interval. This method increases the accuracy in rent and/or price predictions by selecting the best exogenous variables and forecasting models for each district, where applicable. This paper demonstrates the significance and the leveraging potential of adapting human activities to the decision-making processes of the commercial real estate sector.
    Keywords: commercial real estate; Large Data Sets Modelling; Multivariate Time Series Analysis; Urban Spatial Analysis
    JEL: R3
    Date: 2022–01–01
  11. By: Alessio Tartaro; Adam Leon Smith; Patricia Shaw
    Abstract: Regulations and standards in the field of artificial intelligence (AI) are necessary to minimise risks and maximise benefits, yet some argue that they stifle innovation. This paper critically examines the idea that regulation stifles innovation in the field of AI. Current trends in AI regulation, particularly the proposed European AI Act and the standards supporting its implementation, are discussed. Arguments in support of the idea that regulation stifles innovation are analysed and criticised, and an alternative point of view is offered, showing how regulation and standards can foster innovation in the field of AI.
    Date: 2023–02
  12. By: Jean Marie Tshimula; D'Jeff K. Nkashama; Patrick Owusu; Marc Frappier; Pierre-Martin Tardif; Froduald Kabanza; Armelle Brun; Jean-Marc Patenaude; Shengrui Wang; Belkacem Chikhaoui
    Abstract: This paper scrutinizes a database of over 4900 YouTube videos to characterize financial market coverage. Financial market coverage generates a large number of videos. Therefore, watching these videos to derive actionable insights could be challenging and complex. In this paper, we leverage Whisper, a speech-to-text model from OpenAI, to generate a text corpus of market coverage videos from Bloomberg and Yahoo Finance. We employ natural language processing to extract insights regarding language use from the market coverage. Moreover, we examine the prominent presence of trending topics and their evolution over time, and the impacts that some individuals and organizations have on the financial market. Our characterization highlights the dynamics of the financial market coverage and provides valuable insights reflecting broad discussions regarding recent financial events and the world economy.
    Date: 2023–02
  13. By: Andrew Bennett; Nathan Kallus; Xiaojie Mao; Whitney Newey; Vasilis Syrgkanis; Masatoshi Uehara
    Abstract: In this paper, we study nonparametric estimation of instrumental variable (IV) regressions. Recently, many flexible machine learning methods have been developed for instrumental variable estimation. However, these methods have at least one of the following limitations: (1) restricting the IV regression to be uniquely identified; (2) only obtaining estimation error rates in terms of pseudometrics (\emph{e.g., } projected norm) rather than valid metrics (\emph{e.g., } $L_2$ norm); or (3) imposing the so-called closedness condition that requires a certain conditional expectation operator to be sufficiently smooth. In this paper, we present the first method and analysis that can avoid all three limitations, while still permitting general function approximation. Specifically, we propose a new penalized minimax estimator that can converge to a fixed IV solution even when there are multiple solutions, and we derive a strong $L_2$ error rate for our estimator under lax conditions. Notably, this guarantee only needs a widely-used source condition and realizability assumptions, but not the so-called closedness condition. We argue that the source condition and the closedness condition are inherently conflicting, so relaxing the latter significantly improves upon the existing literature that requires both conditions. Our estimator can achieve this improvement because it builds on a novel formulation of the IV estimation problem as a constrained optimization problem.
    Date: 2023–02
    Abstract: In this article, I have estimated the Rule of Law for 193 countries using data from the Environment Social and Governance-ESG database of World Bank. I have used different econometric techniques to estimate the value of “Rule of Law” i.e.: Panel Data with Fixed Effects, Panel Data with Random Effects, Pooled OLS. I found that Rule of Law is positively associated, among others to “Regulatory Quality” and “Control of Corruption” and negatively associated among others to, “Access to Electricity” and “Prevalence of Overweight”. I have performed a cluster analysis with the k-Mean algorithm optimized with the Elbow Method and I find the presence of four clusters. Finally, I present a confrontation among eight different machine-learning algorithms to predict the level of Rule of Law. I find that Linear Regression is the best predictor according to MAE, MSE, RMSE and R-squared.
    Keywords: Analysis of Collective Decision-Making, General, Political Processes: Rent-Seeking, Lobbying, Elections, Legislatures, and Voting Behaviour, Bureaucracy, Administrative Processes in Public Organizations, Corruption, Positive Analysis of Policy Formulation, and Implementation.
    JEL: D7 D70 D71 D72 D73 D78
    Date: 2023–02–11
  15. By: Jiří Witzany; Milan Fičura
    Abstract: Recently, there has been a considerable interest in machine learning (ML) applications to valuation of options. The main motivation is the speed of calibration or, for example, calculation of the credit valuation adjustments (CVA). It is usually assumed that there is a relatively liquid market with plain vanilla option quotations that can be used to calibrate (using an ML model) the volatility surface, or to estimate parameters of an advanced stochastic model. In the second stage the calibrated volatility surface (or the model parameters) are used to value given exotic options, again using a trained NN (or another ML model). The NNs are typically trained "off-line" by sampling many model and market parameters´ combinations and calculating the options´ market values. In our research, we focus on the quite common situation of a non-liquid option market where we lack sufficiently many plain vanilla option quotations to calibrate the volatility surface, but we still need to value an exotic option or just a plain vanilla option subject to a more advanced stochastic model as it is typical on energy and carbon derivative markets. We show that it is possible to use selected moments of the underlying historical price return series complemented with a volatility risk premium estimate to value such options using the ML approach.
    Keywords: derivatives valuation, options, calibration, neural networks
    JEL: C45 C63 G13
    Date: 2023–01–24
  16. By: David Akinwamide; Tunbosun Oyedokun; Jonas Hahn
    Abstract: The fourth industrial revolution has transformed how data and information are shared and managed in the built environment. The adoption of various technologies (such as BIM, drone technologies, VR, Big Data, Machine learning etc.) for an effective data management in the built environment has therefore created many challenges for professionals to upgrade their practice. This study therefore explore the prospect and challenges of data management in the Nigerian built environment in the fourth industrial revolution. A qualitative technique (such as oral interview) with the review of relevant existing literature on the subject was adopted for this study. Several challenges outlined by professionals include the effect of cloud computing, big data, mobile devices, and financial and ethical concerns. Some possible solutions were discussed for an effective adoption of data management in the fourth industrial revolution. The study recommends that professionals need to upgrade their knowledge and competencies with the changes in the built environment through training, education opportunities and continuing development programmes.
    Keywords: Built Environment; Data Management; Fourth Industrial Revolution; Professionals
    JEL: R3
    Date: 2022–01–01
  17. By: Johann Lussange; Boris Gutkin
    Abstract: Recent technological developments have changed the fundamental ways stock markets function, bringing regulatory instances to assess the benefits of these developments. In parallel, the ongoing machine learning revolution and its multiple applications to trading can now be used to design a next generation of financial models, and thereby explore the systemic complexity of financial stock markets in new ways. We here follow on a previous groundwork, where we designed and calibrated a novel agent-based model stock market simulator, where each agent autonomously learns to trade by reinforcement learning. In this Paper, we now study the predictions of this model from a regulator's perspective. In particular, we focus on how the market quality is impacted by smaller order book tick sizes, increasingly larger metaorders, and higher trading frequencies, respectively. Under our model assumptions, we find that the market quality benefits from the latter, but not from the other two trends.
    Date: 2023–02
  18. By: Aymeric Vie; J. Doyne Farmer
    Abstract: Agent-based models (ABMs) are fit to model heterogeneous, interacting systems like financial markets. We present the latest advances in Evology: a heterogeneous, empirically calibrated market ecology agent-based model of the US stock market. Prices emerge endogenously from the interactions of market participants with diverse investment behaviours and their reactions to fundamentals. This approach allows testing trading strategies while accounting for the interactions of this strategy with other market participants and conditions. Those early results encourage a closer association between ABMs and ML algorithms for testing and optimising investment strategies using machine learning algorithms.
    Date: 2023–02
  19. By: Federico Giorgi (Università di Roma ‘Tor Vergata’); Stefano Herzel (Università di Roma ‘Tor Vergata’); Paolo Pigato (Università di Roma ‘Tor Vergata’)
    Abstract: We propose a Reinforcement Learning (RL) algorithm for generating a trading strategy in a realistic setting, that includes transaction costs and factors driving the asset dynamics. We benchmark our algorithm against the analytical optimal solution, available when factors are linear and transaction costs are quadratic, showing that RL is able to mimic the optimal strategy. Then we consider a more realistic setting, including non-linear dynamics, that better describes the WTI spot prices time series. For these more general dynamics, an optimal strategy is not known and RL becomes a viable alternative. We show that on synthetic data generated from WTI spot prices, the RL agent outperforms a trader that linearizes the model to apply the theoretical optimal strategy.
    Keywords: Portfolio Optimization, Reinforcement Learning, SARSA, Commodities, Threshold Models.
    Date: 2023–02–18

This nep-big issue is ©2023 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.