nep-big New Economics Papers
on Big Data
Issue of 2021‒05‒31
twenty papers chosen by
Tom Coupé
University of Canterbury

  1. Applying Artificial Intelligence on Satellite Imagery to Compile Granular Poverty Statistics By Hofer, Martin; Sako, Tomas; Martinez, Jr., Arturo; Addawe, Mildred; Durante, Ron Lester
  2. Predicting Poverty Using Geospatial Data in Thailand By Puttanapong , Nattapong; Martinez, Jr. , Arturo; Addawe, Mildred; Bulan, Joseph; Durante , Ron Lester; Martillan , Marymell
  3. Multi-Horizon Forecasting for Limit Order Books: Novel Deep Learning Approaches and Hardware Acceleration using Intelligent Processing Units By Zihao Zhang; Stefan Zohren
  4. Financial Time Series Analysis and Forecasting with HHT Feature Generation and Machine Learning By Tim Leung; Theodore Zhao
  5. Can we imitate stock price behavior to reinforcement learn option price? By Xin Jin
  6. Preaching to Social Media: Turkey’s Friday Khutbas and Their Effects on Twitter By Ozan Aksoy
  7. Contracting, pricing, and data collection under the AI flywheel effect By Huseyin Gurkan; Francis de Véricourt
  8. Towards Artificial Intelligence Enabled Financial Crime Detection By Zeinab Rouhollahi
  9. What do we know from the vast literature on efficiency and productivity in healthcare? A Systematic Review and Bibliometric Analysis By Kok Fong See; Shawna Grosskopf; Vivian Valdmanis; Valentin Zelenyuk
  10. Deep Kernel Gaussian Process Based Financial Market Predictions By Yong Shi; Wei Dai; Wen Long; Bo Li
  11. Central Bank Communication: One Size Does Not Fit All By Joan Huang; John Simon
  12. Financial Intermediation and Technology: What's Old, What's New? By Boot, Arnoud W A; Hoffmann, Peter; Laeven, Luc; Ratnovski, Lev
  13. Using Machine Learning to Create an Early Warning System for Welfare Recipients By Sansone, Dario; Zhu, Anna
  14. Belief Distortions and Macroeconomic Fluctuations By Bianchi, Francesco; Ludvigson, Sydney C.; Ma, Sai
  15. Predicting Nature of Default using Machine Learning Techniques By Longden, Elaine
  16. Research on Regional Urban Economic Development by Nightlight-time Remote Sensing By Jiongyan Zhang
  17. Option Valuation through Deep Learning of Transition Probability Density By Haozhe Su; M. V. Tretyakov; David P. Newton
  18. Real-Time Inequality and the Welfare State in Motion: Evidence from COVID-19 in Spain By Aspachs, Oriol; Durante, Ruben; Graziano, Alberto; Mestres, Josep; Montalvo, Jose G; Reynal-Querol, Marta
  19. Extraction of measurements from medical reports By El Youssefi Ahmed; Abdelahad Chraibi; Julien Taillard; Ahlame Begdouri
  20. Neural Options Pricing By Timothy DeLise

  1. By: Hofer, Martin (Vienna University of Economics and Business); Sako, Tomas (Freelance data scientist); Martinez, Jr., Arturo (Asian Development Bank); Addawe, Mildred (Asian Development Bank); Durante, Ron Lester (Asian Development Bank)
    Abstract: The spatial granularity of poverty statistics can have a significant impact on the efficiency of targeting resources meant to improve the living conditions of the poor. However, achieving granularity typically requires increasing the sample sizes of surveys on household income and expenditure or living standards, an option that is not always practical for government agencies that conduct these surveys. Previous studies that examined the use of innovative (geospatial) data sources such as those from high-resolution satellite imagery suggest that such method may be an alternative approach of producing granular poverty maps. This study outlines a computational framework to enhance the spatial granularity of government-published poverty estimates using a deep layer computer vision technique applied on publicly available medium-resolution satellite imagery, household surveys, and census data from the Philippines and Thailand. By doing so, the study explores a potentially more cost-effective alternative method for poverty estimation method. The results suggest that even using publicly accessible satellite imagery, in which the resolutions are not as fine as those in commercially sourced images, predictions generally aligned with the distributional structure of government-published poverty estimates, after calibration. The study further contributes to the existing literature by examining robustness of the resulting estimates to user-specified algorithmic parameters and model specifications.
    Keywords: big data; computer vision; data for development; machine learning algorithm; official statistics; poverty; SDG
    JEL: C19 D31 I32 O15
    Date: 2020–12–29
  2. By: Puttanapong , Nattapong (Thammasat University); Martinez, Jr. , Arturo (Asian Development Bank); Addawe, Mildred (Asian Development Bank); Bulan, Joseph (Asian Development Bank); Durante , Ron Lester (Asian Development Bank); Martillan , Marymell (Asian Development Bank)
    Abstract: Poverty statistics are conventionally compiled using data from household income and expenditure survey or living standards survey. This study examines an alternative approach in estimating poverty by investigating whether readily available geospatial data can accurately predict the spatial distribution of poverty in Thailand. In particular, geospatial data examined in this study include night light intensity, land cover, vegetation index, land surface temperature, built-up areas, and points of interest. The study also compares the predictive performance of various econometric and machine learning methods such as generalized least squares, neural network, random forest, and support vector regression. Results suggest that intensity of night lights and other variables that approximate population density are highly associated with the proportion of an area’s population who are living in poverty. The random forest technique yielded the highest level of prediction accuracy among the methods considered in this study, perhaps due to its capability to fit complex association structures even with small and medium-sized datasets. Moving forward, additional studies are needed to investigate whether the relationships observed here remain stable over time, and therefore, may be used to approximate the prevalence of poverty for years when household surveys on income and expenditures are not conducted, but data on geospatial correlates of poverty are available.
    Keywords: big data; computer vision; data for development; machine learning algorithm; multidimensional poverty; official statistics; poverty; SDG; Thailand
    JEL: C19 D31 I32 O15
    Date: 2020–12–29
  3. By: Zihao Zhang; Stefan Zohren
    Abstract: We design multi-horizon forecasting models for limit order book (LOB) data by using deep learning techniques. Unlike standard structures where a single prediction is made, we adopt encoder-decoder models with sequence-to-sequence and Attention mechanisms, to generate a forecasting path. Our methods achieve comparable performance to state-of-art algorithms at short prediction horizons. Importantly, they outperform when generating predictions over long horizons by leveraging the multi-horizon setup. Given that encoder-decoder models rely on recurrent neural layers, they generally suffer from a slow training process. To remedy this, we experiment with utilising novel hardware, so-called Intelligent Processing Units (IPUs) produced by Graphcore. IPUs are specifically designed for machine intelligence workload with the aim to speed up the computation process. We show that in our setup this leads to significantly faster training times when compared to training models with GPUs.
    Date: 2021–05
  4. By: Tim Leung; Theodore Zhao
    Abstract: We present the method of complementary ensemble empirical mode decomposition (CEEMD) and Hilbert-Huang transform (HHT) for analyzing nonstationary financial time series. This noise-assisted approach decomposes any time series into a number of intrinsic mode functions, along with the corresponding instantaneous amplitudes and instantaneous frequencies. Different combinations of modes allow us to reconstruct the time series using components of different timescales. We then apply Hilbert spectral analysis to define and compute the associated instantaneous energy-frequency spectrum to illustrate the properties of various timescales embedded in the original time series. Using HHT, we generate a collection of new features and integrate them into machine learning models, such as regression tree ensemble, support vector machine (SVM), and long short-term memory (LSTM) neural network. Using empirical financial data, we compare several HHT-enhanced machine learning models in terms of forecasting performance.
    Date: 2021–05
  5. By: Xin Jin
    Abstract: This paper presents a framework of imitating the price behavior of the underlying stock for reinforcement learning option price. We use accessible features of the equities pricing data to construct a non-deterministic Markov decision process for modeling stock price behavior driven by principal investor's decision making. However, low signal-to-noise ratio and instability that appear immanent in equity markets pose challenges to determine the state transition (price change) after executing an action (principal investor's decision) as well as decide an action based on current state (spot price). In order to conquer these challenges, we resort to a Bayesian deep neural network for computing the predictive distribution of the state transition led by an action. Additionally, instead of exploring a state-action relationship to formulate a policy, we seek for an episode based visible-hidden state-action relationship to probabilistically imitate principal investor's successive decision making. Our algorithm then maps imitative principal investor's decisions to simulated stock price paths by a Bayesian deep neural network. Eventually the optimal option price is reinforcement learned through maximizing the cumulative risk-adjusted return of a dynamically hedged portfolio over simulated price paths of the underlying.
    Date: 2021–05
  6. By: Ozan Aksoy (Centre for Quantitative Social Sciences in the Social Research Institute, University College London)
    Abstract: In this study I analyse through machine learning the content of all Friday khutbas (sermons) read to millions of citizens in thousands of Mosques of Turkey since 2015. I focus on six non-religious and recurrent topics that feature in the sermons, namely business, family, nationalism, health, trust, and patience. I demonstrate that the content of the sermons responds strongly to events of national importance. I then link the Friday sermons with ~4.8 million tweets on these topics to study whether and how the content of sermons affects social media behaviour. I find generally large effects of the sermons on tweets, but there is also heterogeneity by topic. It is strongest for nationalism, patience, and health and weakest for business. Overall, these results show that religious institutions in Turkey are influential in shaping the public’s social media content and that this influence is mainly prevalent on salient issues. More generally, these results show that mass offline religious activity can have strong effects on social media behaviour
    Keywords: text-as-data analysis, computational social science, social media, religion, Islam, Turkey
    JEL: C63 N35 Z12
    Date: 2021–05–01
  7. By: Huseyin Gurkan (ESMT European School of Management and Technology); Francis de Véricourt (ESMT European School of Management and Technology)
    Abstract: This paper explores how firms that lack expertise in machine learning (ML) can leverage the so-called AI Flywheel effect. This effect designates a virtuous cycle by which, as an ML product is adopted and new user data are fed back to the algorithm, the product improves, enabling further adoptions. However, managing this feedback loop is difficult, especially when the algorithm is contracted out. Indeed, the additional data that the AI Flywheel effect generates may change the provider's incentives to improve the algorithm over time. We formalize this problem in a simple two-period moral hazard framework that captures the main dynamics among ML, data acquisition, pricing, and contracting. We find that the firm's decisions crucially depend on how the amount of data on which the machine is trained interacts with the provider's effort. If this effort has a more (less) significant impact on accuracy for larger volumes of data, the firm underprices (overprices) the product. Interestingly, these distortions sometimes improve social welfare, which accounts for the customer surplus and profits of both the firm and provider. Further, the interaction between incentive issues and the positive externalities of the AI Flywheel effect has important implications for the firm's data collection strategy. In particular, the firm can boost its profit by increasing the product's capacity to acquire usage data only up to a certain level. If the product collects too much data per user, the firm's profit may actually decrease, i.e., more data is not necessarily better. As a result, the firm should consider reducing its product's data acquisition capacity when its initial dataset to train the algorithm is large enough.
    Keywords: Data, machine learning, data product, pricing, incentives, contracting
    Date: 2020–03–03
  8. By: Zeinab Rouhollahi
    Abstract: Recently, financial institutes have been dealing with an increase in financial crimes. In this context, financial services firms started to improve their vigilance and use new technologies and approaches to identify and predict financial fraud and crime possibilities. This task is challenging as institutions need to upgrade their data and analytics capabilities to enable new technologies such as Artificial Intelligence (AI) to predict and detect financial crimes. In this paper, we put a step towards AI-enabled financial crime detection in general and money laundering detection in particular to address this challenge. We study and analyse the recent works done in financial crime detection and present a novel model to detect money laundering cases with minimum human intervention needs.
    Date: 2021–05
  9. By: Kok Fong See (Universiti Sains Malaysia, Malaysia); Shawna Grosskopf (Oregon State University, United States); Vivian Valdmanis (Western Michigan University, United States); Valentin Zelenyuk (School of Economics and Centre for Efficiency and Productivity Analysis (CEPA) at The University of Queensland, Australia)
    Abstract: Not only does healthcare play a key role in a country’s economy, but it is also one of the fastest-growing sectors for most countries, resulting in rising expenditures. In turn, efficiency and productivity analyses of the healthcare industry have attracted attention from a wide variety of interested parties, including academics, hospital administrators, and policy makers. As a result, a very large number of studies of efficiency and productivity in the healthcare industry have appeared over the past three decades in a variety of outlets. In this paper, we conduct a comprehensive and systematic review of these studies with the aid of modern machine technology learning methods for bibliometric analysis. This approach facilitated our identification and analysis and allowed us to reveal patterns and clusters in the data from 477 efficiency and productivity articles associated with the healthcare industry from 1983 to 2019, produced by nearly 1000 authors and published in a multitude of academic journals. Leveraging on such ‘biblioanalytics’, combined with our own understanding of the field, we then highlight the trends and possible future of efficiency and productivity studies in healthcare.
    Date: 2021–05
  10. By: Yong Shi; Wei Dai; Wen Long; Bo Li
    Abstract: The Gaussian Process with a deep kernel is an extension of the classic GP regression model and this extended model usually constructs a new kernel function by deploying deep learning techniques like long short-term memory networks. A Gaussian Process with the kernel learned by LSTM, abbreviated as GP-LSTM, has the advantage of capturing the complex dependency of financial sequential data, while retaining the ability of probabilistic inference. However, the deep kernel Gaussian Process has not been applied to forecast the conditional returns and volatility in financial market to the best of our knowledge. In this paper, grid search algorithm, used for performing hyper-parameter optimization, is integrated with GP-LSTM to predict both the conditional mean and volatility of stock returns, which are then combined together to calculate the conditional Sharpe Ratio for constructing a long-short portfolio. The experiments are performed on a dataset covering all constituents of Shenzhen Stock Exchange Component Index. Based on empirical results, we find that the GP-LSTM model can provide more accurate forecasts in stock returns and volatility, which are jointly evaluated by the performance of constructed portfolios. Further sub-period analysis of the experiment results indicates that the superiority of GP-LSTM model over the benchmark models stems from better performance in highly volatile periods.
    Date: 2021–05
  11. By: Joan Huang (Reserve Bank of Australia); John Simon (Reserve Bank of Australia)
    Abstract: High-quality central bank communication can improve the effectiveness of monetary policy and is an essential element in providing greater central bank transparency. There is, however, no agreement on what high-quality communication looks like. To shed light on this, we investigate 3 important aspects of central bank communication. We focus on how different audiences perceive the readability and degree of reasoning within various economic publications; providing the reasons for decisions is a critical element of transparency. We find that there is little correlation between perceived readability and reasoning in the economic communications we analyse, which highlights that commonly used measures of readability can miss important aspects of communication. We also find that perceptions of communication quality can vary significantly between audiences; one size does not fit all. To dig deeper we use machine learning techniques and develop a model that predicts the way different audiences rate the readability of and reasoning within texts. The model highlights that simpler writing is not necessarily more readable nor more revealing of the author's reasoning. The results also show how readability and reasoning vary within and across documents; good communication requires a variety of styles within a document, each serving a different purpose, and different audiences need different styles. Greater central bank transparency and more effective communication require an emphasis not just on greater readability of a single document, but also on setting out the reasoning behind conclusions in a variety of documents that each meet the needs of different audiences.
    Keywords: central bank communications; machine learning; natural language processing; readability; central bank transparency
    JEL: C61 C83 D83 E58 Z13
    Date: 2021–05
  12. By: Boot, Arnoud W A; Hoffmann, Peter; Laeven, Luc; Ratnovski, Lev
    Abstract: We study the effects of technological change on financial intermediation, distinguishing between innovations in information (data collection and processing) and communication (relationships and distribution). Both follow historic trends towards an increased use of hard information and less in-person interaction, which are accelerating rapidly. We point to more recent innovations, such as the combination of data abundance and artificial intelligence, and the rise of digital platforms. We argue that in particular the rise of new communication channels can lead to the vertical and horizontal disintegration of the traditional bank business model. Specialized providers of financial services can chip away activities that do not rely on access to balance sheets, while platforms can interject themselves between banks and customers. We discuss limitations to these challenges, and the resulting policy implications.
    Keywords: communication; financial innovation; Financial Intermediation; Fintech; Information
    JEL: E58 G20 G21 O33
    Date: 2020–07
  13. By: Sansone, Dario (University of Exeter); Zhu, Anna (RMIT University)
    Abstract: Using high-quality nation-wide social security data combined with machine learning tools, we develop predictive models of income support receipt intensities for any payment enrolee in the Australian social security system between 2014 and 2018. We show that off-the-shelf machine learning algorithms can significantly improve predictive accuracy compared to simpler heuristic models or early warning systems currently in use. Specifically, the former predicts the proportion of time individuals are on income support in the subsequent four years with greater accuracy, by a magnitude of at least 22% (14 percentage points increase in the R2), compared to the latter. This gain can be achieved at no extra cost to practitioners since the algorithms use administrative data currently available to caseworkers. Consequently, our machine learning algorithms can improve the detection of long-term income support recipients, which can potentially provide governments with large savings in accrued welfare costs.
    Keywords: income support, machine learning, Australia
    JEL: C53 H53 I38 J68
    Date: 2021–05
  14. By: Bianchi, Francesco; Ludvigson, Sydney C.; Ma, Sai
    Abstract: This paper combines a data rich environment with a machine learning algorithm to provide estimates of time-varying systematic expectational errors ("belief distortions") about the macroeconomy embedded in survey responses. We find that such distortions are large on average even for professional forecasters, with all respondent-types over-weighting their own forecast relative to other information. Forecasts of inflation and GDP growth oscillate between optimism and pessimism by quantitatively large amounts. To investigate the dynamic relation of belief distortions with the macroeconomy, we construct indexes of aggregate (across surveys and respondents) expectational biases in survey forecasts. Over-optimism is associated with an increase in aggregate economic activity. Our estimates provide a benchmark to evaluate theories for which information capacity constraints, extrapolation, sentiments, ambiguity aversion, and other departures from full information rational expectations play a role in business cycles.
    Keywords: beliefs; Biases; Expectations; Machine Learning
    JEL: E17 E27 E32 E7 G4
    Date: 2020–07
  15. By: Longden, Elaine (Tilburg University, School of Economics and Management)
    Date: 2021
  16. By: Jiongyan Zhang
    Abstract: In order to study the phenomenon of regional economic development and urban expansion from the perspective of night-light remote sensing images, researchers use NOAA-provided night-light remote sensing image data (data from 1992 to 2013) along with ArcGIS software to process image information, obtain the basic pixel information data of specific areas of the image, and analyze these data from the space-time domain for presentation of the trend of regional economic development in China in recent years, and tries to explore the urbanization effect brought by the rapid development of China's economy. Through the analysis and study of the data, the results show that the urbanization development speed in China is still at its peak, and has great development potential and space. But at the same time, people also need to pay attention to the imbalance of regional development.
    Date: 2020–11
  17. By: Haozhe Su; M. V. Tretyakov; David P. Newton
    Abstract: Transition probability densities are fundamental to option pricing. Advancing recent work in deep learning, we develop novel transition density function generators through solving backward Kolmogorov equations in parametric space for cumulative probability functions, using neural networks to obtain accurate approximations of transition probability densities, creating ultra-fast transition density function generators offline that can be trained for any underlying. These are 'single solve' , so they do not require recalculation when parameters are changed (e.g. recalibration of volatility) and are portable to other option pricing setups as well as to less powerful computers, where they can be accessed as quickly as closed-form solutions. We demonstrate the range of application for one-dimensional cases, exemplified by the Black-Scholes-Merton model, two-dimensional cases, exemplified by the Heston process, and finally for a modified Heston model with time-dependent parameters that has no closed-form solution.
    Date: 2021–05
  18. By: Aspachs, Oriol; Durante, Ruben; Graziano, Alberto; Mestres, Josep; Montalvo, Jose G; Reynal-Querol, Marta
    Abstract: Official statistics on economic inequality are only available at low frequency and with considerable delay. This makes it challenging to assess the impact on inequality of fast-unfolding crises like the COVID-19 pandemic, and to rapidly evaluate and tailor policy responses. We propose a new methodology to track income inequality at high frequency using anonymized data from bank records for over three million account holders in Spain. Using this approach, we analyse how inequality evolved between February and July 2020 (compared to the same months of 2019). We first show that the wage distribution in our data matches very closely that from official labour surveys. We then document that, in the absence of government intervention, inequality would have increased dramatically, mainly due to job losses and wage cuts experienced by low-wage workers. The increase in pre-transfer inequality was especially pronounced among younger and foreign-born individuals, and in regions more dependent on tourism. Finally, we find that public transfers and unemployment insurance schemes were very effective at providing a safety net to the most affected segments of the population and at offsetting most of the increase in inequality.
    Keywords: Administrative data; COVID-19; High Frequency Data; inequality
    JEL: C81 D63 E24 J31
    Date: 2020–07
  19. By: El Youssefi Ahmed (USMBA - Université Sidi Mohamed Ben Abdellah - Fès [Université de Taza]); Abdelahad Chraibi (Alicante [Seclin]); Julien Taillard (Alicante [Seclin]); Ahlame Begdouri (USMBA - Université Sidi Mohamed Ben Abdellah - Fès [Université de Taza])
    Abstract: A patients' medical record represents their medical history and enclose interesting information about their health status within written reports. These reports usually contain measurements (among other information) that need to be reviewed before any new medical intervention, since they might influence the medical decision regarding the types of drugs that are prescribed or their dosage. In this paper, we introduce a method that extracts measurements automatically from textual medical discharge summaries, admission notes, progress notes, and primary care notes. We don't distinguish between reports belonging to different services. For doing so, we propose a system that uses Grobid-quantities to extract value/unit pairs, uses generated rules from analysis of medical reports and text mining tools to identify candidate measurements. These candidates are then classified using a Long Short Term Memory (LSTM) network trained model to determine which is the corresponding measurement to the value/unit pair. The results are promising: 95.13% accuracy, a precision of 92.38%, a recall of 94.01% and an F1 score of 89.49%.
    Keywords: Conditional Random Fields (CRF),Long Short Term Memory (LSTM),Natural Language Processing,Measurement,Medical report
    Date: 2020–10–26
  20. By: Timothy DeLise
    Abstract: This research investigates pricing financial options based on the traditional martingale theory of arbitrage pricing applied to neural SDEs. We treat neural SDEs as universal It\^o process approximators. In this way we can lift all assumptions on the form of the underlying price process, and compute theoretical option prices numerically. We propose a variation of the SDE-GAN approach by implementing the Wasserstein distance metric as a loss function for training. Furthermore, it is conjectured that the error of the option price implied by the learnt model can be bounded by the very Wasserstein distance metric that was used to fit the empirical data.
    Date: 2021–05

This nep-big issue is ©2021 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.