nep-big New Economics Papers
on Big Data
Issue of 2022‒02‒07
fourteen papers chosen by
Tom Coupé
University of Canterbury

  1. The Territorial Big Data: An innovative concept of Territorial Economic Intelligence. By Mouad Lamrabet; Taoufik Benkaraache
  2. Artificial Intelligence and Big Data in the Age of COVID-19 By Francisco J. Bariffi; Julia M. Puaschunder
  3. 'Moving On' -- Investigating Inventors' Ethnic Origins Using Supervised Learning By Matthias Niggli
  5. LSTM Architecture for Oil Stocks Prices Prediction By Javad T. Firouzjaee; Pouriya Khaliliyan
  6. Winning the War? New Evidence on the Measurement and the Determinants of Poverty in the United States By Miss Anke Weber; Katharina Bergant; Andrea Medici
  7. Black-box Bayesian inference for economic agent-based models By Farmer, J. Doyne; Dyer, Joel; Cannon, Patrick; Schmon, Sebastian
  8. The Earth is Not Flat: A New World of High-Dimensional Peer Effects By Aurélien Sallin; Simone Balestra
  9. Use of Alternative Data in the Bank of Japan's Research Activities By Seisaku Kameda
  10. Robust Algorithmic Collusion By Nicolas Eschenbaum; Filip Melgren; Philipp Zahn
  11. Big Data for smart cities and citizen engagement: evidence from Twitter data analysis on Italian municipalities By Silvia Blasi; Edoardo Gobbo; Silvia Rita Sedita
  12. Vaccination Policy and Trust By Jelnov, Artyom; Jelnov, Pavel
  13. Predicting housing prices. A long term housing price path for Spanish regions By Paloma Taltavull de La Paz
  14. The DONUT Approach to EnsembleCombination Forecasting By Lars Lien Ankile; Kjartan Krange

  1. By: Mouad Lamrabet (Laboratoire de Recherche en Intelligence Stratégique - UH2MC - Université Hassan II [Casablanca]); Taoufik Benkaraache (Laboratoire de Recherche en Intelligence Stratégique - UH2MC - Université Hassan II [Casablanca])
    Abstract: Currently, we are witnessing a "data revolution" and the appearance of an emerging resource: « The Big Data ». The mastery of this resource remains today a strategic stake for the economic actors. In this perspective, we develop in this article the elements of an original thesis concept that we call Territorial Big Data. We define it as an innovative concept of Territorial Economic Intelligence whose raw material is the strategic data.
    Abstract: Actuellement, nous assistons à une « révolution de la donnée » et à l'avènement d'une ressource émergente : « Le Big Data ». La maitrise de ladite ressource demeure aujourd'hui un enjeu stratégique pour les acteurs économiques. Dans cette perspective, nous développons dans cet article les éléments d'un concept original de thèse que nous nommons Big Data Territorial. Nous le définissons en tant que concept novateur d'Intelligence Economique Territoriale dont la matière première est la donnée stratégique.
    Keywords: Big Data,Territorial Economic Intelligence,Strategic Data,Intelligence économique territoriale,donnée stratégique
    Date: 2021–06–30
  2. By: Francisco J. Bariffi (University Carlos III of Madrid, Spain); Julia M. Puaschunder (The New School, Department of Economics, School of Public Engagement, USA)
    Abstract: The view that the COVID-19 pandemic has set in motion profound changes in our modern societies is practically unanimous. The global effort to contain, cure, and eradicate COVID-19 has been greatly benefited by the use, development and/or adaptation of technological tools for mass surveillance based on artificial intelligence and robotics systems. The management of the COVID-19 pandemic yet has also revealed many shortcomings generated from the need to make decisions “in extremis†. Systematic lockdowns of entire populations pushed humans to increase exposure to digital devices in order to achieve some sort of social connection. Some nations with the capable technology development used AI systems to access individual digital data in order to control and contain the SARS-CoV-2. Massive surveillance of entire populations is now possible. In this way, the problem arises of how to establish an adequate balance and control between the utility and the results offered by mass surveillance systems based on artificial intelligence and robotics in the fight against COVID-19 on the one hand, and the protection of personal and collective fundamental rights and freedoms, on the other.
    Keywords: Artificial Intelligence, AI, Anti-Discrimination, Big Data, COVID-19, COVID Long Haulers, Democratization of Healthcare Information, Digitalization, Healthcare, Human Rights, Massive Surveillance, Prevention, Tracking
    Date: 2021–10
  3. By: Matthias Niggli
    Abstract: Patent data provides rich information about technical inventions, but does not disclose the ethnic origin of inventors. In this paper, I use supervised learning techniques to infer this information. To do so, I construct a dataset of 95'202 labeled names and train an artificial recurrent neural network with long-short-term memory (LSTM) to predict ethnic origins based on names. The trained network achieves an overall performance of 91% across 17 ethnic origins. I use this model to classify and investigate the ethnic origins of 2.68 million inventors and provide novel descriptive evidence regarding their ethnic origin composition over time and across countries and technological fields. The global ethnic origin composition has become more diverse over the last decades, which was mostly due to a relative increase of Asian origin inventors. Furthermore, the prevalence of foreign-origin inventors is especially high in the USA, but has also increased in other high-income economies. This increase was mainly driven by an inflow of non-western inventors into emerging high-technology fields for the USA, but not for other high-income countries.
    Date: 2022–01
  4. By: Reshmaan Hussam (Harvard Business School); Abu S. Shonchoy (Department of Economics, Florida International University); Chikako Yamauchi (GRIPS); Kailash Pandey (Harvard Business School)
    Abstract: While models of technology adoption posit learning as the basis of behavior change, information campaigns in public health frequently fail to change behavior. We design an information campaign embedding hand-hygiene edutainment within popular dramas using mobile phones, randomly distributed to households in Bangladesh. We document no change in hygiene knowledge, yet substantial improvements in handwashing and health. Employing machine learning techniques with temporal data on media exposure and handwashing, we ï¬ nd that a combination of cumulative and immediate exposure predicts washing, consistent with cue-based habituation. Results highlight how behavior change may be induced by tacit, rather than explicit, knowledge acquisition.
    Date: 2021–12
  5. By: Javad T. Firouzjaee; Pouriya Khaliliyan
    Abstract: Oil companies are among the largest companies in the world whose economic indicators in the global stock market have a great impact on the world economy and market due to their relation to gold, crude oil, and the dollar. To quantify these relations we use the correlation feature and the relationships between stocks with the dollar, crude oil, gold, and major oil company stock indices, we create datasets and compare the results of forecasts with real data. To predict the stocks of different companies, we use Recurrent Neural Networks (RNNs) and LSTM, because these stocks change in time series. We carry on empirical experiments and perform on the stock indices dataset to evaluate the prediction performance in terms of several common error metrics such as Mean Square Error (MSE), Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Mean Absolute Percentage Error (MAPE). The received results are promising and present a reasonably accurate prediction for the price of oil companies' stocks in the near future. The results show that RNNs do not have the interpretability, and we cannot improve the model by adding any correlated data.
    Date: 2022–01
  6. By: Miss Anke Weber; Katharina Bergant; Andrea Medici
    Abstract: Using micro-data from household expenditure surveys, we document the evolution of consumption poverty in the United States over the last four decades. Employing a price index that appears appropriate for low income households, we show that poverty has not declined materially since the 1980s and even increased for the young. We then analyze which social and economic factors help explain the extent of poverty in the U.S. using probit, tobit, and machine learning techniques. Our results are threefold. First, we identify the poor as more likely to be minorities, without a college education, never married, and living in the Midwest. Second, the importance of some factors, such as race and ethnicity, for determining poverty has declined over the last decades but they remain significant. Third, we find that social and economic factors can only partially capture the likelihood of being poor, pointing to the possibility that random factors (“bad luck”) could play a significant role.
    Keywords: Poverty, Inequality, Consumption, Provision and Effects of Welfare Programs
    Date: 2022–01–14
  7. By: Farmer, J. Doyne; Dyer, Joel; Cannon, Patrick; Schmon, Sebastian
    Abstract: Simulation models, in particular agent-based models, are gaining popularity in economics. The considerable flexibility they offer, as well as their capacity to reproduce a variety of empirically observed behaviors of complex systems, give them broad appeal, and the increasing availability of cheap computing power has made their use feasible. Yet a widespread adoption in real-world modelling and decision-making scenarios has been hindered by the difficulty of performing parameter estimation for such models. In general, simulation models lack a tractable likelihood function, which precludes a straightforward application of standard statistical inference techniques. A number of recent works (Grazzini et al., 2017; Platt, 2020, 2021) have sought to address this problem through the application of likelihood-free inference techniques, in which parameter estimates are determined by performing some form of comparison between the observed data and simulation output. However, these approaches are (a) founded on restrictive assumptions, and/or (b) typically require many hundreds of thousands of simulations. These qualities make them unsuitable for large-scale simulations in economics and can cast doubt on the validity of these inference methods in such scenarios. In this paper, we investigate the efficacy of two classes of simulation-efficient black-box approximate Bayesian inference methods that have recently drawn significant attention within the probabilistic machine learning community: neural posterior estimation and neural density ratio estimation. We present a number of benchmarking experiments in which we demonstrate that neural network based black-box methods provide state of the art parameter inference for economic simulation models, and crucially are compatible with generic multivariate time-series data. In addition, we suggest appropriate assessment criteria for use in future benchmarking of approximate Bayesian inference procedures for economic simulation models.
    Date: 2022–02
  8. By: Aurélien Sallin; Simone Balestra
    Abstract: The majority of recent peer-effect studies in education have focused on the effect of one particular type of peers on classmates. This view fails to take into account the reality that peer effects are heterogeneous for students with different characteristics, and that there are at least as many peer effect functions as there are types of peers. In this paper, we develop a general empirical framework that accounts for systematic interactions between peer types and nonlinearities of peer effects. We use machine-learning methods to (i) understand which dimensions of peer characteristics are the most predictive of academic success, (ii) estimate high-dimensional peer effects functions, and (iii) investigate performance-improving classroom allocation through policy-relevant simulations. First, we find that students' own characteristics are the most predictive of academic success, and that the most predictive peer effects are generated by students with special needs, low-achieving students, and male students. Second, we show that peer effects traditionally reported by the literature likely miss important nonlinearities in the distribution of peer proportions. Third, we determine that classroom compositions that are the most balanced in students' characteristics are the best ways to reach maximal aggregated school performance.
    Keywords: peer effects, high dimensionality, machine learning, classroom composition
    JEL: C31 H75 I21 I28
    Date: 2022–01
  9. By: Seisaku Kameda (Bank of Japan)
    Abstract: The Bank of Japan (BOJ) has recently launched a new page on its website titled, "Alternative Data Analysis." In light of the launch of this page, this paper outlines initiatives taken by the BOJ's research divisions (including but not limited to the Research and Statistics Department) in the field of alternative data analysis. Since the spread of COVID-19, the BOJ has been making active use of high-frequency data - such as mobility trends based on location data - in assessing economic conditions to conduct monetary policy. Moreover, in light of the lessons from the Global Financial Crisis of the late 2000s, the BOJ also has been continuing its efforts to strengthen the collection and use of various individual transaction data in the financial field. Such new forms of big data are called alternative data as opposed to traditional economic and financial statistics. The alternative data employed at the BOJ have been wide-ranging, including high-frequency data, textual data, and granular data; recently, the range of these data has been extending further to cover, for example, climate-related data.
    Date: 2022–01–21
  10. By: Nicolas Eschenbaum; Filip Melgren; Philipp Zahn
    Abstract: This paper develops a formal framework to assess policies of learning algorithms in economic games. We investigate whether reinforcement-learning agents with collusive pricing policies can successfully extrapolate collusive behavior from training to the market. We find that in testing environments collusion consistently breaks down. Instead, we observe static Nash play. We then show that restricting algorithms' strategy space can make algorithmic collusion robust, because it limits overfitting to rival strategies. Our findings suggest that policy-makers should focus on firm behavior aimed at coordinating algorithm design in order to make collusive policies robust.
    Date: 2022–01
  11. By: Silvia Blasi (University of Verona); Edoardo Gobbo (University of Padova); Silvia Rita Sedita (University of Padova)
    Abstract: Smart cities are increasingly keen to establish a fruitful conversation with their citizens, to better capture their needs, and create virtual platforms for stimulating co-creation processes between government and users, with the final objective of increasing the quality of life and well-being. Social media applications provide an opportunity for dialogic communication, where, for a relatively low cost, a large amount of information reaching a wide audience can be published and exchanged in real time, fueling opportunities for citizens’ engagement. This study is based on a social media listening method, through a twitter data mining, which enabled disentangling different components of citizen engagement (popularity, commitment and virality) for a sample of Italian municipalities. In addition, we executed a deep analysis of the types of communication artifact exchanged and, through a content analysis of the tweets published by followers of the municipalities’ accounts, we identified main areas of interests of the social media conversations. Our results are based on the analysis of online conversations engaged by followers of twitter accounts of a sample of 28 Italian municipalities, chosen among the most active and densely populated. We show that municipalities tend to use the twitter account as a channel of communication to inform the population about a variety of topics, such as transports and public works, among the others. The volume of activity and number of followers (audience) vary from one municipality to the other. There is generally a negative relationship between the density of the population of a municipality and citizens’ engagement: smaller municipalities show a higher citizens’ engagement; the biggest ones, like Roma, Milan, Turin, Naples, are laggards. We finally conducted a city profiling process, which provides a representation of key citizens’ segments in terms of engagement. Policy makers could find in our work useful tools to increase citizens’ listening capacity.
    Keywords: smart cities, e-government, twitter, web scraping, social media listening, we-government
    JEL: M10 M38
    Date: 2022
  12. By: Jelnov, Artyom (Ariel University); Jelnov, Pavel (Leibniz University of Hannover)
    Abstract: We study the relationship between trust and vaccination. We show theoretically that vaccination rates are higher in countries with more transparent and accountable governments. The mechanism that generates this result is the lower probability of a transparent and accountable government to promote an unsafe vaccine. Empirical evidence supports this result. We find that countries perceived as less corrupt and more liberal experience higher vaccination rates. Furthermore, they are less likely to adopt a mandatory vaccination policy. One unit of the Corruption Perception Index (scaled from 0 to 10) is associated with a vaccination rate that is higher by one percentage point (pp) but with a likelihood of compulsory vaccination that is lower by 10 pp. In addition, Google Trends data show that public interest in corruption is correlated with interest in vaccination. The insight from our analysis is that corruption affects not only the supply but also the demand for public services.
    Keywords: vaccination, corruption, COVID-19
    JEL: I18
    Date: 2021–12
  13. By: Paloma Taltavull de La Paz
    Abstract: This paper aims to forecast the long term trend of housing prices in the Spanish cities with more than 25 thousand inhabitants, a total of 275 individual municipalities. Based on a causal model explaining housing prices based on six fundamental variables (changes in population, income, number of mortgages, interest rates, vacant and housing prices), a pool VECM technique is used to estimate a housing price model and calculate the 'stable long term price', a central concept defined in the formal valuation process. The model covers the period 1995-2020, and the long term is approached from 2000 to 2026, so the prediction exercise includes backcast and forecast period allowing to extract the long term cycle housing price have followed during last 20 years and project it further six years. The analytical process follows three steps. Firstly, it identifies the cities following a common pattern in their housing market by clustering twice the cities: (1) using house price time series and (2) using a machine learning approach with the six fundamental variables. Results give a comprehensible evolution of the long term component of housing prices and the model also permits the understanding of the main drivers of housing prices in each Spanish region. Clustering cities with two statistical tools give pretty similar results in some cities but is different in others. The challenge of finding the correct grouping is critical to understanding the housing market and forecasting their prices.
    Keywords: Error correction models; Forecast; Housing Prices; Housing valuation; Machine Learning; Time Series
    JEL: R3
    Date: 2021–01–01
  14. By: Lars Lien Ankile; Kjartan Krange
    Abstract: This paper presents an ensemble forecasting method that shows strong results on the M4Competition dataset by decreasing feature and model selection assumptions, termed DONUT(DO Not UTilize human assumptions). Our assumption reductions, consisting mainly of auto-generated features and a more diverse model pool for the ensemble, significantly outperforms the statistical-feature-based ensemble method FFORMA by Montero-Manso et al. (2020). Furthermore, we investigate feature extraction with a Long short-term memory Network(LSTM) Autoencoder and find that such features contain crucial information not captured by traditional statistical feature approaches. The ensemble weighting model uses both LSTM features and statistical features to combine the models accurately. Analysis of feature importance and interaction show a slight superiority for LSTM features over the statistical ones alone. Clustering analysis shows that different essential LSTM features are different from most statistical features and each other. We also find that increasing the solution space of the weighting model by augmenting the ensemble with new models is something the weighting model learns to use, explaining part of the accuracy gains. Lastly, we present a formal ex-post-facto analysis of optimal combination and selection for ensembles, quantifying differences through linear optimization on the M4 dataset. We also include a short proof that model combination is superior to model selection, a posteriori.
    Date: 2022–01

This nep-big issue is ©2022 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.