nep-big New Economics Papers
on Big Data
Issue of 2021‒09‒20
nineteen papers chosen by
Tom Coupé
University of Canterbury

  1. Artificial Intelligence in the Field of Economics By Steve J. Bickley; Ho Fai Chan; Benno Torgler
  2. Who develops AI-related innovations, goods and services?: A firm-level analysis By Hélène Dernis; Laurent Moussiegt; Daisuke Nawa; Mariagrazia Squicciarini
  3. The human capital behind AI: Jobs and skills demand from online job postings By Lea Samek; Mariagrazia Squicciarini; Emile Cammeraat
  4. Using Satellite Imagery and Machine Learning to Estimate the Livelihood Impact of Electricity Access By Nathan Ratledge; Gabe Cadamuro; Brandon de la Cuesta; Matthieu Stigler; Marshall Burke
  5. Gender Distribution across Topics in the Top 5 Economics Journals: A Machine Learning Approach By J.Ignacio Conde-Ruiz; Juan-José Ganuza; Manu García; Luis A. Puch
  6. Integrating R Machine Learning Algorithms in Stata using rcall: A Tutorial By Ebad F. Haghish
  7. Housing-Price Prediction in Colombia using Machine Learning By Otero Gomez, Daniel; MANRIQUE, MIGUEL ANGEL CORREA; Sierra, Omar Becerra; Laniado, Henry; Mateus C, Rafael; Millan, David Andres Romero
  8. Solid Domestic Waste classification using Image Processing and Machine Learning By Otero Gomez, Daniel; Toro, Mauricio
  9. Various Course Proposals for: Mathematics with a View Towards (the Theoretical Underpinnings of) Machine Learning By Marc S. Paolella
  10. Investigating the determinants of successful budgeting with SVM and Binary models By Hariharan, Naveen Kunnathuvalappil
  11. Наукастинг темпов роста стоимостных объемов экспорта и импорта по товарным группам By Maiorova, Ksenia; Fokin, Nikita
  12. WaveCorr: Correlation-savvy Deep Reinforcement Learning for Portfolio Management By Saeed Marzban; Erick Delage; Jonathan Yumeng Li; Jeremie Desgagne-Bouchard; Carl Dussault
  13. Surveillance capitalism – a new techno-economic paradigm? By Falch, Morten
  14. Study on the Requirements for AI Development and Operation Ethics Centered on Children By Saito, Nagayuki
  15. Risk Measurement, Risk Entropy, and Autonomous Driving Risk Modeling By Jiamin Yu
  16. Using solar panels for business purposes: Evidence based on high-frequency power usage data By Weisser, Christoph; Lenel, Friederike; Lu, Yao; Kis-Katos, Krisztina; Kneib, Thomas
  17. Rich Cities, Poor Countryside? Social Structure of the Poor and Poverty Risks in Urban and Rural Places in an Affluent Country. An Administrative Data based Analysis using Random Forest By Oliver Hümbelin; Lukas Hobi; Robert Fluder
  18. A Sentiment Analysis Model of a Civil Service Performance Evaluation Using a Feminist Framework By Cao, Shurui
  19. Competition and Mergers with Strategic Data Intermediaries By David Bounie; Antoine Dubus; Patrick Waelbroeck

  1. By: Steve J. Bickley; Ho Fai Chan; Benno Torgler
    Abstract: The history of AI in economics is long and winding, much the same as the evolving field of AI itself. Economists have engaged with AI since its beginnings, albeit in varying degrees and with changing focus across time and places. In this study, we have explored the diffusion of AI and different AI methods (e.g., machine learning, deep learning, neur al networks, expert systems, knowledge- based systems) through and within economic subfields, taking a scientometrics approach. In particular, we centre our accompanying discussion of AI in economics around the problems of economic calculation and social planning as proposed by Hayek. To map the history of AI within and between economic sub- fields, we construct two datasets containing bibliometrics information of economics papers based on search query results from the Scopus database and the EconPapers (and IDEAs/RePEc) repository. We present descriptive results that map the use and discussion of AI in economics over time, place, and subfield. In doing so, we also characterise the authors and affiliations of those engaging with AI in economics. Additionally, we find positive correlations between quality of institutional affiliation and engagement with or focus on AI in economics and negative correlations between the Human Development Index and share of learning-based AI papers.
    Keywords: Artificial Intelligence; Machine Learning; Economics; Scientometrics; Science of Science; Bibliometrics
    JEL: B40 N01 A14
    Date: 2021–09
  2. By: Hélène Dernis (OECD); Laurent Moussiegt (OECD); Daisuke Nawa (OECD); Mariagrazia Squicciarini (OECD)
    Abstract: This study proposes an exploratory analysis of the characteristics of Artificial Intelligence (AI) “actors”. It focuses on entities that deploy AI-related technologies or introduce AI-related goods and services on large international markets. It builds on the OECD Science, Technology and Innovation Micro-data Lab infrastructure, and, in particular, on Intellectual Property (IP) rights data (patents and trademarks) combined with company-level data. Statistics on AI-related patents and trademarks show that AI-related activities are strongly concentrated in some countries, sectors, and actors. Development of AI technologies and/or goods and services is mainly due to start-ups or large incumbents, located in the United States, Japan, Korea, or the People’s Republic of China, and, to a lesser extent, in Europe. A majority of these actors operate in ICT-related sectors. The composition of the IP portfolio of the AI actors indicates that AI is frequently combined with a variety of sector-specific technologies, goods, or services.
    Date: 2021–09–22
  3. By: Lea Samek (OECD); Mariagrazia Squicciarini (OECD); Emile Cammeraat (OECD)
    Abstract: Building on recent OECD work, this paper analyses the skills sets (“skills bundles”) demanded in artificial intelligence (AI)-related online job postings. The analysis uses Burning Glass Technologies’ data for the United States and the United Kingdom and finds that skills related to the open source programming software Python and to machine learning represent “must-haves” for working with AI. Employers additionally value specialised skills related to robotics, AI development and applying AI. A comparison of the periods 2013-15 and 2017-19 shows that the latter two have become more interrelated over time, with “neural network” skills connecting both groups. Network analysis relating AI skills to general skills highlights the growing role of socio-emotional skills; and of skill bundles related to programming, management of big data and data analysis. Key results hold for both countries and time periods, though differences emerge across occupations and industries.
    Keywords: AI, Online jobs, Skill bundles, Skills
    Date: 2021–09–22
  4. By: Nathan Ratledge; Gabe Cadamuro; Brandon de la Cuesta; Matthieu Stigler; Marshall Burke
    Abstract: In many regions of the world, sparse data on key economic outcomes inhibits the development, targeting, and evaluation of public policy. We demonstrate how advancements in satellite imagery and machine learning can help ameliorate these data and inference challenges. In the context of an expansion of the electrical grid across Uganda, we show how a combination of satellite imagery and computer vision can be used to develop local-level livelihood measurements appropriate for inferring the causal impact of electricity access on livelihoods. We then show how ML-based inference techniques deliver more reliable estimates of the causal impact of electrification than traditional alternatives when applied to these data. We estimate that grid access improves village-level asset wealth in rural Uganda by 0.17 standard deviations, more than doubling the growth rate over our study period relative to untreated areas. Our results provide country-scale evidence on the impact of a key infrastructure investment, and provide a low-cost, generalizable approach to future policy evaluation in data sparse environments.
    Date: 2021–09
  5. By: J.Ignacio Conde-Ruiz (Universidad Complutense de Madrid and ICAE (Spain).); Juan-José Ganuza (Universitat Pompeu Fabra and Barcelona GSE.); Manu García (Washington University in St. Louis and ICAE.); Luis A. Puch (Universidad Complutense de Madrid and ICAE (Spain).)
    Abstract: We analyze all the articles published in the top five (T5) Economics journals be- tween 2002 and 2019 in order to find gender differences in their research approach. We implement an unsupervised machine learning algorithm: the Structural Topic Model (STM), so as to incorporate gender document-level meta-data into a probabilistic text model. This algorithm characterizes jointly the set of latent topics that best fits our data (the set of abstracts) and how the documents/abstracts are allocated to each latent topic. Latent topics are mixtures over words where each word has a probability of belonging to a topic after controlling by journal name and publication year (the meta-data). Thus, the topics may capture research fields but also other more subtle characteristics related to the way in which the articles are written. We find that fe- males are unevenly distributed along the estimated latent topics, by using only data driven methods. This finding relies on “automatically” generated built-in data given the contents in the abstracts of the articles in the T5 journals, without any arbitrary allocation of texts to particular categories (as JEL codes, or research areas).
    Keywords: Machine Learning; Gender Gaps; Structural Topic Model; Gendered Language; Research Fields.
    JEL: I20 J16 Z13
    Date: 2021–06
  6. By: Ebad F. Haghish (Department of Psychology, University of Oslo, Norway)
    Abstract: rcall is a Stata package that integrates R and R packages in Stata and supports seamless two-way data communication between R and Stata. The package offers two modes of data communication, which are 1) interactive and 2) non-interactive. In the first part of the presentation, I will introduce the latest updates of the package (version 3.0) and how to use it in practice for data analysis (interactive mode). The second part of the presentation concerns developing Stata packages with rcall (non-interactive mode) and how to defensively embed R and R packages within Stata programs. All the examples of the presentation, either for data analysis or package development, would be based on embedding R machine learning algorithms in Stata and using them in practice.
    Date: 2021–09–12
  7. By: Otero Gomez, Daniel; MANRIQUE, MIGUEL ANGEL CORREA; Sierra, Omar Becerra; Laniado, Henry; Mateus C, Rafael; Millan, David Andres Romero
    Abstract: It is a common practice to price a house without proper evaluation studies being performed for assurance. That is why the purpose of this study provide an explanatory model by establishing parameters for accuracy in interpretation and projection of housing prices. In addition, it is intentioned to establish proper data preprocessing practices in order to increase the accuracy of machine learning algorithms. Indeed, according to our literature review, there are few articles and reports on the use of Machine Learning tools for the prediction of property prices in Colombia. The dataset in which the research is built upon was provided by an existing real estate company. It contains near 940,000 items (housing advertisements) posted on the platform from the year 2018 to 2020. The database was enriched using statistical imputation techniques. Housing prices prediction was performed using Decision Tree Regressors and LightGBM methods, thus deriving in better alternatives for house price prediction in Colombia. Moreover, to measure the accuracy of the proposed models, the Root Mean Squared Logarithmic Error (RMSLE) statistical indicator was used. The best cross validation results obtained were 0.25354±0.00699 for the LightGBM, 0.25296 ±0.00511 for the Bagging Regressor, and 0.25312±0.00559 for the ExtraTree Regressor with Bagging Regressor, and it was not found a statistical difference between their performances.
    Date: 2020–09–02
  8. By: Otero Gomez, Daniel; Toro, Mauricio
    Abstract: This research concentrates on a bounded version of the waste image classification problem. It focuses on determining the more useful approach when working with two kinds of feature vectors, one construed using pixel values and the second construed from a Bag of Features (BoF). Several image processing techniques such as object centering, pixel value re scaling and edge filtering are applied. Logistic Regression, K Nearest Neighbors, and Support Vector Machines are used as classification algorithms. Experiments demonstrate that object centering significantly improves models’ performance when working with pixel values. Moreover, it is determined that by generating sufficiently simple data relations the BoF approach achieves superior overall results. The Support Vector Machine achieved a 0.9 AUC Score and 0.84 accuracy score.
    Date: 2021–06–07
  9. By: Marc S. Paolella (University of Zurich - Department of Banking and Finance; Swiss Finance Institute)
    Abstract: In light of the growing use, acceptance of, and demand for, machine learning in many fields, notably data science, but also other fields such as finance- and this in both industry and academics, some university departments might wish, or find themselves forced to, accord to the winds of change and address this pressing issue. The goal of this document is to assist in designing relevant courses using material at the appropriate mathematical level. It protocols, sorts, evaluates, and contrasts, numerous viable books for a variety of possible courses. The subjects span several levels of, and different avenues in, linear algebra and real analysis, with briefer discussions of material in probability theory and mathematical finance.
    Date: 2021–09
  10. By: Hariharan, Naveen Kunnathuvalappil
    Abstract: Learning the determinants of successful project budgeting is crucial. This research attempts to empirically find the determinants of a successful budget. To find this, this work applied three different supervised machine learning algorithms for classification: Support Vector Machine (SVM), Logistic regression, and Probit regression with data from 470 projects. Five features have been selected: coordination, participation, budget control, communication, and motivation. The SVM analysis results showed that SVM could predict successful and failed budgets with fairly good accuracy. The results from Logistic and Probit regression showed that if managers properly focus on coordination, participation, budget control, and communication, the probability of success in project-budget increases.
    Date: 2021–09–08
  11. By: Maiorova, Ksenia; Fokin, Nikita
    Abstract: In this paper we consider a set of machine learning and econometrics models, namely: Elastic Net, Random Forest, XGBoost and SSVS as applied to nowcasting a large dataset of USD volumes of Russian exports and imports by commodity group. We use lags of the volumes of export and import commodity groups, prices for some imported and exported goods and other variables, due to which the curse of dimensionality becomes quite acute. The models we use are very popular and have proven themselves well in forecasting in the presence of the curse of dimensionality, when the number of model parameters exceeds the number of observations. The best model is the weighted model of machine learning methods, which outperforms the ARIMA benchmark model in nowcasting the volume of both exports and imports. In the case of the largest commodities groups, we often get a significantly more accurate nowcasts then ARIMA model, according to the Diebold-Mariano test. In addition, nowcasts turns out to be quite close to the historical forecasts of the Bank of Russia, being constructed in similar conditions.
    Keywords: наукастинг; внешняя торговля; проклятие размерности; машинное обучение; российская экономика
    JEL: C52 C53 C55 F17
    Date: 2020–06
  12. By: Saeed Marzban; Erick Delage; Jonathan Yumeng Li; Jeremie Desgagne-Bouchard; Carl Dussault
    Abstract: The problem of portfolio management represents an important and challenging class of dynamic decision making problems, where rebalancing decisions need to be made over time with the consideration of many factors such as investors preferences, trading environments, and market conditions. In this paper, we present a new portfolio policy network architecture for deep reinforcement learning (DRL)that can exploit more effectively cross-asset dependency information and achieve better performance than state-of-the-art architectures. In particular, we introduce a new property, referred to as \textit{asset permutation invariance}, for portfolio policy networks that exploit multi-asset time series data, and design the first portfolio policy network, named WaveCorr, that preserves this invariance property when treating asset correlation information. At the core of our design is an innovative permutation invariant correlation processing layer. An extensive set of experiments are conducted using data from both Canadian (TSX) and American stock markets (S&P 500), and WaveCorr consistently outperforms other architectures with an impressive 3%-25% absolute improvement in terms of average annual return, and up to more than 200% relative improvement in average Sharpe ratio. We also measured an improvement of a factor of up to 5 in the stability of performance under random choices of initial asset ordering and weights. The stability of the network has been found as particularly valuable by our industrial partner.
    Date: 2021–09
  13. By: Falch, Morten
    Abstract: This paper look at surveillance capitalism as described in the book by Shoshana Zubof, and discuss whether surveillance capitalism represents a new stage of capitalist development. This is done by using the theory of techno-economic paradigms as a theoretical framework.
    Keywords: Surveillance Capitalism,Techno-economic paradigm,artificial intelligence,big data
    Date: 2021
  14. By: Saito, Nagayuki
    Abstract: In recent years, the use of artificial intelligence (AI) has become widespread in society, and the need for ethical considerations regarding its development and operation has been discussed. However, despite the fact that those considerations are equally important for both adults and children, discussions on ethical guidelines based on the impact of international AI on children have not progressed. This study examined the requirements for child-centered AI ethics guidelines based on the United Nations' "Children's Rights Convention." As a result, it became clear that it is necessary to develop AI based on the child's developmental stage, secure a usage environment, and educate the government and industry about the peculiar characteristics of children.
    Keywords: Children,Artificial Intelligence (AI),Ethics Guidelines,UNCRC,Rights and Protection
    Date: 2021
  15. By: Jiamin Yu
    Abstract: It has been for a long time to use big data of autonomous vehicles for perception, prediction, planning, and control of driving. Naturally, it is increasingly questioned why not using this big data for risk management and actuarial modeling. This article examines the emerging technical difficulties, new ideas, and methods of risk modeling under autonomous driving scenarios. Compared with the traditional risk model, the novel model is more consistent with the real road traffic and driving safety performance. More importantly, it provides technical feasibility for realizing risk assessment and car insurance pricing under a computer simulation environment.
    Date: 2021–09
  16. By: Weisser, Christoph; Lenel, Friederike; Lu, Yao; Kis-Katos, Krisztina; Kneib, Thomas
    Abstract: Access to electricity is typically the main benefit associated with solar panels, but in economically less developed countries, where access to electricity is still very limited, solar panel systems can also serve as means to generate additional income and to diversify income sources. We analyze high-frequency electricity usage and repayment data of around 70,000 households in Tanzania that purchased a solar panel system on credit, in order to (1) determine the extent to which solar panel systems are used for income generation, and (2) explore the link between the usage of the solar system for business purposes and the repayment of the customer credit that finances its purchase. Based on individual patterns of energy consumption within each day, we use XGBoost as a supervised machine learning model combined with labels from a customer survey on business usage to generate out-of-sample predic- tions of the daily likelihood that customers operate a business.We find a low average predicted business probability; yet there is considerable variation across households and over time. While the majority of households are predicted to use their system primarily for private consumption, our findings suggest that a substantial proportion uses it for income generation purposes occasionally. Our subsequent statistical analysis regresses the occurrence of individual credit delinquency within each month on the monthly average predicted probability of business-like electricity usage, relying on a time-dependent proportional hazards model. Our results show that customers with more business-like electricity usage patterns are significantly less likely to face repayment difficulties, suggesting that using the system to generate additional income can help to alleviate cash constraints and prevent default.
    Keywords: Rural electrification,Off-grid energy,High-frequency electricity usage data,Solar panels,Tanzania,Risk management,Credit default,Big Data,Supervised machinelearning,Time-dependent proportional hazards model,XGBoost
    Date: 2021
  17. By: Oliver Hümbelin; Lukas Hobi; Robert Fluder
    Abstract: In many countries, it is difficult to study subnational poverty patterns, as official statistics often rely on surveys with limited ability to disaggregate regionally. This is a drawback because the social and economic structure varies within countries, which has a significant impact on who lives below the poverty line. To address poverty, it is therefore important to further understand urban/rural differences. In this context, administrative data-based approaches offer new opportunities. This paper contributes to the field of territorial poverty studies by using linked tax data to examine poverty in a large political district in Switzerland with 1 million inhabitants and large rural and urban parts. We measure poverty using income and financial reserves (asset-based poverty) and examine poverty in urban and rural areas. By doing so we can compare the social structure of the poor in detail. We then use random forest based variable importance analysis to see whether the importance of poverty risks factors differs in urban and rural parts. We can show that poor people in rural areas are more likely to be of retirement age compared to the urban parts. Among the workforce, the share of poor is higher for those who work in agriculture compared to those working in industry or the service sector. In urban areas, the poor are more often freelancers and people of foreign origin. Despite on where they live, people with no or little education, single parents, and people working in gastronomy/tourism are disproportionately often poor. With respect to risk factors, we find that the general opportunity structure like density of workplaces or aggravated access in mountain areas seem to be of minor importance compared to risk factors that relate to the immediate social situation. Low attachment to the labor market is by far the most important characteristic predicting poverty on the household level. However, the sector of occupation is of big importance too. Since the possibilities to engage in a specific occupation is linked to the regional opportunity structure, this result fosters the argument that territorial opportunities matter. The importance of the sector of occupation is especially dominant predicting poor households in rural parts.
    Keywords: poverty, poverty risk factors, regional difference, admin-data, random forest
    JEL: I32
    Date: 2021–09–07
  18. By: Cao, Shurui
    Abstract: The aim of this article is to explore using sentiment analysis to assess the openness of governments and establish a dynamic evaluation mechanism to supervise governments, and consequently improve the administrative law on a feminist basis. I build a sentiment analysis model based on communication between me and governments, discuss the implication of the model, and propose potential improvements to the administrative law in China.
    Date: 2021–09–16
  19. By: David Bounie (IP Paris - Institut Polytechnique de Paris); Antoine Dubus (ETH Zürich - Eidgenössische Technische Hochschule - Swiss Federal Institute of Technology [Zürich]); Patrick Waelbroeck (Télécom Paris)
    Abstract: We analyze competition between data intermediaries collecting information on consumers, which they sell to firms for price discrimination purposes. We show that competition between data intermediaries benefits consumers by increasing competition between firms, and by reducing the amount of consumer data collected. We argue that merger policy guidelines should investigate the effect of the data strategies of large intermediaries on competition and consumer surplus in related markets.
    Date: 2021–09–07

This nep-big issue is ©2021 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.