nep-big New Economics Papers
on Big Data
Issue of 2018‒08‒13
twelve papers chosen by
Tom Coupé
University of Canterbury

  1. Machine Learning Macroeconometrics: A Primer By Dimitris Korobilis
  2. Some Facts of High-Tech Patenting By Michael Webb; Nick Short; Nicholas Bloom; Josh Lerner
  3. Thresholded ConvNet Ensembles: Neural Networks for Technical Forecasting By Sid Ghoshal; Stephen J. Roberts
  4. AI: Intelligent machines, smart policies: Conference summary By OECD
  5. Robustness Analysis of a Website Categorization Procedure based on Machine Learning By Renato Bruni; Gianpiero Bianchi
  6. A Machine Learning Approach to the Forecast Combination Puzzle By Antoine Mandel; Amir Sani
  7. Falling Through the Net: The Digital Divide in Western Australia By Steven Bond-Smith; Alan S Duncan; Daniel Kiely; Silvia Salazar
  8. Detecting Urban Markets with Satellite Imagery: An Application to India By Kathryn Baragwanath Vogel; Ran Goldblatt; Gordon H. Hanson; Amit K. Khandelwal
  9. Financial Inclusion and Contract Terms: Experimental Evidence From Mexico By Sara G. Castellanos; Diego Jiménez Hernández; Aprajit Mahajan; Enrique Seira
  10. A New Approach to Nowcasting Indian Gross Value Added By Soumya Bhadury; Sanjib Pohit; Robert C. M. Beyer
  11. Classifying Patents Based on their Semantic Content By Antonin Bergeaud; Yoann Potiron; Juste Raimbault
  12. Julia Cagé, Nicolas Hervé, Marie-Luce Viaud, L’information à tout prix, Ina Editions, 2017. By Antoine Machut

  1. By: Dimitris Korobilis (University of Essex, UK; Rimini Centre for Economic Analysis)
    Abstract: This Chapter reviews econometric methods that can be used in order to deal with the challenges of inference in high-dimensional empirical macro models with possibly “more parameters than observations”. These methods broadly include machine learning algorithms for Big Data, but also more traditional estimation algorithms for data with a short span of observations relative to the number of explanatory variables. While building mainly on a univariate linear regression setting, I show how machine learning ideas can be generalized to classes of models that are interesting to applied macroeconomists, such as time-varying parameter models and vector autoregressions.
    Date: 2018–07
  2. By: Michael Webb; Nick Short; Nicholas Bloom; Josh Lerner
    Abstract: Patenting in software, cloud computing, and artificial intelligence has grown rapidly in recent years. Such patents are acquired primarily by large US technology firms such as IBM, Microsoft, Google, and HP, as well as by Japanese multinationals such as Sony, Canon, and Fujitsu. Chinese patenting in the US is small but growing rapidly, and world-leading for drone technology. Patenting in machine learning has seen exponential growth since 2010, although patenting in neural networks saw a strong burst of activity in the 1990s that has only recently been surpassed. In all technological fields, the number of patents per inventor has declined near-monotonically, except for large increases in inventor productivity in software and semiconductors in the late 1990s. In most high-tech fields, Japan is the only country outside the US with significant US patenting activity; however, whereas Japan played an important role in the burst of neural network patenting in the 1990s, it has not been involved in the current acceleration. Comparing the periods 1970-89 and 2000-15, patenting in the current period has been primarily by entrant assignees, with the exception of neural networks.
    JEL: L86 O34
    Date: 2018–07
  3. By: Sid Ghoshal; Stephen J. Roberts
    Abstract: Much of modern practice in financial forecasting relies on technicals, an umbrella term for several heuristics applying visual pattern recognition to price charts. Despite its ubiquity in financial media, the reliability of its signals remains a contentious and highly subjective form of 'domain knowledge'. We investigate the predictive value of patterns in financial time series, applying machine learning and signal processing techniques to 22 years of US equity data. By reframing technical analysis as a poorly specified, arbitrarily preset feature-extractive layer in a deep neural network, we show that better convolutional filters can be learned directly from the data, and provide visual representations of the features being identified. We find that an ensemble of shallow, thresholded CNNs optimised over different resolutions achieves state-of-the-art performance on this domain, outperforming technical methods while retaining some of their interpretability.
    Date: 2018–07
  4. By: OECD
    Abstract: This report reflects discussions at the OECD conference “AI: Intelligent Machines, Smart Policies” held in Paris on 26-27 October, 2017. After discussing the state of Artificial intelligence (AI) research – in particular ‘machine learning’ –, speakers illustrated the opportunities that AI provides to improve economies and societies, in areas ranging from scientific discovery and satellite data analysis to music creation. There was broad agreement that the rapid development of AI calls for national and international policy frameworks that engage all stakeholders. Discussions focused on the need for policy to facilitate the adoption of AI systems to promote innovation and growth, help address global challenges, and boost jobs and skills development, while at the same time establishing appropriate safeguards to ensure that AI systems are human-centric and benefit people broadly. Transparency and oversight, algorithmic discrimination and privacy abuses were key concerns, as were new liability, responsibility, security and safety questions.
    Date: 2018–08–02
  5. By: Renato Bruni (Department of Computer, Control and Management Engineering Antonio Ruberti (DIAG), University of Rome La Sapienza, Rome, Italy); Gianpiero Bianchi (Direzione centrale per la metodologia e disegno dei processi statistici (DCME),Italian National Institute of Statistics Istat, Rome, Italy)
    Abstract: Website categorization has recently emerged as a very important task in several contexts. A huge amount of information is freely available through websites, and it could be used to accomplish statistical surveys, saving the cost of the surveys, or to validate already surveyed data. However, the information of interest for the specific categorization has to be mined among that huge amount. This turns out to be a dicult task in practice. This work describes techniques that can be used to convert website categorization into a supervised classification problem. To do so, each data record should summarize the content of an entire website. We generate this kind of records by using web scraping and optical character recognition, followed by a number of automated feature engineering steps. When such records have been produced, we apply to them state-of-the-art classification techniques to categorize the websites according to the aspect of interest. We use Support Vector Machines, Random Forest and Logistic classifiers. Since in many applicative cases the labels available for the training set may be noisy, we analyze the robustness of our procedure with respect to the presence of misclassified training records. We present results on real-world data for the problem of the detection of websites providing e-commerce facilities.
    Keywords: Classification ; Machine Learning ; Feature Engineering ; Text
    Date: 2018
  6. By: Antoine Mandel (CES - Centre d'économie de la Sorbonne - UP1 - Université Panthéon-Sorbonne - CNRS - Centre National de la Recherche Scientifique); Amir Sani (CFM-Imperial Institute of Quantitative Finance - Imperial College London, CES - Centre d'économie de la Sorbonne - UP1 - Université Panthéon-Sorbonne - CNRS - Centre National de la Recherche Scientifique)
    Abstract: Forecast combination algorithms provide a robust solution to noisy data and shifting process dynamics. However in practice, sophisticated combination methods often fail to consistently outperform the simple mean combination. This "forecast combination puzzle" limits the adoption of alternative com- bination approaches and forecasting algorithms by policy-makers. Through an adaptive machine learning algorithm designed for streaming data, this pa- per proposes a novel time-varying forecast combination approach that retains distribution-free guarantees in performance while automatically adapting com- binations according to the performance of any selected combination approach or forecaster. In particular, the proposed algorithm offers policy-makers the ability to compute the worst-case loss with respect to the mean combination ex-ante, while also guaranteeing that the combination performance is never worse than this explicit guarantee. Theoretical bounds are reported with re- spect to the relative mean squared forecast error. Out-of-sample empirical performance is evaluated on the Stock and Watson seven-country dataset and the ECB Sur- vey of Professional Forecasters.
    Keywords: Forecasting,Forecast Combination Puzzle,Forecast combinations,Machine Learning,Econometrics,Apprentissage statistique,Combinaison de prédicteurs,Econométrie
    Date: 2017–04–19
  7. By: Steven Bond-Smith (Bankwest Curtin Economics Centre (BCEC), Curtin University); Alan S Duncan (Bankwest Curtin Economics Centre (BCEC), Curtin University); Daniel Kiely (Bankwest Curtin Economics Centre, Curtin Business School); Silvia Salazar (Bankwest Curtin Economics Centre, Curtin University)
    Abstract: New data technologies, big data analytics and intelligent software systems are transforming the way we produce, consume or distribute commodities, and increasingly, the way we access services. They are also changing the way in which we engage with our personal, social and business networks and communities. This BCEC Focus on WA report shows there are clear divides between the haves and have nots, across various measures of access, ability and affordability. At the household and individual level, we evidence clear differences along geographic, demographic and socio-economic lines, with one in four of the poorest households in Western Australia without access to the internet compared to almost all of the highest income households. Those most at risk of falling through the net in WA, and of becoming increasingly disconnected from society include: those living in the most remote areas; families at higher levels of socio-economic disadvantage; older population cohorts and low income families, including children at risk of missing out on the educational benefits of ICT. Analysis of expenditure patterns over time show that digital technologies are a necessity, particularly for those on lower incomes. The newly devised BCEC digital stress indicator identifies those households, by family composition and housing tenure, which are chiefly at risk. The BCEC Small Business Survey highlights that internet quality and coverage does vary significantly between Western Australia’s regions. The report found 26 per cent of small businesses in the South West and Pilbara regions rated the quality of their internet infrastructure as low, compared with 25 per cent in the Wheatbelt and only 11 per cent in Perth.
    Keywords: digital divide, productivity and innovation, digital transformation, digital infrastructure, internet connectivity, Western Australia, WA economy, regional connectedness, small business
    Date: 2018–08
  8. By: Kathryn Baragwanath Vogel; Ran Goldblatt; Gordon H. Hanson; Amit K. Khandelwal
    Abstract: This paper proposes a methodology for defining urban markets based on economic activity detected by satellite imagery. We use nighttime lights data, whose use in economics is increasingly common, to define urban markets based on contiguous pixels that have a minimum threshold of light intensity. The coarseness of the nightlight data and the blooming effect of lights, however, create markets whose boundaries are too expansive and too smooth relative to the visual inspection of actual cities. We compare nightlight-based markets to those formed using high-resolution daytime satellite imagery, whose use in economics is less common, to detect the presence of builtup landcover. We identify an order of magnitude more markets with daytime imagery; these markets are realistically jagged in shape and reveal much more within and across-market variation in the density of economic activity. The size of landcover-based markets displays a sharp sensitivity to the proximity of paved roads that is not present in the case of nightlight-based markets. Our results suggest that daytime satellite imagery is a promising source of data for economists to study the spatial extent and distribution of economic activity.
    JEL: O1 O18 R1
    Date: 2018–07
  9. By: Sara G. Castellanos; Diego Jiménez Hernández; Aprajit Mahajan; Enrique Seira
    Abstract: This paper provides evidence on the difficulty of expanding access to credit through large institutions. We use detailed observational data and a large-scale countrywide experiment to examine a large bank's experience with a credit card that accounted for approximately 15% of all first-time formal sector borrowing in Mexico in 2010. Borrowers have limited credit histories and high exit-risk – a third of all study cards are defaulted on or canceled during the 26 month sample period. We use a large-scale randomized experiment on a representative sample of the bank's marginal borrowers to test whether contract terms affect default. We find that large experimental changes in interest rates and minimum payments do little to mitigate default risk. We also use detailed data on purchases and payments to construct a measure of bank revenue per card and find it is generally low and difficult to predict (using machine learning methods), perhaps explaining the bank's eventual discontinuation of the product. Finally, we show that borrowers generating a favorable credit history are much more likely to switch banks providing suggestive evidence of a lending externality. Taken together these facts highlight the difficulty of increasing financial access using large formal sector financial organizations.
    JEL: D14 D18 D82 G20 G21
    Date: 2018–07
  10. By: Soumya Bhadury; Sanjib Pohit (National Council of Applied Economic Research, New Delhi); Robert C. M. Beyer (The World Bank)
    Abstract: In India, quarterly growth of Gross Value Added (GVA) is published with a large lag and nowcasts are exacerbated by data challenges typically faced by emerging market economies, such as big data revisions, mixed frequencies data publication, small sample size, non-synchronous nature of data releases, and data releases with varying lags. This paper presents a new framework to nowcast India’s GVA that incorporates information of mixed data frequencies and other data characteristics. In addition, evening-hour luminosity has been added as a crucial high-frequency indicator. Changes in nightlight intensity contain information about economic activity, especially in countries with a large informal sector and significant data challenges, including in India. The framework for the ‘trade, hotels, transport, communication and services related to broadcasting’ bloc of the Indian GVA has been illustrated in this paper.
    Keywords: Nowcasting, India, gross value added, evening-hour luminosity, dynamic factor analysis, EM algorithm.
    JEL: C32 C51 C53
    Date: 2018–06
  11. By: Antonin Bergeaud; Yoann Potiron; Juste Raimbault
    Abstract: In this paper, we extend some usual techniques of classification resulting from a largescale data-mining and network approach. This new technology, which in particular is designed to be suitable to big data, is used to construct an open consolidated database from raw data on 4 million patents taken from the US patent office from 1976 onward. To build the pattern network, not only do we look at each patent title, but we also examine their full abstract and extract the relevant keywords accordingly. We refer to this classification as semantic approach in contrast with the more common technological approach which consists in taking the topology when considering US Patent office technological classes. Moreover, we document that both approaches have highly different topological measures and strong statistical evidence that they feature a different model. This suggests that our method is a useful tool to extract endogenous information.
    Keywords: Patents, Semantic Analysis, Network, Modularity, Innovation, USPTO
    JEL: O3 O39
    Date: 2018
  12. By: Antoine Machut (Pacte, Laboratoire de sciences sociales - UPMF - Université Pierre Mendès France - Grenoble 2 - UJF - Université Joseph Fourier - Grenoble 1 - IEPG - Sciences Po Grenoble - Institut d'études politiques de Grenoble - CNRS - Centre National de la Recherche Scientifique - UGA - Université Grenoble Alpes)
    Abstract: Antoine Machut, V1, 30 mai 2018, revue ppw et ds Julia Cagé, Nicolas Hervé, Marie-Luce Viaud, L'information à tout prix, Ina Editions, 2017. Cet ouvrage pose un problème éminemment actuel à propos de la production d'information. Quelle est la capacité des médias à produire de l'information originale alors qu'une énorme quantité d'articles est accessible au plus grand nombre rapidement et gratuitement sur Internet ? Si les médias ne gagnent (presque) pas d'argent à produire des articles en ligne, quelle incitation économique ont-ils à produire de l'information originale, plutôt que de reprendre de l'information déjà existante ? Issu d'un projet de recherche de grande ampleur qui vise à suivre la diffusion de l'information en France, le livre est co-rédigé par une économiste (Julia Cagé), un informaticien et une informaticienne (Nicolas Hervé et Marie-Luce Viaud). Cette collaboration entre disciplines donne lieu à un ouvrage impressionnant de richesse empirique et de limpidité, malgré la haute technicité des outils employés. Les outils du big data ont permis de collecter quasi-exhaustivement les articles de presse produits en ligne en 2013. Au total, plus de 2,5 millions de documents ont été collectés, provenant de 86 médias d'actualité généraliste, dont l'Agence France Presse (AFP), de dix pure-players (médias dont le contenu est publié exclusivement en ligne), ainsi que des sites web des radios et télévisions. Les outils du machine learning leur ont ensuite permis de soumettre ce corpus à des algorithmes de détection d'événements médiatiques, de copies et de citations. L'objet de l'étude est double : il s'agit d'une part d' objectiver et d'expliquer l'ampleur des pratiques de copié-collé d'articles de presse sur Internet (et inversement la production d'informations originales). C'est l'objet des chapitres 1 à 4. Il s'agit d'autre part de lier ce constat aux modèles économiques susceptibles d'inciter les médias à produire plus ou moins d'informations originales (chapitres 5 à 7).
    Date: 2018–09–01

This nep-big issue is ©2018 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.