nep-big New Economics Papers
on Big Data
Issue of 2018‒05‒07
sixteen papers chosen by
Tom Coupé
University of Canterbury

  1. Rethinking Policy Evaluation – Do Simple Neural Nets Bear Comparison with Synthetic Control Method? By Steinkraus, Arne
  2. The Impact of Artificial Intelligence on Innovation By Iain M. Cockburn; Rebecca Henderson; Scott Stern
  3. Can media and text analytics provide insights into labour market conditions in China? By Bailliu, Jeannine; Han, Xinfen; Kruger, Mark; Liu, Yu-Hsien; Thanabalasingam, Sri
  4. The quantification of text - Supervised learning methods - The application of textual sentiment indicators to the UK CRE market By Steffen Heinig
  5. Machine Learning Forecasts of Public Transport Demand: A comparative analysis of supervised algorithms using smart card data By Sebastián M. Palacio
  6. Image Analyses and Real Estate: Evaluation of the Quality of Location Using Remotely Sensed Imagery By Miroslav Despotovic; David Koch; Gunther Maier; Matthias Zeppelzauer
  7. Detecting Outliers with Semi-Supervised Machine Learning: A Fraud Prediction Application By Sebastián M. Palacio
  8. Quantifying macroeconomic expectations in stock markets using Google Trends By Johannes Bock
  9. Real Estate valuation and forecasting in non-homogeneous markets: A case study in Greece during the financial crisis By Dimitrios Papastamos; Antonis Alexandridis; Dimitris Karlis
  10. The Effect of Deforestation on the Access to Clean Drinking Water: A Study of Malawi's Deforestation By Annie Mwai Mapulanga and Hisahiro Naito
  11. Does A Higher Population Growth Cause Deforestation? : A Study of Malawi's Rapid Deforestation By Annie Mwai Mapulanga and Hisahiro Naito
  12. The Production of Information in an Online World: Is Copy Right? By Julia Cage; nicolas Hervé; Marie-Luce Viaud
  13. Exploring the determinants of liquidity with big data – Market heterogenity in German markets By Marcelo Cajias; Philipp Freudenreich
  14. Debt Overhang, Rollover Risk, and Corporate Investment: Evidence from the European Crisis By Kalemli-Ozcan, Sebnem; Laeven, Luc; Moreno, David
  15. Capital humano para la transformación digital en América Latina By Katz, Raúl L.
  16. Decision Sciences, Economics, Finance, Business, Computing, and Big Data: Connections By Chang, C-L.; McAleer, M.J.; Wong, W.-K.

  1. By: Steinkraus, Arne
    Abstract: With the advent of big data in economics machine learning algorithms become more and more appealing to economists. Despite some attempts of establishing artificial neural networks in in the early 1990s, only little is known about their ability of estimating causal effects in policy evaluation. We employ a simple forecasting neural network to analyze the effect of the construction of the Oresund bridge on the local economy. The outcome is compared to the causal effect estimated by the proven Synthetic Control Method. Our results suggest that – especially in so-called prediction policy problems – neural nets may outperform traditional approaches.
    Keywords: Artificial Neural Nets,Machine Learning,Synthetic Control Method,Policy Evaluation
    JEL: C45 O18
    Date: 2018
  2. By: Iain M. Cockburn; Rebecca Henderson; Scott Stern
    Abstract: Artificial intelligence may greatly increase the efficiency of the existing economy. But it may have an even larger impact by serving as a new general-purpose “method of invention” that can reshape the nature of the innovation process and the organization of R&D. We distinguish between automation-oriented applications such as robotics and the potential for recent developments in “deep learning” to serve as a general-purpose method of invention, finding strong evidence of a “shift” in the importance of application-oriented learning research since 2009. We suggest that this is likely to lead to a significant substitution away from more routinized labor-intensive research towards research that takes advantage of the interplay between passively generated large datasets and enhanced prediction algorithms. At the same time, the potential commercial rewards from mastering this mode of research are likely to usher in a period of racing, driven by powerful incentives for individual companies to acquire and control critical large datasets and application-specific algorithms. We suggest that policies which encourage transparency and sharing of core datasets across both public and private actors may be critical tools for stimulating research productivity and innovation-oriented competition going forward.
    JEL: L1
    Date: 2018–03
  3. By: Bailliu, Jeannine; Han, Xinfen; Kruger, Mark; Liu, Yu-Hsien; Thanabalasingam, Sri
    Abstract: The official Chinese labour market indicators have been seen as problematic, given their small cyclical movement and their only-partial capture of the labour force. In our paper, we build a monthly Chinese labour market conditions index (LMCI) using text analytics applied to mainland Chinese-language newspapers over the period from 2003 to 2017. We use a supervised machine learning approach by training a support vector machine classification model. The information content and the forecast ability of our LMCI are tested against official labour market activity measures in wage and credit growth estimations. Surprisingly, one of our findings is that the much-maligned official labour market indicators do contain information. However, their information content is not robust and, in many cases, our LMCI can provide forecasts that are significantly superior. Moreover, regional disaggregation of the LMCI illustrates that labour conditions in the export-oriented coastal region are sensitive to export growth, while those in inland regions are not. This suggests that text analytics can, indeed, be used to extract useful labour market information from Chinese newspaper articles.
    JEL: C38 E24 E27
    Date: 2018–04–20
  4. By: Steffen Heinig
    Abstract: In the real estate industry information are an essential good, which influences the behaviour of market participants. One main source of information about the market are news articles. For the financial markets and especially for the real estate market the quantification of text represents a new source for the extraction of market sentiment. In this study, I examine a newly constructed corpus of news articles regarding the London real estate market, with the help of supervised learning algorithms (i.e. SVM, Maximum Entropy, GLMNET). More than 100,000 articles are used over a period of 11 years (2004-2015). One central issue during this process is the annotation of the documents in the training corpus. Since the real estate market does not offer an annotated news corpus and labelling such a large corpus manually would be expensive in different ways, I propose a new method of how this gap can be overcome. The use of real estate related Amazon book reviews for the training process of the different classifiers has been proven to be quite promising. I used more than 220,000 reviews for the training process. The results suggest, that the book reviews are a good substitute and classifiers trained on the reviews are able to extract the sentiment from the articles. Satisfying graphical results reveal, at least for some of the different classifiers, that the underlying market sentiment was extracted. The textual sentiment indicators are also able to improve the performance of different models. In this study, I will use the textual indicators in a probit model to see whether they have any signalling power about future developments.
    Keywords: Natural Language Processing; Quantification of text; Sentiment Analysis; Supervised Learning algorithm
    JEL: R3
    Date: 2017–07–01
  5. By: Sebastián M. Palacio (GiM, Department of Econometrics, Statistics and Applied Economics, Universitat de Barcelona)
    Abstract: Public transport smart cards are widely used around the world. However, while they provide information about various aspects of passenger behavior, they have not been properly exploited to predict demand. Indeed, traditional methods in economics employ linear unbiased estimators that pay little attention to accuracy, which is the main problem faced by the sector’s regulators. This paper reports the application of various supervised machine learning (SML) techniques to smart card data in order to forecast demand, and it compares these outcomes with traditional linear model estimates. We conclude that the forecasts obtained from these algorithms are much more accurate.
  6. By: Miroslav Despotovic; David Koch; Gunther Maier; Matthias Zeppelzauer
    Abstract: A growing number of applied studies examine the impact of urban space quality on property price. Especially the planning and development of the immediate neighborhood (micro location) is an important influencing factor in regional economics. An image-based method for the estimation of location quality, in the context of property valuation, does not exist yet. We develop method for the determination of the quality of location using image processing, taking at the same time into account the classification in quality classes based on regional structural characteristics. With the help of automatic image analysis, a new information source is leveraged, which previously could not be taken into account within the scope of evaluation of location quality or within the scope of automated valuation models (e.g. hedonic models). In the field of image analysis, the extraction of parameters related to location quality is a new task. It is so far not clear to which degree meaningful parameters can be found autonomously by machine learning. This dissertation will investigate this question in detail and is to our knowledge the first approach for the automatic image-based valuation of location quality.
    Keywords: Hedonic Pricing; Image Processing; location quality; Machine Learning; Neighborhoods
    JEL: R3
    Date: 2017–07–01
  7. By: Sebastián M. Palacio (GiM, Department of Econometrics, Statistics and Applied Economics, Universitat de Barcelona)
    Abstract: Abnormal pattern prediction has received a great deal of attention from both academia and industry, with applications that range from fraud, terrorism and intrusion detection to sensor events, medical diagnoses, weather patterns, etc. In practice, most abnormal pattern prediction problems are characterized by the presence of a small number of labeled data and a huge number of unlabeled data. While this points most obviously to the adoption of a semi-supervised approach, most empirical studies have opted for a simplification and treated it as a supervised problem, resulting in a severe bias of false negatives. In this paper, we propose an innovative methodology based on semi-supervised techniques and introduce a new metric the Cluster-Score for abnormal homogeneity measurement. Specifically, the methodology involves transmuting unsupervised models to supervised models using the Cluster-Score metric, which defines the objective boundaries between clusters and evaluates the homogeneity of the abnormalities in the cluster construction. We apply this methodology to a problem of fraud detection among property insurance claims. The objectives are to increase the number of fraudulent claims detected and to reduce the proportion of claims investigated that are, in fact, non-fraudulent. The results from applying our methodology considerably improved these objectives.
    Keywords: Outlier Detection, Semi-Supervised Models, Fraud, Cluster, Insurance
  8. By: Johannes Bock
    Abstract: Among other macroeconomic indicators, the monthly release of U.S. unemployment rate figures in the Employment Situation report by the U.S. Bureau of Labour Statistics gets a lot of media attention and strongly affects the stock markets. I investigate whether a profitable investment strategy can be constructed by predicting the likely changes in U.S. unemployment before the official news release using Google query volumes for related search terms. I find that massive new data sources of human interaction with the Internet not only improves U.S. unemployment rate predictability, but can also enhance market timing of trading strategies when considered jointly with macroeconomic data. My results illustrate the potential of combining extensive behavioural data sets with economic data to anticipate investor expectations and stock market moves.
    Date: 2018–05
  9. By: Dimitrios Papastamos; Antonis Alexandridis; Dimitris Karlis
    Abstract: In recent years big financial institutions are interested in creating and maintaining property valuation models. The main objective is to use reliable historical data in order to be able to forecast the price of a new property in a comprehensible manner and provide some indication for the uncertainty around this forecast. In this paper we develop an automatic valuation model for property valuation using a large database of historical prices from Greece. The Greek property market is an inefficient, non- homogeneous market, still at its infancy governed by lack of information. As a result modelling the Greek real estate market is a very challenging problem. The available data cover a big range of properties across time and include the financial crisis period in Greece which led to tremendous changes in the dynamics of the real estate market. We formulate and compare linear and non-linear models based on regression, hedonic equations and artificial neural networks. The forecasting ability of each method is evaluated out-of-sample. Special care is given on measuring the success of the forecasts but also to identify the property characteristics that lead to large forecasting errors. Finally, by examining the strengths and the performance of each method we apply a combined forecasting rule to improve performance. Our results indicate that the proposed methodology constitutes an accurate tool for property valuation in non- homogeneous, newly developed markets.
    Keywords: Artificial Neural Networks; Automated Valuation Models; Forecasting Accuracy; Residential Market; Valuations
    JEL: R3
    Date: 2017–07–01
  10. By: Annie Mwai Mapulanga and Hisahiro Naito
    Abstract: Using Malawi's satellite images of land use and cover, weather data and population data at each cluster and two waves of the Demographic Health Survey (DHS), this paper estimates the causal effect of deforestation on access to clean drinking water. The previous literature of forest science have examined the effect of deforestation of water flow and mixed results. This paper, instead, directly examines the causal effect of deforestation on households' access to clean drinking water by using two Staged Least Square (2SLS) estimation. The results illustrate strong empirical evidence that deforestation decreases the access to clean water. Falsification tests show that a possibility of our instrumental variable picking up an unobserved time trend is very unlikely. We find that a one percentage point increase in deforestation decreases access to clean water by 1.0-1.3 percentage points.
    Date: 2018–03
  11. By: Annie Mwai Mapulanga and Hisahiro Naito
    Abstract: Using Malawi's satellite images of land use/land cover change, weather data and population data at each cluster and Population Housing Census (PHC) data, this paper estimates the causal effect of the growth of population of local residents on deforestation in Malawi. We use the average number of births in the census ten years ago as the instrumental variable to control the endogeneity of population growth. The results illustrate strong empirical evidence that high population growth of local residents increases deforestation through expansion of agricultural land. The results show that a 1 percent increase in population growth increases the deforestation rate by 2.7 percent through the increase in agricultural land. In terms of land use changes, a one hectare gain in agriculture land results in a 0.57 hectare loss in forest land cover.
    Date: 2018–03
  12. By: Julia Cage (Département d'économie); nicolas Hervé (Institut national de l'audiovisuel); Marie-Luce Viaud (Institut national de l'audiovisuel)
    Abstract: This paper documents the extent of copying and estimates the returns to originality in online news production. We build a unique dataset combining all the online content produced by French news media (newspaper, television, radio, pure online media, and a news agency) during the year 2013 with new micro audience data. We develop a topic detection algorithm that identi_es each news event, we trace the timeline of each story and study news propagation. We unravel new evidence on online news production. First, we show that one quarter of the news stories are reproduced online in less than 4 minutes. Second, we _nd that only 32.6% of the online content is original. Third, we show that reputation e_ects partly counterbalance the negative impact of plagiarism on newsgathering incentives. By using media-level daily audience and article-level social media statistics (Facebook and Twitter shares), we _nd that original content represents between 54 and 62% of online news consumption. Reputation mechanisms actually appear to solve about 30 to 40% of the copyright violation problem.
    Keywords: internet; information spreading; copyright; investigative journalism; social media
    Date: 2018–01
  13. By: Marcelo Cajias; Philipp Freudenreich
    Abstract: Purpose – The purpose of this paper is to examine the market liquidity (time-on-market) and its determinants for rental dwellings in the largest seven German cities with big data.Design/methodology/approach – The determinants of time-on-market are estimated with the Cox proportional hazards model. Hedonic characteristics as well as socioeconomic and spatial variables are combined with different fixed effects and controls for non-linearity to maximise the explanatory power of the model.Findings – Higher asking rent and larger living space decrease the liquidity on all seven markets, while dwelling’s age, the number of rooms and proximity to the city centre fasten the letting process. For the linear and non-linear hedonic characteristics heterogeneous implications are found.Practical implications – The findings are of interest for institutional and private landlords as well as governmental organizations in charge of housing and urban development.Originality/value – It is the first paper to deal with liquidity of rental dwellings in the seven most populated cities of Europe’s second largest rental market by applying the Cox proportional hazards model. Furthermore, the German rental market is of particular interest, as approximately 60%of all rental dwellings are owned by private landlords and the German market is organized polycentric.
    Keywords: Big data; Cox proportional hazard model; Housing real estate; Liquidity%2F Time-on-market; Non-linearity
    JEL: R3
    Date: 2017–07–01
  14. By: Kalemli-Ozcan, Sebnem; Laeven, Luc; Moreno, David
    Abstract: We quantify the role of financial factors that have contributed to sluggish investment in Europe in the aftermath of the 2008-2009 crisis. Using a big data approach, we match the firms to their banks based on banking relationships in 8 European countries over time, obtaining over 2 million observations. We document four stylized facts. First, the decline in investment in the aftermath of the crisis can be linked to higher leverage, increased debt service, and having a relationship with a weak bank-once we condition on aggregate demand shocks. Second, the relation between leverage and investment depends on the maturity structure of debt: firms with a higher share of long-term debt have higher investment rates relative to firms with a lower share of long-term debt since the rollover risk for the former is lower and the latter is higher. Third, the negative effect of leverage is more pronounced when firms are linked to weak banks, i.e., banks with high exposure to sovereign risk. Firms with higher shares of short-term debt decrease investment more relative to firms with lower shares of short-term debt even both set of firms linked to weak banks. This result suggests that loan evergreening by weak banks played a limited role in increasing investment. Fourth, the direct negative effect of weak banks on the average firm's invest- ment disappears once demand shocks are controlled for, although the differential effects with respect to leverage and the maturity of debt remain.
    Keywords: Bank-Sovereign Nexus; Debt Maturity; Firm Investment; Rollover Risk
    JEL: E22 E32 E44 F34 F36 G32
    Date: 2018–04
  15. By: Katz, Raúl L.
    Abstract: En este documento se analiza la oferta de programas de capacitación en tecnologías digitales maduras en siete países de la región (Argentina, Brasil, Chile, Colombia México, Perú y Uruguay). La mayoría de los programas censados incluyen cursos relacionadas con robótica / control, inteligencia artificial / aprendizaje de máquinas, o big data / analíticos. Como datos relevantes, el estudio indica que el país que ofrece el número mayor de cursos en tecnologías digitales avanzadas es Brasil y el menor, Uruguay. Asimismo, se observa una clara brecha de oferta en los programas de capacitación de alto nivel. Esto tiene un impacto en el nivel y recursos dedicados a investigación y desarrollo en la región. En lo que se refiere a la concentración en tecnologías de avanzada, la robótica y control tienden a concentrar la mayor parte de la oferta de capacitación.Por último, se ha identificado que la educación superior en la región se caracteriza por un sistema fragmentado y diversificado, en el que los sistemas de educación superior privados prevalecen sobre los públicos. En este sistema proliferan las instituciones destinadas a ofrecer programas de formación superior de forma descoordinada, sin responder a una matriz de desarrollo educativo uniforme orientada a aumentar la dotación de capital humano de los países.
    Date: 2018–04–24
  16. By: Chang, C-L.; McAleer, M.J.; Wong, W.-K.
    Abstract: This paper provides a review of some connecting literature in Decision Sciences, Economics, Finance, Business, Computing, and Big Data. We then discuss some research that is related to the six cognate disciplines. Academics could develop theoretical models and subsequent econometric and statistical models to estimate the parameters in the associated models. Moreover, they could then conduct simulations to examine whether the estimators or statistics in the new theories on estimation and hypothesis have small size and high power. Thereafter, academics and practitioners could then apply their theories to analyze interesting problems and issues in the six disciplines and other cognate areas.
    Keywords: Decision sciences, economics, finance, business, computing, and big data, theoretical models, econometric and statistical models, applications
    JEL: A10 G00 G31 O32
    Date: 2018–03–01

This nep-big issue is ©2018 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.