nep-big 2018-06-18 papers

on Big Data

Issue of 2018–06–18
nine papers chosen by
Tom Coupé, University of Canterbury

Big Data, artificial intelligence and the geography of entrepreneurship in the United States By Ebert, Tobias; Eichstaedt, Johannes C.; Lee, Neil; Obschonka, Martin; Rodríguez-Pose, Andrés
Machine Learning the Cryptocurrency Market By Laura Alessandretti; Abeer ElBahrawy; Luca Maria Aiello; Andrea Baronchelli
Political Connections and Firms: Network Dimensions By Bussolo, Maurizio; Commander, Simon; Poupakis, Stavros
Tobacco spending in Georgia: Machine learning approach By Maksym Obrizan; Karine Torosyan; Norberto Pignatti
Data Science for Institutional and Organizational Economics By Prüfer, Jens; Prüfer, Patricia
That's classified! Inventing a new patent taxonomy By Billington, Stephen D.; Hanna, Alan J.
Day-ahead electricity price forecasting with high-dimensional structures: Univariate vs. multivariate modeling frameworks By Florian Ziel; Rafal Weron
Regime switching in the presence of endogeneity By Tom Auld; Oliver Linton
Économétrie & Machine Learning By Arthur Charpentier; Emmanuel Flachaire; Antoine Ly

Big Data, artificial intelligence and the geography of entrepreneurship in the United States

By:	Ebert, Tobias; Eichstaedt, Johannes C.; Lee, Neil; Obschonka, Martin; Rodríguez-Pose, Andrés
Abstract:	There is increasing interest in the potential of artificial intelligence and Big Data (e.g., generated via social media) to help understand economic outcomes and processes. But can artificial intelligence models, solely based on publicly available Big Data (e.g., language patterns left on social media), reliably identify geographical differences in entrepreneurial personality/culture that are associated with entrepreneurial activity? Using a machine learning model processing 1.5 billion tweets by 5.25 million users, we estimate the Big Five personality traits and an entrepreneurial personality profile for 1,772 U.S. counties. We find that these Twitter-based personality estimates show substantial relationships to county-level entrepreneurship activity, accounting for 24% (entrepreneurial personality profile) and 32% (all Big Five trait as separate predictors in one model) of the variance in local entrepreneurship and are robust to the introduction in the model of conventional economic factors that affect entrepreneurship. We conclude that artificial intelligence methods, analysing publically available social media data, are indeed able to detect entrepreneurial patterns, by measuring territorial differences in entrepreneurial personality/culture that are valid markers of actual entrepreneurial behaviour. More importantly, such social media datasets and artificial intelligence methods are able to deliver similar (or even better) results than studies based on millions of personality tests (selfreport studies). Our findings have a wide range of implications for research and practice concerned with entrepreneurial regions and eco-systems, and regional economic outcomes interacting with local culture.
Keywords:	artificial intelligence; Big Data; Big Five; Counties; entrepreneurship; personality; psychological traits; social media; Twitter; U.S.
JEL:	L26 R11 R12
Date:	2018–05
URL:	https://d.repec.org/n?u=RePEc:cpr:ceprdp:12949

Machine Learning the Cryptocurrency Market

By:	Laura Alessandretti; Abeer ElBahrawy; Luca Maria Aiello; Andrea Baronchelli
Abstract:	Machine learning and AI-assisted trading have attracted growing interest for the past few years. Here, we use this approach to test the hypothesis that the inefficiency of the cryptocurrency market can be exploited to generate abnormal profits. We analyse daily data for $1,681$ cryptocurrencies for the period between Nov. 2015 and Apr. 2018. We show that simple trading strategies assisted by state-of-the-art machine learning algorithms outperform standard benchmarks. Our results show that non-trivial, but ultimately simple, algorithmic mechanisms can help anticipate the short-term evolution of the cryptocurrency market.
Date:	2018–05
URL:	https://d.repec.org/n?u=RePEc:arx:papers:1805.08550

Political Connections and Firms: Network Dimensions

By:	Bussolo, Maurizio (World Bank); Commander, Simon (IE Business School, Altura Partners); Poupakis, Stavros (University College London)
Abstract:	Business and politician interaction is pervasive but has mostly been analysed with a binary approach. Yet the network dimensions of such connections are ubiquitous. We use a unique dataset for seven economies that documents politically exposed persons (PEPs) and their links to companies, political parties and other individuals. With this dataset, we can identify networks of connections, including their scale and composition. We find that all country networks are integrated having a Big Island. They also tend to be marked by small-world properties of high clustering and short path length. Matching our data to firm level information, we examine the association between being connected and firm-level attributes. The originality of our analysis is to identify how location in a network, including extent of ties and centrality, are correlated with firm scale and performance. In a binary approach such network characteristics are omitted and the scale and economic impact of politically connected business may be significantly mis/under-estimated. By comparing results of the binary approach with our network approach, we can also assess the biases that result from ignoring network attributes.
Keywords:	connections: PEPs, networks, rents
JEL:	L14 L53 P26
Date:	2018–04
URL:	https://d.repec.org/n?u=RePEc:iza:izadps:dp11498

Tobacco spending in Georgia: Machine learning approach

By:	Maksym Obrizan (Kyiv School of Economics); Karine Torosyan (International School of Economics at TSU); Norberto Pignatti (International School of Economics at TSU)
Abstract:	The purpose of this study is to analyze tobacco spending in Georgia using various machine learning methods applied to a sample of 10,757 households from Integrated Household Survey collected by GeoStat in 2016. Previous research has shown that smoking is the leading cause of death for 35-69 year olds. In addition, tobacco expenditures may constitute as much as 17% of the household budget. Five different algorithms (ordinary least squares, random forest, two gradient boosting methods and deep learning) were applied to 8,173 households (or 76.0%) in the train set. Out-of-sample predictions were then obtained for 2,584 remaining households in the test set. Under the default settings random forest algorithm showed the best performance with more than 10% improvement in terms of root-mean-square error (RMSE). Improved accuracy and availability of machine learning tools in R calls for active use of these methods by policy makers and scientists in health economics, public health and related fields.
Keywords:	Tobacco Spending, Household Survey, Georgia, Machine Learning
JEL:	I12 L66 D12
Date:	2018–05
URL:	https://d.repec.org/n?u=RePEc:rcd:wpaper:3184

Data Science for Institutional and Organizational Economics

By:	Prüfer, Jens (Tilburg University, Center For Economic Research); Prüfer, Patricia (Tilburg University, Center For Economic Research)
Abstract:	To which extent can data science methods – such as machine learning, text analysis, or sentiment analysis – push the research frontier in the social sciences? This essay briefly describes the most prominent data science techniques that lend themselves to analyses of institutional and organizational governance structures. We elaborate on several examples applying data science to analyze legal, political, and social institutions and sketch how specific data science techniques can be used to study important research questions that could not (to the same extent) be studied without these techniques. We conclude by comparing the main strengths and limitations of computational social science with traditional empirical research methods and its relation to theory.
Keywords:	data science; maching learning; institutions; text analysis
JEL:	C50 C53 C87 D02 K0
Date:	2018
URL:	https://d.repec.org/n?u=RePEc:tiu:tiucen:6d04f0fe-0bcd-4cf4-86f6-f2e0a86fa575

That's classified! Inventing a new patent taxonomy

By:	Billington, Stephen D.; Hanna, Alan J.
Abstract:	Patent studies inform our understanding of innovation. Any study of patenting involves classifying patent data according to a chosen taxonomy. The literature has produced numerous taxonomies, which means patents are being classified differently across studies. This potential inconsistency is compounded by a lack of documentation provided on existing taxonomies, making them diffcult to replicate. Because of this, we develop a new patent taxonomy using machine learning techniques, and propose a new methodology to automate patent classification. We contrast existing taxonomies with our own upon a widely used patent dataset. In a regression analysis of patent classes upon patent characteristics, we show that classification bias exists: the size, statistical significance, and direction of association of coefficients depend upon how a patent dataset has been classified. We recommend investigators adopt our approach to ensure future studies are comparable and replicable.
Keywords:	Innovation,Invention,Machine Learning,Patents,Patent Classification,Taxonomy,Economic History
JEL:	K11 N24 N74 O31 O33
Date:	2018
URL:	https://d.repec.org/n?u=RePEc:zbw:qucehw:201806

Day-ahead electricity price forecasting with high-dimensional structures: Univariate vs. multivariate modeling frameworks

By:	Florian Ziel; Rafal Weron
Abstract:	We conduct an extensive empirical study on short-term electricity price forecasting (EPF) to address the long-standing question if the optimal model structure for EPF is univariate or multivariate. We provide evidence that despite a minor edge in predictive performance overall, the multivariate modeling framework does not uniformly outperform the univariate one across all 12 considered datasets, seasons of the year or hours of the day, and at times is outperformed by the latter. This is an indication that combining advanced structures or the corresponding forecasts from both modeling approaches can bring a further improvement in forecasting accuracy. We show that this indeed can be the case, even for a simple averaging scheme involving only two models. Finally, we also analyze variable selection for the best performing high-dimensional lasso-type models, thus provide guidelines to structuring better performing forecasting model designs.
Date:	2018–05
URL:	https://d.repec.org/n?u=RePEc:arx:papers:1805.06649

Regime switching in the presence of endogeneity

By:	Tom Auld; Oliver Linton
Abstract:	We study the behaviour of the Betfair betting market and the sterling/dollar exchange rate (futures price) during 24 June 2016, the night of the EU referendum. We investigate how the two markets responded to the announcement of the voting results. We employ a Bayesian updating methodology to update prior opinion about the likelihood of the final outcome of the vote. We then relate the voting model to the real time evolution of the market determined prices as results are announced. We find that although both markets appear to be inefficient in absorbing the new information contained in vote outcomes, the betting market is apparently less inefficient than the FX market. The different rates of convergence to fundamental value between the two markets leads to highly profitable arbitrage opportunities.
Keywords:	EU Referendum, prediction markets, machine learning, efficient markets hypothesis, pairs trading, cointegration, Bayesian methods, exchange rates.
Date:	2018
URL:	https://d.repec.org/n?u=RePEc:msh:ebswps:2018-10

Économétrie & Machine Learning

By:	Arthur Charpentier (CREM - Centre de recherche en économie et management - UNICAEN - Université de Caen Normandie - NU - Normandie Université - UR1 - Université de Rennes 1 - UNIV-RENNES - Université de Rennes - CNRS - Centre National de la Recherche Scientifique); Emmanuel Flachaire (GREQAM - Groupement de Recherche en Économie Quantitative d'Aix-Marseille - ECM - Ecole Centrale de Marseille - CNRS - Centre National de la Recherche Scientifique - AMU - Aix Marseille Université - EHESS - École des hautes études en sciences sociales); Antoine Ly (UPE - Université Paris-Est)
Abstract:	L'économétrie et l'apprentissage machine semblent avoir une finalité en commun: construire un modèle prédictif, pour une variable d'intérêt, à l'aide de variables explicatives (ou features). Pourtant, ces deux champs se sont développés en parallèle, créant ainsi deux cultures différentes, pour paraphraser Breiman (2001a). Le premier visait à construire des modèles probabilistes permettant de décrire des phénomèmes économiques. Le second utilise des algorithmes qui vont apprendre de leurs erreurs, dans le but, le plus souvent de classer (des sons, des images, etc). Or récemment, les modèles d'apprentissage se sont montrés plus efficaces que les techniques économétriques traditionnelles (avec comme prix à payer un moindre pouvoir explicatif), et surtout, ils arrivent à gérer des données beaucoup plus volumineuses. Dans ce contexte, il devient nécessaire que les économètres comprennent ce que sont ces deux cultures, ce qui les oppose et surtout ce qui les rapproche, afin de s'approprier des outils développés par la communauté de l'apprentissage statistique, pour les intégrer dans des modèles économétriques.
Keywords:	apprentissage,moindres carrés,modélisation,économétrie,données massives
Date:	2018–05–25
URL:	https://d.repec.org/n?u=RePEc:hal:wpaper:hal-01568851

This nep-big issue is ©2018 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.

General information on the NEP project can be found at https://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.

NEP’s infrastructure is sponsored by the Griffith Business School of Griffith University in Australia.