nep-big New Economics Papers
on Big Data
Issue of 2018‒02‒19
twelve papers chosen by
Tom Coupé
University of Canterbury

  1. Prediction, Judgment and Complexity By Ajay K. Agrawal; Joshua S. Gans; Avi Goldfarb
  2. Leadership in Scholarship: A Machine Learning Based Investigation of Editors' Influence on Textual Structure By Onder, Ali Sina; Popov, Sergey V; Schweitzer, Sascha
  3. Will Robots Automate Your Job Away? Full Employment, Basic Income, and Economic Democracy By Ewan McGaughey; Centre for Business Research
  4. Moral Values and Voting: Trump and Beyond By Benjamin Enke
  5. The Roots of Inequality: Estimating Inequality of Opportunity from Regression Trees By Paolo Brunori; Paul Hufe; Daniel Gerszon Mahler
  6. A portrait of innovative start-ups across countries By Stefano Breschi; Julie Lassébie; Carlo Menon
  7. Robust machine learning by median-of-means : theory and practice By Guillaume Lecué; Mathieu Lerasle
  8. Innovative events By Max Nathan; Anna Rosso
  9. Big data, computational science, economics, finance, marketing, management, and psychology: connections By Chia-Lin Chang; Wing-Keung Wong; Michael McAleer
  10. Probabilistic forecasting of the wind energy resource at the monthly to seasonal scale By Bastien Alonzo; Philippe Drobinski; Riwal Plougonven; Peter Tankov
  11. Data science applications to connected vehicles: Key barriers to overcome By Alvaro Gomez Losada
  12. L’intelligence économique au Maroc : l’apport d’une stratégie offensive de l’information au travers d’une analyse automatique des brevets By Nezha Cherrabi; Maud Pélissier; David Reymond

  1. By: Ajay K. Agrawal; Joshua S. Gans; Avi Goldfarb
    Abstract: We interpret recent developments in the field of artificial intelligence (AI) as improvements in prediction technology. In this paper, we explore the consequences of improved prediction in decision-making. To do so, we adapt existing models of decision-making under uncertainty to account for the process of determining payoffs. We label this process of determining the payoffs ‘judgment.’ There is a risky action, whose payoff depends on the state, and a safe action with the same payoff in every state. Judgment is costly; for each potential state, it requires thought on what the payoff might be. Prediction and judgment are complements as long as judgment is not too difficult. We show that in complex environments with a large number of potential states, the effect of improvements in prediction on the importance of judgment depend a great deal on whether the improvements in prediction enable automated decision-making. We discuss the implications of improved prediction in the face of complexity for automation, contracts, and firm boundaries.
    JEL: D81 D86 O33
    Date: 2018–01
  2. By: Onder, Ali Sina (University of Bayreuth); Popov, Sergey V (Cardiff Business School); Schweitzer, Sascha (University of Bayreuth)
    Abstract: Academic journals disseminate new knowledge, and editors of prominent journals are in a position to affect the direction and composition of research. Using machine learning procedures, we measure the influence of editors of the American Economic Review (AER) on the relative topic structure of papers published in the AER and other top general interest journals. We apply the topic analysis apparatus to the corpus of all publications in the Top 5 journals in Economics between 1976 and 2013, and also to the publications of the AER's editors during the same period. This enables us to observe the changes occurring over time in the relative frequency of topics covered by the AER and other leading general interest journals over time. We .nd that the assignment of a new editor tends to coincide with a change of topics in the AER in favour of a new editor's topics which can not be explained away by shifts in overall research trends that may be observed in other leading general interest journals.
    Keywords: Text Search; Topical Analysis; Academia; Knowledge Dissemination; In- fluence; Journals; Editors
    JEL: A11 A14 O3
    Date: 2018–01
  3. By: Ewan McGaughey; Centre for Business Research
    Abstract: Will the internet, robotics and artificial intelligence mean a 'jobless future'? A recent narrative says tomorrow's technology will fundamentally differ from cotton mills, steam engines, or washing machines. Automation will be less like post-WW2 demobilisation for soldiers, and more like the car for horses. Driverless vehicles will oust truckers and taxi drivers. Hyper-intelligent clouds will oust financial advisers, doctors, and journalists. We face more 'natural' or 'technological' unemployment than ever. Government, it is said, must enact a basic income, because so many jobs will vanish. Also, maybe robots should become 'electronic persons', the subjects of rights and duties, so they can be taxed. This narrative is endorsed by prominent tech-billionaires, but it is flawed. Everything depends on social policy. Instead of mass unemployment and a basic income, the law can achieve full employment and fair incomes. This article explains three views of the causes of unemployment: as 'natural', as stemming from irrationality or technology, or as caused by laws that let people restrict the supply of capital to the job market. Only the third view has any credible evidence to support it. After WW2, 42% of UK jobs were redundant (actually, not hypothetically) but social policy maintained full employment, and it can be done again. Unemployment is driven by inequality of wealth and of votes in the economy. Democratic governments should reprogramme the law: for full employment and universal fair incomes. The owners of the robots will not automate your job away, if we defend economic democracy.
    Keywords: Robots, automation, inequality, democracy, unemployment, basic income, NAIRU, sheep, Luddites, washing machines, flying skateboards
    JEL: E62 E6 E52 E51 E50 E32 E12 E00 E02 D6 J01 K1 J20 K31 J23 J32 K22 J41 J51 J58 J6
    Date: 2018–03
  4. By: Benjamin Enke
    Abstract: This paper studies the supply of and demand for moral values in recent U.S. presidential elections. The hypothesis is that people exhibit heterogeneity in their adherence to “individualizing” relative to “communal” moral values and that politicians' vote shares reflect the interaction of their relative moral appeal and the values of the electorate. To investigate the supply of morality, a text analysis of campaign documents classifies all candidates for the presidency since 2008 along the moral individualism vs. communalism dimension. On the demand-side, the analysis exploits two separate survey datasets to link the structure of voters' moral values to election outcomes, both across individuals within counties and across counties within states or commuting zones. The results document that heterogeneity in moral values is systematically related to voting behavior in ways that are predicted by supply-side text analyses. For example, Donald Trump's rhetoric exhibits the largest communal moral appeal among all recent presidential nominees. This pattern is matched on the demand-side, where communal values are strongly correlated with votes for Trump in the primaries, the difference in votes between Trump and past Republicans in the presidential election, and increases in voter turnout in 2016. Similarly tight connections between supply- and demand-side analyses hold for almost all contenders for the presidency in recent years, hence suggesting that morality is a key determinant of election outcomes more generally. Still, a key difference between 2016 and earlier elections appears to be the salience of moral threat in political language.
    JEL: D03 D72
    Date: 2018–01
  5. By: Paolo Brunori; Paul Hufe; Daniel Gerszon Mahler
    Abstract: We propose a set of new methods to estimate inequality of opportunity based on conditional inference regression trees. In particular, we illustrate how these methods represent a substantial improvement over existing empirical approaches to measure in equality of opportunity. First, they minimize the risk of arbitrary and ad-hoc model selection. Second, they provide a standardized way of trading off upward and downward biases in inequality of opportunity estimations. Finally, regression trees can be graphically represented; their structure is immediate to read and easy to understand. This will make the measurement of inequality of opportunity more easily comprehensible to a large audience. These advantages are illustrated by an empirical application based on the 2011 wave of the European Union Statistics on Income and Living Conditions.
    Keywords: Equality of opportunity; machine learning; random forests.
    JEL: D31 D63 C38
    Date: 2018
  6. By: Stefano Breschi; Julie Lassébie; Carlo Menon (OECD)
    Abstract: The report presents new cross-country descriptive evidence on innovative start-ups and related venture capital investments drawing upon Crunchbase, a new dataset that is unprecedented in terms of scope and comprehensiveness. The analysis employs a mix of different statistical techniques (descriptive graphics, econometric analysis, and machine learning) to highlight a number of findings. First, there are significant cross-country differences in the professional and educational background of start-ups’ founders, notably the share of founders with previous academic experience and in the share of “serial entrepreneurs”. Conversely, the founders’ average age is rather constant across countries, but shows a fair degree of variability across sectors. Second, IP assets, and in particular the presence of an inventor in the team of founders, are strongly associated with start-ups’ success. Finally, female founders are less likely to receive funding, receive lower amounts when they do receive financing, and have a lower probability of successful exit, when other factors are controlled for.
    Date: 2018–02–08
  7. By: Guillaume Lecué (CREST; CNRS; Université Paris Saclay); Mathieu Lerasle (CNRS,département de mathématiques d’Orsay)
    Abstract: We introduce new estimators for robust machine learning based on median-of-means (MOM) estimators of the mean of real valued random variables. These estimators achieve optimal rates of convergence under minimal assumptions on the dataset. The dataset may also have been corrupted by outliers on which no assumption is granted. We also analyze these new estimators with standard tools from robust statistics. In particular, we revisit the concept of breakdown point. We modify the original definition by studying the number of outliers that a dataset can contain without deteriorating the estimation properties of a given estimator. This new notion of breakdown number, that takes into account the statistical performances of the estimators, is non-asymptotic in nature and adapted for machine learning purposes. We proved that the breakdown number of our estimator is of the order of number of observations * rate of convergence. For instance, the breakdown number of our estimators for the problem of estimation of a d-dimensional vector with a noise variance a² is a²d and it becomes a²s log(ed/s) when this vector has only s non-zero component. Beyond this breakdown point, we proved that the rate of convergence achieved by our estimator is number of outliers divided by number of observations. Besides these theoretical guarantees, the major improvement brought by these new estimators is that they are easily computable in practice. In fact, basically any algorithm used to approximate the standard Empirical Risk Minimizer (or its regularized versions) has a robust version approximating our estimators. On top of being robust to outliers, the "MOM version" of the algorithms are even faster than the original ones, less demanding in memory resources in some situations and well adapted for distributed datasets which makes it particularly attractive for large dataset analysis. As a proof of concept, we study many algorithms for the classical LASSO estimator. It turns out that the original algorithm can be improved a lot in practice by randomizing the blocks on which \local means" are computed at each step of the descent algorithm. A byproduct of this modification is that our algorithms come with a measure of depth of data that can be used to detect outliers, which is another major issue in Machine learning.
    Date: 2017–11–01
  8. By: Max Nathan (University of Birmingham); Anna Rosso (University of Milan)
    Abstract: Policymakers need to understand innovation in high-profile sectors like technology. This can be surprisingly hard to observe. We combine UK administrative microdata, media and website content to develop experimental measures of firm innovation – new product/service launches – that complement existing metrics. We then explore the innovative performance of technology sector SMEs – firms also of great policy interest – using panel fixed effects settings, comparing conventional and machine-learning-based definitions of industry space. For companies with event coverage, tech SMEs are substantially more launch-active than non-tech firms, with suggestive evidence of firm-city interactions. We use instruments and reweighting to handle underlying event exposure probabilities.
    Keywords: innovation, ICT, data science
    JEL: L86
    Date: 2017–10–09
  9. By: Chia-Lin Chang (Department of Applied economics, Department of Finance National Chung Hsing University, Taiwan.); Wing-Keung Wong (Department of Finance, Fintech Center, and Big Data Research Center, Asia University, Taiwan and Department of Medical Research, China Medical University Hospital, Taiwan And Department of Economics and Finance, Hang Seng Management College, Hong Kong, China and Department of Economics, Lingnan University, Hong Kong, China.); Michael McAleer (Department of Quantitative Finance National Tsing Hua University, Taiwan and Econometric Institute Erasmus School of Economics Erasmus University Rotterdam, The Netherlands and Department of Quantitative Economics Complutense University of Madrid, Spain And Institute of Advanced Sciences Yokohama National University, Japan.)
    Abstract: The paper provides a review of the literature that connects Big Data, Computational Science, Economics, Finance, Marketing, Management, and Psychology, and discusses some research that is related to the seven disciplines. Academics could develop theoretical models and subsequent econometric and statistical models to estimate the parameters in the associated models, as well as conduct simulation to examine whether the estimators in their theories on estimation and hypothesis testing have good size and high power. Thereafter, academics and practitioners could apply theory to analyse some interesting issues in the seven disciplines and cognate areas.
    Keywords: Big Data, Computational science, Economics, Finance, Management, Theoretical models, Econometric and statistical models, Applications.
    JEL: A10 G00 G31 O32
    Date: 2018–01
  10. By: Bastien Alonzo (IPSL; LMD; CNRS; Ecole Polytechnique; Université de Paris-Saclay; Laboratoire de Probabilités et Modéles Aléatoires, Université Paris Diderot-Paris 7); Philippe Drobinski (IPSL; LMD; CNRS; Ecole Polytechnique; Université de Paris-Saclay); Riwal Plougonven (IPSL; LMD; CNRS; Ecole Polytechnique; Université de Paris-Saclay); Peter Tankov (CREST; ENSAE ParisTech)
    Abstract: We build and evaluate a probabilistic model designed for forecasting the distribution of the daily mean wind speed at the seasonal timescale in France. On such long-term timescales, the variability of the surface wind speed is strongly in uenced by the atmosphere large-scale situation. Our aim is to predict the daily mean wind speed distribution at a speci c location using the information on the atmosphere large-scale situation, summarized by an index. To this end, we estimate, over 20 years of daily data, the conditional probability density function of the wind speed given the index. We next use the ECMWF seasonal forecast ensemble to predict the atmosphere large-scale situation and the index at the seasonal timescale. We show that the model is sharper than the climatology at the monthly horizon, even if it displays a strong loss of precision after 15 days. Using a statistical postprocessing method to recalibrate the ensemble forecast leads to further improvement of our probabilistic forecast, which then remains sharper than the climatology at the seasonal horizon.
    Keywords: Wind energy, Wind speed forecasting, Seasonal forecasting, Probabilistic forecasting, Ensemble forecasts, Ensemble model output statistics
    Date: 2017–10–11
  11. By: Alvaro Gomez Losada (European Commission - JRC)
    Abstract: The connected vehicles will generate huge amount of pervasive and real time data, at very high frequencies. This poses new challenges for Data science. How to analyse these data and how to address short-term and long-term storage are some of the key barriers to overcome.
    Keywords: data recording, digital technology, intelligent transport system, new technology, recording equipment, research report, road safety, road transport, satellite navigation, speed control, technical specification, technical standard, traffic control, transport regulations, vehicle.
    Date: 2017–12
  12. By: Nezha Cherrabi (I3M - Laboratoire Information, Milieux, Médias, Médiations - EA 3820 - UTLN - Université de Toulon - Université Nice Sophia Antipolis [UNS] : EA3820); Maud Pélissier (CLILLAC-ARP EA 3967 - Centre de Linguistique Inter-langues, de Lexicologie, de Linguistique Anglaise et de Corpus - UPD7 - Université Paris Diderot - Paris 7); David Reymond (MICA - Médiations, Informations, Communication, Arts - Université Michel de Montaigne - Bordeaux 3 - Université Bordeaux Montaigne)
    Abstract: L’ouverture de l’économie marocaine suite à des accords de libre échange a, dans un premier temps, fragilisé le pays en créant un déficit commercial important les exportations ayant une croissance très faible et reposant principalement sur une compétitivité prix. Progressivement, les instances dirigeantes du pays ont mis en oeuvre une politique visant à moderniser le tissu industriel national de façon à attirer des investisseurs étrangers mais aussi à développer des activités tournées vers l’international et la production de produits de qualité intégrant une forte dimension R&D. Sur les dix dernières années la compétitivité hors prix des exportations marocaines a augmenté de façon significative suite aux efforts consentis en matière d’innovation. Des secteurs innovants sont apparus : énergies renouvelables, logistique, industrie automobile, aéronautique. Les industries extractives sont montées en gamme et ont permis de positionner le Maroc sur l’exportation de produits chimiques (engrais, sels halogènes…). Pour accompagner cette politique industrielle offensive, ce pays se dote aussi progressivement de dispositifs d’intelligence économique comme outils d’aide à la décision pour renforcer la compétitivité de ses PME qui constituent plus de 90% de son tissu productif. Ces dernières années, des initiatives multiples ont été prises pour mettre en oeuvre une telle politique mais, encore à ce jour, des défis restent à relever pour favoriser la dynamique d’innovation des PME. C’est une pratique en gestation mais encore largement cloisonnée (Achchab and Ahdil, 2015). Parmi les différents volets d’action d’une politique d’intelligence économique, le brevet occupe une place de premier rang. Il est toutefois souvent valorisé dans une optique défensive avec pour objectif de sensibiliser les PME, en particulier, à l’importance de protéger leur patrimoine informationnel, clé de leur compétitivité. L’exemple des babouches marocaines (Bredeloup and Bertoncello, 2006) et l’attaque chinoise de ce produit dit du « terroir » montre l’importance d’intégrer une analyse globale des brevets sur le territoire marocain comme une source stratégique d’information pour les entreprises. Mais, il apparaît de plus en plus que le brevet peut aussi être utilisé dans une stratégie informationnelle offensive devenant ainsi un élément indispensable guidant la dynamique d’innovation des PME (Shih, Liu, and Hsu, 2010).Dans le cas spécifique de pays en développement, une telle stratégie offensive informationnelle du brevet peut contribuer à « améliorer les produits existants, valoriser les ressources naturelles et les machines et procédés de première transformations qui sont concernés » (Dou and Leveillé, 2015). Cette nouvelle perspective offerte est rendue possible grâce à l’élaboration de logiciels permettant une analyse automatique de brevets reposant sur une logique de big data. Dans cette perspective, nous souhaitons présenter ici l’apport d’un outil, Patent2Net(Reymond and Quoniam, 2014), qui permet de crawler l’univers des brevets dans le cadre d’une analyse brevet au Maroc. C’est un logiciel gratuit et sous licence libre (CECILL-B), réalisé par I3M et l’IRSIC laboratoires en sciences de l’information et de la communication, et une équipe internationale composée de professeurs et chercheurs universitaires (ibid.). Il s’agira de montrer comment une analyse des métadonnées des brevets (déposants, inventeurs, dates de dépôts, pays de protection, offices de dépôts etc...), des réseaux entre déposants, inventeurs, entre brevets citants et cités, permet d’offrir des informations stratégiques sur les technologies et connaissances utilisées par les inventeurs, et constituent, à ce titre, un levier stratégique 1 tant au niveau des institutions gouvernementales que des entreprises.
    Keywords: analyse des données,intelligence économique,information brevets,analyse automatique des brevets,patent2net,Infométrie
    Date: 2016–05–12

This nep-big issue is ©2018 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.