nep-big New Economics Papers
on Big Data
Issue of 2018‒09‒24
eighteen papers chosen by
Tom Coupé
University of Canterbury

  1. "Read My Lips": Using Automatic Text Analysis to Classify Politicians by Party and Ideology By Eitan Sapiro-Gheiler
  2. Occupational Classifications: A Machine Learning Approach By Ikudo, Akina; Lane, Julia; Staudt, Joseph; Weinberg, Bruce A.
  3. The Race for an Artificial General Intelligence: Implications for Public Policy By Naudé, Wim; Dimitri, Nicola
  4. Sharing Responsibility with a Machine By Oliver Kirchkamp; Christina Strobel
  5. The Double-Edged Sword of Global Integration: Robustness, Fragility \& Contagion in the International Firm Network By Everett Grant
  6. Slanted images: Measuring nonverbal media bias By Boxell, Levi
  7. Measuring Bias in Consumer Lending By Will Dobbie; Andres Liberman; Daniel Paravisini; Vikram Pathania
  8. Big Data Econometrics: Now Casting and Early Estimates By Dario Buono; George Kapetanios; Massimiliano Marcellino; Gianluigi Mazzi; Fotis Papailias
  9. Machine Learning for Regularized Survey Forecast Combination: Partially-Egalitarian Lasso and its Derivatives By Francis X. Diebold; Minchul Shin
  10. Building on the Hamburg Statement and the G20 Roadmap for Digitalization: Toward a G20 framework for artificial intelligence in the workplace By Twomey, Paul
  11. Can Machine Learning Techniques Predict Non-performance of Farm Non-Real Estate Loans in the Ag Finance Databook By Mallory, Mindy; Kuethe, Todd; Hubbs, Todd
  12. Learning L2 Continuous Regression Functionals via Regularized Riesz Representers By Victor Chernozhukov; Whitney K Newey; Rahul Singh
  13. Nowcasting Food Stock Movement using Food Safety Related Web Search Queries By Asgari, Mahdi; Nemati, Mehdi; Zheng, Yuqing
  14. Affordable or Unaffordable? Social Housing Rent in Taipei By Ying-Hui Chiang
  15. Improve Naïve Bayesian Classifier by Using Genetic Algorithm for Arabic Document By Farah Zawaideh; Raed Sahawneh
  16. Wheat Nitrogen Response Conditional on Past Yield and Weather: A Step in Making Use of Big Data By Mills, Brian; Brorsen, Wade; Tostão, Emílio
  17. Comparative Study of Three Time Series Methods in Forecasting Dengue Hemorrhagic Fever Incidence in Thailand By Somsri Banditvilai; Siriluck Anansatitzin
  18. Markets for Information: An Introduction By Bergemann, Dirk; Bonatti, Alessandro

  1. By: Eitan Sapiro-Gheiler
    Abstract: The increasing digitization of political speech has opened the door to studying a new dimension of political behavior using text analysis. This work investigates the value of word-level statistical data from the US Congressional Record--which contains the full text of all speeches made in the US Congress--for studying the ideological positions and behavior of senators. Applying machine learning techniques, we use this data to automatically classify senators according to party, obtaining accuracy in the 70-95% range depending on the specific method used. We also show that using text to predict DW-NOMINATE scores, a common proxy for ideology, does not improve upon these already-successful results. This classification deteriorates when applied to text from sessions of Congress that are four or more years removed from the training set, pointing to a need on the part of voters to dynamically update the heuristics they use to evaluate party based on political speech. Text-based predictions are less accurate than those based on voting behavior, supporting the theory that roll-call votes represent greater commitment on the part of politicians and are thus a more accurate reflection of their ideological preferences. However, the overall success of the machine learning approaches studied here demonstrates that political speeches are highly predictive of partisan affiliation. In addition to these findings, this work also introduces the computational tools and methods relevant to the use of political speech data.
    Date: 2018–09
  2. By: Ikudo, Akina (University of California, Los Angeles); Lane, Julia (New York University); Staudt, Joseph (U.S. Census Bureau); Weinberg, Bruce A. (Ohio State University)
    Abstract: Characterizing the work that people do on their jobs is a longstanding and core issue in labor economics. Traditionally, classification has been done manually. If it were possible to combine new computational tools and administrative wage records to generate an automated crosswalk between job titles and occupations, millions of dollars could be saved in labor costs, data processing could be sped up, data could become more consistent, and it might be possible to generate, without a lag, current information about the changing occupational composition of the labor market. This paper examines the potential to assign occupations to job titles contained in administrative data using automated, machine-learning approaches. We use a new extraordinarily rich and detailed set of data on transactional HR records of large firms (universities) in a relatively narrowly defined industry (public institutions of higher education) to identify the potential for machine-learning approaches to classify occupations.
    Keywords: UMETRICS, occupational classifications, machine learning, administrative data, transaction data
    JEL: J0 J21 J24
    Date: 2018–08
  3. By: Naudé, Wim (Maastricht University); Dimitri, Nicola (University of Siena)
    Abstract: An arms race for an artificial general intelligence (AGI) would be detrimental for and even pose an existential threat to humanity if it results in an unfriendly AGI. In this paper an all-pay contest model is developed to derive implications for public policy to avoid such an outcome. It is established that in a winner-takes all race, where players must invest in R&D, only the most competitive teams will participate. Given the difficulty of AGI the number of competing teams is unlikely ever to be very large. It is also established that the intention of teams competing in an AGI race, as well as the possibility of an intermediate prize is important in determining the quality of the eventual AGI. The possibility of an intermediate prize will raise quality of research but also the probability of finding the dominant AGI application and hence will make public control more urgent. It is recommended that the danger of an unfriendly AGI can be reduced by taxing AI and by using public procurement. This would reduce the pay-off of contestants, raise the amount of R&D needed to compete, and coordinate and incentivize co-operation, all outcomes that will help alleviate the control and political problems in AI. Future research is needed to elaborate the design of systems of public procurement of AI innovation and for appropriately adjusting the legal frameworks underpinning high-tech innovation, in particular dealing with patents created by AI.
    Keywords: artificial intelligence, innovation, technology, public policy
    JEL: O33 O38 O14 O15 H57
    Date: 2018–08
  4. By: Oliver Kirchkamp (FSU Jena, School of Economics); Christina Strobel (FSU Jena, School of Economics)
    Abstract: Humans make decisions jointly with others. They share responsibility for the outcome with their interaction partners. Today, more and more often the partner in a decision is not another human but, instead, a machine. Here we ask whether the type of the partner, machine or human, affects our responsibility, our perception of the choice and the choice itself. As a workhorse we use a modified dictator game with two joint decision makers: either two humans or one human and one machine. We find no treatment effect on perceived responsibility or guilt. We also find only a small and insignificant effect on actual choices.
    Keywords: Human-computer interaction, Experiment, Shared responsibility, Moral wiggle room
    JEL: C91 D63 D80
    Date: 2018–09–12
  5. By: Everett Grant (Federal Reserve Bank of Dallas)
    Abstract: We use daily equity returns to estimate global inter-firm networks across all major industries from 1981-2016 and test whether the network is robust or fragile, relating multinational firms' overall health with global integration. More connected firms are less likely to be in distress and have higher profit growth and equity returns, but are also more exposed to direct contagion from distressed neighboring firms and network level crises. Our machine learning analysis reveals the centrality of finance in the international firm network and increased globalization over time, with greater potential for crises to spread globally when they do occur.
    Date: 2018
  6. By: Boxell, Levi
    Abstract: I build a dataset of over one million images used on the front page of websites around the 2016 election period. I then use machine-learning tools to detect the faces of politicians across the images and measure the nonverbal emotional content expressed by each politician. Combining this with data on the partisan composition of each website’s users, I show that websites portray politicians that align with the partisan preferences of their users with more positive emotions. I also find that nonverbal coverage by Republican-leaning websites was not consistent over the 2016 election, but became more favorable towards Donald Trump after he clinched the Republican nomination.
    Keywords: media bias, images, emotions, nonverbal, polarization
    JEL: C0 H0 L82 L86
    Date: 2018–09–17
  7. By: Will Dobbie; Andres Liberman; Daniel Paravisini; Vikram Pathania
    Abstract: This paper tests for bias in consumer lending decisions using administrative data from a high-cost lender in the United Kingdom. We motivate our analysis using a simple model of bias in lending, which predicts that profits should be identical for loan applicants from different groups at the margin if loan examiners are unbiased. We identify the profitability of marginal loan applicants by exploiting variation from the quasi-random assignment of loan examiners. We find significant bias against both immigrant and older loan applicants when using the firm's preferred measure of long-run profits. In contrast, there is no evidence of bias when using a short-run measure used to evaluate examiner performance, suggesting that the bias in our setting is due to the misalignment of firm and examiner incentives. We conclude by showing that a decision rule based on machine learning predictions of long-run profitability can simultaneously increase profits and eliminate bias.
    JEL: J15 J16
    Date: 2018–08
  8. By: Dario Buono; George Kapetanios; Massimiliano Marcellino; Gianluigi Mazzi; Fotis Papailias
    Abstract: This paper aims at providing a primer on the use of big data in macroeconomic nowcasting and early estimation. We discuss: (i) a typology of big data characteristics relevant for macroeconomic nowcasting and early estimates, (ii) methods for features extraction from unstructured big data to usable time series, (iii) econometric methods that could be used for nowcasting with big data, (iv) some empirical nowcasting results for key target variables for four EU countries, and (v) ways to evaluate nowcasts and ash estimates. We conclude by providing a set of recommendations to assess the pros and cons of the use of big data in a specific empirical nowcasting context.
    Keywords: Big Data, Nowcasting, Early Estimates, Econometric Methods
    JEL: C32 C53
    Date: 2018
  9. By: Francis X. Diebold; Minchul Shin
    Abstract: Despite the clear success of forecast combination in many economic environments, several important issues remain incompletely resolved. The issues relate to selection of the set of forecasts to combine, and whether some form of additional regularization (e.g., shrinkage) is desirable. Against this background, and also considering the frequently-found good performance of simple-average combinations, we propose a LASSO-based procedure that sets some combining weights to zero and shrinks the survivors toward equality ("partially-egalitarian LASSO"). Ex-post analysis reveals that the optimal solution has a very simple form: The vast majority of forecasters should be discarded, and the remainder should be averaged. We therefore propose and explore direct subset-averaging procedures motivated by the structure of partially-egalitarian LASSO and the lessons learned, which, unlike LASSO, do not require choice of a tuning parameter. Intriguingly, in an application to the European Central Bank Survey of Professional Forecasters, our procedures outperform simple average and median forecasts – indeed they perform approximately as well as the ex-post best forecaster.
    JEL: C53
    Date: 2018–08
  10. By: Twomey, Paul
    Abstract: Building on the 2017 Hamburg Statement and the G20 Roadmap for Digitalization, this paper recommends a G20 framework for artificial intelligence in the workplace. It proposes high level principles for such a framework for G-20 governments to enable the smoother, internationally broader and more socially acceptable introduction of big data and AI. The principles are dedicated to the work space. It summarises the main issues behind the framework principles. It also suggests two paths towards adoption of a G-20 framework for artificial intelligence in the workplace.
    Keywords: artifical intelligence,privacy,wealth distribution,workplace,regulation,political principles,workers,transparency,G20,heads of government,big data,Hamburg Statement
    JEL: K2 O3
    Date: 2018
  11. By: Mallory, Mindy; Kuethe, Todd; Hubbs, Todd
    Keywords: Agricultural Finance, Risk and Uncertainty
    Date: 2018–04–06
  12. By: Victor Chernozhukov; Whitney K Newey; Rahul Singh
    Abstract: Many objects of interest can be expressed as an L2 continuous functional of a regression, including average treatment effects, economic average consumer surplus, expected conditional covariances, and discrete choice parameters that depend on expectations. Debiased machine learning (DML) of these objects requires a learning a Riesz representer (RR). We provide here Lasso and Dantzig learners of the RR and corresponding learners of affine and other nonlinear functionals. We give an asymptotic variance estimator for DML. We allow for a wide variety of regression learners that can converge at relatively slow rates. We give conditions for root-n consistency and asymptotic normality of the functional learner. We give results for non affine functionals in addition to affine functionals.
    Date: 2018–09
  13. By: Asgari, Mahdi; Nemati, Mehdi; Zheng, Yuqing
    Abstract: Predicting financial market movements in today’s fast-paced and complex environment is challenging more than ever. For many investors, online resources are a major source of information. Researchers can use Google Trends to access the number of search queries of a particular topic by internet users. The search volume index provided by Google then can be used as a proxy for importance of that topic. To predict the collective response to a particular news, we can use the search index for relevant search terms in our forecasting model. The focus of our study is forecasting food stock movement. A unique feature of the food industry is that besides common fundamental information, stakeholders are responsive to food safety news. In this study, we test whether including relevant search terms would reduce the forecasting error and improve the predictive power of traditional models. We use the market data and Google Trends index for 46 listed food companies. The empirical results show that on average the use of search terms reduces forecasting error by 2 to 31 percent for predicting trading volume, and reduces forecasting error by 3.5 to 77 percent for predicting the closing price, depending on the company. We also applied a model confidence set (MCS) to create a set of specifications that have statistically least forecasting error. The average forecasting error of the models in the set is lower than all models with search terms which implies that the MCS approach is efficient in identifying models with best predictive power.
    Keywords: Agribusiness, Research Methods/ Statistical Methods
    Date: 2018–02–06
  14. By: Ying-Hui Chiang
    Abstract: Current governmental data on rental housing-only by agencies have to be registered- do not reflect real market activity on the Taipei rental market. This study is trying to use web scraping to collect the big data. By cleaning, analyzing and mapping the data reveal spatial and temporal patterns cross districts housing markets in Taipei City. The rental market issue is more important in Taipei with surging housing price. The research will build the rent model to estimate fair rent of different types housing. To assess the rent affordability by the ratio between social housing rent and fair rent. To calculate the rent burdens by the ratio between median household income and median rent across the statistical area. We use two indicators that rent affordability and rent burden to discuss the social housing policy in Taipei. The findings are to capture the real rental market in Taipei and to provide suggestions for social housing policy by using big data.
    Keywords: Housing Affordability; Housing Rent; Social Houing
    JEL: R3
    Date: 2018–01–01
  15. By: Farah Zawaideh (Irbid National University); Raed Sahawneh (Irbid National University)
    Abstract: Automatic text categorization (TC) has become one of the most interesting fields for researchers in data mining, information retrieval, web text mining, as well as natural language processing paradigms due to the vast number of new documents being retrieved for various information retrieval systems. This paper proposes a new TC technique, which classifies Arabic language text documents using the naïve Bayesian classifier attached to a genetic algorithm, model; this algorithm classifies documents by generating a random sample of chromosomes that represent documents in the corpus. The developed model aims to enhance the work of naïve Bayesian classifier through applying the genetic algorithm model. Experiment results show that the precision and recall are increased when testing higher number of documents; the precision was ranged from 0.8 to 0.97 for different testing environment; the number of genes that is placed in every chromosome is also tested and experiments show that the best value for the number of genes is 50 genes
    Keywords: Data mining, Text classification, Genetic algorithm, Naïve Bayesian Classifier, N-gram processing
    JEL: C80
    Date: 2018–06
  16. By: Mills, Brian; Brorsen, Wade; Tostão, Emílio
    Keywords: Agricultural Finance, Production Economics, Agribusiness
    Date: 2017–06–30
  17. By: Somsri Banditvilai (King Mongkut's Institute of Technology Ladkrabang); Siriluck Anansatitzin (King Mongkut's Institute of Technology Ladkrabang)
    Abstract: Accurate incidence forecasting of infectious disease such as dengue hemorrhagic fever is critical for early prevention and detection of outbreaks. This research presents a comparative study of three different forecasting methods based on the monthly incidence of dengue hemorrhagic fever. Holt and Winters method, Box-Jenkins method and Artificial Neural Networks were compared. The data were taken from the Bureau of Epidemiology, Department of Disease Control, Ministry of Public Health starting from January, 2003 to December, 2016. The data were divided into 2 sets. The first set from January, 2003 to December, 2015 were used for constructing and selection the forecasting models. The second set from January, 2016 to December, 2016 were used for computing the accuracy of the forecasting model. The forecasting models were chosen by considering the smallest root mean square error (RMSE) and mean absolute percentage error (MAPE) were used to measure the accuracy of the model. The results showed that Artificial Neural Networks obtained the smallest RMSE in the modeling process and the MAPE in the forecasting process was 14.05%
    Keywords: Dengue hemorrhagic fever, Time Series Forecasting, Holt-Winters method, Box-Jenkins method, Artificial Neural Networks
    JEL: C22 C45
    Date: 2018–06
  18. By: Bergemann, Dirk; Bonatti, Alessandro
    Abstract: We survey a recent and growing literature on markets for information. We offer a comprehensive view of information markets through an integrated model of consumers, information intermediaries, and firms. The model embeds a large set of applications ranging from sponsored search advertising to credit scores to information sharing among competitors. We then review a mechanism design approach to selling information in greater detail. We distinguish between ex ante sales of information (the buyer acquires an information structure) and ex post sales (the buyer pays for specific realizations). We relate this distinction to the different products that brokers, advertisers, and publishers use to trade consumer information online. We discuss the endogenous limits to the trade of information that derive from its potential adverse use for consumers. Finally we revisit the role of recommender systems and artificial intelligence systems as markets for indirect information.
    Keywords: information design; information markets; intermediaries; mechanism design; predictions; ratings
    JEL: D42 D82 D83
    Date: 2018–08

This nep-big issue is ©2018 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.