nep-big New Economics Papers
on Big Data
Issue of 2021‒02‒15
twenty papers chosen by
Tom Coupé
University of Canterbury

  1. FinTech in Financial Inclusion: Machine Learning Applications in Assessing Credit Risk By Majid Bazarbash
  2. HANA: A HAndwritten NAme Database for Offline Handwritten Text Recognition By Christian M. Dahl; Torben Johansen; Emil N. S{\o}rensen; Simon Wittrock
  3. Discrete Choice Analysis with Machine Learning Capabilities By Youssef M. Aboutaleb; Mazen Danaf; Yifei Xie; Moshe Ben-Akiva
  4. Absolute Value Constraint: The Reason for Invalid Performance Evaluation Results of Neural Network Models for Stock Price Prediction By Yi Wei; Cristian Tiu; Vipin Chaudhary
  5. Predicting Recession Probabilities Using Term Spreads: New Evidence from a Machine Learning Approach By Jaehyuk Choi; Desheng Ge; Kyu Ho Kang; Sungbin Sohn
  6. The present vision of AI… or the HAL syndrome By Pierre-Jean Benghozi; Hugues Chevalier
  7. Solving optimal stopping problems with Deep Q-Learning By John Ery; Loris Michel
  8. Modeling surrender risk in life insurance: theoretical and experimental insight By Mark Kiermayer
  9. Comparing conventional and machine-learning approaches to risk assessment in domestic abuse cases By Jeffrey Grogger; Ria Ivandic; Tom Kirchmaier
  10. Constructing Daily Economic Sentiment Indices Based on Google Trends By Vera Eichenauer; Ronald Indergand; Isabel Z. Martínez; Christoph Sax
  11. Measuring National Happiness with Music By Benetos, Emmanouil; Ragano, Alessandro; Sgroi, Daniel; Tuckwell, Anthony
  12. Machine Learning for Strategic Inference By In-Koo Cho; Jonathan Libgober
  13. Modelling Sovereign Credit Ratings: Evaluating the Accuracy and Driving Factors using Machine Learning Techniques By Bart H. L. Overes; Michel van der Wel
  14. Tree-based Node Aggregation in Sparse Graphical Models By Ines Wilms; Jacob Bien
  15. Won't Get Fooled Again: A Supervised Machine Learning Approach for Screening Gasoline Cartels By Douglas Silveira; Silvinha Vasconcelos; Marcelo Resende; Daniel O. Cajueiro
  16. Nowcasting Monthly GDP with Big Data: a Model Averaging Approach By Tommaso Proietti; Alessandro Giovannelli
  17. Does Fake News Affect Voting Behaviour? By Michele Cantarella; Nicolò Fraccaroli; Roberto Volpe
  18. Rise of the Machines: The Impact of Automated Underwriting By Jansen, Mark; Nguyen, Hieu; Shams, Amin
  19. News-based Sentiment Indicators By Chengyu Huang; Sean Simpson; Daria Ulybina; Agustin Roitman
  20. A Simplified measure of nutritional empowerment using machine learning to abbreviate the Women's Empowerment in Nutrition Index (WENI) By Shree Saha; Sudha Narayanan

  1. By: Majid Bazarbash
    Abstract: Recent advances in digital technology and big data have allowed FinTech (financial technology) lending to emerge as a potentially promising solution to reduce the cost of credit and increase financial inclusion. However, machine learning (ML) methods that lie at the heart of FinTech credit have remained largely a black box for the nontechnical audience. This paper contributes to the literature by discussing potential strengths and weaknesses of ML-based credit assessment through (1) presenting core ideas and the most common techniques in ML for the nontechnical audience; and (2) discussing the fundamental challenges in credit risk analysis. FinTech credit has the potential to enhance financial inclusion and outperform traditional credit scoring by (1) leveraging nontraditional data sources to improve the assessment of the borrower’s track record; (2) appraising collateral value; (3) forecasting income prospects; and (4) predicting changes in general conditions. However, because of the central role of data in ML-based analysis, data relevance should be ensured, especially in situations when a deep structural change occurs, when borrowers could counterfeit certain indicators, and when agency problems arising from information asymmetry could not be resolved. To avoid digital financial exclusion and redlining, variables that trigger discrimination should not be used to assess credit rating.
    Keywords: Credit risk;Credit;Credit ratings;Loans;Machine learning;WP,ML model,bears risk,machine learning technique,ML analysis,ML evaluation
    Date: 2019–05–17
  2. By: Christian M. Dahl; Torben Johansen; Emil N. S{\o}rensen; Simon Wittrock
    Abstract: Methods for linking individuals across historical data sets, typically in combination with AI based transcription models, are developing rapidly. Probably the single most important identifier for linking is personal names. However, personal names are prone to enumeration and transcription errors and although modern linking methods are designed to handle such challenges these sources of errors are critical and should be minimized. For this purpose, improved transcription methods and large-scale databases are crucial components. This paper describes and provides documentation for HANA, a newly constructed large-scale database which consists of more than 1.1 million images of handwritten word-groups. The database is a collection of personal names, containing more than 105 thousand unique names with a total of more than 3.3 million examples. In addition, we present benchmark results for deep learning models that automatically can transcribe the personal names from the scanned documents. Focusing mainly on personal names, due to its vital role in linking, we hope to foster more sophisticated, accurate, and robust models for handwritten text recognition through making more challenging large-scale databases publicly available. This paper describes the data source, the collection process, and the image-processing procedures and methods that are involved in extracting the handwritten personal names and handwritten text in general from the forms.
    Date: 2021–01
  3. By: Youssef M. Aboutaleb; Mazen Danaf; Yifei Xie; Moshe Ben-Akiva
    Abstract: This paper discusses capabilities that are essential to models applied in policy analysis settings and the limitations of direct applications of off-the-shelf machine learning methodologies to such settings. Traditional econometric methodologies for building discrete choice models for policy analysis involve combining data with modeling assumptions guided by subject-matter considerations. Such considerations are typically most useful in specifying the systematic component of random utility discrete choice models but are typically of limited aid in determining the form of the random component. We identify an area where machine learning paradigms can be leveraged, namely in specifying and systematically selecting the best specification of the random component of the utility equations. We review two recent novel applications where mixed-integer optimization and cross-validation are used to algorithmically select optimal specifications for the random utility components of nested logit and logit mixture models subject to interpretability constraints.
    Date: 2021–01
  4. By: Yi Wei; Cristian Tiu; Vipin Chaudhary
    Abstract: Neural networks for stock price prediction(NNSPP) have been popular for decades. However, most of its study results remain in the research paper and cannot truly play a role in the securities market. One of the main reasons leading to this situation is that the prediction error(PE) based evaluation results have statistical flaws. Its prediction results cannot represent the most critical financial direction attributes. So it cannot provide investors with convincing, interpretable, and consistent model performance evaluation results for practical applications in the securities market. To illustrate, we have used data selected from 20 stock datasets over six years from the Shanghai and Shenzhen stock market in China, and 20 stock datasets from NASDAQ and NYSE in the USA. We implement six shallow and deep neural networks to predict stock prices and use four prediction error measures for evaluation. The results show that the prediction error value only partially reflects the model accuracy of the stock price prediction, and cannot reflect the change in the direction of the model predicted stock price. This characteristic determines that PE is not suitable as an evaluation indicator of NNSPP. Otherwise, it will bring huge potential risks to investors. Therefore, this paper establishes an experiment platform to confirm that the PE method is not suitable for the NNSPP evaluation, and provides a theoretical basis for the necessity of creating a new NNSPP evaluation method in the future.
    Date: 2021–01
  5. By: Jaehyuk Choi; Desheng Ge; Kyu Ho Kang; Sungbin Sohn
    Abstract: The literature on using yield curves to forecast recessions typically measures the term spread as the difference between the 10-year and the three-month Treasury rates. Furthermore, using the term spread constrains the long- and short-term interest rates to have the same absolute effect on the recession probability. In this study, we adopt a machine learning method to investigate whether the predictive ability of interest rates can be improved. The machine learning algorithm identifies the best maturity pair, separating the effects of interest rates from those of the term spread. Our comprehensive empirical exercise shows that, despite the likelihood gain, the machine learning approach does not significantly improve the predictive accuracy, owing to the estimation error. Our finding supports the conventional use of the 10-year--three-month Treasury yield spread. This is robust to the forecasting horizon, control variable, sample period, and oversampling of the recession observations.
    Date: 2021–01
  6. By: Pierre-Jean Benghozi (i3-CRG - Centre de recherche en gestion i3 - X - École polytechnique - Université Paris-Saclay - CNRS - Centre National de la Recherche Scientifique); Hugues Chevalier (Institut de l'Iconomie, Bordeaux)
    Abstract: Purpose The HAL syndrome is a sign of the pathology of analysts and commenters when they are dealing with the stakes and risks of AI, then stressing the omnipotence of technologies and expected performances, the autonomy of machine, the problems of human control, the anthropomorphism in handling usages. The perception of new uses, the capacity to appropriate the digital dimension, the very conception of applications, terminals and infrastructures are highly structured by shared vision of technologies that spread within society. Design/methodology/approach Analyzing fictional content such as "2001, a Space Odyssey" and the forward-thinking vision of AI it offers contribute to characterize the deep ambiguity of AI. HAL, the computer of 2001, helps us to understand that AI is just an umbrella term that covers very different configurations and systems. The power to inspire coming from HAL holds to its being part of an identifiable genre, fiction, a privileged container for projecting phantasms about future unknown domains. Findings The HAL syndrome leads us to relativize the omnipotence granted to technology and willingly circulated by both digital companies and transhumanist thinkers that advocate the use of science and technology – including IT – to enhance the human condition. Originality/value The HAL syndrome, as it continues to influence our minds, becomes the basis of the questioning, concerns and enthusiasms triggered by AI. Therefore, it calls for original reflection over the need and modalities of the regulation of the current technological dynamics.
    Keywords: Artificial intelligence,Regulation
    Date: 2019–05–13
  7. By: John Ery; Loris Michel
    Abstract: We propose a reinforcement learning (RL) approach to model optimal exercise strategies for option-type products. We pursue the RL avenue in order to learn the optimal action-value function of the underlying stopping problem. In addition to retrieving the optimal Q-function at any time step, one can also price the contract at inception. We first discuss the standard setting with one exercise right, and later extend this framework to the case of multiple stopping opportunities in the presence of constraints. We propose to approximate the Q-function with a deep neural network, which does not require the specification of basis functions as in the least-squares Monte Carlo framework and is scalable to higher dimensions. We derive a lower bound on the option price obtained from the trained neural network and an upper bound from the dual formulation of the stopping problem, which can also be expressed in terms of the Q-function. Our methodology is illustrated with examples covering the pricing of swing options.
    Date: 2021–01
  8. By: Mark Kiermayer
    Abstract: Surrender poses one of the major risks to life insurance and a sound modeling of its true probability has direct implication on the risk capital demanded by the Solvency II directive. We add to the existing literature by performing extensive experiments that present highly practical results for various modeling approaches, including XGBoost and neural networks. Further, we detect shortcomings of prevalent model assessments, which are in essence based on a confusion matrix. Our results indicate that accurate label predictions and a sound modeling of the true probability can be opposing objectives. We illustrate this with the example of resampling. While resampling is capable of improving label prediction in rare event settings, such as surrender, and thus is commonly applied, we show theoretically and numerically that models trained on resampled data predict significantly biased event probabilities. Following a probabilistic perspective on surrender, we further propose time-dependent confidence bands on predicted mean surrender rates as a complementary assessment and demonstrate its benefit. This evaluation takes a very practical, going concern perspective, which respects that the composition of a portfolio might change over time.
    Date: 2021–01
  9. By: Jeffrey Grogger; Ria Ivandic; Tom Kirchmaier
    Abstract: We compare predictions from a conventional protocol-based approach to risk assessment with those based on a machine-learning approach. We first show that the conventional predictions are less accurate than, and have similar rates of negative prediction error as, a simple Bayes classifier that makes use only of the base failure rate. A random forest based on the underlying risk assessment questionnaire does better under the assumption that negative prediction errors are more costly than positive prediction errors. A random forest based on two-year criminal histories does better still. Indeed, adding the protocol-based features to the criminal histories adds almost nothing to the predictive adequacy of the model. We suggest using the predictions based on criminal histories to prioritize incoming calls for service, and devising a more sensitive instrument to distinguish true from false positives that result from this initial screening.
    Keywords: domestic abuse, risk assessment, machine learning
    JEL: K42
    Date: 2020–02
  10. By: Vera Eichenauer (KOF Swiss Economic Institute, ETH Zurich, Switzerland); Ronald Indergand (State Secretariat for Economic Affairs SECO, Switzerland); Isabel Z. Martínez (KOF Swiss Economic Institute, ETH Zurich, Switzerland); Christoph Sax (University of Basel; cynkra LLC, Switzerland)
    Abstract: Google Trends have become a popular data source for social science research. We show that for small countries or sub-national regions like U.S. states, underlying sampling noise in Google Trends can be substantial. The data may therefore be unreliable for time series analysis and is furthermore frequency-inconsistent: daily data differs from weekly or monthly data. We provide a novel sampling technique along with the R-package trendecon in order to generate stable daily Google search results that are consistent with weekly and monthly queries of Google Trends. We use this new approach to construct long and consistent daily economic indices for the (mainly) German-speaking countries Germany, Austria, and Switzerland. The resulting indices are significantly correlated with traditional leading indicators, with the advantage that they are available much earlier.
    Keywords: Google Trends, measurement, high frequency, forecasting, Covid-19 Market, Euro, sectoral heterogeneity
    JEL: E01 E32 E37
    Date: 2020–06
  11. By: Benetos, Emmanouil (Queen Mary University of London and The Alan Turing Institute); Ragano, Alessandro (University College Dublin); Sgroi, Daniel (University of Warwick, ESRC CAGE Centre and IZA Bonn.); Tuckwell, Anthony (University of Warwick and ESRC CAGE Centre.)
    Abstract: We propose a new measure for national happiness based on the emotional content of a country’s most popular songs. Using machine learning to detect the valence of the UK’s chart-topping song of each year since the 1970s, we find that it reliably predicts the leading survey-based measure of life satisfaction. Moreover, we find that music valence is better able to predict life satisfaction than a recently-proposed measure of happiness based on the valence of words in books (Hills et al., 2019). Our results have implications for the role of music in society, and at the same time validate a new use of music as a measure of public sentiment.
    Keywords: subjective wellbeing, life satisfaction, national happiness, music information retrieval, machine learning. JEL Classification: N30, Z11, Z13
    Date: 2021
  12. By: In-Koo Cho; Jonathan Libgober
    Abstract: We study interactions between strategic players and markets whose behavior is guided by an algorithm. Algorithms use data from prior interactions and a limited set of decision rules to prescribe actions. While as-if rational play need not emerge if the algorithm is constrained, it is possible to guide behavior across a rich set of possible environments using limited details. Provided a condition known as weak learnability holds, Adaptive Boosting algorithms can be specified to induce behavior that is (approximately) as-if rational. Our analysis provides a statistical perspective on the study of endogenous model misspecification.
    Date: 2021–01
  13. By: Bart H. L. Overes; Michel van der Wel
    Abstract: Sovereign credit ratings summarize the creditworthiness of countries. These ratings have a large influence on the economy and the yields at which governments can issue new debt. This paper investigates the use of a Multilayer Perceptron (MLP), Classification and Regression Trees (CART), and an Ordered Logit (OL) model for the prediction of sovereign credit ratings. We show that MLP is best suited for predicting sovereign credit ratings, with an accuracy of 68%, followed by CART (59%) and OL (33%). Investigation of the determining factors shows that roughly the same explanatory variables are important in all models, with regulatory quality, GDP per capita and unemployment rate as common important variables. Consistent with economic theory, a higher regulatory quality and/or GDP per capita are associated with a higher credit rating, while a higher unemployment rate is associated with a lower credit rating.
    Date: 2021–01
  14. By: Ines Wilms; Jacob Bien
    Abstract: High-dimensional graphical models are often estimated using regularization that is aimed at reducing the number of edges in a network. In this work, we show how even simpler networks can be produced by aggregating the nodes of the graphical model. We develop a new convex regularized method, called the tree-aggregated graphical lasso or tag-lasso, that estimates graphical models that are both edge-sparse and node-aggregated. The aggregation is performed in a data-driven fashion by leveraging side information in the form of a tree that encodes node similarity and facilitates the interpretation of the resulting aggregated nodes. We provide an efficient implementation of the tag-lasso by using the locally adaptive alternating direction method of multipliers and illustrate our proposal's practical advantages in simulation and in applications in finance and biology.
    Date: 2021–01
  15. By: Douglas Silveira; Silvinha Vasconcelos; Marcelo Resende; Daniel O. Cajueiro
    Abstract: In this article, we combine machine learning techniques with statistical moments of the gasoline price distribution. By doing so, we aim to detect and predict cartels in the Brazilian retail market. In addition to the traditional variance screen, we evaluate how the standard deviation, coefficient of variation, skewness, and kurtosis can be useful features in identifying anti-competitive market behavior. We complement our discussion with the so-called confusion matrix and discuss the trade-offs related to false-positive and false-negative predictions. Our results show that in some cases, false-negative outcomes critically increase when the main objective is to minimize false-positive predictions. We offer a discussion regarding the pros and cons of our approach for antitrust authorities aiming at detecting and avoiding gasoline cartels.
    Keywords: cartel screens, price dynamics, fuel retail market, machine learning
    JEL: C21 C45 C52 K40 L40 L41
    Date: 2021
  16. By: Tommaso Proietti (CEIS & DEF, University of Rome "Tor Vergata"); Alessandro Giovannelli (DEF, University of Rome "Tor Vergata")
    Abstract: Gross domestic product (GDP) is the most comprehensive and authoritative measure of economic activity. The macroeconomic literature has focused on nowcasting and forecasting this measure at the monthly frequency, using related high frequency indicators. We address the issue of estimating monthly gross domestic product using a large dimensional set of monthly indicators, by pooling the disaggregate estimates arising from simple and feasible bivariate models that consider one indicator at a time in conjunction to GDP. Our base model handles mixed frequency data and ragged-edge data structure with any pattern of missingness. Our methodology enables to distill the common component of the available economic indicators, so that the monthly GDP estimates arise from the projection of the quarterly figures on the space spanned by the common component. The weights used for the combination reflect the ability to nowcast quarterly GDP and are obtained as a function of the regularized estimator of the high-dimensional covariance matrix of the nowcasting errors. A recursive nowcasting and forecasting experiment illustrates that the optimal weights adapt to the information set available in real time and vary according to the phase of the business cycle.
    Keywords: Mixed-Frequency Data, Dynamic Factor Models, State Space Models,Shrinkage
    JEL: C32 C52 C53 E37
    Date: 2020–05–12
  17. By: Michele Cantarella (University of Modena and Reggio Emilia and University of Helsinki); Nicolò Fraccaroli (Università di Roma "Tor Vergata"); Roberto Volpe (LUISS Guido Carli)
    Abstract: We study the impact of fake news on votes for populist parties in the Italian elections of 2018. Our empirical strategy exploits the presence of Italian- and German-speaking voters in the Italian region of Trentino Alto-Adige/Südtirol as an exogenous source of assignment to fake news exposure. Using municipal data, we compare the effect of exposure to fake news on the vote for populist parties in the 2013 and 2018 elections. To do so, we introduce a novel indicator of populism using text mining on the Facebook posts of Italian parties before the elections. We find that exposure to fake news is positively correlated with vote for populist parties, but that less than half of this correlation is causal. Our findings support the view that exposure to fake news (i) favours populist parties, but also that (ii) it is positively correlated with prior support for populist parties, suggesting a self-selection mechanism.
    Keywords: Fake News, Political Economy, Electoral Outcomes, Populism
    JEL: C26 D72 P16
    Date: 2020–06–17
  18. By: Jansen, Mark (U of Utah); Nguyen, Hieu (U of Utah); Shams, Amin (Ohio State U)
    Abstract: Using a randomized experiment in auto lending, we provide evidence of higher loan profitability with algorithmic machine underwriting relative to human underwriting. Machine-underwritten loans generate 10.2% higher loan-level profit than human-underwritten loans in a sample of 140,000 randomly assigned applications. The loans underwritten by machines not only have higher interest rates but also realize a 6.8% lower incidence of default. The performance gap is mainly driven by loans with higher complexity and where potential for agency conflicts is the highest. These results are consistent with algorithmic underwriting mitigating agency conflicts and humans' limited capacity for analyzing complex problems.
    JEL: D14 G21 O33
    Date: 2020–07
  19. By: Chengyu Huang; Sean Simpson; Daria Ulybina; Agustin Roitman
    Abstract: We construct sentiment indices for 20 countries from 1980 to 2019. Relying on computational text analysis, we capture specific language like “fear”, “risk”, “hedging”, “opinion”, and, “crisis”, as well as “positive” and “negative” sentiments, in news articles from the Financial Times. We assess the performance of our sentiment indices as “news-based” early warning indicators (EWIs) for financial crises. We find that sentiment indices spike and/or trend up ahead of financial crises.
    Keywords: Early warning systems;Financial crises;Hedging;Global financial crisis of 2008-2009;Banking crises;WP,sentiment index,crisis sentiment,seed words,sentiment indices,term cluster,word vector representation,word-vector models
    Date: 2019–12–06
  20. By: Shree Saha (Cornell University); Sudha Narayanan (Indira Gandhi Institute of Development Research)
    Abstract: Measuring empowerment is both complicated and time consuming. A number of recent efforts have focused on how to better measure this complex multidimensional concept such that it is easy to implement. In this paper, we use machine learning techniques, specifically LASSO, us- ing survey data from five Indian states to abbreviate a recently developed measure of nutritional empowerment, the Women's Empowerment in Nutrition Index (WENI) that has 33 distinct indi- cators. Our preferred Abridged Women's Empowerment in Nutrition Index (A-WENI) consists of 20 indicators. We validate the A-WENI via a field survey from a new context, the west- ern Indian state of Maharashtra. We find that the 20-indicator A-WENI is both capable of reproducing well the empowerment status generated by the 33-indicator WENI and predicting nutritional outcomes such as BMI and dietary diversity. Using this index, we find that in our Maharashtra sample, on average, only 51.2 of mothers of children under the age of 5 years are nutritionally empowered, whereas 86.1 of their spouses are nutritionally empowered. We also find that only 22.3 of the elderly women are nutritionally empowered. These estimates are broadly consistent with those based on the 33-indicator WENI. The A-WENI will reduce the time burden on respondents and can be incorporated in any general purpose survey conducted in rural contexts. Many of the indicators in A-WENI are often collected routinely in contemporary household surveys. Hence, capturing nutritional empowerment does not entail significant additional burden. Developing A-WENI can thus aid in an expansion of efforts to measure nutritional empowerment; this is key to understanding better the barriers and challenges women face and help identify ways in which women can improve their nutritional well-being in meaningful ways.
    Keywords: Empowerment, nutrition, machine learning, LASSO, gender, India, South Asia
    JEL: J16 D63 I00 C55
    Date: 2020–10

This nep-big issue is ©2021 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.