nep-big New Economics Papers
on Big Data
Issue of 2022‒09‒19
thirteen papers chosen by
Tom Coupé
University of Canterbury

  1. Mental health concerns prelude the Great Resignation: Evidence from Social Media By R. Maria del Rio-Chanona; Alejandro Hermida-Carrillo; Melody Sepahpour-Fard; Luning Sun; Renata Topinkova; Ljubica Nedelkoska
  2. Learn Then Test: Calibrating Predictive Algorithms to Achieve Risk Control By Angelopoulos, Anastasios N.; Bates, Stephen; Candes, Emmanuel J.; Jordan, Michael I.; Lei, Lihua
  3. An intelligent algorithmic trading based on a risk-return reinforcement learning algorithm By Boyi Jin
  4. Deep Learning for Choice Modeling By Zhongze Cai; Hanzhao Wang; Kalyan Talluri; Xiaocheng Li
  5. pystacked: Stacking generalization and machine learning in Stata By Achim Ahrens; Christian B. Hansen; Mark E. Schaffer
  6. Non-Stationary Dynamic Pricing Via Actor-Critic Information-Directed Pricing By Po-Yi Liu; Chi-Hua Wang; Heng-Hsui Tsai
  7. Local processing of massive databases with R: a national analysis of a Brazilian social programme By Paz, Hellen; Maia, Mateus; Moraes, Fernando; Lustosa, Ricardo; Costa, Lilia; Macêdo, Samuel; Barreto, Marcos E.; Ara, Anderson
  8. Modeling Path-Dependent State Transition by a Recurrent Neural Network By Yang, Bill Huajian
  9. Estimating inequality with missing incomes By Brunori, Paolo; Salas-Rojo, Pedro; Verne, Paolo
  10. DSGE Models and Machine Learning: An Application to Monetary Policy in the Euro Area By Daniel Stempel; Johannes Zahner
  11. Transfer Ranking in Finance: Applications to Cross-Sectional Momentum with Data Scarcity By Daniel Poh; Stephen Roberts; Stefan Zohren
  12. Combining Forecasts under Structural Breaks Using Graphical LASSO By Tae-Hwy Lee; Ekaterina Seregina
  13. Neural Payoff Machines: Predicting Fair and Stable Payoff Allocations Among Team Members By Daphne Cornelisse; Thomas Rood; Mateusz Malinowski; Yoram Bachrach; Tal Kachman

  1. By: R. Maria del Rio-Chanona; Alejandro Hermida-Carrillo; Melody Sepahpour-Fard; Luning Sun; Renata Topinkova; Ljubica Nedelkoska
    Abstract: To study the causes of the 2021 Great Resignation, we use text analysis to investigate the changes in work- and quit-related posts between 2018 and 2021 on Reddit. We find that the Reddit discourse evolution resembles the dynamics of the U.S. quit and layoff rates. Furthermore, when the COVID-19 pandemic started, conversations related to working from home, switching jobs, work-related distress, and mental health increased. We distinguish between general work-related and specific quit-related discourse changes using a difference-in-differences method. Our main finding is that mental health and work-related distress topics disproportionally increased among quit-related posts since the onset of the pandemic, likely contributing to the Great Resignation. Along with better labor market conditions, some relief came beginning-to-mid-2021 when these concerns decreased. Our study validates the use of forums such as Reddit for studying emerging economic phenomena in real time, complementing traditional labor market surveys and administrative data.
    Date: 2022–08
  2. By: Angelopoulos, Anastasios N. (?); Bates, Stephen (?); Candes, Emmanuel J. (?); Jordan, Michael I. (?); Lei, Lihua (Stanford U)
    Abstract: We introduce a framework for calibrating machine learning models so that their predictions satisfy explicit, finite-sample statistical guarantees. Our calibration algorithm works with any underlying model and (unknown) data-generating distribution and does not require model refitting. The framework addresses, among other examples, false discovery rate control in multi-label classification, intersection-over-union control in instance segmentation, and the simultaneous control of the type-1 error of outlier detection and confidence set coverage in classification or regression. Our main insight is to reframe the risk-control problem as multiple hypothesis testing, enabling techniques and mathematical arguments different from those in the previous literature. We use our framework to provide new calibration methods for several core machine learning tasks with detailed worked examples in computer vision and tabular medical data.
    Date: 2022–04
  3. By: Boyi Jin
    Abstract: This scientific paper propose a novel portfolio optimization model using an improved deep reinforcement learning algorithm. The objective function of the optimization model is the weighted sum of the expectation and value at risk(VaR) of portfolio cumulative return. The proposed algorithm is based on actor-critic architecture, in which the main task of critical network is to learn the distribution of portfolio cumulative return using quantile regression, and actor network outputs the optimal portfolio weight by maximizing the objective function mentioned above. Meanwhile, we exploit a linear transformation function to realize asset short selling. Finally, A multi-process method is used, called Ape-x, to accelerate the speed of deep reinforcement learning training. To validate our proposed approach, we conduct backtesting for two representative portfolios and observe that the proposed model in this work is superior to the benchmark strategies.
    Date: 2022–08
  4. By: Zhongze Cai; Hanzhao Wang; Kalyan Talluri; Xiaocheng Li
    Abstract: Choice modeling has been a central topic in the study of individual preference or utility across many fields including economics, marketing, operations research, and psychology. While the vast majority of the literature on choice models has been devoted to the analytical properties that lead to managerial and policy-making insights, the existing methods to learn a choice model from empirical data are often either computationally intractable or sample inefficient. In this paper, we develop deep learning-based choice models under two settings of choice modeling: (i) feature-free and (ii) feature-based. Our model captures both the intrinsic utility for each candidate choice and the effect that the assortment has on the choice probability. Synthetic and real data experiments demonstrate the performances of proposed models in terms of the recovery of the existing choice models, sample complexity, assortment effect, architecture design, and model interpretation.
    Date: 2022–08
  5. By: Achim Ahrens; Christian B. Hansen; Mark E. Schaffer
    Abstract: pystacked implements stacked generalization (Wolpert, 1992) for regression and binary classification via Python's scikit-lear}. Stacking combines multiple supervised machine learners -- the "base" or "level-0" learners -- into a single learner. The currently supported base learners include regularized regression, random forest, gradient boosted trees, support vector machines, and feed-forward neural nets (multi-layer perceptron). pystacked can also be used with as a `regular' machine learning program to fit a single base learner and, thus, provides an easy-to-use API for scikit-learn's machine learning algorithms.
    Date: 2022–08
  6. By: Po-Yi Liu; Chi-Hua Wang; Heng-Hsui Tsai
    Abstract: This paper presents a novel non-stationary dynamic pricing algorithm design, where pricing agents face incomplete demand information and market environment shifts. The agents run price experiments to learn about each product's demand curve and the profit-maximizing price, while being aware of market environment shifts to avoid high opportunity costs from offering sub-optimal prices. The proposed ACIDP extends information-directed sampling (IDS) algorithms from statistical machine learning to include microeconomic choice theory, with a novel pricing strategy auditing procedure to escape sub-optimal pricing after market environment shift. The proposed ACIDP outperforms competing bandit algorithms including Upper Confidence Bound (UCB) and Thompson sampling (TS) in a series of market environment shifts.
    Date: 2022–08
  7. By: Paz, Hellen; Maia, Mateus; Moraes, Fernando; Lustosa, Ricardo; Costa, Lilia; Macêdo, Samuel; Barreto, Marcos E.; Ara, Anderson
    Abstract: The analysis of massive databases is a key issue for most applications today and the use of parallel computing techniques is one of the suitable approaches for that. Apache Spark is a widely employed tool within this context, aiming at processing large amounts of data in a distributed way. For the Statistics community, R is one of the preferred tools. Despite its growth in the last years, it still has limitations for processing large volumes of data in single local machines. In general, the data analysis community has difficulty to handle a massive amount of data on local machines, often requiring high-performance computing servers. One way to perform statistical analyzes over massive databases is combining both tools (Spark and R) via the sparklyr package, which allows for an R application to use Spark. This paper presents an analysis of Brazilian public data from the Bolsa Família Programme (BFP—conditional cash transfer), comprising a large data set with 1.26 billion observations. Our goal was to understand how this social program acts in different cities, as well as to identify potentially important variables reflecting its utilization rate. Statistical modeling was performed using random forest to predict the utilization rated of BFP. Variable selection was performed through a recent method based on the importance and interpretation of variables in the random forest model. Among the 89 variables initially considered, the final model presented a high predictive performance capacity with 17 selected variables, as well as indicated high importance of some variables for the observed utilization rate in income, education, job informality, and inactive youth, namely: family income, education, occupation and density of people in the homes. In this work, using a local machine, we highlighted the potential of aggregating Spark and R for analysis of a large database of 111.6 GB. This can serve as proof of concept or reference for other similar works within the Statistics community, as well as our case study can provide important evidence for further analysis of this important social support programme.
    Keywords: big data; massive databases; impact evaluation; sparklyr; Bolsa Familia
    JEL: C1
    Date: 2020–12–01
  8. By: Yang, Bill Huajian
    Abstract: Rating transition models are widely used for credit risk evaluation. It is not uncommon that a time-homogeneous Markov rating migration model deteriorates quickly after projecting repeatedly for a few periods. This is because the time-homogeneous Markov condition is generally not satisfied. For a credit portfolio, rating transition is usually path dependent. In this paper, we propose a recurrent neural network (RNN) model for modeling path-dependent rating migration. An RNN is a type of artificial neural networks where connections between nodes form a directed graph along a temporal sequence. There are neurons for input and output at each time-period. The model is informed by the past behaviours for a loan along the path. Information learned from previous periods propagates to future periods. Experiments show this RNN model is robust.
    Keywords: Path-dependent, rating transition, recurrent neural network, deep learning, Markov property, time-homogeneity
    JEL: C13 C18 C45 C51 C58 G12 G17 G32 G33 M3
    Date: 2022–08–18
  9. By: Brunori, Paolo; Salas-Rojo, Pedro; Verne, Paolo
    Abstract: The measurement of income inequality is affected by missing observations, especially if they are concentrated on the tails of an income distribution. This paper conducts an experiment to test how the different correction methods proposed by the statistical, econometric and machine learning literature address measurement biases of inequality due to item non-response. We take a baseline survey and artificially corrupt the data employing several alternative non-linear functions that simulate patterns of income nonresponse and show how biased inequality statistics can be when item non-responses are ignored. The comparative assessment of correction methods indicates that most methods are able to partially correct for missing data biases. Sample reweighting based on probabilities on non-response produces inequality estimates quite close to true values in most simulated missing data patterns. Matching and Pareto corrections can also be effective to correct for selected missing data patterns. Other methods, such as Single and Multiple imputations and Machine Learning methods are less effective. A final discussion provides some elements that help explaining these findings.
    Keywords: income inequality; item non-response; income distributions; inequality predictions; imputations
    JEL: D31 D63 E64 O15
    Date: 2022–07
  10. By: Daniel Stempel (University of Duesseldorf); Johannes Zahner (University of Marburg)
    Abstract: In the euro area, monetary policy is conducted by a single central bank for 19 member countries. However, countries are heterogeneous in their economic development, including their inflation rates. This paper combines a New Keynesian model and a neural network to assess whether the European Central Bank (ECB) conducted monetary policy between 2002 and 2022 according to the weighted average of the inflation rates within the European Monetary Union (EMU) or reacted more strongly to the inflation rate developments of certain EMU countries. The New Keynesian model first generates data which is used to train and evaluate several machine learning algorithms. We find that a neural network performs best out-of-sample. Thus, we use this algorithm to classify historical EMU data. Our findings suggest disproportional emphasis on the inflation rates experienced by southern EMU members for the vast majority of the time frame considered (80%). We argue that this result stems from a tendency of the ECB to react more strongly to countries whose inflation rates exhibit greater deviations from their long-term trend.
    Keywords: New Keynesian Models, Monetary Policy, European Monetary Union, Neural Networks, Transfer Learning
    JEL: C45 C53 E58
    Date: 2022
  11. By: Daniel Poh; Stephen Roberts; Stefan Zohren
    Abstract: Cross-sectional strategies are a classical and popular trading style, with recent high performing variants incorporating sophisticated neural architectures. While these strategies have been applied successfully to data-rich settings involving mature assets with long histories, deploying them on instruments with limited samples generally produces over-fitted models with degraded performance. In this paper, we introduce Fused Encoder Networks -- a hybrid parameter-sharing transfer ranking model. The model fuses information extracted using an encoder-attention module operated on a source dataset with a similar but separate module focused on a smaller target dataset of interest. In addition to mitigating the issue of target data scarcity, the model's self-attention mechanism enables interactions among instruments to be accounted for, not just at the loss level during model training, but also at inference time. Focusing on momentum applied to the top ten cryptocurrencies by market capitalisation as a demonstrative use-case, the Fused Encoder Networks outperforms the reference benchmarks on most performance measures, delivering a three-fold boost in the Sharpe ratio over classical momentum as well as an improvement of approximately 50% against the best benchmark model without transaction costs. It continues outperforming baselines even after accounting for the high transaction costs associated with trading cryptocurrencies.
    Date: 2022–08
  12. By: Tae-Hwy Lee (Department of Economics, University of California Riverside); Ekaterina Seregina (Colby College)
    Abstract: In this paper we develop a novel method of combining many forecasts based on a machine learning algorithm called Graphical LASSO. We visualize forecast errors from different forecasters as a network of interacting entities and generalize network inference in the presence of common factor structure and structural breaks. First, we note that forecasters often use common information and hence make common mistakes, which makes the forecast errors exhibit common factor structures. We propose the Factor Graphical LASSO (Factor GLASSO), which separates common forecast errors from the idiosyncratic errors and exploits sparsity of the precision matrix of the latter. Second, since the network of experts changes over time as a response to unstable environments such as recessions, it is unreasonable to assume constant forecast combination weights. Hence, we propose Regime-Dependent Factor Graphical LASSO (RD-Factor GLASSO) and develop its scalable implementation using the Alternating Direction Method of Multipliers (ADMM) to estimate regime-dependent forecast combination weights. The empirical application to forecasting macroeconomic series using the data of the European Central Bank’s Survey of Professional Forecasters (ECB SPF) demonstrates superior performance of a combined forecast using Factor GLASSO and RD-Factor GLASSO.
    Keywords: Common Forecast Errors, Regime Dependent Forecast Combination, Sparse Precision Matrix of Idiosyncratic Errors, Structural Breaks.
    JEL: C13 C38 C55
    Date: 2022–09
  13. By: Daphne Cornelisse; Thomas Rood; Mateusz Malinowski; Yoram Bachrach; Tal Kachman
    Abstract: In many multi-agent settings, participants can form teams to achieve collective outcomes that may far surpass their individual capabilities. Measuring the relative contributions of agents and allocating them shares of the reward that promote long-lasting cooperation are difficult tasks. Cooperative game theory offers solution concepts identifying distribution schemes, such as the Shapley value, that fairly reflect the contribution of individuals to the performance of the team or the Core, which reduces the incentive of agents to abandon their team. Applications of such methods include identifying influential features and sharing the costs of joint ventures or team formation. Unfortunately, using these solutions requires tackling a computational barrier as they are hard to compute, even in restricted settings. In this work, we show how cooperative game-theoretic solutions can be distilled into a learned model by training neural networks to propose fair and stable payoff allocations. We show that our approach creates models that can generalize to games far from the training distribution and can predict solutions for more players than observed during training. An important application of our framework is Explainable AI: our approach can be used to speed-up Shapley value computations on many instances.
    Date: 2022–08

This nep-big issue is ©2022 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at For comments please write to the director of NEP, Marco Novarese at <>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.