nep-big New Economics Papers
on Big Data
Issue of 2018‒10‒08
eighteen papers chosen by
Tom Coupé
University of Canterbury

  1. News-based sentiment analysis in real estate: A supervised machine learning approach with support vector networks By Jochen Hausler; Marcel Lang; Jessica Ruscheinsky
  2. Automation of the technical due diligence with artificial intelligence in the real estate industry By Philipp Maximilien Mueller
  3. Nowcasting New Zealand GDP using machine learning algorithms By Adam Richardson; Thomas van Florenstein Mulder; Tugrul Vehbi
  4. Preventing rather than Punishing: An Early Warning Model of Malfeasance in Public Procurement By Gallego, J; Rivero, G; Martínez, J.D.
  5. Implementing machine learning methods in Stata By Austin Nichols
  6. Temporal Relational Ranking for Stock Prediction By Fuli Feng; Xiangnan He; Xiang Wang; Cheng Luo; Yiqun Liu; Tat-Seng Chua
  7. LASSOPACK and PDSLASSO: Prediction, model selection and causal inference with regularized regression By Achim Ahrens; Christian B Hansen; Mark E Schaffer
  8. Applying Deep Learning to Derivatives Valuation By Ryan Ferguson; Andrew Green
  9. Central Bank Communication and the Yield Curve: A Semi-Automatic Approach using Non-Negative Matrix Factorization By Ancil Crayton
  10. Towards the broad application of machine learning for document classification and data migration in real estate By Mario Bodenbender; Björn-Martin Kurzrock
  11. Artificial neural network regression models: Predicting GDP growth By Jahn, Malte
  12. Deep Neural Networks for Estimation and Inference: Application to Causal Effects and Other Semiparametric Estimands By Max H. Farrell; Tengyuan Liang; Sanjog Misra
  13. Experimental spatial modelling of commercial property stock using GIS By Paul Greenhalgh; Kevin Muldoon-Smith; Adejimi Adebayo; Josephine Ellis
  14. Data-driven sensitivity analysis for matching estimators By Giovanni Cerulli
  15. Constructing Financial Sentimental Factors in Chinese Market Using Natural Language Processing By Junfeng Jiang; Jiahao Li
  16. A big data based method for pass rates optimization in mathematics university lower division courses By Fernando A Morales; Carlos A Osorio; Daniel Cabarcas J
  17. A new price test in geographic market definition – an application to german retail gasoline market By Bantle, Melissa; Muijs, Matthias
  18. Placement Optimization in Refugee Resettlement By Trapp , Andrew C.; Teytelboym , Alexander; Martinello, Alessandro; Andersson, Tommy; Ahani, Narges

  1. By: Jochen Hausler; Marcel Lang; Jessica Ruscheinsky
    Abstract: With the rapid growth of news, information and opinionated data available in digital form, accompanied by a swift progress of textual analysis techniques, the field of sentiment analysis became a hotspot in the area of natural language processing. Additionally, scientists can nowadays draw on increased computational power to study textual documents. These developments allowed real estate researchers to advance beyond traditional sentiment measures such as closed-end fund discounts and survey-based measures (see e.g., Lin et al. (2009) as well as Jin et al. (2014)) and facilitate the development of new sentiment proxies. As an example, Google search volume data was successfully used to forecast commercial real estate market developments (Dietzel et al. (2014)) and to predict market volatility (Braun (2016)) as well as housing market turning points (Dietzel (2016)). Using sentiment-dictionaries and content-analysis software, Walker (2014) examined the relationship of media coverage and the boom of the UK housing market. In similar fashion, Soo (2015) showed that local housing media sentiment is able to predict future house prices in US cities.However – in contrast to related research in finance – sentiment analysis in real estate still lacks behind. Real estate literature so far misses the application of more advanced machine learning techniques like supervised learning algorithms when trying to extract sentiment from news items. By facilitating a dataset of about 54,000 headlines from the S&P Global Market Intelligence database collected over a 12-year timespan (01/2005 – 12/2016), this paper examines the relationship between movements of both direct as well as indirect commercial real estate markets in the United States and media sentiment. It thereby aims to explore the performance and potential of a support vector machine as classification algorithm (see Cortes and Vapnik (1995). When mapping headlines into a high dimensional feature space, we can identify the polarity of individual news items and aggregate the results into three different sentiment measures. Controlling for other influence factors and sentiment indices, we show that these 'tone' measures indeed bear the potential to explain real estate market movements over time.To our knowledge, this paper is the first one explicitly exploring a support vector machine’s potential in extracting media sentiment not only for the United States but also for real estate markets in general.
    Keywords: Commercial Real Estate; Machine Learning; News-based sentiment analysis; Support vector networks; United States
    JEL: R3
    Date: 2018–01–01
    URL: http://d.repec.org/n?u=RePEc:arz:wpaper:eres2018_153&r=big
  2. By: Philipp Maximilien Mueller
    Abstract: Over the real estate lifecycle numerous documents and data are generated. The majority of building-related data is collected in day-to-day operations, such as maintenance protocols, contracts or energy consumptions. Previous successes in the classification already help to automatically recognize, categorize and name documents as well as to sort them into an individual structure in digital data rooms (Bodenbender/Kurzrock 2018). The actual added value is created in the next step: efficient data analysis with specific utilization of the data.This paper describes an approach for the automation of Technical Due Diligence (TDD) by information extraction (IE). The aim is to extract relevant information from building-related documents and to automatically gain quick insights into the condition of real estate. A global asset under management (AuM) of US$1.2 trillion (PWC, AWM Report, 2017) and a global real estate transaction volume of around US$650 billion in 2016 (JLL Global Market Perspective, 2017) show that there is a regular need to analyze building data. Transactions are a very dynamic area where current trends focus on a more data-driven approach to improve time and cost.In addition, the paper focuses on the standardization of information extraction methods for the TDD as well as the prioritization and evaluation of building-related data. The automated evaluation supports value-adding decisions in the real estate lifecycle with a detailed database. TDD audits are a key objective for reducing information asymmetries, especially in large transactions.Efficient technologies are now available for IE from digital building data. Through machine learning, documents can be read and evaluated automatically. Digital data rooms and operational applications such as ERP systems serve as a source of information for information extraction. Due to the heterogeneity of the documents, rule and learning-based algorithms are used. The IE is based on various technical bases, especially in the field of neural networks and deep learning methods. As the documents are often only available as scans, it requires the integration of OCR methods.The contribution to the ERES-PhD session presents the current state of information extraction in the real estate industry, the research method used for the automation of TDD and its potential benefits for real estate management.
    Keywords: Artificial Intelligence; Automation; digital building data; Information Extraction; technical due diligence
    JEL: R3
    Date: 2018–01–01
    URL: http://d.repec.org/n?u=RePEc:arz:wpaper:eres2018_313&r=big
  3. By: Adam Richardson; Thomas van Florenstein Mulder; Tugrul Vehbi
    Abstract: This paper analyses the real-time nowcasting performance of machine learning algorithms estimated on New Zealand data. Using a large set of real-time quarterly macroeconomic indicators, we train a range of popular machine learning algorithms and nowcast real GDP growth for each quarter over the 2009Q1-2018Q1 period. We compare the predictive accuracy of these nowcasts with that of other traditional univariate and multivariate statistical models. We find that the machine learning algorithms outperform the traditional statistical models. Moreover, combining the individual machine learning nowcasts further improves the performance than in the case of the individual nowcasts alone.
    Keywords: Nowcasting, Machine learning, Forecast evaluation
    JEL: C52 C53
    Date: 2018–09
    URL: http://d.repec.org/n?u=RePEc:een:camaaa:2018-47&r=big
  4. By: Gallego, J; Rivero, G; Martínez, J.D.
    Abstract: Is it possible to predict corruption and public inefficiency in public procurement? With the proliferation of e-procurement in the public sector, anti-corruption agencies and watchdog organizations in many countries currently have access to powerful sources of information. These may help anticipate which transactions become faulty and why. In this paper, we discuss the promises and challenges of using machine learning models to predict inefficiency and corruption in public procurement, both from the perspective of researchers and practitioners. We exemplify this procedure using a unique dataset characterizing more than 2 million public contracts in Colombia, and training machine learning models to predict which of them face corruption investigations or implementation inefficiencies. We use different techniques to handle the problem of class imbalance typical of these applications, report the high accuracy of our models, simulate the trade-off between precision and recall in this context, and determine which features contribute the most to the prediction of malfeasance within contracts. Our approach is useful for governments interested in exploiting large administrative datasets to improve provision of public goods and highlights some of the tradeoffs and challenges that they might face throughout this process.
    Keywords: Corruption, Inefficiency, Machine Learning, Public Procurement
    JEL: C53 M42 O12
    Date: 2018–09–26
    URL: http://d.repec.org/n?u=RePEc:col:000092:016724&r=big
  5. By: Austin Nichols (Abt Associates)
    Abstract: This presentation will discuss some popular supervised and unsupervised machine learning algorithms, and their recommended use, and then present implementations in Stata. The emphasis is on prediction and causal inference, and how to tailor a method to a specific application.
    Date: 2018–10–15
    URL: http://d.repec.org/n?u=RePEc:boc:usug18:08&r=big
  6. By: Fuli Feng; Xiangnan He; Xiang Wang; Cheng Luo; Yiqun Liu; Tat-Seng Chua
    Abstract: Stock prediction aims to predict the future trends of a stock in order to help investors to make good investment decisions. Traditional solutions for stock prediction are based on time-series models. With the recent success of deep neural networks in modeling sequential data, deep learning has become a promising choice for stock prediction. However, most existing deep learning solutions are not optimized towards the target of investment, i.e., selecting the best stock with the highest expected revenue. Specifically, they typically formulate stock prediction as a classification (to predict stock trend) or a regression problem (to predict stock price). More importantly, they largely treat the stocks as independent of each other. The valuable signal in the rich relations between stocks (or companies), such as two stocks are in the same sector and two companies have a supplier-customer relation, is not considered. In this work, we contribute a new deep learning solution, named Relational Stock Ranking (RSR), for stock prediction. Our RSR method advances existing solutions in two major aspects: 1) tailoring the deep learning models for stock ranking, and 2) capturing the stock relations in a time-sensitive manner. The key novelty of our work is the proposal of a new component in neural network modeling, named Temporal Graph Convolution, which jointly models the temporal evolution and relation network of stocks. To validate our method, we perform back-testing on the historical data of two stock markets, NYSE and NASDAQ. Extensive experiments demonstrate the superiority of our RSR method. It outperforms state-of-the-art stock prediction solutions achieving an average return ratio of 98% and 71% on NYSE and NASDAQ, respectively.
    Date: 2018–09
    URL: http://d.repec.org/n?u=RePEc:arx:papers:1809.09441&r=big
  7. By: Achim Ahrens (Economic and Social Research Institute, Dublin); Christian B Hansen (University of Chicago Booth School of Business); Mark E Schaffer (Heriot-Watt University)
    Abstract: The field of machine learning is attracting increasing attention among social scientists and economists. At the same time, Stata offers to date only a very limited set of machine learning tools. This one-hour session introduces two Stata packages, lassopack and pdslasso, which implement regularized regression methods, including but not limited to the lasso (Tibshirani 1996 Journal of the Royal Statistical Society Series B), for Stata. The packages include features intended for prediction, model selection and causal inference, and are thus applicable in a wide range of settings. The commands allow for high-dimensional models, where the number of regressors may be large or even exceed the number of observations under the assumption of sparsity. The package lassopack implements lasso, square-root lasso (Belloni et al. 2011 Biometrika; 2014 Annals of Statistics), elastic net (Zou and Hastie 2005 Journal of the Royal Statistical Society Series B), ridge regression (Hoerl and Kennard 1970 Technometrics), adaptive lasso (Zou 2006 Journal of the American Statistical Association) and post-estimation OLS. These methods rely on tuning parameters, which determine the degree and type of penalization. lassopack supports three approaches for selecting these tuning parameters: information criteria (implemented in lasso2), K-fold and h-step ahead rolling cross-validation (cvlasso), and theory-driven penalization (rlasso) due to Belloni et al. (2012 Econometrica). In addition, rlasso implements the Chernozhukov et al. (2013 Annals of Statistics) sup-score test of joint significance of the regressors.
    Date: 2018–10–15
    URL: http://d.repec.org/n?u=RePEc:boc:usug18:12&r=big
  8. By: Ryan Ferguson; Andrew Green
    Abstract: The universal approximation theorem of artificial neural networks states that a forward feed network with a single hidden layer can approximate any continuous function, given a finite number of hidden units under mild constraints on the activation functions (see Hornik, 1991; Cybenko, 1989). Deep neural networks are preferred over shallow neural networks, as the later can be shown to require an exponentially larger number of hidden units (Telgarsky, 2016). This paper applies deep learning to train deep artificial neural networks to approximate derivative valuation functions using a basket option as an example. To do so it develops a Monte Carlo based sampling technique to derive appropriate training and test data sets. The paper explores a range of network geometries. The performance of the training phase and the inference phase are presented using GPU technology.
    Date: 2018–09
    URL: http://d.repec.org/n?u=RePEc:arx:papers:1809.02233&r=big
  9. By: Ancil Crayton
    Abstract: Communication is now a standard tool in the central bank's monetary policy toolkit. Theoretically, communication provides the central bank an opportunity to guide public expectations, and it has been shown empirically that central bank communication can lead to financial market fluctuations. However, there has been little research into which dimensions or topics of information are most important in causing these fluctuations. We develop a semi-automatic methodology that summarizes the FOMC statements into its main themes, automatically selects the best model based on coherency, and assesses whether there is a significant impact of these themes on the shape of the U.S Treasury yield curve using topic modeling methods from the machine learning literature. Our findings suggest that the FOMC statements can be decomposed into three topics: (i) information related to the economic conditions and the mandates, (ii) information related to monetary policy tools and intermediate targets, and (iii) information related to financial markets and the financial crisis. We find that statements are most influential during the financial crisis and the effects are mostly present in the curvature of the yield curve through information related to the financial theme.
    Date: 2018–09
    URL: http://d.repec.org/n?u=RePEc:arx:papers:1809.08718&r=big
  10. By: Mario Bodenbender; Björn-Martin Kurzrock
    Abstract: Real estate is increasingly becoming an asset class subject to the same requirements as other capital investments. As a consequence, the strategic relevance of real estate portfolios has gained in importance for many businesses. The resulting large quantities of documentation and information require a structured database system, in which information and documents will remain permanently transparent, complete, and findable. Portfolio and operating documentation must be reliably and consistently available to a variety of actors, over a period of decades. In order to facilitate effective document protection, administration and access at all times, it is necessary to establish a unique structure and identification system for the information. In practice, however, there are a variety of existing standards relating to document structures for particular lifecycle phases and for transmission of the data between specific phases. The documents are consequently subject to repeated restructuring throughout their lifecycle - a process that is expensive and entails a risk of data loss.The paper describes an approach for unifying and establishing compatibility between the existing document structure standards throughout the property's lifecycle, making use of unique document classes. The goal is to achieve a stable, unique document classification, accompanied by a capacity to automatically classify relevant (and, in particular, unstructured) documents. In this way, in the course of digitalization or migration, it will be possible to directly associate documents with a document class and thus ensure that they have a single unique classification throughout their lifecycle; they can then be displayed (by the users) in restructured forms for specific use cases at any time without incurring additional costs.In order to determine to what extent this process can be automated with machine learning, a range of algorithms were applied to real building documentation, analyzed, tested for reliability and optimized to building-specific data. The analysis demonstrated that not all digitalized documents are directly suited to automated classification; the paper therefore illustrates the associated problems, presenting detailed recommendations for how to facilitate automated classification and migration using machine learning. In this way, major errors can be avoided from the very beginning of the digitalization process.
    Keywords: Artificial Intelligence; data room; Digitization; document classification; real estate data
    JEL: R3
    Date: 2018–01–01
    URL: http://d.repec.org/n?u=RePEc:arz:wpaper:eres2018_165&r=big
  11. By: Jahn, Malte
    Abstract: Artificial neural networks have become increasingly popular for statistical model fitting over the last years, mainly due to increasing computational power. In this paper, an introduction to the use of artificial neural network (ANN) regression models is given. The problem of predicting the GDP growth rate of 15 industrialized economies in the time period 1996-2016 serves as an example. It is shown that the ANN model is able to yield much more accurate predictions of GDP growth rates than a corresponding linear model. In particular, ANN models capture time trends very flexibly. This is relevant for forecasting, as demonstrated by out-of-sample predictions for 2017.
    Keywords: neural network,forecasting,panel data
    JEL: C45 C53 C61 O40
    Date: 2018
    URL: http://d.repec.org/n?u=RePEc:zbw:hwwirp:185&r=big
  12. By: Max H. Farrell; Tengyuan Liang; Sanjog Misra
    Abstract: We study deep neural networks and their use in semiparametric inference. We provide new rates of convergence for deep feedforward neural nets and, because our rates are sufficiently fast (in some cases minimax optimal), prove that semiparametric inference is valid using deep nets for first-step estimation. Our estimation rates and semiparametric inference results are the first in the literature to handle the current standard architecture: fully connected feedforward neural networks (multi-layer perceptrons), with the now-default rectified linear unit (ReLU) activation function and a depth explicitly diverging with the sample size. We discuss other architectures as well, including fixed-width, very deep networks. We establish nonasymptotic bounds for these deep ReLU nets, for both least squares and logistic loses in nonparametric regression. We then apply our theory to develop semiparametric inference, focusing on treatment effects and expected profits for concreteness, and demonstrate their effectiveness with an empirical application to direct mail marketing. Inference in many other semiparametric contexts can be readily obtained.
    Date: 2018–09
    URL: http://d.repec.org/n?u=RePEc:arx:papers:1809.09953&r=big
  13. By: Paul Greenhalgh; Kevin Muldoon-Smith; Adejimi Adebayo; Josephine Ellis
    Abstract: This novel research project draws upon the experience of a small number of experimental research projects in seeking to extend some of the frontiers to the spatial modelling of commercial real estate markets. In so doing, it explores new ways of capturing, integrating, representing, illustrating and modelling commercial real estate data with other spatial variables.There is tacit understanding of the relationship between the distribution of commercial properties, and spatial factors such as proximity to footfall, transport and other infrastructure. However, there has been surprisingly little research that has been able to illustrate these tangled market relationships using spatial analyses, underpinned by empirical quantitative data. This research project has developed a methodology to visualise the distribution of rateable value, used as a proxy for the attractiveness of commercial property, across a pilot study area, in this case, the City of York, in North Yorkshire, England.The project has experimented with the use of grid squares to analyse geo-spatial relations of commercial real estate variables (such as, rental value, stock, vacancy, availability) with other spatial variables (such as, infrastructural facilities, transportation nodes). The project has confirmed that grid squares are more effective at representing data that are unevenly distributed across urban space at city level, than other artificial delineations, such as area postcodes, political boundaries or streets. The grid square approach can be further enhanced using 3D extrusions which facilitate simultaneous representation of an additional data characteristic, for example total stock combined with average rental value by location. Finally, modelling was conducted using hexagonal rather than grid squares, which revealed that hexagonal tiles are potentially more accurate, due to the proximity of data to the centroid of the tile (effectively losing the corners) and more efficacious at representing linear spatial patterns of of commercial property market data due to hexagons having 50% more directions of alignment than square tiles.
    Keywords: business rates; Commercial; Rateable value; retail property; Spatial Analysis
    JEL: R3
    Date: 2018–01–01
    URL: http://d.repec.org/n?u=RePEc:arz:wpaper:eres2018_61&r=big
  14. By: Giovanni Cerulli (Research Institute on Sustainable Economic Growth, Rome)
    Abstract: Matching is a popular estimator of the Average Treatment Effects (ATEs) within counterfactual observational studies. In recent years, however, many scholars have questioned the validity of this approach for causal inference, as its reliability draws heavily upon the so-called selection-on-observables assumption. When unobservable confounders are possibly at work, they say, it becomes hard to trust matching results, and the analyst should consider alternative methods suitable for tackling unobservable selection. Unfortunately, these alternatives require extra information that may be costly to obtain, or even not accessible. For this reason, some scholars have proposed matching sensitivity tests for the possible presence of unobservable selection. The literature sets out two methods: the Rosenbaum (1987) and the Ichino, Mealli, and Nannicini (2008) tests. Both are implemented in Stata. In this work, I propose a third and different sensitivity test for unobservable selection in Matching estimation based on a ‘leave-covariates-out’ (LCO) approach. Rooted in the machine learning literature, this sensitivity test recalls a bootstrap over different subsets of covariates and simulates various estimation scenarios to be compared with the baseline matching estimated by the analyst. Finally, I will present sensimatch, the Stata routine I developed to run this method, and provide some instructional applications on real datasets.
    Date: 2018–10–15
    URL: http://d.repec.org/n?u=RePEc:boc:usug18:02&r=big
  15. By: Junfeng Jiang; Jiahao Li
    Abstract: In this paper, we design an integrated algorithm to evaluate the sentiment of Chinese market. Firstly, with the help of the web browser automation, we crawl a lot of news and comments from several influential financial websites automatically. Secondly, we use techniques of Natural Language Processing(NLP) under Chinese context, including tokenization, Word2vec word embedding and semantic database WordNet, to compute Senti-scores of these news and comments, and then construct the sentimental factor. Here, we build a finance-specific sentimental lexicon so that the sentimental factor can reflect the sentiment of financial market but not the general sentiments as happiness, sadness, etc. Thirdly, we also implement an adjustment of the standard sentimental factor. Our experimental performance shows that there is a significant correlation between our standard sentimental factor and the Chinese market, and the adjusted factor is even more informative, having a stronger correlation with the Chinese market. Therefore, our sentimental factors can be important references when making investment decisions. Especially during the Chinese market crash in 2015, the Pearson correlation coefficient of adjusted sentimental factor with SSE is 0.5844, which suggests that our model can provide a solid guidance, especially in the special period when the market is influenced greatly by public sentiment.
    Date: 2018–09
    URL: http://d.repec.org/n?u=RePEc:arx:papers:1809.08390&r=big
  16. By: Fernando A Morales; Carlos A Osorio; Daniel Cabarcas J
    Abstract: In this paper an algorithm designed for large databases is introduced for the enhancement of pass rates in mathematical university lower division courses with several sections. Using integer programming techniques, the algorithm finds the optimal pairing of students and lecturers in order to maximize the success chances of the students' body. The students-lecturer success probability is computed according to their corresponding profiles stored in the data bases.
    Date: 2018–09
    URL: http://d.repec.org/n?u=RePEc:arx:papers:1809.09724&r=big
  17. By: Bantle, Melissa (Helmut Schmidt University, Hamburg); Muijs, Matthias (University of Hohenheim)
    Abstract: Market delineation is a fundamental tool in modern antitrust analysis. However, the definition of re- levant markets can be very difficult in practice. This preliminary draft applies a new methodology combining a simple price correlation test with hierarchical clustering -a method known from machine learning- in order to analyze the competitive situation in the German retail gasoline market. Our analysis reveals two remarkable results: At first, there is a uniform pattern across stations of the same brand regarding their maximum daily prices which confirms the claim that prices are partly set centrally. But more importantly, price reactions are also influenced by regional or local market conditions as the price setting of gasoline stations is strongly affected by commuter routes.
    Keywords: market definition; gasoline market; price tests; competition; k-means clustering; hierarchical clustering
    JEL: D22 D40 D43 L10
    Date: 2018–08–29
    URL: http://d.repec.org/n?u=RePEc:ris:vhsuwp:2018_180&r=big
  18. By: Trapp , Andrew C. (Foisie Business School, Worcester Polytechnic Institute); Teytelboym , Alexander (Department of Economics, University of Oxford); Martinello, Alessandro (Department of Economics, Lund University); Andersson, Tommy (Department of Economics, Lund University); Ahani, Narges (Foisie Business School, Worcester Polytechnic Institute)
    Abstract: The paper will be available on Wednesday October 3 at 9am (Swedish time)
    Keywords: Refugee Resettlement; Matching; Integer Optimization; Machine Learning; Humanitarian Operations
    JEL: C44 C61 C78 F22 J61
    Date: 2018–10–03
    URL: http://d.repec.org/n?u=RePEc:hhs:lunewp:2018_023&r=big

This nep-big issue is ©2018 by Tom Coupé. It is provided as is without any express or implied warranty. It may be freely redistributed in whole or in part for any purpose. If distributed in part, please include this notice.
General information on the NEP project can be found at http://nep.repec.org. For comments please write to the director of NEP, Marco Novarese at <director@nep.repec.org>. Put “NEP” in the subject, otherwise your mail may be rejected.
NEP’s infrastructure is sponsored by the School of Economics and Finance of Massey University in New Zealand.