NEP’s Wildwood project

Introduction

The Wildwood project aims to pre-sort the NEP-all report issues by the likelihood that a paper in a current a duplicate of an paper that was in any previous issue. The sources used are the names of authors and the titles of the papers only. Thus from each paper, only the author names, titles and the handles need to be processed. Both source types go through different processing.

Processing of author records

Author records are found with the //text//hasauthor/person//name/text() XPath expression. I leave out the namespace here, and use the // to express that all repetitions of elements have to be taken into account. The name expressions are subjected to removal of punctuation, change to all lower case, trimming leading and trailing whitespace and collapsing all remaining whitespace, and removal of single letters and their left whitespaces. Thus result is split at the whitespace to yield a set of features. The result is stored in a database, as feature → handle → number of occurrences in the record. Thus we do take account of author name list such as husband and wife teams. Finally, the total number of author features is stored for each handle for ease of lookup later.

Processing of title records

Title records are found with the //text//title expressions. If there are several titles, they are not concatenated; each title is treated separately. The title expressions are subjected to removal of punctuation, change to all lower case, removal of leading and trailing whitespace and collapsing of all remaining whitespace. Then a moving window of length three is used to collect adjacent words. For example, the later previous sentence would yield “then a moving” “a moving window”, “moving window of” and so on. The features are stored in a database as feature → handle → number of occurrences in the record. If a title has three words or less, its single title feature is equal to the processed title. Document to document comparison When a new set of papers, from a new nep-all issue is found, we process all papers in the issue. For each paper we record the title and author features. For each feature we have a---possibly empty---set of documents that contain the same feature. A feature that occurs in common is a co-feature. Duplicates are taken into account as the same feature appearing twice. That is “foo bar” and “foo bar bar” have two feature in common. Any known paper that has at least one common author feature AND at least one common title feature with a new paper is said to be in the target set of the new paper. For each paper in the target set, a coefficient of similarity is composed.

Composition of similarity coefficient

For each of the feature types, i.e, author features and title features, we look for the number of common features between the new paper and the papers in the target set. Note that the number of common features takes account of duplicate occurrences of a feature. Thus, if feature 1 appears twice in paper A and three times in paper B, there are two common occurrences. We then divide the number of common features of a type by the minimum of the number of features of that type in the two papers. This ratio is the typed co-feature ratio of the two papers. The presumed similarity is between two papers is the geometric average between the two typed similarities.

Weight in the geometric average

The weight in the geometric average is calculated using the total number of features of each type, and the sum of all features of all type. The weight of a type is equal to the total number of feature of the other type divided by the total number of feature. Thus we give priority to the feature type for which we observe relatively few features.

Internal duplication

Once the calculation of communality with papers from previous records is calculated, we proceeed to the calculation of internal duplication. There we compare the papers in the new issue with the other papers in the same issue.

XML representation

The result of the calculation is the creation of a new file in the ~/ernad/var/report/nep-all/source/ps directory. The new XML document contains a copy of the new issue. For each paper, a list of internal and external similarity data is attached. If Wildwood data is not empty for a given paper, its <text> gets a new child <wildwood>. <wildwood> itself has children <similar id="…" strength="… " type="…"/>. Here the id= attribute has the handle of the target paper. The strength attribute has the calculated strength of similarity. The type= attribute take the value “int” or “ext” depending whether we have an internal or an external potential duplication. The data is only added when the strength is higher than two external parameters. The parameters represent the threshold of values to be included as duplicate candidates into the XML file for external and internal documents. If only one parameter value is given, it is assumed to be the common value.

Implementation

All work has to be done on khufu. The databases like in ~/ernad/var/db/. A NoSQL database system is used. The wildwood script takes a date and limit parameters as the argument. It looks up the databases for all past data. It then parses to nep-all issue of that date. That is in file called report issue file, rif for short. Rifs live in ~/ernad/var/reports/source/us. winwood calculates the similarity data. If a rif with the date exists in the ernad/var/reports/nep-all/source/ps directory, it replaces it, otherwise it creates a new one. It then proceeds to add the data from the current rif to the databases. If there is data for that issue it is simply replace. A wildwood_all script acts as a wrapper. It creates the database anew, and then proceeds to calculate similarity with past issues for all rifs found in ~/ernad/var/reports/source/us.

User interface

This is out of scope for this project.