Introduction
The Wildwood project aims to pre-sort the NEP-all report issues by the
likelihood that a paper in a current a duplicate of an paper that was
in any previous issue. The sources used are the names of authors and
the titles of the papers only. Thus from each paper, only the author
names, titles and the handles need to be processed. Both source types
go through different processing.
Processing of author records
Author records are found with the
//text//hasauthor/person//name/text() XPath expression. I leave out
the namespace here, and use the // to express that all repetitions of
elements have to be taken into account. The name expressions are
subjected to removal of punctuation, change to all lower case,
trimming leading and trailing whitespace and collapsing all remaining
whitespace, and removal of single letters and their left
whitespaces. Thus result is split at the whitespace to yield a set of
features. The result is stored in a database, as feature → handle →
number of occurrences in the record. Thus we do take account of author
name list such as husband and wife teams. Finally, the total number of
author features is stored for each handle for ease of lookup later.
Processing of title records
Title records are found with the //text//title expressions. If there
are several titles, they are not concatenated; each title is treated
separately. The title expressions are subjected to removal of
punctuation, change to all lower case, removal of leading and trailing
whitespace and collapsing of all remaining whitespace. Then a moving
window of length three is used to collect adjacent words. For example,
the later previous sentence would yield “then a moving” “a moving
window”, “moving window of” and so on. The features are stored in a
database as feature → handle → number of occurrences in the record. If
a title has three words or less, its single title feature is equal to
the processed title. Document to document comparison When a new set
of papers, from a new nep-all issue is found, we process all papers in
the issue. For each paper we record the title and author features. For
each feature we have a---possibly empty---set of documents that
contain the same feature. A feature that occurs in common is a
co-feature. Duplicates are taken into account as the same feature
appearing twice. That is “foo bar” and “foo bar bar” have two feature
in common. Any known paper that has at least one common author feature
AND at least one common title feature with a new paper is said to be
in the target set of the new paper. For each paper in the target set,
a coefficient of similarity is composed.
Composition of similarity coefficient
For each of the feature types, i.e, author features and title
features, we look for the number of common features between the new
paper and the papers in the target set. Note that the number of common
features takes account of duplicate occurrences of a feature. Thus, if
feature 1 appears twice in paper A and three times in paper B, there
are two common occurrences. We then divide the number of common
features of a type by the minimum of the number of features of that
type in the two papers. This ratio is the typed co-feature ratio of
the two papers. The presumed similarity is between two papers is the
geometric average between the two typed similarities.
Weight in the geometric average
The weight in the geometric average is calculated using the total
number of features of each type, and the sum of all features of all
type. The weight of a type is equal to the total number of feature of
the other type divided by the total number of feature. Thus we give
priority to the feature type for which we observe relatively few
features.
Internal duplication
Once the calculation of communality with papers from previous records
is calculated, we proceeed to the calculation of internal
duplication. There we compare the papers in the new issue with the
other papers in the same issue.
XML representation
The result of the
calculation is the creation of a new file in the
~/ernad/var/report/nep-all/source/ps directory. The new XML document
contains a copy of the new issue. For each paper, a list of internal
and external similarity data is attached. If Wildwood data is not
empty for a given paper, its <text> gets a new child
<wildwood>. <wildwood> itself has children <similar id="…"
strength="… " type="…"/>. Here the id= attribute has the handle of the
target paper. The strength attribute has the calculated strength of
similarity. The type= attribute take the value “int” or “ext”
depending whether we have an internal or an external potential
duplication. The data is only added when the strength is higher than
two external parameters. The parameters represent the threshold of
values to be included as duplicate candidates into the XML file for
external and internal documents. If only one parameter value is
given, it is assumed to be the common value.
Implementation
All work has to be done on khufu. The databases like in ~/ernad/var/db/. A
NoSQL database system is used. The wildwood script takes a date and
limit parameters as the argument. It looks up the databases for all
past data. It then parses to nep-all issue of that date. That is in
file called report issue file, rif for short. Rifs live in
~/ernad/var/reports/source/us. winwood calculates the similarity
data. If a rif with the date exists in the
ernad/var/reports/nep-all/source/ps directory, it replaces it,
otherwise it creates a new one. It then proceeds to add the data from
the current rif to the databases. If there is data for that issue it
is simply replace. A wildwood_all script acts as a wrapper. It creates
the database anew, and then proceeds to calculate similarity with past
issues for all rifs found in ~/ernad/var/reports/source/us.
User interface
This is out of scope for this project.