NEP’s Yanabino protocol

NEP’s infrastructure is sponsored by

0. Status

This is the Yanabino protocol, dedicated to the memory of Рузалья Нургалеевна Ханнанова. It is written and maintained by me, Thomas Krichel. I first put it out to NEP’s politburo on 2016‒07‒24. This is the version of 2017‒02‒06.

The protocol deals with regulating the inflow of valid papers from RePEc to NEP. At this time valid papers are working papers with a new handle that pass Sune Karlsson’s download check.

1. Introduction

The issue of the irregularity in NEP paper input has been vexing the NEP management ever I created NEP in 1998. A simple workflow brings out all available papers at issuing time. It can be done in two ways
(1), issue according to a rigid schedule, regardless of the number of papers;
(2), issue a roughly equal amount of papers, regardless of the time of issue. Basically that would happen if the amount of outstanding papers reaches a critical value.
The practice has been aligned with the first approach. The second was proposed by Sune many many moons ago.

Since taking de facto the job of general editor in 2013, I have abandoned simple management. Non-simple management means managing stocks of papers in a queue. This process has been labour-intensive and error-prone. I got sick and tired of manual queue maintenance. Here I set out an algorithm capable of handling stock management in a more automated fashion.

2. Theory

Over the years we gain insight about the nature of the irregularity of the input flow. Basically, RePEc has few large archives and many small archives. The output of the small archives has some cyclicality throughout the year, the week and the time of day. Overall this is not a large variation. The main issue with the variability comes from a few large archives. These dump large numbers of papers at times we can’t forecast, or we should not have to forecast.

Let’s take an economic approach to the problem. The idea is to claim that those archives that dump papers in bulk signal to us that they are not really interested in timely dissemination. Thus we can have them wait. When we get a large input, we store the papers from these archives for later issue. The proper way to implement this would be to develop a scoring system for archives and series according to how timely they update their contents. My idea is a bit simpler. But it goes into the same direction.

3. A basic algorithm

We look for new valid documents every day. Thus each day we get a pile of papers that we have found. We issue only weekly. This at issue time we should have a bunch of piles available, well at least six. As by the theory, we try to avoid issuing too many papers from one division. By division I mean an archive or series. For the moment, let’s just think about the division as a series. Thus I propose to “layer” piles. A pile is layered if its papers are sorted by number of papers in the series. Thus a paper in a series that appears only once—say—appears at the top of the pile, whereas a paper in a series with a with lots of papers in the same series is at the bottom of the layered pile. We can talk about thin layers at the top and thick layers at the bottom.

How do we release? Well assume we have an arbitrary target number M that we want to release. At release time we have N piles available. We pick the N / M top papers from every pile. If that is not an integer let’s round it down. Picking from each pile in this way will give us M₁ ≤ M papers, because some piles will contain less then M / N and that M / N number may not be an integer, in which case we have rounded down. Thus after the first take, we may have M − M₁ spaces left to fill and we start again. At the end, we have something that is as bit less than M. Thus M would be a maximum that we take.

Let H be a size horizon, a number of days, say 365 for a year. We can take the number of papers in nep-all for issues from today to H days ago, say N(H). Then N(H) × 7 / H is the number of papers we get per week as observed from today day H in the the past. We can make this the base size T. Note that if RePEc grows, the estimate T is biased below the actual level. Because of this, and because we sometimes issue less than T, we can expect a queue to form.

Let’s look at a fixed calendar variant per week, e.g., say every Saturday morning, we release a bunch of papers. Let the stock of outstanding papers be S. If S < T, we release S paper. We can’t release more because we don’t have more. If S > T, we need to release more than T. If we only release T the system has no way to raise T over time, as the future values of T are determined by the past.

The current implementation uses the algorithm described two paragraphs above as its basic release algorithm. This is the one that is bounded by T. In addition, it implements an unbounded release as a function of the age of the piles and amount of papers in them that have not made it into the previous release. Again, let H, our time horizon be a year. For each pile, consider its date. If this date is more recent than the last release date, skip it. Otherwise, calculate the difference of the last release date minus the date on the pile. Let that difference be d. Let n be the number of papers in the pile. Then, to handle excess, add the first first d × n / H to the released batch of papers. Note that this will guarantee that no pile is more than H + 7 days old.

I need to build up a queue that allows to smooth future output. I build T from observed paper numbers as proposed to the general editor. I don’t take into account that a proper estimation T needs to consider that the queue should—theoretically—have been flushed every week. This is a second downward bias of T.

4. The details of layering

In the previous section, I proposed to layer the papers by series only. Here I set out the actual algorithm for layering. Each paper in an issue will be scored. The score a string that contains 23 characters. There are four components of 5 numbers each, and three blanks to separate components for readability.

(1) The first component is the number of papers in the same series. I pad it by zeros to the left to get 5 numbers.

(2) The second component is the number of papers from the same archive, again padded by zeros to the left.

(3) The third component is the age. If there is no date on the paper, the age component is 99999. I want papers from earlier dates to appear later. An older paper can wait. It’s part of the strategy to punish non-timely publishers. Let’s take the date on the paper. Pad it with “-01” until we get a day. Then calculate the age of the paper by counting the number of days from today to the padded date. Pad to the left with zeros.

(4) Finally, for the fourth component, I prioritize well-described papers. I take the character length of the description of the paper. Take 99999 and subtract the character length.

When we sort by standard string comparison, this forms a unique ranking value of each paper that can be computed using the information in the collected data file only. This is the Yanabino score.

Implementation

The protocol has been running since 2016‒08‒07. I introduced the excess handling on 2016‒09‒20. The queue is public.