Handling arXiv feeds to avoid duplicates

Unless the arXiv has changed recently, articles are published daily which means that the feeds and the email are completely in step.

The problem with the duplicates is that each feed is a separate request to the arXiv for information. The arXiv doesn't know that you are going to merge these results, and I've never heard of a feed reader that attempts to merge feeds to remove duplicates.

However, all is not lost. The feeds that the arXiv provides are not the only way to find information. The arXiv has an API which means that you can effectively craft your own feed. For example, if you point your browser at:

http://export.arxiv.org/api/query?search_query=submittedDate:[20091014200000+TO+20091015200000]&start=0&max_results=500

then you get all the papers submitted yesterday. You can filter your search by subject.

http://export.arxiv.org/api/query?search_query=%28cat:math.AT+OR+cat:math.CT%29+AND+submittedDate:[20091014200000+TO+20091015200000]&start=0&max_results=500

Because the requests are handled all at once, there are no duplicates produced (as can be seen since Emily Riehl's paper is both math.AT and math.CT).

The only catch is that you need to put the date in proper form each time, you can't put in dates such as "today" or "yesterday". Plus the timezone handling is a little weird: the arxiv publishes updates at a certain time determined by the local timezone, which includes daylight saving changes, but the API uses GMT/UTC. So if you want to exactly replicated the "new preprints" announcement of the arxiv then you need to do some funky timezone conversions.

However, this can be done and I've done it. I use a program called RefBase for organising my references and I've modified it so that each morning it presents me with a list of what's new on the arxiv for me to scan through and decide which articles to add to my own bibliographic database. I can also scan back a few days if I've been on holiday. Buried in this extension is the code for figuring out what the date-stamp should be. I could extract it if there's any interest.

Documentation on the arxiv API is at their documentation site. The 'submittedDate' stuff isn't covered there though, that's a newer feature.

Scirate does exactly this. You register there and pick your favorite arXiv sections, and you'll get all papers in those categories with no duplicate everyday. You can browse through past feeds, too. Scirate does not publish Atom or RSS feeds. What they call "feeds" are simply listings on a web page.

I've been using Scirate for math.CO, cs.IT, and puant-ph and never seen duplicates of cross-posted preprints.

You can also "upvote" (or "scite" in their jargon) a preprint and leave a comment. I don't think there are many mathematicians there; it was originally for quantum information folks. But it's already useful if you're alone in your field, and it may prove even more useful if more and more peers in your field start using it, I think.

Admittedly old fashioned, but I get my arxiv fix via email. Over email, if you sign up to several repositories, you get a single combined message, without duplicates.

On the downside, that email comes once a day, so if you're obsessed with the latest, you'll find yourself slightly out of date relative to RSS readers.

Handling arXiv feeds to avoid duplicates

Tags:

Reading List

Arxiv

Related

Recent Posts