Biological data being used by an unpublished research paper is considered proprietary

The copyright is probably on the full database release flatfile and the formatted entries ... you will find similar conditions for UniProt/SwissProt so it is not so unusual.

The restrictions on scripts are common to prevent server performance hits from a large number of requests.

You can simply invite reviewers to download the data from some other server, for example from the EBI SRS server. The URL for entry A00673 would be

"http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?[IMGTLIGM-ID:a00673]+-view+FastaSeqs+-ascii"

You can also use a list of accessions, for example A00673 or A01650

"http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?[IMGTLIGM-ID:a00673|a01650]+-view+FastaSeqs+-ascii"

If downloading many entries you should pause between requests, but putting lists into the URLs may reduce it to few enough not to cause a problem. I doubts EBI would be upset by 200 requests - they would be concerned about thousands.

There are various fasta formats available for IMGT data, you need to find a server that produces fasta files compatible with your input requirements.

Alternatively of course your reviewers could download the whole database from IMGT or any of the other servers (including ftp://ftp.ebi.ac.uk/pub/databases/imgt/) and generate their own fasta subset from the list of accessions/ids

Hope that helps!


Yes, it is making your life a little harder, but it doesn't mean publication of your work is impossible, nor does it make IMGT's action unethical. I cannot comment on the legality of IMGT's copyright claim on the data, but contesting their claim doesn't sound like a great idea in the first place. In any case, talk it through with a lawyer from your university's legal department, before you do anything that deviates from what IMGT asks for.

Now, how can you move forward? Well, separate your existing code (which does the scraping and the analysis) into two separate parts:

  1. The IMGT website scrapper/parser, which will download data and write it to files named after each query (M38103.txt for query “M38103”).

    Do not publish that part (but keep it around, it would be a shame to throw away code that you have already written, and that works: you never know, IMGT's policy may change in the future).

  2. The bulk of your analysis code, which takes these query results as text input files.

You now publish #2, and give the referees access to the files (journal submission websites have an option for “supporting information for reviewers only”, although it may be called differently). In the paper, you indicate clearly (but not aggressively) that “because licensing restrictions do not allow us to redistribute IMGT data, we provide a script that requires query results as text input files”.

You're not the first person to publish valid research results that come from analysis of a proprietary data source. There is no ethical issue here, because the reviewers have enough information to accurately review the validity of your work. Moreover, even the readers will be able to reproduce your work, though it will require a separate download step (and definitely depends on IMGT keeping its database online and freely accessible).

So yeah, it makes your code is little harder to use for others, but it doesn't diminish the values of the results you have obtained with it!


As F'x already said, from a scientific perspective there is no problem.

The data is available, just not automatically and from you, but from the original source (via the ID numbers). So editor, reviewers and readers can get the data, given they are not too lazy to download it from the original web site. That is much more than is common in many fields of physical/experimental science.
Considering that you are talking about biological data, not having an automated download is so much more convenient than trying to reproduce experimental data (although I have to say that it is a sensible and under-used mid-way checkpoint) ...

However, here are a few more thoughts:

Isn't biological data like this public domain? Is it really possible to treat immunoglobulin and T cell receptor nucleotide sequence data as proprietary information?

  • Facts cannot be copyrighted. But the measured data is subject to copyright. If I go through the effort of doing these measurements, I'm the owner of that data. Just as you are the author of the program code you wrote and the paper you wrote.
    But, as you cannot forbid that someone else to write another program doing the same, or another paper on the same subject, just because I have the rights to my measurements, I cannot forbid you to make your own measurements.
    Of course I could donate my data base to the public domain, just as you can put your program under a FOSS license.

  • Copyright varies considerably depending on jusdiction. So IMHO this question cannot really be answered without taking into account where the database comes from (EU) and where you are located.
    Now, in the EU we have a database copyright given you put enough effort in making the data base (it is not enough to grab an old encyclopedia and scan pages from that. But carefully curating a nucleotide sequence database is clearly enough). Again, this copyright is for the database, not for the facts stored in the database.
    For your nucleotide sequences that means: if you choose to use their nucleotide database, you have to stick to their rules. But again, you are free to measure the nucleotide sequence yourself and use that data set instead.


how does one make data available to reviewers and not to users

You can use the letter to the editor to tell the editor that you'd be happy to supply the reviewers with the curated data set you acutally used for the analysis.


the data is inconvenient to download. [...]
I realise that the users of this site may not be comfortable offering what is essentially a legal opinion

While it is certainly a good idea to learn about copyright, I cannot recommend going for legal loopholes against the database owner's expressed wishes. They do allow enough for you and other scientists to do science. Why would you want to upset them?

I think it would be better to talk to them. What about this instead of asking them to allow scripts, you could offer to produce for them a second version of the data base that is suitable as machine-readable input, and kindly ask them whether they would be willing to make that version available via their server.

In addition to learning about copyright, maybe you could ask them for their reasons for their download policy. There may be a whole lot for you to learn in that answers as well.

I come from one of the physical sciences where good measurements take lots of effort. Here are some reasons why a data base ownder may say that people should download the data base from the original source:

  • The name of "owner"/author of the database is associated with it. As author you may not want to run the risk of getting associated with derivatives that do not follow your strict high-quality policy.
  • One very simple way to ensure that people actually get your data base when they think they do is to tell them always to download the original
    (there are alternatives, such as signing a version for distribution etc.).
  • In addition, the owner may want/need to have at least a rough overview of how many people use the data base. Such information is at the very least extremely helpful when you need to show that you are not doing useless stuff for your wages.