Popular proprietary program or obscure open source substitute for reproducible research?

I think there are two kinds of reproducibility:

  1. The ability of someone else to run your code and obtain the same output.
  2. The ability of someone else to write their own code that does the same thing as yours based on your description and on examination of your code (reproduction from scratch).

The second kind of reproducibility is much more convincing, since the main point of scientific reproducibility is to verify correctness of the result. For science that relies on code, it is usually impossible to include every detail of the code in the paper, so verification requires examination of the code.

If you use proprietary software, your code probably makes use of closed source code, and therefore it cannot be verified or reproduced from scratch. If you use open source software, then all of the code that your code calls is probably open source, so it can all be verified or reproduced by someone else from scratch.

At present, it is probably true that the first kind of reproducibility is more achievable with proprietary, widely-used software. I am optimistic that the current trend will lead to open-source software catching up in terms of wide use (consider SAGE, for example).


Addendum, in light of Epigrad's answer below, which I mainly agree with: The problem with relying on closed-source code isn't that someone else won't know what that closed-source code is expected to do.

The problem is that if you have two closed-source implementations of the same algorithm and they give different results (trust me, they usually will), then you have no way of determining which (if either) is correct.

In other words, closed-source code would be fine for reproducibility if it were bug-free. But it's not.


To supplement @David Ketcheson's answer with a "Yes, but..."

I agree that there are two types of reproducibility - CrossValidated discusses them with some degree of frequency. There is, as has been mentioned, "Can I click 'Run' and get the same answer you did" reproducibility, which I generally don't find very compelling.

There's also "Could I repeat your analysis from what you have provided from Step 1 to Step End, and get the same or a similar answer?" I think this is the one we should be aiming for.

That is often helped by using accessible, non-proprietary code, but not always. Consider the following example of an infectious disease dynamics model, expressed as a system of ODEs:

Here, in order to replicate (or fail to replicate) my findings, the software I used doesn't matter. What matters is the equations and parameter values I chose. If I provide those, then the only reason for code being needed is because someone doesn't want to implement the study from scratch, and does want to just run the code and see if the results match, tinker with the assumptions a bit, etc. In that case, everyone benefits from the code being in a form people use.

I think the same is often true for statistical analysis that doesn't use novel methods. At this point, what matters is that the data is available, and that the code is implemented in a language people understand and use. If 95% of people use SAS, even if it is proprietary, then the way to make your results most accessible, and most easy to replicate, is to have an implementation in SAS. Because if you pick an obscure but free language, what you've done is replaced the "Money" barrier with a "Time to understand" barrier - which for most people equates to the same thing.

The summary is this: I don't think "Free/Open" vs. "Proprietary/Closed" is necessarily the deciding distinction. I think that distinction is accessibility, and trying to maximize that. If there is both an open, free and popular software package that's used (R for example) then great! - use that. But if the field uses primarily one commercial package, picking an obscure alternative just because its free doesn't fix accessibility, it just shifts the burden.


Let me start with a disclaimer. I generally subscribe to the free software community perspective that proprietary software is questionable ethically, and best avoided if possible. I realise this perspective is not commonly held in scientific circles. Having said that, sometimes proprietary software is a necessary, or at least not easily avoided evil, and I'm generally pragmatic about using proprietary software when no good alternatives exist. I've used proprietary software in the past, though currently the only proprietary software I'm currently (sporadically) using is Skype, for which no good free alternatives exist.

However, special considerations apply in a scientific context. One of thse has already been covered by @David, namely that in general you can't "see inside" proprietary software to see how something is implemented. Having said that, sometimes proprietary software is written in an interpreted language, as in Splus, and one may be able to see part or all of an algorithm implementation. Regardless, the point holds generally.

A separate and obvious issue, which I don't think anyone has raised, is that using proprietary software forces others who want to use your software to buy the proprietary product you use. These products can be quite expensive, especially for people from poor countries. For example, Matlab, which has been mentioned in this thread, runs to thousands of dollars if one has to pay for a license oneself. Western academic institutions often have site licenses for such popular software, so researchers don't have to pay for it themselves. I personally am quite unhappy when I am expected to use a piece of software written using some proprietary language or package that has to be purchased.

A related issue is that much, if not most research, is done using public funding, i.e. taxpayer money. It seems undesirable to me to use such funds to buy proprietary software, thus adding to the profit of some corporation. In general, there is some movement to make academic work that is done using public funding free. And one can easily make the argument that the usage of proprietary software makes ones scientific product less free. For example, I believe the NIH now has some such policies in place. Similar arguments could be applied to the usage of software tools.

A tangential technical issue is that it is often difficult to get proprietary software to play nice on free software platforms such as the free Unix-like systems currently popular in scientific circles, e.g. the Linux based systems, and the BSD systems. These difficulties include, but are not restricted to

a) ABI problems. If one wants to compile a C/C++ extension for Matlab, for example, one has to use exactly the version of the compiler that the Matlab program has been compiled with

b) The program requires obsolete libraries or requires libraries to be in non-standard places.

I mention this issue in part because my understanding of the question is that it is asking about proprietary vs free in the context of pragmatic usage.

So, to respond to the question directly:

Assuming I'm starting a new project and I wish to make it as reproducible as possible. Should I be using relatively unpopular free software or extremely popular proprietary ones?

I don't think there is a clear answer. If there is no viable alternative, then one would have to use the proprietary software, as I do with Skype. If there a viable free version, I would use it. Bear in mind that if more people start using the "relatively unpopular free software" it will become more popular. :-)