Is it safe to use "df" as the name for a dataframe?
Is this a bug in the tree package, or is it an important cautionary tale whose moral is that I should avoid using df as the name for a dataframe since doing so introduces a name-clash?
I think in this case it may be both, but for your purposes I would take it more as a cautionary example. The fact that it causes an error here indicates that it may not be the best practice.
In my experience R does not manage namespaces very well (comparing it to Python, for example). Because of this, it may have been unwise for the authors of tree to introduce (intentionally or not) a conflict with df
- which is a common throwaway name for a dataframe - if in fact they did so (see comments here and in the question; it is unclear whether this is a clash in data.frame names or improper use of eval() causing clashes between data.frame objects and functions).
With that said, it is a good example of why namespaces are important and (IMO) suggestive of how to write better R code. I think namespaces are being introduced to the R ecosystem, but my experience with R is that there is a lot of namespace 'flatness' and lots of opportunities for name conflicts. For this reason I would suggest that you take this as a reason to use more descriptive / unique identifiers for your own variables. This avoids conflicts like the one you encountered, and provides some future-proofing to help avoid conflicts creeping into previously working code if package internals change.
Because the potential name conflict would make errors more difficult to debug, I forced myself to use dtf
instead of df
for a long time. However important collection of package in the tidyverse seem to be ok with using df
everywhere in their tests, for example test-select.r:
df <- tibble(g = 1:3, x = 3:1) %>% group_by(g)
I've been using df
a lot recently to name python pandas data frames. So I tend to use df
in R as well nowadays. Let's see if this bites back.
Flat or nested namespace
The question of namespace is not part of the original question but it is related to this issue of name conflict with df
. A flat name space is easier and fun to use in exploratory data analysis, you just call all functions directly, but it can lead to collisions. A nested namespace makes debugging more reliable at the cost of being a little more cumbersome, because you have to prefix each function call with the package name.
Name space collisions are less of an issue in python because it has a more nested namespace. For example you import numpy as np
and prefix all numpy function calls with np
, such as np.array()
. (It's possible to do from numpy import *
but it is frowned upon and linters typically complain about it).
In R you have to distinguish trash code used in exploratory data analysis from more durable code that you are going to reuse. In the second case, if you use only one or a few functions from another package, it's better not to import the package but to call the functions you really need with library(package_name)
package_name::function
.