Perform a semi-join with data.table
One solution I can think of is:
tmp <- x[!y]
x[!tmp]
In data.table
, you can have another data table as an i
expression (i.e., the first expression in the data.table.[
call), and that will perform a join, e.g.:
x <- data.table(x = 1:10, y = letters[1:10])
setkey(x, x)
y <- data.table(x = c(1,3,5,1), z = 1:4)
> x[y]
x y z
1: 1 a 1
2: 3 c 2
3: 5 e 3
4: 1 a 4
The !
before the i
expression is an extension of the syntax above that performs a 'not-join', as described on p. 11 of data.table documentation. So the first assignments evaluates to a subset of x
that doesn't have any rows where the key (column x
) is present in y
:
> x[!y]
x y
1: 2 b
2: 4 d
3: 6 f
4: 7 g
5: 8 h
6: 9 i
7: 10 j
It is similar to setdiff
in this regard. And therefore the second statement returns all the rows in x
where the key is present in y
.
The !
feature was added in data.table 1.8.4
with the following note in NEWS:
o A new "!" prefix on i signals 'not-join' (a.k.a. 'not-where'), #1384i. DT[-DT["a", which=TRUE, nomatch=0]] # old not-join idiom, still works DT[!"a"] # same result, now preferred. DT[!J(6),...] # !J == not-join DT[!2:3,...] # ! on all types of i DT[colA!=6L | colB!=23L,...] # multiple vector scanning approach (slow) DT[!J(6L,23L)] # same result, faster binary search '!' has been used rather than '-' : * to match the 'not-join'/'not-where' nomenclature * with '-', DT[-0] would return DT rather than DT[0] and not be backwards compatible. With '!', DT[!0] returns DT both before (since !0 is TRUE in base R) and after this new feature. * to leave DT[+J...] and DT[-J...] available for future use
For some reason, the following doesn't work x[!(x[!y])]
- probably data.table
is too smart about parsing the argument.
P.S. As Josh O'Brien pointed in another answer, a one-line would be x[!eval(x[!y])]
.
More possibilities :
w = unique(x[y,which=TRUE]) # the row numbers in x which have a match from y
x[w]
If there are duplicate key values in x, then that needs :
w = unique(x[y,which=TRUE,allow.cartesian=TRUE])
x[w]
Or, the other way around :
setkey(y,x)
w = !is.na(y[x,which=TRUE,mult="first"])
x[w]
If nrow(x) << nrow(y) then the y[x] approach should be faster.
If nrow(x) >> nrow(y) then the x[y] approach should be faster.
But the anti anti join appeals too :-)