How to perform a DISTINCT in Pig Latin on a subset of columns?
The accepted answer is one great solution but, in case you want to reorder the fields in the output (something I had to do recently) this might not work. Here's an alternative:
A = LOAD '$input' AS (f1, f2, f3, f4, f5);
GP = GROUP A BY (f1, f2, f3);
OUTPUT = FOREACH GP GENERATE
group.f1, group.f2, f4, f5, group.f3 ;
When you group on certain fields, the selection would have unique values for the group in a each tuple.
Group on all the other columns, project just the columns of interest into a bag, and then use FLATTEN
to expand them out again:
A_unique =
FOREACH (GROUP A BY a4) {
b = A.(a1,a2,a3);
s = DISTINCT b;
GENERATE FLATTEN(s), group AS a4;
};