finding mean using pig or hadoop
Apache PIG is well adapted for such tasks. See example:
inpt = load '~/pig_data/pig_fun/input/group.txt' as (amnt:double, id:chararray,c2:chararray);
grp = group inpt by id;
mean = foreach grp {
sum = SUM(inpt.amnt);
count = COUNT(inpt);
generate group as id, sum/count as mean, sum as sum, count as count;
};
Pay special attention to the data type of the amnt column as it will influence which implementation of the SUM function PIG is going to invoke.
PIG can also do something that SQL can not, it can put the mean against each input row without using any inner joins. That is useful if you are calculating z-scores using standard deviation.
mean = foreach grp {
sum = SUM(inpt.amnt);
count = COUNT(inpt);
generate FLATTEN(inpt), sum/count as mean, sum as sum, count as count;
};
FLATTEN(inpt) does the trick, now you have access to the original amount that had contributed to the groups average, sum and count.
UPDATE 1:
Calculating variance and standard deviation:
inpt = load '~/pig_data/pig_fun/input/group.txt' as (amnt:double, id:chararray, c2:chararray);
grp = group inpt by id;
mean = foreach grp {
sum = SUM(inpt.amnt);
count = COUNT(inpt);
generate flatten(inpt), sum/count as avg, count as count;
};
tmp = foreach mean {
dif = (amnt - avg) * (amnt - avg) ;
generate *, dif as dif;
};
grp = group tmp by id;
standard_tmp = foreach grp generate flatten(tmp), SUM(tmp.dif) as sqr_sum;
standard = foreach standard_tmp generate *, sqr_sum / count as variance, SQRT(sqr_sum / count) as standard;
It will use 2 jobs. I have not figured out how to do it in one, hmm, need to spend more time on it.