Using shapiro.test on multiple columns in a data frame
Use do.call
with rbind
and lapply
for more simple and compact solution:
df <- data.frame(a = rnorm(100), b = rnorm(100), c = rnorm(100))
do.call(rbind, lapply(df, function(x) shapiro.test(x)[c("statistic", "p.value")]))
#> statistic p.value
#> a 0.986224 0.3875904
#> b 0.9894938 0.6238027
#> c 0.9652532 0.009694794
To apply some function over rows or columns of a data frame, one uses apply
family:
df <- data.frame(a=rnorm(100), b=rnorm(100))
df.shapiro <- apply(df, 2, shapiro.test)
df.shapiro
$a
Shapiro-Wilk normality test
data: newX[, i]
W = 0.9895, p-value = 0.6276
$b
Shapiro-Wilk normality test
data: newX[, i]
W = 0.9854, p-value = 0.3371
Note that column names are preserved, and df.shapiro
is a named list.
Now, if you want, say, a vector of p-values, all you have to do is to extract them from appropriate lists:
unlist(lapply(df.shapiro, function(x) x$p.value))
a b
0.6275521 0.3370931
Not that I think this is a sensible approach to data analysis, but the underlying issue of applying a function to the columns of a data frame is a general task that can easily be achieved using one of sapply()
or lapply()
(or even apply()
, but for data frames, one of the two earlier-mentioned functions would be best).
Here is an example, using some dummy data:
set.seed(42)
df <- data.frame(Gaussian = rnorm(50), Poisson = rpois(50, 2),
Uniform = runif(50))
Now apply the shapiro.test()
function. We capture the output in a list (given the object returned by this function) so we will use lapply()
.
lshap <- lapply(df, shapiro.test)
lshap[[1]] ## look at the first column results
R> lshap[[1]]
Shapiro-Wilk normality test
data: X[[1L]]
W = 0.9802, p-value = 0.5611
You will need to extract the things you want from these objects, which all have the structure:
R> str(lshap[[1]])
List of 4
$ statistic: Named num 0.98
..- attr(*, "names")= chr "W"
$ p.value : num 0.561
$ method : chr "Shapiro-Wilk normality test"
$ data.name: chr "X[[1L]]"
- attr(*, "class")= chr "htest"
If you want the statistic
and p.value
components of this object for all elements of lshap
, we will use sapply()
this time, to nicely arrange the results for us:
lres <- sapply(lshap, `[`, c("statistic","p.value"))
R> lres
Gaussian Poisson Uniform
statistic 0.9802 0.9371 0.918
p.value 0.5611 0.01034 0.001998
Given that you have 500 of these, I'd transpose lres
:
R> t(lres)
statistic p.value
Gaussian 0.9802 0.5611
Poisson 0.9371 0.01034
Uniform 0.918 0.001998
If you plan on doing anything with the p-values from this exercise, I suggest you start thinking about how to correct for multiple comparisons before you shoot yourself in the foot with a 30-cal.