How to find significant correlations in a large dataset

You can use the function rcorr from the package Hmisc.

Using the same demo data from Richie:

m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))

Then:

library(Hmisc)
correlations <- rcorr(as.matrix(the_data))

To access the p-values:

correlations$P

To visualize you can use the package corrgram

library(corrgram)
corrgram(the_data)

Which will produce: enter image description here


In order to print a list of the significant correlations (p < 0.05), you can use the following.

  1. Using the same demo data from @Richie:

    m <- 40
    n <- 80
    the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
    colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
    
  2. Install Hmisc

    install.packages("Hmisc")
    
  3. Import library and find the correlations (@Carlos)

    library(Hmisc)
    correlations <- rcorr(as.matrix(the_data))
    
  4. Loop over the values printing the significant correlations

    for (i in 1:m){
      for (j in 1:m){
        if ( !is.na(correlations$P[i,j])){
          if ( correlations$P[i,j] < 0.05 ) {
            print(paste(rownames(correlations$P)[i], "-" , colnames(correlations$P)[j], ": ", correlations$P[i,j]))
          }
        }
      }
    }
    

Warning

You should not use this for drawing any serious conclusion; only useful for some exploratory analysis and formulate hypothesis. If you run enough tests, you increase the probability of finding some significant p-values by random chance: https://www.xkcd.com/882/. There are statistical methods that are more suitable for this and that do do some adjustments to compensate for running multiple tests, e.g. https://en.wikipedia.org/wiki/Bonferroni_correction.


Here's some sample data for reproducibility.

m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))

You can calculate the correlation between two columns using cor. This code loops over all columns except the first one (which contains our response), and calculates the correlation between that column and the first column.

correlations <- vapply(
  the_data[, -1],
  function(x)
  {
    cor(the_data[, 1], x)
  },
  numeric(1)
)

You can then find the column with the largest magnitude of correlation with y using:

correlations[which.max(abs(correlations))]

So knowing which variables are correlated which which other variables can be interesting, but please don't draw any big conclusions from this knowledge. You need to have a proper think about what you are trying to understand, and which techniques you need to use. The folks over at Cross Validated can help.

Tags:

R

Correlation