How to apply linear regression to every pixel in a large multi-dimensional array containing NaNs?
This blog post mentioned in the comments above contains an incredibly fast vectorized function for cross-correlation, covariance, and regression for multi-dimensional data in Python. It produces all of the regression outputs I need, and does so in milliseconds as it relies entirely on simple vectorised array operations in xarray
.
https://hrishichandanpurkar.blogspot.com/2017/09/vectorized-functions-for-correlation.html
I have made one minor change (first line after #3
) to ensure the function correctly accounts for different numbers of NaN values in each pixel:
def lag_linregress_3D(x, y, lagx=0, lagy=0):
"""
Input: Two xr.Datarrays of any dimensions with the first dim being time.
Thus the input data could be a 1D time series, or for example, have three
dimensions (time,lat,lon).
Datasets can be provided in any order, but note that the regression slope
and intercept will be calculated for y with respect to x.
Output: Covariance, correlation, regression slope and intercept, p-value,
and standard error on regression between the two datasets along their
aligned time dimension.
Lag values can be assigned to either of the data, with lagx shifting x, and
lagy shifting y, with the specified lag amount.
"""
#1. Ensure that the data are properly alinged to each other.
x,y = xr.align(x,y)
#2. Add lag information if any, and shift the data accordingly
if lagx!=0:
# If x lags y by 1, x must be shifted 1 step backwards.
# But as the 'zero-th' value is nonexistant, xr assigns it as invalid
# (nan). Hence it needs to be dropped
x = x.shift(time = -lagx).dropna(dim='time')
# Next important step is to re-align the two datasets so that y adjusts
# to the changed coordinates of x
x,y = xr.align(x,y)
if lagy!=0:
y = y.shift(time = -lagy).dropna(dim='time')
x,y = xr.align(x,y)
#3. Compute data length, mean and standard deviation along time axis:
n = y.notnull().sum(dim='time')
xmean = x.mean(axis=0)
ymean = y.mean(axis=0)
xstd = x.std(axis=0)
ystd = y.std(axis=0)
#4. Compute covariance along time axis
cov = np.sum((x - xmean)*(y - ymean), axis=0)/(n)
#5. Compute correlation along time axis
cor = cov/(xstd*ystd)
#6. Compute regression slope and intercept:
slope = cov/(xstd**2)
intercept = ymean - xmean*slope
#7. Compute P-value and standard error
#Compute t-statistics
tstats = cor*np.sqrt(n-2)/np.sqrt(1-cor**2)
stderr = slope/tstats
from scipy.stats import t
pval = t.sf(tstats, n-2)*2
pval = xr.DataArray(pval, dims=cor.dims, coords=cor.coords)
return cov,cor,slope,intercept,pval,stderr
I'm not sure how this would scale up (perhaps you could use dask), but here is a pretty straightforward way to do this with a pandas DataFrame
using the apply method:
import pandas as pd
import numpy as np
from scipy.stats import linregress
# Independent variable: four time-steps of 1-dimensional data
x_array = np.array([0.5, 0.2, 0.4, 0.4])
# Dependent variable: four time-steps of 3x3 spatial data
y_array = np.array([[[-0.2, -0.2, -0.3],
[-0.3, -0.2, -0.3],
[-0.3, -0.4, -0.4]],
[[-0.2, -0.2, -0.4],
[-0.3, np.nan, -0.3],
[-0.3, -0.3, -0.4]],
[[np.nan, np.nan, -0.3],
[-0.2, -0.3, -0.7],
[-0.3, -0.3, -0.3]],
[[-0.1, -0.3, np.nan],
[-0.2, -0.3, np.nan],
[-0.1, np.nan, np.nan]]])
def lin_regress(col):
"Mask nulls and apply stats.linregress"
col = col.loc[~pd.isnull(col)]
return linregress(col.index.tolist(), col)
# Build the DataFrame (each index represents a pixel)
df = pd.DataFrame(y_array.reshape(len(y_array), -1), index=x_array.tolist())
# Apply a our custom linregress wrapper to each function, split the tuple into separate columns
final_df = df.apply(lin_regress).apply(pd.Series)
# Name the index and columns to make this easier to read
final_df.columns, final_df.index.name = 'slope, intercept, r_value, p_value, std_err'.split(', '), 'pixel_number'
print(final_df)
Output:
slope intercept r_value p_value std_err
pixel_number
0 0.071429 -0.192857 0.188982 8.789623e-01 0.371154
1 -0.071429 -0.207143 -0.188982 8.789623e-01 0.371154
2 0.357143 -0.464286 0.944911 2.122956e-01 0.123718
3 0.105263 -0.289474 0.229416 7.705843e-01 0.315789
4 1.000000 -0.700000 1.000000 9.003163e-11 0.000000
5 -0.285714 -0.328571 -0.188982 8.789623e-01 1.484615
6 0.105263 -0.289474 0.132453 8.675468e-01 0.557000
7 -0.285714 -0.228571 -0.755929 4.543711e-01 0.247436
8 0.071429 -0.392857 0.188982 8.789623e-01 0.371154
The answer provided here https://hrishichandanpurkar.blogspot.com/2017/09/vectorized-functions-for-correlation.html is absolutely good in that it mostly utilises the great power of numpy
vectorization and broadcasting but it assumes the data to be analysed are complete, which is not usually the case in real research cycle. One answer above intended to address the missing data problem but I personally think more codes needs to be updated simply because np.mean()
will return nan if there is nan in the data. Fortunately, numpy
has provided nanmean()
, nanstd()
, and so forth for us to use to calculate mean, standard error, and so forth by ignoring nans in the data. Meanwhile, the program in the original blog targets data formatted netCDF. Some might not know this but be more familiar with the raw numpy.array
format. Therefore, I provide here a code example showing how to calculate co-variance, correlation coefficients, and so forth between two 3-D dimensional arrays (n-D dimensional is of the same logic). Note that I let x_array
to be the indexes of the first dimension of y_array
for convenience but x_array
can surely be read from outside in real analysis.
Code
def linregress_3D(y_array):
# y_array is a 3-D array formatted like (time,lon,lat)
# The purpose of this function is to do linear regression using time series of data over each (lon,lat) grid box with consideration of ignoring np.nan
# Construct x_array indicating time indexes of y_array, namely the independent variable.
x_array=np.empty(y_array.shape)
for i in range(y_array.shape[0]): x_array[i,:,:]=i+1 # This would be fine if time series is not too long. Or we can use i+yr (e.g. 2019).
x_array[np.isnan(y_array)]=np.nan
# Compute the number of non-nan over each (lon,lat) grid box.
n=np.sum(~np.isnan(x_array),axis=0)
# Compute mean and standard deviation of time series of x_array and y_array over each (lon,lat) grid box.
x_mean=np.nanmean(x_array,axis=0)
y_mean=np.nanmean(y_array,axis=0)
x_std=np.nanstd(x_array,axis=0)
y_std=np.nanstd(y_array,axis=0)
# Compute co-variance between time series of x_array and y_array over each (lon,lat) grid box.
cov=np.nansum((x_array-x_mean)*(y_array-y_mean),axis=0)/n
# Compute correlation coefficients between time series of x_array and y_array over each (lon,lat) grid box.
cor=cov/(x_std*y_std)
# Compute slope between time series of x_array and y_array over each (lon,lat) grid box.
slope=cov/(x_std**2)
# Compute intercept between time series of x_array and y_array over each (lon,lat) grid box.
intercept=y_mean-x_mean*slope
# Compute tstats, stderr, and p_val between time series of x_array and y_array over each (lon,lat) grid box.
tstats=cor*np.sqrt(n-2)/np.sqrt(1-cor**2)
stderr=slope/tstats
from scipy.stats import t
p_val=t.sf(tstats,n-2)*2
# Compute r_square and rmse between time series of x_array and y_array over each (lon,lat) grid box.
# r_square also equals to cor**2 in 1-variable lineare regression analysis, which can be used for checking.
r_square=np.nansum((slope*x_array+intercept-y_mean)**2,axis=0)/np.nansum((y_array-y_mean)**2,axis=0)
rmse=np.sqrt(np.nansum((y_array-slope*x_array-intercept)**2,axis=0)/n)
# Do further filteration if needed (e.g. We stipulate at least 3 data records are needed to do regression analysis) and return values
n=n*1.0 # convert n from integer to float to enable later use of np.nan
n[n<3]=np.nan
slope[np.isnan(n)]=np.nan
intercept[np.isnan(n)]=np.nan
p_val[np.isnan(n)]=np.nan
r_square[np.isnan(n)]=np.nan
rmse[np.isnan(n)]=np.nan
return n,slope,intercept,p_val,r_square,rmse
Sample output
I have used this program to test two 3-D arrays with 227x3601x6301 pixels and it completed the work within 20 minutes, each less than 10 minutes.