PCA with several time series as features of one instance with sklearn
Flatten the 2D features into a 1D feature and then Use this new feature set to perform PCA.
Assuming X
holds then entire 1000 instances:
from sklearn.decomposition import PCA
X = X.reshape(1000, -1)
pca = PCA(n_components=250)
pca.fit(X)
You could further improve the performance by passing each instance through LSTM to get a vector that summarizes the entire data frame into a lower dimensional vector representation, flatten them, then further used to perform PCA.
First of all, I would recommend to see the this link to have a better understanding of PCA analysis and data series.
Please take into account that if you have pandas 1000 instances, your data needs to be processed to be more easily as a numpy array. You'd have something like the following:
# your 1000 pandas instances
instances = [pd.DataFrame(data=np.random.normal(0, 1, (300, 20))) for _ in range(1000)]
# transformation to be able to process more easily the data as a numpy array
data=np.array([d.values for d in instances])
This said, let's tackle 2 different solutions.
Simple Solution
This said, the easiest solution is to ignore that you have a time series and just concatenate the information to perform the PCA analysis with all
import numpy as np
from sklearn.decomposition import PCA
data = np.random.randn(1000, 300, 20) # n_instances, n_steps, n_features
# combine the features and the steps, then
# you perform PCA for your 1000 instances
preprocessed = data.reshape((1000, 20*300))
pca = PCA(n_components=100)
pca.fit(preprocessed)
# test it in one sample
sample = pca.transform(preprocessed[0].reshape(1,-1))
Variation with Fourier Transform
Another solution could be by using fourier to try to get more information from your time series.
import numpy as np
from sklearn.decomposition import PCA
data = np.random.randn(1000, 300, 20) # n_instances, n_steps, n_features
# perform a fast fourier transform
preprocessed_1 = np.fft.fft(data,axis=1)
# combine the features and the steps, then
# you perform PCA for your 1000 instances
preprocessed_2 = preprocessed_1.reshape((1000, 20*300))
pca = PCA(n_components=100)
pca.fit(preprocessed_2)
# test it in one sample
pca.transform(preprocessed_2[0].reshape(1,-1))
Note: be careful, for both cases I'm implying that you have the same length for every time series.
There are time series based features in dataset. Appending all series of one instance to one series, will destroy the underlying properties of time series.
For preserving the time series property after dimensionality reduction you would need to generate new time series features from existing features.
data = np.random.randn(1000, 300, 20) #instance x #timestamp x #feature
pre_data = data.reshape((1000*300, 20)) #samples x #features
pca = PCA(n_components=5) #features in transformed data
pca.fit(pre_data)
instance_new = pca.transform(data[0])
Here five transformed features will be generated from original features at each timestamp, thus new features will have same timestamp as original one
I do not find the other answers satisfactory. Mainly because you should account for both the time series structure of the data and the cross-sectional information. You can't simply treat the features at each instance as a single series. Doing so, would inevitably lead to a loss of information and is, simply speaking, statistically wrong.
That said, if you really need to go for PCA, you should at least preserve the time series information:
PCA
Following silgon we transform the data into a numpy array:
# your 1000 pandas instances
instances = [pd.DataFrame(data=np.random.normal(0, 1, (300, 20))) for _ in range(1000)]
# transformation to be able to process more easily the data as a numpy array
data=np.array([d.values for d in instances])
This makes applying PCA way easier:
reshaped_data = data.reshape((1000*300, 20)) # create one big data panel with 20 series and 300.000 datapoints
n_comp=10 #choose the number of features to have after dimensionality reduction
pca = PCA(n_components=n_comp) #create the pca object
pca.fit(pre_data) #fit it to your transformed data
transformed_data=np.empty([1000,300,n_comp])
for i in range(len(data)):
transformed_data[i]=pca.transform(data[i]) #iteratively apply the transformation to each instance of the original dataset
Final output shape: transformed_data.shape: Out[]: (1000,300,n_comp)
.
PLS
However, you can (and should, in my opinion) construct the factors from your matrix of features using partial least squares PLS. This will also grant a further dimensionality reduction.
Let say your data has the following shape. T=1000, N=300, P=20
.
Then we have y=[T,1], X=[N,P,T].
Now, it's pretty easy to understand that for this to work we need to have our matrices to be conformable for multiplication. In our case we will have: y=[T,1]=[1000,1], Xpca=[T,P*N]=[1000,20*300]
Intuitively, what we are doing is to create a new feature for each lag (299=N-1) of each of the P=20 basic features.
I.e. for a given instance i, we will have something like this:
Instancei : x1,i, x1,i-1,..., x1,i-j, x2,i, x2,i-1,..., x2,i-j,..., xP,i, xP,i-1,..., xP,i-j with j=1,...,N-1:
Now, implementation of PLS in python is pretty straightforward.
# your 1000 pandas instances
instances = [pd.DataFrame(data=np.random.normal(0, 1, (300, 20))) for _ in range(1000)]
# transformation to be able to process more easily the data as a numpy array
data=np.array([d.values for d in instances])
# reshape your data:
reshaped_data = data.reshape((1000, 20*300))
from sklearn.cross_decomposition import PLSRegression
n_comp=10
pls_obj=PLSRegression(n_components=n_comp)
factorsPLS=pls_obj.fit_transform(reshaped_data,y)[0]
factorsPLS.shape
Out[]: (1000, n_comp)
What is PLS doing?
To make things easier to grasp we can look at the three-pass regression filter (working paper here) (3PRF). Kelly and Pruitt show that PLS is just a special case of theirs 3PRF:
()
Where Z represents a matrix of proxies. We don't have those but luckily Kelly and Pruitt have shown that we can live without it. All we need to do is to be sure that the regressors (our features) are standardized and run the first two regressions without intercept. Doing so, the proxies will be automatically selected.
So, in short PLS allows you to
- Achieve further dimensionality reduction than PCA.
- account for both the cross-sectional variability among the features and time series information of each series when creating the factors.