How does Pytorch Dataloader handle variable size data?
This is the way I do it:
def collate_fn_padd(batch):
'''
Padds batch of variable length
note: it converts things ToTensor manually here since the ToTensor transform
assume it takes in images rather than arbitrary tensors.
'''
## get sequence lengths
lengths = torch.tensor([ t.shape[0] for t in batch ]).to(device)
## padd
batch = [ torch.Tensor(t).to(device) for t in batch ]
batch = torch.nn.utils.rnn.pad_sequence(batch)
## compute mask
mask = (batch != 0).to(device)
return batch, lengths, mask
then I pass that to the dataloader class as a collate_fn
.
There seems to be a giant list of different posts in the pytorch forum. Let me link to all of them. They all have answers of their own and discussions. It doesn't seem to me that there is one "standard way to do it" but if there is from an authoritative reference please share.
It would be nice that the ideal answer mentions
- efficiency, e.g. if to do the processing in GPU with torch in the collate function vs numpy
things of that sort.
List:
- https://discuss.pytorch.org/t/how-to-create-batches-of-a-list-of-varying-dimension-tensors/50773
- https://discuss.pytorch.org/t/how-to-create-a-dataloader-with-variable-size-input/8278
- https://discuss.pytorch.org/t/using-variable-sized-input-is-padding-required/18131
- https://discuss.pytorch.org/t/dataloader-for-various-length-of-data/6418
- https://discuss.pytorch.org/t/how-to-do-padding-based-on-lengths/24442
bucketing: - https://discuss.pytorch.org/t/tensorflow-esque-bucket-by-sequence-length/41284
So how do you handle the fact that your samples are of different length? torch.utils.data.DataLoader
has a collate_fn
parameter which is used to transform a list of samples into a batch. By default it does this to lists. You can write your own collate_fn
, which for instance 0
-pads the input, truncates it to some predefined length or applies any other operation of your choice.