Why Pytorch officially use mean=[0.485, 0.456, 0.406] and std=[0.229, 0.224, 0.225] to normalize images?
Using the mean and std of Imagenet is a common practice. They are calculated based on millions of images. If you want to train from scratch on your own dataset, you can calculate the new mean and std. Otherwise, using the Imagenet pretrianed model with its own mean and std is recommended.
In that example, they are using the mean and stddev of ImageNet, but if you look at their MNIST examples, the mean and stddev are 1-dimensional (since the inputs are greyscale-- no RGB channels).
Whether or not to use ImageNet's mean and stddev depends on your data. Assuming your data are ordinary photos of "natural scenes"† (people, buildings, animals, varied lighting/angles/backgrounds, etc.), and assuming your dataset is biased in the same way ImageNet is (in terms of class balance), then it's ok to normalize with ImageNet's scene statistics. If the photos are "special" somehow (color filtered, contrast adjusted, uncommon lighting, etc.) or an "un-natural subject" (medical images, satellite imagery, hand drawings, etc.) then I would recommend correctly normalizing your dataset before model training!*
Here's some sample code to get you started:
import os
import torch
from torchvision import datasets, transforms
from torch.utils.data.dataset import Dataset
from tqdm.notebook import tqdm
from time import time
N_CHANNELS = 1
dataset = datasets.MNIST("data", download=True,
train=True, transform=transforms.ToTensor())
full_loader = torch.utils.data.DataLoader(dataset, shuffle=False, num_workers=os.cpu_count())
before = time()
mean = torch.zeros(1)
std = torch.zeros(1)
print('==> Computing mean and std..')
for inputs, _labels in tqdm(full_loader):
for i in range(N_CHANNELS):
mean[i] += inputs[:,i,:,:].mean()
std[i] += inputs[:,i,:,:].std()
mean.div_(len(dataset))
std.div_(len(dataset))
print(mean, std)
print("time elapsed: ", time()-before)
† In computer vision, "Natural scene" has a specific meaning which isn't related to nature vs man-made, see https://en.wikipedia.org/wiki/Natural_scene_perception
* Otherwise you run into optimization problems due to elongations in the loss function-- see my answer here.