loading EMNIST-letters dataset
Because of the way the dataset is structured, the array of image arrays can be accessed with mat['dataset'][0][0][0][0][0][0]
and the array of label arrays with mat['dataset'][0][0][0][0][0][1]
. For instance, print(mat['dataset'][0][0][0][0][0][0][0])
will print out the pixel values of the first image, and print(mat['dataset'][0][0][0][0][0][1][0])
will print the first image's label.
For a less...convoluted dataset, I'd actually recommend using the CSV version of the EMNIST dataset on Kaggle: https://www.kaggle.com/crawford/emnist, where each row is a separate image, there are 785 columns where the first column = class_label and each column after represents one pixel value (784 total for a 28 x 28 image).
@Josh Payne's answer is correct, but I'll expand on it for those who want to use the .mat file with an emphasis on typical data splits.
The data itself has already been split up in to a training and test set. Here's how I accessed the data:
from scipy import io as sio
mat = sio.loadmat('emnist-letters.mat')
data = mat['dataset']
X_train = data['train'][0,0]['images'][0,0]
y_train = data['train'][0,0]['labels'][0,0]
X_test = data['test'][0,0]['images'][0,0]
y_test = data['test'][0,0]['labels'][0,0]
There is an additional field 'writers' (e.g. data['train'][0,0]['writers'][0,0]
) that distinguishes the original sample writer. Finally, there is another field data['mapping']
, but I'm not sure what it is mapping the digits to.
In addition, in Secion II D, the EMNIST paper states that "the last portion of the training set, equal in size to the testing set, is set aside as a validation set". Strangely, the .mat file training/testing size does not match the number listed in Table II, but it does match the size in Fig. 2.
val_start = X_train.shape[0] - X_test.shape[0]
X_val = X_train[val_start:X_train.shape[0],:]
y_val = y_train[val_start:X_train.shape[0]]
X_train = X_train[0:val_start,:]
y_train = y_train[0:val_start]
If you don't want a validation set it is fine to leave these samples in the training set.
Also, if you would like to reshape the data into 2D, 28x28 sized images instead of a 1D 784 array, to get the correct image orientation you'll need to do a numpy reshape using Fortran ordering (Matlab uses column-major ordering, just like Fortran. reference). e.g. -
X_train = X_train.reshape( (X_train.shape[0], 28, 28), order='F')
An alternative solution is to use the EMNIST python package. (Full details at https://pypi.org/project/emnist/)
This lets you pip install emnist
in your environment then import the datasets (they will download when you run the program for the first time).
Example from the site:
>>> from emnist import extract_training_samples
>>> images, labels = extract_training_samples('digits')
>>> images.shape
(240000, 28, 28)
>>> labels.shape
(240000,)
You can also list the datasets
>>> from emnist import list_datasets
>>> list_datasets()
['balanced', 'byclass', 'bymerge', 'digits', 'letters', 'mnist']
And replace 'digits' in the first example with your choice.
This gives you all the data in numpy arrays which I have found makes things easy to work with.