How to read MNIST database in R?

MNIST dataset is also available in the keras package.

library(keras)
mnist <- dataset_mnist()
x_train <- mnist$train$x
y_train <- mnist$train$y
x_test <- mnist$test$x
y_test <- mnist$test$y

Following up on the darch (not ~Darch~) package mentioned above:

The package is called darch. It has been moved to MRAN (Microsoft R Application Network) but is available on CRAN as well.

It provides two functions for the MNIST data:

readMNIST which reads the ubyte files stored in your hard drive and saves them as test.Rdata and train.Rdata archives.

provideMNIST which will download the files and call readMNIST on them.

When calling these functions you need to give the directory names separated by a single slash e.g. readMNIST("..\MNIST\") (last slash required).

If you download the files yourself you will need to change the file names: the gz archives contain files with extensions, like t10k-labels.idx1-ubyte but readMNIST looks for files without extension, like t10k-labels-idx1-ubyte, so you have to change the dot to a dash (with darch version 0.12.0, maybe they'll fix this).

To load the files in R you need to use the load function (e.g. load("..\\MNIST\\test.Rdata"). This will create the matrices trainData and testData in the environment.

For some reason I did not get any dimnames for the matrices.


endian="big", not "high":

> to.read = file("~/Downloads/t10k-images-idx3-ubyte", "rb")

magic number:

> readBin(to.read, integer(), n=1, endian="big")
[1] 2051

number of images:

> readBin(to.read, integer(), n=1, endian="big")
[1] 10000

number of rows:

> readBin(to.read, integer(), n=1, endian="big")
[1] 28

number of columns:

> readBin(to.read, integer(), n=1, endian="big")
[1] 28

here comes the data:

> readBin(to.read, integer(), n=1, endian="big")
[1] 0
> readBin(to.read, integer(), n=1, endian="big")
[1] 0

as per the training set image data description on the web site.

Now you just need to loop and read 28*28 byte chunks into matrices.

Start again:

 > to.read = file("~/Downloads/t10k-images-idx3-ubyte", "rb")

skip header:

> readBin(to.read, integer(), n=4, endian="big")
[1]  2051 10000    28    28

should really get the 28,28 from the header read but hard-coded here:

 > m = matrix(readBin(to.read,integer(), size=1, n=28*28, endian="big"),28,28)
 > image(m)

Might need to transpose or flip the matrix, I think its an upside-down "7".

par(mfrow=c(5,5))
par(mar=c(0,0,0,0))
for(i in 1:25){m = matrix(readBin(to.read,integer(), size=1, n=28*28, endian="big"),28,28);image(m[,28:1])}

gets you:

enter image description here

Oh, and google leads me to: http://www.inside-r.org/packages/cran/darch/docs/readMNIST which might be useful.


Here's how you can do it using Darch package:

Run readMNIST('C:/Users/pj_/Dir/')

Which will store test.RData and train.RData in your set directory. When you load these two files in your Workspace, you will be able to see 'testData', 'testLabels', 'trainData' and 'trainLabels' in your Global Environment.