How to create dataset similar to cifar-10

First we need to understand the format in which the CIFAR10 data set is in. If we refer to: https://www.cs.toronto.edu/~kriz/cifar.html, and specifically, the Binary Version section, we see:

the first byte is the label of the first image, which is a number in the range 0-9. The next 3072 bytes are the values of the pixels of the image. The first 1024 bytes are the red channel values, the next 1024 the green, and the final 1024 the blue. The values are stored in row-major order, so the first 32 bytes are the red channel values of the first row of the image.

Intuitively, we need to store the data in this format. What you can do next as sort of a baseline experiment first, is to get images that are exactly the same size and same number of classes as CIFAR10 and put them in this format. This means that your images should have a size of 32x32x3 and have 10 classes. If you can successfully run this, then you can go further on to factor cases like single channels, different size inputs, and different classes. Doing so would mean that you have to change many variables in the other parts of the code. You have to slowly work your way through.

I'm in the midst of working out a general module. My code for this is in https://github.com/jkschin/svhn. If you refer to the svhn_flags.py code, you will see many flags there that can be changed to accommodate your needs. I admit it's cryptic now, as I haven't cleaned it up such that it is readable, but it works. If you are willing to spend some time taking a rough look, you will figure something out.

This is probably the easy way to run your own data set on CIFAR10. You could of course just copy the neural network definition and implement your own reader, input format, batching, etc, but if you want it up and running fast, just tune your inputs to fit CIFAR10.

EDIT:

Some really really basic code that I hope would help.

from PIL import Image
import numpy as np

im = Image.open('images.jpeg')
im = (np.array(im))

r = im[:,:,0].flatten()
g = im[:,:,1].flatten()
b = im[:,:,2].flatten()
label = [1]

out = np.array(list(label) + list(r) + list(g) + list(b),np.uint8)
out.tofile("out.bin")

This would convert an image into a byte file that is ready for use in CIFAR10. For multiple images, just keep concatenating the arrays, as stated in the format above. To check if your format is correct, specifically for the Asker's use case, you should get a file size of 4274273 + 1 = 546988 bytes. Assuming your pictures are RGB and values range from 0-255. Once you verify that, you're all set to run in TensorFlow. Do use TensorBoard to perhaps visualize one image, just to guarantee correctness.

EDIT 2:

As per Asker's question in comments,

if not eval_data:
    filenames = [os.path.join(data_dir, 'data_batch_%d.bin' % i)
                 for i in xrange(1, 6)]

If you really wanna it to work as it is, you need to study the function calls of CIFAR10 code. In cifar10_input, the batches are hardcoded. So you have to edit this line of code to fit the name of the bin file. Or, just distribute your images into 6 bin files evenly.


I didn't find any of the answers to do what I wanted to I made my own solution. It can be found on my github here: https://github.com/jdeepee/machine_learning/tree/master

This script will convert and amount of images to training and test data where the arrays are the same shape as the cifar10 dataset.

The code is commented so should be easy enough to follow. I should note it iterated through a master directory containing multiple folders which contain the images.