I've been recently using Azure Custom Vision for training an image classifier and exporting a trainied model in a tensorflow format. With the help of transfer learning it allows training a model with only few samples (at least 50 images per class). Great, isn't it?
However, if you start your project from scratch, you may not have even 50 images per class. Or, you may have sufficient number of items for one class and only few for the other one. In this short article we are going to exlpore some code snippets that will allow us to easily generate synthetic data.
Data augmentation in data analysis are techniques used to increase the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data. It acts as a regularizer and helps reduce overfitting when training a machine learning model. It is closely related to oversampling in data analysis... Geometric transformations, flipping, color modification, cropping, rotation, noise injection and random erasing are used to augment image in deep learning.
Step 1 : Create a folder with classes
For this experiment I've created a single folder called images with many subfolders, each corresponding to a class.
Step 2: put your data into the folders
Again, just put your samples into a corresponding folder. For example, I've created a simple electrical switch classificator, and prepared 2 images per class, 2 for the class "ON" and two for the class "OFF"

Step 3 Prepare common functions
Here're some useful functions that may be used in different project, like image saving, or subfolders recognition
import os
import cv2
import imageio
import numpy as np
import imgaug as ia
import imgaug.augmenters as iaa
from PIL import Image
from datetime import datetime
from imgaug.augmentables.batches import UnnormalizedBatch
def count_files_in_folder(folder):
files_count = len([name for name in os.listdir(folder) if os.path.isfile(os.path.join(folder, name))])
def save_image(image, folder):
"""Save an image with unique name
image {Pillow} -- image object to be saved
folder {string} -- output folder
# check whether the folder exists and create one if not
if not os.path.exists(folder):
# to not erase previously saved photos counter (image name) = number of photos in a folder + 1
image_counter = count_files_in_folder(folder)+1
# save image to the dedicated folder (folder name = label)
image_name = folder + '/' + str(image_counter) + '.png'
def get_files_in_folder(folder):
return [os.path.join(folder, name) for name in os.listdir(folder) if os.path.isfile(os.path.join(folder, name))]
def list_oversample(initial_list, max_size):
"""duplicate a list n times or take a part of a list
initial_list {list} -- array to be resized
max_size {int} -- majority class size
resized_array = []
initial_length = len(initial_list)
new_size = max_size - initial_length
if new_size >= initial_length:
augment_rate = int(new_size/initial_length)
resized_array = initial_list*augment_rate
resized_array = initial_list[:new_size]
return resized_array
def save_image_array(image_array, folder):
for image in image_array:
save_image(Image.fromarray(image), folder)
Step 4 : set the augmenters
# Set augmenters
seq = iaa.Sequential([
iaa.Fliplr(0.5), # horizontal flips
iaa.Crop(percent=(0, 0.1)), # random crops
# Small gaussian blur with random sigma between 0 and 0.5.
# But we only blur about 50% of all images.
iaa.GaussianBlur(sigma=(0, 0.5))
# Strengthen or weaken the contrast in each image.
iaa.LinearContrast((0.75, 1.5)),
# Add gaussian noise.
# For 50% of all images, we sample the noise once per pixel.
# For the other 50% of all images, we sample the noise per pixel AND
# channel. This can change the color (not only brightness) of the
# pixels.
iaa.AdditiveGaussianNoise(loc=0, scale=(0.0, 0.05*255), per_channel=0.5),
# Make some images brighter and some darker.
# In 20% of all cases, we sample the multiplier once per channel,
# which can end up changing the color of the images.
iaa.Multiply((0.8, 1.2), per_channel=0.2),
# Apply affine transformations to each image.
# Scale/zoom them, translate/move them, rotate them and shear them.
scale={"x": (0.8, 1.2), "y": (0.8, 1.2)},
translate_percent={"x": (-0.2, 0.2), "y": (-0.2, 0.2)},
rotate=(-25, 25),
shear=(-8, 8)
], random_order=True) # apply augmenters in random order
Step 5 (option 1): Manually define the number of desired items
Here the process is quite straightforward, I simply order the augmenter to generate N items per class, say, 50 per class
# input image
IMAGE_FOLDER = 'images'
# all subfolders in the initial directory
image_subfolders = [os.path.join(IMAGE_FOLDER, subfolder) for subfolder in os.listdir(IMAGE_FOLDER)]
max_image_count = 50
image_target_subfolders = [subfolder for subfolder in image_subfolders if count_files_in_folder(subfolder) < max_image_count]
Step 5 (option 2): Set the number of items per class depending on the majority class
This one is more interesting. For exemple, we've got 100k images for class A, and only <1k images for the other classes (B, C, D etc). There's no need to generate more synthetic data for the majority class, so we automatically define the number of items for each minority class, according to the size of the largest one
# input image
IMAGE_FOLDER = '../data/categories_resized'
# all subfolders in the initial directory
image_subfolders = [os.path.join(IMAGE_FOLDER, subfolder) for subfolder in os.listdir(IMAGE_FOLDER)]
# number of instances in the majority class
max_image_count = max([count_files_in_folder(subfolder) for subfolder in image_subfolders])
image_target_subfolders = [subfolder for subfolder in image_subfolders if count_files_in_folder(subfolder) < max_image_count]
Step 6 : generate the synthetic data
for subfolder in image_target_subfolders:
print (subfolder)
# =============Time calculation===============
start_time = datetime.now()
# =============Time calculation===============
# create images array per folder
image_files = get_files_in_folder(subfolder)
synthetic_image_files = list_oversample(image_files, 50)
images = [imageio.imread(image_file) for image_file in synthetic_image_files]
# apply imge augmentation on a subfolder
augmented_images = seq(images=images)
save_image_array(augmented_images, subfolder)
# =============Time calculation===============
# check the endtime
end_time = datetime.now()
# get the total time spent
time_spent = end_time - start_time
spent_minutes, spent_seconds = divmod(
time_spent.days * 86400 + time_spent.seconds, 60)
print("{} min {} sec".format(spent_minutes, spent_seconds))
# =============Time calculation===============
Now let's have a look on our initial folders

Now we can bring everything to Azure Custom Vision to train a classifier.
Hope this was useful, enjoy!