Speed up Your Data Pipelines for Deep Learning

Tf.data meets Object Storage

Published in

IBM Data Science in Practice

10 min readNov 22, 2021

a pipeline running alongside a single-track train track. On other side of the pipeline and the train track is a thick forest. In the distance, there are forest-covered mountains — Photo by Quinten de Graaf on Unsplash

When training Deep Learning models using GPUs, speeding up your data pipeline is paramount. The goal is to make sure that you are accessing and preparing data fast enough so that the data pipeline is not a bottleneck. In other terms: if your GPU is fast, and you should feed it data even faster!

In this article, I will show you how to set up a fast and elegant data pipeline leveraging Tensorflow’s tf.data API to train Computer Vision models using a GPU, when your data is stored in an Object Storage solution such as S3 or IBM Cloud Object Storage. The end to end code used for this article is available on my GitHub account in this repository.

Note: this post is a shorter version of an end-to-end tutorial that includes instructions dedicated to IBM Watson Studio, available in the IBM Community blogging platform.

Part 1 — Environment description

Let’s consider a typical set up to run Deep Learning code:

Your model is training in an environment (a container in Kubernetes, an individual VM…) that is running Python and has access to a GPU
Your data is in an Object Storage bucket.

If you don’t have access to a Python runtime with a GPU, or to a bucket to store your data, head to the longer version of this post for instructions on how to do it.

What problem are we solving exactly?

We live in an era where separation of compute and data storage is becoming the norm. One reason for that separation is that with this type of architecture, you can scale data storage on one end (in the case of Object Storage, virtually infinitely), and compute resources on the other. While I certainly believe in this type of architecture, there is always a challenge associated with it: it takes time to move data around.

Depending on what you are doing with this data, and also how much data you are working with, you will need to take this transfer time into account: it’s one thing to have fast training code, but you need to avoid the data reads to become the bottleneck.

Let’s start writing some code

Let’s assume that you have credentials to a bucket containing labeled images in a variable called credentials . There are several Object Storage solutions out there that implement the same API as AWS S3, which means any tool designed to work with S3 will work with them. The first obvious such tool is boto3, the Python SDK for S3, which we can instantiate first:

import boto3
s3 = boto3.resource('s3',
                    aws_access_key_id=credentials['access_key'],
                    aws_secret_access_key=credentials['secret_key'],
                    endpoint_url=credentials['url'])
bucket = s3.Bucket(bucket_name)

Part 2— Reading the data — the naive way

At this point, you should have the following:

A notebook open, with access to a GPU
Data stored in an Object Storage bucket, such as IBM Cloud Object Storage
The credentials to access this bucket are loaded safely in your notebook in a python variable called credentials.

The dataset I will be using here is a simple Image Classification dataset containing ~25k images of different natural scenes (buildings, forests, glaciers…) sourced from Kaggle, namely the Intel Image Classification. It’s not as big as some popular image datasets, but big enough to already observe big differences in training time.

A naive approach: download the data sequentially

The simplest thing to do is to use the bucket object we created earlier and use it to list all remote files stored in your bucket. You can then filter suffixes to make sure to keep only .jpg images and store that in a list keys.

Once you have that list, simply iterate over it and use boto3’s download_file method. Note that the keys collected in my case all start with two additional /<bucket-name>/<unnecessary-top-folder>/… which I remove here.

import os
from tqdm import tqdmkeys = [file.key for file in bucket.objects.all() if file.key.endswith('.jpg')]for key in tqdm(keys):
    local_path = '/'.join(key.split('/')[2:])
    # make sure all subfolders needed exist locally:
    os.makedirs(os.path.dirname(local_path), exist_ok=True)
    bucket.download_file(key, local_path)

A less naive approach: download the data in parallel

If you run the code above, you will notice that it’s not very efficient and can take up to 30 minutes to download all the data locally. To speed it up, we can use multiple threads to download files in parallel.

In the snippet above, I was using tqdm to keep track of progress, and tqdm happens to provide a one line solution to parallelize code and get a progress bar at the same time, using thread_map() .

One thing to be careful about when parallelizing code is whether or not the objects you are using are thread safe, i.e. can weird and unexpected things happen when the same object is accessed simultaneously by different threads. So far I’ve used the boto3 Resource API, which is not thread safe. Due to this, we’ll switch here to the low-level Client. Read more about threading concerns with boto3 here, and if you don’t know the difference between the resource and client APIs, I recommend reading this article.

s3 = boto3.client('s3',
                  aws_access_key_id=credentials['access_key'],
                  aws_secret_access_key=credentials['secret_key'],
                  endpoint_url=credentials['url'])from tqdm import tqdm
from tqdm.contrib.concurrent import thread_mapdef download(key):
    local_path = '/'.join(key.split('/')[2:])
    os.makedirs(os.path.dirname(local_path), exist_ok=True)
    s3.download_file(bucket_name, key, local_path)thread_map(download, keys, max_workers=10, tqdm_class=tqdm)

By parallelizing this code, I was able to get the download time from ~30 minutes down to ~5 minutes! 🎉

As a note to those new to multithreading/processing, at some point, the overhead added by adding more concurrent workers takes over the benefit of the parallelization, i.e. switching to 20 workers won’t necessarily the data download to 2.5 minutes.

A shot of a highway at night with a streak of white lights on the left hand side of a highway and a streak of red lights on the right hand side of a highway. — Photo by Julian Hochgesang on Unsplash

Once you have the data downloaded locally, you can use the well-known ImageDataGenerator and its flow_from_directory() method. For example:

train_datagen = tf.keras.preprocessing.image.ImageDataGenerator(
                    rescale=1./255,
                    shear_range=0.2,
                    zoom_range=0.2,
                    horizontal_flip=True)val_datagen = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255)train_ds = train_datagen.flow_from_directory(
    './train',
    target_size=IMG_SIZE,
    batch_size=BATCH_SIZE,
    class_mode='sparse'
)val_ds = val_datagen.flow_from_directory(
    './test',
    target_size=IMG_SIZE,
    batch_size=BATCH_SIZE,
    class_mode='sparse'
)

Part 3 — Efficient data pipelines with tf.data

In part 2 we gained a lot of time. But here, keep in mind that we have only 25k images. What if we had millions of images? We probably don’t want to sit around waiting to download the entire dataset before we can start defining and training a model!

It’s time to introduce tf.data: this Tensorflow API provides a variety of building blocks to assemble a data pipeline, including reading and preprocessing the data to be fed into your model. This pipeline will only be evaluated when it’s actually used during training, which means that you can define the pipeline and quickly move to defining your model’s architecture.

There are many good tutorials out there treating about tf.data, and the Tensorflow documentation (tf.data basics here, and pipeline optimization here) is a very good place to start. Rather than giving a comprehensive guide to tf.data here, I want instead to show you how to set it up to work with Cloud Object Storage, and how to write an efficient image processing pipeline.

First, let’s set up Tensorflow to connect to our bucket. Remember the credentials variable from Part 1? Let’s set some of its content as environment variables here. Make sure to run this before importing Tensorflow.

os.environ['AWS_ACCESS_KEY_ID'] = credentials['access_key']
os.environ['AWS_SECRET_ACCESS_KEY'] = credentials['secret_key']# for TF, make sure to remove the 'https://' prefix:
os.environ['S3_ENDPOINT'] = 's3.private.us-south.cloud-object-storage.appdomain.cloud'# avoid certificate issues when connecting to COS from TF:
os.environ['S3_VERIFY_SSL'] = "0"

We can now import Tensorflow, and run a “smoke test” to check that we can access one sample file in the bucket. If file_io.stat() returns, it means we’re connected!

import tensorflow as tf
from tensorflow.python.lib.io import file_iopath = f's3://{bucket_name}/intel-image-classification/seg_pred/seg_pred/10004.jpg'
file_io.stat(path)

Once connected, let’s write the first step of the pipeline: listing filenames.

train_keys_ds = tf.data.Dataset.list_files(f's3://{bucket_name}/intel-image-classification/seg_train/*/*/*.jpg')
val_keys_ds = tf.data.Dataset.list_files(f's3://{bucket_name}/intel-image-classification/seg_test/*/*/*.jpg')

Before we define the rest of the pipeline, we prepare two helper functions. These functions use class names which we extract by iterating once over the keys dataset.

As you can notice, everything is written in Tensorflow: processing filenames, loading image bytes, decoding them, and resizing the image. This will make scaling the pipeline easier. We now use these functions when defining the full pipeline:

Notice the use of AUTOTUNE for several function to let TF optimize some parameters.

We first map the process_path function defined earlier to the file list dataset. The pipeline also contains steps to shuffle the data, as well as repeat it the same number of times as defined by the EPOCHS variable to make sure we have enough samples to iterate over (don’t worry, we are not actually duplicating data, just defining a generator). But more importantly we make two specific function calls:

.cache() makes sure that after the first read from Object Storage the data will stay cached locally. Here we are caching in memory, but could specify a file path to cache to disk for larger data.
.prefetch() makes sure that the data pipeline is already loading the next batch of data in a buffer while the GPU is crunching the current batch.

a chart labeled “prefetched” with an x-axis labeled time in seconds. There are four rows labeled “open”, “read”, “train”, and “epoch”. — Training pipeline of a tf.data pipeline when prefetch is on (source: Tensorflow documentation)

In the diagram above, you can get a sense of what’s happening when you prefetch in a tf.data pipeline: by the time we reach each pink step, i.e. training the model on a mini batch, we have already run the purple step needed to read the corresponding files, which of course we did after first opening the files. As long as enough blue and purple steps can happen (on CPU) before pink steps (on GPU), our training pipeline is blazing fast!

Before feeding that data pipeline to a model, we can get one batch and visualize the images to make sure all steps are correct:

plt.figure(figsize=(10, 10))iterator = iter(train_ds)
img_batch, label_batch = iterator.next()for i in range(9):
    ax = plt.subplot(3, 3, i + 1)
    plt.imshow(img_batch[i])
    plt.title(class_names[label_batch[i]])
    plt.axis("off")

a 9x9 block of images with: an image of a forest labeled forest, an image of a mountain labeled “mountain”, another image of a forest labeled “forest”, an image of a street labeled “street”, another image of a mountain labeled “mountain”, another image of a street labeled “street”, an image of a glacier labeled “glacier”, an image of buildings labeled “buildings”, another image of a glacier labeled “glacier” — If your pipeline was set up correctly, you’ll see samples from the 9 images classes from our dataset.

Part 4 — Training a model

This part is not the main focus of the blog post, so let’s create a simple model that re-uses VGG16 layers from Keras applications.

Training this model with the training dataset created with tf.data (and the validation dataset, which we define with mostly the same code and I purposely skipped for the purpose of brevity) is as simple as calling .fit() ! No special code needed.

history = model.fit(train_ds,
                    validation_data=val_ds,
                    epochs=EPOCHS,
                    steps_per_epoch=len(train_keys_ds) // BATCH_SIZE,
                    callbacks=callbacks,
)

Part 5 — Pros and cons and simple benchmarking

First, let’s look at some rough numbers, then think about when to use a tf.data pipeline and how.

Option 1: Download all files sequentially, train the model.

Overhead to read the data: ~30 min. This should be enough to convince you not to do it.

Option 2: Download all files in parallel

Overhead to read the data: ~5 min.
Average epoch duration: ~75 sec if using ImageDataGenerator, ~65 sec if using tf.data*

*yes! you can use tf.data with local files too. More on that later.

Option 3: Use a tf.data pipeline with prefetching from Object Storage

Overhead to read the data: None
Average epoch duration: ~65 sec

So should I always use tf.data and read directly from Object Storage?

Yes and no. There are a couple advantages to downloading the files to your notebook environment.

First, as far as I know, there is no way for ImageDataGenerator to take file locations on S3/COS as input (drop a note in the comments if there is!). But when you download the images, you can now rely on ImageDataGenerator to create your image processing pipeline, and it’s much easier to use! For example, you won’t need to worry about the lower-level code to read and decode images, resize them, etc. Plus, ImageDataGenerator has convenient functions to augment data on the fly such as random rotations or cropping, among other ways.

It’s also a little easier to work with local files directly for other tasks: here we didn’t do much exploration of the data, but if we had, it would probably have been easier to use files downloaded locally. You might also want to use OpenCV to do some preprocessing and not worry about scaling it in the moment.

In other terms, optimizing your code too early is not always the best workflow. It’s ok for your experimental code not to be perfect! Here’s what I would recommend:

First, download a small random subset of images
Write your first models using ImageDataGenerator and test the code by training on a few epochs
Replace the ImageDataGenerator by a tf.data pipeline reading from local files, then test the code again
Now replace the source of the tf.data pipeline to Object Storage, and train on all your data!

You are now ready to read and process your data efficiently and make sure your model training will be as fast as possible. As a reminder, all the code and a bonus section showing you how to deploy the models created is available in this repository.

If you liked this article, feel free to check out other posts on my personal account and on IBM’s Data Science in Practice Medium publication. I’m part of a team called Data Science and AI Elite at IBM and my job is to write efficient code to solve business use cases at scale every day.

To learn how to kick-start your data science project with the right expertise, tools and resources, the Data Science and AI Elite (DSE) is here to help. The DSE team can plan, co-create and prove the project with you based on our proven Agile AI methodology. Visit ibm.biz/datascienceelite to connect with us, explore our resources and learn more about Data Science and AI Elite. Request a complimentary consultation: ibm.co/DSE-Consultation.