Using TensorFlow 2 Data API to Load and Pre-Processing Data

Ferry Djaja
7 min readOct 6, 2020

In this tutorial, I will walk through how to use TensorFlow Data API to load and preprocess the data. This tutorial (and code) is based on the book Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow.

Deep learning systems are often trained on very large datasets that will not fit in RAM. With TensorFlow Data API, it makes easy to get the data, load and transform it. TensorFlow takes care of all implementation details, such as multithreading, queueing, batching and prefetching. Moreover, the Data API works seamlessly with tf.keras.

The Data API can read data from text files (such as CSV files), binary files with fixed-size records and binary files that use TensorFlow’s TFRecord format. In this write up, I will cover the Data API.

The Data API

The Data API revolves around the concept of a dataset that represents a sequence of data items.

Let’s create a simple dataset using this below method.

The method from_tensor_slices() takes a tensor and create tf.data.Dataset. Let’s iterate the dataset’s item.

Transformations

Once we have a dataset, we can apply all sorts of transformations by calling the its methods.

Chaining dataset transformations

We can also transform the items by calling the map() method.

We can also apply a filter on the dataset using filter() method.

Shuffle

Using shuffle() method, the following codes creates and displays a dataset containing the integers 0 to 9, repeated 3 times, shuffled using a buffer size of 3, a random seed 42 and batched with a batch size of 9.

Interleaving Lines from Multiple Files

To shuffle the instances, a common approach to split the source data into multiples files and then read them in a random order during training. However, instances located in the same file will still end up close to each other. To avoid this you can pick multiple files randomly and read them simultaneously, interleaving their records.

We will use and load the California Housing dataset and split into a training set, a validation set and a test set and scale it using StandardScaler() function.

We need to reshape the housing.target (our target), so it becomes 1 column and X row. It is aligned with the shape of housing.data (our features).

Let’s check the shape of training set, test set and validation set.

For a very large dataset that does not fit in memory, we will typically want to split it into many files first, then have TensorFlow read these files in parallel. Let’s start by splitting the housing dataset and save it to 20 CSV files:

Let’s check the first few lines of the CSV file.

Now, we will use tf.data.Dataset.list_files to create a dataset containing these file path train_filepaths.

By default, the list_files() method returns a dataset that shuffles the file path.

Interleaving Lines From Multiple Files

Now we can call the interleave() method to read from five files at a time and interleave their lines. We skip the first line of each file which is the header row.

The interleave() method will create a dataset that will pull five file paths from the filepath_dataset and for each one, it will call the lambda function to create a new dataset TextLineDataset.

Preprocessing the Data

The output of the dataset above is just byte strings, we need to parse them and scale the data. Let’s create a small function preprocess().

Let’s walkthrough this code briefly:

  • We have the pre-computed the mean and standard deviation of each feature in training set. X_mean and X_std are 1D tensors, containing eight floats, one per input feature.
  • tf.io.decode_csv() function takes one CSV line and starts by parsing it. It takes two arguments: line (the line to parse) and an array containing the default value for each column in the CSV file.
    All feature columns are floats and that missing values should default to 0, we provide an empty array of type tf.float32 as the default value for the last column (the target).
n_inputs = 8
[0.] * n_inputs + [tf.constant([], dtype=tf.float32)]
  • The decode_csv() function returns a list of scalar tensors (one per column).
  • We need to return 1D tensors arrays. We use tf.stack() on all tensors except for the last one (the target). Do the same for the target value. The below screenshots are the values of x and y respectively.
  • Finally, we scale the input features by subtracting the feature means and then dividing by the feature standard deviations, and we return a tuple containing the scaled features and the target.

Let’s create a function csv_reader_dataset() and putting it together.

Loading and preprocessing data from multiple CSV files

Prefetching

By calling prefetch(1) at the end, we are creating a dataset that will do its best to always be one batch ahead. In other words, while our training algorithm is working on one batch, the dataset will already be working in parallel on getting the next batch ready (e.g., reading the data from disk and preprocessing it). This can improve performance dramatically.

With prefetching, the CPU and the GPU work in parallel: as the GPU works on one batch, the CPU works on the next

Dataset with tf.keras

Now we can use the csv_reader_dataset() function to create a dataset for the training set. We also create datasets for the validation set and test set.

Now we can build and train a Keras model using these datasets by passing the training and validation datasets to the fit() method. The fit() method will take care of repeating the training dataset.

You can see the dataset using the following code:

Create a Keras Sequential model and display the model summary.

And compile the model.

Train the model by calling the method fit() with batch size 32, dataset train_set, validation data valid_set with 20 epochs.

Evaluate the model against the test_set by calling the method evaluate().

And finally perform prediction by calling the method predict().

Reference

https://github.com/ferrygun/TF2_DataAPI/blob/main/tutorial_tfdataset.ipynb

--

--