Fast.Ai Chapter 1-2: DataBlocks and DataLoaders

2022-09-04

DataBlocks and DataLoaders

PyTorch and FastAi’s way of working with data is a new concept for me, and I’m going to try to parse it in a way that I understand. The docs are too intimidating (for now at least) so I’m going to try to learn what I can from the lectures, books, and code examples. My understanding of these classes might be off or unsophisticated, but it’s more of an exercise to grapple with them than trying to get it right. That’ll come with time.

Preparing your data to feed into your machine takes care, and there are a lot of levers you can pull when you train your model once you get your data into it. DataBlocks and DataLoaders are abstractions which convert your raw data into something more structured so that it’s easier for you and for machine learning models to work with.

As the text says, a DataBlock object is like a template for creating DataLoaders. So let’s start there. I’m going to talk about images since that’s the main example in the chapters, but the same holds for other kinds of data.

The basic syntax for constructing DataBlocks is this:

bears = DataBlock(
    blocks=(ImageBlock, CategoryBlock), 
    get_items=get_image_files, 
    splitter=RandomSplitter(valid_pct=0.2, seed=42),
    get_y=parent_label,
    item_tfms=Resize(128))

Let’s go through each of these lines in detail.

blocks=...: In its pure, unprocessed form your data is just a blob. Is it an image? Does it represent a bounding box, say for a face in an image? Does it have the data for segmenting an image (e.g. in object detection)? Each of these types of data have different transformations and ways of working with them. You specify it in this argument so that your model knows the optimal way to work with your data.

In this example the argument is a tuple, with the first index being the type of the input and the last being the type of your output. If you have multiple inputs or multiple outputs, you can specify which blocks are which when you construct the datablock; we won’t need to do that here though.

get_items=...: When we create a DataLoader from this DataBlock in a second we feed it the path (more precisely a Path object) containing all of our files:

dls = bears.dataloaders(path)

But trying to train a model without telling the DataLoader what to do with that path is like throwing a filing cabinet full of images at it and expecting it to know what to do! The get_items argument should be a function which tells you how to pick out the data (in this case our images) from the path. As explained in the text, the get_image_files function recursively goes through the path and finds all images files.

splitter=...: This argument tells the model how to split the data into a training and validation set. In this example we split it $80:20$. Depending on your dataset, though, a random split might not be representative of unseen data in the wild; this was discussed in Chapter 1. You fine-tune just how you want the data to be split with this argument.

get_y=...: This is a supervised learning problem, so each dataset needs to come with an output. Is it the parent directory of the image? Maybe there’s a separate file which tells you all of the labels for your data? This is where you make that explicit.

item_tfms=...: We might need to clean the raw data before we feed it into the model. In this example, we need to resize every image to be the same dimensions. Perhaps in some other model we need to convert every image to greyscale, and for tabular data we might need to address empty fields or outliers. Any transformation that you want to apply to each image one at a time you can feed into this argument.

As well as item_tfms, the example later calls

batch_tfms=...: Similar to item_tfms but for transformations that can be done simultaneously on a handful of data in parallel. This includes rotating, shearing, changing the saturation, all sorts of things (many of them handily prepared for us inside the aug_transforms function).

If the DataBlock defines the pipeline, the DataLoader is a collection of data processed through that pipeline, with all the bells and whistles that help you see into your processed dataset. In this example, we construct it like

dls = bears.dataloaders(path)

The example we see in the text is the dls.valid.show_batch(max_n, n_rows) method and it’s great for instantly visualizing some samples,

I think of a DataLoader as an extremely structured dataset. It exposes functionality for us workers to inspect the data and tweak it, as well being a standardized format that Fast.Ai and PyTorch’s machine learning models understand. Because once we have this, training a model (through transfer learning) is easy:

learn = vision_learner(dls, resnet18, metrics=error_rate)
learn.fine_tune(4)

I like the elegance of this formalism a lot.