Skip to main content

Data in fastai

from fastbook import *

Data In fastai

One of the most important things in fastai to understand is how you prepare your data for a model. The main workhorse for accomplishing this in fastai is the DataBlock api. Here is a hello world example of how this works:

Hello World DataBlock

The argument get_x and get_y operate on an iterable. Let's define an interable as our data:

data = list(range(100))
def get_x(r): return r
def get_y(r): return r + 10
dblock = DataBlock(get_x=get_x, get_y = get_y)
dsets = dblock.datasets(data)

You can see a dataset like so:

dsets.train[0]
(89, 99)

You can also see a DataLoader like so:

dls = dblock.dataloaders(data, bs=5)
next(iter(dls.train))
(tensor([57, 66, 73, 30, 14]), tensor([67, 76, 83, 40, 24]))

With A DataFrame

Similarly, you can operate on one row at a time:

import pandas as pd
df = pd.DataFrame({'x': range(100), 'y': range(100) })
df.head()
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>x</th>
<th>y</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>0</td>
<td>0</td>
</tr>
<tr>
<th>1</th>
<td>1</td>
<td>1</td>
</tr>
<tr>
<th>2</th>
<td>2</td>
<td>2</td>
</tr>
<tr>
<th>3</th>
<td>3</td>
<td>3</td>
</tr>
<tr>
<th>4</th>
<td>4</td>
<td>4</td>
</tr>
</tbody>
</table>
</div>
def get_x(r): return r.x
def get_y(r): return r.y + 10
dblock = DataBlock(get_x=get_x, get_y=get_y)
dsets = dblock.datasets(df)
dsets.train[0]
(78, 88)
dls = dblock.dataloaders(df, bs=3)
next(iter(dls.train))
(tensor([90, 55, 11]), tensor([100,  65,  21]))
def tracer(nm):
def f(x, nm):
# print(f'{nm}:')
# print(f'\tinput: {x}')
# import ipdb; ipdb.set_trace()
return str(x)
return partial(f, nm=nm)
def mult_0(x): return x * 0
def add_1(x): return x +1
tb = TransformBlock(item_tfms=[tracer('item_tfms')])
# def get_y(l): return sum(l)
db = DataBlock(blocks=(TransformBlock, TransformBlock),
get_x=mult_0,
get_y=add_1,
item_tfms=lambda x: str(x))
data = L(range(10))
result = db.datasets(data)
db.summary(data)
Setting-up type transforms pipelines
Collecting items from [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Found 10 items
2 datasets of sizes 8,2
Setting up Pipeline: mult_0
Setting up Pipeline: add_1

Building one sample
Pipeline: mult_0
starting from
1
applying mult_0 gives
0
Pipeline: add_1
starting from
1
applying add_1 gives
2

Final sample: (0, 2)


Collecting items from [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Found 10 items
2 datasets of sizes 8,2
Setting up Pipeline: mult_0
Setting up Pipeline: add_1
Setting up after_item: Pipeline: <lambda> -> ToTensor
Setting up before_batch: Pipeline:
Setting up after_batch: Pipeline:

Building one batch
Applying item_tfms to the first sample:
Pipeline: <lambda> -> ToTensor
starting from
(0, 2)
applying <lambda> gives
(0, 2)
applying ToTensor gives
(0, 2)

Adding the next 3 samples

No before_batch transform to apply

Collating items in a batch

No batch_tfms to apply
result.train[0]
(0, 5)
result = db.dataloaders(data, bs=3)
thing = iter(result.train)
next(thing)
(('0', '0', '0'), ('6', '7', '4'))
next(thing)
(('0', '0', '0'), ('9', '5', '3'))
??TransformBlock
db = DataBlock(blocks=(TransformBlock, tb),
get_y=lambda x: str(x),
batch_tfms=tracer('batch_tfms'))
result = db.datasets(data)
result = db.dataloaders(data, bs=3)
result
<fastai.data.core.DataLoaders at 0x7f9e08ff0160>
thing = iter(result.train)
next(thing)
(('1', '5', '6'), ('1', '5', '6'))
f = aug_transforms()[0]
f
Flip -- {'size': None, 'mode': 'bilinear', 'pad_mode': 'reflection', 'mode_mask': 'nearest', 'align_corners': True, 'p': 0.5}:
encodes: (TensorImage,object) -> encodes
(TensorMask,object) -> encodes
(TensorBBox,object) -> encodes
(TensorPoint,object) -> encodes
decodes: