How to add a new dataset to the Feast feature store

In this article, I’ll show how to use the Feast feature store in a local environment. We will download a dataset, store it in a Parquet file, define a new FeatureView in the Feast repository, and retrieve it using Feast.

How to prepare a dataset for the Feast feature store

The Feast feature store works with time-series features. Therefore, every dataset must contain the timestamp in addition to the entity id. Different observations of the same entity may exist if such observations have a different timestamp.

In our example, we are going to use the Iris dataset. We have a single observation, and we don’t have an entity identifier. What can we do?

We can use the time when the dataset has been obtained as the observation date and turn the DataFrame indexes into the entity ids. It is OK, as long as we don’t use such ids to join with different datasets. After all, those values have no business meaning, and we created them only to identify the observations.

import seaborn
from datetime import datetime

data = seaborn.load_dataset('iris')

data.reset_index(level=0, inplace=True) # turns the index into a column
data = data.rename(columns={'index': 'iris_id'})

data['observation_date'] = datetime(2021, 7, 9, 10, 0, 0)

Now, we have to create and initialize a Feast repository. I run the code in a Jupyter Notebook. Hence there are exclamation marks at the beginning of the commands. If you use the command line, you won’t need them.

!feast init feature_repo
!cd feature_repo && feast apply

The freshly created Feast repository contains an example dataset, so we should see the following output:

Registered entity driver_id
Registered feature view driver_hourly_stats
Deploying infrastructure for driver_hourly_stats

When the repository is ready, we can store the dataset in the data directory as a Parquet file:

data.to_parquet('/content/feature_repo/data/iris.parquet')

Defining features in Feast repository

In the next step, we have to prepare a Python file describing the FeatureView. It must define the data input location, the entity identifier, and the available feature columns. We store this file in the feature repository using the Jupyter writefile command:

%%writefile /content/feature_repo/iris.py
from datetime import timedelta

from google.protobuf.duration_pb2 import Duration

from feast import Entity, Feature, FeatureView, ValueType
from feast.data_source import FileSource

iris_observations = FileSource(
    path="/content/feature_repo/data/iris.parquet",
    event_timestamp_column="observation_date",
)

iris = Entity(name="iris_id", value_type=ValueType.INT64, description="Iris identifier",)

iris_observations_view = FeatureView(
    name="iris_observations",
    entities=["iris_id"],
    ttl=timedelta(days=-1),
    features=[
        Feature(name="sepal_length", dtype=ValueType.FLOAT),
        Feature(name="sepal_width", dtype=ValueType.FLOAT),
        Feature(name="petal_length", dtype=ValueType.INT64),
        Feature(name="petal_width", dtype=ValueType.INT64),
        Feature(name="species", dtype=ValueType.STRING),
    ],
    online=False,
    input=iris_observations,
    tags={},
)

The code above does three things:

  • It defines the feature source location. In this case, a path to the local file system. Note that the FileSource also requires the column containing the event timestamp.
  • The Entity object describes which column contains the entity identifier. In our example, the value is useless and has no business meaning, but we still need it.
  • Finally, we define the FeatureView, which combines the available column names (and types) with the entity identifier and the data location. We have only historical data in our example, so I set the online parameter to False.

Since Feast 0.11, we can skip the features parameter in FeatureView, and the library will infer the column names and types from the data.

When we have the FeatureView definition, we can reload the repository and use the new feature:

!cd feature_repo && feast apply

Output:

Registered entity driver_id
Registered entity iris_id
Registered feature view driver_hourly_stats
Registered feature view iris_observations
Deploying infrastructure for driver_hourly_stats
Deploying infrastructure for iris_observations

What does TTL mean?

In the example below, we retrieve the value from the feature store. We must specify the event_timestamp. The ttl describes the maximal time difference between the actual event timestamp and the timestamp we want to get. Of course, it is a difference “in the past.” We can never retrieve events “in the future.”

Retrieving value from the feature store

To retrieve the value, we must specify the entity ids and the desired observation time:


entity_df = pd.DataFrame.from_dict(
    {
        "iris_id": range(0, 100),
        "event_timestamp": datetime(2021, 7, 10, 10, 0, 0)
    }
)

store = FeatureStore(repo_path="feature_repo")

training_df = store.get_historical_features(
    entity_df=entity_df,
    feature_refs=[
        "iris_observations:sepal_length",
        "iris_observations:sepal_width",
        "iris_observations:species",
    ],
).to_df()

Feast joins the request columns with the given entity_df DataFrame, so when data is not available, we get the entity_df value joined with nulls or NaNs.

Working with TTLs and the event_timestamp

In the previous example, I specified the event_timestamp equal to the observation_date. Because of that, Feast retrieved 100 observations:

event_timestamp	iris_id	iris_observations__sepal_length	iris_observations__sepal_width	iris_observations__species
0	2021-07-10 10:00:00+00:00	0	5.1	3.5	setosa
1	2021-07-10 10:00:00+00:00	72	6.3	2.5	versicolor
2	2021-07-10 10:00:00+00:00	71	6.1	2.8	versicolor
3	2021-07-10 10:00:00+00:00	70	5.9	3.2	versicolor
4	2021-07-10 10:00:00+00:00	69	5.6	2.5	versicolor

Values in the future

When I specify the event_timestamp “in the future” (a date after the available data), Feast will return NaNs because the data does not exist:

entity_df = pd.DataFrame.from_dict(
    {
        "iris_id": range(0, 100),
        "event_timestamp": datetime(2021, 7, 11, 10, 0, 0)
    }
)
...
event_timestamp	iris_id	iris_observations__sepal_length	iris_observations__sepal_width	iris_observations__species
0	2021-07-11 10:00:00+00:00	0	NaN	NaN	NaN

Values in the past

In our example, the observation date is datetime(2021, 7, 9, 10, 0, 0) and the TTL = 1 day, so we can request values between datetime(2021, 7, 8, 10, 0, 0) and datetime(2021, 7, 9, 10, 0, 0). If we use a date in this range, we still get the freshest available value (the only one we have in the database).

entity_df = pd.DataFrame.from_dict(
    {
        "iris_id": range(0, 100),
        "event_timestamp": datetime(2021, 7, 9, 10, 0, 0)
    }
)
...

Returns:

event_timestamp	iris_id	iris_observations__sepal_length	iris_observations__sepal_width	iris_observations__species
0	2021-07-09 10:00:00+00:00	0	5.1	3.5	setosa

Retrieving expired values

However, when we go back one second further, Feast will return NaNs because the available data is not in the range between the event_timestamp and event_timestamp + ttl:

entity_df = pd.DataFrame.from_dict(
    {
        "iris_id": range(0, 100),
        "event_timestamp": datetime(2021, 7, 9, 9, 59, 59)
    }
)
...
event_timestamp	iris_id	iris_observations__sepal_length	iris_observations__sepal_width	iris_observations__species
0	2021-07-09 09:59:59+00:00	0	NaN	NaN	NaN
Older post

Building trustworthy data pipelines

How to build a trustworthy data pipeline?

Newer post

What is MLOps? Do we need MLOps?

A complete definition of MLOps. No, MLOps isn't just DevOps applied to machine learning!