feast.infra.offline_stores.contrib.spark_offline_store package¶
Subpackages¶
- feast.infra.offline_stores.contrib.spark_offline_store.tests package
- Submodules
- feast.infra.offline_stores.contrib.spark_offline_store.tests.data_source module
SparkDataSourceCreator
SparkDataSourceCreator.create_data_source()
SparkDataSourceCreator.create_offline_store_config()
SparkDataSourceCreator.create_saved_dataset_destination()
SparkDataSourceCreator.get_prefixed_table_name()
SparkDataSourceCreator.spark_offline_store_config
SparkDataSourceCreator.spark_session
SparkDataSourceCreator.tables
SparkDataSourceCreator.teardown()
- Module contents
Submodules¶
feast.infra.offline_stores.contrib.spark_offline_store.spark module¶
- class feast.infra.offline_stores.contrib.spark_offline_store.spark.SparkOfflineStore[source]¶
Bases:
OfflineStore
- static get_historical_features(config: RepoConfig, feature_views: List[FeatureView], feature_refs: List[str], entity_df: DataFrame | str, registry: Registry, project: str, full_feature_names: bool = False) RetrievalJob [source]¶
Retrieves the point-in-time correct historical feature values for the specified entity rows.
- Parameters:
config – The config for the current feature store.
feature_views – A list containing all feature views that are referenced in the entity rows.
feature_refs – The features to be retrieved.
entity_df – A collection of rows containing all entity columns on which features need to be joined, as well as the timestamp column used for point-in-time joins. Either a pandas dataframe can be provided or a SQL query.
registry – The registry for the current feature store.
project – Feast project to which the feature views belong.
full_feature_names – If True, feature names will be prefixed with the corresponding feature view name, changing them from the format “feature” to “feature_view__feature” (e.g. “daily_transactions” changes to “customer_fv__daily_transactions”).
- Returns:
A RetrievalJob that can be executed to get the features.
- static offline_write_batch(config: RepoConfig, feature_view: FeatureView, table: Table, progress: Callable[[int], Any] | None)[source]¶
Writes the specified arrow table to the data source underlying the specified feature view.
- Parameters:
config – The config for the current feature store.
feature_view – The feature view whose batch source should be written.
table – The arrow table to write.
progress – Function to be called once a portion of the data has been written, used to show progress.
- static pull_all_from_table_or_query(config: RepoConfig, data_source: DataSource, join_key_columns: List[str], feature_name_columns: List[str], timestamp_field: str, start_date: datetime, end_date: datetime) RetrievalJob [source]¶
Note that join_key_columns, feature_name_columns, timestamp_field, and created_timestamp_column have all already been mapped to column names of the source table and those column names are the values passed into this function.
- static pull_latest_from_table_or_query(config: RepoConfig, data_source: DataSource, join_key_columns: List[str], feature_name_columns: List[str], timestamp_field: str, created_timestamp_column: str | None, start_date: datetime, end_date: datetime) RetrievalJob [source]¶
Extracts the latest entity rows (i.e. the combination of join key columns, feature columns, and timestamp columns) from the specified data source that lie within the specified time range.
All of the column names should refer to columns that exist in the data source. In particular, any mapping of column names must have already happened.
- Parameters:
config – The config for the current feature store.
data_source – The data source from which the entity rows will be extracted.
join_key_columns – The columns of the join keys.
feature_name_columns – The columns of the features.
timestamp_field – The timestamp column, used to determine which rows are the most recent.
created_timestamp_column – The column indicating when the row was created, used to break ties.
start_date – The start of the time range.
end_date – The end of the time range.
- Returns:
A RetrievalJob that can be executed to get the entity rows.
- class feast.infra.offline_stores.contrib.spark_offline_store.spark.SparkOfflineStoreConfig(*, type: StrictStr = 'spark', spark_conf: Dict[str, str] | None = None, staging_location: StrictStr | None = None, region: StrictStr | None = None)[source]¶
Bases:
FeastConfigBaseModel
- type: StrictStr¶
Offline store type selector
- class feast.infra.offline_stores.contrib.spark_offline_store.spark.SparkRetrievalJob(spark_session: SparkSession, query: str, full_feature_names: bool, config: RepoConfig, on_demand_feature_views: List[OnDemandFeatureView] | None = None, metadata: RetrievalMetadata | None = None)[source]¶
Bases:
RetrievalJob
- property full_feature_names: bool¶
Returns True if full feature names should be applied to the results of the query.
- property metadata: RetrievalMetadata | None¶
Return metadata information about retrieval. Should be available even before materializing the dataset itself.
- property on_demand_feature_views: List[OnDemandFeatureView]¶
Returns a list containing all the on demand feature views to be handled.
- persist(storage: SavedDatasetStorage, allow_overwrite: bool = False)[source]¶
Run the retrieval and persist the results in the same offline store used for read. Please note the persisting is done only within the scope of the spark session.
- supports_remote_storage_export() bool [source]¶
Returns True if the RetrievalJob supports to_remote_storage.
- feast.infra.offline_stores.contrib.spark_offline_store.spark.get_spark_session_or_start_new_with_repoconfig(store_config: SparkOfflineStoreConfig) SparkSession [source]¶
feast.infra.offline_stores.contrib.spark_offline_store.spark_source module¶
- class feast.infra.offline_stores.contrib.spark_offline_store.spark_source.SavedDatasetSparkStorage(table: str | None = None, query: str | None = None, path: str | None = None, file_format: str | None = None)[source]¶
Bases:
SavedDatasetStorage
- static from_proto(storage_proto: SavedDatasetStorage) SavedDatasetStorage [source]¶
- spark_options: SparkOptions¶
- to_data_source() DataSource [source]¶
- class feast.infra.offline_stores.contrib.spark_offline_store.spark_source.SparkOptions(table: str | None, query: str | None, path: str | None, file_format: str | None)[source]¶
Bases:
object
- allowed_formats = ['csv', 'json', 'parquet', 'delta', 'avro']¶
- property file_format¶
- classmethod from_proto(spark_options_proto: SparkOptions)[source]¶
Creates a SparkOptions from a protobuf representation of a spark option :param spark_options_proto: a protobuf representation of a datasource
- Returns:
Returns a SparkOptions object based on the spark_options protobuf
- property path¶
- property query¶
- property table¶
- class feast.infra.offline_stores.contrib.spark_offline_store.spark_source.SparkSource(*, name: str | None = None, table: str | None = None, query: str | None = None, path: str | None = None, file_format: str | None = None, event_timestamp_column: str | None = None, created_timestamp_column: str | None = None, field_mapping: Dict[str, str] | None = None, description: str | None = '', tags: Dict[str, str] | None = None, owner: str | None = '', timestamp_field: str | None = None)[source]¶
Bases:
DataSource
- property file_format¶
Returns the file format of this feature data source.
- static from_proto(data_source: DataSource) Any [source]¶
Converts data source config in protobuf spec to a DataSource class object.
- Parameters:
data_source – A protobuf representation of a DataSource.
- Returns:
A DataSource class object.
- Raises:
ValueError – The type of DataSource could not be identified.
- get_table_column_names_and_types(config: RepoConfig) Iterable[Tuple[str, str]] [source]¶
Returns the list of column names and raw column types.
- Parameters:
config – Configuration object used to configure a feature store.
- get_table_query_string() str [source]¶
Returns a string that can directly be used to reference this table in SQL
- property path¶
Returns the path of the spark data source file.
- property query¶
Returns the query of this feature data source
- static source_datatype_to_feast_value_type() Callable[[str], ValueType] [source]¶
Returns the callable method that returns Feast type given the raw column type.
- property table¶
Returns the table of this feature data source
- validate(config: RepoConfig)[source]¶
Validates the underlying data source.
- Parameters:
config – Configuration object used to configure a feature store.