Datasource Module¶
-
class
great_expectations.datasource.
Datasource
(name, data_context=None, data_asset_type=None, generators=None, **kwargs)¶ Datasources are responsible for connecting data and compute infrastructure. Each Datasource provides Great Expectations DataAssets (or batches in a DataContext) connected to a specific compute environment, such as a SQL database, a Spark cluster, or a local in-memory Pandas DataFrame. Datasources know how to access data from relevant sources such as an existing object from a DAG runner, a SQL database, S3 bucket, or local filesystem.
To bridge the gap between those worlds, Datasources interact closely with generators which are aware of a source of data and can produce produce identifying information, called “batch_kwargs” that datasources can use to get individual batches of data. They add flexibility in how to obtain data such as with time-based partitioning, downsampling, or other techniques appropriate for the datasource.
For example, a generator could produce a SQL query that logically represents “rows in the Events table with a timestamp on February 7, 2012,” which a SqlAlchemyDatasource could use to materialize a SqlAlchemyDataset corresponding to that batch of data and ready for validation.
Since opinionated DAG managers such as airflow, dbt, prefect.io, dagster can also act as datasources and/or generators for a more generic datasource.
When adding custom expectations by subclassing an existing DataAsset type, use the data_asset_type parameter to configure the datasource to load and return DataAssets of the custom type.
-
classmethod
from_configuration
(**kwargs)¶ Build a new datasource from a configuration dictionary.
- Parameters
**kwargs – configuration key-value pairs
- Returns
the newly-created datasource
- Return type
datasource (Datasource)
-
classmethod
build_configuration
(class_name, module_name='great_expectations.datasource', data_asset_type=None, generators=None, **kwargs)¶ Build a full configuration object for a datasource, potentially including generators with defaults.
- Parameters
class_name – The name of the class for which to build the config
module_name – The name of the module in which the datasource class is located
data_asset_type – A ClassConfig dictionary
generators – Generator configuration dictionary
**kwargs – Additional kwargs to be part of the datasource constructor’s initialization
- Returns
A complete datasource configuration.
-
property
data_context
¶ Property for attached DataContext
-
property
name
¶ Property for datasource name
-
get_config
()¶ Get the current configuration.
- Returns
datasource configuration dictionary
-
build_generator
(**kwargs)¶ Build a generator using the provided configuration and return the newly-built generator.
-
add_generator
(name, generator_config, **kwargs)¶ Add a generator to the datasource.
- Parameters
name (str) – the name of the new generator to add
generator_config – the configuration parameters to add to the datasource
kwargs – additional keyword arguments will be passed directly to the new generator’s constructor
- Returns
generator (Generator)
-
get_generator
(generator_name='default')¶ Get the (named) generator from a datasource)
- Parameters
generator_name (str) – name of generator (default value is ‘default’)
- Returns
generator (Generator)
-
list_generators
()¶ List currently-configured generators for this datasource.
- Returns
each dictionary includes “name” and “type” keys
- Return type
List(dict)
-
get_batch
(data_asset_name, expectation_suite_name, batch_kwargs, **kwargs)¶ Get a batch of data from the datasource.
If a DataContext is attached, then expectation_suite_name can be used to define an expectation suite to attach to the data_asset being fetched. Otherwise, the expectation suite will be empty.
If no batch_kwargs are specified, the next kwargs for the named data_asset will be fetched from the generator first.
Specific datasource types implement the internal _get_data_asset method to use appropriate batch_kwargs to construct and return GE data_asset objects.
- Parameters
data_asset_name – the name of the data asset for which to fetch data.
expectation_suite_name – the name of the expectation suite to attach to the batch
batch_kwargs – dictionary of key-value pairs describing the batch to get, or a single identifier if that can be unambiguously translated to batch_kwargs
**kwargs – Additional key-value pairs to pass to the datasource, such as reader parameters
- Returns
A data_asset consisting of the specified batch of data with the named expectation suite connected.
-
get_data_asset
(generator_asset, generator_name=None, expectation_suite=None, batch_kwargs=None, **kwargs)¶ Get a DataAsset using a datasource. generator_asset and generator_name are required.
- Parameters
generator_asset – The name of the asset as identified by the generator to return.
generator_name – The name of the configured generator to use.
expectation_suite – The expectation suite to attach to the data_asset
batch_kwargs – Additional batch_kwargs that can
**kwargs – Additional kwargs that can be used to supplement batch_kwargs
- Returns
DataAsset
-
get_available_data_asset_names
(generator_names=None)¶ Returns a dictionary of data_asset_names that the specified generator can provide. Note that some generators, such as the “no-op” in-memory generator may not be capable of describing specific named data assets, and some generators (such as filesystem glob generators) require the user to configure data asset names.
- Parameters
generator_names – the generators for which to fetch available data asset names.
- Returns
{ generator_name: [ data_asset_1, data_asset_2, ... ] ... }
- Return type
dictionary consisting of sets of generator assets available for the specified generators
-
build_batch_kwargs
(data_asset_name, *args, **kwargs)¶ Build batch kwargs for a requested data_asset. Try to use a generator where possible to support partitioning, but fall back to datasource-default behavior if the generator cannot be identified.
- Parameters
data_asset_name – the data asset for which to build batch_kwargs; if a normalized name is provided, use the named generator.
*args – at most exactly one positional argument can be provided from which to build kwargs
**kwargs – additional keyword arguments to be used to build the batch_kwargs
- Returns
A PandasDatasourceBatchKwargs object suitable for building a batch of data from this datasource
-
named_generator_build_batch_kwargs
(generator_name, generator_asset, partition_id=None, **kwargs)¶ Use the named generator to build batch_kwargs
-
get_data_context
()¶ Getter for the currently-configured data context.
-
classmethod
PandasDatasource¶
-
class
great_expectations.datasource.pandas_datasource.
PandasDatasource
(name='pandas', data_context=None, data_asset_type=None, generators=None, boto3_options=None, **kwargs)¶ Bases:
great_expectations.datasource.datasource.Datasource
The PandasDatasource produces PandasDataset objects and supports generators capable of interacting with the local filesystem (the default subdir_reader generator), and from existing in-memory dataframes.
-
classmethod
build_configuration
(data_asset_type=None, generators=None, boto3_options=None, **kwargs)¶ Build a full configuration object for a datasource, potentially including generators with defaults.
- Parameters
data_asset_type – A ClassConfig dictionary
generators – Generator configuration dictionary
boto3_options – Optional dictionary with key-value pairs to pass to boto3 during instantiation.
**kwargs – Additional kwargs to be part of the datasource constructor’s initialization
- Returns
A complete datasource configuration.
-
classmethod
SqlAlchemyDatasource¶
-
class
great_expectations.datasource.sqlalchemy_datasource.
SqlAlchemyDatasource
(name='default', data_context=None, data_asset_type=None, credentials=None, generators=None, **kwargs)¶ Bases:
great_expectations.datasource.datasource.Datasource
- A SqlAlchemyDatasource will provide data_assets converting batch_kwargs using the following rules:
if the batch_kwargs include a table key, the datasource will provide a dataset object connected to that table
if the batch_kwargs include a query key, the datasource will create a temporary table using that that query. The query can be parameterized according to the standard python Template engine, which uses $parameter, with additional kwargs passed to the get_batch method.
-
classmethod
build_configuration
(data_asset_type=None, generators=None, **kwargs)¶ Build a full configuration object for a datasource, potentially including generators with defaults.
- Parameters
data_asset_type – A ClassConfig dictionary
generators – Generator configuration dictionary
**kwargs – Additional kwargs to be part of the datasource constructor’s initialization
- Returns
A complete datasource configuration.
SparkDFDatasource¶
-
class
great_expectations.datasource.sparkdf_datasource.
SparkDFDatasource
(name='default', data_context=None, data_asset_type=None, generators=None, spark_config=None, **kwargs)¶ Bases:
great_expectations.datasource.datasource.Datasource
The SparkDFDatasource produces SparkDFDatasets and supports generators capable of interacting with local filesystem (the default subdir_reader generator) and databricks notebooks.
-
classmethod
build_configuration
(data_asset_type=None, generators=None, spark_config=None, **kwargs)¶ Build a full configuration object for a datasource, potentially including generators with defaults.
- Parameters
data_asset_type – A ClassConfig dictionary
generators – Generator configuration dictionary
spark_config – dictionary of key-value pairs to pass to the spark builder
**kwargs – Additional kwargs to be part of the datasource constructor’s initialization
- Returns
A complete datasource configuration.
-
classmethod