Dataset Module

great_expectations.dataset.base

great_expectations.dataset.pandas_dataset

great_expectations.dataset.sqlalchemy_dataset

class great_expectations.dataset.sqlalchemy_dataset.MetaSqlAlchemyDataset(*args, **kwargs)

Bases: great_expectations.dataset.base.Dataset

classmethod column_map_expectation(func)

For SqlAlchemy, this decorator allows individual column_map_expectations to simply return the filter that describes the expected condition on their data.

The decorator will then use that filter to obtain unexpected elements, relevant counts, and return the formatted object.

classmethod column_aggregate_expectation(func)

Constructs an expectation using column-aggregate semantics.

class great_expectations.dataset.sqlalchemy_dataset.SqlAlchemyDataset(table_name=None, engine=None, connection_string=None)

Bases: great_expectations.dataset.sqlalchemy_dataset.MetaSqlAlchemyDataset

add_default_expectations()

The default behavior for SqlAlchemyDataset is to explicitly include expectations that every column present upon initialization exists.

expect_table_columns_to_match_ordered_list(column_list, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect the columns to exactly match a specified list.

expect_table_columns_to_match_ordered_list is a expectation, not a column_map_expectation or column_aggregate_expectation.

Args:
column_list (list of str): The column names, in the correct order.
Other Parameters:
result_format (str or None): Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY.
For more detail, see result_format.

include_config (boolean): If True, then include the expectation config as part of the result object. For more detail, see include_config. catch_exceptions (boolean or None): If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions. meta (dict or None): A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns:

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_column_max_to_be_between(column, min_value=None, max_value=None, parse_strings_as_datetimes=None, output_strftime_format=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect the column max to be between an min and max value

expect_column_sum_to_be_between is a column_aggregate_expectation.

Args:
column (str): The column name min_value (comparable type or None): The minimum number of unique values allowed. max_value (comparable type or None): The maximum number of unique values allowed.
Keyword Args:
parse_strings_as_datetimes (Boolean or None): If True, parse min_value, max_values, and all non-null column values to datetimes before making comparisons. output_strftime_format (str or None): A valid strfime format for datetime output. Only used if parse_strings_as_datetimes=True.
Other Parameters:
result_format (str or None): Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY.
For more detail, see result_format.

include_config (boolean): If True, then include the expectation config as part of the result object. For more detail, see include_config. catch_exceptions (boolean or None): If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions. meta (dict or None): A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns:

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

Notes:

These fields in the result object are customized for this expectation:

{
    "observed_value": (list) The actual column max
}
  • min_value and max_value are both inclusive.
  • If min_value is None, then max_value is treated as an upper bound
  • If max_value is None, then min_value is treated as a lower bound
expect_column_min_to_be_between(column, min_value=None, max_value=None, parse_strings_as_datetimes=None, output_strftime_format=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect the column to sum to be between an min and max value

expect_column_min_to_be_between is a column_aggregate_expectation.

Args:
column (str): The column name min_value (comparable type or None): The minimum number of unique values allowed. max_value (comparable type or None): The maximum number of unique values allowed.
Keyword Args:
parse_strings_as_datetimes (Boolean or None): If True, parse min_value, max_values, and all non-null column values to datetimes before making comparisons. output_strftime_format (str or None): A valid strfime format for datetime output. Only used if parse_strings_as_datetimes=True.
Other Parameters:
result_format (str or None): Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY.
For more detail, see result_format.

include_config (boolean): If True, then include the expectation config as part of the result object. For more detail, see include_config. catch_exceptions (boolean or None): If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions. meta (dict or None): A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns:

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

Notes:

These fields in the result object are customized for this expectation:

{
    "observed_value": (list) The actual column min
}
  • min_value and max_value are both inclusive.
  • If min_value is None, then max_value is treated as an upper bound
  • If max_value is None, then min_value is treated as a lower bound
expect_column_sum_to_be_between(column, min_value=None, max_value=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect the column to sum to be between an min and max value

expect_column_sum_to_be_between is a column_aggregate_expectation.

Args:
column (str): The column name min_value (comparable type or None): The minimum number of unique values allowed. max_value (comparable type or None): The maximum number of unique values allowed.
Other Parameters:
result_format (str or None): Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY.
For more detail, see result_format.

include_config (boolean): If True, then include the expectation config as part of the result object. For more detail, see include_config. catch_exceptions (boolean or None): If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions. meta (dict or None): A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns:

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

Notes:

These fields in the result object are customized for this expectation:

{
    "observed_value": (list) The actual column sum
}
  • min_value and max_value are both inclusive.
  • If min_value is None, then max_value is treated as an upper bound
  • If max_value is None, then min_value is treated as a lower bound

great_expectations.dataset.util

great_expectations.dataset.util.parse_result_format(result_format)

This is a simple helper utility that can be used to parse a string result_format into the dict format used internally by great_expectations. It is not necessary but allows shorthand for result_format in cases where there is no need to specify a custom partial_unexpected_count.

class great_expectations.dataset.util.DotDict

Bases: dict

dot.notation access to dictionary attributes

class great_expectations.dataset.util.DocInherit(mthd)

Bases: object

great_expectations.dataset.util.recursively_convert_to_json_serializable(test_obj)

Helper function to convert a dict object to one that is serializable

Parameters:test_obj – an object to attempt to convert a corresponding json-serializable object
Returns:(dict) A converted test_object

Warning

test_obj may also be converted in place.

great_expectations.dataset.util.is_valid_partition_object(partition_object)

Tests whether a given object is a valid continuous or categorical partition object. :param partition_object: The partition_object to evaluate :return: Boolean

great_expectations.dataset.util.is_valid_categorical_partition_object(partition_object)

Tests whether a given object is a valid categorical partition object. :param partition_object: The partition_object to evaluate :return: Boolean

great_expectations.dataset.util.is_valid_continuous_partition_object(partition_object)

Tests whether a given object is a valid continuous partition object. :param partition_object: The partition_object to evaluate :return: Boolean

great_expectations.dataset.util.categorical_partition_data(data)

Convenience method for creating weights from categorical data.

Parameters:data (list-like) – The data from which to construct the estimate.
Returns:A new partition object:
{
    "partition": (list) The categorical values present in the data
    "weights": (list) The weights of the values in the partition.
}
great_expectations.dataset.util.kde_partition_data(data, estimate_tails=True)

Convenience method for building a partition and weights using a gaussian Kernel Density Estimate and default bandwidth.

Parameters:
  • data (list-like) – The data from which to construct the estimate
  • estimate_tails (bool) – Whether to estimate the tails of the distribution to keep the partition object finite
Returns:

A new partition_object:

{
    "partition": (list) The endpoints of the partial partition of reals,
    "weights": (list) The densities of the bins implied by the partition.
}

great_expectations.dataset.util.partition_data(data, bins='auto', n_bins=10)
great_expectations.dataset.util.continuous_partition_data(data, bins='auto', n_bins=10)

Convenience method for building a partition object on continuous data

Parameters:
  • data (list-like) – The data from which to construct the estimate.
  • bins (string) – One of ‘uniform’ (for uniformly spaced bins), ‘ntile’ (for percentile-spaced bins), or ‘auto’ (for automatically spaced bins)
  • n_bins (int) – Ignored if bins is auto.
Returns:

A new partition_object:

{
    "bins": (list) The endpoints of the partial partition of reals,
    "weights": (list) The densities of the bins implied by the partition.
}

great_expectations.dataset.util.infer_distribution_parameters(data, distribution, params=None)

Convenience method for determining the shape parameters of a given distribution

Parameters:
  • data (list-like) – The data to build shape parameters from.
  • distribution (string) – Scipy distribution, determines which parameters to build.
  • params (dict or None) – The known parameters. Parameters given here will not be altered. Keep as None to infer all necessary parameters from the data data.
Returns:

A dictionary of named parameters:

{
    "mean": (float),
    "std_dev": (float),
    "loc": (float),
    "scale": (float),
    "alpha": (float),
    "beta": (float),
    "min": (float),
    "max": (float),
    "df": (float)
}

See: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html#scipy.stats.kstest

great_expectations.dataset.util.validate_distribution_parameters(distribution, params)

Ensures that necessary parameters for a distribution are present and that all parameters are sensical.

If parameters necessary to construct a distribution are missing or invalid, this function raises ValueError with an informative description. Note that ‘loc’ and ‘scale’ are optional arguments, and that ‘scale’ must be positive.

Parameters:
  • distribution (string) – The scipy distribution name, e.g. normal distribution is ‘norm’.
  • params (dict or list) –

    The distribution shape parameters in a named dictionary or positional list form following the scipy cdf argument scheme.

    params={‘mean’: 40, ‘std_dev’: 5} or params=[40, 5]

Exceptions:
ValueError: With an informative description, usually when necessary parameters are omitted or are invalid.