Dataset Module

great_expectations.dataset.dataset

class great_expectations.dataset.dataset.MetaDataset(*args, **kwargs)

Bases: great_expectations.data_asset.data_asset.DataAsset

Holds expectation decorators.

classmethod column_map_expectation(func)

Constructs an expectation using column-map semantics.

The column_map_expectation decorator handles boilerplate issues surrounding the common pattern of evaluating truthiness of some condition on a per-row basis.

Parameters

func (function) – The function implementing a row-wise expectation. The function should take a column of data and return an equally-long column of boolean values corresponding to the truthiness of the underlying expectation.

Notes

column_map_expectation intercepts and takes action based on the following parameters: mostly (None or a float between 0 and 1): Return “success”: True if at least mostly percent of values match the expectation. For more detail, see mostly.

column_map_expectation excludes null values from being passed to the function

Depending on the result_format selected, column_map_expectation can additional data to a return object, including element_count, nonnull_values, nonnull_count, success_count, unexpected_list, and unexpected_index_list. See _format_map_output

See also

expect_column_values_to_be_in_set for an example of a column_map_expectation

classmethod column_aggregate_expectation(func)

Constructs an expectation using column-aggregate semantics.

The column_aggregate_expectation decorator handles boilerplate issues surrounding the common pattern of evaluating truthiness of some condition on an aggregated-column basis.

Parameters

func (function) – The function implementing an expectation using an aggregate property of a column. The function should take a column of data and return the aggregate value it computes.

Notes

column_aggregate_expectation excludes null values from being passed to the function

See also

expect_column_mean_to_be_between for an example of a column_aggregate_expectation

class great_expectations.dataset.dataset.Dataset(*args, **kwargs)

Bases: great_expectations.dataset.dataset.MetaDataset

hashable_getters = ['get_column_min', 'get_column_max', 'get_column_mean', 'get_column_modes', 'get_column_median', 'get_column_quantiles', 'get_column_nonnull_count', 'get_column_stdev', 'get_column_sum', 'get_column_unique_count', 'get_column_value_counts', 'get_row_count', 'get_table_columns', 'get_column_count_in_range']
classmethod from_dataset(dataset=None)

This base implementation naively passes arguments on to the real constructor, which is suitable really when a constructor knows to take its own type. In general, this should be overridden

get_row_count()

Returns: int, table row count

get_table_columns()

Returns: List[str], list of column names

get_column_nonnull_count(column)

Returns: int

get_column_mean(column)

Returns: float

get_column_value_counts(column)

Returns: pd.Series of value counts for a column, sorted by value

get_column_sum(column)

Returns: float

get_column_max(column, parse_strings_as_datetimes=False)

Returns: any

get_column_min(column, parse_strings_as_datetimes=False)

Returns: any

get_column_unique_count(column)

Returns: int

get_column_modes(column)

Returns: List[any], list of modes (ties OK)

get_column_median(column)

Returns: any

get_column_quantiles(column, quantiles)

Get the values in column closest to the requested quantiles :param column: name of column :type column: string :param quantiles: the quantiles to return. quantiles must be a tuple to ensure caching is possible :type quantiles: tuple of float

Returns

the nearest values in the dataset to those quantiles

Return type

List[any]

get_column_stdev(column)

Returns: float

get_column_partition(column, bins='uniform', n_bins=10)

Get a partition of the range of values in the specified column.

Parameters
  • column – the name of the column

  • bins – ‘uniform’ for evenly spaced bins or ‘quantile’ for bins spaced according to quantiles

  • n_bins – the number of bins to produce

Returns

A list of bins

get_column_hist(column, bins)

Get a histogram of column values :param column: the column for which to generate the histogram :param bins: the bins to slice the histogram. bins must be a tuple to ensure caching is possible :type bins: tuple

Returns: List[int], a list of counts corresponding to bins

get_column_count_in_range(column, min_val=None, max_val=None, strict_min=False, strict_max=True)

Returns: int

test_column_map_expectation_function(function, *args, **kwargs)

Test a column map expectation function

Parameters
  • function (func) – The function to be tested. (Must be a valid column_map_expectation function.)

  • *args – Positional arguments to be passed the the function

  • **kwargs – Keyword arguments to be passed the the function

Returns

A JSON-serializable expectation result object.

Notes

This function is a thin layer to allow quick testing of new expectation functions, without having to define custom classes, etc. To use developed expectations from the command-line tool, you’ll still need to define custom classes, etc.

Check out Custom expectations for more information.

test_column_aggregate_expectation_function(function, *args, **kwargs)

Test a column aggregate expectation function

Parameters
  • function (func) – The function to be tested. (Must be a valid column_aggregate_expectation function.)

  • *args – Positional arguments to be passed the the function

  • **kwargs – Keyword arguments to be passed the the function

Returns

A JSON-serializable expectation result object.

Notes

This function is a thin layer to allow quick testing of new expectation functions, without having to define custom classes, etc. To use developed expectations from the command-line tool, you’ll still need to define custom classes, etc.

Check out Custom expectations for more information.

expect_column_to_exist(column, column_index=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect the specified column to exist.

expect_column_to_exist is a expectation, not a column_map_expectation or column_aggregate_expectation.

Parameters

column (str) – The column name.

Other Parameters
  • column_index (int or None) – If not None, checks the order of the columns. The expectation will fail if the column is not in location column_index (zero-indexed).

  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_table_columns_to_match_ordered_list(column_list, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect the columns to exactly match a specified list.

expect_table_columns_to_match_ordered_list is a expectation, not a column_map_expectation or column_aggregate_expectation.

Parameters

column_list (list of str) – The column names, in the correct order.

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_table_row_count_to_be_between(min_value=None, max_value=None, strict_min=False, strict_max=False, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect the number of rows to be between two values.

expect_table_row_count_to_be_between is a expectation, not a column_map_expectation or column_aggregate_expectation.

Keyword Arguments
  • min_value (int or None) – The minimum number of rows, inclusive unless strict_min=True.

  • max_value (int or None) – The maximum number of rows, inclusive unless strict_max=True.

  • strict_min (boolean) – If True, the table row count must be strictly larger than min_value.

  • strict_max (boolean) – If True, the table row count be strictly smaller than max_value.

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

Notes

  • min_value and max_value are both inclusive unless strict_min or strict_max are set to True.

  • If min_value is None, then max_value is treated as an upper bound, and the number of acceptable rows has no minimum.

  • If max_value is None, then min_value is treated as a lower bound, and the number of acceptable rows has no maximum.

See also

expect_table_row_count_to_equal

expect_table_row_count_to_equal(value, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect the number of rows to equal a value.

expect_table_row_count_to_equal is a expectation, not a column_map_expectation or column_aggregate_expectation.

Parameters

value (int) – The expected number of rows.

Other Parameters
  • result_format (string or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

See also

expect_table_row_count_to_be_between

expect_column_values_to_be_unique(column, mostly=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect each column value to be unique.

This expectation detects duplicates. All duplicated values are counted as exceptions.

For example, [1, 2, 3, 3, 3] will return [3, 3, 3] in result.exceptions_list, with unexpected_percent = 0.6.

expect_column_values_to_be_unique is a column_map_expectation.

Parameters

column (str) – The column name.

Keyword Arguments

mostly (None or a float between 0 and 1) – Return “success”: True if at least mostly percent of values match the expectation. For more detail, see mostly.

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_column_values_to_not_be_null(column, mostly=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect column values to not be null.

To be counted as an exception, values must be explicitly null or missing, such as a NULL in PostgreSQL or an np.NaN in pandas. Empty strings don’t count as null unless they have been coerced to a null type.

expect_column_values_to_not_be_null is a column_map_expectation.

Parameters

column (str) – The column name.

Keyword Arguments

mostly (None or a float between 0 and 1) – Return “success”: True if at least mostly percent of values match the expectation. For more detail, see mostly.

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_column_values_to_be_null(column, mostly=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect column values to be null.

expect_column_values_to_be_null is a column_map_expectation.

Parameters

column (str) – The column name.

Keyword Arguments

mostly (None or a float between 0 and 1) – Return “success”: True if at least mostly percent of values match the expectation. For more detail, see mostly.

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_column_values_to_be_of_type(column, type_, mostly=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect a column to contain values of a specified data type.

expect_column_values_to_be_of_type is a column_aggregate_expectation for typed-column backends, and also for PandasDataset where the column dtype and provided type_ are unambiguous constraints (any dtype except ‘object’ or dtype of ‘object’ with type_ specified as ‘object’).

For PandasDataset columns with dtype of ‘object’ expect_column_values_to_be_of_type is a column_map_expectation and will independently check each row’s type.

Parameters
  • column (str) – The column name.

  • type_ (str) – A string representing the data type that each column should have as entries. Valid types are defined by the current backend implementation and are dynamically loaded. For example, valid types for PandasDataset include any numpy dtype values (such as ‘int64’) or native python types (such as ‘int’), whereas valid types for a SqlAlchemyDataset include types named by the current driver such as ‘INTEGER’ in most SQL dialects and ‘TEXT’ in dialects such as postgresql. Valid types for SparkDFDataset include ‘StringType’, ‘BooleanType’ and other pyspark-defined type names.

Keyword Arguments

mostly (None or a float between 0 and 1) – Return “success”: True if at least mostly percent of values match the expectation. For more detail, see mostly.

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_column_values_to_be_in_type_list(column, type_list, mostly=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect a column to contain values from a specified type list.

expect_column_values_to_be_in_type_list is a column_aggregate_expectation for typed-column backends, and also for PandasDataset where the column dtype provides an unambiguous constraints (any dtype except ‘object’). For PandasDataset columns with dtype of ‘object’ expect_column_values_to_be_of_type is a column_map_expectation and will independently check each row’s type.

Parameters
  • column (str) – The column name.

  • type_list (str) – A list of strings representing the data type that each column should have as entries. Valid types are defined by the current backend implementation and are dynamically loaded. For example, valid types for PandasDataset include any numpy dtype values (such as ‘int64’) or native python types (such as ‘int’), whereas valid types for a SqlAlchemyDataset include types named by the current driver such as ‘INTEGER’ in most SQL dialects and ‘TEXT’ in dialects such as postgresql. Valid types for SparkDFDataset include ‘StringType’, ‘BooleanType’ and other pyspark-defined type names.

Keyword Arguments

mostly (None or a float between 0 and 1) – Return “success”: True if at least mostly percent of values match the expectation. For more detail, see mostly.

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_column_values_to_be_in_set(column, value_set, mostly=None, parse_strings_as_datetimes=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect each column value to be in a given set.

For example:

# my_df.my_col = [1,2,2,3,3,3]
>>> my_df.expect_column_values_to_be_in_set(
    "my_col",
    [2,3]
)
{
  "success": false
  "result": {
    "unexpected_count": 1
    "unexpected_percent": 0.16666666666666666,
    "unexpected_percent_nonmissing": 0.16666666666666666,
    "partial_unexpected_list": [
      1
    ],
  },
}

expect_column_values_to_be_in_set is a column_map_expectation.

Parameters
  • column (str) – The column name.

  • value_set (set-like) – A set of objects used for comparison.

Keyword Arguments
  • mostly (None or a float between 0 and 1) – Return “success”: True if at least mostly percent of values match the expectation. For more detail, see mostly.

  • parse_strings_as_datetimes (boolean or None) – If True values provided in value_set will be parsed as datetimes before making comparisons.

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_column_values_to_not_be_in_set(column, value_set, mostly=None, parse_strings_as_datetimes=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect column entries to not be in the set.

For example:

# my_df.my_col = [1,2,2,3,3,3]
>>> my_df.expect_column_values_to_not_be_in_set(
    "my_col",
    [1,2]
)
{
  "success": false
  "result": {
    "unexpected_count": 3
    "unexpected_percent": 0.5,
    "unexpected_percent_nonmissing": 0.5,
    "partial_unexpected_list": [
      1, 2, 2
    ],
  },
}

expect_column_values_to_not_be_in_set is a column_map_expectation.

Parameters
  • column (str) – The column name.

  • value_set (set-like) – A set of objects used for comparison.

Keyword Arguments

mostly (None or a float between 0 and 1) – Return “success”: True if at least mostly percent of values match the expectation. For more detail, see mostly.

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_column_values_to_be_between(column, min_value=None, max_value=None, strict_min=False, strict_max=False, allow_cross_type_comparisons=None, parse_strings_as_datetimes=False, output_strftime_format=None, mostly=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect column entries to be between a minimum value and a maximum value (inclusive).

expect_column_values_to_be_between is a column_map_expectation.

Parameters
  • column (str) – The column name.

  • min_value (comparable type or None) – The minimum value for a column entry.

  • max_value (comparable type or None) – The maximum value for a column entry.

Keyword Arguments
  • strict_min (boolean) – If True, values must be strictly larger than min_value, default=False

  • strict_max (boolean) – If True, values must be strictly smaller than max_value, default=False allow_cross_type_comparisons (boolean or None) : If True, allow comparisons between types (e.g. integer and string). Otherwise, attempting such comparisons will raise an exception.

  • parse_strings_as_datetimes (boolean or None) – If True, parse min_value, max_value, and all non-null column values to datetimes before making comparisons.

  • output_strftime_format (str or None) – A valid strfime format for datetime output. Only used if parse_strings_as_datetimes=True.

  • mostly (None or a float between 0 and 1) – Return “success”: True if at least mostly percent of values match the expectation. For more detail, see mostly.

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

Notes

  • min_value and max_value are both inclusive unless strict_min or strict_max are set to True.

  • If min_value is None, then max_value is treated as an upper bound, and there is no minimum value checked.

  • If max_value is None, then min_value is treated as a lower bound, and there is no maximum value checked.

expect_column_values_to_be_increasing(column, strictly=None, parse_strings_as_datetimes=False, mostly=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect column values to be increasing.

By default, this expectation only works for numeric or datetime data. When parse_strings_as_datetimes=True, it can also parse strings to datetimes.

If strictly=True, then this expectation is only satisfied if each consecutive value is strictly increasing–equal values are treated as failures.

expect_column_values_to_be_increasing is a column_map_expectation.

Parameters

column (str) – The column name.

Keyword Arguments
  • strictly (Boolean or None) – If True, values must be strictly greater than previous values

  • parse_strings_as_datetimes (boolean or None) – If True, all non-null column values to datetimes before making comparisons

  • mostly (None or a float between 0 and 1) – Return “success”: True if at least mostly percent of values match the expectation. For more detail, see mostly.

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_column_values_to_be_decreasing(column, strictly=None, parse_strings_as_datetimes=False, mostly=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect column values to be decreasing.

By default, this expectation only works for numeric or datetime data. When parse_strings_as_datetimes=True, it can also parse strings to datetimes.

If strictly=True, then this expectation is only satisfied if each consecutive value is strictly decreasing–equal values are treated as failures.

expect_column_values_to_be_decreasing is a column_map_expectation.

Parameters

column (str) – The column name.

Keyword Arguments
  • strictly (Boolean or None) – If True, values must be strictly greater than previous values

  • parse_strings_as_datetimes (boolean or None) – If True, all non-null column values to datetimes before making comparisons

  • mostly (None or a float between 0 and 1) – Return “success”: True if at least mostly percent of values match the expectation. For more detail, see mostly.

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_column_value_lengths_to_be_between(column, min_value=None, max_value=None, strict_min=False, strict_max=False, mostly=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect column entries to be strings with length between a minimum value and a maximum value (inclusive).

This expectation only works for string-type values. Invoking it on ints or floats will raise a TypeError.

expect_column_value_lengths_to_be_between is a column_map_expectation.

Parameters

column (str) – The column name.

Keyword Arguments
  • min_value (int or None) – The minimum value for a column entry length.

  • max_value (int or None) – The maximum value for a column entry length.

  • mostly (None or a float between 0 and 1) – Return “success”: True if at least mostly percent of values match the expectation. For more detail, see mostly.

  • strict_min (boolean) – If True, value lengths must be strictly larger than min_value, default=False

  • strict_max (boolean) – If True, value lengths must be strictly smaller than max_value, default=False

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

Notes

  • min_value and max_value are both inclusive unless strict_min or strict_max are set to True.

  • If min_value is None, then max_value is treated as an upper bound, and the number of acceptable rows has no minimum.

  • If max_value is None, then min_value is treated as a lower bound, and the number of acceptable rows has no maximum.

expect_column_value_lengths_to_equal(column, value, mostly=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect column entries to be strings with length equal to the provided value.

This expectation only works for string-type values. Invoking it on ints or floats will raise a TypeError.

expect_column_values_to_be_between is a column_map_expectation.

Parameters
  • column (str) – The column name.

  • value (int or None) – The expected value for a column entry length.

Keyword Arguments

mostly (None or a float between 0 and 1) – Return “success”: True if at least mostly percent of values match the expectation. For more detail, see mostly.

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_column_values_to_match_regex(column, regex, mostly=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect column entries to be strings that match a given regular expression. Valid matches can be found anywhere in the string, for example “[at]+” will identify the following strings as expected: “cat”, “hat”, “aa”, “a”, and “t”, and the following strings as unexpected: “fish”, “dog”.

expect_column_values_to_match_regex is a column_map_expectation.

Parameters
  • column (str) – The column name.

  • regex (str) – The regular expression the column entries should match.

Keyword Arguments

mostly (None or a float between 0 and 1) – Return “success”: True if at least mostly percent of values match the expectation. For more detail, see mostly.

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_column_values_to_not_match_regex(column, regex, mostly=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect column entries to be strings that do NOT match a given regular expression. The regex must not match any portion of the provided string. For example, “[at]+” would identify the following strings as expected: “fish”, “dog”, and the following as unexpected: “cat”, “hat”.

expect_column_values_to_not_match_regex is a column_map_expectation.

Parameters
  • column (str) – The column name.

  • regex (str) – The regular expression the column entries should NOT match.

Keyword Arguments

mostly (None or a float between 0 and 1) – Return “success”: True if at least mostly percent of values match the expectation. For more detail, see mostly.

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_column_values_to_match_regex_list(column, regex_list, match_on='any', mostly=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect the column entries to be strings that can be matched to either any of or all of a list of regular expressions. Matches can be anywhere in the string.

expect_column_values_to_match_regex_list is a column_map_expectation.

Parameters
  • column (str) – The column name.

  • regex_list (list) – The list of regular expressions which the column entries should match

Keyword Arguments
  • match_on= (string) – “any” or “all”. Use “any” if the value should match at least one regular expression in the list. Use “all” if it should match each regular expression in the list.

  • mostly (None or a float between 0 and 1) – Return “success”: True if at least mostly percent of values match the expectation. For more detail, see mostly.

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_column_values_to_not_match_regex_list(column, regex_list, mostly=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect the column entries to be strings that do not match any of a list of regular expressions. Matches can be anywhere in the string.

expect_column_values_to_not_match_regex_list is a column_map_expectation.

Parameters
  • column (str) – The column name.

  • regex_list (list) – The list of regular expressions which the column entries should not match

Keyword Arguments

mostly (None or a float between 0 and 1) – Return “success”: True if at least mostly percent of values match the expectation. For more detail, see mostly.

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_column_values_to_match_strftime_format(column, strftime_format, mostly=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect column entries to be strings representing a date or time with a given format.

expect_column_values_to_match_strftime_format is a column_map_expectation.

Parameters
  • column (str) – The column name.

  • strftime_format (str) – A strftime format string to use for matching

Keyword Arguments

mostly (None or a float between 0 and 1) – Return “success”: True if at least mostly percent of values match the expectation. For more detail, see mostly.

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_column_values_to_be_dateutil_parseable(column, mostly=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect column entries to be parsable using dateutil.

expect_column_values_to_be_dateutil_parseable is a column_map_expectation.

Parameters

column (str) – The column name.

Keyword Arguments

mostly (None or a float between 0 and 1) – Return “success”: True if at least mostly percent of values match the expectation. For more detail, see mostly.

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_column_values_to_be_json_parseable(column, mostly=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect column entries to be data written in JavaScript Object Notation.

expect_column_values_to_be_json_parseable is a column_map_expectation.

Parameters

column (str) – The column name.

Keyword Arguments

mostly (None or a float between 0 and 1) – Return “success”: True if at least mostly percent of values match the expectation. For more detail, see mostly.

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_column_values_to_match_json_schema(column, json_schema, mostly=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect column entries to be JSON objects matching a given JSON schema.

expect_column_values_to_match_json_schema is a column_map_expectation.

Parameters

column (str) – The column name.

Keyword Arguments

mostly (None or a float between 0 and 1) – Return “success”: True if at least mostly percent of values match the expectation. For more detail, see mostly.

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_column_parameterized_distribution_ks_test_p_value_to_be_greater_than(column, distribution, p_value=0.05, params=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect the column values to be distributed similarly to a scipy distribution. This expectation compares the provided column to the specified continuous distribution with a parametric Kolmogorov-Smirnov test. The K-S test compares the provided column to the cumulative density function (CDF) of the specified scipy distribution. If you don’t know the desired distribution shape parameters, use the ge.dataset.util.infer_distribution_parameters() utility function to estimate them.

It returns ‘success’=True if the p-value from the K-S test is greater than or equal to the provided p-value.

expect_column_parameterized_distribution_ks_test_p_value_to_be_greater_than is a column_aggregate_expectation.

Parameters
  • column (str) – The column name.

  • distribution (str) – The scipy distribution name. See: https://docs.scipy.org/doc/scipy/reference/stats.html Currently supported distributions are listed in the Notes section below.

  • p_value (float) – The threshold p-value for a passing test. Default is 0.05.

  • params (dict or list) – A dictionary or positional list of shape parameters that describe the distribution you want to test the data against. Include key values specific to the distribution from the appropriate scipy distribution CDF function. ‘loc’ and ‘scale’ are used as translational parameters. See https://docs.scipy.org/doc/scipy/reference/stats.html#continuous-distributions

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

Notes

These fields in the result object are customized for this expectation:

{
    "details":
        "expected_params" (dict): The specified or inferred parameters of the distribution to test                         against
        "ks_results" (dict): The raw result of stats.kstest()
}
  • The Kolmogorov-Smirnov test’s null hypothesis is that the column is similar to the provided distribution.

  • Supported scipy distributions:

    • norm

    • beta

    • gamma

    • uniform

    • chi2

    • expon

expect_column_distinct_values_to_be_in_set(column, value_set, parse_strings_as_datetimes=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect the set of distinct column values to be contained by a given set.

The success value for this expectation will match that of expect_column_values_to_be_in_set. However, expect_column_distinct_values_to_be_in_set is a column_aggregate_expectation.

For example:

# my_df.my_col = [1,2,2,3,3,3]
>>> my_df.expect_column_distinct_values_to_be_in_set(
    "my_col",
    [2, 3, 4]
)
{
  "success": false
  "result": {
    "observed_value": [1,2,3],
    "details": {
      "value_counts": [
        {
          "value": 1,
          "count": 1
        },
        {
          "value": 2,
          "count": 1
        },
        {
          "value": 3,
          "count": 1
        }
      ]
    }
  }
}
Parameters
  • column (str) – The column name.

  • value_set (set-like) – A set of objects used for comparison.

Keyword Arguments

parse_strings_as_datetimes (boolean or None) – If True values provided in value_set will be parsed as datetimes before making comparisons.

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_column_distinct_values_to_equal_set(column, value_set, parse_strings_as_datetimes=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect the set of distinct column values to equal a given set.

In contrast to expect_column_distinct_values_to_contain_set() this ensures not only that a certain set of values are present in the column but that these and only these values are present.

expect_column_distinct_values_to_equal_set is a column_aggregate_expectation.

For example:

# my_df.my_col = [1,2,2,3,3,3]
>>> my_df.expect_column_distinct_values_to_equal_set(
    "my_col",
    [2,3]
)
{
  "success": false
  "result": {
    "observed_value": [1,2,3]
  },
}
Parameters
  • column (str) – The column name.

  • value_set (set-like) – A set of objects used for comparison.

Keyword Arguments

parse_strings_as_datetimes (boolean or None) – If True values provided in value_set will be parsed as datetimes before making comparisons.

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_column_distinct_values_to_contain_set(column, value_set, parse_strings_as_datetimes=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect the set of distinct column values to contain a given set.

In contrast to expect_column_values_to_be_in_set() this ensures not that all column values are members of the given set but that values from the set must be present in the column.

expect_column_distinct_values_to_contain_set is a column_aggregate_expectation.

For example:

# my_df.my_col = [1,2,2,3,3,3]
>>> my_df.expect_column_distinct_values_to_contain_set(
    "my_col",
    [2,3]
)
{
"success": true
"result": {
    "observed_value": [1,2,3]
},
}
Parameters
  • column (str) – The column name.

  • value_set (set-like) – A set of objects used for comparison.

Keyword Arguments

parse_strings_as_datetimes (boolean or None) – If True values provided in value_set will be parsed as datetimes before making comparisons.

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_column_mean_to_be_between(column, min_value=None, max_value=None, strict_min=False, strict_max=False, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect the column mean to be between a minimum value and a maximum value (inclusive).

expect_column_mean_to_be_between is a column_aggregate_expectation.

Parameters
  • column (str) – The column name.

  • min_value (float or None) – The minimum value for the column mean.

  • max_value (float or None) – The maximum value for the column mean.

  • strict_min (boolean) – If True, the column mean must be strictly larger than min_value, default=False

  • strict_max (boolean) – If True, the column mean must be strictly smaller than max_value, default=False

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

Notes

These fields in the result object are customized for this expectation:

{
    "observed_value": (float) The true mean for the column
}
  • min_value and max_value are both inclusive unless strict_min or strict_max are set to True.

  • If min_value is None, then max_value is treated as an upper bound.

  • If max_value is None, then min_value is treated as a lower bound.

expect_column_median_to_be_between(column, min_value=None, max_value=None, strict_min=False, strict_max=False, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect the column median to be between a minimum value and a maximum value.

expect_column_median_to_be_between is a column_aggregate_expectation.

Parameters
  • column (str) – The column name.

  • min_value (int or None) – The minimum value for the column median.

  • max_value (int or None) – The maximum value for the column median.

  • strict_min (boolean) – If True, the column median must be strictly larger than min_value, default=False

  • strict_max (boolean) – If True, the column median must be strictly smaller than max_value, default=False

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

Notes

These fields in the result object are customized for this expectation:

{
    "observed_value": (float) The true median for the column
}
  • min_value and max_value are both inclusive unless strict_min or strict_max are set to True.

  • If min_value is None, then max_value is treated as an upper bound

  • If max_value is None, then min_value is treated as a lower bound

expect_column_quantile_values_to_be_between(column, quantile_ranges, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect specific provided column quantiles to be between provided minimum and maximum values.

quantile_ranges must be a dictionary with two keys:

  • quantiles: (list of float) increasing ordered list of desired quantile values

  • value_ranges: (list of lists): Each element in this list consists of a list with two values, a lower and upper bound (inclusive) for the corresponding quantile.

For each provided range:

  • min_value and max_value are both inclusive.

  • If min_value is None, then max_value is treated as an upper bound only

  • If max_value is None, then min_value is treated as a lower bound only

The length of the quantiles list and quantile_values list must be equal.

For example:

# my_df.my_col = [1,2,2,3,3,3,4]
>>> my_df.expect_column_quantile_values_to_be_between(
    "my_col",
    {
        "quantiles": [0., 0.333, 0.6667, 1.],
        "value_ranges": [[0,1], [2,3], [3,4], [4,5]]
    }
)
{
  "success": True,
    "result": {
      "observed_value": {
        "quantiles: [0., 0.333, 0.6667, 1.],
        "values": [1, 2, 3, 4],
      }
      "element_count": 7,
      "missing_count": 0,
      "missing_percent": 0.0,
      "details": {
        "success_details": [true, true, true, true]
      }
    }
  }
}

expect_column_quantile_values_to_be_between can be computationally intensive for large datasets.

expect_column_quantile_values_to_be_between is a column_aggregate_expectation.

Parameters
  • column (str) – The column name.

  • quantile_ranges (dictionary) – Quantiles and associated value ranges for the column. See above for details.

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

Notes

These fields in the result object are customized for this expectation: :: details.success_details

expect_column_stdev_to_be_between(column, min_value=None, max_value=None, strict_min=False, strict_max=False, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect the column standard deviation to be between a minimum value and a maximum value. Uses sample standard deviation (normalized by N-1).

expect_column_stdev_to_be_between is a column_aggregate_expectation.

Parameters
  • column (str) – The column name.

  • min_value (float or None) – The minimum value for the column standard deviation.

  • max_value (float or None) – The maximum value for the column standard deviation.

  • strict_min (boolean) – If True, the column standard deviation must be strictly larger than min_value, default=False

  • strict_max (boolean) – If True, the column standard deviation must be strictly smaller than max_value, default=False

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

Notes

These fields in the result object are customized for this expectation:

{
    "observed_value": (float) The true standard deviation for the column
}
  • min_value and max_value are both inclusive unless strict_min or strict_max are set to True.

  • If min_value is None, then max_value is treated as an upper bound

  • If max_value is None, then min_value is treated as a lower bound

expect_column_unique_value_count_to_be_between(column, min_value=None, max_value=None, strict_min=False, strict_max=False, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect the number of unique values to be between a minimum value and a maximum value.

expect_column_unique_value_count_to_be_between is a column_aggregate_expectation.

Parameters
  • column (str) – The column name.

  • min_value (int or None) – The minimum number of unique values allowed.

  • max_value (int or None) – The maximum number of unique values allowed.

  • strict_min (boolean) – If True, the number of unique values must be strictly larger than min_value, default=False

  • strict_max (boolean) – If True, the number of unique values must be strictly smaller than max_value, default=False

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

Notes

These fields in the result object are customized for this expectation:

{
    "observed_value": (int) The number of unique values in the column
}
  • min_value and max_value are both inclusive unless strict_min or strict_max are set to True.

  • If min_value is None, then max_value is treated as an upper bound

  • If max_value is None, then min_value is treated as a lower bound

expect_column_proportion_of_unique_values_to_be_between(column, min_value=0, max_value=1, strict_min=False, strict_max=False, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect the proportion of unique values to be between a minimum value and a maximum value.

For example, in a column containing [1, 2, 2, 3, 3, 3, 4, 4, 4, 4], there are 4 unique values and 10 total values for a proportion of 0.4.

expect_column_proportion_of_unique_values_to_be_between is a column_aggregate_expectation.

Parameters
  • column (str) – The column name.

  • min_value (float or None) – The minimum proportion of unique values. (Proportions are on the range 0 to 1)

  • max_value (float or None) – The maximum proportion of unique values. (Proportions are on the range 0 to 1)

  • strict_min (boolean) – If True, the minimum proportion of unique values must be strictly larger than min_value, default=False

  • strict_max (boolean) – If True, the maximum proportion of unique values must be strictly smaller than max_value, default=False

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

Notes

These fields in the result object are customized for this expectation:

{
    "observed_value": (float) The proportion of unique values in the column
}
  • min_value and max_value are both inclusive unless strict_min or strict_max are set to True.

  • If min_value is None, then max_value is treated as an upper bound

  • If max_value is None, then min_value is treated as a lower bound

expect_column_most_common_value_to_be_in_set(column, value_set, ties_okay=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect the most common value to be within the designated value set

expect_column_most_common_value_to_be_in_set is a column_aggregate_expectation.

Parameters
  • column (str) – The column name

  • value_set (set-like) – A list of potential values to match

Keyword Arguments

ties_okay (boolean or None) – If True, then the expectation will still succeed if values outside the designated set are as common (but not more common) than designated values

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

Notes

These fields in the result object are customized for this expectation:

{
    "observed_value": (list) The most common values in the column
}

observed_value contains a list of the most common values. Often, this will just be a single element. But if there’s a tie for most common among multiple values, observed_value will contain a single copy of each most common value.

expect_column_sum_to_be_between(column, min_value=None, max_value=None, strict_min=False, strict_max=False, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect the column to sum to be between an min and max value

expect_column_sum_to_be_between is a column_aggregate_expectation.

Parameters
  • column (str) – The column name

  • min_value (comparable type or None) – The minimal sum allowed.

  • max_value (comparable type or None) – The maximal sum allowed.

  • strict_min (boolean) – If True, the minimal sum must be strictly larger than min_value, default=False

  • strict_max (boolean) – If True, the maximal sum must be strictly smaller than max_value, default=False

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

Notes

These fields in the result object are customized for this expectation:

{
    "observed_value": (list) The actual column sum
}
  • min_value and max_value are both inclusive unless strict_min or strict_max are set to True.

  • If min_value is None, then max_value is treated as an upper bound

  • If max_value is None, then min_value is treated as a lower bound

expect_column_min_to_be_between(column, min_value=None, max_value=None, strict_min=False, strict_max=False, parse_strings_as_datetimes=False, output_strftime_format=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect the column to sum to be between an min and max value

expect_column_min_to_be_between is a column_aggregate_expectation.

Parameters
  • column (str) – The column name

  • min_value (comparable type or None) – The minimal column minimum allowed.

  • max_value (comparable type or None) – The maximal column minimum allowed.

  • strict_min (boolean) – If True, the minimal column minimum must be strictly larger than min_value, default=False

  • strict_max (boolean) – If True, the maximal column minimum must be strictly smaller than max_value, default=False

Keyword Arguments
  • parse_strings_as_datetimes (Boolean or None) – If True, parse min_value, max_values, and all non-null column values to datetimes before making comparisons.

  • output_strftime_format (str or None) – A valid strfime format for datetime output. Only used if parse_strings_as_datetimes=True.

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

Notes

These fields in the result object are customized for this expectation:

{
    "observed_value": (list) The actual column min
}
  • min_value and max_value are both inclusive unless strict_min or strict_max are set to True.

  • If min_value is None, then max_value is treated as an upper bound

  • If max_value is None, then min_value is treated as a lower bound

expect_column_max_to_be_between(column, min_value=None, max_value=None, strict_min=False, strict_max=False, parse_strings_as_datetimes=False, output_strftime_format=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect the column max to be between an min and max value

expect_column_max_to_be_between is a column_aggregate_expectation.

Parameters
  • column (str) – The column name

  • min_value (comparable type or None) – The minimum number of unique values allowed.

  • max_value (comparable type or None) – The maximum number of unique values allowed.

Keyword Arguments
  • parse_strings_as_datetimes (Boolean or None) – If True, parse min_value, max_values, and all non-null column values to datetimes before making comparisons.

  • output_strftime_format (str or None) – A valid strfime format for datetime output. Only used if parse_strings_as_datetimes=True.

  • strict_min (boolean) – If True, the minimal column minimum must be strictly larger than min_value, default=False

  • strict_max (boolean) – If True, the maximal column minimum must be strictly smaller than max_value, default=False

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

Notes

These fields in the result object are customized for this expectation:

{
    "observed_value": (list) The actual column max
}
  • min_value and max_value are both inclusive unless strict_min or strict_max are set to True.

  • If min_value is None, then max_value is treated as an upper bound

  • If max_value is None, then min_value is treated as a lower bound

expect_column_chisquare_test_p_value_to_be_greater_than(column, partition_object=None, p=0.05, tail_weight_holdout=0, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect column values to be distributed similarly to the provided categorical partition. This expectation compares categorical distributions using a Chi-squared test. It returns success=True if values in the column match the distribution of the provided partition.

expect_column_chisquare_test_p_value_to_be_greater_than is a column_aggregate_expectation.

Parameters
  • column (str) – The column name.

  • partition_object (dict) – The expected partition object (see Partition Objects).

  • p (float) – The p-value threshold for rejecting the null hypothesis of the Chi-Squared test. For values below the specified threshold, the expectation will return success=False, rejecting the null hypothesis that the distributions are the same. Defaults to 0.05.

Keyword Arguments

tail_weight_holdout (float between 0 and 1 or None) – The amount of weight to split uniformly between values observed in the data but not present in the provided partition. tail_weight_holdout provides a mechanism to make the test less strict by assigning positive weights to unknown values observed in the data that are not present in the partition.

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

Notes

These fields in the result object are customized for this expectation:

{
    "observed_value": (float) The true p-value of the Chi-squared test
    "details": {
        "observed_partition" (dict):
            The partition observed in the data.
        "expected_partition" (dict):
            The partition expected from the data, after including tail_weight_holdout
    }
}
expect_column_bootstrapped_ks_test_p_value_to_be_greater_than(column, partition_object=None, p=0.05, bootstrap_samples=None, bootstrap_sample_size=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect column values to be distributed similarly to the provided continuous partition. This expectation compares continuous distributions using a bootstrapped Kolmogorov-Smirnov test. It returns success=True if values in the column match the distribution of the provided partition.

The expected cumulative density function (CDF) is constructed as a linear interpolation between the bins, using the provided weights. Consequently the test expects a piecewise uniform distribution using the bins from the provided partition object.

expect_column_bootstrapped_ks_test_p_value_to_be_greater_than is a column_aggregate_expectation.

Parameters
  • column (str) – The column name.

  • partition_object (dict) – The expected partition object (see Partition Objects).

  • p (float) – The p-value threshold for the Kolmogorov-Smirnov test. For values below the specified threshold the expectation will return success=False, rejecting the null hypothesis that the distributions are the same. Defaults to 0.05.

Keyword Arguments
  • bootstrap_samples (int) – The number bootstrap rounds. Defaults to 1000.

  • bootstrap_sample_size (int) – The number of samples to take from the column for each bootstrap. A larger sample will increase the specificity of the test. Defaults to 2 * len(partition_object[‘weights’])

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

Notes

These fields in the result object are customized for this expectation:

{
    "observed_value": (float) The true p-value of the KS test
    "details": {
        "bootstrap_samples": The number of bootstrap rounds used
        "bootstrap_sample_size": The number of samples taken from
            the column in each bootstrap round
        "observed_cdf": The cumulative density function observed
            in the data, a dict containing 'x' values and cdf_values
            (suitable for plotting)
        "expected_cdf" (dict):
            The cumulative density function expected based on the
            partition object, a dict containing 'x' values and
            cdf_values (suitable for plotting)
        "observed_partition" (dict):
            The partition observed on the data, using the provided
            bins but also expanding from min(column) to max(column)
        "expected_partition" (dict):
            The partition expected from the data. For KS test,
            this will always be the partition_object parameter
    }
}
expect_column_kl_divergence_to_be_less_than(column, partition_object=None, threshold=None, tail_weight_holdout=0, internal_weight_holdout=0, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect the Kulback-Leibler (KL) divergence (relative entropy) of the specified column with respect to the partition object to be lower than the provided threshold.

KL divergence compares two distributions. The higher the divergence value (relative entropy), the larger the difference between the two distributions. A relative entropy of zero indicates that the data are distributed identically, when binned according to the provided partition.

In many practical contexts, choosing a value between 0.5 and 1 will provide a useful test.

This expectation works on both categorical and continuous partitions. See notes below for details.

expect_column_kl_divergence_to_be_less_than is a column_aggregate_expectation.

Parameters
  • column (str) – The column name.

  • partition_object (dict) – The expected partition object (see Partition Objects).

  • threshold (float) – The maximum KL divergence to for which to return success=True. If KL divergence is larger than the provided threshold, the test will return success=False.

Keyword Arguments
  • internal_weight_holdout (float between 0 and 1 or None) – The amount of weight to split uniformly among zero-weighted partition bins. internal_weight_holdout provides a mechanims to make the test less strict by assigning positive weights to values observed in the data for which the partition explicitly expected zero weight. With no internal_weight_holdout, any value observed in such a region will cause KL divergence to rise to +Infinity. Defaults to 0.

  • tail_weight_holdout (float between 0 and 1 or None) – The amount of weight to add to the tails of the histogram. Tail weight holdout is split evenly between (-Infinity, min(partition_object[‘bins’])) and (max(partition_object[‘bins’]), +Infinity). tail_weight_holdout provides a mechanism to make the test less strict by assigning positive weights to values observed in the data that are not present in the partition. With no tail_weight_holdout, any value observed outside the provided partition_object will cause KL divergence to rise to +Infinity. Defaults to 0.

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

Notes

These fields in the result object are customized for this expectation:

{
  "observed_value": (float) The true KL divergence (relative entropy) or None if the value is                   calculated as infinity, -infinity, or NaN
  "details": {
    "observed_partition": (dict) The partition observed in the data
    "expected_partition": (dict) The partition against which the data were compared,
                            after applying specified weight holdouts.
  }
}

If the partition_object is categorical, this expectation will expect the values in column to also be categorical.

  • If the column includes values that are not present in the partition, the tail_weight_holdout will be equally split among those values, providing a mechanism to weaken the strictness of the expectation (otherwise, relative entropy would immediately go to infinity).

  • If the partition includes values that are not present in the column, the test will simply include zero weight for that value.

If the partition_object is continuous, this expectation will discretize the values in the column according to the bins specified in the partition_object, and apply the test to the resulting distribution.

  • The internal_weight_holdout and tail_weight_holdout parameters provide a mechanism to weaken the expectation, since an expected weight of zero would drive relative entropy to be infinite if any data are observed in that interval.

  • If internal_weight_holdout is specified, that value will be distributed equally among any intervals with weight zero in the partition_object.

  • If tail_weight_holdout is specified, that value will be appended to the tails of the bins ((-Infinity, min(bins)) and (max(bins), Infinity).

If relative entropy/kl divergence goes to infinity for any of the reasons mentioned above, the observed value will be set to None. This is because inf, -inf, Nan, are not json serializable and cause some json parsers to crash when encountered. The python None token will be serialized to null in json.

expect_column_pair_values_to_be_equal(column_A, column_B, ignore_row_if='both_values_are_missing', result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect the values in column A to be the same as column B.

Parameters
  • column_A (str) – The first column name

  • column_B (str) – The second column name

Keyword Arguments

ignore_row_if (str) – “both_values_are_missing”, “either_value_is_missing”, “neither”

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_column_pair_values_A_to_be_greater_than_B(column_A, column_B, or_equal=None, parse_strings_as_datetimes=False, allow_cross_type_comparisons=None, ignore_row_if='both_values_are_missing', result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect values in column A to be greater than column B.

Parameters
  • column_A (str) – The first column name

  • column_B (str) – The second column name

  • or_equal (boolean or None) – If True, then values can be equal, not strictly greater

Keyword Arguments
  • allow_cross_type_comparisons (boolean or None) – If True, allow comparisons between types (e.g. integer and string). Otherwise, attempting such comparisons will raise an exception.

  • ignore_row_if (str) – “both_values_are_missing”, “either_value_is_missing”, “neither

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_column_pair_values_to_be_in_set(column_A, column_B, value_pairs_set, ignore_row_if='both_values_are_missing', result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect paired values from columns A and B to belong to a set of valid pairs.

Parameters
  • column_A (str) – The first column name

  • column_B (str) – The second column name

  • value_pairs_set (list of tuples) – All the valid pairs to be matched

Keyword Arguments

ignore_row_if (str) – “both_values_are_missing”, “either_value_is_missing”, “never”

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_multicolumn_values_to_be_unique(column_list, ignore_row_if='all_values_are_missing', result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect the values for each row to be unique across the columns listed.

Parameters

column_list (tuple or list) – The first column name

Keyword Arguments

ignore_row_if (str) – “all_values_are_missing”, “any_value_is_missing”, “never”

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

great_expectations.dataset.pandas_dataset

class great_expectations.dataset.pandas_dataset.MetaPandasDataset(*args, **kwargs)

Bases: great_expectations.dataset.dataset.Dataset

MetaPandasDataset is a thin layer between Dataset and PandasDataset.

This two-layer inheritance is required to make @classmethod decorators work.

Practically speaking, that means that MetaPandasDataset implements expectation decorators, like column_map_expectation and column_aggregate_expectation, and PandasDataset implements the expectation methods themselves.

classmethod column_map_expectation(func)

Constructs an expectation using column-map semantics.

The MetaPandasDataset implementation replaces the “column” parameter supplied by the user with a pandas Series object containing the actual column from the relevant pandas dataframe. This simplifies the implementing expectation logic while preserving the standard Dataset signature and expected behavior.

See column_map_expectation for full documentation of this function.

classmethod column_pair_map_expectation(func)

The column_pair_map_expectation decorator handles boilerplate issues surrounding the common pattern of evaluating truthiness of some condition on a per row basis across a pair of columns.

classmethod multicolumn_map_expectation(func)

The multicolumn_map_expectation decorator handles boilerplate issues surrounding the common pattern of evaluating truthiness of some condition on a per row basis across a set of columns.

class great_expectations.dataset.pandas_dataset.PandasDataset(*args, **kwargs)

Bases: great_expectations.dataset.pandas_dataset.MetaPandasDataset, pandas.core.frame.DataFrame

PandasDataset instantiates the great_expectations Expectations API as a subclass of a pandas.DataFrame.

For the full API reference, please see Dataset

Notes

  1. Samples and Subsets of PandaDataSet have ALL the expectations of the original data frame unless the user specifies the discard_subset_failing_expectations = True property on the original data frame.

  2. Concatenations, joins, and merges of PandaDataSets contain NO expectations (since no autoinspection is performed by default).

get_row_count()

Returns: int, table row count

get_table_columns()

Returns: List[str], list of column names

get_column_sum(column)

Returns: float

get_column_max(column, parse_strings_as_datetimes=False)

Returns: any

get_column_min(column, parse_strings_as_datetimes=False)

Returns: any

get_column_mean(column)

Returns: float

get_column_nonnull_count(column)

Returns: int

get_column_value_counts(column)

Returns: pd.Series of value counts for a column, sorted by value

get_column_unique_count(column)

Returns: int

get_column_modes(column)

Returns: List[any], list of modes (ties OK)

get_column_median(column)

Returns: any

get_column_quantiles(column, quantiles)

Get the values in column closest to the requested quantiles :param column: name of column :type column: string :param quantiles: the quantiles to return. quantiles must be a tuple to ensure caching is possible :type quantiles: tuple of float

Returns

the nearest values in the dataset to those quantiles

Return type

List[any]

get_column_stdev(column)

Returns: float

get_column_hist(column, bins)

Get a histogram of column values :param column: the column for which to generate the histogram :param bins: the bins to slice the histogram. bins must be a tuple to ensure caching is possible :type bins: tuple

Returns: List[int], a list of counts corresponding to bins

get_column_count_in_range(column, min_val=None, max_val=None, strict_min=False, strict_max=True)

Returns: int

expect_column_values_to_not_match_regex_list(column, regex_list, mostly=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)
Expect the column entries to be strings that do not match any of a list of regular expressions. Matches can

be anywhere in the string.

expect_column_values_to_not_match_regex_list is a column_map_expectation.

Args:

column (str): The column name. regex_list (list): The list of regular expressions which the column entries should not match

Keyword Args:

mostly (None or a float between 0 and 1): Return “success”: True if at least mostly percent of values match the expectation. For more detail, see mostly.

Other Parameters:

result_format (str or None): Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format. include_config (boolean): If True, then include the expectation config as part of the result object. For more detail, see include_config. catch_exceptions (boolean or None): If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions. meta (dict or None): A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns:

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

See Also:

expect_column_values_to_match_regex_list

expect_column_parameterized_distribution_ks_test_p_value_to_be_greater_than(column, distribution, p_value=0.05, params=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect the column values to be distributed similarly to a scipy distribution. This expectation compares the provided column to the specified continuous distribution with a parametric Kolmogorov-Smirnov test. The K-S test compares the provided column to the cumulative density function (CDF) of the specified scipy distribution. If you don’t know the desired distribution shape parameters, use the ge.dataset.util.infer_distribution_parameters() utility function to estimate them.

It returns ‘success’=True if the p-value from the K-S test is greater than or equal to the provided p-value.

expect_column_parameterized_distribution_ks_test_p_value_to_be_greater_than is a column_aggregate_expectation.

Parameters
  • column (str) – The column name.

  • distribution (str) –

    The scipy distribution name. See: https://docs.scipy.org/doc/scipy/reference/stats.html Currently supported distributions are listed in the Notes section below.

  • p_value (float) – The threshold p-value for a passing test. Default is 0.05.

  • params (dict or list) –

    A dictionary or positional list of shape parameters that describe the distribution you want to test the data against. Include key values specific to the distribution from the appropriate scipy distribution CDF function. ‘loc’ and ‘scale’ are used as translational parameters. See https://docs.scipy.org/doc/scipy/reference/stats.html#continuous-distributions

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

Notes

These fields in the result object are customized for this expectation:

{
    "details":
        "expected_params" (dict): The specified or inferred parameters of the distribution to test                         against
        "ks_results" (dict): The raw result of stats.kstest()
}
  • The Kolmogorov-Smirnov test’s null hypothesis is that the column is similar to the provided distribution.

  • Supported scipy distributions:

    • norm

    • beta

    • gamma

    • uniform

    • chi2

    • expon

expect_column_pair_values_to_be_equal(column_A, column_B, ignore_row_if='both_values_are_missing', result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect the values in column A to be the same as column B.

Parameters
  • column_A (str) – The first column name

  • column_B (str) – The second column name

Keyword Arguments

ignore_row_if (str) – “both_values_are_missing”, “either_value_is_missing”, “neither”

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_column_pair_values_A_to_be_greater_than_B(column_A, column_B, or_equal=None, parse_strings_as_datetimes=None, allow_cross_type_comparisons=None, ignore_row_if='both_values_are_missing', result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect values in column A to be greater than column B.

Parameters
  • column_A (str) – The first column name

  • column_B (str) – The second column name

  • or_equal (boolean or None) – If True, then values can be equal, not strictly greater

Keyword Arguments
  • allow_cross_type_comparisons (boolean or None) – If True, allow comparisons between types (e.g. integer and string). Otherwise, attempting such comparisons will raise an exception.

  • ignore_row_if (str) – “both_values_are_missing”, “either_value_is_missing”, “neither

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_column_pair_values_to_be_in_set(column_A, column_B, value_pairs_set, ignore_row_if='both_values_are_missing', result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect paired values from columns A and B to belong to a set of valid pairs.

Parameters
  • column_A (str) – The first column name

  • column_B (str) – The second column name

  • value_pairs_set (list of tuples) – All the valid pairs to be matched

Keyword Arguments

ignore_row_if (str) – “both_values_are_missing”, “either_value_is_missing”, “never”

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_multicolumn_values_to_be_unique(column_list, ignore_row_if='all_values_are_missing', result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect the values for each row to be unique across the columns listed.

Parameters

column_list (tuple or list) – The first column name

Keyword Arguments

ignore_row_if (str) – “all_values_are_missing”, “any_value_is_missing”, “never”

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

great_expectations.dataset.sqlalchemy_dataset

class great_expectations.dataset.sqlalchemy_dataset.MetaSqlAlchemyDataset(*args, **kwargs)

Bases: great_expectations.dataset.dataset.Dataset

classmethod column_map_expectation(func)

For SqlAlchemy, this decorator allows individual column_map_expectations to simply return the filter that describes the expected condition on their data.

The decorator will then use that filter to obtain unexpected elements, relevant counts, and return the formatted object.

class great_expectations.dataset.sqlalchemy_dataset.SqlAlchemyDataset(table_name=None, engine=None, connection_string=None, custom_sql=None, schema=None, *args, **kwargs)

Bases: great_expectations.dataset.sqlalchemy_dataset.MetaSqlAlchemyDataset

classmethod from_dataset(dataset=None)

This base implementation naively passes arguments on to the real constructor, which is suitable really when a constructor knows to take its own type. In general, this should be overridden

head(n=5)

Returns a PandasDataset with the first n rows of the given Dataset

get_row_count()

Returns: int, table row count

get_table_columns()

Returns: List[str], list of column names

get_column_nonnull_count(column)

Returns: int

get_column_sum(column)

Returns: float

get_column_max(column, parse_strings_as_datetimes=False)

Returns: any

get_column_min(column, parse_strings_as_datetimes=False)

Returns: any

get_column_value_counts(column)

Returns: pd.Series of value counts for a column, sorted by value

get_column_mean(column)

Returns: float

get_column_unique_count(column)

Returns: int

get_column_median(column)

Returns: any

get_column_quantiles(column, quantiles)

Get the values in column closest to the requested quantiles :param column: name of column :type column: string :param quantiles: the quantiles to return. quantiles must be a tuple to ensure caching is possible :type quantiles: tuple of float

Returns

the nearest values in the dataset to those quantiles

Return type

List[any]

get_column_stdev(column)

Returns: float

get_column_hist(column, bins)

return a list of counts corresponding to bins

Parameters
  • column – the name of the column for which to get the histogram

  • bins – tuple of bin edges for which to get histogram values; must be tuple to support caching

get_column_count_in_range(column, min_val=None, max_val=None, strict_min=False, strict_max=True)

Returns: int

create_temporary_table(table_name, custom_sql)

Create Temporary table based on sql query. This will be used as a basis for executing expectations. WARNING: this feature is new in v0.4. It hasn’t been tested in all SQL dialects, and may change based on community feedback. :param custom_sql:

column_reflection_fallback()

If we can’t reflect the table, use a query to at least get column names.

expect_column_values_to_not_match_regex_list(column, regex_list, mostly=None, result_format=None, include_config=False, catch_exceptions=None, meta=None)

Expect the column entries to be strings that do not match any of a list of regular expressions. Matches can be anywhere in the string.

expect_column_values_to_not_match_regex_list is a column_map_expectation.

Parameters
  • column (str) – The column name.

  • regex_list (list) – The list of regular expressions which the column entries should not match

Keyword Arguments

mostly (None or a float between 0 and 1) – Return “success”: True if at least mostly percent of values match the expectation. For more detail, see mostly.

Other Parameters
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.

  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.

  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.

  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.

Returns

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

great_expectations.dataset.sparkdf_dataset

class great_expectations.dataset.sparkdf_dataset.MetaSparkDFDataset(*args, **kwargs)

Bases: great_expectations.dataset.dataset.Dataset

MetaSparkDFDataset is a thin layer between Dataset and SparkDFDataset. This two-layer inheritance is required to make @classmethod decorators work. Practically speaking, that means that MetaSparkDFDataset implements expectation decorators, like column_map_expectation and column_aggregate_expectation, and SparkDFDataset implements the expectation methods themselves.

classmethod column_map_expectation(func)

Constructs an expectation using column-map semantics.

The MetaSparkDFDataset implementation replaces the “column” parameter supplied by the user with a Spark Dataframe with the actual column data. The current approach for functions implementing expectation logic is to append a column named “__success” to this dataframe and return to this decorator.

See column_map_expectation for full documentation of this function.

class great_expectations.dataset.sparkdf_dataset.SparkDFDataset(spark_df, *args, **kwargs)

Bases: great_expectations.dataset.sparkdf_dataset.MetaSparkDFDataset

This class holds an attribute spark_df which is a spark.sql.DataFrame.

classmethod from_dataset(dataset=None)

This base implementation naively passes arguments on to the real constructor, which is suitable really when a constructor knows to take its own type. In general, this should be overridden

head(n=5)

Returns a PandasDataset with the first n rows of the given Dataset

get_row_count()

Returns: int, table row count

get_table_columns()

Returns: List[str], list of column names

get_column_nonnull_count(column)

Returns: int

get_column_mean(column)

Returns: float

get_column_sum(column)

Returns: float

get_column_max(column, parse_strings_as_datetimes=False)

Returns: any

get_column_min(column, parse_strings_as_datetimes=False)

Returns: any

get_column_value_counts(column)

Returns: pd.Series of value counts for a column, sorted by value

get_column_unique_count(column)

Returns: int

get_column_modes(column)

leverages computation done in _get_column_value_counts

get_column_median(column)

Returns: any

get_column_quantiles(column, quantiles)

Get the values in column closest to the requested quantiles :param column: name of column :type column: string :param quantiles: the quantiles to return. quantiles must be a tuple to ensure caching is possible :type quantiles: tuple of float

Returns

the nearest values in the dataset to those quantiles

Return type

List[any]

get_column_stdev(column)

Returns: float

get_column_hist(column, bins)

return a list of counts corresponding to bins

get_column_count_in_range(column, min_val=None, max_val=None, strict_min=False, strict_max=True)

Returns: int

great_expectations.dataset.util

great_expectations.dataset.util.is_valid_partition_object(partition_object)

Tests whether a given object is a valid continuous or categorical partition object. :param partition_object: The partition_object to evaluate :return: Boolean

great_expectations.dataset.util.is_valid_categorical_partition_object(partition_object)

Tests whether a given object is a valid categorical partition object. :param partition_object: The partition_object to evaluate :return: Boolean

great_expectations.dataset.util.is_valid_continuous_partition_object(partition_object)

Tests whether a given object is a valid continuous partition object. See Partition Objects.

Parameters

partition_object – The partition_object to evaluate

Returns

Boolean

great_expectations.dataset.util.categorical_partition_data(data)

Convenience method for creating weights from categorical data.

Parameters

data (list-like) – The data from which to construct the estimate.

Returns

A new partition object:

{
    "values": (list) The categorical values present in the data
    "weights": (list) The weights of the values in the partition.
}

See Partition Objects.

great_expectations.dataset.util.kde_partition_data(data, estimate_tails=True)

Convenience method for building a partition and weights using a gaussian Kernel Density Estimate and default bandwidth.

Parameters
  • data (list-like) – The data from which to construct the estimate

  • estimate_tails (bool) – Whether to estimate the tails of the distribution to keep the partition object finite

Returns

A new partition_object:

{
    "bins": (list) The endpoints of the partial partition of reals,
    "weights": (list) The densities of the bins implied by the partition.
}

See :ref:`partition_object`.

great_expectations.dataset.util.partition_data(data, bins='auto', n_bins=10)
great_expectations.dataset.util.continuous_partition_data(data, bins='auto', n_bins=10, **kwargs)

Convenience method for building a partition object on continuous data

Parameters
  • data (list-like) – The data from which to construct the estimate.

  • bins (string) – One of ‘uniform’ (for uniformly spaced bins), ‘ntile’ (for percentile-spaced bins), or ‘auto’ (for automatically spaced bins)

  • n_bins (int) – Ignored if bins is auto.

  • kwargs (mapping) – Additional keyword arguments to be passed to numpy histogram

Returns

A new partition_object:

{
    "bins": (list) The endpoints of the partial partition of reals,
    "weights": (list) The densities of the bins implied by the partition.
}
See :ref:`partition_object`.

great_expectations.dataset.util.build_continuous_partition_object(dataset, column, bins='auto', n_bins=10)

Convenience method for building a partition object on continuous data from a dataset and column

Parameters
  • dataset (GE Dataset) – the dataset for which to compute the partition

  • column (string) – The name of the column for which to construct the estimate.

  • bins (string) – One of ‘uniform’ (for uniformly spaced bins), ‘ntile’ (for percentile-spaced bins), or ‘auto’ (for automatically spaced bins)

  • n_bins (int) – Ignored if bins is auto.

Returns

A new partition_object:

{
    "bins": (list) The endpoints of the partial partition of reals,
    "weights": (list) The densities of the bins implied by the partition.
}
See :ref:`partition_object`.

great_expectations.dataset.util.infer_distribution_parameters(data, distribution, params=None)

Convenience method for determining the shape parameters of a given distribution

Parameters
  • data (list-like) – The data to build shape parameters from.

  • distribution (string) – Scipy distribution, determines which parameters to build.

  • params (dict or None) – The known parameters. Parameters given here will not be altered. Keep as None to infer all necessary parameters from the data data.

Returns

A dictionary of named parameters:

{
    "mean": (float),
    "std_dev": (float),
    "loc": (float),
    "scale": (float),
    "alpha": (float),
    "beta": (float),
    "min": (float),
    "max": (float),
    "df": (float)
}

See: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html#scipy.stats.kstest

great_expectations.dataset.util.validate_distribution_parameters(distribution, params)

Ensures that necessary parameters for a distribution are present and that all parameters are sensical.

If parameters necessary to construct a distribution are missing or invalid, this function raises ValueError with an informative description. Note that ‘loc’ and ‘scale’ are optional arguments, and that ‘scale’ must be positive.

Parameters
  • distribution (string) – The scipy distribution name, e.g. normal distribution is ‘norm’.

  • params (dict or list) –

    The distribution shape parameters in a named dictionary or positional list form following the scipy cdf argument scheme.

    params={‘mean’: 40, ‘std_dev’: 5} or params=[40, 5]

Exceptions:

ValueError: With an informative description, usually when necessary parameters are omitted or are invalid.

great_expectations.dataset.util.create_multiple_expectations(df, columns, expectation_type, *args, **kwargs)

Creates an identical expectation for each of the given columns with the specified arguments, if any.

Parameters
  • df (great_expectations.dataset) – A great expectations dataset object.

  • columns (list) – A list of column names represented as strings.

  • expectation_type (string) – The expectation type.

Raises
  • KeyError if the provided column does not exist.

  • AttributeError if the provided expectation type does not exist or df is not a valid great expectations dataset.

Returns

A list of expectation results.