Data Asset Module

great_expectations.data_asset.base

class great_expectations.data_asset.base.DataAsset(*args, **kwargs)
class great_expectations.data_asset.base.ValidationStatistics(evaluated_expectations, successful_expectations, unsuccessful_expectations, success_percent, success)

Bases: tuple

__getnewargs__()

Return self as a plain tuple. Used by copy and pickle.

__getstate__()

Exclude the OrderedDict from pickling

__repr__()

Return a nicely formatted representation string

evaluated_expectations

Alias for field number 0

success

Alias for field number 4

success_percent

Alias for field number 3

successful_expectations

Alias for field number 1

unsuccessful_expectations

Alias for field number 2

great_expectations.data_asset.file_data_asset

class great_expectations.data_asset.file_data_asset.MetaFileDataAsset(*args, **kwargs)

Bases: great_expectations.data_asset.base.DataAsset

MetaFileDataset is a thin layer above FileDataset. This two-layer inheritance is required to make @classmethod decorators work. Practically speaking, that means that MetaFileDataset implements expectation decorators, like file_lines_map_expectation and FileDataset implements the expectation methods themselves.

classmethod file_lines_map_expectation(func)

Constructs an expectation using file lines map semantics. The file_lines_map_expectations decorator handles boilerplate issues surrounding the common pattern of evaluating truthiness of some condition on an line by line basis in a file.

Parameters:func (function) – The function implementing an expectation that will be applied line by line across a file. The function should take a file and return information about how many lines met expectations.

Notes

Users can specify skip value k that will cause the expectation function to disregard the first k lines of the file.

file_lines_map_expectation will add a kwarg _lines to the called function with the nonnull lines to process.

null_lines_regex defines a regex used to skip lines, but can be overridden

See also

expect_file_line_regex_match_count_to_be_between for an example of a file_lines_map_expectation

class great_expectations.data_asset.file_data_asset.FileDataAsset(file_path=None, *args, **kwargs)

Bases: great_expectations.data_asset.file_data_asset.MetaFileDataAsset

FileDataset instantiates the great_expectations Expectations API as a subclass of a python file object. For the full API reference, please see DataAsset

expect_file_line_regex_match_count_to_be_between(*args, **kwargs)

Expect the number of times a regular expression appears on each line of a file to be between a maximum and minimum value.

Parameters:
  • regex – A string that can be compiled as valid regular expression to match
  • expected_min_count (None or nonnegative integer) – Specifies the minimum number of times regex is expected to appear on each line of the file
  • expected_max_count (None or nonnegative integer) – Specifies the maximum number of times regex is expected to appear on each line of the file
Keyword Arguments:
 
  • skip (None or nonnegative integer) – Integer specifying the first lines in the file the method should skip before assessing expectations
  • mostly (None or number between 0 and 1) – Specifies an acceptable error for expectations. If the percentage of unexpected lines is less than mostly, the method still returns true even if all lines don’t match the expectation criteria.
  • null_lines_regex (valid regular expression or None) – If not none, a regex to skip lines as null. Defaults to empty or whitespace-only lines.
Other Parameters:
 
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.
  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.
  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.
  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.
  • _lines (list) – The lines over which to operate (provided by the file_lines_map_expectation decorator)
Returns:

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_file_line_regex_match_count_to_equal(*args, **kwargs)

Expect the number of times a regular expression appears on each line of a file to be between a maximum and minimum value.

Parameters:
  • regex – A string that can be compiled as valid regular expression to match
  • expected_count (None or nonnegative integer) – Specifies the number of times regex is expected to appear on each line of the file
Keyword Arguments:
 
  • skip (None or nonnegative integer) – Integer specifying the first lines in the file the method should skip before assessing expectations
  • mostly (None or number between 0 and 1) – Specifies an acceptable error for expectations. If the percentage of unexpected lines is less than mostly, the method still returns true even if all lines don’t match the expectation criteria.
  • null_lines_regex (valid regular expression or None) – If not none, a regex to skip lines as null. Defaults to empty or whitespace-only lines.
Other Parameters:
 
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.
  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.
  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.
  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.
  • _lines (list) – The lines over which to operate (provided by the file_lines_map_expectation decorator)
Returns:

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_file_hash_to_equal(*args, **kwargs)

Expect computed file hash to equal some given value.

Parameters:

value – A string to compare with the computed hash value

Keyword Arguments:
 
  • hash_alg (string) – Indicates the hash algorithm to use
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.
  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.
  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.
  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.
Returns:

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_file_size_to_be_between(*args, **kwargs)

Expect file size to be between a user specified maxsize and minsize.

Parameters:
  • minsize (integer) – minimum expected file size
  • maxsize (integer) – maximum expected file size
Keyword Arguments:
 
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.
  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.
  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.
  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.
Returns:

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_file_to_exist(*args, **kwargs)

Checks to see if a file specified by the user actually exists

Parameters:

filepath (str or None) – The filepath to evalutate. If none, will check the currently-configured path object of this FileDataAsset.

Keyword Arguments:
 
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.
  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.
  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.
  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.
Returns:

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_file_to_have_valid_table_header(*args, **kwargs)

Checks to see if a file has a line with unique delimited values, such a line may be used as a table header.

Keyword Arguments:
 
  • skip (nonnegative integer) – Integer specifying the first lines in the file the method should skip before assessing expectations
  • regex (string) – A string that can be compiled as valid regular expression. Used to specify the elements of the table header (the column headers)
  • result_format (str or None) – Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.
  • include_config (boolean) – If True, then include the expectation config as part of the result object. For more detail, see include_config.
  • catch_exceptions (boolean or None) – If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.
  • meta (dict or None) – A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification. For more detail, see meta.
Returns:

A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

expect_file_to_be_valid_json(*args, **kwargs)
schema : string
optional JSON schema file on which JSON data file is validated against
result_format (str or None):
Which output mode to use: BOOLEAN_ONLY, BASIC, COMPLETE, or SUMMARY. For more detail, see result_format.
include_config (boolean):
If True, then include the expectation config as part of the result object. For more detail, see include_config.
catch_exceptions (boolean or None):
If True, then catch exceptions and include them as part of the result object. For more detail, see catch_exceptions.
meta (dict or None):
A JSON-serializable dictionary (nesting allowed) that will be included in the output without modification.

For more detail, see meta.

Returns:A JSON-serializable expectation result object.

Exact fields vary depending on the values passed to result_format and include_config, catch_exceptions, and meta.

class great_expectations.data_asset.file_data_asset.MetaFileDataAsset(*args, **kwargs)

Bases: great_expectations.data_asset.base.DataAsset

MetaFileDataset is a thin layer above FileDataset. This two-layer inheritance is required to make @classmethod decorators work. Practically speaking, that means that MetaFileDataset implements expectation decorators, like file_lines_map_expectation and FileDataset implements the expectation methods themselves.

classmethod file_lines_map_expectation(func)

Constructs an expectation using file lines map semantics. The file_lines_map_expectations decorator handles boilerplate issues surrounding the common pattern of evaluating truthiness of some condition on an line by line basis in a file.

Parameters:func (function) – The function implementing an expectation that will be applied line by line across a file. The function should take a file and return information about how many lines met expectations.

Notes

Users can specify skip value k that will cause the expectation function to disregard the first k lines of the file.

file_lines_map_expectation will add a kwarg _lines to the called function with the nonnull lines to process.

null_lines_regex defines a regex used to skip lines, but can be overridden

See also

expect_file_line_regex_match_count_to_be_between for an example of a file_lines_map_expectation

great_expectations.data_asset.util

great_expectations.dataset.util.is_valid_partition_object(partition_object)

Tests whether a given object is a valid continuous or categorical partition object. :param partition_object: The partition_object to evaluate :return: Boolean

great_expectations.dataset.util.is_valid_categorical_partition_object(partition_object)

Tests whether a given object is a valid categorical partition object. :param partition_object: The partition_object to evaluate :return: Boolean

great_expectations.dataset.util.is_valid_continuous_partition_object(partition_object)

Tests whether a given object is a valid continuous partition object. :param partition_object: The partition_object to evaluate :return: Boolean

great_expectations.dataset.util.categorical_partition_data(data)

Convenience method for creating weights from categorical data.

Parameters:data (list-like) – The data from which to construct the estimate.
Returns:A new partition object:
{
    "partition": (list) The categorical values present in the data
    "weights": (list) The weights of the values in the partition.
}
great_expectations.dataset.util.kde_partition_data(data, estimate_tails=True)

Convenience method for building a partition and weights using a gaussian Kernel Density Estimate and default bandwidth.

Parameters:
  • data (list-like) – The data from which to construct the estimate
  • estimate_tails (bool) – Whether to estimate the tails of the distribution to keep the partition object finite
Returns:

A new partition_object:

{
    "partition": (list) The endpoints of the partial partition of reals,
    "weights": (list) The densities of the bins implied by the partition.
}

great_expectations.dataset.util.partition_data(data, bins='auto', n_bins=10)
great_expectations.dataset.util.continuous_partition_data(data, bins='auto', n_bins=10)

Convenience method for building a partition object on continuous data

Parameters:
  • data (list-like) – The data from which to construct the estimate.
  • bins (string) – One of ‘uniform’ (for uniformly spaced bins), ‘ntile’ (for percentile-spaced bins), or ‘auto’ (for automatically spaced bins)
  • n_bins (int) – Ignored if bins is auto.
Returns:

A new partition_object:

{
    "bins": (list) The endpoints of the partial partition of reals,
    "weights": (list) The densities of the bins implied by the partition.
}

great_expectations.dataset.util.infer_distribution_parameters(data, distribution, params=None)

Convenience method for determining the shape parameters of a given distribution

Parameters:
  • data (list-like) – The data to build shape parameters from.
  • distribution (string) – Scipy distribution, determines which parameters to build.
  • params (dict or None) – The known parameters. Parameters given here will not be altered. Keep as None to infer all necessary parameters from the data data.
Returns:

A dictionary of named parameters:

{
    "mean": (float),
    "std_dev": (float),
    "loc": (float),
    "scale": (float),
    "alpha": (float),
    "beta": (float),
    "min": (float),
    "max": (float),
    "df": (float)
}

See: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html#scipy.stats.kstest

great_expectations.dataset.util.validate_distribution_parameters(distribution, params)

Ensures that necessary parameters for a distribution are present and that all parameters are sensical.

If parameters necessary to construct a distribution are missing or invalid, this function raises ValueError with an informative description. Note that ‘loc’ and ‘scale’ are optional arguments, and that ‘scale’ must be positive.

Parameters:
  • distribution (string) – The scipy distribution name, e.g. normal distribution is ‘norm’.
  • params (dict or list) –

    The distribution shape parameters in a named dictionary or positional list form following the scipy cdf argument scheme.

    params={‘mean’: 40, ‘std_dev’: 5} or params=[40, 5]

Exceptions:
ValueError: With an informative description, usually when necessary parameters are omitted or are invalid.
great_expectations.dataset.util.create_multiple_expectations(df, columns, expectation_type, *args, **kwargs)

Creates an identical expectation for each of the given columns with the specified arguments, if any.

Parameters:
  • df (great_expectations.dataset) – A great expectations dataset object.
  • columns (list) – A list of column names represented as strings.
  • expectation_type (string) – The expectation type.
Raises:
  • KeyError if the provided column does not exist.
  • AttributeError if the provided expectation type does not exist or df is not a valid great expectations dataset.
Returns:

A list of expectation results.