Creating expectations is an opportunity to blend contextual knowledge from subject-matter experts and insights from profiling and performing exploratory analysis on your dataset. This tutorial covers creating expectations for a data asset using a Jupyter notebook.
0. Open Jupyter Notebook¶
This tutorial assumes that:
great_expectations initand completed the steps covered in the previous tutorial: Run great_expectations init.
your current directory is the root of the project where you ran
You can either follow the tutorial with the dataset that it uses or you can execute the same steps on your project with your own data.
If you get stuck, find a bug or want to ask a question, go to our Slack - this is the best way to get help from the contributors and other users.
great_expectations init command created a
great_expectations/notebooks/ folder in your project. The folder contains example notebooks for pandas, Spark and SQL datasources.
If you are following this tutorial using the NPI dataset, open the pandas notebook. If you are working with a different dataset, follow along in the notebook with instructions tailored to your datasource:
jupyter notebook great_expectations/notebooks/pandas/create_expectations.ipynb
jupyter notebook great_expectations/notebooks/spark/create_expectations.ipynb
jupyter notebook great_expectations/notebooks/sql/create_expectations.ipynb
1. Get a DataContext Object¶
A DataContext represents a Great Expectations project. It organizes datasources, notification settings, data documentation sites, and storage and access for expectation suites and validation results.
The DataContext is configured via a yml file stored in a directory called
This entire directory, which includes configuration files as well as expectation suites, should be stored in version control.
Instantiating a DataContext loads your project configuration and all its resources.
context = ge.data_context.DataContext()
To read more about DataContext, see: DataContexts
2. List Data Assets¶
A Data Asset is data you can describe with expectations.
A Pandas datasource generates data assets from Pandas DataFrames or CSV files. In this example the pipeline processes NPI data that it reads from CSV files in the
npidata directory into Pandas DataFrames. This is the data you want to describe with expectations. That directory and its files form a data asset, named “npidata” (based on the directory name).
A Spark datasource generates data assets from Spark DataFrames or CSV files. The data loaded into a data asset is the data you want to describe and specify with expectations. If this example read CSV files in a directory called
npidata into a Spark DataFrame, the resulting data asset would be called “npidata” based on the directory name.
A SQLAlchemy datasource generates data assets from tables, views and query results.
If the data resided in a table (or view) in a database, it would be accessible as a data asset with the name of that table (or view).
If the data did not reside in one table
npidataand, instead, the example pipeline ran an SQL query that fetched the data (probably from multiple tables), the result set of that query would be accessible as a data asset. The name of this data asset would be up to us (e.g., “npidata” or “npidata_query”).
Use this convenience method to list all data assets and expectation suites in your project (using the DataContext).
The output looks like this:
npidata is the short name of the data asset. Full names of data assets in a DataContext consist of three parts, for example:
data__dir/default/npidata. You don’t need to know (yet) how the namespace is managed and the exact meaning of each part. The DataContexts article describes this in detail.
3. Pick a data asset and set the expectation suite name¶
normalize_data_asset_name method converts the short name of a data asset to a full name:
data_asset_name = "npidata" normalized_data_asset_name = context.normalize_data_asset_name(data_asset_name) normalized_data_asset_name
expectation_suite_name = "warning"
4. Create a new empty expectation suite¶
Individual Expectations are organized into expectation suites. We recommend ‘warning’ or ‘default’ as the name for a first expectation suite associated with a data asset.
Let’s create a new empty suite in our project so we can start writing Expectations!
If an expectation suite with this name already exists for this data_asset, you will get an error. If you would like to overwrite this expectation suite, set
5. Load a batch of data to create Expectations¶
Expectations describe data assets. Data assets are composed of batches. Validation checks expectations against a batch of data.
For example, a batch could be the most recent day of log data. For a database table, a batch could be the data in that table at a particular time.
To create expectations about a data asset you will load a batch of data as a Great Expectations
Dataset and then call expectation methods.
get_batch method is used to load a batch of a data asset:
batch = context.get_batch(normalized_data_asset_name, expectation_suite_name, batch_kwargs)
Calling this method asks the Context to get a batch of data from the data asset
normalized_data_asset_name and attach the expectation suite
expectation_suite_name to it. The
batch_kwargs argument specifies which batch of the data asset should be loaded.
If you have no preference as to which batch of the data asset should be loaded, use the
yield_batch_kwargs method on the data context:
batch_kwargs = context.yield_batch_kwargs(data_asset_name)
This is most likely sufficient for the purpose of this tutorial.
Now you have the contents of one of the files loaded as batch of the data asset
7. Review and save your Expectations¶
expectations_store attribute in the
great_expectations.yml configuration file controls the location where the DataContext saves the expectation suite.
When you call
get_expectation_suite, you might see this warning in the output:
That is produced since, by default, GE will drop any expectation that was not successful on its last run.
Sometimes, you may want to save an expectation even though it did not validate successfully on the current batch (e.g., you
have a reason to believe that the expectation is correct and the current batch has bad entries). In this case, pass
an additional argument to the
8. View the Expectations in Data Docs¶
Data Docs compiles Expectations and Validations into HTML documentation. By default the HTML website is hosted on your local filesystem. When you are working in a team, the website can be hosted in the cloud (e.g., on S3) and serve as the shared source of truth for the team working on the data pipeline.
To view the expectation suite you just created as HTML, rebuild the data docs and open the webstite in the browser:
Read more about the capabilities and configuration of Data Docs here: Data Docs.