Data quality plays an important role while building an extract, transform, and load (ETL) pipeline for sending data to downstream analytical applications and machine learning (ML) models. The analogy “garbage in, garbage out” is apt at describing why it’s important to filter out bad data before further processing. Continuously monitoring data quality and comparing it with predefined target metrics helps you comply with your governance frameworks.

In November 2020, AWS announced the general availability of AWS Glue DataBrew, a new visual data preparation tool that helps you clean and normalize data without writing code. This reduces the time it takes to prepare data for analytics and ML by up to 80% compared to traditional approaches to data preparation.

In this post, we walk through a solution in which we apply various business rules to determine the quality of incoming data and separate good and bad records. Furthermore, we publish a data quality score card using Amazon QuickSight and make records available for further analysis.

Use case overview

For our use case, we use a public dataset that is available for download at Synthetic Patient Records with COVID-19. It contains 100,000 synthetic patient records in CSV format. Data hosted within SyntheticMass has been generated by SyntheaTM, an open-source patient population simulation made available by The MITRE Corporation.

When we unzip the 100k_synthea_covid19_csv.zip file, we see the following CSV files:

  • Allergies.csv
  • Careplans.csv
  • Conditions.csv
  • Devices.csv
  • Encounters.csv
  • Imaging_studies.csv
  • Immunizations.csv
  • Medications.csv
  • Observations.csv
  • Organizations.csv
  • Patients.csv
  • Payer_transitions.csv
  • Payers.csv
  • Procedures.csv
  • Providers.csv
  • Supplies.csv

We perform the data quality checks categorized by the following data quality dimensions:

  • Completeness
  • Consistency
  • Integrity

For our use case, these CSV files are maintained by your organization’s data ingestion team, which uploads the updated CSV file to Amazon Simple Storage Service (Amazon S3) every week. The good and bad records are separated through a series of data preparation steps, and the business team uses the output data to create business intelligence (BI) reports.

Architecture overview

The following architecture uses DataBrew for data preparation and building key KPIs, Amazon Athena for data analysis with standard SQL, and QuickSight for building the data quality score card.

The workflow includes the following steps:

  1. The ingestion team receives CSV files in an S3 input bucket every week.
  2. The DataBrew job scheduled to run every week triggers the recipe job.
  3. DataBrew processes the input files and generates output files that contain additional fields depending on the recipe job logic.
  4. After the output data is written, we create external table on top of it by creating and running an AWS Glue crawler.
  5. The good and bad records are separated by creating views on top of the external table.
  6. Data analysts can use Athena to analyze good and bad records.
  7. The records can also be separated directly using QuickSight calculated fields.
  8. We use QuickSight to create the data quality score card in the form of a dashboard, which fetches data through Athena.

Prerequisites

Before beginning this tutorial, make sure you have the required permissions to create the resources required as part of the solution.

Additionally, create the S3 input and output buckets to capture the data, and upload the input data into the input bucket.

Create DataBrew datasets

To create a DataBrew dataset for the patient data, complete the following steps:

  1. On the DataBrew console, choose Datasets.
  2. Choose Connect new dataset.
  3. For Dataset name, enter a name (for this post, Patients).
  4. For Enter your source from S3, enter the S3 path of the patients input CSV.
  5. Choose Create Dataset.

Repeat these steps to create datasets for other CSV files, such as encounters, conditions, and so on.

Create a DataBrew project

To create a DataBrew project for marketing data, complete the following steps:

  1. On the DataBrew console, choose Projects.
  2. Choose Create a project.
  3. For Project name, enter a name (for this post, patients-data-quality).
  4. For Select a dataset, select My datasets.
  5. Select the patients dataset.
  6. Under Permissions, for Role name, choose an AWS Identity and Access Management (IAM) role that allows DataBrew to read from your Amazon S3 input location.

You can choose a role if you already created one, or create a new one. For more information, see Adding an IAM role with data resource permissions.

  1. Wait till the dataset is loaded (about 1–2 minutes).
  2. To make a consistency check, choose Birthdate.
  3. On the Create menu, choose Flag column.
  4. Under Create column, for Values to flag, select Custom value.
  5. For Source column, choose BIRTHDATE.
  6. For Values to flag, enter the regular expression (?:(?:18|19|20)[0-9]{2}).
  7. For Flag values as, choose Yes or no.
  8. For Destination column, enter BIRTHDATE_flagged.

The new column BIRTHDATE_FLAGGED now displays Yes for a valid four-digit year within BIRTHDATE.

  1. To create a completeness check, repeat the preceding steps to create a DRIVERS_FLAGGED column by choosing the DRIVERS column to mark missing values.
  2. To create an integrity check, choose the JOIN transformation.
  3. Choose the encounters dataset and choose Next.
  4. For Select join type, select Left join.
  5. For Join keys, choose Id for Table A and Patient for Table B.
  6. Under Column list, unselect all columns from Table B except for Patient.
  7. Choose Finish.
  8. Choose the Patient column and create another flag column PATIENTS_FLAG to mark missing values from the Patient column.

For our use case, we created three new columns to demonstrate data quality checks for data quality dimensions in scope (consistency, completeness, and integrity), but you can integrate additional transformations on the same or additional columns as needed.

  1. After you finish applying all your transformations, choose Publish on the recipe.
  2. Enter a description of the recipe version and choose Publish.

Create a DataBrew job

Now that our recipe is ready, we can create a job for it, which gets invoked through our AWS Lambda functions.

  1. On the DataBrew console, choose Jobs.
  2. Choose Create a job.
  3. For Job name¸ enter a name (for example, patient-data-quality).

Your recipe is already linked to the job.

  1. Under Job output settings¸ for File type, choose your final storage format (for this post, we choose CSV).
  2. For S3 location, enter your final S3 output bucket path.
  3. For Compression, choose the compression type you want to apply (for this post, we choose None).
  4. For File output storage, select Replace output files for each job run.

We choose this option because our use case is to publish a data quality score card for every new set of data files.

  1. Under Permissions, for Role name¸ choose your IAM role.
  2. Choose Create and run job.

Create an Athena table

If you’re familiar with Apache Hive, you may find creating tables on Athena to be familiar. You can create tables by writing the DDL statement on the query editor, or by using the wizard or JDBC driver. To use the query editor, enter the following DDL statement to create a table:

CREATE EXTERNAL TABLE `blog_output`(
  `id` string, 
  `birthdate` string, 
  `birthdate_flagged` string, 
  `deathdate` string, 
  `ssn` string, 
  `drivers` string, 
  `drivers_flagged` string, 
  `passport` string, 
  `prefix` string, 
  `first` string, 
  `last` string, 
  `suffix` string, 
  `maiden` string, 
  `marital` string, 
  `race` string, 
  `ethnicity` string, 
  `gender` string, 
  `birthplace` string, 
  `address` string, 
  `city` string, 
  `state` string, 
  `county` string, 
  `zip` bigint, 
  `lat` double, 
  `lon` double, 
  `healthcare_expenses` double, 
  `healthcare_coverage` double, 
  `patient` string, 
  `patient_flagged` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://<your-bucket>/blog_output/';

Let’s validate the table output in Athena by running a simple SELECT query. The following screenshot shows the output.

Create views to filter good and bad records (optional)

To create a good records view, enter the following code:

CREATE OR REPLACE VIEW good_records AS
SELECT * FROM "databrew_blog"."blog_output"
where 
birthdate_flagged = 'Yes' AND
drivers_flagged = 'No' AND
patient_flagged = 'No'

To create a bad records view, enter the following code:

CREATE OR REPLACE VIEW bad_records AS
SELECT * FROM "databrew_blog"."blog_output"
where 
birthdate_flagged = 'No' OR
drivers_flagged = 'Yes' OR 
patient_flagged = 'Yes'

Now you have the ability to query the good and bad records in Athena using these views.

Create a score card using QuickSight

Now let’s complete our final step of the architecture, which is creating a data quality score card through QuickSight by connecting to the Athena table.

  1. On the QuickSight console, choose Athena as your data source.
  2. For Data source name, enter a name.
  3. Choose Create data source.
  4. Choose your catalog and database.
  5. Select the table you have in Athena.
  6. Choose Select.

Now you have created a dataset.

To build the score card, you add calculated fields by editing the dataset blog_output.

  1. Locate your dataset.
  2. Choose Edit dataset.
  3. Choose Add calculated field.
  4. Add the field DQ_Flag with value ifelse({birthdate_flagged} = 'No' OR {drivers_flagged} = 'Yes' OR {patient_flagged} = 'Yes' , 'Invalid', 'Valid').

Similarly, add other calculated fields.

  1. Add the field % Birthdate Invalid Year with value countIf({birthdate_flagged}, {birthdate_flagged} = 'No')/count({birthdate_flagged}).
  2. Add the field % Drivers Missing with value countIf({drivers_flagged}, {drivers_flagged} = 'Yes')/count({drivers_flagged}).
  3. Add the field % Patients missing encounters with value countIf({patient_flagged}, {patient_flagged} = 'Yes')/count({patient_flagged}).
  4. Add the field % Bad records with the value countIf({DQ_Flag}, {DQ_Flag} = 'Invalid')/count({DQ_Flag}).

Now we create the analysis blog_output_analysis.

  1. Change the format of the calculated fields to display the Percent format.
  2. Start adding visuals by choosing Add visual on the + Add menu.

Now you can create a quick report to visualize your data quality score card, as shown in the following screenshot.

If QuickSight is using SPICE storage, you need to refresh the dataset in QuickSight after you receive notification about the completion of the data refresh. If the QuickSight report is running an Athena query for every request, you might see a “table not found” error when data refresh is in progress. We recommend using SPICE storage to get better performance.

Cleaning up

To avoid incurring future charges, delete the resources created during this walkthrough.

Conclusion

This post explains how to create a data quality score card using DataBrew, Athena queries, and QuickSight.

This gives you a great starting point for using this solution with your datasets and applying business rules to build a complete data quality framework to monitor issues within your datasets. We encourage you to use various built-in transformations to get the maximum value for your project.


About the Authors

Nitin Aggarwal is a Senior Solutions Architect at AWS, where helps digital native customers with architecting data analytics solutions and providing technical guidance on various AWS services. He brings more than 16 years of experience in software engineering and architecture roles for various large-scale enterprises.

 

 

 

Gaurav Sharma is a Solutions Architect at AWS. He works with digital native business customers providing architectural guidance on AWS services.

 

 

 

Vivek Kumar is a Solutions Architect at AWS. He works with digital native business customers providing architectural guidance on AWS services.