Customers are looking for ways to securely and cost-efficiently manage large volumes of sensitive data archival and deletion in their data lake by following regulations and data protection and privacy laws, such as GDPR, POPIA, and LGPD. This post describes a way to automatically identify sensitive data stored in your data lake within AWS, tag the data according to its sensitivity level, and apply appropriate lifecycle policies in a secured and cost-effective way.

Amazon Macie is a managed data security and data privacy service that uses machine learning (ML) and pattern matching to discover and protect your sensitive data stored in Amazon Simple Storage Service. (Amazon S3). In this post, we show you how to develop a solution using Macie, Amazon Kinesis Data Firehose, Amazon S3, Amazon EventBridge, and AWS Lambda to identify sensitive data across a large number of S3 buckets, tag them, and apply lifecycle policies for transition and deletion.

Solution overview

The following diagram illustrates the architecture of our solution.

The flow of the solution is as follows:

  1. An Amazon S3 bucket contains multiple objects with different sensitivities of data.
  2. A Macie job analyses the S3 bucket to identify the different sensitivities.
    1. An EventBridge rule is triggered for each finding that Macie generates from the job.
    2. The rule copies the results created by the Macie job to a Kinesis Data Firehose delivery stream.
    3. The delivery stream copies the results to an S3 bucket as a JSON file for
  3. The arrival of the results triggers a Lambda function that parses the sensitivity metadata from the JSON file.
  4. The function tags the objects in the bucket mentioned Step 1 and creates an S3 Lifecycle policy based on the sensitivity level of each object and overwrites an S3 Lifecycle policy for each existing object.
  5. The S3 Lifecycle policy moves data to different classes and deletes data based on the configured rules. For example, we implement the following rules:
    1. Archive objects with high sensitivity, tagged as High, after 700 days.
    2. Delete objects tagged as High after 3,000 days.

Create resources with AWS CloudFormation

We provide an AWS CloudFormation template to create the following resources:

  • An S3 bucket named archival-blog-<account_number>-<region_name> as a sample subject bucket as described above.
  • An S3 bucket named archival-blog-results-<account_number>-<region_name> to store the results generated by the Macie job.
  • A Firehose delivery stream to send the results of the Macie job to the S3 bucket.
  • An EventBridge rule that matches the incoming event of a result generated by the Macie job and routes the result to the Firehose delivery stream.
  • A Lambda function to apply tags and S3 Lifecycle policies on the data objects of the subject S3 bucket based on the result generated by the Macie job.
  • AWS Identity and Access Management (IAM) roles and policies with appropriate permissions.

Launch the following stack, providing your stack name:

After the cloud formation stack is deployed, copy  sample_pii.xlsx and recipe.xlsx as sample data for Macie to detect as sensitive data in archival-blog-<account_number>-<region_name>.

Next, we scan the subject S3 bucket for sensitive data to tag and attach the appropriate lifecycle policies.

Configure a Macie job

Macie uses ML and pattern matching to discover and protect your sensitive data in AWS. To configure a Macie job, complete the following steps:

  1. On the Macie console, create a new job by choosing Create Job.
  2. Select the bucket that you want to analyze and choose Next.
  3. Select a schedule if you want to run the job on a schedule, or One-time job if you want to run the job one time.

    For this post, we select One-time job. We can also choose Scheduled job for periodic jobs in production, but this is out of the scope of this post.

  4. Choose Next.
  5. Enter a name for the job and choose Next.
  6. Review the job details and confirm they’re correct before choosing Submit.

The job immediately starts after you submit it.

Review the results

Whenever the Macie job runs, it generates the following results.

Secondly, the Lambda function tags sensitive object, with Sensitivity : High.

Thirdly, the function creates a corresponding S3 Lifecycle policy.

Clean up

When you’re done with this solution, delete the sample data from the subject S3 bucket, delete all the data objects that are stored as Macie results in the S3 bucket for this post, and then delete the CloudFormation stack to remove all the service resources used in the solution.

Conclusion

In this post, we highlighted how you can use Macie to scan all your data stored on Amazon S3 and how to store your security findings in Amazon S3 using EventBridge and Kinesis Data Firehose. We also explored how you can use Lambda to tag the relevant objects in your subject S3 bucket using the Macie job’s security findings. An S3 Lifecycle transition policy moves tagged objects across different storage classes to help save cost, and an S3 Lifecycle expiration policy deletes objects that have reached the end of their lifecycle. According to the privacy laws like GDPR and POPIA, personal data should be retained as long as the data needs to be retained for legal purposes, or needs to be processed. In this post, we provide a mechanism to allow archival of sensitive data which you might not want to delete immediately, but reduce related storage costs and delete it after a certain period when that data is no longer required. The data archival and deletion periods used above are sample numbers that can be customised within the Lambda function based on requirements. Additionally, please also explore Macie’s different type of findings. You can use these different finding to build capabilities like sending notifications, creating searches in cloud trail S3 object logs for who is accessing specific objects, etc.

If you have any questions, comments, or concerns, please reach out to AWS Support. If you have feedback about this post, submit it in the comments section.


About the Authors

Subhro Bose is a Data Architect in Emergent Technologies and Intelligence Platform in Amazon. He loves working on ways for emergent technologies such as AI/ML, big data, quantum, and more to help businesses across different industry verticals succeed within their innovation journey.

 

 

 

Akshay Chandiramani is Data Analytics Consultant for AWS Professional Services. He is passionate about solving complex big data and MLOps problems for customers to create tangible business outcomes. In his spare time. he enjoys playing snooker and table tennis, and capturing trends on the stock market.