Batch data processing is ubiquitous in computing, especially for teams creating data processing workflows and pipelines. The AWS Batch service enables analytics, data scientists and developers the ability to run batch computing jobs in a fully managed AWS environment.

In keeping with other Amazon innovations, this means there is no infrastructure to build or manage. Any AWS resources need to run your batch processes are be based on your definitions. This means AWS will provision compute resources (e.g., CPU, memory…) you tell it to apply to a given workload. You can also use Docker images for your batch workloads.

What Is A Batch Job?

An important concept with any batch processing effort is the “job”. A batch job will generally be composed of three elements that define a specific workload:

  • Task — the unit of work submitted
  • Queues — order and flow of tasks to be completed
  • Environment — the compute resources that run your tasks

How these three manifest themselves within your batch application will vary. For example, you may want to look at Cloudwatch to schedule Batch jobs or use AWS Lambda function to handle this.

If you have a long-running batch process, you may want to localize a schedule using CRON inside your Docker image. The point is that you have options so put some forethought into your workflow as it will pay dividends when you are ready to use AWS.

First Step? The AWS Batch Tutorial

Amazon provides a quick primer post on how to get started with AWS Batch using Docker:

Creating a Simple "Fetch & Run" AWS Batch Job | Amazon Web Services

Familiarize yourself with the simple “Fetch & Run process detailed in the tutorial. It provides the basic know-how to run just about any Docker-based batch application, including the BigQuery one we detail below.

Done with the tutorial? Now you can run the pre-built Google BigQuery Docker image.

AWS Batch Processing Example: Google BigQuery Batch Exports

In this ETL batch example, we demonstrate how to create a batch file (or files) using Docker, BigQuery, AWS, and Google Cloud.

Our tutorial shows you how to do Google Analytics 360 batch exports from BigQuery for use in the AWS environment.

Step 1: SQL query batch file processing

In our BigQuery batch process, you need to define your export. This is done with SQL. The sql file contains a query written in SQL (Structured Query Language). If you have any experience with SQL, you likely have come across .sql files before.

Here is an example batch SQL statement for our BigQuery exports. This query is what we want to run as part of a daily batch job for exporting Google Analytics 360 data to an AWS data lake:

# ga_master SQL
SELECT
trafficSource.source + ' / ' + trafficSource.medium AS source_medium,
count(DISTINCT CONCAT(fullVisitorId, STRING(visitId)), 100000) as sessions,
SUM(totals.bounces) as bounces,
100 * SUM(totals.bounces) / count(DISTINCT CONCAT(fullVisitorId, STRING(visitId)), 100000) as bounce_rate,
SUM(totals.transactions) as transactions,
100 * SUM(totals.transactions) / count(DISTINCT CONCAT(fullVisitorId, STRING(visitId)), 100000) as conversion_rate,
SUM(totals.transactionRevenue) / 1000000 as transaction_revenue,
SUM(hits.transaction.transactionRevenue) / 1000000 as rev2,
AVG(hits.transaction.transactionRevenue) / 1000000 as avg_rev2,
(SUM(hits.transaction.transactionRevenue) / 1000000 ) / SUM(totals.transactions) as avg_rev
FROM TABLE_DATE_RANGE([{{GOOGLE_CLOUDSDK_CORE_PROJECT}}:{{GOOGLE_BIGQUERY_JOB_DATASET}}.{{GOOGLE_BIGQUERY_TABLE}}],TIMESTAMP('{{QDATE}}'),TIMESTAMP('{{QDATE}}'))
GROUP BY source_medium
ORDER BY sessions DESC

Our export process leverages one or more batch *.sql files stored in a ./sql/* directory. These files provide the batch SQL query definition for each export. These SQL files become part of the batch export definitions for each job.

You can see a collection of this SQL files here: https://github.com/openbridge/ob_google-bigquery/tree/master/sql

Step 2: Define a batch AWS data integration job

In this example, we have a couple of batch processing scripts that will create a file from BigQuery based on our SQL we defined earlier. However, the Docker images need environment variables set to properly run. The *.env files define your environment variables for use within your Docker container.

Here is an example of a env file for use in your batch job. Notice that we are calling in the GOOGLE_BIGQUERY_SQL=ga_master SQL above:

MODE=test
GOOGLE_CLOUDSDK_ACCOUNT_FILE=/auth.json
GOOGLE_CLOUDSDK_ACCOUNT_EMAIL=foo-bar-buzz-1234@appspot.gserviceaccount.com
GOOGLE_CLOUDSDK_CRONFILE=/crontab.conf
GOOGLE_CLOUDSDK_CORE_PROJECT=foo-bar-buzz-1234
GOOGLE_CLOUDSDK_COMPUTE_ZONE=us-east1-b
GOOGLE_CLOUDSDK_COMPUTE_REGION=us-east1
GOOGLE_BIGQUERY_SQL=ga_master # Name of the SQL file (Do not include the .sql extension)
GOOGLE_BIGQUERY_JOB_DATASET=1122334455
GOOGLE_BIGQUERY_TABLE=ga_sessions_
GOOGLE_STORAGE_PATH=google_analytics # No preceding or trailing slashes
GOOGLE_STORAGE_BUCKET=openbridge-ga360
GOOGLE_DOUBLECLICK_ID=
GOOGLE_DOUBLECLICK_NETWORK_CODE=
GOOGLE_ADWORDS_CUSTOMER_ID=
AWS_ACCESS_KEY_ID=ADSDLKDLKSDKDDLW34
AWS_SECRET_ACCESS_KEY=SDLEOAODEMK$%+teSDLLKLDSLLKDSLeer
AWS_S3_BUCKET=openbridge-bucket # Just the name
AWS_S3_PATH=ebs/my/path/to/google_analytics # No preceding or trailing slashes
LOG_FILE=/ebs/logs/gcloud.log

This command-line statement for a batch export will get initialized by the batch data integration job:

for i in ./env/prod/*.env; do    echo "working on $i"    bash -c "docker run -it -v /github/ob_google-cloud/auth/prod/prod.json:/auth.json -v /github/ob_google-cloud/sql:/sql --env-file ${i} ${dockerimage} bigquery-run ${MODE} ${START} ${END}"    if [[ $? = 0 ]]; then echo "OK: "; else echo "ERROR: "; fi  done

Notice thatbigquery-run is controlling the batch process for the job. We will dig into this process below.

This is all you need to run the process. However, you may want to customize how it gets run. This is where you can go deeper with the next steps…

Step 3: Batch script to control runtime

Thebigquery-run.sh script control and execute the batch export process called bigquery-export.sh:

#!/usr/bin/env bash
set -ex
exec 200< $0
if ! flock -n 200; then echo "OK: There can be only one query process happening at once. Exit" && exit 1; fi
args=("$@")
if [[ ${args[1]} = "prod" ]]; then MODE=prod && export MODE=${1}; else export MODE="test"; fi
if [[ -z ${args[2]} ]]; then START=$(date -d "1 day ago" "+%Y-%m-%d"); else export START="${2}" && echo "OK: START date passed... "; fi
if [[ -z ${args[3]} ]]; then END=$(date -d "1 day ago" "+%Y-%m-%d"); else export END="${3}" && echo "OK: END date passed... "; fi
function account_auth() {
# batch file if statement for auth files
if test -n "${GOOGLE_CLOUDSDK_ACCOUNT_FILE}";then echo "OK: GOOGLE_CLOUDSDK_ACCOUNT_FILE is present. Authenticating with ${GOOGLE_CLOUDSDK_ACCOUNT_FILE}..." && gcloud auth activate-service-account --key-file="${GOOGLE_CLOUDSDK_ACCOUNT_FILE}" "${GOOGLE_CLOUDSDK_ACCOUNT_EMAIL}"; else echo "INFO: GOOGLE_CLOUDSDK_ACCOUNT_FILE not present. Assuming you are using --volumes-from for Google Cloud SDK authorization.";fi
}
function export_process_sql() {
exec 200< $0
if ! flock -n 200; then echo "OK: There can be only one query process happening at once. Exit" && exit 1; fi
{
flock -s 200
# Update the database so any records missing the MD5 hash will get updated
echo "OK: Starting to run query ${GOOGLE_BIGQUERY_SQL} ${GOOGLE_CLOUDSDK_CORE_PROJECT} ${GOOGLE_BIGQUERY_JOB_DATASET} ${START} ${END}"
cur=$(dateseq -i%Y-%m-%d -f%Y-%m-%d "${START}" "${END}")
export cur
while read -r line; do
echo "Working on date $line"
bash -c "/usr/bin/bigquery-export ${GOOGLE_BIGQUERY_SQL} ${GOOGLE_CLOUDSDK_CORE_PROJECT} ${GOOGLE_BIGQUERY_JOB_DATASET} ${line} ${line} >> /tmp/query.log 2>&1"
if [[ $? = 0 ]]; then echo "OK: ${GOOGLE_BIGQUERY_SQL} query completed"; else echo "ERROR: ${GOOGLE_BIGQUERY_SQL} query did not complete"; fi
if [[ $? = 0 ]]; then echo "OK: "; else echo "ERROR: "; fi

done <<< "${cur}"
} 200>/tmp/query.lock
}
function run() {
account_auth
export_process_sql
echo "OK: All processes have completed."
}
run
exit 0

Step 4: Batch code for exporting data

You notice in Step 3 we referenced a script called bigquery-export. Here is the bigquery-export.sh batch script which handles all the file creation based the supplied SQL. This internally sets job queues for different stages of the process:

#!/usr/bin/env bash
set -ex
exec 200< $0
if ! flock -n 200; then echo "OK: There can be only one query process happening at once. Exit" && exit 1; fi
args=("$@")
# Google Settings
if [[ -z ${args[1]} ]]; then echo "OK: BIGQUERY SQL was not passed. Using container default"; else GOOGLE_BIGQUERY_SQL=${1}; fi
if [[ -z ${args[2]} ]]; then echo "OK: CLOUDSDK PROJECT was not passed. Using container default"; else GOOGLE_CLOUDSDK_CORE_PROJECT=${2}; fi
if [[ -z ${args[3]} ]]; then echo "OK: BIGQUERY DATASET was not passed. Using container default"; else GOOGLE_BIGQUERY_JOB_DATASET=${3}; fi
# Set the dates
DATE=$(date +"%Y%m%d") && echo "Runnning on $DATE..."
DSTART=${4}
DEND=${5}
START=$(date -d${DSTART} +%s)
END=$(date -d${DEND} +%s)
CUR=${START}
# Check if CRON is being used. We set a "lock" file for run times
if [[ -f "/cron/${GOOGLE_CLOUDSDK_CRONFILE}" ]]; then
# Check if the process should run or not.
if [[ -f "/tmp/runjob.txt" ]]; then
echo "OK: Time to run export process..."
else
echo "INFO: No export job is present. Exit" && exit 1
fi
else
echo "OK: Cron is not present. Running outside of cron..."
fi
# Check is the passed SQL is present.
if [[ -f "/sql/${GOOGLE_BIGQUERY_SQL}.sql" ]]; then
echo "OK: The ${GOOGLE_BIGQUERY_SQL} is present. Ready to run..."
else
echo "ERROR: The is no ${GOOGLE_BIGQUERY_SQL} present. Can not run without ${GOOGLE_BIGQUERY_SQL} being present. Please check your configs." && exit 1
fi
function setup_gs_storage() {
# Set the table and file name to the date of the data being extracted
# We use a "prodcution" and "test" context
FID=$(cat /dev/urandom | tr -cd [:alnum:] | head -c 8)
if [[ -z ${GOOGLE_STORAGE_PATH} ]]; then
if [[ "${MODE}" = "prod" ]]; then
FILEDATE="${FDATE}"_"${FID}" && GOOGLE_STORAGE_PATH="production"
else
FILEDATE="testing_${FDATE}_${FID}" && GOOGLE_STORAGE_PATH="testing"
fi
else
echo "OK: GOOGLE_STORAGE_PATH is set to ${GOOGLE_STORAGE_PATH}"
FILEDATE="${FDATE}"_"${FID}"
fi
# Setup the working bucket to be used for exports
# Check if the bucket exists.
if [[ -n "${GOOGLE_CLOUDSDK_ACCOUNT_FILE}" ]]; then
echo "INFO: gs:// check skipped due to a Google auth bug"
else
if ! gsutil ls gs:// | grep -e "gs://${GOOGLE_STORAGE_BUCKET}/"; then
echo "OK: Creating bucket ${GOOGLE_STORAGE_BUCKET}"
gsutil mb -p "${GOOGLE_CLOUDSDK_CORE_PROJECT}" -l us gs://"${GOOGLE_STORAGE_BUCKET}"
gsutil lifecycle set /lifecycle.json gs://"${GOOGLE_STORAGE_BUCKET}"/
else
echo "INFO: ${GOOGLE_STORAGE_BUCKET} bucket already exists"
fi
fi
}
function setup_bq_tables() {
# Set working dataset and make sure tables auto-expire after an hour
# We will archive our processed data for audting purposes for 30 days
GOOGLE_BIGQUERY_WD_DATASET="${GOOGLE_BIGQUERY_JOB_DATASET}_${GOOGLE_BIGQUERY_SQL}_wd"
GOOGLE_BIGQUERY_ARCHIVE_DATASET="${GOOGLE_BIGQUERY_JOB_DATASET}_${GOOGLE_BIGQUERY_SQL}_archive"
if ! bq ls -d | grep -e "${GOOGLE_BIGQUERY_JOB_DATASET}_${GOOGLE_BIGQUERY_SQL}_wd"; then
echo "OK: ${GOOGLE_BIGQUERY_JOB_DATASET}_wd does not exist. Creating..."
bq mk --default_table_expiration 3600 "${GOOGLE_BIGQUERY_WD_DATASET}"
else
echo "INFO: ${GOOGLE_BIGQUERY_JOB_DATASET}_${GOOGLE_BIGQUERY_SQL}_wd dataset already exists"
fi
if ! bq ls -d | grep -e "${GOOGLE_BIGQUERY_JOB_DATASET}_${GOOGLE_BIGQUERY_SQL}_archive"; then
echo "OK: ${GOOGLE_BIGQUERY_JOB_DATASET}_${GOOGLE_BIGQUERY_SQL}_archive does not exist. Creating..."
bq mk --default_table_expiration 2678400 "${GOOGLE_BIGQUERY_ARCHIVE_DATASET}"
else
echo "INFO: ${GOOGLE_BIGQUERY_JOB_DATASET}_archive dataset already exists."
fi
#Import the query and set the required BQ variables
BQQUERY=$(cat /sql/${GOOGLE_BIGQUERY_SQL}.sql | sed "s/{{QDATE}}/${QDATE}/g; s/{{GOOGLE_BIGQUERY_SOURCE_TABLE}}/${GOOGLE_BIGQUERY_SOURCE_TABLE}/g; s/{{GOOGLE_CLOUDSDK_CORE_PROJECT}}/${GOOGLE_CLOUDSDK_CORE_PROJECT}/g; s/{{GOOGLE_BIGQUERY_JOB_DATASET}}/${GOOGLE_BIGQUERY_JOB_DATASET}/g; s/{{GOOGLE_DOUBLECLICK_NETWORK_CODE}}/${GOOGLE_DOUBLECLICK_NETWORK_CODE}/g; s/{{GOOGLE_DOUBLECLICK_ID}}/${GOOGLE_DOUBLECLICK_ID}/g" )
}
function table_check() {
# Check if the table exists. For GA360 we want to use the current date given the delay from Google in delivery may cause it not to be present
if [[ ${GOOGLE_BIGQUERY_SOURCE_TABLE} = *"ga_sessions"* ]]; then
GOOGLE_TABLE_TEST="ga_sessions_${FDATE}"
else
GOOGLE_TABLE_TEST=${GOOGLE_BIGQUERY_SOURCE_TABLE}
fi
for i in $(bq ls -n 9999 "${GOOGLE_CLOUDSDK_CORE_PROJECT}":"${GOOGLE_BIGQUERY_JOB_DATASET}" | grep "${GOOGLE_TABLE_TEST}" | awk '{print $1}'); do if test "${i}" = "${GOOGLE_TABLE_TEST}"; then GASESSIONSCHECK="0"; else GASESSIONSCHECK="1"; fi done
}
function export_data() {
# We export the data to a temp table. This is used for the data file export
if [[ "${GASESSIONSCHECK}" = "0" ]]; then
echo "OK: ${GOOGLE_TABLE_TEST} table exists. Run export process..."
bq query --batch --allow_large_results --destination_table="${GOOGLE_BIGQUERY_WD_DATASET}"."${FILEDATE}"_"${GOOGLE_BIGQUERY_SQL}" "${BQQUERY}"
else
echo "ERROR: The ${GOOGLE_TABLE_TEST} data is not present yet. Cant start export process" && exit 1
fi
# Check if the process created the daily export table in the working job dataset
for i in $(bq ls -n 9999 "${GOOGLE_CLOUDSDK_CORE_PROJECT}":"${GOOGLE_BIGQUERY_WD_DATASET}" | grep "${FILEDATE}_${GOOGLE_BIGQUERY_SQL}" | awk '{print $1}'); do if test "${i}" = "${FILEDATE}_${GOOGLE_BIGQUERY_SQL}"; then BQTABLECHECK="0"; else BQTABLECHECK="1"; fi done
# We will perform a spot check to make sure that the job table in the working dataset does in fact have data present. If it does, run the export process
if test "${BQTABLECHECK}" = "0"; then
echo "OK: ${FILEDATE}_${GOOGLE_BIGQUERY_SQL} table exists. Checking record counts..."
while read -r num; do echo "${num}" && if [[ $num =~ \"num\":\"([[:digit:]]+)\" ]] && (( BASH_REMATCH[1] > 1000 )); then echo "Ok: ${FILEDATE}_${GOOGLE_BIGQUERY_SQL} table count test meet expectations. Ready to creat extracts..."
bq extract --compression=GZIP ${GOOGLE_CLOUDSDK_CORE_PROJECT}:${GOOGLE_BIGQUERY_WD_DATASET}.${FILEDATE}_${GOOGLE_BIGQUERY_SQL} gs://${GOOGLE_STORAGE_BUCKET}/${GOOGLE_STORAGE_PATH}/${GOOGLE_BIGQUERY_SQL}/${DSTART}/${FILEDATE}_${GOOGLE_BIGQUERY_SQL}_export*.gz; fi done < <(echo "SELECT COUNT(*) as num FROM [${GOOGLE_CLOUDSDK_CORE_PROJECT}:${GOOGLE_BIGQUERY_WD_DATASET}.${FILEDATE}_${GOOGLE_BIGQUERY_SQL}] HAVING COUNT(*) > 100000" | bq query --format json)
else
echo "ERROR: The ${FILEDATE}_${GOOGLE_BIGQUERY_SQL} table counts are too low. Exiting" && exit 1
fi
}
function transfer_s3() {
# Transfer to S3
if [[ "${MODE}" = "prod" && -n ${AWS_S3_BUCKET} ]]; then
# batch copy file but exclude any possible test data
gsutil -m rsync -d -r -x "testing_" gs://"${GOOGLE_STORAGE_BUCKET}"/"${GOOGLE_STORAGE_PATH}"/"${GOOGLE_BIGQUERY_SQL}"/ s3://"${AWS_S3_BUCKET}"/"${AWS_S3_PATH}"/
echo "OK: Completed S3 transfer"
else
echo "OK: No S3 Transfer. Running in TEST mode"
fi
}
function archive_table() {
# batch file if else to make an archive of the production table
if [[ "${MODE}" = "prod" ]]; then
# Make sure we dont copy test tables
for i in $(bq ls -n 9999 "${GOOGLE_CLOUDSDK_CORE_PROJECT}" | grep "testing_*" | awk '{print $1}'); do bq rm -ft "${GOOGLE_CLOUDSDK_CORE_PROJECT}"."${i}"; done
# Transfer to archive dataset
bq cp "${GOOGLE_CLOUDSDK_CORE_PROJECT}":"${GOOGLE_BIGQUERY_WD_DATASET}"."${FILEDATE}"_"${GOOGLE_BIGQUERY_SQL}" "${GOOGLE_CLOUDSDK_CORE_PROJECT}":"${GOOGLE_BIGQUERY_ARCHIVE_DATASET}"."${FILEDATE}"_"${GOOGLE_BIGQUERY_SQL}"_archive
else
echo "OK: No Table Archive. Running in TEST mode"
fi
}
{
flock -s 200
while [[ ${CUR} -le ${END} ]]; do
#Set date to proper BQ query format for SQL
QDATE=$(date -d@${CUR} +%Y-%m-%d)
FDATE=$(date -d@${CUR} +%Y%m%d)
let CUR+=24*60*60
setup_gs_storage
setup_bq_tables
table_check
export_data
if [[ -z ${AWS_S3_BUCKET} ]]; then echo "INFO: AWS_S3_BUCKET not set"; else transfer_s3; fi
archive_table
done
} 200>/tmp/query.lock
# Batch file echo that everything worked. Cleanup and reset the process
echo "OK: Job is complete" && rm -Rf /tmp/*
exit 0

Example: Defining a batch job

Typically, a batch job will execute one or more tasks on a schedule. For our BigQuery example above, we will set up a command set with query start and end dates to 2019–01–01:

/usr/bin/env bash -c 'bigquery-job prod 2019-01-01 2019-01-01'

This will run a batch job for January 1st, 2019. If you are running it on this batch job on January 2nd, then you will be pulling data for the full 2019–01–01 date.

However, if you run this at 5 PM on the 1st, you will only pull partial data in your batch job. Be mindful of what is resident in those upstream tables and that your query is aligned to the presence of the desired data.

Example: Batch job processing CRON example

When bigquery-job is run, it is designed to run all the jobs present in the ./env directory. If you wanted to set this up as a recurring operation, you can create CRON task:

05 12 * * * /usr/bin/env bash -c 'bigquery-job prod <start date> <end date>'

However, you don’t have to use the bigquery-job wrapper. You can call the job script bigquery-run directly via Docker:

docker run -it -v /Users/bob/Documents/github/ob_google-bigquery/auth/prod.json:/auth.json -v /Users/bob/Documents/github/ob_google-bigquery/sql:/sql --env-file /env/file.env openbridge/ob_google-bigquery bigquery-run prod 2019-01-01 2019-01-01

BigQuery Batch Docker Image

For detailed examples and all the code shown here please reference the Github project here:

Google BigQuery: Stop Worrying & Do Data Pipelines Like A Boss With Speedy Google Cloud SDK And…

What Is Your Feedback On AWS Batch?

Our takeaway is the service offers value for those needing batch services, especially If you already use Docker to package your batch applications.

As the “Fetch and Run” example showed, you can run batch jobs with Docker services easily. This means you can set up batch jobs for just about anything, including exports from Google Cloud to AWS (see https://github.com/openbridge/ob_google-bigquery)

Using the service as the compute environment helps your automate your workflows. As you can see from the examples the heavy lifting is the development of your batch processing application in Docker. If you are running batch processes locally on your laptop or some other environment, AWS Batch gives you a secure, consistent managed environment to run your processes.

What is AWS Batch pricing?

Since there is no additional charge for AWS Batch you can be efficient in aligning your resources (e.g. EC2 instances) for your batch jobs. The longer your jobs or number of instance you need to run them, the more expensive the costs will be.

For example, in our BigQuery example, let us assume we wanted to export 720 days of data. We can do that sequentially on one instance or break up the jobs where 24 instances will each get a month to work on. In this case, you are willing to incur additional costs for speed. The 24 instances will get the work done in a fraction of the time the single server would.

Getting Started

Are you looking for fully automated, zero administration batch processing? Openbridge offers batch data ingestion as a service. We offer both batch data ingestion service as well as streaming API data ingestions.

In addition to our batch data processing system and streaming API, we also offer pre-built data connectors to popular services like Facebook, Amazon Seller Central, Google Ads, Salesforce, Google Analytics 360, YouTube and many others.

If you do not want to learn batch processing and are looking for a code-free, fully managed zero administration data pipelines, we have you covered.

We have launched a code-free, zero-admin, fully automated batch data pipelines to leading cloud warehouses and data lakes.

Get started with Amazon Redshift, Google BigQuery, Redshift Spectrum or Amazon Athena for free!

DNot ready yet get started just yet? If you are trying to learn how to implement AWS Batch reach out to our team of data experts to discuss your needs further.

Need a platform and team of experts to kickstart your data and analytics efforts? Our batch service can help! If you want to discuss a proof-of-concept, pilot, project or any other effort, the Openbridge platform and team of data experts are ready to help.

Visit us at www.openbridge.com to learn how we are helping other companies with code-free, fully managed, and automated batch on AWS.

References