Flattening JSON data with nested schema structure using Apache PySpark

Photo by Patrick Tomasso on Unsplash

Introduction

JavaScript Object Notation (JSON) is a text-based, flexible, lightweight data-interchange format for semi-structured data. It is heavily used in transferring data between servers, web applications, and web-connected devices.

More often than not, events that are generated by a service or a product are in JSON format. These JSON records can have multi-level nesting, array-type fields which in turn have their own schema. Additionally, some of these fields are mandatory, some are optional. Hence, retrieving the schema and extracting only required columns becomes a tedious task.

This article presents an approach to minimize the amount of effort that is spent to retrieve the schema of the JSON records to extract specific columns and flattens out the entire JSON data passed as input.

Implementation

Let’s say that two people have ordered items from an online delivery platform and the events generated were dumped as ORC files in an S3 location, here s3://mybucket/orders/ . To read these records, execute this piece of code:

df = spark.read.orc('s3://mybucket/orders/')

When you do a df.show(5, False) , it displays up to 5 records without truncating the output of each column.

JSON records

Let’s print the schema of the JSON and visualize it. To do that, execute this piece of code:

json_df = spark.read.json(df.rdd.map(lambda row: row.json))
json_df.printSchema()
JSON schema
Note: Reading a collection of files from a path ensures that a global schema is captured over all the records stored in those files.

The JSON schema can be visualized as a tree where each field can be considered as a node. If a field contains sub-fields then that node can be considered to have multiple child nodes. The tree for this schema would look like this:

Tree visualization of JSON schema

The first record in the JSON data belongs to a person named John who ordered 2 items. The second record belongs to Chris who ordered 3 items. The expectation of our algorithm would be to extract all fields and generate a total of 5 records, each record for each item.

The key to flattening these JSON records is to obtain:

  1. the path to every leaf node (these nodes could be of string or bigint or timestamp etc. types but not of struct-type or array-type)
  2. order of exploding (provides the sequence in which columns are to be exploded, in case of array-type).
  3. order of opening (provides the sequence in which columns are to be opened, in case a struct-type is a parent of array-type).

Code Implementation

First import the necessary library:

import json

Next, the class variables are defined:

where get_fields_in_json function is defined as:

A brief explanation of each of the class variables is given below:

  1. fields_in_json : This variable contains the metadata of the fields in the schema.
  2. all_fields : This variable contains a 1–1 mapping between the path to a leaf field and the column name that would appear in the flattened dataframe.
  3. cols_to_explode : This variable is a set containing paths to array-type fields.
  4. structure : This variable is a dictionary that is used for step by step node traversal to the array-type fields in cols_to_explode .
  5. order : This is a list containing the order in which array-type fields have to be exploded. If the array-type is inside a struct-type then the struct-type has to be opened first, hence has to appear before the array-type.
  6. bottom_to_top : This contains a dictionary where each key maps to a list of mutually exclusive leaf fields for every array-type/struct-type field (if struct type field is a parent of array type field).
  7. rest : Contains fields that are directly accessible with or without dot notation.

All these class variables are then used to perform exploding/opening the fields. But how are these class variables computed? Let us analyze this in steps.

Step 1: When the compute function is called from the object of AutoFlatten class, the class variables get updated where the compute function is defined as follows:

Each of the class variables would then look like this:

class variables (image)

Step 2: The unnest_dict function unnests the dictionaries in the json_schema recursively and maps the hierarchical path to the field to the column name in the all_fields dictionary whenever it encounters a leaf node (check done in is_leaf function). Additionally, it also stored the path to the array-type fields in cols_to_explode set.

Step 3: Next, all the leaf fields are obtained by checking if elements of all_fields start with any element in cols_to_explode and are stored in all_cols_in_explode_cols .

Step 4: Using all_cols_in_explode_cols, rest is calculated which contains fields directly accessible with or without the dot notation, using a simple set difference operation.

Step 5: Now, structure is computed using cols_to_explode that is used for step by step node traversal to get to the array-type fields.

Step 6: Next, a BFS traversal is performed on structure to obtain the order in which the array explode has to take place and this order is stored in order class variable.

Step 7: Finally, order and all_cols_in_explode_cols is used to get all mutually exclusive fields for each element in order . To do this, a bottom-to-top approach is used i.e. the order list is reversed and the leaf fields inside each of the fields in order are mapped and stored in bottom_to_top. Note that '.order_details' key in bottom_to_top has no elements it. If there were leaf nodes under it, those would be directly accessible and would appear in rest .

Combining all the functions, the class would look like this:

To make use of the class variables to open/explode, this block of code is executed:

Here, the JSON records are read from the S3 path, and the global schema is computed. This schema is then passed while creating an object of the AutoFlatten class that initializes all class variables. When the compute function is called from the object of AutoFlatten class, the class variables are updated.

To open/explode, all first-level columns are selected with the columns in rest which haven’t appeared already. If there are duplicate target column names, then the duplicate ones are aliased as <parent_field_name>-<target_column_name> , and all paths to those fields are added to the visited set of paths.

Then a check is done if order is empty or not. An empty order list means that there is no array-type field in the schema and vice-versa. In case, the order list isn’t empty, a traversal is done over each element in order and based on its type array/struct (only appears if an array-type field is a child of struct-type field), the columns are exploded/opened respectively and only those columns are selected which already haven’t appeared. Additionally, duplicate target column names are replaced by <parent_field_name>-<target_column_name> and the paths to those fields are added to the visited set of paths.

Let's see what columns appear in final_df .

>>> final_df.columns
['pincode', 'street', 'name', 'quantity', 'email', 'city', 'ordered_time', 'id', 'state', 'house_number', 'price', 'discount', 'country', 'product_name', 'order_details-id']

Since id in the order_details field was a duplicate, it was renamed as order_details-id .

Looking at the counts of the initial dataframe df and final_df dataframe, we know that the array explode has occurred properly.

dataframe counts

Let's now verify by looking at the records belonging to the final_df dataframe.

final dataframe

As you can see, there is one record for every item that was purchased, and the algorithm has worked as expected.

Things to note

There are few things to keep in mind while using this approach.

  1. All the target column names have been retrieved by using the name of the leaf node in the metadata of the JSON schema.
  2. If you are working in a constrained environment then the column names will have to be changed with respect to the compliance standards after performing flattening.
  3. Incorporating regexp_replace, epoch to timestamp conversion, string to timestamp conversion and others are regarded as custom transformations on the raw data extracted from each of the columns. Hence, it has to be defined by the developer after performing the autoflatten operation.
  4. Beware of exposing Personally Identifiable Information (PII) columns as this mechanism exposes all columns. You would have to perform custom operations like hashes on those columns.

That’s it! I hope this helps people who are looking to flatten out their JSON data without defining and passing a schema to extract required fields and also those who are looking to learn new stuff.

Thanks to Sandesh for collaborating with me on this!

Cheers!


Flattening JSON records using PySpark was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.