Some of the most useful Pandas tricks

Reading data is the first step in any data science project. As a machine learning practitioner or a data scientist, you would have surely come across JSON (JavaScript Object Notation) data. JSON is a widely used format for storing and exchanging data. For example, NoSQL database like MongoDB store the data in JSON format, and REST API’s responses are mostly available in JSON.
Although this format works well for storing and exchanging data, it needs to be converted into a tabular form for further analysis. You are likely to deal with 2 types of JSON structure, a JSON object or a list of JSON objects. In internal Python lingo, you are most likely to deal with a dict or a list of dicts.

In this article, you’ll learn how to use Pandas’s built-in function json_normalize() to flatten those 2 types of JSON into Pandas DataFrames. This article is structured as follows:
- Flattening a simple JSON
- Flattening a JSON with multiple levels
- Flattening a JSON with a nested list
- Ignoring KeyError if keys are not always present
- Custom separator using sep
- Adding prefix for meta and record data
- Working with a local file
- Working with a URL
Please check out Notebook for the source code.
1. Flattening a simple JSON
Let’s begin with 2 simple JSON, a simple dict and a list of simple dicts.
When the JSON is a simple dict
a_dict = {
'school': 'ABC primary school',
'location': 'London',
'ranking': 2,
}
df = pd.json_normalize(a_dict)

The result looks great. Let’s take a look at the data types with df.info(). We can see that columns that are numerical are cast to numeric types.
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 school 1 non-null object
1 location 1 non-null object
2 ranking 1 non-null int64
dtypes: int64(1), object(2)
memory usage: 152.0+ bytes
When the data is a list of dicts
json_list = [
{ 'class': 'Year 1', 'student number': 20, 'room': 'Yellow' },
{ 'class': 'Year 2', 'student number': 25, 'room': 'Blue' },
]
pd.json_normalize(json_list)

The result looks great. json_normalize() function is able to convert each record in the list into a row of tabular form.
What about keys that are not always present, for example, num_of_students is not available in the 2nd record.
json_list = [
{ 'class': 'Year 1', 'num_of_students': 20, 'room': 'Yellow' },
{ 'class': 'Year 2', 'room': 'Blue' }, # no num_of_students
]
pd.json_normalize(json_list)

We can see that no error is thrown and those missing keys are shown as NaN.
2. Flattening a JSON with multiple levels
Pandas json_normalize() works great for simple JSON (known as flattened JSON). What about JSON with multiple levels?
When the data is a dict
Let’s first take a look at the following dict:
json_obj = {
'school': 'ABC primary school',
'location': 'London',
'ranking': 2,
'info': {
'president': 'John Kasich',
'contacts': {
'email': {
'admission': 'admission@abc.com',
'general': 'info@abc.com'
},
'tel': '123456789',
}
}
}
The value of info is multiple levels (known as a nested dict). By calling pd.json_normalize(json_obj), we get:

The result looks great. All nested values are flattened and converted into separate columns.
If you don’t want to dig all the way down to each value use the max_level argument. With the argument max_level=1, we can see that our nested value contacts is put up into a single column info.contacts.
pd.json_normalize(data, max_level=1)

When the data is a list of dicts
json_list = [
{
'class': 'Year 1',
'student count': 20,
'room': 'Yellow',
'info': {
'teachers': {
'math': 'Rick Scott',
'physics': 'Elon Mask'
}
}
},
{
'class': 'Year 2',
'student count': 25,
'room': 'Blue',
'info': {
'teachers': {
'math': 'Alan Turing',
'physics': 'Albert Einstein'
}
}
},
]
pd.json_normalize(json_list)

We can see that all nested values in each record of the list are flattened and converted into separate columns. Similarly, we can use the max_level argument to limit the number of levels, for example
pd.json_normalize(json_list, max_level=1)

3. Flattening JSON with a nested list
What about JSON with a nested list?
When the data is a dict
Let’s see how to flatten the following JSON into a DataFrame:
json_obj = {
'school': 'ABC primary school',
'location': 'London',
'ranking': 2,
'info': {
'president': 'John Kasich',
'contacts': {
'email': {
'admission': 'admission@abc.com',
'general': 'info@abc.com'
},
'tel': '123456789',
}
},
'students': [
{ 'name': 'Tom' },
{ 'name': 'James' },
{ 'name': 'Jacqueline' }
],
}
Notes the value of students is a nested list. By calling pd.json_normalize(json_obj), we get:

We can see that our nested list is put up into a single column students and other values are flattened. How can we flatten the nested list? To do that, we can set the argument record_path to ['students']:
# Flatten students
pd.json_normalize(data, record_path=['students'])

The result looks great but doesn’t include school and tel. To include them, we can use the argument meta to specify a list of metadata we want in the result.
pd.json_normalize(
json_obj,
record_path =['students'],
meta=['school', ['info', 'contacts', 'tel']],
)

When the data is a list of dicts
json_list = [
{
'class': 'Year 1',
'student count': 20,
'room': 'Yellow',
'info': {
'teachers': {
'math': 'Rick Scott',
'physics': 'Elon Mask'
}
},
'students': [
{
'name': 'Tom',
'sex': 'M',
'grades': { 'math': 66, 'physics': 77 }
},
{
'name': 'James',
'sex': 'M',
'grades': { 'math': 80, 'physics': 78 }
},
]
},
{
'class': 'Year 2',
'student count': 25,
'room': 'Blue',
'info': {
'teachers': {
'math': 'Alan Turing',
'physics': 'Albert Einstein'
}
},
'students': [
{ 'name': 'Tony', 'sex': 'M' },
{ 'name': 'Jacqueline', 'sex': 'F' },
]
},
]
pd.json_normalize(json_list)

All nested lists are put up into a single column students and other values are flattened. To flatten the nested list, we can set the argument record_path to ['students']. Notices that not all records have math and physics, and those missing values are shown as NaN.
pd.json_normalize(json_list, record_path=['students'])

If you would like to include other metadata use the argument meta:
pd.json_normalize(
json_list,
record_path =['students'],
meta=['class', 'room', ['info', 'teachers', 'math']]
)

4. The errors argument
The errors argument default to 'raise’ and will raise KeyError if keys listed in meta are not always present. For example, the math teacher is not available from the second record.
data = [
{
'class': 'Year 1',
'student count': 20,
'room': 'Yellow',
'info': {
'teachers': {
'math': 'Rick Scott',
'physics': 'Elon Mask',
}
},
'students': [
{ 'name': 'Tom', 'sex': 'M' },
{ 'name': 'James', 'sex': 'M' },
]
},
{
'class': 'Year 2',
'student count': 25,
'room': 'Blue',
'info': {
'teachers': {
# no math teacher
'physics': 'Albert Einstein'
}
},
'students': [
{ 'name': 'Tony', 'sex': 'M' },
{ 'name': 'Jacqueline', 'sex': 'F' },
]
},
]
A KeyError will be thrown when trying to flatten the math.
pd.json_normalize(
data,
record_path =['students'],
meta=['class', 'room', ['info', 'teachers', 'math']],
)

To work around it, set the argument errors to 'ignore' and those missing values are filled with NaN.
pd.json_normalize(
data,
record_path =['students'],
meta=['class', 'room', ['info', 'teachers', 'math']],
errors='ignore'
)

5. Custom Separator using the sep argument
By default, all nested values will generate column names separated by .. For example info.teachers.math. To separate column names with something else, you can use the sep argument.
pd.json_normalize(
data,
record_path =['students'],
meta=['class', 'room', ['info', 'teachers', 'math']],
sep='->'
)

6. Adding prefix for meta and record data
Sometimes, it may be more descriptive to add prefixes for the column names. To do that for the meta and record_path, we can simply pass the string to the argument meta_prefix and record_prefix respectively:
pd.json_normalize(
data,
record_path=['students'],
meta=['class'],
meta_prefix='meta-',
record_prefix='student-'
)

7. Working with a local file
Often, the JSON data you will be working on is stored locally as a .json file. However, Pandas json_normalize() function only accepts a dict or a list of dicts. To work around it, you need help from a 3rd module, for example, the Python json module:
import json
# load data using Python JSON module
with open('data/simple.json','r') as f:
data = json.loads(f.read())
# Flattening JSON data
pd.json_normalize(data)
data = json.loads(f.read()) loads data using Python json module. After that, json_normalize() is called on the data to flatten it into a DataFrame.
8. Working with a URL
JSON is a standard format for transferring data in REST APIs. Often, you need to work with API’s response in JSON format. The simplest way to do that is using the Python request modules:
import requests
URL = 'http://raw.githubusercontent.com/BindiChen/machine-learning/master/data-analysis/027-pandas-convert-json/data/simple.json'
data = json.loads(requests.get(URL).text)
# Flattening JSON data
pd.json_normalize(data)
Conclusion
Pandas json_normalize() function is a quick, convenient, and powerful way for flattening JSON into a DataFrame.
I hope this article will help you to save time in flattening JSON data. I recommend you to check out the documentation for the json_normalize() API and to know about other things you can do.
Thanks for reading. Please check out the notebook for the source code and stay tuned if you are interested in the practical aspect of machine learning.
You may be interested in some of my other Pandas articles:
- Pandas cut() function for transforming numerical data into categorical data
- Using Pandas method chaining to improve code readability
- How to do a Custom Sort on Pandas DataFrame
- All the Pandas shift() you should know for data analysis
- When to use Pandas transform() function
- Pandas concat() tricks you should know
- Difference between apply() and transform() in Pandas
- All the Pandas merge() you should know
- Working with datetime in Pandas DataFrame
- Pandas read_csv() tricks you should know
- 4 tricks you should know to parse date columns with Pandas read_csv()
More tutorials can be found on my Github
All Pandas json_normalize() you should know for flattening JSON was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.