Read CSV, converting categories, and finding empty strings in a dataframe

Data, also known these days as the new oil, is one of the most sought after natural resources in the world right now. However, acquiring data is one thing, while obtaining insights from it is another.

This process is also called data handling, an umbrella term that encompasses data cleaning, data preprocessing, data analysis, data predictions, and feature engineering.

There’s no one right method for handling data, and techniques used depend largely on the scenario, but there are some fast methods or tricks for tackling these challenges in certain scenarios. This is the purpose of this blog post and the subsequent series of posts.

Each part of the series will be structured into 3 parts: An introduction, three scenarios faced, and three proposed solutions.

Let’s dive into the scenarios.

SCENARIO 1: Reading specific columns, from CSV to dataframes

To get data into to a dataframe, most of the time the data is read from an external source, like a CSV or HTML file, database, etc. An ideal way to read data from a CSV using Pandas is pd.read_csv().

But what if there are certain columns in the CSV that are actually needed? A typical approach is to read the CSV and then slice the dataframe to select the columns that are of interest, like this: (Please note: The data used for this scenario is from Kaggle on Customer details on Loan)

However, there is a better way, as shown below and explained in the next couple of paragraphs:

Here, instead of slicing we identified and stated the important columns needed by using the argument usecols in the read_csv() method before loading the data into a dataframe.

Although the former method (slicing) works fine, it also requires time and efficiency. For the data used, the CSV consists of 145 columns, which means it will take a bit of time to read and load on the dataframe if reading all the columns at once. Plus, there’s no point in writing 1000 lines codes when you can achieve the same goal with just 10 lines of code—or in this case, a single line of code.

SCENARIO 2: Creating a new column from a previous column

Creating a new column from previous columns (also known as feature engineering) is very common when trying to improve a machine learning model, or when you’re narrow down insights from an existing column(s).

The data used below depicts information like gender, age_group and survival status (where 1 means Survived and 0 mean Not survived), which was artificially created for the purpose of this example. Take a look at the dataframe below

Focusing on the age_group column, there may be times where we need to change the column from age_group to another related category such as: teenager, child, etc. We can easily do this by first creating a dictionary with the keys as the existing variables and the values as the new variables, and then using a .map() method to apply it to a new column in a dataframe.

SCENARIO 3: Finding empty strings in a dataframe that aren’t exactly null values

Having missing values in a dataframe is very common when handling real-world data. However, especially when dealing with text, there are often missing values that Pandas may not recognize as missing.

One example of this is an empty string (a string object containing only space characters). One amazing string attribute used to identify such a string is .isspace. In this demo, we’re going to convert all the rows in a dataframe into a tuple using .itertuple() in a for-loop and filter the index of all the texts that have only spaces in them. Our data below shows movie reviews that have been labeled positive or negative. Find more details here.

Let see how the dataframe looks like and see if there are missing values:

From the above code, we see that there are 35 missing reviews and 0 missing labels. This means that there are labels with no review assigned to them. Missing reviews add no insight into the analysis and should be dropped. But how can we identify empty reviews as well?

In the above code, we created two empty lists and converted the dataframe into a tuple in order to separate the columns and index entry. This was also used to iterate through all the rows in the form of a tuple. We separated strings from the NULL values using if type(review)==str and stored the index of the null values in one of the lists created. Then for the reviews that are strings, we checked for strings containing only space characters using review.isspace; and if it does, we store the index in another list.

(Please note that we can store all the indexes found in one list. The reason for separating the index for the null values from the empty strings is to clearly understand the number of identified empty strings and null values—separately.)

Let's check the length of both lists:

It’s time to drop them by simply using the .drop() method:

We’ve successfully removed the null values as well as the empty review strings. At this stage, your data should be ready to be analyzed or preprocessed and is one step closer to being passed into a machine learning model.

Conclusion

Data handling plays a huge part in how quickly we can extract insights from our data—and in and how credible those insights are. So this process shouldn’t be taken likely.

As we already know, handling data is very much a decision-making process. Although there are rules of thumb, one still has to decide which method works best in a given situation. A major characteristic of real-world data is the presence of missing values. The next post in this series will be shedding light on how missing values in data should be handled, depending on the data type, purpose, etc. So stay tuned.

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to exploring the emerging intersection of mobile app development and machine learning. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Fritz AI, the machine learning platform that helps developers teach devices to see, hear, sense, and think. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletters (Deep Learning Weekly and Heartbeat), join us on Slack, and follow Fritz AI on Twitter for all the latest in mobile machine learning.