Reference source: vitu.ai

Data grouping aggregation and sorting

Grouping aggregation is very important. It has its own part in the official pandas document: groupby: split apply combine.

import pandas as pd
pd.set_option('max_rows', 5)
import numpy as np
melbourne_data = pd.read_csv('melb_data.csv') 
melbourne_data.head()

Mapping maps allow us to transform data in a dataframe or series for the entire column at once. However, we usually want to group the data and then perform specific operations on the group in which the data resides. To do this, we can use the group by operation.

For example, the function that we’ve been using a lot so far is the value “counts function. We can copy the operations performed by value “counts by performing the following group by operations:

melbourne_data.groupby('Rooms').Rooms.count()

Groupby creates a set of classifications that assign the same point values to a given house classification. Then, for each of these groups, we calculate the number of times it appears.

Value “counts is only a shortcut to this groupby operation. We can use any aggregate function we’ve used before. For example, to get the cheapest house in each category, we can do the following:

melbourne_data.groupby('Rooms').Price.min()

You can think of each group we generate as a fragment of our dataframe that contains only data that matches values. We can access this dataframe directly using the apply method, and then we can manipulate the data in any way we think fit. For example, here is a way to select the address of the first house in each house country in the dataset:

melbourne_data.groupby('Suburb').apply(lambda df: df.Address.iloc[0])

For finer grained control, you can also group by multiple columns. For example, we will select your house by region and suburban:

melbourne_data.groupby(['Regionname', 'Suburb']).apply(lambda df: df.loc[df.Price.idxmax()])

Another group by method worth mentioning is AGG, which allows you to run a bunch of different functions on the dataframe at the same time. For example, we can generate a simple statistical summary of the dataset as follows

melbourne_data.groupby(['Suburb']).Price.agg([len, min, max])

Using groupby effectively will allow you to perform many very powerful operations using datasets.

Multi index

In all the examples we’ve seen so far, we’ve been using a dataframe or series object with a single label index. Groupby is slightly different because it sometimes produces so-called multi indexes, depending on the operation we are running.

The difference between a multi index and a regular index is that it has multiple levels. For example:

house = melbourne_data.groupby(['Regionname', 'Suburb']).Address.agg([len])
house
mi = _.index
type(mi)

There are several ways to deal with multi index hierarchy, which do not exist for single index. It also requires two levels of labels to retrieve values. Dealing with multiple index output is a common “problem” for users who are new to pandas.

Details the use cases of the multiindex and their use in the multiindex / advanced selection section of the pandas documentation.

However, the most commonly used method you use is to convert back to the normal index method, which is the reset ﹣ index method:

house.reset_index()

sort

Looking at house, we can see that groups return data in index order, not in value order. That is, when outputting the result of groupby, the order of the rows depends on the value in the index, not the value in the data.

To get the data we want, we can sort it by ourselves. The sort_values method is very convenient for this.

house = house.reset_index()
house.sort_values(by='len')

Sort_values is ascending by default, with the lowest value in the first place. In most cases, we want to sort in descending order, with the higher number first. Do it in this way.

house.sort_values(by='len', ascending=False)

To sort by index value, use the matching method sort? Index. This method has the same parameters and default order:

house.sort_index()

Finally, you can sort by more than one column at a time:

house.sort_values(by=['Regionname', 'len'])

Original address: data processing [Swiss Army knife pandas guide]: 4. Grouping aggregation and sorting