Pandas - GroupBy Method for data analysis

In data analysis, we often would need our data to be analyzed by some categories or set of categories. In SQL, GROUP BY statement is used which allows you to separate data into groups, which can be aggregated independently of one another, for SQL you can use the following code:

SELECT column_1, column_2, SUM(column_3)
FROM table_1
GROUP BY column_1, column_2;

In Pandas, SQL's GROUP BY operation is performed using the similarly named groupby() method. Pandas groupby() method allows us to split data into groups to computing operations for better analysis. This is also called a transformation process since we separate our data into groups and apply a function that converts or transforms the data in some way depending on groups.

In this article, you'll learn the groupby process (split-apply-combine) and how to use the Pandas groupby() function to group data and perform operations. In this article, we will use the Student Performance in Exams dataset as an example, which is available on Kaggle. You can find a dataset here.

The groupby process: split-apply-combine

By 'group by' we are referring to a process involving one or more of the following steps:

splitting the data into groups based on some criteria.

Applying a function to each group independently.

Combining the results into a data structure.

Out of these, the split step is the most straightforward. It splits the data into groups, In the next step, we apply a function to groups. It can be any one of the following.

Aggregation: Compute a summary statistic(or statistics) for each group. for example:
- compute group sums or means
- compute group sizes/counts
Transformation: perform some group-specific computations and return a like-indexed object. for examples
- Standardize data(zscore) within a group.
- Filling NAs within groups with a value derived from each group.
Filtration: discard some groups, according to group-wise computation that evaluates True or False. for examples:
- Discard data that belongs to groups with only a few members.
- Filter out data based on the group sum or mean.

pd.groupby() Method

This is a syntax of Pandas groupby() method which includes only most frequently used parameters.

syntax:

DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, dropna=True)

Parameters:

by: mapping, function, label, or list of labels
- it can be columns or index labels.
axis: {0 or 'index', 1 or 'columns'}, default = 0
- split along rows(0) or columns(1)
level: int, level name, or sequence of such, default None
- If the axis is a multi-index(hierarchical), group by a particular level or levels.
as_index: bool, default=True
- for aggregated output, return object with group lables as the index.
sort: bool, default=True
- sort group keys.
dropna: bool, default=True
- If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups.

Before starting to learn the groupby method of pandas. Let's first load data and let's take a look at the first five rows of data.

import pandas as pd
import numpy as np

data = pd.read_csv('data.csv')
data.head()

   gender race/ethnicity parental level of education         lunch  \
0  female        group B           bachelor's degree      standard   
1  female        group C                some college      standard   
2  female        group B             master's degree      standard   
3    male        group A          associate's degree  free/reduced   
4    male        group C                some college      standard   

  test preparation course  math score  reading score  writing score  
0                    none          72             72             74  
1               completed          69             90             88  
2                    none          90             95             93  
3                    none          47             57             44  
4                    none          76             78             75

What is Pandas `groupby()`?

Pandas groupby() methods split data to analyze it into groups by some categories. These categories can be simply a mapping of one or more labels (index or columns). In our example, let's first use the single gender column.

>>> group = data.groupby(by='gender')

In the above code, we create a groupby object which divides our data into different gender values. By calling type() function on result, we can see that it returns a DataFrameGroupBy Object.

>>> type(group)
pandas.core.groupby.generic.DataFrameGroupBy

The groupby() function return a DataFrameGroupBy object that contains information about groups. To analyze how the dataframe is divided into groups of rows, we can use attributes and methods of the DataFrameGroupBy object.

ngroups: The ngroups attribute will return the number of groups.

>>> group.ngroups
2

groups: The groups attribute is used to get the information of groups.

>>> group.groups
{'female': [0, 1, 2, 5, 6, 9, 12, 14, 15, 17, 19, 21, 23, 27, 29, 30, 31, 32, 36, 37, 38, 41, 42, 44, 46, 47, 48, 54, 55, 56, 59, 63, 64, 67, 69, 70, 72, 78, 79, 80, 85, 86, 87, 88, 89, 90, 94, 97, 98, 99, 102, 105, 106, 108, 109, 110, 113, 114, 116, 117, 118, 119, 120, 122, 125, 129, 133, 138, 140, 141, 142, 145, 148, 152, 155, 156, 158, 161, 164, 165, 167, 168, 169, 172, 173, 174, 175, 176, 177, 178, 179, 181, 182, 183, 189, 190, 192, 194, 198, 199, ...], 
'male': [3, 4, 7, 8, 10, 11, 13, 16, 18, 20, 22, 24, 25, 26, 28, 33, 34, 35, 39, 40, 43, 45, 49, 50, 51, 52, 53, 57, 58, 60, 61, 62, 65, 66, 68, 71, 73, 74, 75, 76, 77, 81, 82, 83, 84, 91, 92, 93, 95, 96, 100, 101, 103, 104, 107, 111, 112, 115, 121, 123, 124, 126, 127, 128, 130, 131, 132, 134, 135, 136, 137, 139, 143, 144, 146, 147, 149, 150, 151, 153, 154, 157, 159, 160, 162, 163, 166, 170, 171, 180, 184, 185, 186, 187, 188, 191, 193, 195, 196, 197, ...]}

In the above output, we can see each group is listed and specifies the rows of the dataframe assigned to each group.

size(): size() will return size of each groups. It returns a Series whose index are the group names and values are the size of each group. This is same as value_counts() function of DataFrame/Series.

>>> group.size()
gender
female    518
male      482
dtype: int64

Selecting a group

You can use the get_group() method to select a particular group. Since the group object's return type is DataFrame/Series type so, you can use attributes and functions of DataFrame or Series on the result.

>>> group.get_group('female').head()

   gender race/ethnicity parental level of education     lunch  \
0  female        group B           bachelor's degree  standard   
1  female        group C                some college  standard   
2  female        group B             master's degree  standard   
5  female        group B          associate's degree  standard   
6  female        group B                some college  standard   

  test preparation course  math score  reading score  writing score  
0                    none          72             72             74  
1               completed          69             90             88  
2                    none          90             95             93  
5                    none          71             83             78  
6               completed          88             95             92

The output shows the data with gender value 'female' only.

Data Aggregation

After creating the DataFrameGroupBy object, you can apply the operation to the grouped data. Data aggregation is the transformation that produces a single integer from an array. Any function that returns a single scalar value is an aggregate function.

for example: Computing a summary statistic (or statistics) for each group using sum(), mean(), min(), max(), size() or counts(), etc.

You can perform aggregation on a particular column or if a column is not specified then the computation is applied to all numeric columns.

>>> group['math score'].sum()
gender
female    32962
male      33127
Name: math score, dtype: int64

>>> group.mean()
        math score  reading score  writing score
gender                                          
female   63.633205      72.608108      72.467181
male     68.728216      65.473029      63.311203

After an operation of aggregation, the names of some columns may not be meaningful. It is often useful to add a prefix to the column name that describes the aggregate values. add_prefix() function helps to add prefixes on grouped data column names as follows.

>>> group.mean().add_prefix('mean ')
        mean math score  mean reading score  mean writing score
gender                                                         
female        63.633205           72.608108           72.467181
male          68.728216           65.473029           63.311203

Very oftern the two phases of grouping and application of the function/s are performed in a single step as follows:

>>> data.groupby(by=['gender'])['math score'].mean()
gender
female    63.633205
male      68.728216
Name: math score, dtype: float64

`agg()` method

agg() or equivalent aggregate() method of a grouped object allows you to compute single or multiple aggregation functions at once.

>>> group['math score'].agg(['count','std', 'median'])
        count        std  median
gender                          
female    518  15.491453    65.0
male      482  14.356277    69.0

agg() function also support user-defined functions.

# using lambda function
>>> group['reading score'].agg([lambda x: x.median() - x.mean()])
        <lambda>
gender          
female  0.391892
male    0.526971

# using user define function
def diff(x):
    return x.max() - x.min()

>>> group['math score'].aggregate(diff)
gender
female    100
male       73
Name: math score, dtype: int64

Named aggregation

Pandas support column-specific aggregation with control over the column name in the agg() function

>>> data.groupby(by='lunch')['reading score'].agg(
                               min_score = 'min', 
                               max_score = 'max',
                               )
              min_score  max_score
lunch                             
free/reduced         17        100
standard             26        100

We can apply different functions to the columns of a DataFrame by passing a dictionary to the agg() function.

>>> group.agg({'math score':np.sum, 'reading score':np.mean})
        math score  reading score
gender                           
female       32962      72.608108
male         33127      65.473029

Transformation

Transformation is some group-specific computations and returns a like-indexed object with the same size of input data.

Let's standardize data(zscore) within a group using transform() method.

>>> zscore = lambda x: (x - x.mean())/x.std()

# To standardize the `writing score` column using transform function.
>>> group['writing score'].transform(zscore)
0      0.103256
1      1.046344
2      1.383162
3     -1.368247
4      0.828180
         ...   
995    1.517889
996   -0.588869
997   -0.503015
998    0.305346
999    0.911618
Name: writing score, Length: 1000, dtype: float64

You can also use the apply function which produces the same output as above.

Filtration

The filter method returns a subset of the original DataFrame. Filtration discards some groups, according to a group-wise computation that evaluates True or False.

The argument of the filter must be a function that, applied to the group as a whole, return True or False. Let's group the data by "parental level of education" and look for the size of each group using size() method of DataFrameGroupBy Object.

>>> data.groupby("parental level of education").size()
parental level of education
associate's degree    222
bachelor's degree     118
high school           196
master's degree        59
some college          226
some high school      179
dtype: int64

Now, let's filter out data to return all students whose parental level of education is associate's degree. For that, we use filter() method with lambda function.

>>> data.groupby('parental level of education').filter(lambda x: len(x) == 222).head()

    gender race/ethnicity parental level of education         lunch  \
3     male        group A          associate's degree  free/reduced   
5   female        group B          associate's degree      standard   
10    male        group C          associate's degree      standard   
11    male        group D          associate's degree      standard   
19  female        group C          associate's degree  free/reduced   

   test preparation course  math score  reading score  writing score  
3                     none          47             57             44  
5                     none          71             83             78  
10                    none          58             54             52  
11                    none          40             52             43  
19                    none          54             58             61

Grouping by multiple categories.

1. Multiple columns

So far, we grouped our data using single columns. But grouping can be done with multiple columns.

>>> data.groupby(by=['gender', 'test preparation course'])[['math score', 'writing score']].min()

                                math score  writing score
gender test preparation course                           
female completed                        23             36
       none                              0             10
male   completed                        39             38
       none                             27             15

2. Combination of columns and index

A DataFrame may be grouped by a combination of columns and index levels by specifying the column names as strings and the index levels as pd.Grouper objects

>>> arrays = [['a', 'a', 'a', 'b', 'b'], ['one', 'two', 'one', 'two', 'one']]
>>> index = pd.MultiIndex.from_arrays(arrays=arrays, names=['first', 'second'])
>>> df = pd.DataFrame({'A':[1, 1, 1, 1, 0], 'B':np.arange(5)}, index=index)
>>> df
              A  B
first second      
a     one     1  0
      two     1  1
      one     1  2
b     two     1  3
      one     0  4

# Let's groupby by dataframe df by second index level and the A column
>>> df.groupby([pd.Grouper(level=1), 'A']).sum()

          B
second A   
one    0  4
       1  2
two    1  4

Or directly specifying the index level names as key to the groupby method.

>>> df.groupby(['second', 'A']).sum()
          B
second A   
one    0  4
       1  2
two    1  4

GroupBy Sorting

By default, the group keys(index of a group) are sorted during the groupby() operation. You can pass sort=False to get the unsorted grouped data.

>>> data.groupby('lunch').count()['gender']
lunch
free/reduced    355
standard        645
Name: gender, dtype: int64

>>> data.groupby('lunch', sort=False).count()['gender']
lunch
standard        645
free/reduced    355
Name: gender, dtype: int64

GroupBy dropna

By default NA values are excluded from group keys during the groupby operation. However, sometimes we need to analyze the Null values in group keys, to include NA values in group keys you could pass dropna=False, The default setting of dropna is True.

>>> lists = [[1, 2, 3], [None, 2, 3], [2, 1, 4], [1, 2, 3]]
>>> df = pd.DataFrame(lists, columns=['a', 'b', 'c'])
>>> df
     a  b  c
0  1.0  2  3
1  NaN  2  3
2  2.0  1  4
3  1.0  2  3
>>> df.groupby(by=['a']).sum()
     b  c
a        
1.0  4  6
2.0  1  4
>>> df.groupby(by=['a'], dropna=False).sum()
     b  c
a        
1.0  4  6
2.0  1  4
NaN  2  3

Resetting index with as_index

The result of aggregation on grouping by multiple columns or levels will result in a MultiIndex DataFrame. This can be changed by using as_index=False into a zero-based dataframe object.

>>> df.groupby(by=['a', 'b']).sum()
       c
a   b   
1.0 2  6
2.0 1  4

>>> df.groupby(by=['a', 'b'], as_index=False).sum()
     a  b  c
0  1.0  2  6
1  2.0  1  4

The same can be achieved using the reset_index dataframe function.

>>> df.groupby(by=['a', 'b']).sum().reset_index()
     a  b  c
0  1.0  2  6
1  2.0  1  4

Conclusion

Pandas provide one of the most powerful and flexible groupby functionality for both aggregating and transforming data for better analysis and visualization. I hope this article will help you to learn about Pandas. I recommend you to check out its documentation to know more about the groupby() method.

Thanks for reading.

References

[1] Pandas Official Tutorial: Group_by:split-apply-combine