In data analysis, we often would need our data to be analyzed by some categories or set of categories. In SQL, GROUP BY
statement is used which allows you to separate data into groups, which can be aggregated independently of one another, for SQL you can use the following code:
SELECT column_1, column_2, SUM(column_3)
FROM table_1
GROUP BY column_1, column_2;
In Pandas, SQL's GROUP BY
operation is performed using the similarly named groupby()
method. Pandas groupby()
method allows us to split data into groups to computing operations for better analysis. This is also called a transformation process since we separate our data into groups and apply a function that converts or transforms the data in some way depending on groups.
In this article, you'll learn the groupby process (split-apply-combine) and how to use the Pandas groupby() function to group data and perform operations. In this article, we will use the Student Performance in Exams dataset as an example, which is available on Kaggle. You can find a dataset here.
The groupby process: split-apply-combine
By 'group by' we are referring to a process involving one or more of the following steps:
- splitting the data into groups based on some criteria.
- Applying a function to each group independently.
- Combining the results into a data structure.
Out of these, the split step is the most straightforward. It splits the data into groups, In the next step, we apply a function to groups. It can be any one of the following.
Aggregation: Compute a summary statistic(or statistics) for each group. for example:
- compute group sums or means
- compute group sizes/counts
Transformation: perform some group-specific computations and return a like-indexed object. for examples
- Standardize data(zscore) within a group.
- Filling NAs within groups with a value derived from each group.
Filtration: discard some groups, according to group-wise computation that evaluates True or False. for examples:
- Discard data that belongs to groups with only a few members.
- Filter out data based on the group sum or mean.
pd.groupby() Method
This is a syntax of Pandas groupby()
method which includes only most frequently used parameters.
syntax:
DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, dropna=True)
Parameters:
- by: mapping, function, label, or list of labels
- it can be columns or index labels.
- axis: {0 or 'index', 1 or 'columns'}, default = 0
- split along rows(0) or columns(1)
- level: int, level name, or sequence of such, default None
- If the axis is a multi-index(hierarchical), group by a particular level or levels.
- as_index: bool, default=True
- for aggregated output, return object with group lables as the index.
- sort: bool, default=True
- sort group keys.
- dropna: bool, default=True
- If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups.
Before starting to learn the groupby method of pandas. Let's first load data and let's take a look at the first five rows of data.
import pandas as pd
import numpy as np
data = pd.read_csv('data.csv')
data.head()
gender race/ethnicity parental level of education lunch \
0 female group B bachelor's degree standard
1 female group C some college standard
2 female group B master's degree standard
3 male group A associate's degree free/reduced
4 male group C some college standard
test preparation course math score reading score writing score
0 none 72 72 74
1 completed 69 90 88
2 none 90 95 93
3 none 47 57 44
4 none 76 78 75
What is Pandas groupby()
?
Pandas groupby()
methods split data to analyze it into groups by some categories. These categories can be simply a mapping of one or more labels (index or columns). In our example, let's first use the single gender column.
>>> group = data.groupby(by='gender')
In the above code, we create a groupby object which divides our data into different gender
values. By calling type()
function on result, we can see that it returns a DataFrameGroupBy
Object.
>>> type(group)
pandas.core.groupby.generic.DataFrameGroupBy
The groupby()
function return a DataFrameGroupBy object that contains information about groups. To analyze how the dataframe is divided into groups of rows, we can use attributes and methods of the DataFrameGroupBy object.
- ngroups:
The
ngroups
attribute will return the number of groups.
>>> group.ngroups
2
- groups:
The
groups
attribute is used to get the information of groups.
>>> group.groups
{'female': [0, 1, 2, 5, 6, 9, 12, 14, 15, 17, 19, 21, 23, 27, 29, 30, 31, 32, 36, 37, 38, 41, 42, 44, 46, 47, 48, 54, 55, 56, 59, 63, 64, 67, 69, 70, 72, 78, 79, 80, 85, 86, 87, 88, 89, 90, 94, 97, 98, 99, 102, 105, 106, 108, 109, 110, 113, 114, 116, 117, 118, 119, 120, 122, 125, 129, 133, 138, 140, 141, 142, 145, 148, 152, 155, 156, 158, 161, 164, 165, 167, 168, 169, 172, 173, 174, 175, 176, 177, 178, 179, 181, 182, 183, 189, 190, 192, 194, 198, 199, ...],
'male': [3, 4, 7, 8, 10, 11, 13, 16, 18, 20, 22, 24, 25, 26, 28, 33, 34, 35, 39, 40, 43, 45, 49, 50, 51, 52, 53, 57, 58, 60, 61, 62, 65, 66, 68, 71, 73, 74, 75, 76, 77, 81, 82, 83, 84, 91, 92, 93, 95, 96, 100, 101, 103, 104, 107, 111, 112, 115, 121, 123, 124, 126, 127, 128, 130, 131, 132, 134, 135, 136, 137, 139, 143, 144, 146, 147, 149, 150, 151, 153, 154, 157, 159, 160, 162, 163, 166, 170, 171, 180, 184, 185, 186, 187, 188, 191, 193, 195, 196, 197, ...]}
In the above output, we can see each group is listed and specifies the rows of the dataframe assigned to each group.
- size(): size() will return size of each groups. It returns a Series whose index are the group names and values are the size of each group. This is same as value_counts() function of DataFrame/Series.
>>> group.size()
gender
female 518
male 482
dtype: int64
Selecting a group
You can use the get_group()
method to select a particular group. Since the group object's return type is DataFrame/Series type so, you can use attributes and functions of DataFrame or Series on the result.
>>> group.get_group('female').head()
gender race/ethnicity parental level of education lunch \
0 female group B bachelor's degree standard
1 female group C some college standard
2 female group B master's degree standard
5 female group B associate's degree standard
6 female group B some college standard
test preparation course math score reading score writing score
0 none 72 72 74
1 completed 69 90 88
2 none 90 95 93
5 none 71 83 78
6 completed 88 95 92
The output shows the data with gender value 'female' only.
Data Aggregation
After creating the DataFrameGroupBy object, you can apply the operation to the grouped data. Data aggregation is the transformation that produces a single integer from an array. Any function that returns a single scalar value is an aggregate function.
for example: Computing a summary statistic (or statistics) for each group using sum(), mean(), min(), max(), size() or counts(), etc.
You can perform aggregation on a particular column or if a column is not specified then the computation is applied to all numeric columns.
>>> group['math score'].sum()
gender
female 32962
male 33127
Name: math score, dtype: int64
>>> group.mean()
math score reading score writing score
gender
female 63.633205 72.608108 72.467181
male 68.728216 65.473029 63.311203
After an operation of aggregation, the names of some columns may not be meaningful. It is often useful to add a prefix to the column name that describes the aggregate values. add_prefix()
function helps to add prefixes on grouped data column names as follows.
>>> group.mean().add_prefix('mean ')
mean math score mean reading score mean writing score
gender
female 63.633205 72.608108 72.467181
male 68.728216 65.473029 63.311203
Very oftern the two phases of grouping and application of the function/s are performed in a single step as follows:
>>> data.groupby(by=['gender'])['math score'].mean()
gender
female 63.633205
male 68.728216
Name: math score, dtype: float64
agg()
method
agg()
or equivalent aggregate()
method of a grouped object allows you to compute single or multiple aggregation functions at once.
>>> group['math score'].agg(['count','std', 'median'])
count std median
gender
female 518 15.491453 65.0
male 482 14.356277 69.0
agg()
function also support user-defined functions.
# using lambda function
>>> group['reading score'].agg([lambda x: x.median() - x.mean()])
<lambda>
gender
female 0.391892
male 0.526971
# using user define function
def diff(x):
return x.max() - x.min()
>>> group['math score'].aggregate(diff)
gender
female 100
male 73
Name: math score, dtype: int64
Named aggregation
Pandas support column-specific aggregation with control over the column name in the agg()
function
>>> data.groupby(by='lunch')['reading score'].agg(
min_score = 'min',
max_score = 'max',
)
min_score max_score
lunch
free/reduced 17 100
standard 26 100
We can apply different functions to the columns of a DataFrame by passing a dictionary to the agg()
function.
>>> group.agg({'math score':np.sum, 'reading score':np.mean})
math score reading score
gender
female 32962 72.608108
male 33127 65.473029
Transformation
Transformation is some group-specific computations and returns a like-indexed object with the same size of input data.
Let's standardize data(zscore) within a group using transform()
method.
>>> zscore = lambda x: (x - x.mean())/x.std()
# To standardize the `writing score` column using transform function.
>>> group['writing score'].transform(zscore)
0 0.103256
1 1.046344
2 1.383162
3 -1.368247
4 0.828180
...
995 1.517889
996 -0.588869
997 -0.503015
998 0.305346
999 0.911618
Name: writing score, Length: 1000, dtype: float64
You can also use the apply
function which produces the same output as above.
Filtration
The filter method returns a subset of the original DataFrame. Filtration discards some groups, according to a group-wise computation that evaluates True or False.
The argument of the filter must be a function that, applied to the group as a whole, return True or False. Let's group the data by "parental level of education" and look for the size of each group using size() method of DataFrameGroupBy Object.
>>> data.groupby("parental level of education").size()
parental level of education
associate's degree 222
bachelor's degree 118
high school 196
master's degree 59
some college 226
some high school 179
dtype: int64
Now, let's filter out data to return all students whose parental level of education is associate's degree. For that, we use filter() method with lambda function.
>>> data.groupby('parental level of education').filter(lambda x: len(x) == 222).head()
gender race/ethnicity parental level of education lunch \
3 male group A associate's degree free/reduced
5 female group B associate's degree standard
10 male group C associate's degree standard
11 male group D associate's degree standard
19 female group C associate's degree free/reduced
test preparation course math score reading score writing score
3 none 47 57 44
5 none 71 83 78
10 none 58 54 52
11 none 40 52 43
19 none 54 58 61
Grouping by multiple categories.
1. Multiple columns
So far, we grouped our data using single columns. But grouping can be done with multiple columns.
>>> data.groupby(by=['gender', 'test preparation course'])[['math score', 'writing score']].min()
math score writing score
gender test preparation course
female completed 23 36
none 0 10
male completed 39 38
none 27 15
2. Combination of columns and index
A DataFrame may be grouped by a combination of columns and index levels by specifying the column names as strings and the index levels as pd.Grouper objects
>>> arrays = [['a', 'a', 'a', 'b', 'b'], ['one', 'two', 'one', 'two', 'one']]
>>> index = pd.MultiIndex.from_arrays(arrays=arrays, names=['first', 'second'])
>>> df = pd.DataFrame({'A':[1, 1, 1, 1, 0], 'B':np.arange(5)}, index=index)
>>> df
A B
first second
a one 1 0
two 1 1
one 1 2
b two 1 3
one 0 4
# Let's groupby by dataframe df by second index level and the A column
>>> df.groupby([pd.Grouper(level=1), 'A']).sum()
B
second A
one 0 4
1 2
two 1 4
Or directly specifying the index level names as key to the groupby method.
>>> df.groupby(['second', 'A']).sum()
B
second A
one 0 4
1 2
two 1 4
GroupBy Sorting
By default, the group keys(index of a group) are sorted during the groupby()
operation. You can pass sort=False
to get the unsorted grouped data.
>>> data.groupby('lunch').count()['gender']
lunch
free/reduced 355
standard 645
Name: gender, dtype: int64
>>> data.groupby('lunch', sort=False).count()['gender']
lunch
standard 645
free/reduced 355
Name: gender, dtype: int64
GroupBy dropna
By default NA
values are excluded from group keys during the groupby operation. However, sometimes we need to analyze the Null
values in group keys, to include NA values in group keys you could pass dropna=False
, The default setting of dropna
is True
.
>>> lists = [[1, 2, 3], [None, 2, 3], [2, 1, 4], [1, 2, 3]]
>>> df = pd.DataFrame(lists, columns=['a', 'b', 'c'])
>>> df
a b c
0 1.0 2 3
1 NaN 2 3
2 2.0 1 4
3 1.0 2 3
>>> df.groupby(by=['a']).sum()
b c
a
1.0 4 6
2.0 1 4
>>> df.groupby(by=['a'], dropna=False).sum()
b c
a
1.0 4 6
2.0 1 4
NaN 2 3
Resetting index with as_index
The result of aggregation on grouping by multiple columns or levels will result in a MultiIndex DataFrame. This can be changed by using as_index=False
into a zero-based dataframe object.
>>> df.groupby(by=['a', 'b']).sum()
c
a b
1.0 2 6
2.0 1 4
>>> df.groupby(by=['a', 'b'], as_index=False).sum()
a b c
0 1.0 2 6
1 2.0 1 4
The same can be achieved using the reset_index
dataframe function.
>>> df.groupby(by=['a', 'b']).sum().reset_index()
a b c
0 1.0 2 6
1 2.0 1 4
Conclusion
Pandas provide one of the most powerful and flexible groupby functionality for both aggregating and transforming data for better analysis and visualization. I hope this article will help you to learn about Pandas. I recommend you to check out its documentation to know more about the groupby() method.
Thanks for reading.
References
- [1] Pandas Official Tutorial: Group_by:split-apply-combine