This is the fourth part of the Series of Data Analysis with Python Pandas. Dataframe is one of the most important features of Pandas which is used in data analysis. The last post was about creating Dataframe. Now that we know how to create them let's start retrieving information from them to get the most out of the data.
In this post, we'll discover the most useful attributes of the Dataframe that are commonly used in data analysis methods.
Let's start with loading libraries and create a dataframe.
# importing libraries
import pandas as pd
import numpy as np
# Create dataframe using numpy array.
arr = np.random.randint(20, size=(4,4))
df = pd.DataFrame(arr, columns=['one', 'two', 'three', 'four'])
print(df)
output
one two three four
0 10 2 4 6
1 0 15 19 8
2 15 6 19 12
3 10 13 3 18
Retrieving labels
You can get the index axes(row) labels using .index
attribute and columns lables using .columns
attributes.
>>> df.index
RangeIndex(start=0, stop=4, step=1)
>>> df.columns
Index(['one', 'two', 'three', 'four'], dtype='object')
These attributes return sequence objects which can be used like any other sequences in Python(such as a list) to get the values from it.
>>> df.columns[0]
one
# To convert into list.
>>> df.columns.to_list()
['one', 'two', 'three', 'four']
You can use the built-in list()
method or the to_list()
method of Dataframe to convert into a list type.
DataTypes
.dtypes
The type of data values in data frame is called datatypes or dtypes. You can get the data type of each column in Dataframe using the .dtypes
attribute. It is used to check whether your Dataframe assigns with the correct datatype or not.
>>> df.dtypes
one int64
two int64
three int64
four int64
dtype: object
We can see that our dataframe has four columns with all data values of type integer
.
It returns a Series with the data type of each column where the index is column names and corresponding data types as values. The dtype
at the bottom represent the datatype of returned Series since the values in this series are strings it returned the object
type.
.astype()
You can change the data type explicitly using the astype()
method.
# To change datatype of single column.
>>> df['one'].astype('float')
0 10.0
1 0.0
2 15.0
3 10.0
Name: one, dtype: float64
# To change data type of multiple columns.
>>> df.astype({'one': 'float', 'two': 'float'})
one two three four
0 1.0 19.0 10 9
1 17.0 12.0 18 16
2 16.0 18.0 5 1
3 6.0 9.0 18 17
We can change the datatype of multiple columns by passing a dictionary of columns as keys with respective datatype as values.
.select_dtypes()
We can select the columns based on datatypes using .select_dtypes()
. It return a subset of the DataFrame’s columns based on the column data types.
# Let's create dataframe with different datatypes.
>>> df = pd.DataFrame({'a': [1, 2] * 2 ,
'b': [True, False] * 2,
'c': [1.0, 2.0] * 2})
>>> df
a b c
0 1 True 1.0
1 2 False 2.0
2 1 True 1.0
3 2 False 2.0
# To select single column
>>> df.select_dtypes(include='bool')
b
0 True
1 False
2 True
3 False
# To select multiple columns.
>>> df.select_dtypes(include=['int', 'bool'])
a b
0 1 True
1 2 False
2 1 True
3 2 False
>>> df.select_dtypes(exclude=['bool'])
a c
0 1 1.0
1 2 2.0
2 1 1.0
3 2 2.0
.select_dtypes(include=None, exclude=None)
it comes with two parameters include
and exclude
. We can pass a single datatype value or list of data types to these parameters.
When we pass values to include, It returns a Dataframe of columns with that datatypes, whereas when we pass values to exclude, It returns a data frame excluding the datatype that we pass as a parameter.
As you can see in the above code example, we pass bool
for boolean data type to exclude
and it returns a Dataframe that has columns with all other data types but not of the type of boolean.
Size
The .ndim
, .size
, and .shape
will return dimension, size, and shape of the Dataframe respectively.
.ndim
.ndim
returns an integer value representing the number of axes/array dimensions.
# To get the dimensions of dataframe.
>>> df.ndim
2
It returns 2 for Dataframe since it has two axes (rows and columns) and 1 for Series since it has one axis.
.shape
Returns a tuple representing a number of rows and the number of columns of the Dataframe.
# To get the shape of dataframe
>>> df.shape
(4, 3)
.size
Return an int representing the total number of elements/values in Datafame.
# To get the total count of data values in dataframe
>>> df.size
12
Data
We can extract data from data frames without axes labels using .values
attribute or .to_numpy()
method.
.values
It returns a NumPy ndarrays of values of DataFrame.
# To get all the values of dataframe as numpy array.
>>> df.values
array([[1, True, 1.0],
[2, False, 2.0],
[1, True, 1.0],
[2, False, 2.0]], dtype=object)
.to_numpy()
It converts the dataframe into NumPy ndarrays. By default, the dtype of the returned array will be the common NumPy dtype of all types in the DataFrame.
# To convert dataframe into numpy array.
# returns all values without labels.
>>> df.to_numpy()
array([[1, True, 1.0],
[2, False, 2.0],
[1, True, 1.0],
[2, False, 2.0]], dtype=object)
.info()
This is one of the most useful methods of Dataframe in Data Analysis. It prints a summary, which includes information about index dtype and columns, non-null values, and memory usage of the Dataframe.
# to get the summary of data.
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 a 4 non-null int64
1 b 4 non-null bool
2 c 4 non-null float64
dtypes: bool(1), float64(1), int64(1)
memory usage: 196.0 bytes
.info()
by default prints summary information of all columns, But we can print a summary of columns count and its dtypes but not per column information by setting the value of verbose
parameter to False
. By default, it is True
.
# To print a summary of columns count and its dtypes
# but not per column information.
>>> df.info(verbose=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Columns: 3 entries, a to c
dtypes: bool(1), float64(1), int64(1)
memory usage: 196.0 bytes
We can see that by setting the verbose
parameter to False
. We get only information about the index, column counts, the datatype of columns, and memory usage of DataFrame. This is very useful when we need a quick summary of our data.
In this article, we learned about the most useful attributes of dataframe with some alternative methods that are similar to those attributes that we can use instead. These attributes are often used while doing the descriptive analysis.
This is all about today's article. Thank you for reading, I hope this helps you. :)
See you in my next article.