Attributes of Pandas DataFrame.

Attributes of Pandas DataFrame.

This is the fourth part of the Series of Data Analysis with Python Pandas. Dataframe is one of the most important features of Pandas which is used in data analysis. The last post was about creating Dataframe. Now that we know how to create them let's start retrieving information from them to get the most out of the data.

In this post, we'll discover the most useful attributes of the Dataframe that are commonly used in data analysis methods.

Let's start with loading libraries and create a dataframe.

# importing libraries
import pandas as pd
import numpy as np

# Create dataframe using numpy array.
arr = np.random.randint(20, size=(4,4))
df = pd.DataFrame(arr, columns=['one', 'two', 'three', 'four'])

print(df)

output

   one  two  three  four
0   10    2      4     6
1    0   15     19     8
2   15    6     19    12
3   10   13      3    18

Retrieving labels

You can get the index axes(row) labels using .index attribute and columns lables using .columns attributes.

>>> df.index
RangeIndex(start=0, stop=4, step=1)

>>> df.columns
Index(['one', 'two', 'three', 'four'], dtype='object')

These attributes return sequence objects which can be used like any other sequences in Python(such as a list) to get the values from it.

>>> df.columns[0]
one

 # To convert into list.
>>> df.columns.to_list()
['one', 'two', 'three', 'four']

You can use the built-in list() method or the to_list() method of Dataframe to convert into a list type.

DataTypes

.dtypes

The type of data values in data frame is called datatypes or dtypes. You can get the data type of each column in Dataframe using the .dtypes attribute. It is used to check whether your Dataframe assigns with the correct datatype or not.

>>> df.dtypes
one      int64
two      int64
three    int64
four     int64
dtype: object

We can see that our dataframe has four columns with all data values of type integer.

It returns a Series with the data type of each column where the index is column names and corresponding data types as values. The dtype at the bottom represent the datatype of returned Series since the values in this series are strings it returned the object type.

.astype()

You can change the data type explicitly using the astype() method.

# To change datatype of single column.
>>> df['one'].astype('float')
0    10.0
1     0.0
2    15.0
3    10.0
Name: one, dtype: float64

# To change data type of multiple columns. 
>>> df.astype({'one': 'float', 'two': 'float'})
    one   two  three  four
0   1.0  19.0     10     9
1  17.0  12.0     18    16
2  16.0  18.0      5     1
3   6.0   9.0     18    17

We can change the datatype of multiple columns by passing a dictionary of columns as keys with respective datatype as values.

.select_dtypes()

We can select the columns based on datatypes using .select_dtypes(). It return a subset of the DataFrame’s columns based on the column data types.

# Let's create dataframe with different datatypes.
>>> df = pd.DataFrame({'a': [1, 2] * 2 ,
                       'b': [True, False] * 2, 
                       'c': [1.0, 2.0] * 2})

>>> df
   a      b    c
0  1   True  1.0
1  2  False  2.0
2  1   True  1.0
3  2  False  2.0
# To select single column
>>> df.select_dtypes(include='bool')
      b
0     True
1     False
2     True
3     False

# To select multiple columns.
>>> df.select_dtypes(include=['int', 'bool'])
   a      b
0  1   True
1  2  False
2  1   True
3  2  False

>>> df.select_dtypes(exclude=['bool'])
   a    c
0  1  1.0
1  2  2.0
2  1  1.0
3  2  2.0

.select_dtypes(include=None, exclude=None) it comes with two parameters include and exclude. We can pass a single datatype value or list of data types to these parameters.

When we pass values to include, It returns a Dataframe of columns with that datatypes, whereas when we pass values to exclude, It returns a data frame excluding the datatype that we pass as a parameter.

As you can see in the above code example, we pass bool for boolean data type to exclude and it returns a Dataframe that has columns with all other data types but not of the type of boolean.

Size

The .ndim, .size, and .shape will return dimension, size, and shape of the Dataframe respectively.

.ndim

.ndim returns an integer value representing the number of axes/array dimensions.

# To get the dimensions of dataframe.
>>> df.ndim
2

It returns 2 for Dataframe since it has two axes (rows and columns) and 1 for Series since it has one axis.

.shape

Returns a tuple representing a number of rows and the number of columns of the Dataframe.

# To get the shape of dataframe
>>> df.shape
(4, 3)

.size

Return an int representing the total number of elements/values in Datafame.

# To get the total count of data values in dataframe
>>> df.size
12

Data

We can extract data from data frames without axes labels using .values attribute or .to_numpy()method.

.values

It returns a NumPy ndarrays of values of DataFrame.

# To get all the values of dataframe as numpy array.
>>> df.values
array([[1, True, 1.0],
       [2, False, 2.0],
       [1, True, 1.0],
       [2, False, 2.0]], dtype=object)

.to_numpy()

It converts the dataframe into NumPy ndarrays. By default, the dtype of the returned array will be the common NumPy dtype of all types in the DataFrame.

# To convert dataframe into numpy array.
# returns all values without labels.
>>> df.to_numpy()
array([[1, True, 1.0],
       [2, False, 2.0],
       [1, True, 1.0],
       [2, False, 2.0]], dtype=object)

.info()

This is one of the most useful methods of Dataframe in Data Analysis. It prints a summary, which includes information about index dtype and columns, non-null values, and memory usage of the Dataframe.

# to get the summary of data.
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   a       4 non-null      int64  
 1   b       4 non-null      bool   
 2   c       4 non-null      float64
dtypes: bool(1), float64(1), int64(1)
memory usage: 196.0 bytes

.info() by default prints summary information of all columns, But we can print a summary of columns count and its dtypes but not per column information by setting the value of verbose parameter to False. By default, it is True.

# To print a summary of columns count and its dtypes 
# but not per column information.

>>> df.info(verbose=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Columns: 3 entries, a to c
dtypes: bool(1), float64(1), int64(1)
memory usage: 196.0 bytes

We can see that by setting the verbose parameter to False. We get only information about the index, column counts, the datatype of columns, and memory usage of DataFrame. This is very useful when we need a quick summary of our data.

In this article, we learned about the most useful attributes of dataframe with some alternative methods that are similar to those attributes that we can use instead. These attributes are often used while doing the descriptive analysis.

This is all about today's article. Thank you for reading, I hope this helps you. :)

See you in my next article.

Did you find this article valuable?

Support Madhuri Patil by becoming a sponsor. Any amount is appreciated!