How to Create a Data Frame in Pandas.
DataFrame is one of the most important data structures in Pandas when it comes to Data Analysis. Let's learn how to create a DataFrame.
Hello Learners,
Welcome to Data Analysis in Python Pandas, Data Frame is one of the most important data structures of Pandas. Most of the time we have to work with Dataframes while doing Data analysis. So, let's learn how to create Dataframe in Pandas.
In the last article, we learn to create a Series in Python Pandas. If you did not read that yet, you can read it here since it's going to be a continuation of that.
So, Lets' Get Started.
What are Dataframes?
You can think of Dataframe as tables in SQL, they are very similar. The Dataframe has a tabular data structure that stored data in rows and columns. DataFrame is a 2-Dimensional data structure that holds heterogeneous data with labeled axes.
If you remember in Series as well we have an index array containing labels associated with each element. Similarly, DataFrame has two labeled axes one for rows and one for columns.
rows | columns | |
column 1 | column 2 | |
0 | 1 | 1 |
1 | 2 | 2 |
How to create Dataframe?
In Data analysis, Pandas DataFrame will be created by a loading SQL database, CSV file, or Excel file. But, we can create a DataFrame from scratch using list, dictionary, list of the dictionaries, or even using Series, etc. There are various ways to create a DataFrame.
The Syntax for creating DataFrame is
pandas.DataFrame(data, index, columns, dtype)
here
data
can be- List, list of lists, or list of dictionaries etc.
- Dictionary of a list, ndarray, series, or dictionary.
index
: you can optionally pass the index array (row labels) to the data frame. If not passed then the default index will be assigned (0, 1, .. len(data)-1
) as index labels.columns
: you can pass columns array(columns labels) to the data frame. If you did not pass the column labels then the default values such as (0, 1, ..., n-1
) wheren
is a number of columns, will be assigned.dtype
: To specify the data type of particular columns.
Now, that we have an idea about what a data frame is, Let's look at how we can create a DataFrame using different data inputs.
Dictionary as Dataframe
The most common way to create a Dataframe is to pass a dictionary as data input to the pandas DataFrame()
constructor.
The dictionary contains a key for each column that you want to define, with a list of values for each of them.
When we use a dictionary of lists as data, then all lists must be of the same length. If an index is passed, it must also be the same length as the arrays/list. If no index is passed, the index will be range(n)
, where n
is the array length.
Let's start with importing NumPy and Pandas libraries.
import numpy as np
import pandas as pd
# Creating dataframe using dictionary of lists.
d = {'one':[1, 2, 3, 4], 'two':[2.0, 3.2, 4.5, 5.5]}
df = pd.DataFrame(d)
df
Output
one two
0 1 2.0
1 2 3.2
2 3 4.5
3 4 5.5
Did you notice the index? we did not specify the index while creating the Dataframe so the default index with values of 0
, 1
, 2
, and 3
is assigned since our length of data/ length of list values in the dictionary is 4. The keys of the dictionary become the columns names and values is data points.
But, if you want to assign labels to the indexes of a dataframe, you have to use the index option.
# creating DataFrame with the specified index.
d = {'one':[1, 2, 3, 4], 'two':[2.0, 3.2, 4.5, 5.5]}
index = ['a', 'b', 'c', 'd']
df = pd.DataFrame(d, index=index)
df
Output
one two
a 1 2.0
b 2 3.2
c 3 4.5
d 4 5.5
If you want to select a particular column from a dictionary from which you want to create a data frame, you can select columns using the columns
option in the DataFrame constructor by specifying a sequence of columns.
Let's select column one
only from the dictionary we define above.
df = pd.DataFrame(d, columns = ['one'])
df
Output
one
a 1
b 2
c 3
d 4
ndarray as Dataframe
Now that, we know about how to use the index
and columns
attributes of the DataFrame constructor, we can easily define a DataFrame.
Instead of using Dictionary, we can define three arguments in the constructor, data
with ndarray values, an array containing labels assigned to the index
option, and an array for names of the columns assigned to the columns
.
# Create DataFrame using ndarray.
arr = np.arange(16).reshape((4, 4))
index = ['a', 'b', 'c', 'd']
columns = ['one', 'two', 'three', 'four']
df = pd.DataFrame(data = arr, index=index, columns = columns)
df
Output
one two three four
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
d 12 13 14 15
List of lists as Dataframe
Similarly, we can create a Data frame using a List and a list of lists.
# Dataframe from a simple list.
df = pd.DataFrame([1, 2, 3, 4], index=['a', 'b', 'c', 'd'], columns=['one'])
print(df)
Output
one
a 1
b 2
c 3
d 4
# Create DataFrame using lists of list.
arr = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
index = ['a', 'b', 'c']
columns = ['one', 'two', 'three']
df = pd.DataFrame(data = arr, index=index, columns=columns)
print(df)
Output
one two three
a 1 2 3
b 4 5 6
c 7 8 9
As you can see in the above code example, the first list becomes first column, second list as second column and so on.
Series as Dataframe
We can create a Dataframe from Series. The resulting index will be the union of the indexes of the various Series.
# Series
a = pd.Series([1, 2, 3], index = ['a', 'b', 'c'], dtype=float)
b = pd.Series([11, 22, 33, 44], index = ['a', 'b', 'c', 'd'])
# DataFrame using series `a` and `b`.
df = pd.DataFrame({'first':a, 'second':b})
df
Output
first second
a 1.0 11
b 2.0 22
c 3.0 33
d NaN 44
Here, the index of the new Dataframe considers all index values / Union of the index of all series. Since the d
index label is not present in Series a
, it is assigned with the NaN
value which is nothing but missing data.
There are various ways to create a Dataframe in Pandas, and we learned some of the most common ways to create them.
Thank you for reading. See you in my next article.