What is Pandas AI: The Generative AI Python Library
One of the most important libraries Data professional use day-to-day is Pandas. Python Pandas is an open-source package used for data analysis and manipulation in Python programming. It offers various capabilities like creating tables, working with spreadsheets or relational databases, and structuring or cleaning data.
What is Pandas AI
Pandas AI is a new data science library that makes data frames conventional. What does this mean? According to the documentation, “Pandas AI is a Python library that enhances Pandas with generative AI capabilities. It is intended to complement, not replace, the popular data analysis and manipulation tool. Using pandasai
, users can summarize pandas data frames data by interacting like humans.”
Pandas AI takes data analysis to the next level as you can just converse with your data instead of going back and forth with it. Pandas AI does not replace Pandas but it serves as a complimentary use case to go with Pandas hand in hand. Data professionals spend a significant amount of time cleansing data in preparation for its analysis process. Pandas AI helps tremendously to take their data analysis to the next level. Data professionals investigate various approaches and processes that can be used to reduce the time spent on data preparation, and Pandas AI now allows them to do so.
Pandas AI is built on OpenAI. It uses ChatGPT’s power to generate and run Python code.
How to use Pandas AI
First, we have to install Panda's AI. I would be using Google Colab for this project.
pip install pandasai
Importing Libraries
After installation of Pandas AI, you would need to go to OpenAI to create an API key to continue. The API would help us connect to OpenAI and to use this technology. Here OpenAI.
import pandas as pd
from pandasai import PandasAI
from pandasai.llm.openai import OpenAI
llm = OpenAI(api_token=api)
Creating our data frame
For this project, I created a dummy dataset of 15 Nigeria states and their GDP relating GDP to the happiness index
data = {
'states': [
'Abia',
'Adamawa',
'Akwa Ibom',
'Anambra',
'Bauchi',
'Bayelsa',
'Benue',
'Borno',
'Cross River',
'Delta',
'Ebonyi',
'Edo',
'Ekiti',
'Enugu',
'Gombe'
],
'gdp': [
65000000000,
78000000000,
920000000000,
570000000000,
41000000000,
320000000000,
46000000000,
39000000000,
62000000000,
860000000000,
40000000000,
500000000000,
42000000000,
57000000000,
39000000000
],
'happiness_index': [
5.8,
6.2,
7.5,
8.1,
4.6,
7.3,
5.1,
4.4,
6.9,
8.5,
4.8,
7.9,
5.0,
6.7,
4.3
]
}
We have created a dummy data, now it is time to convert the data to a data frame.
df = pd.DataFrame(data)
df.head(5)
Displays
states gdp happiness_index
0 Abia 65000000000 5.8
1 Adamawa 78000000000 6.2
2 Akwa Ibom 920000000000 7.5
3 Anambra 570000000000 8.1
4 Bauchi 41000000000 4.6
Example 1: Finding out what is the GDP of Lagos
res = pandas_ai(df, "What is the gdp of Abia?")
print(res)
Displays
65000000000
Example 2: Using the info() to know more about the data
res = pandas_ai(df, "Show the info of data in tabular form")
print(res)
Displays
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 states 15 non-null object
1 gdp 15 non-null int64
2 happiness_index 15 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 488.0+ bytes
None
Example 3: Finding duplicate rows in the data
res = pandas_ai(df, "Are there any duplicate rows?")
print(res)
Displays
There are no duplicate rows.
Example 4: Printing all column names
res = pandas_ai(df, "List all the column names")
print(res)
Displays
['states', 'gdp', 'happiness_index']
Example 5: What is the happiest state
res = pandas_ai(df, "What is the happiest state")
print(res)
Displays
Delta
So according to our data Delta State is the happiest state
Example 6: Where function
res = pandas_ai(df, "Show the data in the row where 'Country'='Edo'")
print(res)
states gdp happiness_index
11 Edo 500000000000 7.9
Next we would use visualization to show patterns in our data
Example 7: Pandas AI can also do data visualization. We would plot the top 5 states in Nigeria to showcase the variations in GDP.
pandas_ai.run(
df,
"Plot the barchart of countries showing for each the gpd of the top 5 happiest state, using different colors for each bar",
)
Example 8: Show the least happiest state
pandas_ai.run(
df,
"Plot the barchart of countries showing for each the gpd of the least 5 happiest state, using different colors for each bar",
)
Conclusion
I will conclude by saying Pandas AI is still in its early stages, and for now, it is cannot replace the pandas library. But we should not that it is a very powerful library and can do more than what this example I have just shown. It addition would enhance the pandas functionality and further increases the efficiency and simplicity of handling data in Python.