Pandas Crash Course: Transform Your Data Skills with Simple Slicing and Dicing

What’s the difference between a panda and the Pandas library?

One is a lovable bear that munches on bamboo, while the other is a powerful Python library that processes datasets with ease.

Pandas is a powerful Python package that helps you manipulate and analyze data with relatively simple syntax. It’s like a hot knife cutting through butter: super satisfying, easy, and gets the job done in a pinch.

With Pandas you can:

Clean data
Transform data
Read and write common file types into DataFrames (i.e., Excel, CSV, and more)
Visualize data through plots, graphs, and charts
Statistical analysis

The name “pandas” is a combination of “panel data,” a term from econometrics referring to multi-dimensional data sets, and a play on “Python data analysis.”

Under the hood, Pandas is built on top of another package called NumPy for fast and memory-efficient computation of numerical data. In other words, a lot of the heavy lifting is done for you so you can focus on using that hot knife!

We’re lucky to have the Python open-source community, where countless developers have generously dedicated their time to creating powerful tools like Pandas. Thanks to their efforts, we can use this incredible package for free, with much of the complexity abstracted away, making it easier for us to focus on solving problems.

Pandas is a fundamental tool for data scientists, but really, anyone who works with data should understand how to leverage it effectively.

Let’s dive into the basics of this powerful package and explore the variety of tools it offers. Whether you’re cleaning data, analyzing trends, or just organizing information, Pandas is an essential skill to have in your toolkit!

Table of Contents

Fundamentals of Pandas

Note: To effectively understand and learn the fundamentals of pandas, it’s essential to have a strong foundation in Python. Without a solid understanding of Python basics, mastering this package may be challenging.

I highly encourage you to check out my other blog on how to learn Python effectively.

Now, to start off there are two fundamental objects or workhorses you’ll use in Pandas: DataFrames and Series.

DataFrame

A DataFrame is a 2D representation of data. Think of columns and rows like you would see in Excel.

Creating a DataFrame from scratch is very easy. You can create a dictionary where the keys acts as column names and the values are a list of any data types:

import pandas as pd

# Creating a DataFrame
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

print(df)

Output:

With a DataFrame you have access to all kinds of attributes and functions that allow you to manipulate the data. Let’s check out some common usages:

Sorting

Use the sort_values function to sort a DataFrame by column or multiple columns. You can enter single column name as a string or use a list of column names to perform multiple sorting.

import pandas as pd

# Creating a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 28],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}

df = pd.DataFrame(data)

# Sorting the DataFrame by 'Age' in ascending order
sorted_df = df.sort_values(by='Age')
print(sorted_df)

Output:

      Name  Age         City
0    Alice   25     New York
3    David   28      Houston
1      Bob   30  Los Angeles
2  Charlie   35      Chicago

Filtering

Use the loc attribute to filter a DataFrame. The loc attribute takes conditions to filter rows in the first argument and conditions for column names in the second argument.

df.loc[row condition, column condition]

Typically, you’d enter a condition to filter rows in the first argument and a list of column names in the second argument.

But If you just want all column names, you can leave the second argument empty and just enter your row condition to filter rows.

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 28],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}

df = pd.DataFrame(data)

# Selecting rows where Age is greater than 28 and selecting 'Name' and 'City' columns
result = df.loc[df['Age'] > 28, ['Name', 'City']]
print(result)

Output:

     Name         City
1     Bob  Los Angeles
2  Charlie      Chicago
3    David      Houston

Since Alice is the only one that has an age less than 28, that row gets filtered out and we’re left with the 3 rows where the age is greater than 28 and we’re handpicking the Name and City columns.

Joining Multiple DataFrames

Use the merge function to perform all kinds of joins between multiple DataFrames. If you’re familiar with SQL, you’ll be able to perform joins like inner, left outer, cross, etc.

Your left parameter would be your base DataFrame and your right DataFrame would be the second dataset you’re joining on.

The on parameter takes your condition(s) for joining the DataFrames.

Lastly, the how parameter takes the type of join as a string.

import pandas as pd

# DataFrame 1
df1 = pd.DataFrame(
    {
        "ID": [1, 2, 3, 4],
        "Name": ["Alice", "Bob", "Charlie", "David"],
        "City": ["New York", "Los Angeles", "Chicago", "Houston"],
    }
)

# DataFrame 2
df2 = pd.DataFrame(
    {
        "ID": [2, 3, 4, 5],
        "Age": [27, 22, 32, 29],
        "City": ["Los Angeles", "Chicago", "Houston", "Phoenix"],
    }
)

# Perform an inner join using the 'ID' and 'City' columns as keys
# If columns differ in both dataframes you can use `left_on=["key1", "key2"], right_on=["key3", "key4"]`
result = pd.merge(left=df1, right=df2, on=["ID", "City"], how="inner")
display(result)

Output:

   ID     Name         City  Age
0   2      Bob  Los Angeles   27
1   3  Charlie      Chicago   22
2   4    David      Houston   32

Aggregations

To perform group by’s and use aggregate functions, you can use the groupgby method followed by the agg function.

In groupby you specify the list of column names you want to group the data by. And in agg you specify new column names followed by the column name you’re using to perform aggregations like sum, mean, etc.

import pandas as pd

# Sample DataFrame
data = {
    "A": ["foo", "foo", "bar", "bar", "foo"],
    "B": ["one", "two", "one", "two", "one"],
    "C": [10, 20, 30, 40, 50],
    "D": [1, 2, 3, 4, 5],
}

df = pd.DataFrame(data)

# Groupby with aggregation and custom column names
# The variable name on the left will be a new column name
# To the right of the equal sign is a tuple. The first item is the name of the column and the second item is the aggregation performed
result = df.groupby(["A", "B"], as_index=False).agg(
    C_sum=('C', 'sum'),
    C_mean=('C', 'mean'),
    C_count=('C', 'count'),
    D_sum=('D', 'sum')
)

# Display the result
print(result)

Output:

     A    B  C_sum  C_mean  C_count  D_sum
0  bar  one     30      30        1      3
1  bar  two     40      40        1      4
2  foo  one     60      30        2      6
3  foo  two     20      20        1      2

Reading and Writing Excel Files

In order to read Excel files into DataFrames we need to use another package called openpyxl.

If you’re using a notebook to run your Python code like Google Colab, you won’t need to worry about this.

But if you’re developing locally on your computer, you can easily install the necessary packages:

pip install openpyxl pandas

Then use the read_excel method from Pandas to read Excel files into DataFrames! What’s also great is this method comes with parameters that help you customize how you want to read the data in.

import pandas as pd

# Define the path to your Excel file
file_path = 'path_to_your_file.xlsx'

# Read the Excel file into a DataFrame
# Specify the sheet name, columns to read, and handle missing values
df = pd.read_excel(
    file_path,          # Path to the Excel file
    sheet_name='Sheet1', # Name of the sheet to read
    usecols=['A', 'C', 'E'], # Columns to read (can also use column letters or indices)
    na_values=['NA', 'N/A', 'Missing'], # Additional strings to recognize as NA/NaN
    engine='openpyxl'   # Engine to use for reading Excel files
)

# Display the first few rows of the DataFrame
print(df.head())

After you’ve sliced and diced your DataFrame and you want to export to an Excel file, it’s as easy as using the to_excel method.

import pandas as pd

# Create a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'San Francisco', 'Los Angeles']
}
df = pd.DataFrame(data)

# Define the path to save the Excel file
file_path = r'C:\Users\YourUsername\Documents\output.xlsx'

# Write the DataFrame to the specified Excel file
df.to_excel(file_path, index=False, sheet_name='People')

There are other methods like to_csv and more that help you export DataFrames into multiple formats. Check out the documentation.

Series

A series is a labeled one-dimensional array. In tabular data (i.e. rows and columns) you can think of a series as a single column.

A column can be considered a list of values. And an array is pretty much the same concept, a list of values that’s labeled with the name of that column.

For example, when you select a column from a DataFrame, it’s actually a series:

column_a = df['A']
print(column_a)

Output:

0    1
1    2
2    3
Name: A, dtype: int64

Don’t worry if this doesn’t click right now. You’ll mostly be working with DataFrames, but it’s good to understand Series since it’s a building block when understanding DataFrames.

Limitations of Pandas

While Pandas is great for handling a large size of data, it does have its limitations.

Memory Consumption: Pandas loads entire datasets into memory, which can lead to high RAM usage and potential performance issues with large datasets. When your dataset starts upwards of 2 or 3 GB and more, you will probably notice slower performance.

Single-threaded Processing: Pandas operates on a single CPU core, limiting its ability to efficiently process large volumes of data.

Scalability Constraints: Pandas is designed for single-machine use and lacks built-in support for distributed computing, making it less suitable for processing massive datasets across multiple machines.

For handling larger datasets, consider using tools like PySpark or any other comparable framework which are designed for distributed computing and can efficiently process Big Data.

Slicing and dicing Big Data is a separate topic, but keep this in mind when processing the volume of data you’re working with.

Wrapping Up: Next Steps with Pandas

To truly grasp the material we’ve covered, I highly encourage you to practice the attributes and functions we’ve explored in Pandas.

I firmly believe that the best way to learn is by doing.

You can read, watch, and absorb as much content as you want, but until you actually apply those concepts in code, those thoughts will remain abstract in your mind and may fade away.

Tutorial hell is real and to break free from that cycle, you need to build your courage and embrace failure.

I get it—it’s tough to face complexity and the fear of getting stuck. But when you push through those barriers and dive in, you’ll start building confidence, knowing that by breaking down challenges into smaller, manageable pieces, you can tackle anything that comes your way.

By writing code and seeing the results firsthand, you’ll solidify what you’ve learned. Trust me, practicing this way will help you retain more information.

And don’t hesitate to search for help when you need it. It’s completely normal to run into bugs or errors—that’s all part of the learning process.

The beauty of Pandas is its popularity, which means it’s well-documented. You’ll always find answers to any questions or challenges you face while working with this framework.

So, give it a shot! Try taking a task you normally do in Excel, load it into a Pandas DataFrame, and apply the same transformations you’d use to analyze or generate the output you need.

I hope this helps open your eyes to the possibilities for automating tasks and working more seamlessly with data.