Split Excel Data into Multiple Files by Column Values using Python (fast & easy)
Introduction
In this tutorial, I will show you how to separate Excel data into multiple files using Python and the pandas library. My sample workbook contains sales records for different segments, countries, and products. The goal is to create a separate workbook for each country, which includes only the data specific to that country.
Manually, you would need to filter the column, copy the data, paste it into a new workbook, and save it. This process can be tedious, especially with large datasets. Python simplifies this task significantly. However, real-world data isn’t always neat, so I will also cover how to handle messy data towards the end of this tutorial.
Getting Started
First, I have an empty output folder ready in my workbooks directory. I’ll start with a blank Python file. The first step is to import the required libraries:
import pandas as pd
import os
If you haven’t installed pandas yet, you can do so by using the command:
pip install pandas
You might also need the openpyxl
library, which is an optional dependency for pandas. Install it using:
pip install openpyxl
Creating the DataFrame
Next, create a pandas DataFrame by loading your Excel file:
df = pd.read_excel('your_file.xlsx')
Make sure your Python script is in the same directory as your workbook to avoid specifying the full path. If it’s not, you’ll need to provide the exact path to the file.
Set a variable called column_name to the column you want to split:
column_name = 'Country'
Now, create a list of unique values from that column:
countries = df[column_name].unique()
Iterating Over Each Country
With the list of countries, I can iterate through each one using a for-loop.
This code checks if each country is contained within the specified column and creates a separate DataFrame for each country, exporting it to an Excel file. The output path is defined as the output folder.
Running the Script
To execute the script, navigate to the script directory in your command prompt and run the Python file. You should now see the separated Excel files in the output folder.
Handling Messy Data
The real-world data can be messy. For instance, country names might have inconsistencies like different cases or extra spaces. Let’s modify our script to handle such scenarios:
df['Country'] = df['Country'].str.strip().str.title()
This line removes any leading or trailing whitespace and converts the country names to title case, ensuring consistency.
Dealing with Special Characters
In some cases, country names may include special characters that can cause issues when creating filenames. For this, I will create a list of characters to remove:
symbols_to_remove = ['/', '\\', '?', '*', ':', '<', '>', '|']
Next, I will apply the replace method to clean the country names:
Conclusion
In this tutorial, I demonstrated how to split Excel data into multiple files using Python. We started with a clean dataset and moved on to handling messier data with inconsistencies and special characters. This approach can save a lot of time and effort, especially when dealing with large datasets. If you have any questions, feel free to ask in the comments!