Turn Your Excel Worksheet Into An Exploratory Data Analysis Report In Just 3 Lines Of Python Code

Category: Data Analysis,Python + Excel

Introduction

In this tutorial, I will showcase how you can perform exploratory data analysis (EDA) using the Python library Pandas Profiling. This tool allows you to generate comprehensive reports from your datasets in just three lines of code. Imagine transforming your Excel workbook into an interactive HTML report that gives you insights into missing values, distributions, correlations, and descriptive statistics like mean and standard deviation.

Interactive exploratory data analysis report

Getting Started

When you receive a new dataset, such as the housing prices in Melbourne from Kaggle, one of the first steps is to explore the data. You might want to check for missing values, understand how the values are distributed, and see if there are correlations between different variables. Typically, users apply filters or create pivot tables and charts to analyze the data. However, with Pandas Profiling, you can generate a complete summary of your data effortlessly.

Melbourne housing dataset

Python Code

To get started with Pandas Profiling, you first need to ensure you have Python installed on your machine. If not, there are many tutorials available to guide you through the installation process.

Once Python is set up, create a new Python file and follow these steps:

Import the necessary libraries:

import pandas as pd
from pandas_profiling import ProfileReport

Install the required libraries via command prompt:

pip install pandas
pip install xlrd
pip install pandas-profiling

Load your dataset into a Pandas DataFrame:

df = pd.read_csv('melb_data.csv')

Generate the profiling report:

profile = ProfileReport(df, title="EDA House Prices")
profile.to_file('EDA_Houseprices_Analysis.html')

After executing this code, you’ll have a new HTML file with a detailed EDA report.

Generating Pandas Profiling report

Upon opening the generated report, you’ll find a comprehensive overview, including the number of columns, rows, duplicates, and missing values. The report also highlights any warnings, such as high cardinality in certain columns or potential correlations.

Overview of the EDA report

For each column, you will get a quick overview of unique values, missing values, and their distribution. Numerical columns will display a histogram, while categorical columns will show count plots. The report even includes a correlation heat map to identify relationships between variables, such as house prices and the number of rooms.

Correlation heat map

Additionally, the report provides a missing values overview, helping you understand which columns require attention. For instance, you might notice that 40-50% of data for certain attributes like “building area” is missing.

Missing values overview

Outro

Pandas Profiling is a powerful tool for quickly generating exploratory data analysis reports. It automates many aspects of data exploration, providing insights that can help guide further analysis and data cleaning efforts. For more information about this library, you can visit the official GitHub page linked below.

Turn Your Excel Worksheet Into An Exploratory Data Analysis Report In Just 3 Lines Of Python Code

Introduction

Getting Started

Python Code

Outro

Further Links