Solve Real-World Data Science Tasks in Python | Data Analysis with Pandas & Plotly (Full Tutorial)
In this tutorial, we are going to solve a real-world data science/analysis project with Python.
We will be using the following Python Libraries:
- Pandas
- Pandas Profiling Report
- AutoViz
- Plotly
After we have loaded the dataset, we will do some initial exploratory data analysis to get an idea of the dataset. I will show you very useful pandas’ functions that you can apply to any dataset you might deal with.
However, nowadays, there are so many cool libraries available, making exploratory data analysis so much easier. I will show you my favourite 2 libraries, which will generate automated reports for us in just a few lines of code.
Those reports are a great starting point before we are moving on to answer real-world business type questions.
While answering those questions, we will cover a wide range of various pandas’ functions. Additionally, we will also code our own python helper function, which we will use in the deep-dive & visualization section. All the charts we are going to create will be interactive and have a clean design.
We will cover the following chart types:
- Histogram
- Box Plot
- Bar Charts
- Scatter Plot
- Line Chart
Feel free to code along with me. In the project files, you will also find an exercise Notebook that includes all the tasks we will solve.
# Version: # --Python 3.8.5-- # autoviz==0.0.81 # numpy==1.19.3 # openpyxl==3.0.5 # pandas==1.2.0 # pandas-profiling==2.9.0 # plotly==4.14.1 # plotly-express==0.4.1 # xlrd==2.0.1 # Imports: import pandas as pd import plotly import plotly.express as px from pandas_profiling import ProfileReport from autoviz.AutoViz_Class import AutoViz_Class # -- Settings Plotly template # Reference Link: # https://plotly.com/python/templates/ # Try other themes: 'plotly_dark', 'plotly_white', 'ggplot2', 'seaborn', 'simple_white' template_style = "plotly_white" # ### Load DataFrame df = pd.read_excel("data/data.xlsx", engine="openpyxl") # **Inspeact first 5 rows of the DataFrame** df.head() # # Explore Dataset # ## Traditionally # Basic Info about DataFrame df.info() # Describe Method df.describe() # Get a view of unique values in column, e.g. 'Ship Mode' df["Ship Mode"].unique() # NaN count for each column df.isnull().sum() # ## Automated Reports # #### Pandas Profiling Report # Generate Pandas Profiling Report profile = ProfileReport(df, title="Sales Profiling Report") # Export Pandas Profiling Report to HTML profile.to_file("output/Salesreport.html") # #### Auto Viz Report AV = AutoViz_Class() df_autoviz = AV.AutoViz("data/data.xlsx") # # Data Preperation & Analysis # ### 🚩 TASKS: # - What was the highest Sale in 2020? # - What is average discount rate of charis? # - Add extra columns to seperate Year & Month from the Order Date # - Add a new column to calculate the Profit Margin for each sales record # - Export manipulated dataframe to Excel # - Create a new dataframe to reflect total Profit & Sales by Sub-Category # - Develop a function, to return a dataframe which is grouped by a particular column (as an input) # **What was the highest Sales?** df.nlargest(3, "Sales") # Highest Sale df.iloc[df["Sales"].argmax()] # **What is average Discount of charis?** # Create Boolean mask mask = df["Sub-Category"] == "Chairs" # Use Boolean mask to filter dataframe df[mask]["Discount"].mean() # **Add an extra column for "Order Month" & "Order Year"** df["Order Date Year"] = df["Order Date"].dt.year df["Order Date Month"] = df["Order Date"].dt.month df.head(2) # **Add a new column to calculate the Profit Margin for each sales record** df["Profit Margin"] = df["Profit"] / df["Sales"] df.head(3) # **Export manipulated dataframe back to excel** df.to_excel("output/data_output.xlsx", index=False) # #### Total Profit &Sales by Sub-Category # Group By Sub-Category [SUM] df_by_sub_category = df.groupby("Sub-Category").sum() # Reset Index df_by_sub_category.reset_index(inplace=True) # Print out Head df_by_sub_category.head() # #### Develop a function, to return a dataframe which is grouped by a particular column (as an input) # Groupby as a function def grouped_data(column_name): """ Groupby column and return DataFrame Input: Column Name """ df_tmp = df.groupby(column_name).sum() df_tmp.reset_index(inplace=True) return df_tmp # Group DataFrame by Segment grouped_data("Segment") # # Further Deep Dive & Visualization # ### 🚩 Objective: # - Further Analysis/Deep Dive using various kind of Charts # - Prepare/Refactor Dataframe for different Charttypes # - Generate & Export 'Ready-To-Present- Charts': Clean & Interactive # ----- # #### 📊 Chart Types: # - [x] Histogram # - [x] Boxpot # - [x] Various Barplots # - [x] Scatterplot # - [x] Linechart # **Distribution Sales [Histogram]** # Quick Stats Overview for Sales df["Sales"].describe() # Create Chart fig = px.histogram(df, x="Sales", template=template_style) # Plot Chart fig.show() # **Show the distribution and skewness of Sales [Boxplot]** # Create Chart fig = px.box(df, y="Sales", range_y=[0, 1000], template=template_style) # Plot Chart fig.show() # **Plot Sales by Sub-Category** # Create Dataframe data = grouped_data("Sub-Category") data.head() # Create Chart fig = px.bar( data, x="Sub-Category", y="Sales", title="<b>Sales by Sub Category</b>", template=template_style, ) # Display Plot fig.show() # Export Chart to HTML plotly.offline.plot(fig, filename="output/Sales_Sub_Cat.html", auto_open=False) # **Plot Profit by Sub-Category** # Create Chart fig = px.bar( data, x="Sub-Category", y="Profit", title="<b>Profit by Sub Category</b>", template=template_style, ) # Display Plot fig.show() # Export Chart to HTML plotly.offline.plot(fig, filename="output/Profit_Sub_Cat.html", auto_open=False) # **Plot Sales & Profit by Sub-Category** # Create Chart fig = px.bar( data, x="Sub-Category", y="Sales", color="Profit", color_continuous_scale=["red", "yellow", "green"], template=template_style, title="<b>Sales & Profit by Sub Category</b>", ) # Display Plot fig.show() # Export Chart to HTML plotly.offline.plot(fig, filename="output/Profit_Sales_Sub_Cat.html", auto_open=False) # #### Inspect Negative Profit of Tables # Is there any linear correlation between Sales/Profit & Discount? [Scatterplot] # Create Chart fig = px.scatter( df, x="Sales", y="Profit", color="Discount", template=template_style, title="<b>Scatterplot Sales/Profit</b>", ) # Display Plot fig.show() # Export Chart to HTML plotly.offline.plot( fig, filename="output/Sales_Profit_Scatterplot.html", auto_open=False ) # **Check Discount mean by Sub Category** # Create new dataframe: Group by 'Sub-Category' and aggregate the mean of 'Discount' df_discount = df.groupby("Sub-Category").agg({"Discount": "mean", "Profit": "sum"}) # Display first 5 rows of new dataframe df_discount.head() # **Plot Mean Discount by Sub Category** # Create Chart fig = px.bar( df_discount, x=df_discount.index, y="Discount", color="Profit", color_continuous_scale=["red", "yellow", "green"], template=template_style, title="<b>Mean Discount by Sub Category</b>", ) # Display Plot fig.show() # Export Chart to HTML # **Plot Sales & Profit Development for the year 2020** # Sort Values by Order Date df_sorted = df.sort_values(by=["Order Date"]) # Add cummulative Sales & Profit df_sorted["cummulative_sales"] = df_sorted["Sales"].cumsum() df_sorted["cummulative_profit"] = df_sorted["Profit"].cumsum() # Print tail & head of sorted dataframe df_sorted.tail(3) # Create Chart fig = px.line( df_sorted, x="Order Date", y=["cummulative_sales", "cummulative_profit"], template=template_style, title="<b>Sales/Profit Development</b>", ) # Display Plot fig.show() # Export Chart to HTML plotly.offline.plot( fig, filename="output/Sales_Profit_Development.html", auto_open=False ) # # Well Done 🚀