After understanding individual features using univariate analysis, the next important step in Exploratory Data Analysis (EDA) is Bivariate Analysis.
Bivariate analysis focuses on analyzing the relationship between two variables. This step is crucial because machine learning models rely heavily on how features interact with each other and with the target variable.
If you skip this step, you may miss important patterns, correlations, and dependencies that directly affect your model’s performance.
What is Bivariate Analysis?
Bivariate Analysis means studying the relationship between two variables.
Examples:
- Age vs Salary
- Experience vs Salary
- Study Hours vs Marks
- Product Category vs Sales
The goal is to understand:
- Whether variables are related
- The strength of the relationship
- The direction (positive or negative)
- Patterns or trends
- Differences between categories
Why Bivariate Analysis is Important
Many beginners only focus on individual features, but real insights come from relationships.
Benefits:
- Helps in feature selection
- Improves model accuracy
- Identifies strong predictors
- Detects hidden patterns
- Helps avoid multicollinearity
Example:
If experience and salary are strongly correlated, your model can use that information effectively.
Best Plots for Bivariate Analysis
Let’s explore the most useful plots and when to use them.
1. Scatter Plot
Best For:
Numerical vs Numerical
Use When:
You want to see the relationship between two continuous variables.
What It Shows:
- Positive or negative correlation
- Clusters
- Outliers
Code:
sns.scatterplot(x="Age", y="Salary", data=df)
2. Line Plot
Best For:
Trends over time or ordered data
Use When:
You want to track changes between two variables over time.
Code:
sns.lineplot(x="Year", y="Sales", data=df)
3. Bar Plot
Best For:
Categorical vs Numerical
Use When:
You want to compare average values across categories.
Code:
sns.barplot(x="Category", y="Sales", data=df)
4. Boxplot
Best For:
Categorical vs Numerical
Use When:
You want to compare distributions across categories.
What It Shows:
- Spread of data
- Median
- Outliers
Code:
sns.boxplot(x="Category", y="Price", data=df)
5. Violin Plot
Best For:
Distribution comparison
Use When:
You want more detailed distribution insights than boxplot.
Code:
sns.violinplot(x="Category", y="Score", data=df)
6. Strip Plot
Best For:
Showing individual data points
Use When:
You want to visualize distribution clearly.
Code:
sns.stripplot(x="Category", y="Value", data=df)
7. Swarm Plot
Best For:
Better version of strip plot
Use When:
You want to avoid overlapping points.
Code:
sns.swarmplot(x="Category", y="Value", data=df)
8. Heatmap
Best For:
Correlation analysis
Use When:
You want to visualize relationships between multiple numerical variables.
What It Shows:
- Correlation strength
- Positive/negative relationships
Code:
sns.heatmap(df.corr(), annot=True)
9. Regression Plot
Best For:
Trend + relationship
Use When:
You want to see correlation along with a trend line.
Code:
sns.regplot(x="Age", y="Salary", data=df)
How to Choose the Right Plot
Numerical vs Numerical:
- Scatter Plot
- Line Plot
- Regression Plot
Categorical vs Numerical:
- Bar Plot
- Boxplot
- Violin Plot
Multiple Relationships:
- Heatmap
Distribution with categories:
- Strip Plot
- Swarm Plot
Real Example in Machine Learning
Suppose you are building a sales prediction model.
You can use:
- Scatter plot → Sales vs Advertising
- Bar plot → Sales by category
- Heatmap → Correlation between features
- Regression plot → Trend analysis
This helps you select better features and build a stronger model.
Common Mistakes Beginners Make
1. Ignoring Relationships
Only focusing on single variables.
2. Using Wrong Plot
Using bar plot instead of scatter for numerical data.
3. Not Checking Correlation
Missing important relationships.
4. Overlooking Outliers
Outliers can distort relationships.
Final Thoughts
Bivariate analysis is where real insights begin. It helps you understand how variables interact, which is essential for building accurate machine learning models.
Before moving to modeling, always analyze relationships properly.
Remember:
Good EDA = Better Insights = Better Models
Mastering bivariate analysis will take your data science skills to the next level.
Dialogue (0)
Add your thoughts