EDA can be divided into 3 types
- Univariate (single column analysis)
- Bi-variate (two-column analysis)
- Multivariate (more than two column analysis)
Types of Data
- Numerical Data
Categorical Data
Numerical Data: This is the data which is basically in the form of numbers
- Categorical Data : In this data we can categorize or split data into groups
Examples Of Categorical Data:
Branches in engineering
Students in electronics will be in one group
Students in Computer Science will be in one group
Students in electrical will be in one group
Example 2
Gender
Male can be one group
Females can be one group
Example 3:
Nationality
Indian
Korean
Canadian
Example 4 :
People who cast their vote will be in a group
People who did not cast their vote
Example 5 :
People who are vaccinated
People who are not vaccinated
Step 1 in Univariate Analysis
- Identify which is categorical and which is numerical data For Categorical Data we use count plot
- For numerical Data we use the histogram
the histogram has bins and bin size
this depends on the size of each bin and the min and max values in the column data will be counted and plotted For example, assume I have a data set that has values from 0 - 100
if I give the bin size as 10 it will imply that my graph will be divided into 10 different ranges from 0-100 example 2 assumes I have data set that has values between 0 and 100
we can say that bin size maximum observation/bins if bins are given
the y axis represents the frequency or number of occurrences of values in that range or bin
Step 2: Box Plot or 5 Number Summary
gives the 5 number summary
- Median: after sorting the data in increasing order the central value is called the median
it is also referred to as a measure of central tendency and is preferred over the mean as the mean gets severely affected by an outlier.
for categorical data we use mode
PERCENTILE: IS A VALUE WHICH REPRESENTS THE PERCENTAGE OF VALUE BELOW IT
EXAMPLE IF 8 IS 25TH PERCENTILE MEANS BELOW 8 WE HAVE MORE 25 PERCENTAGE OF VALUES
Calculating perecentile = (number of observations below that value)/(total number of observations)100
If we want to find the value at the 25th percentile then the formula is
Value=(percentile/100)(n+1)
here value is the index so if we get the value as 5 then in the dataset we need to see the value at the 5th index
for decimal numbers ie 5.5 take the average of the 4th and 5th observation
2.25th Percentile - Also known as the first, or lower, quartile. The 25th percentile is the value at which 25% of the answers lie below that value, and 75% of the answers lie above that value.
3.50th Percentile - Also known as the Median. The median cuts the data set in half. Half of the answers lie below the median and half lie above the median.
4.75th Percentile - Also known as the third, or upper, quartile. The 75th percentile is the value at which 25% of the answers lie above that value and 75% of the answers lie below that value. Above the 75th or below the 25th percentile - If your data falls above the 75th percentile or below the 25th percentile we still display your data and include a << or >> indicator noting that your club's position is above or below those points.
5.values between 25th and 75th percentile is iqr
6.25th percentile is called q1
7.50th percentile is called q2
8.75th percentile is called q3
- min and max
the use of min and max is to define the range of values so that outliers can be detected
10 outliers