Statistics For Machine Learning

Statistics For Machine Learning

Introduction to Statistics

  1. Statistics is the science of collecting organizing and analyzing data

Types of Statistics

There are two types of statistics

  1. Inferential Statistics
  2. Descriptive Statistics

      1. Inferential Statistics: In Inferential Statistics, we aim to make an inference or conclusion.
      2.  Descriptive Statistics: We describe the data set in descriptive data.
    

Types of Variables

  1. Numerical Variables: They are numerical values which cannot be grouped.
  2. Categorical Variables: These values can be grouped together.

Examples:

  1. age, weight etc are numerical variables
  2. Gender, course selected while doing undergrad are categorical data etc

Categorical Variables Further classification

Categorical Variables can be further classified as ordinal and nominal categorical variables

  1. Ordinal Categorical Variables are those whose category follows a certain order or they can be ranked Example : worst,good,better
  2. Nominal Categorical Variables: They do not follow any order Example: Gender, state

Descriptive Statistics

As the name suggests it will describe the data.

Measure Of Central Tendency

It is a single value that attempts to describe a set of data by identifying the central Value

  1. Mean
  2. Mode
  3. Median

Note: mode, mean and median can be calculated only if the data is entirely numerical. If we have string values then we need to convert the values. Also if there are NaN values we need to handle them.

  1. Mean: it is the average of the values Drawback: Easily gets affected by an outlier

  2. Median: Sort the data in ascending order

    1. For a dataset with an even number of values it is the aggregate of n and n+1 th value
    2. For a dataset with an odd number of values it is the mid value
  3. Mode: It is basically the most frequently occurring element.

Measure Of Dispersion

Statistical Dispersion indicates the spread of the data.

  1. Variance If variance or variation is more then our graph will be more spread. If variance or variation is less then our graph will be squeezed.
  2. Standard Deviation Standard Deviation is the spread of the mean. Note: Scaling and shifting operations do not affect the spread of the data
  1. Normalization : In normalization, we change the values such that they fall between 0 - 1. Mean need not necessarily be the same,

  2. Standardization: It basically means we are basically adjusting the values measured in different scales to a common scale Or basically, we are centring the graph with a mean of 0

Encoding Categorical Values

Our Machine Learning Algorithm always needs data to be in numerical form

  1. One-Hot Encoding: We use this technique when the is no dependency between the categorical group.Or we have nominal data Example: Gender has 2 group
  2. Male
  3. Female Both groups are independent of each other. Or in other words, there is no order Definition One Hot Encoding In digital circuits and machine learning, a one-hot is a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0).

How to perform One Hot Encoding Manually? We create dummy variables The number of dummy variables created is equal to n-1, where n represents the number of categories Example: Gender Male Female n=2 as we have two categories The number of dummy variables is equal to 2-1=1 The drawback of this method it leads to an increase in the dimensionality which leads to a dimensionality curse.

  1. We have ordinal data we use a label encoder for ordinal data. We basically specify the order of the data as a number

There is a problem in both ways that first we have to separate the column and then combine it again To solve this we use a column transformer.

Handling NaN values

  1. If there are only 1 or 2 rows with NaN values we can directly drop those rows. It will not have a severe effect on our distribution.
  2. Fillna will replace all the NaN with the value of the parameter passed in.
  3. If the variable is categorical then we replace nan values with mode.
  4. If the variable is numerical then we replace nan values with a median.

Percentiles

Percentile is a value below which a certain percentage of observations are Example: The 25th percentile is 30 This means below or less than 30 there are 25 percent values.

Box Plot

The box plot will give us the 5-number summary. This helps us to identify the outlier The outlier in the box plot is anything beyond min max. A range of min - max 25th percentile 75th percentile median

Co-relation

Co-relation measures the relationship, or association, between two variables by looking at how the variables change with respect to each other.

Types Of Co-relation

  1. High Co-relation: High correlation describes a stronger correlation between two variables
  2. Low Co-relation: Low correlation describes a weaker correlation, meaning that the two variables are probably not related.

A co-relation basically represents a linear relationship.

3.A positive correlation means that this linear relationship is positive, and the two variables increase or decrease in the same direction. 4.A negative correlation is just the opposite, wherein the relationship line has a negative slope and the variables change in opposite directions (i.e, one variable decreases while the other increases). 5.No correlation simply means that the variables behave very differently and thus, have no linear relationship.

Correlation Coefficient The correlation coefficient is an important statistical indicator of a correlation and how the two variables are indeed correlated (or not).

  1. r < 0 implies negative correlation
  2. r > 0 implies positive correlation
  3. r = 0 implies no correlation