# Data Analysis Methods useful for HRP

#### What is the meaning of analysis?

By analysis we mean the computation of indices or measures along with searching for patterns of relationships that exist among the data groups. The analysis may be categorized as descriptive analysis and inferential analysis (a.k.a statistical analysis).

##### Descriptive Statistics

Typically descriptive statistics (also known as descriptive analysis) is the first level of analysis. It helps researchers summarize the data and find patterns. A few commonly used descriptive statistics are:

• Mean: numerical average of a set of values.
• Median: midpoint of a set of numerical values.
• Mode: most common value among a set of values.
• Percentage: used to express how a value or group of respondents within the data relates to a larger group of respondents.
• Frequency: the number of times a value is found.
• Range: the highest and lowest value in a set of values.

Descriptive statistics provide absolute numbers. However, they do not explain the rationale or reasoning behind those numbers. Before applying descriptive statistics, it’s important to think about which one is best suited for your research question and what you want to show. For example, a percentage is a good way to show the gender distribution of respondents.

Descriptive statistics are most helpful when the research is limited to the sample and does not need to be generalized to a larger population. For example, if you are comparing the percentage of children vaccinated in two different villages, then descriptive statistics is enough.

Since descriptive analysis is mostly used for analyzing single variable, it is often called univariate analysis. Hence, we can conclude that descriptive analysis methods can be used by the students of BHM in their project. Let us look at the details of the two statistical methods.

## Types of descriptive statistics

### Distribution

A distribution is nothing but a summary containing the value frequencies of variables. In a simple table for distribution, you will find a value list placed against a number of units or individuals. For instance, defining the percentage marks of each college student or total count of students in all the subject streams.

Example 1 : Table of grade

Example 2: Frequency distribution table

### Central Tendency

Central tendency is a term that defines the idea of having one number which can summarize the full data set. Simply put, it is a number that lies in the center of the set.

Here are the three main measures utilized to define what is descriptive statistics through central tendency.

• Mean – Mean is the average of every value present in the sample range. For example, if you have 10 values, then sum all the values and divide it by 10 to find the mean.
• Median – As the name suggests, the median lies in the middle of the range. However, firstly, you need to assign the values in the numerical order and then extract the exact central value for the median.
• Mode – The mode is a value in the range which occurs frequently. You can determine the mode statistics by finding the number which is occurring most times. For this, you again need to arrange the numbers in ascending order which will highlight the most occurring values of the range.
###### Example: Find the mean, median and mode of the following numbers: 9,8,9,6,4,5,2,3,1,7

Mean = (10+8+9+6+4+5+2+3+1+7)/10 = 55/10 = 5.4

Median

• Step 1 – Arrange the numbers in ascending order = 1,2,3,4,5,6,7,8,9,9
• Step 2 – Find the mid value = Here there are 2 numbers in the middle 5 & 6; These numbers has to be added & divided by 2 = (5+6)/2 = 5.5

Mode = 9, since nine is repeated twice in the distribution.

### Dispersion

Dispersion is utilized to define how values of the distribution are spread across the central tendency.

Here are the measures utilized to find the dispersion of the central tendency values in the descriptive statistics.

### Range

The range is the simplest measure in the descriptive statistics in excel for dispersion. You can find the range by subtracting the minimum value from the maximum value.

In the above example, we have the values: 1,2,3,4,5,6,7,8,9,9

9-1 = 8 is the range.

### Variance

The variance is calculated by finding a difference in consecutive values, then adding their square values, and dividing by (n-1). Here, n is the total number of values.

Example: If the range is 1,2,3

• Calculate (1-2), (2-3), (3-1) = -1, -1, 2
• Add the Squares = 1+ 1 + 4 = 6
• Divide by 2 = 6/2 = 3

Hence the variances is = 3

### Standard deviation

Standard deviation is the most essential part of descriptive statistics because it closely defines the relation of every value to the mean.

Example: Find the SD of 1,2,3,4,5,6,7,8,9,9

Step 1 – Substract the mean from each value

• 1 – 5.4 = -4.4
• 2 – 5.4 = -3.4
• 3 – 5.4 = -2.4
• 4 – 5.4 = -1.4
• 5 – 5.4 = 0.4
• 6 – 5.4 = 0.6
• 7 – 5.4 = 1.6
• 8 – 5.4 = 2.6
• 9 – 5.4 = 3.6
• 9 – 5.4 = 3.6

Step 2 – Find the square of all the values, add them

19.36 + 11.56 + 5.76 + 1.96 + 0.16 + 0.36 + 2.56 + 6.76 + 12.96 +12.96 = 74.4

Step 3 – Find variance by dividing step 2 result by (n-1) {10-1=9}

Variance = 74.4/9 = 8.27

Step 4 – Square root of variance is standard deviation. Hence,

Standard deviation = 2.88

Standard deviation is necessary for descriptive statistics as it helps in drawing various conclusions on the value we have found.

### Standard Errors

Usually, when we find a sample mean of the population, it is a random variable. Here’s how:

You need to find the height of the trees, which are 100 years old. So, you have randomly selected 100 100-year old trees and found its sample mean. This means is then used as the estimate in other calculations of descriptive statistics.

However, it is necessary to understand that the sample means we have taken is just one of the possibilities. If we select another sample mean of the same size, it will have different random values.

This is why the sample means is always a random variable. This random variable has a probability distribution which is commonly referred to as sampling distribution. This distribution has a standard deviation which is also a standard error that is calculated through different methods.

### Variability

Variability is defined in multiple ways.

(i) Standard deviation defines variability in point to point fashion in a sample. This means it measures variations in the sampling units.

(ii) The coefficient of variation also measures variability in point to point fashion. But, it measures variability as per a relative basis, which is not affected by any measurement units.

(iii) Standard errors define variability on the sample to sample basis. This means variability is calculated according to repeated samples.

### Interquartile Range

One of the important measures of descriptive statistics is the interquartile range.

In this method, you have four quarters in your range. This means ¼ th of the data lies in 1st, 2nd, 3rd, and 4th quarters. In these quarters, you will find a number which divides 3rd and 4th quarters and 1st and 2nd quarters. The number that lies between the 2nd and 3rd quarters is also called the median.

### The Normal Curve

You can plot the descriptive statistics values of central tendency and dispersion in the form of a curve. It is a bell-shaped curve that contains most values in the middle and some values at the extremes.

For example, height. Most of the people have a height between 5-6 feet, and some have a height below 5, and some above 6.

In a normal curve, all these values of descriptive statistics in excel such as mean, mode, and median lie on the same line.

## What is Inferential Statistics?

Inferential statistics are utilized when you have to infer the situation of a population as per sample data. For example, if you have to judge who was a better president, A or B, then you can’t talk to the whole population. In this case, you’d use sample data to judge the situation.

There are two major inferential statistics types in what is inferential statistics.

#### 1.Hypothesis Tests

Hypothesis tests are utilized to answer questions of research using the collected sample data. For example, understanding how breakfast leads to a productive day for both children and adults.

#### 2.Estimating Parameters

Estimating parameters are the statistical features of your data such as the mean or if you can use population mean to predict factors about the population.

## Difference between descriptive & inferential statistics

Thus, Both inferential and descriptive statistics utilize the same type of data. Inferential statistics utilize this data to predict the results related to a larger group. But, descriptive statistics use this data to give a dedicated result or statistic.

## Inferential Statistics Types

### Z Statistics

Z statistics is all about the Z score, using which inferential statistics or predictions about the population is made.
Z score, also known as a standard score, depicts the standard deviations which fall below and above a data point. It ranges from -3 to +3 on the data line.

A positive score means Z score is above the given mean and negative score means Z score is below the given mean.

#### Why we use Z score in inferential statistics?

Inferential statistics type Z score helps in finding the relative position of a score. For instance, whether a score lies in the top 10% or not. Knowing the relative position of a value in the entire dataset helps in finding various details about the population.

For instance, you can compare two values or compare values from varied data sets as well.
Inferential statistics example of Z score:

If Jack and John took two different exams but scored the same, how can you compare their performance?

#### Through Z Score

John took an English test and scored 50%, and Jack took a History test and scored the same marks. You can utilize the Z score of inferential statistics to find out the performance of each in relevance to the population.

#### The formula of Z Score

Z Score = (Datapoint – Mean)/ Standard Deviation

However, it is necessary to note that we can only use the Z score when you have sample data of more than 30 people or values. If the data set is less than that, then you should utilize the T score of inferential statistics. We have explained the T score in the following sections.

Let’s consider the above example of Jack and John to understand how the Z score helps in calculating the performance of the two students.

For the English test, the standard deviation was 10, and the mean was 40.

For the History test, the standard deviation was 10, and the mean was 60.

Z score for History or Jack is (50-60)/10 = – 1

Z score for English or John is (50-40)/10 = 1

This shows that John performed better than the average child of the class, but Jack did not.

### Hypothesis Testing

#### What is inferential statistics hypothesis testing?

In simple words, the hypothesis is making a guess about your surroundings or any event with the help of data-driven statistics. Hypothesis testing is utilized to test whether a study’s results are valid or not. This is achieved using a random sample, which allows the data scientist to analyze if the test results were archived by chance or are repeatable.

For instance, if you need to find out who is a better president A or B, you take the hypothesis that B is better than A. Based on this assumption, we either prove the hypothesis true or false.

#### Steps for Hypothesis Testing

(i) Firstly, define the null hypothesis. This should be the fact accepted widely.
(ii) Then, define an alternative hypothesis. We will try to prove this hypothesis true or false.
(iii) Now, define the significance level (a), which is usually .05, .02, or .01 depending upon the test.
(iv) Select the score that best suits the situation such as T score or Z score. This will be your (p) value.
(v) Compare (a) and (p) values to prove the null hypothesis true or false.

It is best to write if-else hypothesis statements to make the task easier.

#### Null Hypothesis

The null hypothesis is usually a fact. You know this is accepted widely. For instance, Obama is better than Trump or independent and dependent variables have no relationship. In your hypothesis test for inferential statistics, you will either try to accept this fact or disapprove it.

#### Alternative Hypothesis

An alternative hypothesis is something we are willing to prove right. While the null hypothesis can only be an equality operator, the alternative hypothesis can be less than, inequality, or greater than the operator.

But, it should be remembered that the alternative hypothesis is mutually exclusive with the null hypothesis – always.

#### Confidence Level

If you take multiple random samples, then the times a given result will turn out to be true is related to the confidence level. The percentage of this true result is the confidence level measure, such as 95%, is highly common.

#### Significance Level

The significance level is a measure of probability related to the rejection of the null hypothesis. In this score, we find out the rejection probability of a null hypothesis. This is when the given null hypothesis is true in reality.
A = 1 – C

Here, C is the confidence level, and A is the significance level.

#### Rejection or Acceptance of the Given Null Hypothesis

P-value denotes the probability related to the unusual results achieved with a true null hypothesis. Hence, to reject or accept a null hypothesis, we compare (a) and (p) values.

### T Statistics

T statistics of inferential statistics, also known as Student’s T, is a measure that is the same as Z statistics. The only difference is you can describe the sample using T statistics rather than describing the population.

Generally, T statistics of inferential statistics are used when you have lower than 30 sample units, or the standard deviation of the given population is not known to us. If in case, we take a sample with higher values than 30, it may turn out to be the same as Z statistics or distribution chart.

Here, the degree of freedom holds high importance. It is the count of total interdependent operations found in the data set.

Degree of freedom or df = number of samples or n – 1

Note: In T statistics of inferential statistics examples, it would be difficult to prove the null hypothesis false because of the type of distribution observed.

#### The Formula of T Score

T score = (x–u)/ (S/ √(df))

Here, x is the mean of the sample, u is the mean of the population, SD is the sample’s standard deviation, and df is the degree of freedom.

## Central Limit Theorem

We commonly compare Z score or Z statistics with the normal distribution or expressing Z score in terms of standard deviation.

The central limit theorem is important for the normal distribution of inferential statistics. This theorem says that as you keep increasing the sample size, the mean of the sample moves toward normal distribution. This is regardless of the population distribution shape.

For instance, if you take the Facebook posting habits of 200 people, it will give you a distorted distribution. But, if you increase the sample size and take 2000 people, you will get a bell-shaped curve like a normal distribution for Facebook positing habits of people.

Some properties of the central limit theorem of inferential statistics are:

(i) The population means will be nearly similar to sampling distribution population.
(ii) If you divide the standard deviation or standard error of the population with a sample size’s square root, you will get nearly similar value to the sampling distribution’s standard deviation.
(iii) Even if your population distribution with a small sample was bimodal or skewed, you would get a normal distribution with a large sample. (We have already covered this in the above example).

#### Confidence Interval

In inferential statistics, we use a sample mean and use it to move towards the mean of the population. But, knowing how accurately this sample would be able to give an idea of the population is hard. Hence, we use the confidence interval for this.

A confidence interval gives a range that will give you the population parameter.

(i) When using a one-sided interval, we may take 5% to the right or left of the given distribution. This is when considering a 95% confidence interval.
(ii) When using a two-sided interval, we may take 2.5% on the right and left sides.

#### Some of the inferential statistic examples:

(i) Making a prediction about the whole population just by using a random sample.
(ii) Understanding the differences in the random sample when compared to the whole population such as the sports element above.
(iii) Understanding the impact of a feature on the hypothesis or result.

Thus, Inferential statistics are important to judge the feature of the entire population without actually taking the opinion or data related to the entire population. You can include additional factors and elements in this hypothesis and still receive a valuable result. This is why inferential statistics is considered as one of the most important disciplines of statistics.

```Reference:
Kothari, C. R., & Garg, G. (n.d.). Research Methodology: Methods & Techniques (Second). New Age International Publishers.
Tony, A. (2019, December 11). A-Z of Inferential Statistics. Digital Vidya. https://www.digitalvidya.com/blog/inferential-statistics/
```