**AIOU **Solved Assignments code B.Ed 8614 Spring 2020 Assignment 2 **Course:Introduction to Educational Statistics (8614) **Spring 2020. AIOU past papers

**ASSIGNMENT No: 2
**

**Introduction to Educational Statistics (8614) B.Ed 1.5 Years**

**Spring, 2020**

## AIOU Solved Assignment 2 Code 8614 Spring 2020** **

**Q1. What is correlation? How level of measurement help us in selecting correct types of correlation? Write comprehensive note on range of correlation coefficient and what does it explain? Can we predict future correlation by current relationship? If yes (20)**

** **

Correlation has many uses and definitions. As Carol Alexander, 2001 observes, correlation may only be meaningfully computed for stationary processes. Covariance stationarity for a time series, y_{t}, is defined as:

- Constant, finite mean
- Constant, finite variance
- Covariance(y
_{t}, y_{t-s}) depends only on the lag s

For financial data, this implies that correlation is only meaningful for variates such as rates of return or normally transformed variates, z, such that:

z = (x – m)/s

Where x is non-stationary and m is the mean of x and s the standard deviation. For non-stationary variates like prices, correlation is not usually meaningful.

A more coherent measure of relatedness is cointegration. Cointegration uses a two-step process:

- Long-term equilibrium relationships are established
- A dynamic correlation of returns is estimated

Cointegration will not be discussed in these ERM sessions, however, it is very important in developing dynamic hedges that seek to keep stationary tracking error within preset bounds. Hedging using correlation measures typically is not able to achieve such control.

However, instantaneous and terminal measures of correlation are used in various applications such as developing stochastic interest rate generators.

## Definitions of Correlation

### Pearson’s correlation formula

Linear relationships between variables can be quantified using the Pearson Product-Moment Correlation Coefficient, or

The value of this statistic is always between -1 and 1, and if and are unrelated it will equal zero.

(source: http://maigret.psy.ohio-state.edu/~trish/Teaching/Intro_Stats/Lecture_Notes/chapter5/node5.html)

**Spearman’s Correlation Method**

A nonparametric (distribution-free) rank statistic proposed by Spearman in 1904 as a measure of the strength of the associations between two variables (Lehmann and D’Abrera 1998). The Spearman rank correlation coefficient can be used to give an *R*-estimate, and is a measure of monotone association that is used when the distribution of the data make Pearson’s correlation coefficient undesirable or misleading.

The Spearman rank correlation coefficient is defined by

(1) |

where *d* is the difference in statistical rank of corresponding variables, and is an approximation to the exact correlation coefficient

(2) |

computed from the original data. Because it uses ranks, the Spearman rank correlation coefficient is much easier to compute.

The variance, kurtosis, and higher-order moments are

(3) | |||

(4) | |||

(5) |

Student was the first to obtain the variance.

(source: http://mathworld.wolfram.com/SpearmanRankCorrelationCoefficient.html)

**The Simple Formula for r _{s}, for Rankings without Ties**

wine | X | Y | D |
D^{2} |

a b c d e f g h |
1 2 3 4 5 6 7 8 |
2 1 5 3 4 7 8 6 |
—1 1 —2 1 1 —1 —1 2 |
1 1 4 1 1 1 1 4 |

N = 8 –D^{2} = 14 |

Here is the same table you saw above, except now we also take the difference between each pair of ranks (**D**=X—Y), and then the square of each difference. All that is required for the calculation of the Spearman coefficient are the values of N and-**D**^{2}, according to the formula

r_{s} |
= | 1 — | 6D^{2}
N(N |

(source: http://faculty.vassar.edu/lowry/ch3b.html)

There is no generally accepted method for computing the standard error for small samples.

**Kendall****‘s Tau Coefficient **

Spearman’s r treats ranks as scores and computes the correlation between two sets of ranks. Kendall’s tau is based on the number of inversions in rankings.

Although there is evidence that Kendall’s Tau holds up better than Pearson’s *r *to extreme nonnormality in the data, that seems to be true only at quite extreme levels.

Let inv := number of inversions, i.e. reversals of pair-wise rank orders between n pairs. Equal rankings need an adjustment.

t = 1 – 2* inv/(number of pairs of objects)

= 1 – 2 * inv/ (n*(n-1)/2) = 1 – 4* inv/(n*(n-1))

(source: http://www.psych.yorku.ca/dand/tsp/general/corrstats.pdf)

## Relationship Between Correlation and Volatility

** **

In *Volatility and Correlation in Option Pricing,1999,* in the context of two imperfectly correlated variables, * *Ricardo Rebonato states,

“Under these assumptions we can now run two simulations, one with a constant …identical volatility for both variables and with imperfect correlation, and the other with different instantaneous imperfect correlation, and the other with instantaneous volatilities …but perfect correlation. One can then evaluate correlation, calculated along the path, between the changes in the log of the two variables in the two cases. …As is apparent from the two figures, the same sample correlation can be obtained despite the fact that the two de-correlation-generating mechanisms are very different.

** AIOU Solved Assignment 2 Code 8614 Spring 2020**

**Q2. Explain the following terms with examples: **

**Degree of freedom**

Degrees of freedom are the number of values in a study that have the freedom to vary. They are commonly discussed in relationship to various forms of hypothesis testing in statistics, such as a chi-square. It is essential to calculate degrees of freedom when trying to understand the importance of a chi square statistic and the validity of the null hypothesis.

## Chi Square Tests

There are two different kinds of chi square tests: the test of independence, which asks a question of relationship, such as, “Is there a relationship between gender and SAT scores?”; and the goodness-of-fit test, which asks something like “If a coin is tossed 100 times, will it come up heads 50 times and tails 50 times?” For these tests, degrees of freedom are utilized to determine if a certain null hypothesis can be rejected based on the total number of variables and samples within the experiment. For example, when considering students and course choice, a sample size of 30 or 40 students is likely not large enough to generate significant data. Getting the same or similar results from a study using a sample size of 400 or 500 students is more valid.

**Dispersion of Data.**

The dispersion of a data set is the amount of variability seen in that data set. This lesson will review the three most common measures of dispersion, defining and giving examples of each.

## Dispersion Gives Information

Pretend that you want to sell your house. You narrow your search to two companies: SCT Housing and WCT Housing. Both companies advertise that sellers receive, on average, 90% of their asking price. Does it matter which company you choose?

The real question is, does the **mean** (average), describe the data accurately enough to make an informed decision? No, it doesn’t. The mean is not a reliable predictor; it only describes the data set as a whole and doesn’t tell what’s happening within the set.

Let’s add some data to the example to illustrate the point. Assume the following shows the percent of asking price received on the previous nine sales for each company:

- SCT: 88, 92, 91, 89, 89, 91, 91, 89, and 90
- WCT: 71, 100, 100, 83, 100, 95, 86, and 90

How can you make an informed decision about which company will offer you the greatest benefit for the least risk?

You must analyze the **dispersion**, the amount of variation within a data set, of each set. Only then will you be able to truly compare these two companies.

## Measures of Dispersion

Data sets with **strong central tendencies** are sets in which items are tightly grouped around the mean. **Weak central tendency** in data indicates that individual items are not grouped with any significance, which makes predictions based on this data less reliable than those based on data sets with strong central tendencies.

For example, if your mail is always delivered between 8:02 a.m. and 8:08 a.m., you can reliably predict when the mail will come. However, if your mail delivery can range from 8:00 a.m. to 5:30 p.m., you are no longer able to pinpoint a delivery time and planning for special deliveries becomes much more tricky.

In this lesson, we will review three measures of dispersion: **range** (the distance between the lowest and highest values in the data set), interquartile range and standard deviation.

Let’s explore these measures of dispersion by applying them to our opening scenario.

### Range

To find the range of any data set, you need to first put the values in order from lowest to highest. Then you simply subtract the lowest from the highest.

Before continuing, go back and find the range of each of the data sets above.

So:

- SCT Housing: 88, 89, 89, 89, 90, 91, 91, 91, 92 = 92-88 = 4 Range is 4.
- WCT Housing: 71, 83, 85, 86, 90, 95, 100, 100, 100= 100-71 = 29 Range is 29

A small range indicates a strong central tendency. Remember that a strong central tendency tells us that all the data is grouped tightly around the mean.

The range identifies how varied a data set is but does not account for **outliers**, pieces of data that fall far outside the remainder of the data set (like the 71 in the WCT set). Outliers can skew measures of central tendency artificially.

### Interquartile Range

To account for possible outliers use the **interquartile range (IQR)**. This is a measure of the range within only the middle 50% of the data set.

To find the IQR, you separate the data set into **quartiles** (four equal parts) by first putting the data set into numerical order (as we did with the range). Then find the **median** (middle) of the set. This is identified as **Q2**, or the beginning of the second quartile. After finding the median of the whole set, identify the median of each half of the set.

The median of the lower half is **Q1** and the median of the upper half is **Q3**. These two points mark the top and bottom of the middle 50% of the data set. The range is the difference between Q1 and Q3.

See if you can identify the IQR of each set before moving on.

A box plot can help you to visualize the IQR.

The lower the IQR, the stronger the central tendency of the data set. Since we are still just dealing with ranges, there are still unknowns in the data. To confidently compare these two data sets we will need a stronger measure.

**Sample size**

In statistics and quantitative research methodology, a data sample is a set of data collected and/or selected from a statistical population by a defined procedure.[1] The elements of a sample are known as sample points, sampling units or observations[citation needed].

Typically, the population is very large, making a census or a complete enumeration of all the values in the population either impractical or impossible. The sample usually represents a subset of manageable size. Samples are collected and statistics are calculated from the samples, so that one can make inferences or extrapolations from the sample to the population.

The data sample may be drawn from a population without replacement (i.e. no element can be selected more than once in the same sample), in which case it is a subset of a population; or with replacement (i.e. an element may appear multiple times in the one sample), in which case it is a multisubset.

In statistics, the term population is used to describe the subjects of a particular study—everything or everyone who is the subject of a statistical observation. Populations can be large or small in size and defined by any number of characteristics, though these groups are typically defined specifically rather than vaguely—for instance, a population of women over 18 who buy coffee at Starbucks rather than a population of women over 18.

Statistical populations are used to observe behaviors, trends, and patterns in the way individuals in a defined group interact with the world around them, allowing statisticians to draw conclusions about the characteristics of the subjects of study, although these subjects are most often humans, animals, and plants, and even objects like stars.

Importance of Populations

The Australian Government Bureau of Statistics notes:

It is important to understand the target population being studied, so you can understand who or what the data are referring to. If you have not clearly defined who or what you want in your population, you may end up with data that are not useful to you.

There are, of course, certain limitations with studying populations, mostly in that it is rare to be able to observe all of the individuals in any given group. For this reason, scientists who use statistics also study subpopulations and take statistical samples of small portions of larger populations to more accurately analyze the full spectrum of behaviors and characteristics of the population at large.

A statistical population is any group of individuals who are the subject of a study, meaning that almost anything can make up a population so long as the individuals can be grouped together by a common feature, or sometimes two common features. For example, in a study that is trying to determine the mean weight of all 20-year-old males in the United States, the population would be all 20-year-old males in the United States.

Another example would be a study that investigates how many people live in Argentina wherein the population would be every person living in Argentina, regardless of citizenship, age, or gender. By contrast, the population in a separate study that asked how many men under 25 lived in Argentina might be all men who are 24 and under who live in Argentina regardless of citizenship.

Statistical populations can be as vague or specific as the statistician desires; it ultimately depends on the goal of the research being conducted. A cow farmer wouldn’t want to know the statistics on how many red female cows he owns; instead, he would want to know the data on how many females cows he has that are still able to produce calves. That farmer would want to select the latter as his population of study.

**Limited Resources**

Although the total population is what scientists wish to study, it is very rare to be able to perform a census of every individual member of the population. Due to constraints of resources, time, and accessibility, it is nearly impossible to perform a measurement on every subject. As a result, many statisticians, social scientists and others use inferential statistics, where scientists are able to study only a small portion of the population and still observe tangible results. Rather than performing measurements on every member of the population, scientists consider a subset of this population called a statistical sample. These samples provide measurements of the individuals that tell scientists about corresponding measurements in the population, which can then be repeated and compared with different statistical samples to more accurately describe the whole population.

**T-Test**

T-tests are used to compare two means to assess whether they are from the same population. T-tests presume that both groups are normally distributed and have relatively equal variances. The t-statistic is distributed on a curve that is based on the number of degrees of freedom (df). There are three kinds of t-tests: independent-samples, paired-samples, and one-sample.

## Overview Of T-Test

It’s Battle of the Sexes, Round 6172. DING! We want to compare the guys and the gals on one last question before declaring a winner. So we tested 100 guys, and they scored an average of 80% with a standard deviation of 3%. The 100 gals, however, scored an average of 81% with a standard deviation of 4%. Can the gals declare victory and call it a day?

Not so fast!

Note that, although the gals have a higher score than the guys, they have a wider range of scores, too. Assuming a normal distribution, note that nearly all of the guys have scores higher than 74 (2 standard deviations below their mean — recall that, in a normal distribution, more than 95% of the scores are within two standard deviations of the mean), while some of the gals have scores below 74 (as 2 standard deviations below the gals’ mean is 73!).

Since, barring any outliers, the gals have both lower and higher scores than the guys, we are going to have to see if the gals have done better in light of both their averages *and* their standard deviations. To do this, we use a t-test.

Essentially, a **t-test** is used to compare two samples to determine if they came from the same population. Whenever we draw a sample from the population, we can reasonably expect that the sample mean will deviate from the population mean a little bit. So, if we were to take a sample of guys, and a sample of gals, we would not expect them to have exactly the same mean and standard deviation.

The question is, are their means so different that we are willing to call it unlikely that they are indeed from the same population? After all, despite the so-called Martian and Venusian origins and multiple jests to the contrary, both guys and gals are human. Might the two groups just be slight variations on the human population, or are guys and gals really different on this measure?

Thus, our *null hypothesis* states that the guys and gals are from the same population, and thus their means are equal. The alternative hypothesis states that the guys and gals are not from the same population, and thus their means are not equal. Since we will reject the null hypothesis if a group did either better or worse, we make a *two-tailed* test and split our minimum probability for rejecting the null hypothesis (alpha) into two rejection regions of .025 instead of one rejection region of .05.

To determine the probability that the results are true given the null hypothesis, let’s compute the t-statistic:

## Formula for an Independent Samples T-test

This is the formula for an independent samples t-test. Where the *X’s* (‘X-bar-one’ and ‘X-bar-two’) are the means of the two *independent* samples (hence this is called an independent-samples t-test), the *s* represents the standard deviation for each group, and the *N’s* each represent the respective sample sizes of each group.

In the case of the guys vs. gals battle example, we have this:

** Evaluating t**

We can evaluate the t-statistic on the t-distribution, which varies in shape slightly depending upon the number of degrees of freedom (*df*). The t-distribution varies in shape slightly depending on the *df*. The formula for degrees of freedom in an independent samples t-test is:

*df* = *N*1+*N*2-2

We subtract 2 because each of the two means we computed costs us one degree of freedom.

We use the area under the curve of the t-distribution to determine the probability of obtaining a value for t that is higher/lower than the one calculated (in this case, we want the probability of obtaining *t* >= 2, on a t-distribution based on 198 (100+100-2) degrees of freedom). To do this, we will need a table of t-distributions.

## AIOU Solved Assignment 2 Code 8614 Spring 2020

**Q3. What is data cleaning? Write down its importance and benefits. How to ensure it before analysis of data?**

Data cleansing is the process of altering data in a given storage resource to make sure that it is accurate and correct. There are many ways to pursue data cleansing in various software and data storage architectures; most of them center on the careful review of data sets and the protocols associated with any particular data storage technology.

Data cleansing is also known as data cleaning or data scrubbing.

## Techopedia explains Data Cleansing

Data cleansing is sometimes compared to data purging, where old or useless data will be deleted from a data set. Although data cleansing can involve deleting old, incomplete or duplicated data, data cleansing is different from data purging in that data purging usually focuses on clearing space for new data, whereas data cleansing focuses on maximizing the accuracy of data in a system. A data cleansing method may use parsing or other methods to get rid of syntax errors, typographical errors or fragments of records. Careful analysis of a data set can show how merging multiple sets led to duplication, in which case data cleansing may be used to fix the problem.

Many issues involving data cleansing are similar to problems that archivists, database admin staff and others face around processes like data maintenance, targeted data mining and the extract, transform, load (ETL) methodology, where old data is reloaded into a new data set. These issues often regard the syntax and specific use of command to effect related tasks in database and server technologies like SQL or Oracle. Database administration is a highly important role in many businesses and organizations that rely on large data sets and accurate records for commerce or any other initiative. Data cleansing is hard to do, hard to maintain, hard to know where to start. There seem to always be errors, dupes, or format inconsistencies.

A simple, five-step data cleansing process that can help you target the areas where your data is weak and needs more attention. From the first planning stage up to the last step of monitoring your cleansed data, the process will help your team zone in on dupes and other problems within your data. What’s important to remember about the five step process, is that it’s a continuous circle. So you can start small and make incremental changes, repeating the process several times to continue improving data quality.

**Plan**

First off, you want to identify the set of data that is critical for making your marketing efforts the best they can possibly be. When looking at data you should focus on high priority data, and start small. The fields you will want to identify will be unique to your business and what information you are specifically looking for, but it may include: job title, role, email address, phone, industry, revenue, etc.

It would be beneficial to create and put into place specific validation rules at this point to standardize and cleanse the existing data as well as automate this process for the future. For example, making sure your postal codes and state codes agree, making sure the addresses are all standardized the same way, etc. Seek out your IT team members in help with setting these up! They are more help than just deleting a virus!

**Analyze to Cleanse**

After you have an idea of the priority data your company desires, it’s important to go through the data you already have in order to see what is missing, what can be thrown out, and what, if any, are gaps between them.

You will also need to identify a set of resources to handle and manually cleanse exceptions to your rules. The amount of manual intervention is directly correlated to the amount of acceptable levels of data quality you have. Once you build out a list of rules or standards, it’ll be much easier to actually begin cleansing.

**Implement Automation**

Once you’ve begun to cleanse, you should begin to standardize and cleanse the flow of new data as it enters the system by creating scripts or workflows. These can be run in real-time or in batch (daily, weekly, monthly) depending on how much data you’re working with. These routines can be applied to new data, or to previously keyed-in data.

**Append Missing Data**

Step four is important especially for records that cannot be automatically corrected. Examples of this are emails, phone numbers, industry, company size, etc. It’s important to identify the correct way of getting a hold of the missing data, whether it’s from 3rd party append sites, reaching out to the contacts or just via good old-fashioned Google.

**Monitor**

You will want to set up a periodic review so that you can monitor issues before they become a major problem. You should be monitoring your database on a whole as well as in individual units, the contacts, accounts, etc. You should also be aware of bounce rates, and keep track of bounced emails as well as response rates. It’s important to keep up-to-date with who is working at the company; so if a customer does not reply to any campaign in more than 6 months, it’s a good idea to dig a little deeper and make sure that that person still holds that position, is still at that company, or quite frankly, depending on how well you’ve maintained the database, hasn’t already kicked the bucket.

The end of this cycle, or step six if you will, is to bring the whole process full circle. Revisit your plans from the first step and reevaluate. Can your priorities be changed? Do the rules you implemented still fit into your overall business strategy? Pinpointing these necessary changes will equip you to work through the cycle; make changes that benefit your process and conduct periodic reviews to make sure that your data cleansing is running with smoothness and accuracy.

Follow this cycle and you’ll be well on your way to having the cleanest and thus most effective data. Click here for more details on what defines “clean” data.

How Salesify Can Help With Your Data Cleansing

Salesify’s data cleansing services would help take a lot off of your company’s plate. Your team could spend countless hours trying to fix data, or you could get your data from Salesify, which has already been cleansed and sorted for you. This in turn helps your company become more efficient in terms of maintaining data in the future. Other Salesify services include B2B contact lists, account profiling, event and whitepaper promotions, lead qualification and appointment setting, and renewal and maintenance sales.

## AIOU Solved Assignment 2 Code 8614 Spring 2020

**Q4. Explain the concept of reliability, Explain types of reliability, and methods used to calculate each types. (20)**

‘Reliability’ of any research is the degree to which it gives an accurate score across a range of measurement. It can thus be viewed as being ‘repeatability’ or ‘consistency’. In summary:

**Inter-rater:**Different people, same test.**Test-retest:**Same people, different times.**Parallel-forms**: Different people, same time, different test.**Internal consistency**: Different questions, same construct.

## Inter-Rater Reliability

When multiple people are giving assessments of some kind or are the subjects of some test, then similar people should lead to the same resulting scores. It can be used to calibrate people, for example those being used as observers in an experiment.

Inter-rater reliability thus evaluates reliability *across different people*.

Two major ways in which inter-rater reliability is used are (a) testing how similarly people *categorize *items, and (b) how similarly people *score *items.

This is the best way of assessing reliability when you are using observation, as observer bias very easily creeps in. It does, however, assume you have multiple observers, which is not always the case.

Inter-rater reliability is also known as *inter-observer reliability* or *inter-coder reliability*.

### Examples

Two people may be asked to categorize pictures of animals as being dogs or cats. A perfectly reliable result would be that they both classify the same pictures in the same way.

Observers being used in assessing prisoner stress are asked to assess several ‘dummy’ people who are briefed to respond in a programmed and consistent way. The variation in results from a standard gives a measure of their reliability.

In a test scenario, an IQ test applied to several people with a true score of 120 should result in a score of 120 for everyone. In practice, there will be usually be some variation between people.

## Test-Retest Reliability

An assessment or test of a person should give the same results whenever you apply the test.

Test-retest reliability evaluates reliability across *time*.

Reliability can vary with the many factors that affect how a person responds to the test, including their mood, interruptions, time of day, etc. A good test will largely cope with such factors and give relatively little variation. An unreliable test is highly sensitive to such factors and will give widely varying results, even if the person re-takes the same test half an hour later.

Generally speaking, the longer the delay between tests, the greater the likely variation. Better tests will give less retest variation with longer delays.

Of course the problem with test-retest is that people may have learned and that the second test is likely to give different results.

This method is particularly used in experiments that use a no-treatment control group that is measure pre-test and post-test.

### Examples

Various questions for a personality test are tried out with a class of students over several years. This helps the researcher determine those questions and combinations that have better reliability.

In the development of national school tests, a class of children are given several tests that are intended to assess the same abilities. A week and a month later, they are given the same tests. With allowances for learning, the variation in the test and retest results are used to assess which tests have better test-retest reliability.

## Parallel-Forms Reliability

One problem with questions or assessments is knowing what questions are the best ones to ask. A way of discovering this is do two tests in parallel, using different questions.

Parallel-forms reliability evaluates different questions and question sets that seek to assess the same construct.

Parallel-Forms evaluation may be done in combination with other methods, such as *Split-half*, which divides items that measure the same construct into two tests and applies them to the same group of people.

### Examples

An experimenter develops a large set of questions. They split these into two and administer them each to a randomly-selected half of a target sample.

In development of national tests, two different tests are simultaneously used in trials. The test that gives the most consistent results is used, whilst the other (provided it is sufficiently consistent) is used as a backup.

## Internal Consistency Reliability

When asking questions in research, the purpose is to assess the response against a given construct or idea. Different questions that test the same construct should give consistent results.

Internal consistency reliability evaluates individual questions in comparison with one another for their ability to give consistently appropriate results.

*Average inter-item correlation* compares correlations between all pairs of questions that test the same construct by calculating the mean of all paired correlations.

*Average item total correlation* takes the average inter-item correlations and calculates a total score for each item, then averages these.

*Split-half correlation* divides items that measure the same construct into two tests, which are applied to the same group of people, then calculates the correlation between the two total scores.

*Cronbach’s alpha* calculates an equivalent to the average of all possible split-half correlations and is calculated thus:

a = (N . r-bar) / (1 + (N-1) . r-bar)

Where N is the number of components,

and r-bar is the average of all Pearson correlation coefficients

## AIOU Solved Assignment 2 Code 8614 Spring 2020

**Q5. What are measures of difference? Explain different types of detail with examples. How are these tests used in hypothesis testing (20)**

A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. As such, measures of central tendency are sometimes called measures of central location. They are also classed as summary statistics. The mean (often called the average) is most likely the measure of central tendency that you are most familiar with, but there are others, such as the median and the mode.

The mean, median and mode are all valid measures of central tendency, but under different conditions, some measures of central tendency become more appropriate to use than others. In the following sections, we will look at the mean, mode and median, and learn how to calculate them and under what conditions they are most appropriate to be used.

## Mean (Arithmetic)

The mean (or average) is the most popular and well known measure of central tendency. It can be used with both discrete and continuous data, although its use is most often with continuous data (see our Types of Variable guide for data types). The mean is equal to the sum of all the values in the data set divided by the number of values in the data set. So, if we have n values in a data set and they have values x_{1}, x_{2}, …, x_{n}, the sample mean, usually denoted by (pronounced x bar), is:

This formula is usually written in a slightly different manner using the Greek capitol letter, , pronounced “sigma”, which means “sum of…”:

You may have noticed that the above formula refers to the sample mean. So, why have we called it a sample mean? This is because, in statistics, samples and populations have very different meanings and these differences are very important, even if, in the case of the mean, they are calculated in the same way. To acknowledge that we are calculating the population mean and not the sample mean, we use the Greek lower case letter “mu”, denoted as µ:

The mean is essentially a model of your data set. It is the value that is most common. You will notice, however, that the mean is not often one of the actual values that you have observed in your data set. However, one of its important properties is that it minimises error in the prediction of any one value in your data set. That is, it is the value that produces the lowest amount of error from all other values in the data set.

An important property of the mean is that it includes every value in your data set as part of the calculation. In addition, the mean is the only measure of central tendency where the sum of the deviations of each value from the mean is always zero.

### When not to use the mean

The mean has one main disadvantage: it is particularly susceptible to the influence of outliers. These are values that are unusual compared to the rest of the data set by being especially small or large in numerical value. For example, consider the wages of staff at a factory below:

Staff | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |

Salary | 15k | 18k | 16k | 14k | 15k | 15k | 12k | 17k | 90k | 95k |

The mean salary for these ten staff is $30.7k. However, inspecting the raw data suggests that this mean value might not be the best way to accurately reflect the typical salary of a worker, as most workers have salaries in the $12k to 18k range. The mean is being skewed by the two large salaries. Therefore, in this situation, we would like to have a better measure of central tendency. As we will find out later, taking the median would be a better measure of central tendency in this situation.

## Median

The median is the middle score for a set of data that has been arranged in order of magnitude. The median is less affected by outliers and skewed data. In order to calculate the median, suppose we have the data below:

65 | 55 | 89 | 56 | 35 | 14 | 56 | 55 | 87 | 45 | 92 |

We first need to rearrange that data into order of magnitude (smallest first):

14 | 35 | 45 | 55 | 55 | 56 |
56 | 65 | 87 | 89 | 92 |

Our median mark is the middle mark – in this case, 56 (highlighted in bold). It is the middle mark because there are 5 scores before it and 5 scores after it. This works fine when you have an odd number of scores, but what happens when you have an even number of scores? What if you had only 10 scores? Well, you simply have to take the middle two scores and average the result. So, if we look at the example below:

65 | 55 | 89 | 56 | 35 | 14 | 56 | 55 | 87 | 45 |

We again rearrange that data into order of magnitude (smallest first):

14 | 35 | 45 | 55 | 55 |
56 |
56 | 65 | 87 | 89 |

Only now we have to take the 5th and 6th score in our data set and average them to get a median of 55.5.

## Mode

The mode is the most frequent score in our data set. On a histogram it represents the highest bar in a bar chart or histogram. You can, therefore, sometimes consider the mode as being the most popular option. An example of a mode is presented below:

Normally, the mode is used for categorical data where we wish to know which is the most common category, as illustrated below:

We can see above that the most common form of transport, in this particular data set, is the bus. However, one of the problems with the mode is that it is not unique, so it leaves us with problems when we have two or more values that share the highest frequency, such as below:

We are now stuck as to which mode best describes the central tendency of the data. This is particularly problematic when we have continuous data because we are more likely not to have any one value that is more frequent than the other. For example, consider measuring 30 peoples’ weight (to the nearest 0.1 kg). How likely is it that we will find two or more people with **exactly** the same weight (e.g., 67.4 kg)? The answer, is probably very unlikely – many people might be close, but with such a small sample (30 people) and a large range of possible weights, you are unlikely to find two people with exactly the same weight; that is, to the nearest 0.1 kg. This is why the mode is very rarely used with continuous data.

Another problem with the mode is that it will not provide us with a very good measure of central tendency when the most common mark is far away from the rest of the data in the data set, as depicted in the diagram below:

In the above diagram the mode has a value of 2. We can clearly see, however, that the mode is not representative of the data, which is mostly concentrated around the 20 to 30 value range. To use the mode to describe the central tendency of this data set would be misleading.

## Skewed Distributions and the Mean and Median

We often test whether our data is normally distributed because this is a common assumption underlying many statistical tests. An example of a normally distributed set of data is presented below:

When you have a normally distributed sample you can legitimately use both the mean or the median as your measure of central tendency. In fact, in any symmetrical distribution the mean, median and mode are equal. However, in this situation, the mean is widely preferred as the best measure of central tendency because it is the measure that includes all the values in the data set for its calculation, and any change in any of the scores will affect the value of the mean. This is not the case with the median or mode.

However, when our data is skewed, for example, as with the right-skewed data set below:

we find that the mean is being dragged in the direct of the skew. In these situations, the median is generally considered to be the best representative of the central location of the data. The more skewed the distribution, the greater the difference between the median and mean, and the greater emphasis should be placed on using the median as opposed to the mean. A classic example of the above right-skewed distribution is income (salary), where higher-earners provide a false representation of the typical income if expressed as a mean and not a median.

If dealing with a normal distribution, and tests of normality show that the data is non-normal, it is customary to use the median instead of the mean. However, this is more a rule of thumb than a strict guideline. Sometimes, researchers wish to report the mean of a skewed distribution if the median and mean are not appreciably different (a subjective assessment), and if it allows easier comparisons to previous research to be made.

** **** Check Also: AIOU**** **