AIOU Solved Assignment 2 Code 8602 Spring 2020

AIOU Solved Assignments code B.ed 8602 Spring 2020 assignment 2  Course: Education, Assessment and Evaluation (8602) spring 2020. AIOU past papers

Education, Assessment and Evaluation (8602)
B.ed 1.5 Years
Spring, 2020

AIOU Solved Assignment 2 Code 8602 Spring 2020

Q.1   State different methods to enhance the reliability of the measurement tool. Also explain each type by providing examples.

Reliability is the consistency of a measure. In educational testing, reliability refers to the confidence that the test score will be the same across repeated administrations of the test. There is a close relation between the construct of reliability and the construct of validity. Many sources discuss how a test can have reliability without validity and that a test cannot have validity without reliability. In the theoretical sense, these statements are true but not in any practical sense. A test is designed to be reliable and valid, consistent, and accurate. Practical conceptualizations of reliability cannot be discussed separately from examples with validity.


Reliability without validity would be similar to an archer consistently hitting the target in the same place but missing the bull’s eye by a foot. The archer’s aim is reliable because it is predictable but it is not accurate. The archer’s aim never hits what it is expected to hit. In this analogy, validity without reliability would be the arrows hitting the target in a haphazard manner but close to the bull’s eye and centering around the bull’s eye. In this second example, it can be seen that the validity is evidence that the archer is aiming at the right place. However, it also demonstrates that, even though the reliability is low, there is still some degree of reliability. That is, at least the arrows are consistently hitting the target. In addition, if the arrows are centered around the bull’s eye, the error of one aim leading too far to the right is balanced by another aim leading too far to the left. Looking at the unpainted backside of the target’s canvas, someone would be able to identify where the bull’s eye was by averaging the distance of all the shots from the bull’s eye.

Reliability of a test is an important selling point for publishers of standardized test, especially in high-stakes testing. If an institute asserts that its instrument can identify children who qualify for a special education program, the users of that test would hope that it has a high reliability. Otherwise, some children requiring the special education may not be identified, whereas others who do not require the special education may be unnecessarily assigned to the special education program.

Situations perceived as low-stakes testing

Even in situations perceived as low-stakes testing, such as classroom testing, reliability and validity are serious concerns. Classroom teachers are concerned that the tests they administer are truly reflective of their students’ abilities. If a teacher administered a test that was reliable but not valid, it would not have much practical use. An example would be a teacher in a grade-school history class administering, as a midterm exam, a standardized test from a reputable publisher. If that exam was suggested by the test developer as the final exam, the results would most likely be reliable but not valid. The test results would be reliable because they would most likely reflect the students’ rank order in class performance. However, the test would not be valid as most students would not be ready for half of the material being tested. If the grade-school history teacher administered as a midterm exam a standardized test recommended by the test developer as a midterm exam for the appropriate grade level, the test could be considered valid. However, if the students (for some strange reason) did not receive uniform instruction in grade-appropriate history, the test would most likely not be reliable.

From these examples, it is clear that it is easier to increase the validity of a reliable measure than to increase the reliability of an otherwise valid measure. The reliable archer could be trained, little by little, to move the aim in the direction of the bull’s eye. However, the target could be moved over one foot, so that the bull’s eye is at the spot on the target that the archer usually hits. The teacher could take the time necessary (half a school year) to teach the students what they need to know to pass the valid final exam. (This is similar to training the archer to shoot in the right direction.) Another solution for the classroom situation would be for the teacher to adapt the test items in the exam, so that it is more appropriate as a midterm exam, instead of as a final exam. (This is similar to moving the target so that the archer’s aim hits the bull’s eye.)

In the test publishing world, the reliability of a draft test instrument is often quickly established. However, after the test is used many times, its validity might be questioned. A test designed as a verbal reasoning test may rely heavily on the test-taker’s knowledge of music, art, and history. Because the test is reliable, the publishers might redefine what construct it is measuring; in this case, it is a better measure of the students’ knowledge of the humanities than of verbal reasoning. The publisher would recall all copies of the Verbal Reasoning test; then, with little change to the test, the publisher could offer it again as the Humanities Achievement test.

Test reliability is explained through the true score theory and the theory of reliability. True score theory states that the observed score on a test is the result of a true score plus some error in measurement. The theory of reliability compares the reliability of a test of human characteristics with the reliability of measuring instruments in the physical sciences.


True score is the exact measure of the test taker’s true ability in the area being tested. With a perfect test, the observed score would be equal to the true score. However, there is no perfect test. As one example of where the error may occur, the wording of test items may not be detailed enough for some test takers yet be too detailed for others. The examples of test errors are innumerable. According to true score theory, no one can know what the reliability of the test is unless one knows how much random error exists in the test. One cannot know how much error exists in the test unless one knows what the true score is. As a theoretical concept, the true score cannot be known. Therefore, what the reliability is can never be known with certainty. However, one can still estimate what the reliability is through repeated measures. As the error is assumed to be random, it should be balanced out over many administrations of the same test. If the test-taker’s ability measured by the test is unchanging, when the error inflates the observed score one time it can be expected to deflate the observed score to the same degree at another time.

Reliability in assessment

Reliability indicates the consistency or stability of test performance and is one of the most important considerations when selecting tests and other assessment tools. A test must be constructed so that examiners can administer the test with minimal errors and can interpret the performance of students with confidence.

The assessment process is subject to error from many sources. Errors in measurement can stem from the testing environment, the student, the test, and the examiner. Sources of error in the testing environment include:

  • Hunger
  • Fatigue
  • Illness
  • Difficulty in understanding test instructions
  • Difficulty in understanding or interpreting language used

Sources of error stemming from the test include:

  • Ambiguously worded questions
  • Biased questions
  • Different interpretations of the wording of test questions

An examiner who is not prepared or who incorrectly interprets administration or scoring guidelines contributes to measurement errors. Sources of error associated with test administration include:

  • Unclear directions
  • Difficulty in achieving rapport
  • Insensitivity to student’s culture, language, preferences, or other characteristics
  • Ambiguous scoring

Errors associated with recording information about the student

Reliability information that is reported in test manuals should be carefully considered. While there are some books and journal articles that report evaluations of tests, tests are not given “seals of approval.” To be useful, they must meet certain standards. Three professional organizations, the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education (1999) have published Standards for Educational and Psychological Testing, which provide criteria for evaluating tests, testing practices, and the effects of test use on individuals. The 1999 edition of Standards for Educational and Psychological Testing describes reliability and provides a departure from more traditional thinking about reliability. In this edition, reliability refers to the “scoring procedure that enables the examiner to quantify, evaluate, and interpret behavior or work samples. Reliability refers to the consistency of such measurements when the testing procedure is repeated on a population of individuals or groups.

Test developers convey reliability of assessment instruments in various ways. They are responsible for reporting evidence of reliability. Test users and consumers must use this evidence in deciding the suitability of various assessment instruments. While no one approach is preferred, educators should be familiar with all of the approaches in order to judge the usefulness of instruments. These approaches are: (1) one or more correlation coefficients, (2) variances or standard deviations of measurement errors, and (3) technical information about tests known as IRT (item response theory).

AIOU Solved Assignment 2 Code 8602 Spring 2020

Q.2   Explain the effects of curricular validity on performance of the examinees. Also how can you measure the curricular validity of tests elaborate in detail?

Assessment is important because of all the decisions you will make about children when teaching and caring for them. The decisions facing our three teachers at the beginning of this chapter all involve how best to educate children. Like them, you will be called upon every day to make decisions before, during, and after your teaching. Whereas some of these decisions will seem small and inconsequential, others will be “high stakes,” influencing the life course of children. All of your assessment decisions taken as a whole will direct and alter children’s learning outcomes.  Below outlines for you some purposes of assessment and how assessment can enhance your teaching and student learning. All of these purposes are important; if you use assessment procedures appropriately, you will help all children learn well.

The following general principles should guide both policies and practices for the assessment of young children:

Assessment should bring about benefits for children. Gathering accurate information from young children is difficult and potentially stressful. Assessments must have a clear benefit—either in direct services to the child or in improved quality of educational programs. Assessment should be tailored to a specific purpose and should be reliable, valid, and fair for that purpose. Assessments designed for one purpose are not necessarily valid if used for other purposes. In the past, many of the abuses of testing with young children have occurred because of misuse.

Assessment policies

Assessment policies should be designed recognizing that reliability and validity of assessments increase with children’s age. The younger the child, the more difficult it is to obtain reliable and valid assessment data. It is particularly difficult to assess children’s cognitive abilities accurately before age six. Because of problems with reliability and validity, some types of assessment should be postponed until children are older, while other types of assessment can be pursued, but only with necessary safeguards. Assessment should be age appropriate in both content and the method of data collection. Assessments of young children should address the full range of early learning and development, including physical well-being and motor development; social and emotional development; approaches toward learning; language development; and cognition and general knowledge. Methods of assessment should recognize that children need familiar contexts to be able to demonstrate their abilities. Abstract paper-and-pencil tasks may make it especially difficult for young children to show what they know.

Assessment should be linguistically appropriate, recognizing that to some extent all assessments are measures of language. Regardless of whether an assessment is intended to measure early reading skills, knowledge of color names, or learning potential, assessment results are easily confounded by language proficiency, especially for children who come from home backgrounds with limited exposure to English, for whom the assessment would essentially be an assessment of their English proficiency. Each child’s first- and second-language development should be taken into account when determining appropriate assessment methods and in interpreting the meaning of assessment results.

Parents should be a valued source of assessment information, as well as an audience for assessment. Because of the fallibility of direct measures of young children, assessments should include multiple sources of evidence, especially reports from parents and teachers. Assessment results should be shared with parents as part of an ongoing process that involves parents in their child’s education.4

Purposes of Assessment


  • Identify what children know
  • Identify children’s special needs
  • Determine appropriate placement
  • Select appropriate curricula to meet children’s individual needs
  • Refer children and, as appropriate, their families for additional services to programs and agencies


  • Communicate with parents to provide information about their children’s progress and learning
  • Relate school activities to home activities and experiences
  • Early Childhood Programs
  • Make policy decisions regarding what is and is not appropriate for children
  • Determine how well and to what extent programs and services children receive are beneficial and appropriate

Early Childhood Teachers

  • Identify children’s skills, abilities, and needs
  • Make lesson and activity plans and set goals
  • Create new classroom arrangements

Select materials

  • Make decisions about how to implement learning activities
  • Report to parents and families about children’s developmental status and achievement
  • Monitor and improve the teaching-learning process
  • Meet the individual needs of children

Group for instruction

  • The Public
  • Inform the public regarding children’s achievement
  • Provide information relating to student’s school-wide achievements
  • Provide a basis for public policy (e.g., legislation, recommendations, and statements)

Face Validity ascertains that the measure appears to be assessing the intended construct under study. The stakeholders can easily assess face validity. Although this is not a very “scientific” type of validity, it may be an essential component in enlisting motivation of stakeholders. If the stakeholders do not believe the measure is an accurate assessment of the ability, they may become disengaged with the task.

Example: If a measure of art appreciation is created all of the items should be related to the different components and types of art.  If the questions are regarding historical time periods, with no reference to any artistic movement, stakeholders may not be motivated to give their best effort or invest in this measure because they do not believe it is a true assessment of art appreciation.

2. Construct Validity is used to ensure that the measure is actually measure what it is intended to measure (i.e. the construct), and not other variables. Using a panel of “experts” familiar with the construct is a way in which this type of validity can be assessed. The experts can examine the items and decide what that specific item is intended to measure.  Students can be involved in this process to obtain their feedback.

Example: A women’s studies program may design a cumulative assessment of learning throughout the major.  The questions are written with complicated wording and phrasing.  This can cause the test inadvertently becoming a test of reading comprehension, rather than a test of women’s studies.  It is important that the measure is actually assessing the intended construct, rather than an extraneous factor.

3. Criterion-Related Validity is used to predict future or current performance – it correlates test results with another criterion of interest.

 Example: If a physics program designed a measure to assess cumulative student learning throughout the major.  The new measure could be correlated with a standardized measure of ability in this discipline, such as an ETS field test or the GRE subject test. The higher the correlation between the established measure and new measure, the more faith stakeholders can have in the new assessment tool.

AIOU Solved Assignment 2 Code 8602 Spring 2020

Q.3   Write down learning outcomes for any unit of Social Studies for 8th grade and develop an essay type test item with rubric, 5 multiple choice questions and 5 short questions for the written learning outcomes.

Since the implementation of the curriculum reform, the Education Bureau has aimed at promoting learning to learn and whole-person development. It has introduced a flexible and open curriculum framework to promote the “paradigm shift” in school education ─ steering from a textbook-oriented and teacher-centred teaching approach, to a multi-dimensional, interactive and student-centred learning approach. According to the Interim Review of the curriculum reform and Inspection Annual Reports, students were interested in learning, and willing to answer teachers’ questions. They participated actively in learning activities and cooperated with their peers in discussions and presentations. Teachers possessed good professional knowledge and were capable of using information technology and subject resources properly to facilitate learning and teaching. A wide range of teaching and assessment strategies were adopted to cater for students’ learning needs, and quality feedback was provided to enhance students’ learning. Students had outstanding performance in international assessments in reading, mathematics and science. Thus, the basic education in Hong Kong has achieved considerable success.

Learning and teaching culture

Regarding the learning and teaching culture and the professional development of teachers in primary schools, a sustainable paradigm shift has been witnessed: students have become more active in learning. The development of generic skills, especially communication skills, creativity and critical thinking skills, and the inculcating of positive core values and attitudes in students, can reach the major goals of the curriculum reform, and schools have been moving towards self-directed learning. Building on the achievements of the curriculum reform, schools can further enhance learning and teaching, adopt appropriate strategies to cater for learner diversity and help students develop self-directed learning capabilities.

Apart from adjusting the pace of learning and teaching according to students’ ability, teachers should embrace different cultures and provide students with various learning opportunities and room for self-directed learning. For example, through assignments, project learning, life-wide learning, group discussion and sharing, students can select their own learning strategies and develop self-directed learning skills according to their ability, personality, learning style, expected learning outcomes, etc.

*        In the learning process, teachers can help students reflect on the following questions in order to develop their self-directed learning skills:

There are a number of steps in the strategic planning process.   It is recommended you complete each step.  Some organizations choose to by-pass steps in hopes of reducing the planning time.  We recommend you follow all the steps in the appropriate sequence.  Although some steps can be time consuming and consensus can be difficult to obtain, the end result is a plan that has support from planning group members and other stakeholders.

Catering for Learner Diversity

Every student is a unique individual. They are different in level of maturity, gender, personality, ability, aspiration, interest, learning motivation, culture, language and socioeconomic background. Their intelligence, cognitive and learning styles influence the learning traits. Therefore, in addition to a comprehensive understanding of the curriculum content and features, schools and teachers should cater for learner diversity in lessons. For example, the newly-arrived children, non-Chinese speaking students and cross-boundary students may lack the prior knowledge for understanding the learning content of certain topics due to their different backgrounds. Under such situations, teachers may teach them the relevant knowledge beforehand.

Cognitive style reflects an individual’s thinking mode, which is the methods and habits that one tends to adopt when receiving, processing, organising and remembering information. It will affect one’s performance and achievement in learning. Scholars classify cognitive styles into different categories. For example, cognitive styles are categorised into two dimensions – “holistic-analytical” style and “verbal-imagery” style. Learners of the former cognitive style tend to treat the information as a whole, or the collection of parts when organising information, whereas learners of the latter cognitive style tend to think and express in words, or mental images.

Curriculum Level

*        Teachers may adapt the curriculum according to learner diversity, including learning needs, styles, interests and abilities, etc. For example, teachers can adjust the teaching pace, content, hierarchy, strategies, and assessment tools and methods. Curriculum adaptation can target at a class, a group or an individual student. The learning objectives set for students can be partially the same and partially different. Even if the learning objectives are the same, the allocation of time, content and form of learning activities can be adapted. The ultimate goal of curriculum adaptation is to provide an environment to support student learning so that every student can participate in the learning process to achieve learning goals.

Curriculum adaptation usually takes place in terms of content, process and outcome. One or two of the following areas could be adapted:

Content: Teachers may focus on teaching the most crucial concepts, processes and skills, adjust the difficulty of learning content, or select basic or more advanced level learning materials relevant to the topics.

Process: Teachers may consider adjusting the complexity and abstractness of the learning task, or allow different students to learn in different ways.

Outcomes: Teachers may consider adjusting the degree of challenge of the learning tasks, or expect different learning outcomes according to students’ learning abilities or styles4. For example, after reading a story book, teachers usually require the students to submit book reports, but the teachers may allow accommodator students to discuss the assignment questions and present in a group; allow converger students to carry out role plays and propose a method to solve problems, or attempt to relate the story content to real life; allow diverger students to rewrite the ending of the story; and allow assimilator students to infer the main idea of the story.

Teachers can develop or design appropriate learning materials and or activities according to students’ cognitive styles. For example, if students tend to acquire information by reading or listening to text messages, teachers can provide them with text-based learning materials. The activities can include reading articles, listening to recordings and group discussions. If students tend to acquire information through visual channels, teachers could incorporate more images in the learning materials. The activities can include watching video clips and reading charts. Although an individual may have his or her own habitual cognitive style, he or she may develop other styles according to the situation. Therefore, through creating different learning contexts, teachers can nurture and develop different cognitive styles in students.

Learning styles can be innate or nurtured in social interaction. They reflect learners’ unique learning habits and their preference in processing information. They include the specific learning strategy that learners adopt or the learning mode and environment that they prefer when completing a learning task. Similarly, scholars classify learning styles into different categories. For example, according to the two dimensions of perception and processing modes, learning styles can be divided into four types – accommodator, diverger, converger and assimilator.

Similarly, reading tests give you only general measures of reading ability. Some students may be good readers in certain content areas, yet they may score poorly on a given test because the reading passages in that test do not include the content areas they know.

Good programs select students by using several assessment tools, rather than just one. Although the regulations do not explicitly state other requirements, they do allow you to use additional assessment tools in selecting students. Ask your state director how you can best use other assessment tools, such as report card grades, results of other tests, and systematic teacher assessments obtained through questionnaires.

Some common methods for using multiple assessments are:

  • selecting students who score below prescribed cut-offs on both your district’s standardized test and another state-mandated test;
  • using your district’s standardized test to identify a pool of possible participants, then using either a teacher-completed questionnaire or report card grades to select students from the pool;
  • using a systematic method for obtaining teachers’ judgments about students’ needs in order to identify a pool of possible participants, then using a standardized test to select students from the pool; or
  • using the standardized test to identify a pool of students, then creating a study team to select students from the pool and carefully documenting the study team’s process.


Out-of-level testing occurs when you give a standardized test to students who are at a different grade level than the one for which the test is designed. In some cases, school officials use out-of-level tests in compensatory programs because those students are behind their peers and in-level testing is frustrating for them. Administrators who follow this practice believe that somehow it is more valid to give those students tests designed for lower grade levels.

While out-of-level tests may be less frustrating to some students, the scores obtained from them are also less valid because

  • the content for out-of-level tests does not represent the content taught in the classroom,
  • the scale that test publishers use to link different test levels is loaded with error,
  • there are no norms for out-of-level tests,
  • scores obtained on tests of different difficulty are not comparable, and
  • when obtained, out-of-level scores appear to be too low.

Although in-level test scores are more reliable in the middle than at the high- and low-score ranges, they are quite reliable in placing students at the high or low end of the scale. For example, with a reasonable degree of assurance, we can say that a student who scores at the 10th percentile rank is most likely a low-achieving student. What we are less sure about is whether the student is at the 10th percentile rank or the 15th percentile rank. Either way, we are reasonable in concluding that the student is low achieving.


Generally, when school personnel say that certain students perform at grade-level, they mean that those students can learn material at about the same rate and quality as others in the same class. The implication is that students who don’t perform at grade-level have significantly more difficulty in class than their peers. Accordingly, when students are labeled as working below grade-level, the implication is that they may not have the aptitude, maturity, or interest to do the work that others in the same class are doing. This interpretation of students’ abilities is made by relatively few people.

In contrast, in the testing arena at grade-level has a different meaning. When students score at grade-level, their scores are at the 50th percentile rank. It means that about half of their peers score higher and about half score lower. In testing, at grade-level does not relate to how well students perform in the classroom. Therefore, when you review students’ scores, you must consider that, by definition, many students score below grade-level.

Since most people use the term grade-level in the general sense, you can either avoid using grade-equivalent test scores or develop a range of scores that indicate satisfactory achievement in the classroom. You may also think of average performance on a test as being between the 23rd and the 77th percentile rank.


Administrators tend to interpret differences in test scores in one of two ways. First, they may think that a difference of one or two percentile rank points is an important difference. Secondly, they may think that a difference of ten points shows that the test is unreliable. Few administrators can differentiate the degree of error in individual and group scores.

In general, individual scores have more error in them than group scores do. The error in an individual score is largely a function of the test’s standard error that is described in the publisher’s technical manual. For most of the tests given in elementary and secondary schools, the standard error is about 2.5 raw score points. This means that about 95% of the time, we would expect the scores for individual students to fall within a range of 10 raw score points. That is not particularly reassuring, but it is exactly why we need to use multiple measures for selecting students and why for most of the tests we use we should be a little skeptical of individual test scores and cautious in interpreting differences.

The error in group scores largely depends on the size of the group. Once you have a group of about 30 scores, the magnitude of the errors decreases. By the time you average all the scores for your school district, you can regard the results as accurate as long as there is not some systematic bias operating for most everyone in the district.

You can be confident of your interpretation when you consider score averages of large groups. For instance, if when you consider a group of 55 scores, the score average changes one or two percentile rank points, then that is an important change. If you consider averages based on fewer cases, you must be more cautious. You can be more or less confident of average scores depending on the level. There is a definite hierarchy in the strength of your interpretations. Your interpretations are most sure when you consider district averages, followed in order by building averages, classroom averages, and finally individual students’ scores. 

Q.4   Describe the measures of central tendency. Also elaborate how these measures can be utilized in the interpretation of the test results. Provide examples where necessary.

Explore the measures of central tendency. Learn more about mean, median, and mode and how they are used in the field of psychology. At the end, test your knowledge with a short quiz.

What Is Central Tendency?

Think about how you describe a single piece of numerical data. This is usually done in terms of its value. For example, in order to describe the number 2, you might put up two fingers or you might say 2 = 1 + 1. How would you describe a group of data? It would not be beneficial to use your fingers in this instance. Nor is it beneficial to simply add the data together. However, you can describe a group of data in a single value by using measures of central tendency.

So, what exactly is a measure of central tendency? A measure of central tendency is a single value that describes the way in which a group of data cluster around a central value. To put in other words, it is a way to describe the center of a data set. There are three measures of central tendency: the mean, the median, and the mode.

How can you tell what is the average or typical value in a data set? The measures of central tendency can help you to figure this out! In this lesson, learn about these common ways to characterize data.

What are the Measures of Central Tendency?

Sarah made an 85 on her last math test, and she wondered how her grade compared to the other students in her class. The grades of all eleven students in the class were listed in a table, so Sarah decided to take this data and do some quick calculations.

Test score data from the math class

She wanted to know how well the average student in her class did on the exam. After doing some research, she concluded that the best way to do that would be to calculate one or more of the measures of central tendency for the test score data.

There are three common measures of central tendency: mean, median, and mode. Although each of these tries to give an average or representative value for the entire data set, they do so in different ways and, therefore, can be used to analyze different types of data.

Definitions and Equations

The mean is the arithmetic average of all the values in the data set. This is the most common measure of central tendency, but it has some disadvantages. If the data is significantly skewed or there are some outliers, the mean may not accurately reflect the true middle of the distribution.

To calculate the mean of a group of numerical values, add all the values together and then divide by the total number of values (n)

The median is the value that is in the middle of the data set when all the values are arranged from smallest to largest. Median is a better measure of central tendency in situations where the mean might be skewed. This usually occurs if the data is not normally distributed or if there are outliers. Outliers will have a much bigger effect on the mean than the median.

Finally, the mode is the most commonly occurring value. One advantage of the mode is that it can be used with qualitative data and not just numerical data. Both median and mode can only be calculated for numerical data.

Example – Mean

Let’s look at Sarah’s math test scores again and try to calculate the mean, median, and mode for this data.

To find the mean, add up all the test scores and divide by the total number of scores in the list (11 for Sarah’s class).

Example – Median

To find the median score on the test, put all the scores in order from smallest to largest. The one right in the middle is the median.

So, the mean of this data is 68.6 and the median is 80. If the data was normally distributed without outliers, you would expect these to be very similar, but in this case, they aren’t! Which one would give Sarah a better representation of the true middle of the data?

Looking at the data again, notice that there are two students who had a score of zero. Maybe those students were absent on the day of the test or didn’t take the test for some other reason. The rest of the scores fall between 65-100, so these two zeros are definitely outliers to the rest of the data. When there are outliers like this, the median will give you a better estimate of the true middle value of the data than the mean.

Example – Mode

The mode of this data is the number that occurs most frequently. In this case, the mode is 95 because it is the only score that occurs more than once in the data. For Sarah’s test score data, mode may not be the best way to measure central tendency.

Why Is Central Tendency Important?

Central tendency is very useful in psychology. It lets us know what is normal or ‘average’ for a set of data. It also condenses the data set down to one representative value, which is useful when you are working with large amounts of data. Could you imagine how difficult it would be to describe the central location of a 1000-item data set if you had to consider every number individually?

Central tendency also allows you to compare one data set to another. For example, let’s say you have a sample of girls and a sample of boys, and you are interested in comparing their heights. By calculating the average height for each sample, you could easily draw comparisons between the girls and boys.

Central tendency is also useful when you want to compare one piece of data to the entire data set. Let’s say you received a 60% on your last psychology quiz, which is usually in the D range. You go around and talk to your classmates and find out that the average score on the quiz was 43%. In this instance, your score was significantly higher than those of your classmates. Since your teacher grades on a curve, your 60% becomes an A. Had you not known about the measures of central tendency, you probably would have been really upset by your grade and assumed that you bombed the test.

Three Measures of Central Tendency

Let’s talk more about the different measures of central tendency. You are probably already familiar with the mean, or average. The mean is calculated in two steps:

  1. Add the data together to find the sum
  2. Take the sum of the data and divide it by the total number of data

Now let’s see how this is done using the height example from earlier. Let’s say you have a sample of ten girls and nine boys.

The girls’ heights in inches are: 60, 72, 61, 66, 63, 66, 59, 64, 71, 68.

Here are the steps to calculate the mean height for the girls:

First, you add the data together: 60 + 72 + 61 + 66 + 63 + 66 + 59 + 64 + 71 + 68 = 650. Then, you take the sum of the data (650) and divide it by the total number of data (10 girls): 650 / 10 = 65. The average height for the girls in the sample is 65 inches. If you look at the data, you can see that 65 is a good representation of the data set because 65 lands right around the middle of the data set.

To illustrate this point, let’s look at what happens to the mean when we change 68 to 680. Again, we add the data together: 60 + 72 + 61 + 66 + 63 + 66 + 59 + 64 + 71 + 680 = 1262. Then we take the sum of the data (1262) and divide it by the total number of data (10 girls): 1262 / 10 = 126.2. The mean height (in inches) for the sample of girls is now 126.2. This number is not a good estimate of the central height for the girls. This number is almost twice as high as the height of most of the girls!

However, we can still use other measures of central tendency even when there are outliers. In the scenario above, where a girl who is 680 inches is an outlier, we could use the median. But first, let’s explore how to find the median.

The median is the value that cuts the data set in half. If you have an odd number of data, then it’s the value that’s right in the middle. Let’s practice the boys’ heights since there are nine boys. There are two steps to finding the median in a sample with an odd number of data:

  1. List the data in numerical order
  2. Locate the value in the middle of the list

Now let’s find the median height for our sample of boys. The boys’ heights in inches are: 66, 78, 79, 69, 77, 79, 73, 74, 62. So, first we list the data in numerical order: 62, 66, 69, 73, 74, 77, 78, 79, 79. Then, we locate the value in the middle of the list: 62, 66, 69, 73, 74, 77, 78, 79, 79. In a data set that consists of nine items, the datum in the fifth place is the median. The median height for the boys is 74 inches.

Q.5   Briefly describe the present trends and classroom techniques used by teachers for the formative assessment of the students learning. Also enlist the components of good progress report.

There are many ways of reporting test performance. A variety of scores can be used when interpreting students’ test performance.

Raw Scores

The raw score is the number of items a student answers correctly without adjustment for guessing. For example, if there are 15 problems on an arithmetic test, and a student answers 11 correctly, then the raw score is 11. Raw scores, however, do not provide us with enough information to describe student performance.

Percentage Scores

percentage score is the percent of test items answered correctly. These scores can be useful when describing a student’s performance on a teacher-made test or on a criterion-referenced test. However, percentage scores have a major disadvantage: We have no way of comparing the percentage correct on one test with the percentage correct on another test. Suppose a child earned a score of 85 percent correct on one test and 55 percent correct on another test. The interpretation of the score is related to the difficulty level of the test items on each test. Because each test has a different or unique level of difficulty, we have no common way to interpret these scores; there is no frame of reference.

To interpret raw scores and percentage-correct scores, it is necessary to change the raw or percentage score to a different type of score in order to make comparisons. Evaluators rarely use raw scores and percentage-correct scores when interpreting performance because it is difficult to compare one student’s scores on several tests or the performance of several students on several tests.

Derived Scores

Derived scores are a family of scores that allow us to make comparisons between test scores. Raw scores are transformed to derived scores. Developmental scores and scores of relative standing are two types of derived scores. Scores of relative standing include percentiles, standard scores, and stanines.

Developmental Scores

Sometimes called age and grade equivalents, developmental scores are scores that have been transformed from raw scores and reflect the average performance at age and grade levels. Thus, the student’s raw score (number of items correct) is the same as the average raw score for students of a specific age or grade. Age equivalents are written with a hyphen between years and months (e.g., 12–4 means that the age equivalent is 12 years, 4 months old). A decimal point is used between the grade and month in grade equivalents (e.g., 1.2 is the first grade, second month).

Developmental scores can be useful (McLean, Bailey, & Wolery, 1996; Sattler, 2001). Parents and professionals easily interpret them and place the performance of students within a context. Because of the ease of misinterpretation of these scores, parents and professionals should approach them with extreme caution. There are a number of reasons for criticizing these scores.

For a student who is 6 years old and in the first grade, grade and age equivalents presume that for each month of first grade an equal amount of learning occurs. But, from our knowledge of child growth and development and theories about learning, we know that neither growth nor learning occurs in equal monthly intervals. Age and grade equivalents do not take into consideration the variation in individual growth and learning.

A third criticism of developmental scores is that age and grade equivalents encourage the use of false standards. A second-grade teacher should not expect all students in the class to perform at the second-grade level on a reading test. Differences between students within a grade mean that the range of achievement actually spans several grades. In addition, developmental scores are calculated so that half of the scores fall below the median and half fall above the median. Age and grade equivalents are not standards of performance.

A fourth criticism of age and grade equivalents is that they promote typological thinking. The use of age and grade equivalents causes us to think in terms of a typical kindergartener or a typical 10-year-old. In reality, students vary in their abilities and levels of performance. Developmental scores do not take these variations into account.

A fifth criticism is that most developmental scores are interpolated and extrapolated. A normed test includes students of specific ages and grades—not all ages and grades—in the norming sample. Interpolation is the process of estimating the scores of students within the ages and grades of the norming sample. Extrapolation is the process of estimating the performance of students outside the ages and grades of the normative sample.

Developmental Quotient

developmental quotient an estimate of the rate of development. If we know a student’s developmental age and chronological age, it is possible to calculate a developmental quotient. For example, suppose a student’s developmental age is 12 years (12 years 12 months in a year = 144 months) and the chronological age is also 12 years, or 144 months. Using the following formula, we arrive at a developmental quotient of 100.

Developmental age 144 months / Chronological age 144 months X 100 = 100

144/144 X 100 = 100

1/1 X 100 = 100

But, suppose another student’s chronological age is also 144 months and that the developmental age is 108 months. Using the formula, this student would have a developmental quotient of 75.

Developmental age 108 months/ Chronological age X 100 = 75

108/144 X 100 = 75

Developmental quotients have all of the drawbacks associated with age and grade equivalents. In addition, they may be misleading because developmental age may not keep pace with chronological age as the individual gets older. Consequently, the gap between developmental age and chronological age becomes larger as the student gets older.

Standard Scores   Another type of derived score is a standard score. Standard score is the name given to a group or category of scores. Each specific type of standard score within this group has the same mean and the same standard deviation. Because each type of standard score has the same mean and the same standard deviation, standard scores are an excellent way of representing a child’s performance. Standard scores allow us to compare a child’s performance on several tests and to compare one child’s performance to the performance of other students. Unlike percentile scores, standard scores function in mathematical operations. For instance, standard scores can be averaged. In the Snapshot, teachers Lincoln Bates and Sari Andrews discuss test scores. As is apparent, standard scores are equal interval scores. The different types of standard scores, some of which we discuss in the following subsections, are:

  1. z-scores: have a mean of 0 and a standard deviation of 1.
  2. T-scores: have a mean of 50 and a standard deviation of 10.
  3. Deviation IQ scores: have a mean of 100 and a standard deviation of 15 or 16.
  4. Normal curve equivalents: have a mean of 50 and a standard deviation of 21.06.
  5. Stanines: standard score bands divide a distribution of scores into nine parts.
  6. Percentile ranks: point in a distribution at or below which the scores of a given percentage of students fall.

Deviation IQ Scores Deviation   Deviation IQ scores are frequently used to report the performance of students on norm-referenced standardized tests. The deviation scores of the Wechsler Intelligence Scale for Children–IIIand the Wechsler Individual Achievement Test–IIhave a mean of 100 and a standard deviation of 15, while the Stanford-Binet Intelligence Scale–IV has a mean of 100 and a standard deviation of 16. Many test manuals provide tables that allow conversion of raw scores to deviation IQ scores.

Normal Curve Equivalents  Normal curve equivalents (NCEs) a type of standard score with a mean of 50 and a standard deviation of 21.06. When the baseline of the normal curve is divided into 99 equal units, the percentile ranks of 1, 50, and 99 are the same as NCE units (Lyman, 1986). One test that does report NCEs is the Developmental Inventory-2.However, NCEs are not reported for some tests.

Stanines  Stanines are bands of standard scores that have a mean of 5 and a standard deviation of 2. Stanines range from 1 to 9. Despite their relative ease of interpretation, stanines have several disadvantages. A change in just a few raw score points can move a student from one stanine to another. Also, because stanines are a general way of interpreting test performance, caution is necessary when making classification and placement decisions. As an aid in interpreting stanines, evaluators can assign descriptors to each of the 9 values:

9—very superior


7—very good



4—below average

3—considerably below average


1—very poor

Basal and Ceiling Levels

Many tests, because test authors construct them for students of differing abilities, contain more items than are necessary. To determine the starting and stopping points for administering a test, test authors designate basal and ceiling levels. (Although these are really not types of scores, basal and ceiling levels are sometimes called rules or scores.) The basal level is the point below which the examiner assumes that the student could obtain all correct responses and, therefore, it is the point at which the examiner begins testing.

The test manual will designate the point at which testing should begin. For example, a test manual states, “Students who are 13 years old should begin with item 12. Continue testing when three items in a row have been answered correctly. If three items in a row are not answered correctly, the examiner should drop back a level.” This is the basal level.

Let’s look at the example of the student who is 9 years old. Although the examiner begins testing at the 9-year-old level, the student fails to answer correctly three in a row. Thus, the examiner is unable to establish a basal level at the suggested beginning point. Many manuals instruct the examiner to continue testing backward, dropping back one item at a time, until the student correctly answers three items. Some test manuals instruct examiners to drop back an entire level, for instance, to age 8, and begin testing. When computing the student’s raw score, the examiner includes items below the basal point as items answered correctly. Thus, the raw score includes all the items the student answered correctly plus the test items below the basal point. The ceiling level is the point above which the examiner assumes that the student would obtain all incorrect responses if the testing were to continue; it is, therefore, the point at which the examiner stops testing. “To determine a ceiling,” a manual may read, “discontinue testing when three items in a row have been missed.

Leave a Reply

Your email address will not be published. Required fields are marked *