Page 1 of 26

Advances in Social Sciences Research Journal – Vol. 10, No. 7

Publication Date: July 25, 2023

DOI:10.14738/assrj.107.14968.

Iddrisu, S. A. (2023). Validation of Mentee-Teachers’ Assessment Tool within the Framework of Generalisability Theory at the

Faculty of Education, University for Development Studies. Advances in Social Sciences Research Journal, 10(7). 103-128.

Services for Science and Education – United Kingdom

Validation of Mentee-Teachers’ Assessment Tool within the

Framework of Generalisability Theory at the Faculty of

Education, University for Development Studies

Simon Alhassan Iddrisu

University for Development Studies,

P. O. Box TL 1350, Tamale, Northern Region, Ghana

ABSTRACT

Practitioners in assessment and other researchers have over the years expressed

dissatisfaction with the lack of consistency in scores obtained from use of multiple

measurement instruments. Such scores derived from these largely inconsistent and

unreliable procedures are relied upon by decision makers in taking very important

decisions in education, health and other related fields. The purpose of this study

therefore, was to apply the G theory procedures to validate the mentee assessment

tool being used at the Faculty of Education, University for Development Studies. The

G study involved estimating the generalisability (reliability-like) coefficients of

mentees’ assessment scores, and determining the level of acceptability (validity) of

these coefficients. A nested design was used because different sets of raters

assessed different student-mentees on different occasions in the field. The

relationship among these variables (facets); students, raters and occasions,

appropriately mirrored a nested relationship. Data obtained by raters on 300

students, in the 2018/2019 off-campus teaching practice, were entered into EDUG

software for relevant analysis of the results. The study found that both rater and

student facets accounted for the largest measurement errors in mentees’ observed

scores, reporting an estimated G coefficient of 0.62(62%), and representing a

positive moderate relationship. Based on these findings, the study concluded that,

the quality of mentee observed scores could be improved for either relative or

absolute decisions by varying the number of levels of both raters and occasions. To

achieve acceptable G coefficient values of 0.83 and above, it is recommended that,

decision makers employ a model that uses four raters per occasion for three

occasions of assessment.

Keywords: Object of measurement, Relative decision, Absolute decision, universe of

generalisation, universe of admissible observations, Composite facet, Generalisability

study, Decision(D) Study, Optimisation

INTRODUCTION

Educational assessors and researchers alike increasingly express concern about the reliability

and validity of scores produced from multiple measurement procedures such as tests, rating

scales, surveys and other forms of observations (1-Alkharusi, 2012). This is because scores

generated through the use of any measurement procedure in educational and psychometric

assessments, often are basis for making very important decisions (Kolen and Brennan, 2014,

1995; Hughes & Garrett, 2018). Kolen (2014) identified three levels of decision-making based

Page 2 of 26

104

Advances in Social Sciences Research Journal (ASSRJ) Vol. 10, Issue 7, July-2023

Services for Science and Education – United Kingdom

on assessment scores which are: the individual, institutional, and public policy levels. Individual

level decisions based on results may involve a student opting to attend a certain tertiary or non- tertiary institution or even electing to pursue a certain programme of study (Fendler, 2006;

Kolen & Brennan, 2014). Institutional level decisions likewise rely on previous assessment

records to either certify professionals or to admit applicants into tertiary programmes in

relevant institutions. Public policy level decisions address general problems such as improving

quality and access to education in the nation for all to benefit from. Shavelson and Webb (2004)

submitted that, the usefulness of any assessment score largely is dependent on the extent to

which we can generalize with accuracy and precision to wider set of situations.

Allen and Yen (2011) also reckoned that, assessment results generally have multiple purposes

and applications, such as found in the selection of new employees, applicants or clients for

varied reasons. Yukawa, Gansky, O’Sullivan, Teherani and Fieldman (2020) maintained that in

the training of budding professionals in the fields of education, health, law, agriculture,

business, etc, assessments remained integral in the process, where relevant rating scales or

tools are administered periodically in the conduct of these assessments.

Atilgan (2019), likewise indicated that the choice of a tool for assessment in education depends

on what attribute to be measured. Essay-type instruments and tailored rubrics are among

several tools reviewed in literature for purposes of assessing the writing skills and other

competencies of trainees (Atilgan, 2019; Atilgan, Kan & Aydin, 2017; Turgut & Baykul, 2010).

Graham, Harris, and Herbert (2011) for instance, used a writing-based essay type rubric in

assessing students’ writing skills at the primary. Fleming, House, Hanson, Garbutt, Kroenke

Abedin and Rubio (2013) developed the Mentoring Competency Assessment tool (MCA), which

they used in assessing the skills of mentors in clinical and translational science. An estimate of

the reliability and validity of scores obtained from the MCA tool showed high and positive

relationship among the competencies examined.

Many educational settings, schools and other similar institutions are arguably the largest

consumers of data emanating from multiple testing and other assessment procedures (Miller et

al., 2011). A major challenge associated with measurements in both the social sciences and

education is therefore the inconsistencies (unreliability) of its measurements (Sirec, 2017,

Revelle, 2016, Brennan, 2005). When the same characteristic is measured on two different

occasions, the results obtained often are different (Steyer, 1999; Revelle, 2016). Steyer et al.

(1999), also intimated that, irrespective of the measures an institution or body may put in place

to assure the sanctity of scores produced from measurements processes, many potential

sources of error continue to persist and must be removed.

Many studies cited in literature have used the G theory in investigating the reliability of rating

scales and scores obtained from such ratings. Kim, Schatschneider, Wanzek, Gatlin and Otaiba

(2017), in a study examined raters and tasks as facets of interest contributing to measurement

error in another generalisability study and reported that, the rater facet was a major

contributor to measurement error. Sudweeks, Reeve and Bradshaw (2005) similarly, estimated

the individual contributions to total variance in a G study with raters and occasions as variables

(facets) of interest. However, they reported that, a rater’s years of experience in teaching

contributed more to measurement error than the rater factor. This report on rater contribution

Page 3 of 26

105

Iddrisu, S. A. (2023). Validation of Mentee-Teachers’ Assessment Tool within the Framework of Generalisability Theory at the Faculty of Education,

University for Development Studies. Advances in Social Sciences Research Journal, 10(7). 103-128.

URL: http://dx.doi.org/10.14738/assrj.107.14968

to measurement errors contradicted that by Kim et al. (2017), who reported a substantial

contribution by the rater factor to measurement error.

Researchers like Kan (2007), Graham, Hebert, Sandbank and Harris, (2016), Bouwer, Beguin,

Sanders, and van der Ber (2015), and Gebril (2009) variously conducted G studies on a guidance

tool rating scale, with the number of essay samples, types of essays, and types of tasks set as

factors of interest. Atilgan (2019), examined the reliability of essay rating rubrics using a G

theory framework, while Lin and Xiao (2018) investigated the rater reliability using holistic and

analytic scoring keys within G theory procedures. With these different studies, the ultimate goal

was to quantify the contribution of the individual facets and their composites to total variance.

G theory was chosen over Classical Test Theory (CTT) and Item Response Theory (IRT) due to

its superiority in quantifying multiple sources of error in a single study (Brennan, 2005).

Whereas the Classical Test Theory focuses on the measurement of reliability in order to

differentiate among individuals (Cardinet et al., 2010), G theory in addition, enables the user to

evaluate the quality of measurements, not just among individuals but also objects (Cardinet,

Johnson, & Pini; 2010). Again, while in CTT, coefficient values determined serve as global

indicators of the quality of measurement, G theory does not only calculate the coefficients, but

it further provides information on the relative contributions and importance of the different

sources of measurement error. Through this unique function, G theory thus permits the user to

module the factors into a measurement procedure to improve measurements.

The Generalisability theory practically is applied at two levels, namely; the G study and the

decision (D) study levels (Heitman, Kovaleski & Pugh, 2009). Whereas the G study enables the

estimation of variance components and reliability coefficients, the D study enables the

investigator determine optimal number of levels of facets and possibly, ‘positively impact

interrater reliability’ for making decisions (Moskal & Leydens, 2000, p.28). It allows the user to

alternatively employ different levels of the variables involved so as to improve the quality of

measurements.

Statement of the Problem

The Faculty of Education, University for Development Studies, trains professional teachers for

the various levels of education, in line with the national aims of producing quality teachers. This

Faculty of Education introduces new programmes and creates more academic departments.

Like many curricula used for training of professionals, that for the training of professional

teachers have two main components; namely, the content, and pedagogy (practical) aspects.

Pedagogical training, equips students with the relevant professional skills and attitudes they

require to enable them teach proficiently in the classrooms, at all levels of education.

The practical components of the training which are implemented as; school observation, On- campus or peer teaching, and Off-campus teaching practice, often are assessed using a tool. This

assessment instrument over the years (since 2012/2013 academic year) has been changed for

at least two different times on account of the grossly unsatisfactory grades the Professional

Education Practice Unit (PEPU) receives on behalf of faculty from raters assigned to evaluate

mentees.

Page 4 of 26

106

Advances in Social Sciences Research Journal (ASSRJ) Vol. 10, Issue 7, July-2023

Services for Science and Education – United Kingdom

Both mentees and faculty have at various times vented their displeasure and suspicions

regarding the accuracy, and the quality of grades awarded during professional practice

sessions. Similar complaints and expressions of dissatisfaction with the observed scores given

to students keep recurring with different cohorts of students at the Faculty of Education.

Essentially, the problems over the period have been the glaring inconsistencies associated with

scores assigned by different raters to students using this assessment tool. Notwithstanding

these dissenting views about the dependency of the observed scores, assessments of mentee

work in the field remains an indispensable component of their training. It was therefore

incumbent on faculty, lecturers, students and the Professional training unit to device ways of

refining the mentee assessment tool, so as to drastically diminish the deficiencies associated

with students’ grades, assigned to mentees during field practice. The deficiencies associated

with the observed scores essentially were the measurement errors whose sources were

unknown and needed to be investigated.

It appears this particular tool has never been scientifically studied either within this university

or outside it, to determine the error sources and their magnitudes in the measurements.

Possible sources of errors linked with mentee scores may include but not limited to raters,

occasions of ratings, mood of raters, items used to examine students among several others

(Cook et al., 2018; Atilgan, 2012; Brennan, 2005; Shavelson & Webb, 1991). This study therefore

set out with a primary purpose to examine closely the overall reliability of mentee scores

assigned by raters on the different occasions using the G theory framework.

Purpose of the Study

The purpose of this study was to validate the mentee assessment tool through estimating the

generalisability (reliability) coefficients, and the variance components, and to estimate the

optimal levels of the facets for achieving high reliability estimates.

Specifically, this study was guided by the objectives outlined below:

1. to quantify the rater and occasion-effects on the reliability of mentees’ observed scores.

2. to explore the effects of composite factors of interest on reliability of the mentees’

assessment scores.

Research Questions

This study was guided by the following questions:

1. What amount of variance in mentees’ observed scores is attributable to the rater and

occasion facets?

2. What effects do the composite facets have on the dependency (reliability) and validity of

students’ observed scores?

Significance of the Study

The significance of this research finds expression in the meaningfully contribution to improving

existing knowledge and practice in the relevant discipline. First the outcomes have the potential

to repose confidence and trust in mentees’ assessment scores through the study. By estimating

the dependency coefficients of the observed scores, the strength of relationship and quality of

scores so produced are established enabling decision makers to be well informed on their

decisions based on these scores.

Page 5 of 26

107

Iddrisu, S. A. (2023). Validation of Mentee-Teachers’ Assessment Tool within the Framework of Generalisability Theory at the Faculty of Education,

University for Development Studies. Advances in Social Sciences Research Journal, 10(7). 103-128.

URL: http://dx.doi.org/10.14738/assrj.107.14968

Again, decision makers can tell which individual or composite factor/s contribute the least or

largest variance to the total variance, so that measures can be taken to reduce the errors and

generate high enough coefficients. By subjecting an assessment procedure (tool) through an

empirical study, the scores that emanated from the use of that procedure equally were refined

through the process and, therefore it is expected that scores would be acceptable to all

stakeholders (both students and faculty), as reflecting true attributes of mentees.

Many researchers and other incidental users of the tool and its products, would and should

invest greater “faith” in the tool and its outcomes following this validation study. Also, mentees

whose performance were assessed using the tool would have known at least the estimated

levels of accuracy in their performance leading to some confidence being reposed in the scores.

Delimitation of the Study

This study was restricted in scope naturally by the choice of a named tool and the application

of a particular theoretical framework to study it. Participation in this study was restricted to

only raters, and mentees who were captured in documents (assessment tool), from which data

for the study were extracted. Within the institutions that use this tool, there were many

categories of students pursuing other programmes, who equally were excluded from

participating in this study by virtue of not being members of the faculty.

In addition, mentee observed scores obtained during the 2018/2019 off-campus internship

were also the only admissible data for analysis in this study. Scores of mentees obtained without

the use of the assessment tool were ineligible for use in this study.

Limitations of the Study

Limitations of the study entailed the deficiencies associated with the research methodologies

applied in this study, which were capable of affecting the generalisability of the findings of this

study. Fixing the object of study (assessment tool), nulled all the possible contributions to all

composites of the tool, limiting knowledge on much these composites added to total variance.

While the nested G study design is less costly, and also usable in multiple situations, it does not

enable the maximation of information about the facets of interest in the study (Cardinet et al.,

2010).

G theory is context specific, and so coefficients estimated for a particular population or universe

may not be generalised beyond that particular universe. The findings on a given population or

conditions for a given study may not be generalised to another population.

LITERATURE REVIEW

Generalisability (G) Theory

Generalizability (G) theory is a statistical body or procedures for evaluating the dependability

(or reliability) of behavioural measurements (Cronbach, Glasser, Nanda, & Rajaratnam, cited in

Brennan, 2001; Shavelson & Webb, 1992). The concept of dependability, defines the accuracy

with which one generalizes from say a student’s observed score on an assessment or other

measure to an average that student would have received under all possible conditions

(Alkharusi, 2012). These possible conditions include the different forms of the instrument, all

occasions of rating, or all possible raters. He indicated that the average score is the same as the

universe score, and it is synonymous with CTT’s concept of true score. G theory thus defines a

Page 6 of 26

108

Advances in Social Sciences Research Journal (ASSRJ) Vol. 10, Issue 7, July-2023

Services for Science and Education – United Kingdom

score as dependable and valid if it allows accurate inferences to be made about the universe of

admissible observations that they are meant to replace (Allal & Cardinet, 1997).

The G theory extends, amplifies and widens the classical reliability theory by identifying,

defining and estimating multiple sources of measurement error (Shavelson, Webb, & Rowley;

1992; Alkharusi, 2012). According to Cronbach et al, cited in Orenz (2018), the philosophy upon

which G theory is founded maintained that, researchers and assessors are interested in the

precision or reliability of a measure because they wish to generalise from the observation in

hand to some class of observations to which it belongs (p.144). For instance, teachers would

often wish to generalise students’ performance on particular occasions over future student

performances on multiple occasions.

In G theory, the presumed multitude or infinite number of occasions over which the investigator

wishes to generalize over is termed a universe of occasions (Shavelson & Webb, 1991).

Likewise, an investigator may want to generalise over a universe of items, raters, situations of

observations and students among others. This set of conditions of measurement the researcher

might desire to generalise over is the universe of generalisation (Alkharusi, 2012: Shavelson,

Wiley & Yin, 2015). This universe of generalisation varies from study to study and so a

researcher must define it explicitly by identifying the conditions of measurement he/she

desires to generalise to in the study (Brennan, 2010).

Again, in G theory language, a set of conditions of measurement of the same type are the facets

(Alkharusi, 2012; Brennan, 2010). For instance, the occasions of assessment, raters who

conduct assessments, rating scales or items used may constitute facets in a study (Shavelson,

Webb & Wiley, 1992; Menendez and Gregori-Giralt, 2018).

Generalisability theory is useful if the desire is to refine the measurement tools/procedures to

yield improved and reliable data (Heitman, Kovleski & Pugh, 2009). G theory was chosen over

the CTT because of its superior alternative procedures, which yield more useful intra-class

correlation coefficients (Denegar & Ball, 1993). G theory further addresses the dependability of

measurement issues, and allows the simultaneous estimation of multiple sources of variance,

including interactions (Shavelson & Webb; 1991, Naizer, 2007; Morrow, 1989) within a single

study. This theory permits a decision maker to investigate the dependability of scores for

different kinds of interpretations and uses, making it more preferred choice among other

theories for this study.

Generalisability Theory and Validity

Validation is a process of evaluating the claims, assumptions and inferences that link

assessment scores with their intended interpretations and uses (Kane, 2006). To validate an

assessment tool using G theory procedures requires the merging of the concepts; “reliability”

and “validity” (Gugiu, Gugiu, & Baldus, 2012). Orenz (2018), reported another approach to

validation which requires the investigator to commence by defining the universe of interests,

and then observe two or more independently selected conditions within that universe. A

relationship between a generalizability coefficient and validity is thus inferred, where the G

coefficient resolves into a validity coefficient. And because the generalisability coefficient

indicates how accurately universe scores can be inferred from observed score, it can be

interpreted as a validity coefficient (Brennan, 2011).

Page 7 of 26

109

Iddrisu, S. A. (2023). Validation of Mentee-Teachers’ Assessment Tool within the Framework of Generalisability Theory at the Faculty of Education,

University for Development Studies. Advances in Social Sciences Research Journal, 10(7). 103-128.

URL: http://dx.doi.org/10.14738/assrj.107.14968

Contemporary validation frameworks, identified five sources of validity evidence without

classifying them as ‘types’ of validity. For instance, Messick cited in Cook and Beckman (2006),

listed five sources of evidence in support of construct validity as; content, response process,

internal structure, relations to other variables and consequences. These sources of validity

evidence are not types but represent categories of evidence, that can be gathered to support

construct validity conclusions derived from assessment scores (Cook & Beckman, 2006).

Besides the popular Cronbach’s Alpha(α) coefficients which are the generally acceptable

standards for interpreting these reliability and validity coefficients, (Naumenko, 2015), George

and Mallery (2003) proposed guidelines which served as yardsticks for interpreting validity- like coefficients. These guidelines provide that a coefficient alpha value greater than 0.90 be

interpreted as “excellent”, around 0.80 as “good”, with 0.70 as “acceptable”, 0.60 as

“questionable”, around 0.50 as “poor”, and figures below 0.50 as “unacceptable”. For purposes

of interpreting coefficients in this study, descriptors for professional performance such as

outstanding shall apply for coefficients 0.90 or higher, proficient for coefficients from 0.80 to

less than 0.90, competent for coefficients from 0.70 to less 0.80, apprentice for coefficients from

0.60 to less than 0.70, and novice for coefficients from 0.50 and below. These descriptors seem

more appropriate for describing performance and categorizing different levels of performance

by trainees.

Sources of Measurement Error

Measurement error is an integral part of any assessment exercise, and therefore, it is a truism

that every measurement situation or practice is susceptible to some measurement error of

some sort (Menendez & Gregori-Giralt; 2018). Measurement errors may manifest as random or

systematic types of error (Revelle, 2016; Marcoulides, 1993) as indicated earlier. These

measurement errors vary due to differences in raters’ inconsistencies in rating, and

consequently culminate in a decrease in reliability and validity of the outcomes (Atilgan, 2019).

Tavares, Brydges and Turner (2018) proposed a departure from the use of the traditional

classical methods of estimating reliability, to more contemporary procedures like the use of G

theory methods which have the advantage of differentiating error sources, and quantifying the

contributions of each source to total measurement error. These diverse sources of

measurement error may include information introduced through external sources (Sadler,

2009); differences in severity or leniency of scoring by assessors (Knight, 2006), occasions of

assessment or differences in the level of training or experience of raters (Howell, 2014). Other

error sources may include variations in the depth of meanings or interpretations obtained by

raters of the rating scale or tool (Tan & Prosser, 2004).

Generalisability Designs Studies

G theory, according to Brannan (2010), recognizes and distinguishes among designs as either

crossed, nested or mixed, and facets as random, fixed, finite or infinite. In a crossed

measurement design, all conditions of the facet, for instance, items are observed with all

conditions of another source of variation (e.g persons). If raters were crossed with students and

occasions, it means all students will be assessed by all raters on all occasions (Shavelson &

Webb, 2018;1991). However, in some situations, not all raters observe all students on all

occasions. Instead, for reasons of resource constraints, different raters may assess different

students on different situations or occasions in a nested design (Cardinet et al., 2010). A

notation i: s or i(s) is used to denote facet i nested within facet s. Both conditions stated in the

Page 8 of 26

110

Advances in Social Sciences Research Journal (ASSRJ) Vol. 10, Issue 7, July-2023

Services for Science and Education – United Kingdom

definition must be upheld for a facet to be nested within another (Cardinet et al., 2010;

Shavelson & Webb, 2018).

G theory may also consider a facet(s) as either “random” or fixed in study design. A sample is

random when it is drawn randomly or assumed exchangeable with any other sample of the

same size obtained from the universe (Gugiu, Baldus & Gugiu, 2012; Cardinet et al., 2010;

Shavelson & Webb, 1991). In G theory, the number of variance components produced is

dependent on the design adopted for the study. This is because, the variation would be

partitioned for the object of measurement, the facets, the interactions, and the residuals.

Advantages of G Theory

The advantages of G theory have been espoused by Shavelson and Webb (1991), Thompson

(2003), and Rentz (2018) as follows:

G theory applies less restrictive assumptions, when used in study designs. Particularly, it

operates on the assumption that persons and other measurement facets are randomly sampled.

Again, unlike other theories, G theory can be used to assess multiple sources of error

simultaneously in a single measurement (i.e., items, raters, occasions, etc.). It has the unique

property of identifying and estimating separately, each of the possible sources of measurement

errors and their combined effects. It is advantageous to an investigator whose measure

combines several components in a study of a construct and its dependability. G theory also

distinguishes for the researcher, between the magnitude and type of errors, to enable decision

makers decide whether the levels of errors are within permissible limits for subsequent

applications. G theory also estimates and reports generalisability coefficients explicitly without

any ambiguity. It distinguishes between relative and absolute coefficients, which are applied

respectively in taking relative and absolute decisions. Other theories like the CTT do not specify

the type of coefficient estimated and may not specify the universe to which researchers may

generalise to unlike the G theory.

Notwithstanding the obvious advantages G theory has over the other theories, it is not without

limitations in its applications (Alkharusi, 2012; Strube, 2002; Shavelson & Webb, 1991; Webb,

Rowley, & Shavelson, 1988). First, G theory is context specific, and so coefficients estimated for

a particular population or universe cannot be generalised beyond the particular universe. The

findings on a given population or conditions for a given study may not be generalised to another

population.

Research Design

A partially nested G design was used to estimate and quantify the individual and composite

contributions to the total variance in the mentee’s observed score. The G theory procedures

were found more appropriate, for they permitted me to isolate, quantify, and to describe the

contributions of constituent sources of measurement error. The outputs from G theory analysis

relatively were simple and easy to interpret relative to other theories. G theory was thus,

justifiably selected for its versatility and for its multiple utility in simultaneously quantifying,

and segregating variance components of multiple error sources (Cardinet et al., 2010;

Shavelson & Webb, 2005). The design was implemented at three levels; Observation, Estimation

and Measurement designs as explained below:

Page 9 of 26

111

Iddrisu, S. A. (2023). Validation of Mentee-Teachers’ Assessment Tool within the Framework of Generalisability Theory at the Faculty of Education,

University for Development Studies. Advances in Social Sciences Research Journal, 10(7). 103-128.

URL: http://dx.doi.org/10.14738/assrj.107.14968

Observation design described the data structure that identified specific facets and their

interactions, involving the number of levels of each facet, such as raters, students and occasions

(Cardinet et al, 2010). The facets of interest were the assessment tool (Object of measurement),

Students (mentees), Occasions and Raters. For the purposes of this study, raters were

reorganized into two sets; rater 1s and rater 2s based on which occasion the rater did the

assessment. The nested design; (r:o) x ps, represented different sets of raters, who rated

different groups of students on different occasions, using the same assessment tool.

The G study was implemented at two levels; involving a generalisability (G) study, and a

decision (D) study, (Brennan, 2001; Gugiu, R., Gugiu, C & Baldus, 2012, Marcoulides, 1999).

While the G study estimated the G coefficients (G-relative and G-absolute), the D study relied on

the information obtained from G study to conduct an optimization study using alternative levels

of the factors (facets) of interests specified under the observation design (Cardinet et al., 2010).

The measurement design for this study is denoted by the relationship, SP/RO. A diagrammatic

representation of the observation design is shown in Figure 2, originally used by Cronbach et

al, (cited in Cardinet et al, 2010). The intersecting circles in Venn diagrams described the

regions common to the ellipses (circles) used in variance partition diagram, represented the

contributions or “effects” of facets to total score variance.

The nesting of raters, r within occasion, o is represented by the inclusion of one entire circle

within another circle.

FIGURE 2: Venn Diagram for ps x (r:o) design

This D study, represented by ps x (r:o) design, reported five variance components whose

estimations were carried out as follows: σ2

o (occasion variance), was divided by the number of

occasions, and the number of raters, (ńr ńo

́)in the D study, by virtue of it being confounded with

^σ2

ro. In the same way, σ2

po (variance of the interaction between instrument and occasions)was

divided by the number of occasions and raters by virtue of its confounding with ^σ2

pro,e.

Population

This study targeted all the assessment scores of all raters who assessed mentees at the Faculty

of Education. The category of students of interest were those who participated in the

2018/2019 Off-campus teaching practice at both basic and Senior High schools. Out of a total

student population of four hundred and thirty (430) who participated in the Off-campus

teaching practice, three hundred (300) of them, whose records were available and complete for

Page 10 of 26

112

Advances in Social Sciences Research Journal (ASSRJ) Vol. 10, Issue 7, July-2023

Services for Science and Education – United Kingdom

the two occasions were used for this study. Table 5 shows the breakdown of students by

programme, summing up to three hundred (300).

Table 5: Number of Students by Departments

Serial No. Programme of Study Total number of Students

1 Social Science Education 100

2 Business Studies Education 90

3 Mathematics and Science Education 30

4 Agric. and Consumer Sciences Education 20

5 Basic and Early Childhood Care Education 60

Total 300

Sample and Sampling Procedure

This study used all mentees who had scores for both first and second round of assessments

from raters. Criterion sampling procedure (Ary et al., 2014), was employed to sample the 300

mentees who met the criterion for inclusion into the sample. The criterion for selection was

that a mentee should have a score from each of the first and second occasion- raters

respectively.

The sample size of 300 mentees was determined for this study relying on Krejcie and Morgan

(2006) Table of sample size determination which prescribed a sample size of 309, for a given

population of 400 at a 3.5% margin of error. Combining the suggested sample size by Morgan’s

Table and the criterion set for inclusion, and the EDUG requirement for a balanced data for

optimum application, I settled on a sample size of 300 mentees who participated in the

2018/2019 off-campus teaching practice programme at the Faculty of Education, University for

Development Studies. The adequacy of this sample size is supported by Vispoel, Morris and

Kilinc (2018), who used a sample of size of 206 participants, in a G study estimating various

reliability coefficients. Atilgan’s (2013) proposition that, a sample size of 400 is sufficient for

performing an accurate and reliable estimation of the generalisability (G) and Phi coefficients

for populations beyond five thousand (5000) also authenticated this sample size for a

quantitative G study.

The records of a total of thirty-six (36) raters comprising 15 from Tamale campus, 14 from Wa,

and 7 from the Navrongo campus were used for the study. Table 4 contains a summary of raters

based on the campus of location.

Table 4: Number of Raters on the Campuses

Serial/Number Name of Campus Number of Raters

1

2

3

Tamale

Navrongo

Wa

15

7

14

Total 36

Data Collection Instruments

The mentee assessment instrument comprised twenty (20) items describing the measures of

attributes of teacher-related professional attributes, organized under three sub-dimensions.

The items on these sub-dimensions are placed on a five-point Likert-like scale, indicating the

Page 11 of 26

113

Iddrisu, S. A. (2023). Validation of Mentee-Teachers’ Assessment Tool within the Framework of Generalisability Theory at the Faculty of Education,

University for Development Studies. Advances in Social Sciences Research Journal, 10(7). 103-128.

URL: http://dx.doi.org/10.14738/assrj.107.14968

various levels of performance. The points on the scale are 1- 5 with corresponding descriptors

as, Novice, Apprentice, Competent, Proficient and Distinguishing/Outstanding. Another column

precedes these five columns and it is labelled ‘Item and Score’. This column captures the core

items the student’s performance is being rated against together with a score box. There are

three distinct sub-dimensions on the instruments which are; Objectives and Core Points in

Lesson Plan, Classroom Organization and Management, and Teaching Methodology and

Delivery.

Under the first sub-dimension; Objectives and Core Points in Lesson Plan, respectively are the

Objectives, Summaries/Core points, Teaching and Learning activities, Teaching and Learning

Materials (TLMs), as well as Subject and Pedagogical Knowledge. These sub-dimensions focus

on measuring the extent to which a mentee has mastered these core professional attributes in

the lesson plan. The second sub-dimension is Classroom organisation and Management. Three

core attributes required by student-teachers include, classroom management and control skills

or techniques, and a good professional teacher’s attitude in general. Sub-dimension three is

“Teaching methodology and delivery” which covers core items 9 to 20. Core items under this

sub-dimension include; Introduction of the lesson, presentation of teaching and learning

activities, pace of lesson and audibility of voice, questioning and feedback techniques,

professional use of chalkboard, use of teaching and learning materials (TLMs), communication

skills, Student participation, Assessment for student learning, Mastery of subject matter,

Closure of lesson and Professional commitment. These core items in sub-dimension three

represent the central activities both learners and teachers, must be engaged in a lesson to

achieve learning.

Data Processing and Analysis

Mentee assessment scores which formed the data used for this study were extracted from

documents. Access to these records was granted by the Dean of the Faculty of Education, and

the Professional Education Practice Unit, who had custody of these records. These records were

then sorted, counted and arranged for easy identification and use. To protect the identities and

privacies of both student- and rater groups, and also, to avoid the violation of individual rights

to secrecy, anonymity and non-intrusion, labels and codes were used in place of their names as

identification numbers.

Scores, raters and students’ characteristics such as gender, programmes, occasion of

assessment among others were extracted from completed assessment tools and the results

tabulated under three columns; students, raters and occasions.

Raters were regrouped into two, and designated as rater1s and rater2s depending on whether

they assessed students on the first or second occasions. The first column was labeled students

with codes from 001 to 300, consisting of the entire group. Student scores accordingly were

recorded under rater1 and rater2 respectively in preparation for entry into the EduG software

as defined under the observation design.

Data so obtained was then prepared and entered into the EduG software for analysis by

research questions posed for the study. Research question one examined major contributors

such as raters (r), and occasions, (2) interaction effects; and their residuals to measurement

error in mentees’ observed score in the G study. Variance components were computed for

Page 12 of 26

114

Advances in Social Sciences Research Journal (ASSRJ) Vol. 10, Issue 7, July-2023

Services for Science and Education – United Kingdom

individual and composites as: (1) instrument (p), together with other unaccounted for error

components. Research question two also examined, estimated and compared the contributions

of facet composites to total error variance, with the view to determining the overall effects on

the dependency of these mentees’ observed scores.

RESULTS

G theory procedures were applied in this study to explore ways of generating dependable

scores using the mentee assessment tool, irrespective of the raters or occasions involved when

assessments are conducted. This study involved estimating the generalisability (reliability-like)

coefficients of mentees’ observed scores, computing variance components of facets, and their

composites. The information obtained from the G study was then used in subsequent decision

studies to explore how G coefficients could be improved to acceptable quality or conventional

levels. Following are presentations of varied data groupings relevant to this study:

Background Characteristics of Raters and Students

Raters and students’ background characteristics of interest in this study such as sex, campus

location and programme specializations or departments were examined, and the statistics

captured in Table 9 and 10.

Table 9: Distribution of Raters by Campus and Gender

Campus Number of Raters by Genders Total

Male Female

Tamale 12 3 15

Navrongo 7 0 7

Wa 10 4 14

Total 29 07 36

Table 9 displays the distribution of raters by campus and sex as obtained from the records used

for this study. The total number of male raters from across the three campuses were twenty- nine (29), while that of female raters were seven. All campuses except Navrongo, had female

representation from the list of raters captured in Table 9. Table 10 also displays the distribution

of respondents by sex and programme.

Table 10: Distribution of Students by Programme and Sex

Programme/Department Sex of Student Total

Male Female

Social Science Education 96 35 131

Business Education 78 13 91

Mathematics and Science Education 36 2 38

Agricultural Science Education 23 1 24

Basic and Early Childhood Education 10 6 16

Total 243 57 300

In Table 10, the students involved in this study were spread across five departments under the

Faculty of Education located across the campuses of UDS. Of the total of 300, 131 Social Science

students formed the majority, and represented over forty-three (43.6%) percent of the total,

ninety-one (91)) students representing just over thirty (30.3%) percent offered Business

Page 13 of 26

115

Iddrisu, S. A. (2023). Validation of Mentee-Teachers’ Assessment Tool within the Framework of Generalisability Theory at the Faculty of Education,

University for Development Studies. Advances in Social Sciences Research Journal, 10(7). 103-128.

URL: http://dx.doi.org/10.14738/assrj.107.14968

Education programme, while sixteen (16) students representing just over five (5.3%) percent

of the total pursued Basic and Early Childhood Care Education programme.

Table 11 illustrates the descriptive statistics of raters as presented above.

Table 11: Descriptive Statistics of Rater Scores

N Range Mean Std. Deviation Variance

RATER1 SCORES 300 59 70.41 7.975 63.594

RATER2 SCORES 300 53 70.53 8.001 64.009

Valid N (listwise) 300

In Table 11, the mean scores of raters who assessed students on each of the two occasions are

presented. While the mean score reported for rater1s was 70.41, with a standard deviation of

7.98, that of rater2s was 70.53 with corresponding standard deviation value, 8.00 reflecting a

very negligible difference between their mean scores. These mean scores obtained for the two

pairs of raters were close in size suggesting some similarity in the mode of scoring by these

pairs of raters.

Analysis

Analysis of variance (ANOVA) procedure under the G theory was applied to estimate variance

components of variables specified in the design as contained in Table 12. The resulting

observation and estimation design for the G study are illustrated in Table 12.

Table 12: Observation and Estimation Designs

Facet Label Level Universe

Instrument P 1 1

Students S 300 INF

Occasions O 2 INF

Raters nested

within Occasions

R:O 2 INF

In Table 12, G study considered the facets; Students, Occasions and Raters as random facets,

because each group of students for instance were drawn from an infinitely large universe of

students, and occasions or groups of raters that could equally have participated in this study.

Raters and occasions were also denoted random facets, based on similar assumptions. The

estimation design in this study thus included three infinite random facets; Students, Occasions

and Raters with the assessment tool, designated as a fixed. The measurement design depicting

the relationship among the facets in this design was SP/RO.

Research Question 1: What amount of variance is attributable to the rater and occasion

facets in the estimated total variance of mentees’ observed scores?

Results of the analysis presented the contributions of the multiple sources to the measurement

error in the G study as requested by the research question 1, are presented in Table 13. The ps

(r:o) G design, presented in Table 13, depicted multiple sources of variance components, where

the mentee assessment tool (P), and students (S) were the facets of differentiation, and the

occasion and rater facets were designated as the instrumentation facets. All estimates of the

variance components illustrated in Table 13, were positive except that of the occasion facet

Page 14 of 26

116

Advances in Social Sciences Research Journal (ASSRJ) Vol. 10, Issue 7, July-2023

Services for Science and Education – United Kingdom

which reported negative variance components (-1.48994), resolved to 0.0 in line with the

recommendations of Brennan (cited in Shavelson & Web, 1991).

The contribution of the object of measurement (p) was reported as dots or null values in Table

13, and these represented zero percentage (0.0%) contribution to measurement error. This

result aligned well, and in keeping with the G theory principles and procedures which hold that,

fixed facets do not contribute to measurement error (Cardinet et al., 2010). The student (S)

facet, produced close to 29% (σ2

s = 41.04; 28.8%) of the total variance in Table 13, being one

single largest contributor to measurement error in this study.

Again, the contributions to total variance by components of the nested relationship, (R:O), was

reported as 1.9% (σ2

r.ro = 2.70). This percentage represented the third largest contribution

among the components as reported in Table 13. This result is supported by the findings of Kim

et al. (2017), who also found the rater facet as one of the greatest contributors to measurement

error. The rater effect (r) was confounded with the rater-by-occasion interaction (ro) and

therefore, the variance component for raters (σ2

r) was hidden in rater-by-occasion interaction

(σ2

ro). By convention, the variance component for these confounded effects is denoted as σ2

r.ro.

Compared with the student facet, the raters’ contribution to measurement error is considered

relatively smaller, and suggests that raters differed minimally in the manner

(leniency/harshness) with which they assessed mentees on the different occasions of

assessments (Cardinet et al., 2010). It is the desire of every user of an assessment procedure

that any error occurring in course of measurement be very small or neglible. This is because the

smaller the error, the better the quality, accuracy or precision of the measurement. Decision- makers would be more confident in relying on the scores to make important decisions.

Table 13 : Analysis of variance

Components

Source SS df MS Random Mixed Corrected % SE

P ......... ..... ...... ..... ......

S 78996.60 299 264.20 41.044 41.044 41.044 28.8 5.76

O 17.28 1 17.28 -1.49 -1.49 -1.49 0.0 1.07

R:O 1817.27 2 908.63 2.70 2.70 2.70 1.90 2.14

PS .......... .... ..... ..... ..... ..... ..... .....

PO ........... ... ....... ..... ..... ..... ..... .....

PR:O ........... ... ...... ..... ..... ..... ..... .....

SO 29907.72 299 100.03 1.31 1.31 1.31 0.90 4.9525

SR:O 58250.74 598 97.41 97.41 97.41 97.41 68.40 5.62

PSO ......... ..... ........ ........ ........ ........ ....... ........

PSR:O ........... ........ ........ ........ ........ ....... ........

Total 168989.60 1199 100%

Table 13 further illustrated both crossed and nested facet relationships alongside their

combined effects, and contributions to total error variance. The paired crossed relationships

included; PS, PO, PR:O, SO, and SR:O. In addition, the three-way interaction relationships

reported were PSO and PSR:O. The fixed instrument (p) facet had significant effects on other

components, like the standard error of measurement (SEM) values (Cardinet et al., 2010).

Consequently, the composites of all facets associated with the instrument reported null values

Page 15 of 26

117

Iddrisu, S. A. (2023). Validation of Mentee-Teachers’ Assessment Tool within the Framework of Generalisability Theory at the Faculty of Education,

University for Development Studies. Advances in Social Sciences Research Journal, 10(7). 103-128.

URL: http://dx.doi.org/10.14738/assrj.107.14968

as illustrated in Table 13. These reported effects coroborated the findings of Vispoel et al

(2018), and Cardinet et al (2010), who, held the position that fixed facets have nulling effects

on all its composites.

The student-occasion interactions (SR:O) nested in occasions reported the highest variance

contribution (σ2

sr:o=97.41, 68.4%) to total variance. This variance also represented the residual

component reporting a substantial amount of variation compared with all other composites

reported in Table 13. This comparatively larger variance contribution could be explained partly

by the confounded facets and the unmeasured sources of variation in this design. Student- Occasion interaction (SO) also reported a nonnegligible variance component of (σ2

so =1.31;

0.9% of total variance) implying that, the relative standings of students in respect of the

attribute of professional knowledge and skills were unstable on each occasion of assessment of

these mentees.

Research Question 2: What effects do the composite facets have on the dependency

(reliability) of students’ observed scores?

This question was posed purposely to enable me estimate, quantify, and examine the

contributions of these facet combinations or interactions to the total variance in the observed

scores. In Table 15, all interactions of all other facets with the fixed facet yielded nothing or zero

contribution to total variance. Table 14 captured the ANOVA results from which conclusions

about the quality and precision of measurements for the selected facets under this

measurement design may be drawn.

Table 14 : G Study Table (Measurement design PS/RO)

Source of Variance Differentiation

Variance

Source of

Variance

Relative error

Variance

%

Relative

Absolute

error

Variance

%

Absolute

P (0.00000) ............ ......

S 41.04421 ............

.......... O ......... (0.00000) 0.0

........... R:O ............ 0.67602 2.6

PS (0.00000) ..........

,........... PO ............ ........

........... PR:O ............. ......

.............. SO 0.65414 2.6 0.65414 2.5

............. SR:O 24.35232 97.4 24.35232 94.8

............... PSO ............ ..........

PSR:O ............. ..........

Sum of variances 41.04421 25.00645 100% 100%

Standard

deviation

6.40658 Relative SE:

5.00065

Absolute SE:

5.06779

Coef-G relative 0.62

Coef-G absolute 0.62

In both Tables 14 and 15, variance components for each of the facets of generalisation are

different. The tool was fixed to reduce random fluctuations and as such, it made no

Page 16 of 26

118

Advances in Social Sciences Research Journal (ASSRJ) Vol. 10, Issue 7, July-2023

Services for Science and Education – United Kingdom

contributions to mesurement error (Cardinet et al., 2010 & Shavelson and Web; 1991). The

student facet was not fixed, and so contributed to true variance (σ2

s =41.04: 28.8%) accounting

for as much as 28.8% of total variance as shown in Table 14.

The composite facets, whose effects were being examined included, PS, PO, PR:O, SO, SR:O, PSO

and PSR :O (Table 15). Among these composites, only two made significant contributions to

total variance, namely; SO and SR:O. The zero effect of the fixed facet (p), was amply manifested

in the interactions that follow; PS, PO, PR :O, PSO and PSR:O, registering null values in each of

these cases (Cardinet et al., 2010; Brennan, 2005). The interaction between students and

occasions (SO) was thus, the only pair that made a nonnegligble contribution to absolute error

variance in the measurement design, accounting for (σ2

so = 0.65, 2.5% of total variance). In

comparison with other composite facets, this was a substantial and no mean contribution to

variation of student behaviour or performance on the different occasions of assessment,

consistent with findings by Sudweeks, Reeve and Bradshaw (2005).

Again, in Table 15, the largest contributor to total variance was the composite between student

and rater facets nested in the occasion facet, which reported over 94% (σ2

SR :O=24.35, 94.8% of

total variance) contribution to variability. The effects of multiple facets such as SR, RO and other

unmeasured sources were confounded in this composite variance reported here.

DISCUSSION OF RESULTS

This study set out to examine the quality of mentee scores obtained using an assessment tool

through G theory procedures. One key objective of the study was to estimate, and isolate the

contributions of facets of interest to total variance in the mentee’s observed scores. In this

generalisability analysis, variance components, and the G coefficients, involving the relative and

absolute coefficients, were used as indicators of the quality of performance or scores. While the

G coefficients provided global measure of the reliability of student scores, the variance

components mirrored the individual contributions of facets to total measurement error

(Cardinet et al., 2010). Information generated through the G analysis served as evidence to

support assumptions and inferences derived from these scores. The study used a nested design

where the object of measurement, the assessment tool was fixed, and then students, raters and

occasions designated as random facets This result appears consistent with findings reported in

many validation studies conducted using G theory procedures in literature (Cardinet et

al.,2010; Atilgan, 2019, Naumenko, 2015).

Regarding the individual contributions of rater and occasion facets in response to the first

research objective, the study reported a zero contribution by the occasion facet to absolute

error variance in line with the findings of Medvedev et al. (2018). The zero contribution by the

occasion facet is further coroborated by findings of Sudweeks, Reeve and Bradshaw (2005),

which maintained that alternative variables other than just the occasion would add to

measurement error. They mentioned variables such as the environment, experience of the

rater, quality of the measure among possible sources of error. Again, since the rater facet was

nested in occasions, no lone contribution by the rater facet was possible from the nested

relationship because of confounding. However, the contribution of the nested relationship of

raters and occasions (R:O) to total variance was reported as 2.6% of absolute error variance,

representing a relatively small part of the total variance. Since the rater effect was confounded

with the rater-by-occasion interaction (ro), the lone contribution to measurement error by

Page 17 of 26

119

Iddrisu, S. A. (2023). Validation of Mentee-Teachers’ Assessment Tool within the Framework of Generalisability Theory at the Faculty of Education,

University for Development Studies. Advances in Social Sciences Research Journal, 10(7). 103-128.

URL: http://dx.doi.org/10.14738/assrj.107.14968

rater variable was not observed in the rater-by-occasion interaction. The percentage thus

reported signified a minimal difference in rating performance by raters on each occasion of

assessment. The contribution means that, raters either differed minimally in their leniency or

severity with which they scored the mentees in the classrooms. Since no main rater effect was

reported in this result due to confounding, with a null or zero contribution by the occasion facet,

it is only reasonable and logical to infer that the 2.6% contribution to total variance was

accounted for by raters in the nested relation. This contribution to total measurement error in

the mentee’s observed score is supported by the findings by Kim et al. (2017) and Naumenko

(2015) as reported in literature.

Objective two of this study also examined the contributions of composite facets to measurement

error in mentee observed scores. In Table 14, only two composites made significant

contributions to the measurement error besides that of the nested relationship. It was revealed

that, the composite facet; Students and Occasions (SO) contributed 2.5% (Table 15) to absolute

error variance. This percentage contribution ranks the third highest after the rater nested in

occasion(R:O) composite which recorded 2.6% addition to the absolute error variance. The

absolute variance components were reported here because the focus was to estimate the

precisions of observed scores independently, and to determine the exact position of each

individual on the assessment scale.

Again, the student-by-Occasion contribution of 2.5% to absolute variance meant that, on each

occasion of assessment, students’ ‘real’ standing on the construct of interest varied by about

2.5%. Ordinarily, an individual’s true standing or performance should be repeatable to achieve

stability on a construct with time, but a situation where the absolute position kept changing on

different occasions provided fertile grounds for suspicious reliability and validity conclusions

on such results. This position was further aggravated by the comparatively larger percentage

contribution to absolute error variance reported for the student-by-Rater interaction nested in

Occasions (SR:O). This composite facet registered the highest effect on absolute error variance

accounting for over 94% of total variance in the observed score. This high percentage of

absolute variance generated in respect of student-rater interaction signified that raters’ ratings

of students’ performance on the scale remained relatively high on different occasions. If raters

scored students high or low scores on the construct of interest which was the student’s

professional knowledge and skills on one occasion, a similar observation would have been

repeated on another occasion of assessment. Alternatively, if raters were severe in rating

students on one occasion, they likely might behave same on a different occasion. This inference

from the foregoing aligns well with Messick’s position cited in Brennan (2005), which identified

two categories of raters, namely – a lenient and a strict scorer. By design some scorers award

high scores while others also give low scores irrespective of the occasion. Such groups of scorers

whether lenient or strict still exhibit subjectivity in their ratings, thereby rendering their

observed scores error infested.

SUMMARY OF KEY FINDINGS

Findings on Research Question 1

Based on the results and analysis of data in this study, it was revealed that while the occasion

facet made zero contribution to measurement error, the rater facet (r), though confounded

(R:O), contributed 1.9%, to total variance in the observed mentee scores. The study also found

that, the overall dependency (reliability) and precision (quality) with which the mentee tool

Page 18 of 26

120

Advances in Social Sciences Research Journal (ASSRJ) Vol. 10, Issue 7, July-2023

Services for Science and Education – United Kingdom

measured the construct of interest in mentees was 62% of the universe (Coef G = 0.62) or the

mentees’ true score. This measure represented a positive coefficient of moderate strength. In

addition, the nested relationship, (R:O) accounted for 1.9% of total variance, which also

represented the third largest contributor to total variance among the variance components.

Rater facet was found a higher contributor to total variance, than the occasion facet.

Findings on Research Question 2

2). The study also found that, the student and occasion composite (SO), and the student-and- Rater interaction nested in the occasion facet (SR:O), were the only two that made substantial

contributions to total variance relative to other composites in the design. The interaction

between student and occasion facet (SO) made significant and a nonnegligble contribution to

absolute error variance, accounting for 2.5% (σ2

so = 0.65, 2.5%) of total variance.

The Student and Rater composite, nested in the occasion facet (SR:O), were the largest

contributor to absolute variance reporting over 94% (σ2

SR :O =24.35, 94.8% of total variance) to

variability. The interaction between students and raters on different occasions of assessment

made the most substantial contribution to total variance compared with other composites.

CONCLUSIONS

Overall, the study ascertained that, the dependency (reliability) of mentee observed scores as

estimated was 0.62 (i.e., 62%), and represents a positive moderate quality level when placed

on an absolute measurement scale. These G coefficients represented global measures of the

reliability or dependency and precision with which mentees’ attributes were estimated.

Conventionally, dependency coefficients above 0.80 are more desirable, and preferable for

absolute type of decisions. The rater nested in occasion (R:O) composite, the crossed

relationship between students and Occasions, as well as the composite between student and

rater facets nested in the occasion facet (SR:O) respectively made contributions to total

variance in the design. Based on the nonnegligible contribution to total variance by rater- occasion composite, it is instructive that the rater main effect be investigated further.

The students and raters composite comparatively made larger variance contribution than the

other composites which meant that, different raters while using the same assessment tool on

students, differed substantially in what they observed about student performance on the

different occasions. These different raters were either lenient in scoring students, and thus saw

more of the construct of interest or were probably harsh in scoring students on different

occasions, and therefore observed less of the construct in students. The suggested variations in

rater interpretations of items on the measurement tool has implications for both the reliability

and validity of scores obtained from these raters. It is therefore imperative that, relevant

authority take appropriate steps targeted at ensuring that raters are given adequate exposure

and training to enable them develop a common understanding of the tool.

RECOMMENDATIONS

Recommendations made based on the findings and conclusions arrived at follow:

1. Based on finding that, raters and students were the largest contributors to total

measurement error in mentees’ observed scores, it is recommended that decision

makers such as Teaching Practice officials across universities and colleges of education,

regularly build the capacities of both raters and students, through periodic tailored

Page 19 of 26

121

Iddrisu, S. A. (2023). Validation of Mentee-Teachers’ Assessment Tool within the Framework of Generalisability Theory at the Faculty of Education,

University for Development Studies. Advances in Social Sciences Research Journal, 10(7). 103-128.

URL: http://dx.doi.org/10.14738/assrj.107.14968

trainings on the contents of the assessment tool so as to reduce errors in rating, and also

to clarify student understandings of tool. Periodic intensive trainings are necessary

particularly for newly recruited teaching staff who might be using the tool for the first

time for mentee assessments.

2. This study used a nested design with the object of measurement fixed, resulting in an

extensive nulling effect on all composites of the object of measurement (mentee tool). It

is recommended that a crossed design be used in future without fixing the object of

measurement to maximize information regarding all facets’ contributions to total

variance. Alternatively, a trial of the nested design could be used without fixing the

mentee tool to explore the individual and composite facet contributions to total variance

of the mentee observed scores. A decision maker would be better informed on which

design to employ considering which of them is cost-effective, and at the same producing

high-quality outcomes.

3. This study solely used G theory in investigating this mentee assessment tool leading to

the results and conclusions arrived at. It is possible the use of an alternative theory or

procedure could assist in ascertaining and confirming the findings of this study. It is

therefore recommended that; a future validation study adopt either factor analysis (FA)

or Item Response Theory procedures as a form of confirmatory study of findings of this

study.

SUGGESTIONS FOR FURTHER RESEARCH

The following suggestions were made for future researchers who have interest in

conducting similar studies:

1. Future researchers who wish to use the G theory for a similar study may still explore the

nested design without fixing the object of measurement. Alternatively, such researchers

should apply factor analysis or IRT procedures which have the advantage of computing

item characteristics besides segregating the measurement errors associated with

individual factors of interest.

2. It is recommended that future researchers define the sub-dimensions of the tool as part

of the facets, in addition to raters, students and occasions, to examine their separate

contributions to total variance in the observed score, under either a crossed or nested

design. A revised model which integrates the additional facets could lead to an improved

quality of reliability coefficients of mentee observed scores.

3. Furthermore, researchers in future may also consider setting up both crossed and

nested designs in one study using separate student groups to compare the G coefficients

and error variances in either designs to inform decision making regarding which design

is more economical to use in future assignment of raters for assessment exercises.

References

Adom, D., Husein, E. K., & Agyem, J. A. (2018). Theoretical and conceptual framework: Mandatory ingredients of a

quality research. International journal of scientific research. https://www.researchgate.net/publication

/322204153

Alexopoulos, D.S. (2007). Classical test theory. In N.J. Salkind (Ed.), Encyclopedia of measurement and statistics.

Thousand oaks: CA: Sage publications, 140-143.

Allal, L., & Cardinet, J. (1997). Generalisability theory. In J.P. Keres (Ed.), Educational research, methodology, and

measurement: An international handbook (2nd, p. 737 -741). Cambridge, United Kingdom: Cambridge University.

Page 20 of 26

122

Advances in Social Sciences Research Journal (ASSRJ) Vol. 10, Issue 7, July-2023

Services for Science and Education – United Kingdom

Allen, M. J. & Yen, W. M. (2002). Introduction to measurement theory. Illinois: Waveland press, Inc.

Ali, A. M., & Yusof, H. (2011). Quality in qualitative studies: the case of validity, reliability and generalisability.

Issues in social and environmental accounting, 5(1), p.25 -64.

Alkharusi, H. (2012). Generalisability theory: An analysis of variance approach to measurement problems in

educational assessment. Journal of studies in education. https://www.researchgate.net/publication/256349585

Ary, D., Jacobs, L. C., Sorensen, C. K., & Walker, D. A. (2014). Introduction to research in education. Wadsworth,

Cengage Learning Inc.

Atilgan, H. (2019). Reliability of essay ratings: A study on generalisability theory. Eurasian journal of educational

research, 1, 133 – 150.

Atilgan, H. (2008). Using generalisability theory to assess the score reliability of the special ability selection

examinations for music education programmes in higher education. International Journal of research &methods in

education, 31 (1), 63 –76 https://doi.org/10.1080/17437270801919925.

Baartman, L. K.J., Bastiaenns, T. J., Kirschner, P. A., van der Vleuten, C. P. M. (2007). Evaluating assessment quality

in competence-based education: A qualitative comparison of two frameworks. Educational research review;

www.elsevier.com/locate/EDUREV

Babie, E. (2013). The practice of social research (13th ed.). Canada: Wadsworth, Cengage Learning.

Baker, F. B. (2001). The basics of item response theory (2nd ed.). Washington, DC. Eric publications.

Bluman, A. G. (2012). Elementary statistics: A step-by-step approach (8th ed.). USA, McGraw-Hill.

Bond, T. G., & Fox, C. M. (2015). Applying the Rasch model; Fundamental measurement in the human sciences (3rd

ed.) New York, Routeledge group.

Bouwer, R., Beguin, A., Sanders, T., & van den Berg, H. (2015). Effect of genre on the generalisability of writing

scores. Language testing, 32 (1), 83 – 100.

Brannan, R. L. (2013). Commentary on ‘validating the interpretations and uses of test scores. Journal of

educational measurement, 50: 74-83.

Brennan, R. L. (2011). Generalisability theory and classical test theory. Centre for advanced studies in

measurement and assessment, University of Iowa. Applied measurement in education,24: 1-21.

Brennan, R.L. (1996). Generalisability of performance assessments. In technical issues in large-scale performance

assessment, edited by Philips, G., 19 -58. Washington, DC: National center for educational statistics.

Brennan, R. L. (2003). Coefficients and indices in Generalisability theory. Center for advanced studies in

measurement and assessment (CASMA) research report (1).

Brennan, R.L. (2001). Generalisability theory. New York: Springler-Verlag New York, Inc.

Brennan, R. L. (1998). A perspective on the history of generalizability theory. Educational measurement: issues

and practice,16(4),14-20

Brennan, R.L. (1996). Generalisability of performance assessments. Technical issues in large-scale performance

assessment. Washington, DC: National center for education and statistics.

Brennan, R.L. (1992). Elements of generalisability (2nd ed.) Iowa city. ACT publications.

Page 21 of 26

123

Iddrisu, S. A. (2023). Validation of Mentee-Teachers’ Assessment Tool within the Framework of Generalisability Theory at the Faculty of Education,

University for Development Studies. Advances in Social Sciences Research Journal, 10(7). 103-128.

URL: http://dx.doi.org/10.14738/assrj.107.14968

Cardinet, J., Johnson, S., & Pini, G. (2010). Applying generalisability theory using Edu G. Quantitative methodology

series. New York, Routledge Group.

Cook, D.A., Thompson, W.G., & Thomas, K.G. (2014). Test-enhanced web-based learning: optimizing the number

of questions (a randomized crossover trial). Academic med: 169 -175

Cook, D.A., Beckman, T. (2006). Current concepts in validity and reliability for psychometric instruments: theory

and application. Journal of education, 19(7), 166.

Crocker, L., & Algina, J. (2008). Introduction to classical and modern test theory. OH, Cengage learning.

Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. Belmont, CA: Wadsworth

group/Thomson learning.

Cronbach, J.L. (1984). Essentials of psychological testing. New York: Happers & Row publishers.

Cronbach, J. L., Lin, R. L., Brenan, R.L. & Haertel, E. (1997). Generalisability analysis for performance assessment

of student achievement or school effectiveness. Educational and psychological measurement, 57, 373 -399.

Denegar, C. R., & Ball, D. W. (1993). Assessing reliability and precision of measurement: an introduction to intra

class correlation and standard error of measurement. Journal sport rehabil, 2 (1),35- 42.

DeVillis, R.F. (2017). Classical test theory. Med care, 44(11), 50 – 59.

Dogan, C.D., & Uluman, M. (2017). A comparison of rubrics and graded category rating scales with various

methods regarding raters’ reliability. Educational sciences: Theory & practice,7, 631 -651. https: dx.doi.org

/10.12738/estp.2017.2.0321.

Doowning, S.M. (2003). Validity on the meaningful interpretation of assessment data. Med education, 37: 830-

837.

Doowning, S.M. & Haladya, T.M. (2004). Validity threats: overcoming interference with proposed interpretations

of assessment data. Med Educ, 38: 327 -333.

DeShon, R. P. (2002). Generalisability theory. In F. Drasgow & N. Schmitt (Eds.), Measuring and analysing

behaviour in organisations, 189 -220.

Edenborough, R. (1999). Using psychometrics, a practical guide to testing and assessment (2nd ed.). London,

Kogan page Ltd.

Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R.L. Linn (Ed.), Educational measurement (3rd ed., pp. 127 -144).

New York: Macmillan.

Fleming, M., House, S., Hanson, V. S., Yu, L., Garbutt, J., McGee, R., Kroenke, K., Abedin, Z., & Rubio, D. M (2013).

The mentoring competency assessment: Validation of a new instrument to evaluate skills of research mentors.

Academic medicine, 88(7).

Fraenkel, J. R., Wallen, N. E., & Hyun, H. H. (2015). How to design and evaluate research in education (9th ed.).

New York, McGraw-Hill education.

Fulton, S. & Krainovich-Miller, B. (2010). Gathering and appraising the literature in LoBiondo-Wood, G. & Haber,

J. (Eds). Nursing research: methods and critical appraisal for evidence-based practice (7th ed.). St, Luis MO:

Mosby Elsevier.

Gaur, A. & Gaur, S.S. (2009). Statistical methods for practice in research: A guide to data analysis using SPSS. New

Delhi, Sage publications Inc.

Page 22 of 26

124

Advances in Social Sciences Research Journal (ASSRJ) Vol. 10, Issue 7, July-2023

Services for Science and Education – United Kingdom

Gebril, A. (2009). Score generalisability of academic writing tasks: does one test method fit it all? Language

testing, 26(4):507 – 531.

Ghana Education Service Headteacher’s handbook.

Ghazali, N. H. (2016). A reliability and validity of an instrument to evaluate the school-based assessment system:

A pilot study. International journal of evaluation and research in education (IJERE),5 (2), 148 – 157.

Gierl, M. J, & Bisanz, J. (2001). Item response theory for psychologists -Book review. Applied psychological

measurement, 25 (4), 405 -408.

Graham, S., Harris, K., & Herbert, M. (2011). Informing writing: The benefits of formative assessment. A carnegie

corporation time to act report. Washington, DC: Alliance for excellent education.

Grant, C. & Osandoo, A. (2014). Understanding, selecting and integrating a theoretical framework in dissertation

research. Creating the blueprint for ‘House’. Administrative issues journal: connecting education, practice and

research. pp.12 -22.

Gullo, D. F. (2005). Understanding assessment and evaluation in early childhood education (2nd ed.). New York,

Teachers college press.

Gugiu, P.C. (2011). Summative confidence. Unpublished doctoral dissertation, Western Michigan university,

Kalamazoo.

Gugiu, M.R., Gugiu, P. C., & Baldus, R. M. A. (2012). Utilizing generalizability theory to investigate the reliability of

the grades assigned to undergraduate research papers. Journal of multidisciplinary evaluation, 8(19),26-40.

Hafner, J. C., & Hafner, P. M. (2003). Quantitative analysis of the rubric as an assessment tool: An empirical study

of student peer-group rating. International journal of science education, 25 (12), 1509 -1528.

Hambleton, R.K. & Jones, R.W. (1997). Comparison of classical test theory and item response theory and their

applications to test development.

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury; Sage

publications.

Hambleton, R.K., & Jones, R. W. (1993). Comparison of classical test theory and item response theory and their

applications to test development.

Hamrick, L.R., Haney, A. B., Kelleher, B. L., & Lane, S. P. (2020). Using generalisability theory to evaluate the

comparative reliability of developmental measures in neurogenic syndrome and low-risk populations. Journal of

neurodevelopmental disorders, 12 (16).

Heitman, R. J., Kovaleski, J.E., & Pugh, S. F. (2009). Application of generalizability theory in estimating he

reliability of ankle-complex laxity measurement. National Athletic trainers’ Association, Inc. Journal of Athletic

training, 44(1), 48 -52.

Hopkins, K. D. (1998). Educational and psychological measurement and evaluation (8th ed.). Boston, MA: Allyn

and Bacon.

Howell, R. (2014). Grading rubrics: Hoopla or help? Innovations in education and teaching international. 51(4);

400 -410.

Ilgen, J.S., Hatala, R., & Cook, D.A. (2015). A systematic review of validity evidence for checklists versus global

rating scales in simulation-based assessment. Med education, 161 -173.

Page 23 of 26

125

Iddrisu, S. A. (2023). Validation of Mentee-Teachers’ Assessment Tool within the Framework of Generalisability Theory at the Faculty of Education,

University for Development Studies. Advances in Social Sciences Research Journal, 10(7). 103-128.

URL: http://dx.doi.org/10.14738/assrj.107.14968

Institute for Continuing Education and Interdisciplinary Research (ICEIR) (2012). University for Development

Studies mentoring policy. Tamale, M-bukConcepts publication.

Jonsson, A. & Svingby, G. (2007). The use of scoring rubrics; reliability, validity and educational consequences.

Educational research review 2. 130 – 144.

Kan, A. (2007). Effects of using a scoring guide on essay scores: Generalisability theory. Perceptual and motor

skills, 105, 891 -905.https://doi.org/10.2466/pms.105.3.891 -905.

Kane, M.T. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 17 -64). Westport, CT:

American Council on Education /Praeger.

Kane, M.T. (2001). Current concerns in validity theory. Journal of education measurement, 38: 319 -342.

Kane, M.T. (1992). An argument-based approach to validity. Psychological bulletin, 112: 527 – 535.

Kaplan, R, M. & Saccuzzo, D.P. (1989). Psychological testing: Principles, applications and issues (2nd ed.).

California, Brooks/Cole publishing Co.

Kovaleski, J.E, Hollis, J. M., Heitman, R. J., Gurchiek, L. R., Pearsall, A. W, (IV). Assessment of ankle-subtalar-joint

complex laxity using an instrumented ankle arthrometer: an experimental cadaveric investigation. Journal of athl

train, 37(4), 467 -474.

Kean, J. & Reilly, J. (2014). Classical test theory.192-194

Kim, Y. S. G., Schatschneider, C., Wanzek, J., Gatlin, B., Otaiba, S. (2017). Writing evaluation: rater and task effects

on the reliability of writing scores for children in Grades 3 and 4. Read writ, 30, 1287- 1310.

Kline, J. B. (2014). Classical test theory: Assumptions, equations, limitations, and item analyses. Thousand Oaks,

Sage publications.

Knight, P. (2006). The local practices of assessment; Assessment and evaluation in higher education.31(4);435 -

452.

Kolen, M. J., & Brennan, R. L. (1995). Test equating methods and practices. New York, Springer – Verlag Inc.

Lawson, J.E. & Cruz, R. A (2017). "Evaluating special educators’ classroom performance: Does rater “type”

matter?", assessment for effective intervention.

Leedy, P. O., & Ormrod, J. E. (2010). Practical research, planning and design (9th ed.). New Jersey, Pearson

education Inc.

Lissitz, R. W., & Samuelsen, K. (2007). A suggested change in terminology and emphasis regarding validity and

education. Educational researcher, 36, 437 - 448.

Mafla, A. C., Herrera-Lopez, H.M., & Villalobos-Galvis, F. H. (2018). Psychometric approach of the revised illness

perception questionnaire for oral health (IPQ-R-OH) in patients with periodontal disease. Journal of

periodontology. DOI: 10.1002/JPER.18-0136.

Magis, D. (2014). Accuracy of asymptotic standard errors of the maximum and weighted likelihood estimators of

proficiency levels with short tests. Applied psychological measurement, 38, 105 -121.

Marcoulides, G., & Ing, M. (2013). The use of generalisability theory in language assessment. In A. Kunnan, (Ed.),

The companion language assessment, 3, 1207 -1223. New York, NY: John Wiley & Sons, Inc. DOI:

10.1002/9781118411360.wbcla014

Page 24 of 26

126

Advances in Social Sciences Research Journal (ASSRJ) Vol. 10, Issue 7, July-2023

Services for Science and Education – United Kingdom

Marcoulides, G. A (1999). Generalisability theory: Picking up where the Rash IRT models leaves off? In.

Embretson, S & Hershberger, S. The new rules of measurement: What every psychologist and educator should

know. Lawrence Eribaum associates Inc., 129 -130.

Marcoulides, G. A. (1996) "Estimating variance components in generalizability theory: The covariance structure

analysis approach", structural equation modeling: A multidisciplinary journal.

Marcoulides, G. A. (1993). Maximizing power in generalizability studies under budget constraints, Journal of

educational statistics, 18(2), 197-206.

McDonald, R. P. (1999). Test theory. A unified treatment Mahwah, NJ: Lawrence Erlbaum Associates.

Medvedev, O.N., Merry, A.F., Skilton, C., Gargiulo, D. A., Mitchell, S. J., & Weller, J. M. (2019). Examining reliability

of WHOBARS: a tool to measure the quality of administration of WHO surgical safety checklist using

generalisability theory with surgical teams from three New Zealand hospitals. BMJ Open

2019;9:e022625:doi:10.1136/bmjopen - 2018-022625.

Messick, S. (1995). Validation of inferences from persons’ responses and performances as scientific inquiry into

score meaning. American Psychological Association; 50:741 – 749.

Messick, S. (1989). Validity in educational measurement (3rd ed). New York; American Council on education and

Macmillan.

Moskal, B., & Leydens, J. (2000). Scoring rubric development: Validity and reliability. Practical assessment,

research and evaluation,7(10),1-11.

Muijs, D. (2011). Doing quantitative research in education with SPSS. London: Sage Publications limited.

Naizer, G.(2007). Basic concepts in generalisability theory: a more powerful approach to evaluating reliability.

http://www.eric.ed.gov. Eric document reproduction service no. ED341729. Accessed

Nitko, A. J., & Brookert, S. M. (2011). Educational assessment of student. Boston, MA: Pearson Education.

Nitko, A. J. (2004). Educational assessment of students (4th ed.). New Jersey, Pearson Education, Inc.

Nunnally, J., & Bernstein, I. (1994). Psychometric theory (3rd ed.). New York: McGraw-Hill, Inc.

Ojerinde, D., Popoola, K., Oje, F., & Onyeneho, P. (2012). Introduction to item response theory (2nd ed.). Abuja.

Marvelouse mike press Ltd.

Orentz, J.O. (2018). Generalisability theory: A comprehensive method for assessing and improving the

dependability of marketing measures. Journal of marketing research, 24, 19 -28.

Pallant, J. (2007). Step-by-step guide to data analysis using version 15: SPSS survival manual (3rd). McGraw-Hill,

open university press.

Regan, B.G., & Kang, M. (2005). Reliability: issues and concerns. AthlTher today, 10(6), 30 -33. individual

differences, 103, 32-39. Doi: 10.1016/j.paid.2016.04.007.

Revelle, W. (2013). Personality project. Retrieved October, 2013, from http://personality- project.org/revelle/syllabi/405.old.syllabus.html

Sadler, D. (2009). “Indeterminacy in the use of preset criteria for assessment and grading”. Assessment and

evaluation in higher education, 34 (2): 159 -165.

Page 25 of 26

127

Iddrisu, S. A. (2023). Validation of Mentee-Teachers’ Assessment Tool within the Framework of Generalisability Theory at the Faculty of Education,

University for Development Studies. Advances in Social Sciences Research Journal, 10(7). 103-128.

URL: http://dx.doi.org/10.14738/assrj.107.14968

Sadler, P. M. & Good, E. (2006). The impact of self –and peer-grading on student learning. Educational

assessment,11(1), 1 -31.

Shavelson, R. J. & Webb, N.M. (2005). Generalisability theory. British journal of mathematical and statistical

psychology, 599- 612.

Shavelson, R.J, Webb, N., & Rowley, G. (1992). Generalisability theory. https:

//www.researchgate.net/publication/344083148

Shavelson, R.J., & Webb, N.M. (1991). Generalizability theory: A primer. Newbury Park, Sage publications.

Sinclair, M. (2007). “A guide to understanding theoretical and conceptual frameworks.” Evidence-Based

midwifery, 5 (2), 39. Gale onefile: Health and

medicine. link.gale.com/apps/doc/A167108906/HRCA?u=anon~a99d 7734&sid=googleScholar&xid=20965ff

(Accessed 5 May 2022).

Sireci, S. (2007). On validity theory and test validation. Educational researcher, 36(8), 477 -481.

Sireci, S. G., & Parker, P. (2006). Validity on trial: Psychometric and legal conceptualization of validity.

Educational measurement: Issues and practice, 25 (3), 27 -34.

Stangor, C. (2006). Research methods for the behavioural sciences (3rd ed). Boston: Houghton Mifflin.

Stegner, A.J., Tobar, D. A., & Kane, M. T. (1999). Generalizability of change scores on the body awareness scale.

Measurement in physical education and exercise science.

Stemler, S. E. (2007). Cohen’s kappa. In N.J. Salkind (Ed.), Encyclopedia of measurement and statistics. Thousand

Oaks, CA: Sage publications.

Steyer, R. (1999). Classical (Psychometric) test theory. Jena, Germany.

Steyer, R., Schmitt, M., & Eid, M. (1999). Latent state–trait theory and research in personality and individual

differences. European journal of Personality,13(5),389 –408.

Sudweeks, R., Reeve, S. & Bradshaw, W. (2005). A comparison of G theory and many-facet Rasch measurement in

an analysis of college sophomore writing. Assessing writing Vol. 9, 239-261.

Suen, H.K. & Lei, P. W. (2007). Classical versus generalizability theory of measurement. Journal of Educational

measurement,4, 1-13.

Schoonen, R. (2012). The validity and generalisability of writing scores. The effect of rater, task and language. In

E. Van Steendam, M. Tillema, G. Rijlaarsdam, & H. van den Berg (Ed.), Measuring writing: Recent insights into

theory, methodology and practice (pp. 1- 22). Leiden, The Netherlands: Brill.

Stevens, D. D., & Levi, A. J. (2005). Introduction to rubrics. Sterling, VA: Stylus publishing.

Strube, M.J. (2002). Reliability and generalisability theory. In L.G grimme & P.R. Yarnold (Eds.) Reading and

understanding more multivariate statistics (pp. 23 -66). Washington, DC: American Psychological Association.

Swiss Society for Research in Education Working Group (2010). Quality of measurement in education- EDUG

user guide. Switzerland.

Tabachnick, B.G., & Fidell, L.S. (2007). Using multivariate statistics (5th ed.). Allyn and Bacon/Pearson education.

Tan, K. & Prosser, M. (2004). Qualitatively different ways of differentiating student achievement: A

phenomenographic study of academic’s conceptions of grade descriptors. Assessment and evaluation in higher

education, 29(3), 267 -282.

Page 26 of 26

128

Advances in Social Sciences Research Journal (ASSRJ) Vol. 10, Issue 7, July-2023

Services for Science and Education – United Kingdom

Tavares, W., Bridges, R. & Tuner, L. (2018). Applying Kane’s validity framework to a simulation- based

assessment of clinical competence. Advances in health science education. https//www.researcgate, net

/publication/320679278

Templin, J. (2012). Item response theory. Colorado, Georgia, USA.

Thompson, B. (2003). A brief introduction to generalisability theory. In B. Thompson (Eds.) Score reliability:

Contemporary thinking on reliability issues, 43 -58. Thousand Oaks, C.A: Sage.

Turgut, M. & Baykul, Y. (2010). Measurement and evaluation in education (Egitimde olome ve degerlendirme).

Ankara: Pegem Akademi. University for Development Studies Administrative Manual, (2018). University for

Development Studies Statutes, (2017).

Vispoel, W. P., Moris, C. A., & Kilinc, M. (2018). Applications of generalisability theory and their relations to

classical test theory and structural equation modeling. American psychological association; Psychological methods,

23(1),1-26.

Webb, N.& Shavelson, R.J. (2005). Generalisability theory: Overview.

https://www.researchgate,net/publication/22758011

Webb, N.M., Rowley, G.L., & Shavelson, R.J. (1988). Sing generalisability theory in counselling and development.

Measurement and evaluation in counselling and development, 21, 81-90.

Yukawa, M., Gansky, S.A., O’Sullivan, P., Teherani, A., & Fieldman, M.D. (2020). A new mentor evaluation tool:

Evidence of validity. Plos one 15 (6): e0234345. https://doi.org/10.1371/journal.pone.0234345. Accessed on

June 16,2020.

Zanon, C., Claudio, S. H., Hanwook, Y., & Hambleton, R. K. (2016). An application of item response theory to

psychological test development. Psicologia: Reflexaoe critica.29:18. DOI 10,186/54:1155 – 016 -0040 -x.

Zhang, B., Johnston, L., & Kilinc, G. (2008). Assessing the reliability of self-and peer rating in student group work.

Assessing and evaluation in higher education,33(3), 329– 340.

Zumbo, B. D., Gadermann, A. M., & Zeisser, C. (2007). Ordinal versions of coefficients alpha and theta for Likert

rating scales. Journal of modern applied statistical methods, 6(1), 21 -29.