Why are predicted grades so hard to give, and can we change it?

Reading Time: 13 minutes

It’s been a difficult old year when it comes to predictions (and so many things, but let’s stay focused!). I’m not going to specifically explore my own statistics here – it isn’t fair to any individuals involved – but I’m interested in the reasons that predictions are often so inaccurate when compared with exam results. This year in particular, it’s especially important: today’s announement that Scotland have regraded everyone in line with centre assessment may or may not affect Thursday, but we should consider this for our own good in the future.

Predictions are important. With them, we know who to target for improvement but so do they. Inaccurate predictions are really unfair on students, especially as they tend to be optimistic. With so many exams at GCSE for example, it’s not uncommon for students to say they’re going to focus on subjects where they’re ‘under target’ or doing less well. I don’t think anyone would disagree that we owe it to them to be as accurate as we can.

But it’s really, really, really hard. There are variations between subjects, but I think most HODs/teachers looking at their statistics would have to hand-on-heart say we find this very difficult and need to do better.

Caution – even when you seem to be accurate, check your stats. I had a conversation once with a Maths teacher in a September and he said “your (English) predictions were really good.” Well, we got a similar number of the grades that we expected. But they weren’t necessarily the right students!!

Obviously everything here is very context-bound but I think there are likely to be similarities.

Predictions are likely to be inaccurate, and over-optimistic. This research summarises studies suggesting around half (depending on the study and year) of grades are accurate – and often closer to 40% – and that the inaccurate grades are around 50% likely to be optimistic than pessimistic. Their own research is really interesting; although they admit it’s a low sample based on one board, the English results are some of the most accurate which surprised me! Still, just under 50% of A-Level predictions were awarded the same grade; another 40% (ish) were awarded a grade above or below. None were 3+ grades out, which wasn’t the case for Chemistry and Psychology, the other subjects studied.

What are some of the barriers to accurate predictions?

The removal of AS levels and (most) coursework
AS-Levels gave you an idea of both skills and content coverage; in the OCR specification, for example, you would do the whole-text question for the Shakespeare then for A-Level add a close analysis (which you’ve also been examined on elsewhere in the spec). The old AS-levels made it a bit easier too, because they were simply a percentage and you could extrapolate based on that.
NEAs are worth just 20% – they really don’t change a candidate’s grade that much from the examination grade. It used to be that we said full marks on the coursework could make all the difference in pushing you over the boundary but that really doesn’t seem to be the case any more. They might confirm a grade, but they won’t dramatically improve or reduce it. Obviously at GCSE, there’s no coursework at all so we’re reliant on other data.

Teachers vary wildly in their marking. 
I don’t think this is an exaggeration, and there’s lots of reasons for it:

  1. Summer examination marking is hugely variable
    Ofqual research into marking demonstrates this. In Maths, the “probability of receiving the ‘definitive grade’ is around 0.96…For history, English language and English literature, the average figures are around 0.55 to 0.6.” While they’re at pains to state that this might just be a mark or two out, and therefore quite legitimate, it’s also true it could be either side of a grade boundary OR over a whole examination quite a significant difference. Tolerance on exam marking is far more lenient than even that. It’s been a while since I examined, but given the average difference between grades is around 14, a tolerance of 2 marks would easily be a grade’s difference for many candidates – so if the “true” mark is, say 23/30 (as agreed by senior examiners), then anything between 21-25 would be acceptable. Over 4 Literature questions, you could easily be 16 marks under what another examiner would give you – 2 grades worth potentially – and still be within acceptable tolerance limits.
    This makes it VERY difficult for teachers, who see two pieces of work seemingly the same grade but very different – how do we know where to aim and balance our AOs? A “good” Literature essay is often different in different questions; do they need more close analysis, a holistic overview, more context? It’s even more prevalent at A-Level, but a balancing act throughout a student’s examinations.
  2. We mark the students in front of us, and we mark for different reasons
    Bias has had a lot of press lately and is definitely worth examining. Bias might include systemic and structural bias related to heritage or disadvantaged status, it might be much much more personal (and yet just as stereotyped!) – the ‘neat, clever’ girl who gets over-predicted, the disruptive boy who turns out to be doing decent work at home but avoiding being called geeky, the one who’s a bit nervous so we’re generous and think ‘they’re working really hard’, the one who never seems to be listening but somehow always ‘pulls it off’ but maybe needs a warning…Marking is a multi-faceted beast and it’s never as easy as right/wrong in our subject. We mark to be encouraging, to give a kick up the backside. If you are in a toxic school environment, you might be more generous because you can’t cope with the extra workload a sea of red and amber will create for you. It’s not fair on students but it is reality and we should address it in the same way that we might look at postcodes or the EM code on SIMs.
  3. We don’t have enough moderation time, and we don’t moderate widely.
    I know few departments who feel they have enough time as a team. We’re split between planning and assessment. Really good moderation isn’t swapping a top/middle/bottom and scribbling ‘agree’ on the bottom.  It takes time, good examples and – perhaps most importantly – a thorough and robust discussion with understanding of what we’re aiming for. That last is complicated by point A above, but the ability to say to a colleague, “I think you’re too generous here,” is actually really difficult sometimes. People are defensive of themselves but also of their students (point B!). It’s even harder to moderate cross-school, even within a trust, due to lack of time
  4. We’re often led by previous data
    There’s a lot on bias to do with this that I won’t re-cover, but target grades – from KS2, MidYis, Yellis, FFT or wherever – have a lot to answer for. In my opinion, they’re really rubbish in anything that isn’t English or Maths and even there, they’re pretty rubbish. In most cases they’re based on probabilities and, for example, a GDS student could get 5-9 at GCSE depending on a whole host of things. If 7 is the “most likely” outcome, though, that’s likely to be the target grade. The school’s internal pressures have a lot to answer for here, but it’s also part of (B) in that we’re wary of messages to students if they’re over or underperforming their data – and again, might just add that one mark or two to bump them into it if we feel it could be justified.
  5. We can’t predict exam technique. 
    I wonder if some of the over-prediction also comes from exam technique failure. We predict based on most likely outcome of things going right. Not getting ‘the dream question’, but getting something where they feel reasonably confidence and get their timings etc right. We don’t predict based on mucking up their timings, mistakenly answering two questions instead of one, having a rubbish exam in the morning (or last Tuesday!) and feeling less confident for the afternoon one, or answering on Mrs Birling instead of Mr Birling…. And we shouldn’t. But it does affect a reasonable proportion of students.
  6. We’re not very confident!!
    For all these reasons, teachers themselves doubt their ability to give a genuine mark. They usually feel better at giving grades with the range that implies, and anecdotally I think a lot of teachers were more comfortable rank ordering students this summer – it was the assignment of grade boundaries that was more problematic.

We can’t predict revision
If you’re in a school that keeps them to the bitter end it’s a bit easier, but not always. We don’t always know who’s putting in the hours, or who will chuck their notes as soon as they’re on study leave. It’s also harder in non-core subjects, I think, where students will ditch their revision time to focus on core OR they’ll unbalance in other ways, to revise what they find easy and enjoy.

Lack of regular graded assessment
I’m not a fan of 4 mocks a year of every paper. The workload is too intense. But then, how do we accommodate for teaching exam technique (e.g. timing), and when we move to whole-class feedback, how do we accurately communicate marks? I’ve loved our FLASH marking trial for improving students’ metacognition, skills, and the reduced workload implications. But we have to strike a balance between providing students with grades to gauge progress (and us with data to support predictions) with teacher workload as well as all the potential damage I think grades can do in terms of students’ attitude to learning, mental wellbeing and so on. Developing the attitude of continuous improvement is, for me, paramount – but it does mean I lack numbers to draw from!

So, are there solutions? 
  1. Know your data
    I’m a big fan of data: it reveals patterns, and then you can start looking at individual students and staff involved. At a department and individual level:
    * What is your differential between predicted and actual? What are the patterns? Do you overpredict  5 and underpredict 8? Breaking it down into groups especially PP, EM, gender and attendance etc is important. Don’t just do white/non-white (obvious but important!).
    * How do these work on an individual level? These conversations are the hardest. Keep it very data-focused; for many, pointing out the biases – perhaps compared with whole-dept too – will be enough to make people think carefully and recognise their unconscious bias. It’s not about judging people for this, it’s about drawing attention to it, checking it, tracking it, and trying to minimise it. It can be part of a bigger conversation too perhaps about curriculum design, attitudes to homework, work etc, but do stay focused on the data you’re exploring. Never share this data whole-team; individuals need their own data but shouldn’t be encouraged to compare themselves to other individuals. Put in the effort to separate it into different documents.
    * Is your accuracy as accurate as it seems? When you’ve predicted 25% 6 and achieved 23% 6, are they the right students?
  2. Do lots of moderation.
    I need to do a lot more of this. Using real scripts, exam scripts, student mocks at all stages. Developing a good eye for what a grade is and then trying to refine it to a mark using the exam board method whether that’s marks by AO or holistic ‘shading’ in the grade. Know how it’s marked and be rigorous in applying the method. Moderate based on new marking if you can – using previous students’ papers or exam scripts, asking teachers to mark them and comparing what they gave is really useful and removes students, or teachers’ biases. It’s more difficult in terms of workload (easier to moderate when they have to mark it anyway rather than generating work) but sometimes can be really powerful.
  3. Use exam scripts
    We spend a bit of the department budget each year on getting these back to have a bank of models, but also scripts to use as moderation. You do need students’ permission – if you’re losing students at sixth form or after A-level, get them to sign the appropriate form before they go on study leave!
  4. Make exam technique explicit, and habitual.
    Students need to practice time management and exam technique, including planning, calming strategies etc.
  5. Experiment with marking practices
    Think about mocks – splitting the papers by question rather than class, dividing a year group alphabetically or blind marking perhaps. Swap classes. Teachers might say they need the mocks to be able to support students’ improvement, but in my experience mocks tell me little I didn’t already know – which is actually a problem. It makes them a waste of time for one – what’s the point if we don’t learn something? The benefit there is exam technique, which as I’ve said is important, but are we sure they’re accurate?
  6. Consider paying for marking.
    We’ve been fortunate enough to do this for a couple of years: the Language GCSE mock at Christmas is externally marked. I found it really useful as an interim check, partly because it does remove all bias, and means we can focus on what the data is telling us. There have been some complaints from teachers – but these have tended to be about students not performing as expected!! Which for me is the benefit here: we can see why students don’t perform, explore the expectations of examiners, and realign our own marking too. It was also really helpful this year when I had three years of the same examiner to inform our GCSE predictions.
  7. Have a department policy on predictions
    This is something I really want to think about – everyone should be predicting in the same way. We need the ability to exercise professional judgement and knowledge of students (if they were ill during mocks, we take that into account, they’ve worked really hard since January etc) BUT there also needs to be consistency. It’s got to be a balance of data and judgement. A difficult one to make into policy, but I want to try this year to refine the methods we use.

 

I really don’t know what Thursday this week and next will hold. I expect similar headlines to Scotland, but who can tell? What I do know is that students need the opportunity to have the best possible chance to work towards a grade that they deserve. Giving them as accurate a prediction as possible does that. If we inflate or reduce it for any reason, we’re doing them a huge disservice and robbing them of that opportunity. It’s not easy. We don’t get it right, though we try really hard. We can keep trying to do better.

Thanks for reading.

What do you think?

This site uses Akismet to reduce spam. Learn how your comment data is processed.