Kayvan Kousha, K. and Thelwall, M. (2024) Assessing the societal influence of academic research with ChatGPT: Impact case study evaluations. https://arxiv.org/pdf/2410.19948
Abstract
Academics and departments are sometimes judged by how their research has benefitted society. For example, the UK’s Research Excellence Framework (REF) assesses Impact Case Studies (ICS), which are five-page, evidence-based claims of societal impacts. This study investigates whether ChatGPT can evaluate societal impact claims and, therefore, potentially support expert human assessors. For this, various parts of 6,220 public ICS from REF2021 were fed to ChatGPT 4o-mini along with the REF2021 evaluation guidelines, comparing the results with published departmental average ICS scores. The results suggest that the optimal strategy for high correlations with expert scores is to input the title and summary of an ICS but not the remaining text, and to modify the original REF guidelines to encourage a stricter evaluation. The scores generated by this approach correlated positively with departmental average scores in all 34 Units of Assessment (UoAs), with values between 0.18 (Economics and Econometrics) and 0.56 (Psychology, Psychiatry and Neuroscience). At the departmental level, the corresponding correlations were higher, reaching 0.71 for Sport and Exercise Sciences, Leisure and Tourism. Thus, ChatGPT-based ICS evaluations are simple and viable to support or cross-check expert judgments, although their value varies substantially between fields.
This appears to be a pre-print or an author-published manuscript. It hasn’t been subject to peer review but that doesn’t make it any less interesting.
Heads up, I am fascinated by the potential (positive and negative) of artificial intelligence in research and impact assessment. At the time of writing, I am a sceptic. I think there is too much unknown about the potential uses/abuses of AI tools everywhere, not just in research and impact assessment. But I am glad we have research like this (and check out the references for other studies) that will generate the evidence to help skeptics like me change my mind (or not).
So, skeptics and enthusiasts…read on!
There were 6,781 impact case studies (ICS) in REF 2021. The authors fed “various parts of 6,220 public ICS from REF2021” into ChatGPT. No clue why the remaining 561 weren’t used or why only various parts were used. For the latter, previous AI research on research quality assessment showed a better correlation between AI and people scores if only the title and abstract were used. Similarly, the researchers here found a better correlation if only the title and impact summary were used, not the full ICS.
Concern #1: I have recently returned from chairing a review of +70 ICS. Reviewers had to read everything but only for less than 20 ICS. I reviewed them all, thinking I could get away with reading only the impact summary. I ended up needing to read all the impact sections and corroborating evidence for all of them. The title and summary were not enough for me. But apparently good enough for ChatGPT.
ICS are scored on reach (which the authors call “scope”) and significance (which the authors call “depth of impact”). The reasons ChatGPT did not score a 4* (highest score) were predicated on geography (ie international is better than local) and transformations beyond the scope of the research topic. Neither are good reasons for not awarding a 4* in my opinion.
For Significance, “lack of evidence” was cited as a reason for not awarding 4*. This is curious if only the impact summary was used (see above) and not the underpinning evidence. Other reasons were cited by the authors as being more convincing; however, they did state “the reasons given for scores were often weak or non-existent but, in some cases, seemed to point to genuine limitations”.
ChatGPT also scored clinical medicine ICS higher than those from Music, Drama, Dance, Performing Arts, Film and Screen Studies. Is this the bias in the system?
Bottom line from the authors: “The maximum within-UoA departmental-level correlation of 0.711 between departmental REF average and GPT average score are not high enough to consider replacing expert evaluations of ICS with AI evaluations” but they say they may be high enough to support internal university reviews of potential ICS submissions.
The latter suggestion seems to be less risky and potentially useful.
I remain a skeptic but still open to new evidence.
Questions for brokers:
- AI and impact assessment: Are you a skeptic or enthusiast? Why?
- What do you think of assessing impact based solely on the title and the impact summary without the detail and the underpinning evidence?
- And re #2: It will reduce the burden of drafting and reviewing ICS vs. This will open up ICS for gaming, overstating, and downright fabrication of impact. Discuss.
Research Impact Canada is producing this journal club series to make evidence on KMb more accessible to knowledge brokers and to create online discussions about research on knowledge mobilization. It is designed for knowledge brokers and other people interested in knowledge mobilization. Read this open-access article. Then come back to this post and join the journal club by posting your comments