Exploring the application of machine learning to expert evaluation of research impact

Williams K, Michalska S, Cohen E, Szomszor M, Grant J (2023) Exploring the application of machine learning to expert evaluation of research impact. PLoS ONE 18(8): e0288469. https://doi.org/10.1371/journal.pone.0288469


The objective of this study is to investigate the application of machine learning techniques to the large-scale human expert evaluation of the impact of academic research. Using publicly available impact case study data from the UK’s Research Excellence Framework (2014), we trained five machine learning models on a range of qualitative and quantitative features, including institution, discipline, narrative style (explicit and implicit), and bibliometric and policy indicators. Our work makes two key contributions. Based on the accuracy metric in predicting high- and low-scoring impact case studies, it shows that machine learning models are able to process information to make decisions that resemble those of expert evaluators. It also provides insights into the characteristics of impact case studies that would be favoured if a machine learning approach was applied for their automated assessment. The results of the experiments showed strong influence of institutional context, selected metrics of narrative style, as well as the uptake of research by policy and academic audiences. Overall, the study demonstrates promise for a shift from descriptive to predictive analysis, but suggests caution around the use of machine learning for the assessment of impact case studies.

This article looks at aspects of artificial intelligence (AI) as they apply to assessment of research impact. AI has important but unexplored impacts for the future anything in almost everything, so it is important to look at it in the context of impact assessment. And this study is by Jonthan Grant and Kate Williams (with newly added Eliel Cohen) – a team well grounded in impact assessment.

The paper reviews some recent literature regarding AI and research assessment but none of it is on impact assessment. A recent deep dive into “Responsible use of technology in research assessment” was undertaken by the Statistical Cybermetrics and Research Evaluation Group, University of Wolverhampton and was summarized by those authors in the LSE Impact Blog. They summarize the report as follows, “We are not recommending this solution because in our judgement, its benefits are marginally outweighed by the perverse incentive it would generate for institutions to overvalue journal impact factors. UKRI has signed the Declaration on Research Assessment (DORA) against overuse of journal impact factors and is currently attempting to reduce its influence in the sector and so an AI system informing REF scores that relied partly on a journal impact calculation would be unwelcome.”

Back to the article at hand…go to the article for the details on data sources and methods what the authors did to which data. I will go right to the “so what”.

Peer review (the method for assessment of impact case studies in the UK Research Excellence Framework – REF) is challenged by lengthy delays, bias and superficiality and lack of generalization across disciplines and between reviewers and editors. “At least in theory, the (semi)automated nature of ML can add to the overall objectivity, transparency and reproducibility of the process, potentially making it an attractive complement to peer review.” With this article addressing the “in theory”.  The questions asked are:

  1. Can we predict high-scoring impact case studies using ML?
  2. What are the characteristics of high-scoring impact case studies?

And yet the analysis is limited because UKRI makes the 6,679 REF 2014 impact case studies available but does not publish their individual scores.

Some observations

  • Predictions were stronger among REF Panels A (medicine, health and life sciences), B (physical sciences, engineering and mathematics) and C (social sciences) in comparison with Panel D (arts and humanities)
  • The models worked better when case studies from top peer institutions (judged by research income as a % of overall institutional income) and lower peer institutions were compared. Predictions were less strong with middle peer institutions.
  • Including policy references in predictive models was strongly predictive of high scoring case studies.
  • Case studies affiliated with companies and health care institutions and governments were affiliated with high scoring case studies while case studies affiliated with archives, non-profits and educational institutions were low scoring.

Some conclusions

  • ML may be recapitulating bias rating impact case studies from higher peer institutions as high and as low from low peer institutions.
  • ML may be recapitulating assumptions about the “importance” of impact affiliated with the private and health care sectors and not social care and education sectors.

An interesting conclusion about REF 2014 assessments: “It may be that REF2014 was not entirely able to avoid the social conditions that surround specific institutions. That is, there may have been implicit pre-conceptions that were captured in the evaluation.” And ML recapitulated these pre-conceptions.

Questions for brokers:

  1. The article presents what (they did) and so what (what they found). What about now what? What recommendations would you make to agencies overseeing assessments of research impact?
  2. Discuss: AI and ML should be used as inputs to expert review not to replace expert review.
  3. Does AI and ML help or hinder a commitment to DORA?

Research Impact Canada is producing this journal club series to make evidence on knowledge mobilization more accessible to knowledge brokers and to facilitate discussion about research on knowledge mobilization. It is designed for knowledge brokers and other parties interested in knowledge mobilization.