Flagship project delivers step change in text analytics capability

November 15, 2022

The HDR UK National Text Analytics Project team recently came together to share the impacts of their work and opportunities for clinical natural language processing (NLP). Rene Ndoyi, one of the attendees, describes his experience of the HDR UK National Text Analytics project symposium.


Author: Rene Ndoyi, Intern at Institute of Health Informatics


Maximizing text analytics capability for health data research: key learnings from the HDR UK National Text Analytics project symposium

On 28 September 2022, the HDR UK National Text Analytics Project team, led by Professor Richard Dobson (UCL Institute of Health Informatics; King’s College London) and Dr Angus Roberts (King’s College London), came together to share the impacts of their work and opportunities for the clinical

natural language processing (NLP) community to deliver and use new NLP tools at this HDR UK symposium.


This flagship project has delivered a step-change in text analytics capability, enabling a major shift in the UK’s ability to use research-ready, actionable, real-time electronic health records by delivering data-driven systems with potential to transform patient care. Sixty people from across HDR UK and the text analytics community attended the symposium to hear about the wide-reaching impacts of the project, learn about methods, tools and challenges for NLP and text analytics research, and discuss what the community needs to be able to access and use NLP resources for research. One of the attendees, Rene Ndoyi describes his thoughts and learning from the symposium below.


My name is Rene Ndoyi, a recent graduate of the HDR UK Black Internship Programme and intern at the UCL Institute of Health Informatics. The internship programme was such a success in my quest to develop a career in health data science. Among the many interesting projects that I was introduced to is the National Text Analytics Resource – led by Professor Richard Dobson (UCL Institute of Health Informatics; King’s College London) and Dr Angus Roberts (King’s College London).


This flagship project has delivered a step-change in text analytics capability, enabling a major shift in the UK’s ability to use research-ready, actionable, real-time electronic health records by delivering data-driven systems with potential to transform patient care. The project has built a community and brought together specialised resources that provide researchers with the tools and support to explore unstructured free text clinical data, using natural language processing (NLP) and text analytics.


Sixty people from across HDR UK and the text analytics community attended the symposium to hear about the wide-reaching impacts of the project, learn about methods, tools and challenges for NLP and text analytics research. Attendees also discussed what the community needs to be able to access and use NLP resources for research.


My internship mentor, Natalie Fitzpatrick, recommended that I attend the symposium as one of the many ways that the project brings together a community but also creates awareness of opportunities for NLP research being carried out across HDR UK.


It was very insightful and interesting to learn about the work that has been done and the success the project has earned over the past five years.


As an early career researcher who is building my skills in data science, I was keen to learn of the various tools and methods that have been developed to address the challenges of using unstructured free text data. A key piece of work is CogStack, a clinical information retrieval and extraction platform to create richer, more useful clinical information to improve healthcare. The tool enables querying data, without having to code thousands of SQL queries, based on real-time data.

Another tool I learnt about was MedCAT, which extracts information from Electronic Health Records and links it to biomedical vocabulary systems like SNOMED-CT and UMLS. Both of these tools are available for the research community to use via the Health Data Research Innovation Gateway, with the code made open source on GitHub.


Efforts to develop and apply these kinds of tools are important in tackling challenges around avoiding bias, transferability and model sharing.


The team described various ways that they are approaching this – from improving access to unstructured data for research, to developing trusted models of governance and standards. They have developed a template model sharing agreement that is being used across 10 different NHS Trusts to date, so that NLP models can be shared easily.


I also learnt that analysis of free text data can be achieved through R programming, a language I am currently learning. The idea of coding reproducible step by step workflows and frameworks is related to my internship learning experiences. Under Dr Johan Thygesen’s supervision, we are exploring development of reproducible and extensible frameworks, based on a previous study that developed a framework for Covid 19 trajectories among 57 million Adults in England.


Speakers also highlighted the importance of data governance and employing user-centred approaches. Natalie Fitzpatrick gave an interesting talk on creating a free text donated databank to develop and train NLP tools. I was fascinated to hear people’s feedback about this databank. Stakeholders, including patients and the public, researchers, clinicians and information governance and ethics experts, shared their thoughts through focus groups. There was a lot of support for the databank, but important issues were highlighted, such as the need to overcome different forms of bias, lack of generalisability, poor quality of data and patients’ ability to access their data to correct errors.


From my experiences at the symposium, I have no doubt that these efforts will harness more opportunities for improved patient care. I look forward to future meetings and opportunities to learn more about the National Text Analytics Resource project.


Share

March 12, 2026
We are looking forward to welcoming Professor Honghan Wu, Professor of Health Informatics and AI at the University of Glasgow, who will deliver his talk “Large language model and Radiology: how to facilitate human and AI collaboration? " as part of our Seminar Series. Abstract: In this upcoming talk, Professor Honghan Wu explores the essential shift from viewing AI as a potential replacement for radiologists to recognizing it as a critical collaborative partner. Moving beyond basic tasks like detection and triage, the presentation highlights how AI can address practical clinical "pain points," such as reducing automated protocoling time by up to 60% and decreasing the time spent communicating with providers and patients by 30%. Professor Wu will present recent research on using knowledge-retrieval and Large Language Models for clinical report error correction and generation. The session concludes with an examination of the real-world deployment lifecycle, discussing the challenges of monitoring the over 700 FDA-cleared radiology AI devices currently in practice Seminar Series Event : “Large language model and Radiology: how to facilitate human and AI collaboration?" Date and Time: Thursday 25 June 2026, 15:00 – 16.00 hrs (BST) Location: Large Committee Room, Hodgkin Building, Guy's Campus Attendance: Mandatory for all DRIVE-Health students; a calendar invitation has already been sent. Registration: Alumni and wider King's College London research community all welcome - please email drive-health-cdt@kcl.ac.uk to let us know if you would like to attend. Biography Honghan Wu is a Professor of Health Informatics and AI, based in the School of Health and Wellbeing of the University of Glasgow, where he leads the research theme of data science and AI. Prof Wu is a co-director of Health Data Research Scotland. He also is an honorary professor at Hong Kong University, an honorary associate professor at Institute of Health Informatics, UCL, and a former Turing Fellow of The Alan Turing Institute, UK's national institute for data science and artificial intelligence. Prof Wu holds a PhD in Computing Science. His current research focuses on machine learning, natural language processing, knowledge graph and their applications in medicine.
March 12, 2026
We are pleased to welcome Simon Ellershaw, PhD Candidate at University College London (UCL) as part of the UKRI UCL Centre for Doctoral Training in AI-enabled Healthcare Systems, who will deliver his talk “Developing Healthcare LLMs: From the NHS to Silicon Valley " as part of our Seminar Series. Abstract: This talk links my PhD and my Silicon Valley internship through one theme: what it really takes to build and deploy LLMs in healthcare. I will introduce Foresight England (Foresight E), a national-scale generative foundation model trained from scratch on 54.9 million de-identified longitudinal NHS EHRs to model patient timelines and enable zero-shot prediction across around 40,000 coded medical events. As NHS England has paused data access pending review, I will focus on the core methodology and lessons learned. I will then switch to my Parexel internship in San Francisco, where I worked in the company’s AI lab on production-focused applications, including pharmacovigilance and protocol de-risking. I will explain how I ended up there, what I worked on, and what I learned, with a candid view of what day-to-day life and work in the Bay Area actually looks like. I will also reflect on how the recent generative AI boom has reshaped the problems teams like ours choose to tackle and the way this work gets built, evaluated, and shipped. Seminar Series Event : “Developing Healthcare LLMs: From the NHS to Silicon Valley" Date and Time: Wednesday 27 May 2026, 15:00 – 16.00 hrs (BST) Location: Judy Dunn, SGDP Building, Denmark Hill Campus Attendance: Mandatory for all DRIVE-Health students; a calendar invitation has already been sent. Registration: Alumni and wider King's College London research community all welcome - please email drive-health-cdt@kcl.ac.uk to let us know if you would like to attend. Biography Simon Ellershaw is a PhD Candidate at University College London (UCL) as part of the UKRI UCL Centre for Doctoral Training in AI-enabled Healthcare Systems, supervised by Prof Richard Dobson and Dr Anoop Shah. His research spans LLM-based generation of hospital discharge summaries, national-scale pre-training of generative models on 57 million electronic health records, and post-training using real-world patient outcomes as verifiable reinforcement-learning rewards. Alongside his PhD, he interned at Parexel AI Labs and now works part-time as an NLP Engineer, developing and deploying production LLM/NLP systems, including applications in pharmacovigilance and quality assurance.