Flagship project delivers step change in text analytics capability

November 15, 2022

The HDR UK National Text Analytics Project team recently came together to share the impacts of their work and opportunities for clinical natural language processing (NLP). Rene Ndoyi, one of the attendees, describes his experience of the HDR UK National Text Analytics project symposium.


Author: Rene Ndoyi, Intern at Institute of Health Informatics


Maximizing text analytics capability for health data research: key learnings from the HDR UK National Text Analytics project symposium

On 28 September 2022, the HDR UK National Text Analytics Project team, led by Professor Richard Dobson (UCL Institute of Health Informatics; King’s College London) and Dr Angus Roberts (King’s College London), came together to share the impacts of their work and opportunities for the clinical

natural language processing (NLP) community to deliver and use new NLP tools at this HDR UK symposium.


This flagship project has delivered a step-change in text analytics capability, enabling a major shift in the UK’s ability to use research-ready, actionable, real-time electronic health records by delivering data-driven systems with potential to transform patient care. Sixty people from across HDR UK and the text analytics community attended the symposium to hear about the wide-reaching impacts of the project, learn about methods, tools and challenges for NLP and text analytics research, and discuss what the community needs to be able to access and use NLP resources for research. One of the attendees, Rene Ndoyi describes his thoughts and learning from the symposium below.


My name is Rene Ndoyi, a recent graduate of the HDR UK Black Internship Programme and intern at the UCL Institute of Health Informatics. The internship programme was such a success in my quest to develop a career in health data science. Among the many interesting projects that I was introduced to is the National Text Analytics Resource – led by Professor Richard Dobson (UCL Institute of Health Informatics; King’s College London) and Dr Angus Roberts (King’s College London).


This flagship project has delivered a step-change in text analytics capability, enabling a major shift in the UK’s ability to use research-ready, actionable, real-time electronic health records by delivering data-driven systems with potential to transform patient care. The project has built a community and brought together specialised resources that provide researchers with the tools and support to explore unstructured free text clinical data, using natural language processing (NLP) and text analytics.


Sixty people from across HDR UK and the text analytics community attended the symposium to hear about the wide-reaching impacts of the project, learn about methods, tools and challenges for NLP and text analytics research. Attendees also discussed what the community needs to be able to access and use NLP resources for research.


My internship mentor, Natalie Fitzpatrick, recommended that I attend the symposium as one of the many ways that the project brings together a community but also creates awareness of opportunities for NLP research being carried out across HDR UK.


It was very insightful and interesting to learn about the work that has been done and the success the project has earned over the past five years.


As an early career researcher who is building my skills in data science, I was keen to learn of the various tools and methods that have been developed to address the challenges of using unstructured free text data. A key piece of work is CogStack, a clinical information retrieval and extraction platform to create richer, more useful clinical information to improve healthcare. The tool enables querying data, without having to code thousands of SQL queries, based on real-time data.

Another tool I learnt about was MedCAT, which extracts information from Electronic Health Records and links it to biomedical vocabulary systems like SNOMED-CT and UMLS. Both of these tools are available for the research community to use via the Health Data Research Innovation Gateway, with the code made open source on GitHub.


Efforts to develop and apply these kinds of tools are important in tackling challenges around avoiding bias, transferability and model sharing.


The team described various ways that they are approaching this – from improving access to unstructured data for research, to developing trusted models of governance and standards. They have developed a template model sharing agreement that is being used across 10 different NHS Trusts to date, so that NLP models can be shared easily.


I also learnt that analysis of free text data can be achieved through R programming, a language I am currently learning. The idea of coding reproducible step by step workflows and frameworks is related to my internship learning experiences. Under Dr Johan Thygesen’s supervision, we are exploring development of reproducible and extensible frameworks, based on a previous study that developed a framework for Covid 19 trajectories among 57 million Adults in England.


Speakers also highlighted the importance of data governance and employing user-centred approaches. Natalie Fitzpatrick gave an interesting talk on creating a free text donated databank to develop and train NLP tools. I was fascinated to hear people’s feedback about this databank. Stakeholders, including patients and the public, researchers, clinicians and information governance and ethics experts, shared their thoughts through focus groups. There was a lot of support for the databank, but important issues were highlighted, such as the need to overcome different forms of bias, lack of generalisability, poor quality of data and patients’ ability to access their data to correct errors.


From my experiences at the symposium, I have no doubt that these efforts will harness more opportunities for improved patient care. I look forward to future meetings and opportunities to learn more about the National Text Analytics Resource project.


Share

February 10, 2026
We are pleased to welcome Dr Antonio de Marvao - Clinical Senior Lecturer at King's College London, and Consultant Cardiologist and Obstetric Physician at GSTT and KCH - who will deliver his talk “Detecting the Rare, Managing the Common: AI-Driven Cardiovascular Care Using EHR Data" as part of our Seminar Series. Abstract: Cardiovascular disease encompasses rare inherited conditions and highly prevalent disorders such as hypertension and cardiometabolic disease. Despite differing epidemiology, both require accurate, dynamic and scalable risk stratification. Electronic health records provide longitudinal, multimodal data at population scale. However, their heterogeneity and fragmentation demand advanced artificial intelligence methods to generate clinically actionable insight. Approximately 70 to 80 percent of NHS data exists in unstructured free text, rendering much of the clinically relevant signal inaccessible to conventional analytics without natural language processing or large language models. To address this challenge, we have been developing an AI-enabled framework for real-world cardiovascular risk prediction using integrated EHR data. The approach brings together structured clinical variables, imaging outputs and free-text documentation within secure hospital environments. Natural language processing and large language models are used to transform narrative records into computable features, while chain-of-thought reasoning architectures extract guideline-defined risk parameters directly from routine documentation. This enables automated calculation of established risk scores and dynamic longitudinal reassessment within an agentic workflow. Local, open-source models are evaluated across parameter scales to ensure an appropriate balance between accuracy, safety and computational efficiency for clinical deployment. In inherited cardiac conditions, this approach enables automated extraction of echocardiographic and clinical features required for sudden cardiac death risk prediction, reducing manual burden and supporting real-time monitoring. The same principles extend to hypertensive disorders of pregnancy, facilitating earlier detection, structured surveillance and stratification of long-term cardiovascular risk. Integration of high-resolution EHR-derived phenotypes with genomic and multi-omics datasets further supports progression from risk prediction to biological insight and therapeutic target discovery. Applied rigorously, AI methodologies operating on routine healthcare data provide a scalable foundation for precision cardiovascular care across the life course. Seminar Series Event : “Detecting the Rare, Managing the Common: AI-Driven Cardiovascular Care Using EHR Data" Date and Time: Tuesday 24 February 2026, 15:30 – 16.30 hrs (GMT) Location: IoPPN Seminar 1 & 2, Denmark Hill Campus Attendance: Mandatory for all DRIVE-Health students; a calendar invitation has already been sent. Registration: Alumni and wider King's College London research community all welcome - please email drive-health-cdt@kcl.ac.uk to let us know if you would like to attend. Biography Antonio de Marvao is a Clinical Senior Lecturer at KCL, and a Consultant Cardiologist and Obstetric Physician at GSTT and KCH, specialising in inherited cardiac conditions, maternal cardiology, and hypertensive disorders of pregnancy. His research sits at the intersection of electronic health records (EHR) derived phenotyping, genomics/multi-omics, and cardiovascular imaging, using machine learning to improve risk prediction modelling and personalise care, across the reproductive continuum - from pregnancy to postpartum - and long-term cardiovascular prevention. He leads work within the NHS England Genomic AI Network, applying natural language processing, large language models and multimodal EHR integration to identify patients with inherited cardiovascular disease, streamline specialist review, and improve access to genetic testing and family screening. In parallel, his group also uses AI and EHR data to better define and detect hypertensive disorders of pregnancy at scale, quantify disparities, and enable earlier, more targeted intervention.
December 17, 2025
We were pleased to welcome Dr Jacqueline Matthew - Clinical Research Fellow/Sonographer at King's College London - who delivered her talk “From Noise to Signal: A Clinical Researcher's Perspective on Translating Advances in Prenatal imaging into Practice" as part of our Seminar Series. Abstract: Over the past decade, machine learning approaches in prenatal imaging has advanced from exploratory academic prototypes to clinically usable, real-time tools, but the path between those two endpoints is rarely straightforward. In this talk, Jacqueline offered a clinical researcher’s perspective on translating biomedical engineering innovations into real-world impact, tracing the journey from the iFIND project’s early breakthroughs in automated fetal imaging to the creation of Fraiya, an AI-driven ultrasound platform now entering clinical deployment. She unpacked the technical, clinical, and regulatory hurdles that shape this trajectory: data acquisition at scale, annotation complexity, model robustness, pipeline optimisation for real-time use, clinical safety engineering, regulatory strategy, and integration with NHS digital ecosystems. Beyond the technical achievements, the session reflected honestly on the innovation “gaps” that researchers and engineers encounter when stepping into entrepreneurship. From productising research outputs, building 'with' clinicians and service users not just 'for' them, securing buy-in, navigating procurement, and proving value in operationally stretched healthcare services. The aim was to provide a pragmatic and motivating roadmap for researchers and innovators seeking to turn biomedical AI research into deployable, sustainable solutions in healthcare. Seminar Series Event : “From Noise to Signal: A Clinical Researcher's Perspective on Translating Advances in Prenatal imaging into Practice. Date and Time: Thursday 22 January 2026, 15:00 – 16.00 hrs (GMT) Location: K39, King's Building, Strand Campus Attendance: Mandatory for all DRIVE-Health students, therefore please accept the calendar invitation. Registration: Alumni and wider King's College London research community all welcome - please email drive-health-cdt@kcl.ac.uk to let us know if you would like to attend. Biography Jacqueline is a clinical academic, sonographer, and MedTech entrepreneur with over 20 years of experience in advancing pregnancy care through compassionate, technology-driven solutions. Specialising in ultrasound and fetal MRI, Jacqueline’s work focuses on leveraging cutting-edge imaging technologies to improve screening, diagnosis, and care for pregnant women. With a PhD in advanced 3D ultrasound and fetal MRI, Jacqueline uses machine learning to refine diagnostic pathways, pushing the boundaries of what’s possible in prenatal care. As Clinical Lead and Chief Medical Officer at an early-stage health tech startup, she has been at the forefront of developing a real-time AI-powered pregnancy ultrasound platform, with ambitions to transform how scans are performed, enhancing diagnostic accuracy, and empowering healthcare professionals to deliver more informed and compassionate care. Jacqueline’s work has earned her widespread recognition, including being named one of the inaugural winners of the NHS England CAHPO Gold Award for Excellence, which celebrates health professionals who exemplify exceptional contributions to healthcare and the NHS values.