Training Cancer Diagnostics Models without Human Intervention

Training Cancer Diagnostics Models for Histopathology without Human Intervention:

When Natural Language Processing meets Computer Vision

In this blog post, we talk about the viability of training cancer diagnostics models for histopathology without requiring human intervention. How do Natural Language Processing (NLP) techniques allow to bypass the need for manual annotations when developing Computer Vision (CV) applications for histopathology? What opportunities can be generated by such an approach? Let’s find it out!

When human annotations are not enough

The digitalization of histopathology workflows, along with the advent of effective Deep Learning methods, have paved the way towards new solutions to support cancer diagnosis. Nevertheless, the annotation process required to train robust computer-aided diagnosis tools falls short due to the limited availability of pathologists in annotating large amounts of Whole Slide Images (WSIs) – which are digitized images of tissue specimens, and the main asset of digital pathology. Therefore, alternative approaches are required to reduce – or even eliminate – the need for human annotations when training computer-aided diagnosis tools for histopathology.

Weak supervision to the rescue

In recent years, the weak supervision paradigm has emerged to target the aforementioned challenges. Weak supervision can be performed in different ways, depending on the task and domain. In digital pathology, weakly-supervised methods typically rely on global – also known as weak or image-level – annotations instead of local – or pixel-wise – annotations. These global annotations refer to whole images, even though the regions corresponding to the annotations are just a small part of them. For instance, a WSI might be labeled as “cancer” even if the cancerous tissue represents just 1-2% of the whole image. As a consequence, weakly-supervised methods require larger training sets to reach performances comparable to those of fully-supervised approaches. Nevertheless, compared to pixel-wise annotations, image-level ones are easier to obtain. In particular, image-level annotations can be extracted from pathology reports, which are often provided together with WSIs. However, most of the recent works in this direction rely on medical experts to extract weak labels from reports or WSIs.

A notable example is the work by Campanella et al. (2016), where the authors trained a model based on Deep Neural Networks (DNNs) to classify WSIs as cancer or non-cancer using a weakly-supervised dataset of over 30,000 WSIs related to prostate, breast, and skin tissue slides. In this work, weak labels were provided by pathologists after a time-consuming analysis of WSIs or automatically retrieved from anatomic pathology Laboratory Information Systems (LISs) – which contain structured diagnoses that can be directly used as annotations. Using such an approach, Campanella et al. achieved excellent performance on cancer classification.

NLP is all you need

Unfortunately, most LISs do not store structured information, but rather deal with noisy and heterogenous free-text reports. Thus, how can weakly-supervised approaches be adopted when human intervention is not feasible and structured information is not available? That’s when NLP comes into play!

The recent advances in NLP, which, with models like BERT, opened up the pre-training era, have paved the way to unprecedented opportunities to exploit raw textual information in clinical practice. Thus, new and exciting directions in weak supervision envision the use of NLP pipelines to automatically extract semantically meaningful concepts from free-text diagnosis reports – to be used as weak labels for training computer-assisted diagnosis tools for histopathology. We, at the ExaMode project, are also working in this direction, so stay tuned to get more on this!