OpenDeID  Dataset

The OpenDeID corpus stands as a significant milestone as the first Australian-based gold-standard corpus specifically designed for patient de-identification purposes. This corpus holds immense value for the development and refinement of automated patient de-identification systems, whether they rely on rule-based algorithms or machine learning approaches. Comprising a total of 2,100 pathology reports, each report averaging approximately 717 tokens, the dataset draws from a pool of 1,833 cancer patients. Within this corpus, a meticulous annotation effort has resulted in the identification of 38,414 Protected Health Information (PHI) entities. Impressively, the inter-annotator agreement and deviation scores for all three de-identification settings demonstrate a high level of accuracy, measuring at 0.9464 and 0.9503, respectively. Worth noting is the fact that the corpus has been manually annotated with surrogate information, ensuring the absence of any identifiable patient data. This resource, meticulously crafted and rich in de-identified patient information, serves as a critical asset in advancing the development and evaluation of de-identification technologies and practices while upholding stringent privacy standards.

About HSA Biobank


The HSA biobank is a collaborative initiative based at the Lowy Cancer Research Centre at the University of New South Wales, Sydney, Australia. It aims at supporting researchers and clinicians for the advancement in the field of translational cancer research across Australia and internationally. The HSA Biobank houses all types of tumor tissues obtained from patients who have undergone surgery at one of the HSA hospitals and have provided their consent to the HSA Biobank. Please refer to this for more information.

http://www.tcrn.unsw.edu.au/hsa


Ethics approval

Dataset access fees

Dataset Access Instructions

Access Criteria


Frequently Asked Questions



 Selected publications