SREDH/AI-Cup 2023 deidentification competition

In recent years, artificial intelligence (AI) technology has developed rapidly. Especially in the past year, companies such as OpenAI, Microsoft and Google have introduced and used their own large-scale language models in related products. These applications such as ChatGPT shown the application potential of Large Language Models (LLMs) in various fields. The application of LLM in clinical medicine is therefore regarded as the future of AI in the field of digital health, which is now a very important and evolving research area. However, when applying such AI models, ordinary users and even system or program developers often do not realize the privacy information issues when interacting with LLMs, which may lead to the risk of leaking important confidential information. In addition, if the training data used in training such large language models contains real private information (such as an individual's name, phone number, ID card number, etc.), there is a certain possibility that it will be affected by the memory capacity of the LLM. The ability to interact with users leads to the leakage of private information.

On the other hand, health, medical and biomedical institutions at all levels around the world are using electronic health records (EHRs) for research. However, EHRs are often filled with private or confidential information related to patients. Fragments of information collected across various EHR systems can be used to deduce the true identity of a patient. Therefore, in order to properly utilize EHRs for secondary research and to promote the development of innovative digital health applications, it is very important to identify and remove patient private information. As such, based on the various privacy issues noticed in literature especially in using LLMs for automatic deidentification of unstructured text notes, the Ministry of Education in Taiwan has sponsored a large nationwide competition, Artificial intelligence CUP 2023-Privacy Protection and Medical Data Standardization Challenge via the nationwide project titled “Ministry of Education Artificial Intelligence Competition and Annotation Data Collection Project”. The aim is to seek automatic de-identification and standardization solutions from researchers around the world. 

The challenge participants evaluated their AI models on a large Australian multicentre corpus. The models were primarily evaluated for entity recognition of sensitive health information and, entity recognition and normalisation of temporal information. Please refer to CodaLab site to participate. 


OpenDeID Corpus Dataset 

2024 International workshop on deidentification of electronic medical record notes (IW-DMRN)

2024 IW-DMRN Information 


SREDH Consortium

National Kaohsiung University of Science and Technology

Asian University