This dataset was generated as part of this study - https://github.com/SREDH-Consortium/mRNA-Vaccine-Hesitancy . The dataset is derived from a structured analysis of Reddit discussions related to mRNA vaccines and is designed to enable systematic surveillance of misinformation narratives grounded in both clinical and thematic frameworks.
The dataset was generated using a five‑stage multi‑agent processing pipeline that transforms raw social media discourse into structured, analyzable outputs. Reddit posts are sequentially processed through autonomous agents responsible for (1) content summarisation, (2) clinical concept and disease entity extraction, (3) mapping of extracted entities to ICD‑11 clinical codes, and (4) classification of narratives using a thematic misinformation taxonomy. A final integration stage consolidates outputs across agents to support downstream analysis.
Each record in the dataset links original Reddit content with its corresponding summaries, identified disease concepts, ICD‑11 codes, and assigned misinformation narratives. This dual‑layered annotation—combining biomedical classification with narrative taxonomy—enables both health‑centric and discourse‑centric analyses of vaccine misinformation. The dataset is particularly suited for research on misinformation surveillance, narrative dynamics, clinical framing of health discourse, and the development or evaluation of automated public health monitoring systems.
The dataset is intended for use by researchers in digital public health, medical informatics, health communication, and AI‑driven surveillance, and provides a reusable benchmark for studying the intersection of clinical representation and misinformation narratives in social media environments.
None
Fill out the data request form to obtain access to the dataset.
Once the request is approved, please sign and return the SREDH Consortium membership, data usage and project description forms that will be sent up on filling out the data request form above.
Pay data access and associated fees, if applicable
Download the dataset from the SREDH secure server.
Submit progress report every 6 months until the completion of the project
Available to researchers (academic and non-academic) for non-commercial purposes
Researchers need to have experience handling sensitive patients and training in ethics.
Researchers are required to report biannually to the SREDH Consortium on any research outputs that arise.
Any output that arises from this dataset needs to be reviewed by the data custodian (SREDH Consurtium) before submission.
Frequently Asked Questions
Please refer to FAQs page.
Selected publications
Adam, D. C., Jonnagaddala, J., Han-Chen, D., Batongbacal, S., Almeida, L., Zhu, J. Z., ... & MacIntyre, C. R. (2017, November). ZikaHack 2016: A digital disease detection competition. In Proceedings of the International Workshop on Digital Disease Detection using Social Media 2017 (DDDSM-2017) (pp. 39-46).
Jonnagaddala, J., Dai, H. J., & Chang, Y. C. (2017, November). Proceedings of the International Workshop on Digital Disease Detection using Social Media 2017 (DDDSM-2017). In Proceedings of the International Workshop on Digital Disease Detection using Social Media 2017 (DDDSM-2017).
Huang, Y. J., Su, C. H., Chang, Y. C., Ting, T. H., Fu, T. Y., Wang, R. M., ... & Hsu, W. L. (2017, November). Incorporating dependency trees improve identification of pregnant women on social media platforms. In Proceedings of the International Workshop on Digital Disease Detection using Social Media 2017 (DDDSM-2017) (pp. 26-32).