Detecting Emerging Drug Slang and Code Language in Social Media Posts

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0
Keywords: Large Language Model, Ontology, OntologyRAG, Lexical Semantics, Terminology Identification
TL;DR: The project generates an ontology from a dictionary source and uses it in combination with an LLM as an OntologyRAG to detect new terminology refering to drugs in social media
Abstract: Dynamic lexical changes in natural language communication are a serious challenge for Natural Language Processing and AI solutions that aim at the extraction of named entities, concepts, or semantic content representations. Social media and other online communication platforms accelerate the spread of newly coined terminology, including slang, multilingual code‑switching, creative spellings, and other non‑standard linguistic forms. These environments make it easy for users to invent, adapt, and circulate expressions that are unfamiliar outside specific communities or regions. In this project we focus on texts that discuss drug use, drug acquisition, and personal experiences with taking drugs. The language in such texts uses novel, low‑frequency, or community‑specific terms—including newly coined slang, coded references, emojis, creative spellings, and regionally bound expressions (Holbrook et al., 2024). This is an interdisciplinary research activity involving Public Health and NLP/AI interests. From the Public Health perspective, transmission of drug information is increasingly happening in digital environments, including social media. Estimates suggest 10-20% of young adults who use drugs have connected with dealers through social media (Kazemi et al., 2017), and that roughly 13% of social media posts advertise an illegal drug (Fuller et al., 2023). Clandestine online speech is partially to blame. Novel terminology and innovative linguistic forms evolve quickly, often intentionally, to avoid moderation or detection. Such variability poses a persistent challenge for speech and language processing working with multilingual text and speech, especially when posts describe substances, effects, or behaviors using terminology unfamiliar outside specific online subcultures (Nahar et al., 2022). Unfortunately, existing technologies are ill-equipped to identify these patterns with the necessary semantic depth. As a source of data, in this project we use large corpora of social media posts from a variety of platforms. Social media has shown to be an effective temporally proximal signal for individual and collective drug-related behaviors through semantically expressed user-generated content (Chen et al., 2025). Yet despite its scientific value—and well-established use in addiction research—prior research also shows that social media data analyzed with probabilistic NLP and transformer-based language learning models directly contribute to the aforementioned detection gap. In such computational work, Rao et al (2024) and Parker et al. (2023) observed that some issues result from ambiguities of common slang terms—Vikes as a reference to the professional US Football team or Vicodin. Detecting completely unseen and unknown terminology and identifying its semantic type and value is significantly more challenging. Herein, we present the development and performance capability of the Drug Knowledge Ontology (DKO)—informed by the National Institutes of Drug Abuse (NIDA) Drug A-Z Inventory (National Institute on Drug Abuse, 2024). In this project we apply neuro-symbolic approaches to facilitate the detection of social media posts that mention or describe experiences with the consumption of drugs. We show how the detection of new drug terminology can be achieved using Large Language Models (LLMs) and Description Logic (DL) (Baader, et al., 2007) ontologies. We aim to demonstrate how the DKO ontology, and the resulting knowledge representation, can be leveraged to surveil noisy social media discourse to uncover nuanced discussions about new and emerging drug terminology. This work is supported by two hypotheses: H1: The DKO will effectively classify labeled and unlabeled drug-related tweets as true/untrue drug references; H2: The DKO can predict a drug class for linguistic inferences in a random collection of drug-related tweets. The approach to generate the DKO involves a basic conversion of a structured dictionary format to a raw OWL RDF format. The descriptions of the risk factors, side effects, methods of use and consumption, and other properties contained in the form of narratives in the dictionary elements are converted to structured DL concepts and relations using LLMs. This process established a class-hierarchy and formal relations between concepts that enable DL-based reasoning. Using state-of-the-art binary text classification approaches, we developed classifiers for social media posts that identify posts that might involve mentions of drugs and drug-related activities. These classifiers perform on our social media posts corpus with an accuracy of more than 90%. The identified posts are analyzed using LLMs that include the NIDA Drug Ontology as context, using a RAG (Gao et al., 2024) approach, which we refer to as LLM + OntologyRAG. We compare LLM-based detection of drug names in social media posts, with an OntologyRAG approach that provides semantic guidance to LLMs in the context of a query.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 59
Loading