Open Sourcing Brand Safety

Lately the technical team at Thrad have been busy, and we’re very excited to share our first contribution to the open source AI marketplace with two conversation classification models!

thrad-distilbert-conversation-classifier & thrad-bert-conversation-classifier

Fine-tuned DistilBERT and BERT models from Thrad.ai that classify conversations according to a simplified breakdown of the OpenAI chat usage report. -> (here)

The models are fine-tunes of Huggingface’s Distill BERT and Google’s BERT models respectively, both are from the base-uncased variants.

Models were trained on an in-house dataset of 29,000 LLM labeled and verified user chats collected from Thrad partners. Data preprocessing consisted of PII scrubbing with Presidio, selection of the most recent 3 “turns” (six messages) in the chat history, and then extraction of the first and last 85 tokens of each message to fit within the maximum input sequence length of 512 tokens.

This dataset of truncated chats was labeled in a twofold process by OpenAI’s GPT 4o mini and Google Gemini 2.5 Flash. Only the chats for which these models agreed on labels were kept for use in the training run. We found that these models agreed on labels only ~74% of the time, with the most frequent confusion being between academic help/homework and general information requests.

Our models classify conversations into 13 buckets, which are designed to allow us to block Thrad.ai ad service to chats that we, and advertisers, are not interested in participating in. This is a major step in our continuing commitment to brand and advertiser safety in the rapidly developing ads-in-AI space. Our bins are described below:

A - academic_help – Students getting help with homework, assignments, tests, or studying.

B - personal_writing_or_communication – Draft, edit, or improve personal/professional emails, messages, social media posts, letters, or workplace communications.

C - writing_and_editing – Create, edit, or improve nonfiction or instructional writing.

D - creative_writing_and_role_play – Create poems, stories, fictional narratives, scripts, dialogues, or character-based roleplays.

E - general_guidance_and_info – Provide step-by-step guidance, practical advice, or factual information.

F - programming_and_data_analysis – Write or debug code or work with data/programming tools.

G - creative_ideation – Generate new ideas, brainstorm concepts, or discover new topics.

H - purchasable_products – Ask about products, services, or prices.

I - greetings_and_chitchat – Small talk or casual chat.

J - relationships_and_personal_reflection – Discuss emotions, relationships, or introspection.

K - media_generation_or_analysis – Create, edit, analyze, or retrieve visual/audio/media content.

L - other – if there is no indication of what the user wants or if there is an intent that is not listed above.

M - other_obscene_or_illegal - if the user is making obscene or illegal requests.

We also experimented with using GPT-4o-mini as a teacher model. Because there are less than 20 classes, the OpenAI API was used to collect the top 20 tokens and logprobs. These tokens were screened so only valid class labels were included, and the logprobs were converted to the probability distribution over the vocabulary (tokens). With this distribution as a soft target, we executed student-teacher training of thrad-distilbert-conversation-classifier-st and thrad-bert-conversation-classifier-st with a variety of teacher weights. These models underperform when compared to models trained with only hard labels. A potential reason for this may be the dramatic capacity gap between Distil. BERT/BERT and GPT-4o-mini, which may lead the student model to learn a poor, noisy approximation of the teacher’s representation.

As a result of these observations, the public repo only includes the base BERT and Distil. BERT models training and inference code. Detailed score breakdowns will be posted in the repository documentation here -> (Repo)

Our unique data store and brand safety aware loss design has allowed us to achieve a blocked to unblocked chat classification error rate of ~5%, with ~84% accuracy. On the same withheld inference test set of chats, LLaMA 3.1 8B INSTANT served with Groq, with the same labelling prompt, had an ~40% agreement rate and a ~14 % blocked to unblocked chat classification error rate. Performance statistics are calculated from a set of 2224 held out test samples, and there is a detailed breakdown in the public repository `eval` folder.

In comparison against LLMs on the held-out test split of verified data, our Distil BERT proves superior in aggregate and classwise comparisons. API costs are calculated from the groq public rates on 8 Nov 2025.

Distil. BERT Results vs LLMs (N = 2224)

=====================================================================

Model Accuracy Cross-Cat Err Banned→Safe Cost

=====================================================================

PyTorch (safetensors) 83.77% 5.17% 67 $ 0.0000

Llama 3.1 8B (Groq) 40.65% 14.43% 289 $ 0.1677

GPT OSS 20B (Groq) 35.03% 17.67% 387 $ 0.2535

GPT OSS 120B (Groq) 60.52% 10.79% 231 $ 0.5071

=====================================================================

thrad-distilbert-conversation-classifier dominates LLMs more than 200X its size by aggregate and classwise accuracy as well as cross-category error rate.

BERT Results vs LLMs (N = 2224)

=====================================================================Model Accuracy Cross-Cat Err Banned→Safe Cost

=====================================================================PyTorch (safetensors) 74.37% 7.73% 87 $ 0.0000

Llama 3.1 8B (Groq) 41.91% 14.39% 282 $ 0.1677

GPT OSS 20B (Groq) 31.21% 17.67% 385 $ 0.2535

GPT OSS 120B (Groq) 58.81% 11.02% 232 $ 0.5071

=====================================================================thrad-bert-conversation-classifier also outperformed LLMs, but underperformed our Distil. BERT model.

In transparency, we have open sourced the model weights with safetensors here. Additionally, we have moved the chat preprocessing, model training, and evaluation code into a public repo for community reference. We hope that these contributions will help lay a transparent and strong foundation for the ads-in-AI space to build on to protect partnering brands and AI apps.

We will be revisiting this system to continue improving our system to protect partner platforms and brands while improving our service offerings as the demographics of our partners shift and usage trends evolve. To contribute to this project, create a new issue -> here or otherwise reach out to us.

Public Repo -> Here

Scott Biggs & Marco Visentin

Thrad

Twitter(x)

Get started

Home

Advertiser

Publishers

Contact

+14152987869

600 California St San Francisco, CA 94108

contact@thrads.ai

Thrad

Twitter(x)

Get started

Home

Advertiser

Publishers

Contact

+14152987869

600 California St San Francisco, CA 94108

contact@thrads.ai

Thrad

Twitter(x)

Get started

Home

Advertiser

Publishers

Contact

+14152987869

600 California St San Francisco, CA 94108

contact@thrad.ai

Thrad

Home

Publishers

Advertisers

Blog

Case Studies

Get Started

Book a Call