




LLM Red-Team harness
Find where your model breaks.
Links
Additional info
How was your experience building with Codex?
The experience was unforgettable. I am a regular user of CODEX, use it for vibe-coding. With so many credits it was so easy to build. Lightning fast to be precise.
Describe your experience using Loops House as the hackathon platform. What worked well, what challenges (if any) did you face, and what improvements would you like to see?
I haven't seen many platforms like these. This platform is so good, very easy to use software. Just loved it
Tell us about your overall experience at Codex Community Hackathon Pune.
Unbelievable, the way this event was managed was exceptional. The volunteers were extremely helpful and the energy inside was great.
What could Codex Community improve to create a better experience for participants?
It was a perfect event, nothing to complaint about to be honest.
Team
3 members- ADOwner
Aditya C
- AR
Arnav Gupta
Team Member - IS
Ishan Gite
Team Member
Overview
An interactive red-team harness for testing how an LLM responds to risky prompts, adversarial variants, and prompt-injection attempts. The app analyzes a prompt, generates attack variants, scores model behavior, suggests defensive prompt hardening, and can produce a safety-style report for the tested session.
The project is built with a FastAPI backend and a Next.js frontend. It uses OpenAI models for classification, attack generation, target responses, judging, defense synthesis, and embeddings. If no valid API key is configured, the backend falls back to a local mock mode so the app can still be opened and tested without crashing.
Analyze -> Attack -> Score -> Defend -> Certify It helps teams discover where a model breaks, understand the strongest failure mode, and generate a concrete mitigation path.
Features -Prompt risk analysis with category, intent, and risk score -Adversarial attack generation from the analyzed prompt -Multi-signal scoring across judge score, refusal quality, keyword signals, embeddings, and heuristic evaluation -Prompt-injection firewall checks -Iterative red-team mode with streaming progress events -Multi-turn attack simulation -Defense synthesis that proposes a hardened system prompt -Session history backed by local SQLite/libSQL or Turso -Safety certificate view with grade, recommendations, and OWASP-style breakdown -Modern Next.js UI with a landing page and an app workspace at /harness
Workflow
- Analyze
Enter a user prompt. The backend classifies the prompt's risk, intent, and category. Examples of categories include system intrusion, harmful content, credential theft, information disclosure, prompt injection, and other safety-relevant classes.
- Attack
The harness turns the analyzed prompt into adversarial variants. These variants preserve the original goal while probing different routes such as indirect phrasing, authority framing, roleplay, obfuscation, context stuffing, and other jailbreak-style techniques.
- Score
Each attack is sent through the scoring pipeline. The backend collects the model response, judges whether the attack succeeded, computes a composite score, assigns severity, and returns the strongest attack.
- Defend
For a high-risk or successful attack, the defense synthesis step proposes a hardened system prompt and re-tests the model to estimate whether the defense improved safety.
- Review
Results can be reviewed in the UI through summaries, detailed attack cards, history, firewall checks, and certificate/report views.