LLM Red-Team harness

Find where your model breaks.

Event

Codex Community Hackathon - Pune

Links

Repository

github.com/inslot2525-ctrl/LLM--HARNESS

Website

llm-harness-llg61hi42-ishan-gite-s-projects.vercel.app

Demo video

drive.google.com/drive/u/1/folders/1ulH7QFQnLLA_DqHwHLflcVEnnbEMKsU0

Team

3 members

AD
Aditya C
Owner
AR
Arnav Gupta
Team Member
IS
Ishan Gite
Team Member

Overview

An interactive red-team harness for testing how an LLM responds to risky prompts, adversarial variants, and prompt-injection attempts. The app analyzes a prompt, generates attack variants, scores model behavior, suggests defensive prompt hardening, and can produce a safety-style report for the tested session.

The project is built with a FastAPI backend and a Next.js frontend. It uses OpenAI models for classification, attack generation, target responses, judging, defense synthesis, and embeddings. If no valid API key is configured, the backend falls back to a local mock mode so the app can still be opened and tested without crashing.

Analyze -> Attack -> Score -> Defend -> Certify It helps teams discover where a model breaks, understand the strongest failure mode, and generate a concrete mitigation path.

Features -Prompt risk analysis with category, intent, and risk score -Adversarial attack generation from the analyzed prompt -Multi-signal scoring across judge score, refusal quality, keyword signals, embeddings, and heuristic evaluation -Prompt-injection firewall checks -Iterative red-team mode with streaming progress events -Multi-turn attack simulation -Defense synthesis that proposes a hardened system prompt -Session history backed by local SQLite/libSQL or Turso -Safety certificate view with grade, recommendations, and OWASP-style breakdown -Modern Next.js UI with a landing page and an app workspace at /harness

Workflow

Analyze

Enter a user prompt. The backend classifies the prompt's risk, intent, and category. Examples of categories include system intrusion, harmful content, credential theft, information disclosure, prompt injection, and other safety-relevant classes.

Attack

The harness turns the analyzed prompt into adversarial variants. These variants preserve the original goal while probing different routes such as indirect phrasing, authority framing, roleplay, obfuscation, context stuffing, and other jailbreak-style techniques.

Score

Each attack is sent through the scoring pipeline. The backend collects the model response, judges whether the attack succeeded, computes a composite score, assigns severity, and returns the strongest attack.

Defend

For a high-risk or successful attack, the defense synthesis step proposes a hardened system prompt and re-tests the model to estimate whether the defense improved safety.

Review

Results can be reviewed in the UI through summaries, detailed attack cards, history, firewall checks, and certificate/report views.