Back
Explore EventsExplore ProjectsMy Projects
LLM Red-Team harness — screenshot 1
LLM Red-Team harness — screenshot 2
LLM Red-Team harness — screenshot 3
LLM Red-Team harness — screenshot 4

LLM Red-Team harness

Find where your model breaks.

Codex Community Hackathon - Pune

Links

Repository

github.com/inslot2525-ctrl/LLM--HARNESS

Website

llm-harness-llg61hi42-ishan-gite-s-projects.vercel.app

Demo video

drive.google.com/drive/u/1/folders/1ulH7QFQnLLA_DqHwHLflcVEnnbEMKsU0

Additional info

How was your experience building with Codex?

The experience was unforgettable. I am a regular user of CODEX, use it for vibe-coding. With so many credits it was so easy to build. Lightning fast to be precise.

Describe your experience using Loops House as the hackathon platform. What worked well, what challenges (if any) did you face, and what improvements would you like to see?

I haven't seen many platforms like these. This platform is so good, very easy to use software. Just loved it

Tell us about your overall experience at Codex Community Hackathon Pune.

Unbelievable, the way this event was managed was exceptional. The volunteers were extremely helpful and the energy inside was great.

What could Codex Community improve to create a better experience for participants?

It was a perfect event, nothing to complaint about to be honest.

Team

3 members
  • AD

    Aditya C

    Owner
  • AR

    Arnav Gupta

    Team Member
  • IS

    Ishan Gite

    Team Member

Overview

An interactive red-team harness for testing how an LLM responds to risky prompts, adversarial variants, and prompt-injection attempts. The app analyzes a prompt, generates attack variants, scores model behavior, suggests defensive prompt hardening, and can produce a safety-style report for the tested session.

The project is built with a FastAPI backend and a Next.js frontend. It uses OpenAI models for classification, attack generation, target responses, judging, defense synthesis, and embeddings. If no valid API key is configured, the backend falls back to a local mock mode so the app can still be opened and tested without crashing.

Analyze -> Attack -> Score -> Defend -> Certify It helps teams discover where a model breaks, understand the strongest failure mode, and generate a concrete mitigation path.

Features -Prompt risk analysis with category, intent, and risk score -Adversarial attack generation from the analyzed prompt -Multi-signal scoring across judge score, refusal quality, keyword signals, embeddings, and heuristic evaluation -Prompt-injection firewall checks -Iterative red-team mode with streaming progress events -Multi-turn attack simulation -Defense synthesis that proposes a hardened system prompt -Session history backed by local SQLite/libSQL or Turso -Safety certificate view with grade, recommendations, and OWASP-style breakdown -Modern Next.js UI with a landing page and an app workspace at /harness

Workflow

  1. Analyze

Enter a user prompt. The backend classifies the prompt's risk, intent, and category. Examples of categories include system intrusion, harmful content, credential theft, information disclosure, prompt injection, and other safety-relevant classes.

  1. Attack

The harness turns the analyzed prompt into adversarial variants. These variants preserve the original goal while probing different routes such as indirect phrasing, authority framing, roleplay, obfuscation, context stuffing, and other jailbreak-style techniques.

  1. Score

Each attack is sent through the scoring pipeline. The backend collects the model response, judges whether the attack succeeded, computes a composite score, assigns severity, and returns the strongest attack.

  1. Defend

For a high-risk or successful attack, the defense synthesis step proposes a hardened system prompt and re-tests the model to estimate whether the defense improved safety.

  1. Review

Results can be reviewed in the UI through summaries, detailed attack cards, history, firewall checks, and certificate/report views.

ExploreProjectsMine