AI Agent — Comprehensive Architecture Reference

// 01

Repository Overview

Purpose

A self-hosted AI coding agent purpose-built to automatically generate JUnit 5 + Mockito unit tests for large Java repositories following DDD architecture. Solves a real-world problem: manually writing tests for hundreds of services, repositories, and domain objects is time-consuming, error-prone, and inconsistent.

Key Capabilities

Structural Intelligence — Graph-based AST analysis, knows exactly which classes need to be mocked
RAG Hybrid Context — Vector search + dependency graph traversal in parallel (ThreadPoolExecutor)
Token Budget — Priority-based snippet selection P1→P5, TokenOptimizer 6000 tokens
Plan-driven StateMachine — ExecutionPlan with 8 StepActions, driven by AgentState machine
7-Pass Validation — Severity-aware (ERROR/WARNING/INFO), RAG-aware construction check (Pass 7)
Auto-repair Loop — Self-corrects validation errors up to 2 times with RepairStrategySelector
Streaming API — Token-by-token SSE with 6 phase events (PLANNING→DONE)
EventBus + Metrics — Publishes events at every step, MetricsCollector, structlog
OpenAI-compatible API — /v1/chat/completions for Tabby IDE

✦ Architecture Classification

Structural Intelligence + RAG Hybrid Coding Agent with Plan-driven Execution and Agentic Repair Loop. Not a typical RAG agent — the intelligence/ layer provides ground truth from the AST graph instead of letting the LLM guess dependencies.

Request → FastAPI → AgentOrchestrator
  → Planner.plan_test_generation() → ExecutionPlan {8 steps}
  → StateMachine: IDLE→PLANNING→RETRIEVING→GENERATING→VALIDATING→COMPLETED
  → ContextBuilder (context/)
      → DependencyAnalyzer (intelligence/) ← exact mocks from AST graph
      → RAGClient (rag/) + ThreadPoolExecutor parallel fetch deps
      → SnippetSelector (P1→P5 priority tiers)
      → TokenOptimizer (budget 6000 tokens)
  → PromptBuilder → vLLM / Qwen2.5-Coder-7B
  → ValidationPipeline (7 passes, ERROR/WARNING/INFO)
  → RepairStrategySelector → repair loop (max 2x, plan_repair())
  → EventBus.publish() + MetricsCollector
  → GenerationResult {test_code, validation_summary, plan_summary, repair_attempts}

// Architecture

Architecture Guide — What Each Subsystem Does

The system is divided into 7 clearly-defined functional layers. Each layer has a single responsibility and does not encroach on the responsibilities of other layers. Click each subsystem for a detailed breakdown.

🔮

Layer 1 — Static information gathering

intelligence/

repo_scanner · file_graph · symbol_map · dependency_analyzer

▶

WHY — Problem

The LLM does not know exactly which classes need to be mocked. If the LLM guesses, many mocks will be missing or arbitrary → tests won’t compile. Ground truth from AST is required, not machine learning.

WHAT — Responsibility

Reads the entire Java repo once, builds a graph representing the relationships between classes. On demand, answers: “Which classes does UserService need to mock exactly?”

HOW — Mechanism

tree-sitter parses AST → builds FileGraph (import edges) + SymbolMap (field/method table) → DependencyAnalyzer merges both graphs → returns TestContext.

4 files, 4 distinct responsibilities

repo_scanner.py — Scans all .java files, parses each one, creates a RepoSnapshot with O(1) lookups by class name / FQN / file path. Scan only — no analysis.

file_graph.py — Uses import statements to build a directed graph: edge A→B means “A imports B”. Supports transitive closure: “all things A depends on, directly and indirectly”. Graph only — no business logic.

symbol_map.py — Global symbol table: class → fields/methods, field type → who injects that field, annotation → classes with that annotation. Used to determine “what does UserService inject via constructor”. Lookup only — no code generation.

dependency_analyzer.py — The sole Facade external callers use. Takes class_name, merges FileGraph + SymbolMap, returns TestContext {mocks, domain_types, layer}. This is the only answer the rest of the system needs.

Clear boundary: intelligence/ only reads static information (pre-indexed). Does not call RAG, does not call LLM, does not generate code. Only output: TestContext & ImpactReport.

📂

Layer 2 — Build search index

indexer/

parse_java · summarize · build_index

▶

WHY — Problem

Cannot fit the entire codebase into every prompt. Need a pre-indexed “library” that can be searched quickly by semantics.

WHAT — Responsibility

Converts each Java class into a 384-dimensional vector + rich metadata, stored in Qdrant. Runs only once (or when the repo changes).

HOW — Mechanism

tree-sitter parses AST → extracts full metadata → summarize.py writes a description → MiniLM embeds → Qdrant upsert.

parse_java.py — tree-sitter parses each .java file. Extracts: class_name, FQN, methods list, fields, dependencies (FQN list), used_types, has_builder, java_type (record/class/interface), record_components, DDD layer. This metadata is critical — Pass 7 validation relies on it.

summarize.py — Generates a short text description for each chunk so embeddings have better semantic context.

build_index.py — Orchestrates everything: calls parser, embedder, upserts into Qdrant with full payload metadata. Supports recreate=True to wipe and rebuild.

Boundary: indexer/ only runs offline (via POST /reindex), does not participate in the real-time generation flow. Output: a fully populated Qdrant collection.

🔍

Layer 3 — Search for relevant code

rag/

client · schema

▶

WHY — Problem

intelligence/ tells us “what needs to be mocked” but doesn’t have the source code of those classes. The LLM needs to see the actual code of dependencies to generate correctly.

WHAT — Responsibility

Queries Qdrant by class name or semantic similarity, returns a list of CodeChunk objects containing both source code and metadata.

HOW — Mechanism

Embeds query with MiniLM → cosine similarity search in Qdrant → filters by metadata → returns top-k highest scoring results.

search_by_class(class_name) — Finds chunks matching exactly by metadata class_name + semantic similarity. Returns source + dependencies list from the payload.

include_dependencies=True — Automatically extracts the dep_simple_names list from the main chunk’s payload. context/ will then fetch each dep in parallel.

RAG vs intelligence/ — the difference:
RAG finds code “semantically close to X” (may miss some). intelligence/ finds “exactly which classes X depends on” (AST, 100%). The system uses both: intelligence/ for the precise mock list, RAG for the source code of those classes.

🏗️

Layer 4 — Select & optimize context

context/

context_builder · snippet_selector · token_optimizer

▶

WHY — Problem

The LLM has a token limit (6000 tokens for context). Stuffing all code in would overflow. Need to intelligently select the most important pieces.

WHAT — Responsibility

Receives raw chunks from RAG + mock list from intelligence/, sorts by importance, smart-truncates, returns a pre-packaged ContextResult.

HOW — Mechanism

4-step pipeline: intelligence → RAG parallel fetch → SnippetSelector (5 tiers P1→P5) → TokenOptimizer (trim from P5 down). Gracefully degrades if intelligence/ unavailable.

Snippet priority logic

P1 — Target class (keep 100%) — The class being tested. Must be complete, never truncated.

P2 — Mockable dependencies (keep 100%) — Classes in TestContext.mocks from intelligence/. The LLM must understand their interface to mock correctly.

P3 — Domain types (truncate if needed) — Value objects and entities that are returned or accepted. The LLM needs their fields to create test data.

P4 — Interfaces (truncate aggressively) — Interfaces expose method signatures. Truncated heavily since implementation detail is rarely needed.

P5 — Transitive deps (dropped first when over budget) — Indirect dependencies. Least important, first to be removed when budget runs low.

Boundary: context/ knows nothing about validation or LLM. It does one thing: converts raw data into optimized context within budget. Output: ContextResult.

🧠

Layer 5 — Orchestration brain

agent/

orchestrator · planner · plan · state_machine · validation · repair · prompt · memory · events · metrics

▶

WHY — Problem

Need a “conductor” to coordinate all other layers in the right order, handle errors, self-repair if needed, and return results in the right format.

WHAT — Responsibility

Receives request → builds a plan → executes each step in sequence → validates output → self-repairs if wrong → returns result.

HOW — Mechanism

Planner creates an ExecutionPlan (list of steps). StateMachine tracks state. A while loop runs each PlanStep. ValidationPipeline checks. RepairStrategySelector appends repair steps if validation fails.

agent/ has 10 modules — one responsibility each

Module	Responsibility	Input	Output
orchestrator.py	Full coordination. Sole entry point from server.	GenerationRequest	GenerationResult
planner.py	Decides what to do (plan steps). Separated from execution.	request metadata	ExecutionPlan
plan.py	Data model: PlanStep, ExecutionPlan, StepAction enum. No logic.	—	Data structures
state_machine.py	Tracks current state. Prevents invalid transitions.	transition event	AgentState
validation.py	Checks 7 quality criteria. Classifies issues by severity.	Java test code string	ValidationResult
repair.py	Decides how to fix based on the type of validation error.	list[ValidationIssue]	RepairPlan
prompt.py	Builds messages[] sent to LLM. Controls format.	context + rules + history	messages list
memory.py	Stores conversation history + generated tests per session.	session events	SessionMemory
events.py	Pub/sub bus. Publishes events at each important step.	Event objects	SSE stream (downstream)
metrics.py	Collects statistics: tokens, chunks, validation pass rate.	Events	MetricsReport

Key insight: orchestrator.py knows nothing about Java parsing or vector embedding. It only coordinates other layers through clear interfaces. If a layer (e.g. intelligence/) is unavailable, the orchestrator gracefully degrades without crashing.

⚡

Layer 6 — Code generation

vllm/

client.py — Qwen2.5-Coder-7B qua OpenAI-compat HTTP

▶

WHY — Problem

Need an LLM that understands Java and writes structurally correct tests. Qwen2.5-Coder is trained on code — much better than a general-purpose text model for this task.

WHAT — Responsibility

Receives messages[] (system + context + user), sends HTTP request to vLLM server, returns Java code string (blocking) or token stream (stream mode).

HOW — Mechanism

POST /v1/chat/completions to vLLM :8000 (OpenAI format). vLLM uses PagedAttention for efficient serving. AWQ = 4-bit quantization, reduces VRAM ~4x.

generate(messages) — Blocking. Waits for the full response. Used for test generation (the orchestrator needs the full text before validating).

stream_generate(messages) — Iterator returning tokens one at a time. Used for SSE streaming to Tabby IDE so users see tokens appearing progressively.

Boundary: vllm/ only wraps an HTTP call. Knows nothing about Java, validation, or sessions. To switch to OpenAI GPT-4, simply change the URL + API key in .env.

🌐

Layer 7 — External communication

server/

api.py · session.py

▶

WHY — Problem

Tabby IDE communicates via the OpenAI API protocol. Need a layer dedicated to translating between HTTP/OpenAI format and the internal agent API.

WHAT — Responsibility

Exposes REST endpoints, handles OpenAI-compat requests, rate limiting, CORS, SSE streaming, session lifecycle, and delegates actual work to the orchestrator.

HOW — Mechanism

FastAPI + uvicorn. Async handlers + ThreadPoolExecutor for blocking orchestrator calls. asyncio.wait_for() for automatic timeout after REQUEST_TIMEOUT seconds.

api.py — 19 endpoints. Parses Tabby message content to detect file_path and class_name. Converts to GenerationRequest, calls orchestrator in thread pool.

session.py — SessionManager manages session lifecycle: creates UUID, expires after 1h, background cleanup task every 5 minutes.

Boundary: server/ only handles HTTP and session lifecycle. No domain logic (no Java validation, no RAG knowledge). All business logic lives in agent/.

Summary: Who does what?

intelligence/ — “The codebase map”: knows exactly which class depends on which.
indexer/ — “The library”: converts code into searchable vectors.
rag/ — “The librarian”: given a query, returns the exact relevant pages.
context/ — “The editor”: curates from a large body of content, keeping the most important within the page limit.
agent/ — “Brain / conductor”: plans, coordinates, quality-controls, self-repairs when wrong.
vllm/ — “The pen”: given full instructions, writes the actual code.
server/ — “The receptionist”: accepts requests from outside, routes to the right person, returns results.

// 02

Project Structure UPDATED

🌐

server/

api.py — FastAPI: all endpoints, OpenAI-compat, SSE streaming. session.py — Session lifecycle.

🧠

agent/

orchestrator · planner · plan · state_machine · prompt · rules · memory · validation · repair · events · metrics — 11 modules.

🔮

intelligence/ ✦

repo_scanner · file_graph · symbol_map · dependency_analyzer — Graph-based structural intelligence. O(1) lookups. Exact mock inference from AST.

🏗️

context/ ✦

context_builder · snippet_selector · token_optimizer — 4-stage pipeline replacing “dump all chunks”. Priority tiers + budget-aware truncation.

📂

indexer/

parse_java.py (tree-sitter AST) · summarize.py · build_index.py (Qdrant upsert with rich metadata: has_builder, record_components, dependencies...).

🔍

rag/

client.py — search_by_class(), include_dependencies=True. schema.py — CodeChunk, SearchQuery, MetadataFilter.

⚡

vllm/

client.py — generate() blocking + stream_generate() iterator. Wrap vLLM OpenAI-compat HTTP API.

⚙️

config/

agent.yaml · rag.yaml · vllm.yaml + .env overrides + env.example.

🧪

tests/

System test directory — currently empty (ironic).

Full Dependency Map

server/api.py
  → agent/orchestrator.py
        → agent/planner.py  →  agent/plan.py (ExecutionPlan, PlanStep, StepAction)
        → agent/state_machine.py  (AgentState, StateMachine, TransitionError)
        → context/context_builder.py  [optional — graceful degradation]
              → intelligence/dependency_analyzer.py
                    → intelligence/repo_scanner.py  (RepoSnapshot, O(1) lookup)
                    → intelligence/file_graph.py    (FileGraph, transitive closure)
                    → intelligence/symbol_map.py    (SymbolMap, global symbol table)
              → rag/client.py
              → context/snippet_selector.py  (5 priority tiers)
              → context/token_optimizer.py   (budget-aware truncation)
        → agent/prompt.py        (PromptBuilder)
        → vllm/client.py         (VLLMClient)
        → agent/validation.py    (ValidationPipeline — 7 passes)
        → agent/repair.py        (RepairStrategySelector, RepairPlan)
        → agent/memory.py        (MemoryManager, SessionMemory)
        → agent/events.py        (EventBus, Event, EventType)
        → agent/metrics.py       (MetricsCollector)

// 03

Execution Flow UPDATED

A — Indexing Flow (runs once / reindex)

POST /reindex

API receives repo_path, delegates to the indexer pipeline.

parse_java.py — tree-sitter AST

Parses all .java files. Extracts: class_name, FQN, methods, fields, dependencies (FQN list), used_types (simple names), has_builder, java_type (record/class/interface), record_components, DDD layer (application/domain/infrastructure).

summarize.py — Chunk Summarization

Each chunk gets a summary. Rich metadata is attached so Pass 7 validation can cross-check later.

MiniLM-L6-v2 — Embedding

Each chunk → 384-dim dense vector.

Qdrant Upsert

Vector + full payload upserted into collection java_codebase. Rich metadata enables intelligence/ and Pass 7 to operate correctly.

B — Test Generation Flow (per request)

HTTP Request

POST /generate-test or /v1/chat/completions with file_path, class_name, task_description, session_id.

Planner → ExecutionPlan

Planner creates a plan with list of PlanSteps: EXTRACT_CLASS_INFO → RETRIEVE_CONTEXT → BUILD_PROMPT → GENERATE_CODE → EXTRACT_CODE → VALIDATE_CODE → RECORD_SESSION.

StateMachine: IDLE → PLANNING → RETRIEVING

Each transition is tracked. EventBus publishes PLAN_CREATED + GENERATION_STARTED.

ContextBuilder.build_context()

① DependencyAnalyzer → TestContext {mocks, domain_types, layer} from AST graph ② RAG search_by_class() fetches main chunk ③ Extract dep_simple_names from Qdrant payload ④ ThreadPoolExecutor(max_workers=5) parallel fetch all deps ⑤ unfound_types → main_chunk.unfound_types ⑥ SnippetSelector P1→P5 ⑦ TokenOptimizer budget=6000.

PromptBuilder → vLLM

Build messages[] with system prompt + rules + context snippets + unfound_types warnings + session history. POST to vLLM Qwen2.5-Coder-7B.

ValidationPipeline (7 passes)

Validates with severity. If ERROR → _ValidationFailed exception raised. EventBus publishes VALIDATION_COMPLETED.

Repair Loop (if ERROR)

planner.plan_repair() appends REPAIR_CODE + GENERATE_CODE + VALIDATE_CODE steps to the running plan. Loop continues. Max max_repair_attempts=2. EventBus publishes REPAIR_STARTED.

COMPLETED → Response

GenerationResult {test_code, validation_passed, validation_summary, rag_chunks_used, tokens_used, plan_summary, repair_attempts}.

// 04

Agent Loop UPDATED

StateMachine States

IDLE
  | transition_to(PLANNING)
  v
PLANNING       ← Planner.plan_test_generation() → ExecutionPlan
  | transition_to(RETRIEVING)
  v
RETRIEVING     ← ContextBuilder.build_context() or _get_rag_context()
  | transition_to(GENERATING)
  v
GENERATING     ← vllm.generate() / stream_generate()   ←―――――――――――――――+
  | transition_to(VALIDATING)                                              |
  v                                                                        |
VALIDATING     ← ValidationPipeline.validate(code, rag_chunks)          |
  | passed → transition_to(COMPLETED)                                    |
  | failed → _ValidationFailed → transition_to(REPAIRING)             |
  v                                                                        |
REPAIRING      ← RepairStrategySelector.build_repair_plan()             |
  | planner.plan_repair() appends REPAIR_CODE + GENERATE_CODE steps       |
  +————————————————————————— continue loop ————————————+
  | current_repair_attempt >= max_repair_attempts → COMPLETED (with issues)
  v
COMPLETED

✦ Plan-driven Repair — Elegant Design

When validation fails, the orchestrator does not restart from scratch. It calls planner.plan_repair() to append new steps to the running ExecutionPlan, then continues the while loop. No code duplication between the generate path and the repair path.

Plan-driven Execution Engine — Core Loop

plan = self.planner.plan_test_generation(...)

while True:
    step = plan.get_next_pending_step()
    if step is None: break       # all steps done

    step.start()
    self.event_bus.publish(Event(type=STEP_STARTED, ...))

    try:
        self._execute_step(sm, plan, step, ctx)
        step.complete()

    except _ValidationFailed as vf:
        if plan.can_repair:
            sm.transition_to(AgentState.REPAIRING, ...)
            self.planner.plan_repair(plan, vf.issues, ctx["extracted_code"])
            sm.transition_to(AgentState.GENERATING, repair=True)
            continue       # continue loop with newly appended repair steps
        else:
            ctx["validation_passed"] = False   # accept with issues

Streaming Events

yield StreamEvent(phase=PLANNING,   content="📋 Planning...")
yield StreamEvent(phase=RETRIEVING, content="🔍 Searching...", metadata={chunks_count})
yield StreamEvent(phase=GENERATING, content=token, delta=True)  # token-by-token
yield StreamEvent(phase=VALIDATING, content="🔎 Validating...", metadata={passed, errors})
yield StreamEvent(phase=REPAIRING,  content="🔧 Repair 1/2...")
yield StreamEvent(phase=DONE,       metadata={tokens_used, rag_chunks_used, ...})

// 05

Model Integration

Model	Size	Role	Access
Qwen/Qwen2.5-Coder-7B-Instruct	7B	Test generation + repair	vLLM OpenAI-compat :8000
sentence-transformers/all-MiniLM-L6-v2	22M	Embedding for RAG indexing + search	In-process SentenceTransformers

Prompt Construction

[SYSTEM]
You are a Java test engineer.
{rules: JUnit5, Mockito, AAA pattern, no Spring context, no @SpringBootTest, ...}
{unfound_types: "These types not in index — mock them without source: [X, Y]"}

[CONTEXT — from ContextBuilder, priority-ordered, token-optimized]
// P1 — AuthUseCaseService.java (target, 100% kept)
{source code}
// P2 — OpenAPIRepository.java (mockable dep — from intelligence/)
{source code}
// P3 — UserProfile.java (domain type — record, no @Builder)
{source code, truncated if over budget}
...

[HISTORY — if refinement / session has history]
User: "Generate tests for AuthUseCaseService"
Assistant: {previous test code}
User: "Add null input tests"

[USER]
{task_description}. File: {file_path}

Response Parsing — _extract_code()

# Level 1: ```java ... ``` code block (preferred)
pattern = r"```(?:java)?\s*\n(.*?)```"
matches = re.findall(pattern, response, re.DOTALL)
if matches: return max(matches, key=len).strip()

# Level 2: detect class declaration directly
pattern = r"((?:import.*?\n)*\s*(?:@\w+.*?\n)*\s*(?:public\s+)?class\s+\w+.*?\{.*\})"
class_match = re.search(pattern, response, re.DOTALL)
if class_match: return class_match.group(1).strip()

return response.strip()  # fallback

// 06

Tool System UPDATED

No dynamic tool-calling via LLM. Tools are fixed step executors dispatched by the orchestrator according to the StepAction enum from ExecutionPlan:

StepAction	Executor	Description
EXTRACT_CLASS_INFO	_step_extract_class_info()	Extract class name from file path
RETRIEVE_CONTEXT	_step_retrieve_context()	ContextBuilder or RAG fallback, parallel fetch
BUILD_PROMPT	_step_build_prompt()	PromptBuilder with session history
GENERATE_CODE	_step_generate_code()	vllm.generate(), check response.success
EXTRACT_CODE	_step_extract_code()	Regex 2-level extraction
VALIDATE_CODE	_step_validate_code()	ValidationPipeline 7 passes, raise _ValidationFailed
RECORD_SESSION	_step_record_session()	session.record_generated_test()
REPAIR_CODE	_step_repair_code()	RepairStrategySelector → rebuild repair prompt

Fallback Hierarchy — 3-tier degradation

# Tier 1: Full ContextBuilder (intelligence + RAG + priority + token budget)
if self.context_builder:
    context_result = self.context_builder.build_context(...)

# Tier 2: RAG-only with graph traversal (parallel fetch)
else:
    rag_chunks = self._get_rag_context(class_name, file_path, session)

# Tier 3: Inline source parsing (regex from Java source sent with request)
if not types_to_fetch and inline_source:
    fallback_types = self._extract_types_from_source(inline_source, class_name)

// 07

Memory System

Type	Storage	Scope	Persist?
Session conversation	Python dict — MemoryManager in-process	Per session UUID	❌ Lost on restart
RAG context cache	Session-level dict (key: "class:file")	Per session	❌ In-memory
generated_tests list	SessionMemory.generated_tests[]	Per session	❌ In-memory
Code index (RAG)	Qdrant (disk-backed)	Entire Java repo	✅ Yes

RAG context is cached per-session by key "ClassName:file_path" so refinement requests don’t need to re-query Qdrant. Session memory also stores a generated_tests list — each entry has class_name, test_code, and a success flag.

// 08

Data Flow UPDATED

Indexing Data Flow

Java Repo (.java files)
  → tree-sitter parse_java.py
  → ClassInfo {class_name, FQN, methods, fields,
                   dependencies (FQN list), used_types (simple names),
                   has_builder, java_type, record_components, DDD layer}
  → summarize.py
  → MiniLM-L6-v2 → 384-dim vector
  → Qdrant upsert: vector + full payload (rich metadata for Pass 7 + intelligence/)

Generation Data Flow (full detail)

HTTP {file_path, class_name, task, session_id}
  → server/api.py → AgentOrchestrator.generate_test()
  → Planner.plan_test_generation() → ExecutionPlan
  → _execute_plan() while loop:

    [EXTRACT_CLASS_INFO]  file_path → class_name

    [RETRIEVE_CONTEXT]
      ContextBuilder.build_context(class_name, file_path, max_tokens=6000)
        → DependencyAnalyzer.test_context_for(class_name)
            → TestContext {mocks: [...], domain_types: [...], layer: "service"}
        → rag.search_by_class(class_name, top_k=1)   ← main chunk
        → extract dep_simple_names from main_chunk.dependencies (FQN)
        → extract used_types from main_chunk.used_types
        → types_to_fetch = dep_simple_names | used_types
        → ThreadPoolExecutor(max_workers=5):
            parallel rag.search_by_class(dep) for dep in types_to_fetch
        → unfound_types → main_chunk.unfound_types
        → SnippetSelector: P1(target)→P2(mocks)→P3(domain)→P4(iface)→P5(trans)
        → TokenOptimizer: P1 keep 100%, P5 drop first if over budget
        → ContextResult {snippets, rag_chunks, token_count, mock_types}
      EventBus.publish(CONTEXT_RETRIEVED)

    [BUILD_PROMPT]
      PromptBuilder.build_test_generation_prompt(context_result, session)

    [GENERATE_CODE]
      vllm.generate(system, user) → GenerationResponse
      EventBus.publish(STEP_COMPLETED)

    [EXTRACT_CODE]  _extract_code(response) → Java string

    [VALIDATE_CODE]
      ValidationPipeline.validate(code, rag_chunks)  ← 7 passes
      → ValidationResult {errors, warnings, infos}
      → passed? → RECORD_SESSION
      → failed? → _ValidationFailed(issues, validation_result)
                → plan.can_repair? → plan_repair() → loop
                → else → accept with issues
      EventBus.publish(VALIDATION_COMPLETED)

    [RECORD_SESSION]
      session.add_assistant_message(code)
      session.record_generated_test(class_name, code, success)

  → StateMachine.transition_to(COMPLETED)
  → EventBus.publish(GENERATION_COMPLETED)
  → GenerationResult

// Deep Dive

intelligence/ — Structural Graph Intelligence NEW

✦ The biggest differentiator

Pure RAG returns “code semantically close to X”. intelligence/ returns “exactly what X needs to mock” by traversing the AST graph — no guessing, no approximation.

4 Components

File	Class	Function
repo_scanner.py	RepoScanner	Scans repo with JavaParser → RepoSnapshot with O(1) lookup by name / FQN / file path
file_graph.py	FileGraph	Directed graph based on import relationships. Finds dependencies, dependents, transitive closures
symbol_map.py	SymbolMap	Global symbol table: class→methods/fields, method→classes, field_type→injectors, annotation→classes
dependency_analyzer.py	DependencyAnalyzer	Merges FileGraph + SymbolMap → TestContext and ImpactReport

Key Queries

analyzer = DependencyAnalyzer(repo_scanner, file_graph, symbol_map)

# "What does AuthUseCaseService need to mock?"
ctx = analyzer.test_context_for("AuthUseCaseService")
# ctx.mocks        → ["OpenAPIRepository", "UserQueryService", ...]  ← 100% accurate
# ctx.domain_types → ["UserProfile", "JwtToken", ...]
# ctx.layer        → "service"

# "If UserProfile changes, what is affected?"
report = analyzer.impact_of("UserProfile")
# report.direct_dependents     → ["AuthUseCaseService", "UserUseCase"]
# report.transitive_dependents → [...]

RAG vs Intelligence — Direct Comparison

	RAG vector search	intelligence/ graph
Finds	Semantically similar code chunks	Exact dependencies from AST
Answer	“Code related to X”	“X needs to mock A, B, C”
Mechanism	Cosine similarity	Graph traversal
Mock accuracy	Estimated, may miss some	100%, complete
Compile result	May miss mocks → NullPointer	Full mock list → compiles immediately

// Deep Dive

context/ — Smart Context Assembly NEW

context/ solves the problem: token budget is limited (6000 tokens) — how do you select the most important content rather than dumping everything in?

4-step Pipeline

ContextBuilder.build_context("AuthUseCaseService", file_path, max_tokens=6000)
  |
  +-- [1] Intelligence Layer  (optional — graceful degradation)
  |         DependencyAnalyzer.test_context_for("AuthUseCaseService")
  |         → TestContext {mocks: [...], domain_types: [...], layer: "service"}
  |
  +-- [2] RAG Search
  |         rag.search_by_class(include_dependencies=True)
  |         + parallel ThreadPoolExecutor fetch per dep (from TestContext.mocks)
  |         → List[CodeChunk]
  |
  +-- [3] SnippetSelector — Priority tiers
  |         P1 ████████ target source   AuthUseCaseService.java  (keep 100%)
  |         P2 ██████   mockable deps   OpenAPIRepository, ...   (keep 100%)
  |         P3 ████     domain types    UserProfile, JwtToken     (truncate)
  |         P4 ██       interfaces      IUserRepository           (hard truncate)
  |         P5 █         transitive deps  indirect dependencies     (drop first)
  |
  +-- [4] TokenOptimizer — Budget-aware (~4 chars/token)
            → ContextResult {
                snippets:               list[Snippet] (priority-ordered)
                rag_chunks:             list[CodeChunk]
                token_count:            int
                mock_types:             list[str]
                intelligence_available: bool
                elapsed_ms:             float
              }

✦ Graceful Degradation

ContextBuilder(rag_client, intelligence=None) — if intelligence/ is unavailable, the system still runs in RAG-only mode. No crash, only reduced mock accuracy.

// Deep Dive

agent/orchestrator.py — Detail NEW

4 Development phases within the same file

Phase	Added	Components
Phase 1	StateMachine + Planner	StateMachine, Planner, ExecutionPlan, PlanStep
Phase 2	Optional ContextBuilder	ContextBuilder, 3-tier graceful degradation
Phase 3	Validation + Repair	ValidationPipeline, RepairStrategySelector, _ValidationFailed
Phase 4	EventBus + Metrics	EventBus, MetricsCollector, structlog structured logging

Shared Context Dict (ctx)

ctx = {
    "class_name":        str,
    "file_path":         str,
    "session":           Optional[SessionMemory],
    "rag_chunks":        list[CodeChunk],
    "context_result":    Optional[ContextResult],
    "system_prompt":     str,
    "user_prompt":       str,
    "full_response":     str,
    "extracted_code":    str,
    "validation_result": Optional[ValidationResult],
    "validation_passed": bool,
    "validation_issues": list[str],
    "tokens_used":       int,         # len(full_response) // 4 (estimate)
    "repair_plan":       Optional[RepairPlan],
}

// Deep Dive

agent/validation.py — ValidationPipeline 7 Passes NEW

Severity Model

ERROR — blocks acceptance, triggers repair loop | WARNING — repair if budget allows, does not block | INFO — recorded, no impact on passed/failed. passed = not any(i.severity == ERROR for i in issues)

Structural

Class declaration present? Braces balanced (open == close)? Import statements present? → ERROR if any are missing.

Forbidden Patterns — Spring Context Leaks

@SpringBootTest, @DataJpaTest, @WebMvcTest, @SpringExtension, @ContextConfiguration, @RunWith(SpringRunner), @Autowired, @MockBean, TestRestTemplate, MockMvc → all ERROR with specific suggestion + line number.

Required Patterns — JUnit5 & Mockito

@ExtendWith(MockitoExtension.class) → ERROR | @Mock → ERROR | @InjectMocks → ERROR | @Test → ERROR | @DisplayName → WARNING.

AAA Pattern — Arrange-Act-Assert

Checks for // Arrange, // Act, // Assert comments in each test method. Missing → WARNING with specific list of missing parts.

Quality — Best Practices

verify() calls present (WARNING)? Test naming convention method_WhenCondition_ShouldResult (INFO)? lenient() usage (WARNING)? @BeforeEach present (INFO)?

Anti-patterns — 8 Real-world Issues (detail below)

SecurityContextHolder, LocalDateTime.now(), @InjectMocks without @Mock, verify on chained static call, setUp with >30 builder calls...

RAG-Aware Construction — Cross-check vs AST Metadata

Depth-tracking mini-parser extracts builder chain fields. Cross-checks against record_components/fields from Qdrant payload. Detects .builder() on record without @Builder, non-existent field names → ERROR with correct constructor suggestion.

Pass 6 — 8 Common Anti-patterns

AP1

SecurityContextHolder without MockedStatic — will leak state between tests. ERROR. Suggestion: try-with-resources MockedStatic.

AP2

LocalDateTime.now() / UUID.randomUUID() in test body — value will differ at runtime. WARNING. Suggestion: use ArgumentCaptor, any() matcher, or a fixed value.

AP3

@InjectMocks without any @Mock — all dependencies will be null. ERROR.

AP4

SecurityContextHolder.setContext() called directly — not safe. ERROR. Suggestion: MockedStatic.when(SecurityContextHolder::getContext).

AP5

SecurityContextHolder.getContext() directly without MockedStatic — will fail at runtime. ERROR.

AP6

Missing MockedStatic import when using SecurityContextHolder — WARNING. Auto-detects missing import.

AP7

verify() on chained static call — verify(SecurityContextHolder.getContext().getAuthentication()) is wrong. ERROR. Suggestion: verify on mock variable directly.

AP8

setUp with >30 builder/setter calls — likely guessing domain fields. WARNING. Suggestion: use mock(Type.class) + stub only the necessary accessors.

Pass 7 — _extract_builder_chain_fields() Mini-parser

# Problem: simple regex would confuse nested calls with fields
# UserProfile.builder()
#   .id(UUID.fromString("abc"))  ← UUID.fromString is nested, NOT a field
#   .name("test")                ← name IS a real field

# Solution: track parenthesis depth
depth = 0
while i < length:
    if c == '(':  depth += 1
    elif c == ')': depth -= 1
    elif c == '.' and depth == 0:   # only match at depth=0
        name = match_next_method()
        if name not in ("build", "toBuilder"):
            fields.append(name)  # ← this is a real field

// Deep Dive

agent/plan.py — ExecutionPlan NEW

@dataclass
class PlanStep:
    step_id:      int           # 1, 2, 3, ...
    action:       StepAction    # enum — no magic strings
    description:  str
    params:       dict          # input cho executor
    status:       StepStatus    # PENDING/IN_PROGRESS/COMPLETED/FAILED/SKIPPED
    result:       Any           # output available to subsequent steps
    started_at:   float
    completed_at: float

    @property
    def duration_ms(self) -> float:
        return (completed_at - started_at) * 1000  # profiling per step

@dataclass
class ExecutionPlan:
    plan_id:               str       # "plan-a3f92bc1" — prefixed UUID
    task_type:             TaskType  # TEST_GENERATION | REFINEMENT | GENERAL_CHAT
    max_repair_attempts:   int = 2
    current_repair_attempt: int = 0
    metadata:              dict      # session_id, task_description, ...

    @property
    def can_repair(self) -> bool:
        return self.current_repair_attempt < self.max_repair_attempts
    # Single source of truth — orchestrator only checks plan.can_repair

    def add_step(self, action, description, **params) -> PlanStep:
        # Dynamic step addition — allows plan_repair() to append repair steps

    def get_next_pending_step(self) -> Optional[PlanStep]:
        return next((s for s in self.steps if s.status == PENDING), None)

// 09

Architecture Issues UPDATED

After reading the full source code, many initially-misassessed problems turned out to already be resolved:

FIXED

No auto-retry → ✓ RepairStrategySelector + plan_repair() appends repair steps + loop up to 2 times. Targeted repair prompt with structured validation issues per category.

FIXED

Validation only string-matching → ✓ ValidationPipeline 7 passes, severity-aware (ERROR/WARNING/INFO), depth-tracking mini-parser for builder chains, RAG cross-check with AST metadata.

FIXED

No observability → ✓ EventBus publishes at every step (PLAN_CREATED, STEP_STARTED, CONTEXT_RETRIEVED, VALIDATION_COMPLETED...), MetricsCollector, structlog with context fields.

FIXED

Tight-coupled orchestrator → ✓ Plan-driven step dispatcher, 3-tier graceful degradation, ContextBuilder is an optional dependency, clear fallback hierarchy.

HIGH

In-memory sessions: MemoryManager stores sessions in an in-process Python dict. Server restart → all active sessions lost. Multi-worker deployment won’t work. Needs Redis or SQLite backend.

HIGH

No authentication: All endpoints, including POST /reindex (a destructive operation that deletes and rebuilds the entire index), have zero auth layer. Exposing to a network is a serious risk.

MED

Rough token estimation: tokens_used = len(full_response) // 4 appears in both orchestrator and streaming. Java code has many special characters — the actual Qwen tokenizer can deviate 20–30%. TokenOptimizer uses the same heuristic.

MED

Synchronous /reindex: Blocking HTTP call. Indexing a large Java repo can take several minutes → HTTP timeout. Needs background job + GET /reindex/{job_id}/status.

MED

Embedding model in-process: MiniLM-L6-v2 loads into the FastAPI process. Costs RAM (~90MB), increases startup time. Should be isolated into a separate embedding microservice.

LOW

Magic number AP8: builder_field_count > 30 hardcoded in Pass 6. This threshold doesn’t fit all project sizes. Should be config-driven in agent.yaml.

LOW

Generated tests not auto-saved: Test code is returned in the response but is not automatically written to src/test/java/... The user must copy-paste manually.

LOW

Zero tests for the system itself: The tests/ directory is empty. Ironic, and a genuine risk during refactoring — especially for ValidationPipeline and ContextBuilder.

// 10

Code Quality UPDATED

Module Structure

9.5/10

server/agent/intelligence/context/rag/indexer/vllm — clear, correct responsibilities

Naming

9.8/10

ExecutionPlan, SnippetSelector, DependencyAnalyzer — completely self-documenting

Modularity

8.2/10

Plan-driven, optional deps, graceful degradation — much better than initially assessed

Domain Expertise

9.7/10

SecurityContextHolder, builder/record patterns — learned from practice, not theory

Error Handling

8.5/10

_ValidationFailed pattern, fallback hierarchy, structlog context — consistent

Concurrency

8.0/10

ThreadPoolExecutor parallel dep fetch — applied where it actually matters

Documentation

8.0/10

Detailed README + TABBY_SETUP.md + docstrings in code

Testing

0.5/10

Zero tests — critical gap, especially for ValidationPipeline and ContextBuilder

Security

1.8/10

No auth. /reindex is an open destructive endpoint. .env in repo

// 11

Simple Explanation

For developers new to the project

Imagine you have a brilliant Java testing colleague, and you want to ask him to write tests for your UserService.

Before you even ask, he has already sat down and read the entire codebase (this is the indexing step). He didn’t just skim it — he drew a map of which class depends on which, which records have no builder, which interfaces are injected where. This is intelligence/.

When you ask “write tests for UserService”, he doesn’t re-read the entire codebase. He looks at the map he drew and instantly knows UserService needs to mock OpenAPIRepository and UserQueryService. Then he picks exactly the relevant files and prioritises the most important ones within his time limit. This is context/: SnippetSelector + TokenOptimizer.

Before writing, he lays out a clear plan: how many steps, which order, how many times to self-fix if something goes wrong. This is ExecutionPlan + StateMachine.

Then he writes the tests according to the team’s rules: JUnit 5, Mockito, AAA pattern, no Spring context. Afterwards he re-reads his own work 7 times, checking for different types of issues: correct structure, sufficient annotations, is SecurityContextHolder mocked correctly, do the field names in the builder actually exist. This is ValidationPipeline 7 passes.

If he finds a mistake, he self-corrects up to 2 times before handing it to you — no need for you to tell him. This is the repair loop. If you want more edge cases, you give feedback and he remembers the entire conversation to fix it in exactly the right place. This is session memory + refinement.

Every action he takes is logged: how long each step took, how many chunks retrieved, how many tokens used, whether validation passed or failed. This is EventBus + MetricsCollector.

From your IDE’s perspective (Tabby), everything looks like chatting with GPT-4 via the OpenAI API. All the complexity is hidden behind the /v1/chat/completions endpoint.

// 12

Suggested Improvements UPDATED

💾 Persistent Session Storage

Replace in-memory dict with Redis or SQLite. Sessions survive restart, support multi-worker. Priority HIGH — required before production deployment.

🔐 API Key Authentication

FastAPI middleware with API key check. Hash keys in env. Especially critical for POST /reindex — a destructive operation that needs protection.

⌛ Async Indexing + Job Queue

/reindex returns a job_id immediately. Background asyncio task or Celery handles processing. Add GET /reindex/{job_id}/status. Avoids HTTP timeout.

📲 Real Tokenizer

Replace len(response) // 4 with AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-7B"). Use for both TokenOptimizer and tokens_used tracking. Current deviation: 20–30%.

🔍 Hybrid Search (BM25 + Dense)

Add BM25 keyword search alongside vector search in Qdrant. Improves recall for exact class/method name lookups that dense vectors tend to miss.

📝 Auto-write Test Files

Option to automatically write generated tests to the correct path in the Java project (src/test/java/...). Closes the workflow loop, no more manual copy-paste.

🧪 Unit Tests for the System

Write tests for ValidationPipeline (mock Java strings), ContextBuilder, RepairStrategySelector. Especially important when adding new anti-patterns to Pass 6.

⚙️ Config-driven AP8 threshold

Replace the magic number > 30 in Pass 6 AP8 with a configurable value in agent.yaml. Per-project tuning instead of hardcoding.

📊 Full Evaluation Pipeline

Extend benchmark.py: compile check (javac), coverage target, validation pass rate, repair success rate. Detects quality regression when changing model or prompt.

🗼 ImpactReport Integration

DependencyAnalyzer.impact_of() is implemented but not yet used in the generation flow. Use it to warn: “changing UserProfile will affect 5 other tests”.

// 13

Architecture Diagrams UPDATED

Diagram 1 — System Architecture (Full)

Diagram 2 — StateMachine + Repair Loop

Diagram 3 — ValidationPipeline 7 Passes

Diagram 4 — context/ Priority Assembly Pipeline

// Quick Start

Quick Start — Run the Agent in 5 Minutes

This guide assumes you already have Docker + Docker Compose and an NVIDIA GPU (or run CPU-only with a smaller model).

Step 0 — Prerequisites

Requirement	Minimum Version	Notes
Python	3.11+	3.12 recommended
Docker + Compose	Docker 24+	Used for Qdrant & vLLM
NVIDIA GPU	CUDA 12.1+ & ≥16 GB VRAM	Qwen2.5-Coder-7B needs ~14 GB VRAM (AWQ)
RAM	≥ 16 GB	embedding model + FastAPI + indexer
Java Repo	Java 17+, Maven/Gradle	Source code to index

Step 1 — Clone & install dependencies

Clone repo and create virtual environment

Create an isolated Python 3.11 environment to avoid conflicts. Never install globally.

git clone https://github.com/huynguyenjv/ai-agent.git
cd ai-agent
python -m venv venv
# Windows:
venv\Scripts\activate
# Linux/macOS:
source venv/bin/activate

Install Python packages

pip install -r requirements.txt

Includes: fastapi, uvicorn, qdrant-client, sentence-transformers, tree-sitter, structlog, pydantic v2.

Download embedding model (offline)

python download_model.py

Downloads all-MiniLM-L6-v2 (22 MB) into ./models/. Only downloads once, used offline.

Step 2 — Configure environment

cp env.example .env
# Edit the required values in .env

⚠️ Three most important values

VLLM_BASE_URL — URL of the vLLM server (default http://localhost:8000/v1)
JAVA_REPO_PATH — path to the Java source code to index
QDRANT_HOST — Qdrant host (default localhost)

Step 3 — Start supporting services

# Start Qdrant (vector database) via Docker
docker compose up -d qdrant

# Start vLLM (LLM server) — requires GPU
docker compose up -d vllm

# Check both are ready
curl http://localhost:6333/health     # Qdrant
curl http://localhost:8000/health     # vLLM

💡 Running without a GPU

Change model in .env: VLLM_MODEL=Qwen/Qwen2.5-Coder-1.5B-Instruct and add --device cpu in docker-compose.yml. ~10x slower but still works.

Step 4 — Index Java codebase

# Start agent server first
python main.py

# In another terminal: call the indexing API
curl -X POST http://localhost:8080/reindex \
  -H "Content-Type: application/json" \
  -d '{"repo_path": "/path/to/your/java/repo", "recreate": false}'

Indexing typically takes 1–10 minutes depending on repo size. Each .java file is parsed by tree-sitter, chunked, embedded, and upserted into Qdrant.

Step 5 — Generate your first test

# Basic test generation
curl -X POST http://localhost:8080/generate-test \
  -H "Content-Type: application/json" \
  -d '{
    "file_path": "src/main/java/com/example/service/UserService.java",
    "class_name": "UserService",
    "task_description": "Generate comprehensive JUnit5 unit tests"
  }'

✓ Expected result

Response JSON includes test_code as a complete Java class with JUnit5 + Mockito. validation_passed: true means all 7 passes passed. repair_attempts shows whether self-repair was needed.

Tabby IDE Integration

The agent exposes a /v1/chat/completions endpoint fully compatible with the OpenAI API. Configure Tabby to point at the agent:

Tabby Setting	Value
Completion Provider	OpenAI Compatible
API Endpoint	`http://localhost:8080/v1`
Model	`ai-agent`
API Key	any value (no auth yet, see issue s09)

// API Reference

API Reference — All Endpoints

The agent runs at http://localhost:8080 (configured via SERVER_PORT). All endpoints return JSON unless otherwise noted.

Core Endpoints

GET /health Check health of the entire system

Response

{
  "status": "healthy",
  "vllm_healthy": true,
  "qdrant_healthy": true,
  "index_stats": {
    "points_count": 1842,
    "collection": "java_codebase"
  }
}

Returns 200 if all services (vLLM + Qdrant) are running. Use for k8s readiness probe.

POST /generate-test Generate JUnit5 tests for a Java class

Request Body

Field	Type	Required	Description
file_path	string	● required	Path to the .java file (relative or absolute)
class_name	string	optional	Class name. If omitted, auto-detected from file path
task_description	string	optional	Additional instructions for the LLM
session_id	string	optional	Session UUID — if provided, restores chat history

Response: GenerateTestResponse

{
  "success": true,
  "test_code": "import org.junit.jupiter.api.Test;\n...",
  "class_name": "UserService",
  "validation_passed": true,
  "validation_issues": [],
  "session_id": "a3f9-...",
  "rag_chunks_used": 7,
  "tokens_used": 1240,
  "plan_summary": "Steps: 7 completed, 0 failed",
  "repair_attempts": 0
}

Note

validation_passed: false does not mean the request failed. Test code is still returned but has warnings. Check validation_issues for details.

POST /refine-test Refine tests based on feedback (requires session_id)

Request Body

Field	Type	Required	Description
session_id	string	● required	Session ID from the previous generate call
feedback	string	● required	Description of the changes to make

Example

curl -X POST http://localhost:8080/refine-test \
  -H "Content-Type: application/json" \
  -d '{
    "session_id": "a3f9-...",
    "feedback": "Add test cases for null input and empty list scenarios"
  }'

The agent will call vLLM again with the full conversation history + new feedback. Test code is regenerated from scratch but has the prior session context.

POST /reindex Re-index the entire Java repository into Qdrant

⚠️ Warning

This is a destructive operation. With recreate: true, the entire vector collection will be deleted and rebuilt. This endpoint has no authentication — do not expose publicly!

Request Body

Field	Type	Required	Description
repo_path	string	● required	Absolute path to the Java repository
recreate	bool	optional	Delete and recreate the collection. Default: `false`

Response

{"success": true, "message": "Indexed 1842 points", "points_indexed": 1842}

Current implementation is synchronous — the HTTP connection is held open until indexing is complete. For large repos this may take several minutes. See §12 for the async indexing plan.

GET /index/stats Qdrant collection statistics

{"points_count": 1842, "collection": "java_codebase", "vector_size": 384}

GET /index/lookup/{class_name} Find RAG chunks for a specific class

Returns a list of CodeChunk matching the class_name. Useful for debugging whether a class is in the index.

GET /index/lookup/UserService
→ [{class_name, file_path, content, metadata, score}, ...]

Session Endpoints

POST /session Create a new session

Creates a new session UUID. Returns SessionInfo containing session_id, created_at, expires_at.

GET /session/{session_id} Get session information

Returns session metadata: ID, creation time, number of generated tests, conversation turns.

DEL /session/{session_id} Delete session (free memory)

Immediately removes the session from memory. Sessions also auto-expire after 1 hour (session_timeout configured in agent.yaml).

GET /sessions List all active sessions

Returns a list of SessionInfo for all sessions that have not yet expired.

OpenAI-Compatible Endpoints (for use with Tabby)

GET /v1/models List models (OpenAI format)

{"object": "list", "data": [{"id": "ai-agent", "object": "model", ...}]}

POST /v1/chat/completions OpenAI chat completions (Tabby IDE entry point)

Supports the full OpenAI chat format. The agent automatically parses message content to detect file_path, class_name and dispatches to the /generate-test flow. Supports both stream: true (SSE token-by-token) and blocking mode.

Custom fields

{
  "model": "ai-agent",
  "messages": [{"role": "user", "content": "Write tests for UserService"}],
  "stream": true,
  "file_path": "src/main/java/.../UserService.java",  // optional
  "workspace_path": "/path/to/workspace"              // optional
}

Streaming

When stream: true, the response is an SSE stream with 6 phase events: PLANNING → RETRIEVING → GENERATING (token-by-token) → VALIDATING → REPAIRING (if needed) → DONE.

GET /v1/agent/status Detailed agent status (StateMachine + session count)

{"state": "IDLE", "active_sessions": 2, "total_generations": 47, ...}

GET /v1/agent/metrics Aggregated metrics from MetricsCollector

{"total_generations": 47, "avg_tokens": 1380, "validation_pass_rate": 0.91, "repair_rate": 0.21, ...}

GET /v1/agent/events/stream SSE stream of realtime events from EventBus

Long-lived SSE connection. Receives all events that EventBus publishes: PLAN_CREATED, STEP_STARTED, CONTEXT_RETRIEVED, VALIDATION_COMPLETED, REPAIR_STARTED, GENERATION_COMPLETED. Useful for monitoring dashboards.

GET /v1/rag-context Debug: view the RAG context that will be built for a given class

GET /v1/rag-context?class_name=UserService&file_path=...
→ {snippets, token_count, mock_types, intelligence_available}

Rate Limiting

Default: 10 requests / 60 seconds per IP. Exceeding this returns HTTP 429. Configure via env: RATE_LIMIT_REQUESTS, RATE_LIMIT_WINDOW. Health check endpoints (/health, /v1/models) are exempt from rate limiting.

⚠️ Current rate limiter

In-memory per-process. In a multi-worker deployment (Gunicorn/uvicorn), each worker has its own rate limit — the effective limit is RATE_LIMIT_REQUESTS × MAX_WORKERS. Redis is needed for distributed rate limiting.

// Configuration

Configuration Guide — Detailed Configuration

The agent is configured in two layers: YAML files in config/ (defaults) and environment variables in .env (overrides). Environment variables always win over YAML.

Environment Variables (.env)

Qdrant — Vector Database

Variable	Default	Description
QDRANT_HOST	localhost	Hostname of the Qdrant server
QDRANT_PORT	6333	Qdrant REST API port
QDRANT_COLLECTION	java_codebase	Name of the collection storing vectors

vLLM — LLM Server

Variable	Default	Description
VLLM_BASE_URL	http://localhost:8000/v1	Base URL of vLLM's OpenAI-compatible API
VLLM_MODEL	Qwen/Qwen2.5-Coder-7B-Instruct-AWQ	Model name being served by vLLM
VLLM_API_KEY	token-abc123	API key (authentication with vLLM)

Embedding Model

Variable	Default	Description
EMBEDDING_MODEL	sentence-transformers/all-MiniLM-L6-v2	Model producing 384-dim vectors. Use a local path for offline usage.
SENTENCE_TRANSFORMERS_HOME	./models	HuggingFace model cache directory

Server & Performance

Variable	Default	Description
SERVER_HOST	0.0.0.0	Bind address
SERVER_PORT	8080	HTTP port
MAX_WORKERS	4	ThreadPoolExecutor workers for blocking I/O
REQUEST_TIMEOUT	300	Timeout (seconds) for a single generation request
LOG_LEVEL	INFO	DEBUG / INFO / WARNING / ERROR

Security & Rate Limiting

Variable	Default	Description
CORS_ORIGINS	*	Allowed CORS origins. Use `*` for dev, specific list for prod.
RATE_LIMIT_REQUESTS	10	Maximum requests per window
RATE_LIMIT_WINDOW	60	Window duration (seconds)
DISABLE_SSL_VERIFY	false	Only set `true` in corporate proxy environments

agent.yaml — Details

Key	Default	Meaning
orchestrator.max_context_tokens	4000	Token budget for RAG context when ContextBuilder is unavailable
orchestrator.top_k_results	10	Number of chunks returned from Qdrant search
orchestrator.session_timeout	3600	Session expiry in seconds
prompt.test_constraints	list	Rules injected into the system prompt (JUnit5, Mockito, AAA...)
rules.layer_detection	patterns	Regex mapping class name → DDD layer (application/domain/infra)

rag.yaml — Details

Key	Default	Meaning
qdrant.vector_size	384	Must match the output dimension of the embedding model
qdrant.distance	Cosine	Similarity metric
embedding.batch_size	32	Number of chunks embedded in parallel per batch (indexing)
search.default_top_k	10	Default number of results to return
search.score_threshold	0.5	Filter out chunks with cosine score below this threshold

vllm.yaml — Details

Key	Default	Meaning
generation.temperature	0.2	Low = deterministic. High = creative. 0.2 is good for code gen.
generation.max_tokens	4096	Token limit for a single response
generation.top_p	0.95	Nucleus sampling threshold
retry.max_attempts	3	Retry count if vLLM call fails (separate from the repair loop)

✌ Tip: tuning for best quality

Increase top_k_results (agent.yaml) to 15–20 for large repos. Increase max_tokens (vllm.yaml) to 8192 for classes with many methods. Lower temperature to 0.1 for more deterministic code.

// Developer Guide

Developer Guide — Extending & Contributing

Quick code orientation

To understand the system, read them in this order:

agent/plan.py

Core data model: ExecutionPlan, PlanStep, StepAction, StepStatus. Understand this struct thoroughly before reading any other file.

agent/state_machine.py

AgentState enum (6 states) and StateMachine managing transitions. Check the VALID_TRANSITIONS dict to see which states can transition to which.

agent/orchestrator.py

Central engine. Find _execute_plan() — this is the main while loop. Find _execute_step() — the dispatch table from StepAction enum to the executing method.

context/context_builder.py

4-step pipeline: intelligence → RAG → SnippetSelector → TokenOptimizer. Read the build_context() method from top to bottom.

agent/validation.py

ValidationPipeline.validate() runs 7 _pass_*() methods. Each pass returns list[ValidationIssue]. Easy to add new passes.

Adding a new Anti-pattern to Pass 6

def _pass6_antipatterns(self, code: str, rag_chunks) -> list[ValidationIssue]:
    issues = []
    lines = code.split("\n")

    # ...existing checks...

    # ADD NEW: Detect Thread.sleep() in tests
    for i, line in enumerate(lines, 1):
        if "Thread.sleep(" in line:
            issues.append(ValidationIssue(
                severity=IssueSeverity.WARNING,
                pass_number=6,
                category="antipattern",
                message=f"Thread.sleep() at line {i} makes tests flaky",
                suggestion="Use Awaitility or mock the dependency causing delay",
                line_number=i
            ))

    return issues

Adding a new StepAction to ExecutionPlan

This is the canonical pattern for extending the pipeline:

# 1. Add enum to agent/plan.py
class StepAction(Enum):
    # ...existing...
    SAVE_TEST_FILE = "save_test_file"   # NEW

# 2. Create executor method in agent/orchestrator.py
def _step_save_test_file(self, plan: ExecutionPlan, step: PlanStep, ctx: dict):
    code = ctx["extracted_code"]
    output_path = self._resolve_test_path(ctx["file_path"])
    Path(output_path).write_text(code, encoding="utf-8")
    ctx["saved_path"] = output_path

# 3. Register in the dispatch table _execute_step()
def _execute_step(self, sm, plan, step, ctx):
    executor_map = {
        StepAction.EXTRACT_CLASS_INFO:  self._step_extract_class_info,
        # ...existing...
        StepAction.SAVE_TEST_FILE: self._step_save_test_file,  # NEW
    }
    executor = executor_map.get(step.action)
    if executor:
        executor(plan, step, ctx)

# 4. Add step in Planner
def plan_test_generation(self, ...) -> ExecutionPlan:
    plan.add_step(StepAction.RECORD_SESSION, "Record session")
    plan.add_step(StepAction.SAVE_TEST_FILE, "Save test file")  # NEW
    return plan

Running system tests

cd ai-agent
source venv/bin/activate

# Unit tests (no vLLM/Qdrant needed)
pytest tests/test_phase1.py -v      # StateMachine + Planner
pytest tests/test_phase2.py -v      # ContextBuilder + intelligence/
pytest tests/test_phase3_4.py -v    # Validation + Repair + Events + Metrics

# Generation quality benchmark
python benchmark.py --test-file benchmark/results/gen_quality_bench.json

⚠️ Current status

The tests/ directory only has placeholders. Most test files have no actual test logic yet. See §12 for the test-writing plan.

Advanced Local Development Setup

# Hot-reload when editing code
uvicorn main:app --reload --host 0.0.0.0 --port 8080 --log-level debug

# Run with structured log output (pretty-printed)
LOG_LEVEL=DEBUG python main.py 2>&1 | python -m structlog.dev

# Check health after startup
curl -s http://localhost:8080/health | python -m json.tool

Docker Compose — Infrastructure

version: "3.8"
services:
  qdrant:
    image: qdrant/qdrant:latest
    ports: ["6333:6333"]
    volumes: ["./data/qdrant:/qdrant/storage"]

  vllm:
    image: vllm/vllm-openai:latest
    command: ["--model", "Qwen/Qwen2.5-Coder-7B-Instruct-AWQ", "--gpu-memory-utilization", "0.85"]
    ports: ["8000:8000"]
    deploy:
      resources:
        reservations:
          devices: [{capabilities: ["gpu"]}]

  agent:
    build: .
    ports: ["8080:8080"]
    env_file: .env
    depends_on: [qdrant, vllm]

// Glossary

Glossary — Terms & Concepts

Architecture terminology

AAA Pattern

Arrange-Act-Assert — Unit test organization pattern with 3 sections: Arrange (configure mocks/data), Act (call the method under test), Assert (check results). The agent uses Pass 4 to ensure // Arrange, // Act, // Assert comments are clearly present.

Agent Loop

The while loop inside _execute_plan() in the orchestrator. Each iteration processes one pending PlanStep. The loop ends when there are no more pending steps or the plan reaches COMPLETED.

Auto-Repair Loop

When ValidationPipeline detects an ERROR, the orchestrator calls planner.plan_repair() to append REPAIR_CODE + GENERATE_CODE + VALIDATE_CODE steps to the running plan, then loops again. Maximum max_repair_attempts=2.

CodeChunk

Unit of data in Qdrant. Each chunk = 1 Java class + rich metadata: class_name, file_path, dependencies (FQN list), used_types, has_builder, record_components, layer.

ContextResult

Output of ContextBuilder.build_context(). Contains: snippets (priority-ordered list), rag_chunks, token_count, mock_types, intelligence_available, elapsed_ms.

DDD Layer

Domain-Driven Design layers. The agent recognizes: application (*Service, *UseCase, *Handler), domain (*Entity, *ValueObject, *Aggregate), infrastructure (*Repository, *Adapter, *Client). The layer influences which types of mocks are generated.

DependencyAnalyzer

Component in intelligence/ that merges FileGraph + SymbolMap to return a TestContext. Analyzes the AST graph to know exactly which classes need to be mocked — no guessing.

EventBus

In-process pub/sub. The orchestrator publishes an Event at each step. EventTypes: PLAN_CREATED, STEP_STARTED, STEP_COMPLETED, CONTEXT_RETRIEVED, VALIDATION_COMPLETED, REPAIR_STARTED, GENERATION_COMPLETED.

ExecutionPlan

Dataclass storing the list of PlanSteps to execute. The plan can be extended at runtime (append repair steps). plan.can_repair is the single source of truth for repair limit.

FileGraph

Directed graph in intelligence/ built on import relationships between Java files. Traverse the graph to find dependencies, dependents, transitive closures.

Graceful Degradation

The system continues to operate when one component is unavailable. Specifically: if intelligence/ is not available, ContextBuilder falls back to RAG-only. If RAG fails, falls back to inline source parsing.

MiniLM-L6-v2

sentence-transformers/all-MiniLM-L6-v2 — 22 MB embedding model producing 384-dim dense vectors. Used for both indexing (build Qdrant) and search (query transform). Runs in-process inside FastAPI.

Priority Tiers

Snippet priority system: P1 (target class, 100%) → P2 (mockable deps, 100%) → P3 (domain types, truncate) → P4 (interfaces, heavy truncate) → P5 (transitive, drop first). TokenOptimizer removes from P5 downward when over budget.

Qwen2.5-Coder-7B

Primary LLM model. The AWQ (4-bit quantized) version needs ~14 GB VRAM. Accessed via vLLM OpenAI-compatible API on port 8000. Used for both initial generation and repair.

RAG

Retrieval-Augmented Generation. Instead of relying solely on the LLM's knowledge, the agent retrieves relevant code snippets from Qdrant and injects them into the prompt before generating.

RepoSnapshot

Object in intelligence/repo_scanner.py containing all Java repo information after scanning: lookup by class name (O(1)), by FQN, by file path. A static snapshot of the repo.

RepairPlan

Output of RepairStrategySelector.build_repair_plan(). Contains: list of issues to fix by category, system prompt for repair, user prompt with the broken code + specific fix instructions.

SnippetSelector

Component in context/snippet_selector.py that classifies CodeChunks into 5 priority tiers based on their relationship to the target class (target/mock/domain/interface/transitive).

SSE

Server-Sent Events. HTTP streaming where the server pushes data to the client continuously. Used for token-by-token streaming and phase events. Tabby IDE receives the SSE stream from /v1/chat/completions when stream: true.

StateMachine

IDLE → PLANNING → RETRIEVING → GENERATING → VALIDATING → REPAIRING → COMPLETED. Each transition is validated against the VALID_TRANSITIONS dict. Raises TransitionError for invalid transitions.

SymbolMap

Global symbol table in intelligence/: maps class → methods/fields, method → class, field type → injectors, annotation → classes. Used to find dependency injection points.

TestContext

Output of DependencyAnalyzer.test_context_for(class_name). Contains: mocks (list of class names needing @Mock), domain_types (list of value objects/entities), layer (application/domain/infra).

Token Budget

Token limit for context sent to the LLM: default 6000 tokens. TokenOptimizer estimates ~4 chars/token, trims or drops lower-priority snippets to stay within budget.

tree-sitter

Library that parses source code into an AST (Abstract Syntax Tree) quickly and accurately. indexer/parse_java.py uses tree-sitter to extract class structure, methods, and dependencies from .java files.

ValidationIssue

A single issue detected by ValidationPipeline. Contains: severity (ERROR/WARNING/INFO), pass_number, category, message, suggestion, line_number.

ValidationPipeline

7-pass validator: P1 Structural, P2 Forbidden, P3 Required, P4 AAA, P5 Quality, P6 Anti-patterns, P7 RAG-aware. passed = not any(i.severity == ERROR for i in issues). On failure → raises _ValidationFailed exception.

vLLM

High-performance LLM inference server with PagedAttention. The agent calls vLLM via the OpenAI-compatible HTTP API. Supports both blocking (generate()) and streaming (stream_generate()).