// huynguyenjv/ai-agent · Deep Architecture Review · v2.0 Full

AI Coding Agent
Comprehensive Architecture Analysis

Structural Intelligence + RAG Hybrid · Plan-driven StateMachine · 7-Pass Severity Validation · Auto-Repair Loop · Self-hosted JUnit5 Generator

Structural Intelligence RAG Hybrid StateMachine + Planner Python · FastAPI Qwen2.5-Coder-7B · vLLM Qdrant · MiniLM-L6-v2 7-Pass Validation Auto-Repair Loop EventBus · Metrics
✦ Complete edition — covers intelligence/ · context/ · orchestrator.py · validation.py (7 passes) · plan.py · state_machine · repair · events · metrics
// 01

Repository Overview

Purpose

A self-hosted AI coding agent purpose-built to automatically generate JUnit 5 + Mockito unit tests for large Java repositories following DDD architecture. Solves a real-world problem: manually writing tests for hundreds of services, repositories, and domain objects is time-consuming, error-prone, and inconsistent.

Key Capabilities

  • Structural Intelligence — Graph-based AST analysis, knows exactly which classes need to be mocked
  • RAG Hybrid Context — Vector search + dependency graph traversal in parallel (ThreadPoolExecutor)
  • Token Budget — Priority-based snippet selection P1→P5, TokenOptimizer 6000 tokens
  • Plan-driven StateMachine — ExecutionPlan with 8 StepActions, driven by AgentState machine
  • 7-Pass Validation — Severity-aware (ERROR/WARNING/INFO), RAG-aware construction check (Pass 7)
  • Auto-repair Loop — Self-corrects validation errors up to 2 times with RepairStrategySelector
  • Streaming API — Token-by-token SSE with 6 phase events (PLANNING→DONE)
  • EventBus + Metrics — Publishes events at every step, MetricsCollector, structlog
  • OpenAI-compatible API — /v1/chat/completions for Tabby IDE
✦ Architecture Classification
Structural Intelligence + RAG Hybrid Coding Agent with Plan-driven Execution and Agentic Repair Loop. Not a typical RAG agent — the intelligence/ layer provides ground truth from the AST graph instead of letting the LLM guess dependencies.
Request → FastAPI → AgentOrchestrator
  → Planner.plan_test_generation() → ExecutionPlan {8 steps}
  → StateMachine: IDLE→PLANNING→RETRIEVING→GENERATING→VALIDATING→COMPLETED
  → ContextBuilder (context/)
      → DependencyAnalyzer (intelligence/) ← exact mocks from AST graph
      → RAGClient (rag/) + ThreadPoolExecutor parallel fetch deps
      → SnippetSelector (P1→P5 priority tiers)
      → TokenOptimizer (budget 6000 tokens)
  → PromptBuilder → vLLM / Qwen2.5-Coder-7B
  → ValidationPipeline (7 passes, ERROR/WARNING/INFO)
  → RepairStrategySelector → repair loop (max 2x, plan_repair())
  → EventBus.publish() + MetricsCollector
  → GenerationResult {test_code, validation_summary, plan_summary, repair_attempts}
// Architecture

Architecture Guide — What Each Subsystem Does

The system is divided into 7 clearly-defined functional layers. Each layer has a single responsibility and does not encroach on the responsibilities of other layers. Click each subsystem for a detailed breakdown.

🔮
Layer 1 — Static information gathering
intelligence/
repo_scanner · file_graph · symbol_map · dependency_analyzer
WHY — Problem
The LLM does not know exactly which classes need to be mocked. If the LLM guesses, many mocks will be missing or arbitrary → tests won’t compile. Ground truth from AST is required, not machine learning.
WHAT — Responsibility
Reads the entire Java repo once, builds a graph representing the relationships between classes. On demand, answers: “Which classes does UserService need to mock exactly?”
HOW — Mechanism
tree-sitter parses AST → builds FileGraph (import edges) + SymbolMap (field/method table) → DependencyAnalyzer merges both graphs → returns TestContext.

4 files, 4 distinct responsibilities

repo_scanner.py — Scans all .java files, parses each one, creates a RepoSnapshot with O(1) lookups by class name / FQN / file path. Scan only — no analysis.
file_graph.py — Uses import statements to build a directed graph: edge A→B means “A imports B”. Supports transitive closure: “all things A depends on, directly and indirectly”. Graph only — no business logic.
symbol_map.py — Global symbol table: class → fields/methods, field type → who injects that field, annotation → classes with that annotation. Used to determine “what does UserService inject via constructor”. Lookup only — no code generation.
dependency_analyzer.py — The sole Facade external callers use. Takes class_name, merges FileGraph + SymbolMap, returns TestContext {mocks, domain_types, layer}. This is the only answer the rest of the system needs.
Clear boundary: intelligence/ only reads static information (pre-indexed). Does not call RAG, does not call LLM, does not generate code. Only output: TestContext & ImpactReport.
📂
Layer 2 — Build search index
indexer/
parse_java · summarize · build_index
WHY — Problem
Cannot fit the entire codebase into every prompt. Need a pre-indexed “library” that can be searched quickly by semantics.
WHAT — Responsibility
Converts each Java class into a 384-dimensional vector + rich metadata, stored in Qdrant. Runs only once (or when the repo changes).
HOW — Mechanism
tree-sitter parses AST → extracts full metadata → summarize.py writes a description → MiniLM embeds → Qdrant upsert.
parse_java.py — tree-sitter parses each .java file. Extracts: class_name, FQN, methods list, fields, dependencies (FQN list), used_types, has_builder, java_type (record/class/interface), record_components, DDD layer. This metadata is critical — Pass 7 validation relies on it.
summarize.py — Generates a short text description for each chunk so embeddings have better semantic context.
build_index.py — Orchestrates everything: calls parser, embedder, upserts into Qdrant with full payload metadata. Supports recreate=True to wipe and rebuild.
Boundary: indexer/ only runs offline (via POST /reindex), does not participate in the real-time generation flow. Output: a fully populated Qdrant collection.
🔍
Layer 3 — Search for relevant code
rag/
client · schema
WHY — Problem
intelligence/ tells us “what needs to be mocked” but doesn’t have the source code of those classes. The LLM needs to see the actual code of dependencies to generate correctly.
WHAT — Responsibility
Queries Qdrant by class name or semantic similarity, returns a list of CodeChunk objects containing both source code and metadata.
HOW — Mechanism
Embeds query with MiniLM → cosine similarity search in Qdrant → filters by metadata → returns top-k highest scoring results.
search_by_class(class_name) — Finds chunks matching exactly by metadata class_name + semantic similarity. Returns source + dependencies list from the payload.
include_dependencies=True — Automatically extracts the dep_simple_names list from the main chunk’s payload. context/ will then fetch each dep in parallel.
RAG vs intelligence/ — the difference:
RAG finds code “semantically close to X” (may miss some). intelligence/ finds “exactly which classes X depends on” (AST, 100%). The system uses both: intelligence/ for the precise mock list, RAG for the source code of those classes.
🏗️
Layer 4 — Select & optimize context
context/
context_builder · snippet_selector · token_optimizer
WHY — Problem
The LLM has a token limit (6000 tokens for context). Stuffing all code in would overflow. Need to intelligently select the most important pieces.
WHAT — Responsibility
Receives raw chunks from RAG + mock list from intelligence/, sorts by importance, smart-truncates, returns a pre-packaged ContextResult.
HOW — Mechanism
4-step pipeline: intelligence → RAG parallel fetch → SnippetSelector (5 tiers P1→P5) → TokenOptimizer (trim from P5 down). Gracefully degrades if intelligence/ unavailable.

Snippet priority logic

P1 — Target class (keep 100%) — The class being tested. Must be complete, never truncated.
P2 — Mockable dependencies (keep 100%) — Classes in TestContext.mocks from intelligence/. The LLM must understand their interface to mock correctly.
P3 — Domain types (truncate if needed) — Value objects and entities that are returned or accepted. The LLM needs their fields to create test data.
P4 — Interfaces (truncate aggressively) — Interfaces expose method signatures. Truncated heavily since implementation detail is rarely needed.
P5 — Transitive deps (dropped first when over budget) — Indirect dependencies. Least important, first to be removed when budget runs low.
Boundary: context/ knows nothing about validation or LLM. It does one thing: converts raw data into optimized context within budget. Output: ContextResult.
🧠
Layer 5 — Orchestration brain
agent/
orchestrator · planner · plan · state_machine · validation · repair · prompt · memory · events · metrics
WHY — Problem
Need a “conductor” to coordinate all other layers in the right order, handle errors, self-repair if needed, and return results in the right format.
WHAT — Responsibility
Receives request → builds a plan → executes each step in sequence → validates output → self-repairs if wrong → returns result.
HOW — Mechanism
Planner creates an ExecutionPlan (list of steps). StateMachine tracks state. A while loop runs each PlanStep. ValidationPipeline checks. RepairStrategySelector appends repair steps if validation fails.

agent/ has 10 modules — one responsibility each

ModuleResponsibilityInputOutput
orchestrator.pyFull coordination. Sole entry point from server.GenerationRequestGenerationResult
planner.pyDecides what to do (plan steps). Separated from execution.request metadataExecutionPlan
plan.pyData model: PlanStep, ExecutionPlan, StepAction enum. No logic.Data structures
state_machine.pyTracks current state. Prevents invalid transitions.transition eventAgentState
validation.pyChecks 7 quality criteria. Classifies issues by severity.Java test code stringValidationResult
repair.pyDecides how to fix based on the type of validation error.list[ValidationIssue]RepairPlan
prompt.pyBuilds messages[] sent to LLM. Controls format.context + rules + historymessages list
memory.pyStores conversation history + generated tests per session.session eventsSessionMemory
events.pyPub/sub bus. Publishes events at each important step.Event objectsSSE stream (downstream)
metrics.pyCollects statistics: tokens, chunks, validation pass rate.EventsMetricsReport
Key insight: orchestrator.py knows nothing about Java parsing or vector embedding. It only coordinates other layers through clear interfaces. If a layer (e.g. intelligence/) is unavailable, the orchestrator gracefully degrades without crashing.
Layer 6 — Code generation
vllm/
client.py — Qwen2.5-Coder-7B qua OpenAI-compat HTTP
WHY — Problem
Need an LLM that understands Java and writes structurally correct tests. Qwen2.5-Coder is trained on code — much better than a general-purpose text model for this task.
WHAT — Responsibility
Receives messages[] (system + context + user), sends HTTP request to vLLM server, returns Java code string (blocking) or token stream (stream mode).
HOW — Mechanism
POST /v1/chat/completions to vLLM :8000 (OpenAI format). vLLM uses PagedAttention for efficient serving. AWQ = 4-bit quantization, reduces VRAM ~4x.
generate(messages) — Blocking. Waits for the full response. Used for test generation (the orchestrator needs the full text before validating).
stream_generate(messages) — Iterator returning tokens one at a time. Used for SSE streaming to Tabby IDE so users see tokens appearing progressively.
Boundary: vllm/ only wraps an HTTP call. Knows nothing about Java, validation, or sessions. To switch to OpenAI GPT-4, simply change the URL + API key in .env.
🌐
Layer 7 — External communication
server/
api.py · session.py
WHY — Problem
Tabby IDE communicates via the OpenAI API protocol. Need a layer dedicated to translating between HTTP/OpenAI format and the internal agent API.
WHAT — Responsibility
Exposes REST endpoints, handles OpenAI-compat requests, rate limiting, CORS, SSE streaming, session lifecycle, and delegates actual work to the orchestrator.
HOW — Mechanism
FastAPI + uvicorn. Async handlers + ThreadPoolExecutor for blocking orchestrator calls. asyncio.wait_for() for automatic timeout after REQUEST_TIMEOUT seconds.
api.py — 19 endpoints. Parses Tabby message content to detect file_path and class_name. Converts to GenerationRequest, calls orchestrator in thread pool.
session.pySessionManager manages session lifecycle: creates UUID, expires after 1h, background cleanup task every 5 minutes.
Boundary: server/ only handles HTTP and session lifecycle. No domain logic (no Java validation, no RAG knowledge). All business logic lives in agent/.

Summary: Who does what?

intelligence/ — “The codebase map”: knows exactly which class depends on which.
indexer/ — “The library”: converts code into searchable vectors.
rag/ — “The librarian”: given a query, returns the exact relevant pages.
context/ — “The editor”: curates from a large body of content, keeping the most important within the page limit.
agent/ — “Brain / conductor”: plans, coordinates, quality-controls, self-repairs when wrong.
vllm/ — “The pen”: given full instructions, writes the actual code.
server/ — “The receptionist”: accepts requests from outside, routes to the right person, returns results.
// 02

Project Structure UPDATED

🌐
server/
api.py — FastAPI: all endpoints, OpenAI-compat, SSE streaming. session.py — Session lifecycle.
🧠
agent/
orchestrator · planner · plan · state_machine · prompt · rules · memory · validation · repair · events · metrics — 11 modules.
🔮
intelligence/ ✦
repo_scanner · file_graph · symbol_map · dependency_analyzer — Graph-based structural intelligence. O(1) lookups. Exact mock inference from AST.
🏗️
context/ ✦
context_builder · snippet_selector · token_optimizer — 4-stage pipeline replacing “dump all chunks”. Priority tiers + budget-aware truncation.
📂
indexer/
parse_java.py (tree-sitter AST) · summarize.py · build_index.py (Qdrant upsert with rich metadata: has_builder, record_components, dependencies...).
🔍
rag/
client.py — search_by_class(), include_dependencies=True. schema.py — CodeChunk, SearchQuery, MetadataFilter.
vllm/
client.py — generate() blocking + stream_generate() iterator. Wrap vLLM OpenAI-compat HTTP API.
⚙️
config/
agent.yaml · rag.yaml · vllm.yaml + .env overrides + env.example.
🧪
tests/
System test directory — currently empty (ironic).

Full Dependency Map

server/api.py
  → agent/orchestrator.py
        → agent/planner.py  →  agent/plan.py (ExecutionPlan, PlanStep, StepAction)
        → agent/state_machine.py  (AgentState, StateMachine, TransitionError)
        → context/context_builder.py  [optional — graceful degradation]
              → intelligence/dependency_analyzer.py
                    → intelligence/repo_scanner.py  (RepoSnapshot, O(1) lookup)
                    → intelligence/file_graph.py    (FileGraph, transitive closure)
                    → intelligence/symbol_map.py    (SymbolMap, global symbol table)
              → rag/client.py
              → context/snippet_selector.py  (5 priority tiers)
              → context/token_optimizer.py   (budget-aware truncation)
        → agent/prompt.py        (PromptBuilder)
        → vllm/client.py         (VLLMClient)
        → agent/validation.py    (ValidationPipeline — 7 passes)
        → agent/repair.py        (RepairStrategySelector, RepairPlan)
        → agent/memory.py        (MemoryManager, SessionMemory)
        → agent/events.py        (EventBus, Event, EventType)
        → agent/metrics.py       (MetricsCollector)
// 03

Execution Flow UPDATED

A — Indexing Flow (runs once / reindex)

POST /reindex
API receives repo_path, delegates to the indexer pipeline.
parse_java.py — tree-sitter AST
Parses all .java files. Extracts: class_name, FQN, methods, fields, dependencies (FQN list), used_types (simple names), has_builder, java_type (record/class/interface), record_components, DDD layer (application/domain/infrastructure).
summarize.py — Chunk Summarization
Each chunk gets a summary. Rich metadata is attached so Pass 7 validation can cross-check later.
MiniLM-L6-v2 — Embedding
Each chunk → 384-dim dense vector.
Qdrant Upsert
Vector + full payload upserted into collection java_codebase. Rich metadata enables intelligence/ and Pass 7 to operate correctly.

B — Test Generation Flow (per request)

HTTP Request
POST /generate-test or /v1/chat/completions with file_path, class_name, task_description, session_id.
Planner → ExecutionPlan
Planner creates a plan with list of PlanSteps: EXTRACT_CLASS_INFO → RETRIEVE_CONTEXT → BUILD_PROMPT → GENERATE_CODE → EXTRACT_CODE → VALIDATE_CODE → RECORD_SESSION.
StateMachine: IDLE → PLANNING → RETRIEVING
Each transition is tracked. EventBus publishes PLAN_CREATED + GENERATION_STARTED.
ContextBuilder.build_context()
① DependencyAnalyzer → TestContext {mocks, domain_types, layer} from AST graph  ② RAG search_by_class() fetches main chunk  ③ Extract dep_simple_names from Qdrant payload  ④ ThreadPoolExecutor(max_workers=5) parallel fetch all deps  ⑤ unfound_types → main_chunk.unfound_types  ⑥ SnippetSelector P1→P5  ⑦ TokenOptimizer budget=6000.
PromptBuilder → vLLM
Build messages[] with system prompt + rules + context snippets + unfound_types warnings + session history. POST to vLLM Qwen2.5-Coder-7B.
ValidationPipeline (7 passes)
Validates with severity. If ERROR → _ValidationFailed exception raised. EventBus publishes VALIDATION_COMPLETED.
Repair Loop (if ERROR)
planner.plan_repair() appends REPAIR_CODE + GENERATE_CODE + VALIDATE_CODE steps to the running plan. Loop continues. Max max_repair_attempts=2. EventBus publishes REPAIR_STARTED.
COMPLETED → Response
GenerationResult {test_code, validation_passed, validation_summary, rag_chunks_used, tokens_used, plan_summary, repair_attempts}.
// 04

Agent Loop UPDATED

StateMachine States

IDLE
  | transition_to(PLANNING)
  v
PLANNING       ← Planner.plan_test_generation() → ExecutionPlan
  | transition_to(RETRIEVING)
  v
RETRIEVING     ← ContextBuilder.build_context() or _get_rag_context()
  | transition_to(GENERATING)
  v
GENERATING     ← vllm.generate() / stream_generate()   ←―――――――――――――――+
  | transition_to(VALIDATING)                                              |
  v                                                                        |
VALIDATING     ← ValidationPipeline.validate(code, rag_chunks)          |
  | passed → transition_to(COMPLETED)                                    |
  | failed → _ValidationFailed → transition_to(REPAIRING)             |
  v                                                                        |
REPAIRING      ← RepairStrategySelector.build_repair_plan()             |
  | planner.plan_repair() appends REPAIR_CODE + GENERATE_CODE steps       |
  +————————————————————————— continue loop ————————————+
  | current_repair_attempt >= max_repair_attempts → COMPLETED (with issues)
  v
COMPLETED
✦ Plan-driven Repair — Elegant Design
When validation fails, the orchestrator does not restart from scratch. It calls planner.plan_repair() to append new steps to the running ExecutionPlan, then continues the while loop. No code duplication between the generate path and the repair path.

Plan-driven Execution Engine — Core Loop

plan = self.planner.plan_test_generation(...)

while True:
    step = plan.get_next_pending_step()
    if step is None: break       # all steps done

    step.start()
    self.event_bus.publish(Event(type=STEP_STARTED, ...))

    try:
        self._execute_step(sm, plan, step, ctx)
        step.complete()

    except _ValidationFailed as vf:
        if plan.can_repair:
            sm.transition_to(AgentState.REPAIRING, ...)
            self.planner.plan_repair(plan, vf.issues, ctx["extracted_code"])
            sm.transition_to(AgentState.GENERATING, repair=True)
            continue       # continue loop with newly appended repair steps
        else:
            ctx["validation_passed"] = False   # accept with issues

Streaming Events

yield StreamEvent(phase=PLANNING,   content="📋 Planning...")
yield StreamEvent(phase=RETRIEVING, content="🔍 Searching...", metadata={chunks_count})
yield StreamEvent(phase=GENERATING, content=token, delta=True)  # token-by-token
yield StreamEvent(phase=VALIDATING, content="🔎 Validating...", metadata={passed, errors})
yield StreamEvent(phase=REPAIRING,  content="🔧 Repair 1/2...")
yield StreamEvent(phase=DONE,       metadata={tokens_used, rag_chunks_used, ...})
// 05

Model Integration

ModelSizeRoleAccess
Qwen/Qwen2.5-Coder-7B-Instruct7BTest generation + repairvLLM OpenAI-compat :8000
sentence-transformers/all-MiniLM-L6-v222MEmbedding for RAG indexing + searchIn-process SentenceTransformers

Prompt Construction

[SYSTEM]
You are a Java test engineer.
{rules: JUnit5, Mockito, AAA pattern, no Spring context, no @SpringBootTest, ...}
{unfound_types: "These types not in index — mock them without source: [X, Y]"}

[CONTEXT — from ContextBuilder, priority-ordered, token-optimized]
// P1 — AuthUseCaseService.java (target, 100% kept)
{source code}
// P2 — OpenAPIRepository.java (mockable dep — from intelligence/)
{source code}
// P3 — UserProfile.java (domain type — record, no @Builder)
{source code, truncated if over budget}
...

[HISTORY — if refinement / session has history]
User: "Generate tests for AuthUseCaseService"
Assistant: {previous test code}
User: "Add null input tests"

[USER]
{task_description}. File: {file_path}

Response Parsing — _extract_code()

# Level 1: ```java ... ``` code block (preferred)
pattern = r"```(?:java)?\s*\n(.*?)```"
matches = re.findall(pattern, response, re.DOTALL)
if matches: return max(matches, key=len).strip()

# Level 2: detect class declaration directly
pattern = r"((?:import.*?\n)*\s*(?:@\w+.*?\n)*\s*(?:public\s+)?class\s+\w+.*?\{.*\})"
class_match = re.search(pattern, response, re.DOTALL)
if class_match: return class_match.group(1).strip()

return response.strip()  # fallback
// 06

Tool System UPDATED

No dynamic tool-calling via LLM. Tools are fixed step executors dispatched by the orchestrator according to the StepAction enum from ExecutionPlan:

StepActionExecutorDescription
EXTRACT_CLASS_INFO_step_extract_class_info()Extract class name from file path
RETRIEVE_CONTEXT_step_retrieve_context()ContextBuilder or RAG fallback, parallel fetch
BUILD_PROMPT_step_build_prompt()PromptBuilder with session history
GENERATE_CODE_step_generate_code()vllm.generate(), check response.success
EXTRACT_CODE_step_extract_code()Regex 2-level extraction
VALIDATE_CODE_step_validate_code()ValidationPipeline 7 passes, raise _ValidationFailed
RECORD_SESSION_step_record_session()session.record_generated_test()
REPAIR_CODE_step_repair_code()RepairStrategySelector → rebuild repair prompt

Fallback Hierarchy — 3-tier degradation

# Tier 1: Full ContextBuilder (intelligence + RAG + priority + token budget)
if self.context_builder:
    context_result = self.context_builder.build_context(...)

# Tier 2: RAG-only with graph traversal (parallel fetch)
else:
    rag_chunks = self._get_rag_context(class_name, file_path, session)

# Tier 3: Inline source parsing (regex from Java source sent with request)
if not types_to_fetch and inline_source:
    fallback_types = self._extract_types_from_source(inline_source, class_name)
// 07

Memory System

TypeStorageScopePersist?
Session conversationPython dict — MemoryManager in-processPer session UUID❌ Lost on restart
RAG context cacheSession-level dict (key: "class:file")Per session❌ In-memory
generated_tests listSessionMemory.generated_tests[]Per session❌ In-memory
Code index (RAG)Qdrant (disk-backed)Entire Java repo✅ Yes

RAG context is cached per-session by key "ClassName:file_path" so refinement requests don’t need to re-query Qdrant. Session memory also stores a generated_tests list — each entry has class_name, test_code, and a success flag.

// 08

Data Flow UPDATED

Indexing Data Flow

Java Repo (.java files)
  → tree-sitter parse_java.py
  → ClassInfo {class_name, FQN, methods, fields,
                   dependencies (FQN list), used_types (simple names),
                   has_builder, java_type, record_components, DDD layer}
  → summarize.py
  → MiniLM-L6-v2 → 384-dim vector
  → Qdrant upsert: vector + full payload (rich metadata for Pass 7 + intelligence/)

Generation Data Flow (full detail)

HTTP {file_path, class_name, task, session_id}
  → server/api.py → AgentOrchestrator.generate_test()
  → Planner.plan_test_generation() → ExecutionPlan
  → _execute_plan() while loop:

    [EXTRACT_CLASS_INFO]  file_path → class_name

    [RETRIEVE_CONTEXT]
      ContextBuilder.build_context(class_name, file_path, max_tokens=6000)
        → DependencyAnalyzer.test_context_for(class_name)
            → TestContext {mocks: [...], domain_types: [...], layer: "service"}
        → rag.search_by_class(class_name, top_k=1)   ← main chunk
        → extract dep_simple_names from main_chunk.dependencies (FQN)
        → extract used_types from main_chunk.used_types
        → types_to_fetch = dep_simple_names | used_types
        → ThreadPoolExecutor(max_workers=5):
            parallel rag.search_by_class(dep) for dep in types_to_fetch
        → unfound_types → main_chunk.unfound_types
        → SnippetSelector: P1(target)→P2(mocks)→P3(domain)→P4(iface)→P5(trans)
        → TokenOptimizer: P1 keep 100%, P5 drop first if over budget
        → ContextResult {snippets, rag_chunks, token_count, mock_types}
      EventBus.publish(CONTEXT_RETRIEVED)

    [BUILD_PROMPT]
      PromptBuilder.build_test_generation_prompt(context_result, session)

    [GENERATE_CODE]
      vllm.generate(system, user) → GenerationResponse
      EventBus.publish(STEP_COMPLETED)

    [EXTRACT_CODE]  _extract_code(response) → Java string

    [VALIDATE_CODE]
      ValidationPipeline.validate(code, rag_chunks)  ← 7 passes
      → ValidationResult {errors, warnings, infos}
      → passed? → RECORD_SESSION
      → failed? → _ValidationFailed(issues, validation_result)
                → plan.can_repair? → plan_repair() → loop
                → else → accept with issues
      EventBus.publish(VALIDATION_COMPLETED)

    [RECORD_SESSION]
      session.add_assistant_message(code)
      session.record_generated_test(class_name, code, success)

  → StateMachine.transition_to(COMPLETED)
  → EventBus.publish(GENERATION_COMPLETED)
  → GenerationResult
// Deep Dive

intelligence/ — Structural Graph Intelligence NEW

✦ The biggest differentiator
Pure RAG returns “code semantically close to X”. intelligence/ returns “exactly what X needs to mock” by traversing the AST graph — no guessing, no approximation.

4 Components

FileClassFunction
repo_scanner.pyRepoScannerScans repo with JavaParser → RepoSnapshot with O(1) lookup by name / FQN / file path
file_graph.pyFileGraphDirected graph based on import relationships. Finds dependencies, dependents, transitive closures
symbol_map.pySymbolMapGlobal symbol table: class→methods/fields, method→classes, field_type→injectors, annotation→classes
dependency_analyzer.pyDependencyAnalyzerMerges FileGraph + SymbolMap → TestContext and ImpactReport

Key Queries

analyzer = DependencyAnalyzer(repo_scanner, file_graph, symbol_map)

# "What does AuthUseCaseService need to mock?"
ctx = analyzer.test_context_for("AuthUseCaseService")
# ctx.mocks        → ["OpenAPIRepository", "UserQueryService", ...]  ← 100% accurate
# ctx.domain_types → ["UserProfile", "JwtToken", ...]
# ctx.layer        → "service"

# "If UserProfile changes, what is affected?"
report = analyzer.impact_of("UserProfile")
# report.direct_dependents     → ["AuthUseCaseService", "UserUseCase"]
# report.transitive_dependents → [...]

RAG vs Intelligence — Direct Comparison

RAG vector searchintelligence/ graph
FindsSemantically similar code chunksExact dependencies from AST
Answer“Code related to X”“X needs to mock A, B, C”
MechanismCosine similarityGraph traversal
Mock accuracyEstimated, may miss some100%, complete
Compile resultMay miss mocks → NullPointerFull mock list → compiles immediately
// Deep Dive

context/ — Smart Context Assembly NEW

context/ solves the problem: token budget is limited (6000 tokens) — how do you select the most important content rather than dumping everything in?

4-step Pipeline

ContextBuilder.build_context("AuthUseCaseService", file_path, max_tokens=6000)
  |
  +-- [1] Intelligence Layer  (optional — graceful degradation)
  |         DependencyAnalyzer.test_context_for("AuthUseCaseService")
  |         → TestContext {mocks: [...], domain_types: [...], layer: "service"}
  |
  +-- [2] RAG Search
  |         rag.search_by_class(include_dependencies=True)
  |         + parallel ThreadPoolExecutor fetch per dep (from TestContext.mocks)
  |         → List[CodeChunk]
  |
  +-- [3] SnippetSelector — Priority tiers
  |         P1 ████████ target source   AuthUseCaseService.java  (keep 100%)
  |         P2 ██████   mockable deps   OpenAPIRepository, ...   (keep 100%)
  |         P3 ████     domain types    UserProfile, JwtToken     (truncate)
  |         P4 ██       interfaces      IUserRepository           (hard truncate)
  |         P5 █         transitive deps  indirect dependencies     (drop first)
  |
  +-- [4] TokenOptimizer — Budget-aware (~4 chars/token)
            → ContextResult {
                snippets:               list[Snippet] (priority-ordered)
                rag_chunks:             list[CodeChunk]
                token_count:            int
                mock_types:             list[str]
                intelligence_available: bool
                elapsed_ms:             float
              }
✦ Graceful Degradation
ContextBuilder(rag_client, intelligence=None) — if intelligence/ is unavailable, the system still runs in RAG-only mode. No crash, only reduced mock accuracy.
// Deep Dive

agent/orchestrator.py — Detail NEW

4 Development phases within the same file

PhaseAddedComponents
Phase 1StateMachine + PlannerStateMachine, Planner, ExecutionPlan, PlanStep
Phase 2Optional ContextBuilderContextBuilder, 3-tier graceful degradation
Phase 3Validation + RepairValidationPipeline, RepairStrategySelector, _ValidationFailed
Phase 4EventBus + MetricsEventBus, MetricsCollector, structlog structured logging

Shared Context Dict (ctx)

ctx = {
    "class_name":        str,
    "file_path":         str,
    "session":           Optional[SessionMemory],
    "rag_chunks":        list[CodeChunk],
    "context_result":    Optional[ContextResult],
    "system_prompt":     str,
    "user_prompt":       str,
    "full_response":     str,
    "extracted_code":    str,
    "validation_result": Optional[ValidationResult],
    "validation_passed": bool,
    "validation_issues": list[str],
    "tokens_used":       int,         # len(full_response) // 4 (estimate)
    "repair_plan":       Optional[RepairPlan],
}
// Deep Dive

agent/validation.py — ValidationPipeline 7 Passes NEW

Severity Model
ERROR — blocks acceptance, triggers repair loop  |  WARNING — repair if budget allows, does not block  |  INFO — recorded, no impact on passed/failed. passed = not any(i.severity == ERROR for i in issues)
P1
Structural
Class declaration present? Braces balanced (open == close)? Import statements present? → ERROR if any are missing.
P2
Forbidden Patterns — Spring Context Leaks
@SpringBootTest, @DataJpaTest, @WebMvcTest, @SpringExtension, @ContextConfiguration, @RunWith(SpringRunner), @Autowired, @MockBean, TestRestTemplate, MockMvc → all ERROR with specific suggestion + line number.
P3
Required Patterns — JUnit5 & Mockito
@ExtendWith(MockitoExtension.class) → ERROR | @Mock → ERROR | @InjectMocks → ERROR | @Test → ERROR | @DisplayName → WARNING.
P4
AAA Pattern — Arrange-Act-Assert
Checks for // Arrange, // Act, // Assert comments in each test method. Missing → WARNING with specific list of missing parts.
P5
Quality — Best Practices
verify() calls present (WARNING)? Test naming convention method_WhenCondition_ShouldResult (INFO)? lenient() usage (WARNING)? @BeforeEach present (INFO)?
P6
Anti-patterns — 8 Real-world Issues (detail below)
SecurityContextHolder, LocalDateTime.now(), @InjectMocks without @Mock, verify on chained static call, setUp with >30 builder calls...
P7
RAG-Aware Construction — Cross-check vs AST Metadata
Depth-tracking mini-parser extracts builder chain fields. Cross-checks against record_components/fields from Qdrant payload. Detects .builder() on record without @Builder, non-existent field names → ERROR with correct constructor suggestion.

Pass 6 — 8 Common Anti-patterns

AP1
SecurityContextHolder without MockedStatic — will leak state between tests. ERROR. Suggestion: try-with-resources MockedStatic.
AP2
LocalDateTime.now() / UUID.randomUUID() in test body — value will differ at runtime. WARNING. Suggestion: use ArgumentCaptor, any() matcher, or a fixed value.
AP3
@InjectMocks without any @Mock — all dependencies will be null. ERROR.
AP4
SecurityContextHolder.setContext() called directly — not safe. ERROR. Suggestion: MockedStatic.when(SecurityContextHolder::getContext).
AP5
SecurityContextHolder.getContext() directly without MockedStatic — will fail at runtime. ERROR.
AP6
Missing MockedStatic import when using SecurityContextHolder — WARNING. Auto-detects missing import.
AP7
verify() on chained static call — verify(SecurityContextHolder.getContext().getAuthentication()) is wrong. ERROR. Suggestion: verify on mock variable directly.
AP8
setUp with >30 builder/setter calls — likely guessing domain fields. WARNING. Suggestion: use mock(Type.class) + stub only the necessary accessors.

Pass 7 — _extract_builder_chain_fields() Mini-parser

# Problem: simple regex would confuse nested calls with fields
# UserProfile.builder()
#   .id(UUID.fromString("abc"))  ← UUID.fromString is nested, NOT a field
#   .name("test")                ← name IS a real field

# Solution: track parenthesis depth
depth = 0
while i < length:
    if c == '(':  depth += 1
    elif c == ')': depth -= 1
    elif c == '.' and depth == 0:   # only match at depth=0
        name = match_next_method()
        if name not in ("build", "toBuilder"):
            fields.append(name)  # ← this is a real field
// Deep Dive

agent/plan.py — ExecutionPlan NEW

@dataclass
class PlanStep:
    step_id:      int           # 1, 2, 3, ...
    action:       StepAction    # enum — no magic strings
    description:  str
    params:       dict          # input cho executor
    status:       StepStatus    # PENDING/IN_PROGRESS/COMPLETED/FAILED/SKIPPED
    result:       Any           # output available to subsequent steps
    started_at:   float
    completed_at: float

    @property
    def duration_ms(self) -> float:
        return (completed_at - started_at) * 1000  # profiling per step

@dataclass
class ExecutionPlan:
    plan_id:               str       # "plan-a3f92bc1" — prefixed UUID
    task_type:             TaskType  # TEST_GENERATION | REFINEMENT | GENERAL_CHAT
    max_repair_attempts:   int = 2
    current_repair_attempt: int = 0
    metadata:              dict      # session_id, task_description, ...

    @property
    def can_repair(self) -> bool:
        return self.current_repair_attempt < self.max_repair_attempts
    # Single source of truth — orchestrator only checks plan.can_repair

    def add_step(self, action, description, **params) -> PlanStep:
        # Dynamic step addition — allows plan_repair() to append repair steps

    def get_next_pending_step(self) -> Optional[PlanStep]:
        return next((s for s in self.steps if s.status == PENDING), None)
// 09

Architecture Issues UPDATED

After reading the full source code, many initially-misassessed problems turned out to already be resolved:

FIXED
No auto-retry → ✓ RepairStrategySelector + plan_repair() appends repair steps + loop up to 2 times. Targeted repair prompt with structured validation issues per category.
FIXED
Validation only string-matching → ✓ ValidationPipeline 7 passes, severity-aware (ERROR/WARNING/INFO), depth-tracking mini-parser for builder chains, RAG cross-check with AST metadata.
FIXED
No observability → ✓ EventBus publishes at every step (PLAN_CREATED, STEP_STARTED, CONTEXT_RETRIEVED, VALIDATION_COMPLETED...), MetricsCollector, structlog with context fields.
FIXED
Tight-coupled orchestrator → ✓ Plan-driven step dispatcher, 3-tier graceful degradation, ContextBuilder is an optional dependency, clear fallback hierarchy.
HIGH
In-memory sessions: MemoryManager stores sessions in an in-process Python dict. Server restart → all active sessions lost. Multi-worker deployment won’t work. Needs Redis or SQLite backend.
HIGH
No authentication: All endpoints, including POST /reindex (a destructive operation that deletes and rebuilds the entire index), have zero auth layer. Exposing to a network is a serious risk.
MED
Rough token estimation: tokens_used = len(full_response) // 4 appears in both orchestrator and streaming. Java code has many special characters — the actual Qwen tokenizer can deviate 20–30%. TokenOptimizer uses the same heuristic.
MED
Synchronous /reindex: Blocking HTTP call. Indexing a large Java repo can take several minutes → HTTP timeout. Needs background job + GET /reindex/{job_id}/status.
MED
Embedding model in-process: MiniLM-L6-v2 loads into the FastAPI process. Costs RAM (~90MB), increases startup time. Should be isolated into a separate embedding microservice.
LOW
Magic number AP8: builder_field_count > 30 hardcoded in Pass 6. This threshold doesn’t fit all project sizes. Should be config-driven in agent.yaml.
LOW
Generated tests not auto-saved: Test code is returned in the response but is not automatically written to src/test/java/... The user must copy-paste manually.
LOW
Zero tests for the system itself: The tests/ directory is empty. Ironic, and a genuine risk during refactoring — especially for ValidationPipeline and ContextBuilder.
// 10

Code Quality UPDATED

Module Structure
9.5/10
server/agent/intelligence/context/rag/indexer/vllm — clear, correct responsibilities
Naming
9.8/10
ExecutionPlan, SnippetSelector, DependencyAnalyzer — completely self-documenting
Modularity
8.2/10
Plan-driven, optional deps, graceful degradation — much better than initially assessed
Domain Expertise
9.7/10
SecurityContextHolder, builder/record patterns — learned from practice, not theory
Error Handling
8.5/10
_ValidationFailed pattern, fallback hierarchy, structlog context — consistent
Concurrency
8.0/10
ThreadPoolExecutor parallel dep fetch — applied where it actually matters
Documentation
8.0/10
Detailed README + TABBY_SETUP.md + docstrings in code
Testing
0.5/10
Zero tests — critical gap, especially for ValidationPipeline and ContextBuilder
Security
1.8/10
No auth. /reindex is an open destructive endpoint. .env in repo
// 11

Simple Explanation

For developers new to the project
Imagine you have a brilliant Java testing colleague, and you want to ask him to write tests for your UserService.

Before you even ask, he has already sat down and read the entire codebase (this is the indexing step). He didn’t just skim it — he drew a map of which class depends on which, which records have no builder, which interfaces are injected where. This is intelligence/.

When you ask “write tests for UserService”, he doesn’t re-read the entire codebase. He looks at the map he drew and instantly knows UserService needs to mock OpenAPIRepository and UserQueryService. Then he picks exactly the relevant files and prioritises the most important ones within his time limit. This is context/: SnippetSelector + TokenOptimizer.

Before writing, he lays out a clear plan: how many steps, which order, how many times to self-fix if something goes wrong. This is ExecutionPlan + StateMachine.

Then he writes the tests according to the team’s rules: JUnit 5, Mockito, AAA pattern, no Spring context. Afterwards he re-reads his own work 7 times, checking for different types of issues: correct structure, sufficient annotations, is SecurityContextHolder mocked correctly, do the field names in the builder actually exist. This is ValidationPipeline 7 passes.

If he finds a mistake, he self-corrects up to 2 times before handing it to you — no need for you to tell him. This is the repair loop. If you want more edge cases, you give feedback and he remembers the entire conversation to fix it in exactly the right place. This is session memory + refinement.

Every action he takes is logged: how long each step took, how many chunks retrieved, how many tokens used, whether validation passed or failed. This is EventBus + MetricsCollector.

From your IDE’s perspective (Tabby), everything looks like chatting with GPT-4 via the OpenAI API. All the complexity is hidden behind the /v1/chat/completions endpoint.

// 12

Suggested Improvements UPDATED

💾 Persistent Session Storage

Replace in-memory dict with Redis or SQLite. Sessions survive restart, support multi-worker. Priority HIGH — required before production deployment.

🔐 API Key Authentication

FastAPI middleware with API key check. Hash keys in env. Especially critical for POST /reindex — a destructive operation that needs protection.

⌛ Async Indexing + Job Queue

/reindex returns a job_id immediately. Background asyncio task or Celery handles processing. Add GET /reindex/{job_id}/status. Avoids HTTP timeout.

📲 Real Tokenizer

Replace len(response) // 4 with AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-7B"). Use for both TokenOptimizer and tokens_used tracking. Current deviation: 20–30%.

🔍 Hybrid Search (BM25 + Dense)

Add BM25 keyword search alongside vector search in Qdrant. Improves recall for exact class/method name lookups that dense vectors tend to miss.

📝 Auto-write Test Files

Option to automatically write generated tests to the correct path in the Java project (src/test/java/...). Closes the workflow loop, no more manual copy-paste.

🧪 Unit Tests for the System

Write tests for ValidationPipeline (mock Java strings), ContextBuilder, RepairStrategySelector. Especially important when adding new anti-patterns to Pass 6.

⚙️ Config-driven AP8 threshold

Replace the magic number > 30 in Pass 6 AP8 with a configurable value in agent.yaml. Per-project tuning instead of hardcoding.

📊 Full Evaluation Pipeline

Extend benchmark.py: compile check (javac), coverage target, validation pass rate, repair success rate. Detects quality regression when changing model or prompt.

🗼 ImpactReport Integration

DependencyAnalyzer.impact_of() is implemented but not yet used in the generation flow. Use it to warn: “changing UserProfile will affect 5 other tests”.

// 13

Architecture Diagrams UPDATED

Diagram 1 — System Architecture (Full)

Tabby IDE OpenAI API compat FastAPI :8080 server/api.py server/session.py EventBus · Metrics structlog Orchestrator agent/orchestrator.py Planner + StateMachine PromptBuilder ValidationPipeline (7P) RepairStrategySelector MemoryManager Phase 1·2·3·4 ContextBuilder context/ SnippetSelector P1→P5 TokenOptimizer 6000t graceful degradation ContextResult intelligence/ DependencyAnalyzer FileGraph · SymbolMap TestContext · ImpactReport RAG Client rag/client.py MiniLM-L6-v2 embed ThreadPoolExecutor Qdrant :6333 vLLM :8000 vllm/client.py Qwen2.5-Coder-7B generate() blocking stream_generate() Indexer parse_java.py (tree-sitter) summarize.py build_index.py has_builder · record_components dependencies · used_types Java Repo .java files · DDD application/domain/infra context TestContext generate upsert vectors + rich metadata (has_builder, record_components, dependencies...)

Diagram 2 — StateMachine + Repair Loop

IDLE PLANNING RETRIEVING GENERATING VALIDATING REPAIRING COMPLETED pass fail + can_repair plan_repair() append steps max attempts accept w/issues Planner.plan_test_generation() ContextBuilder.build_context() vllm.generate() ValidationPipeline (7 passes) RepairStrategySelector

Diagram 3 — ValidationPipeline 7 Passes

code string +rag_chunks P1 Structural class·braces imports ERROR P2 Forbidden @SpringBoot @Autowired MockMvc ERROR P3 Required @ExtendWith @Mock @InjectMocks ERR/WARN P4 AAA // Arrange // Act // Assert WARNING P5 Quality verify() naming lenient() WARN/INFO P6 Anti-patterns SecurityCtx DateTime.now AP1–AP8 ERR/WARN P7 ✦ RAG-aware .builder() check field names vs AST metadata depth-tracking mini-parser ERROR ValidationResult {errors, warnings, infos} · passed = no ERRORs · triggers repair loop if errors exist

Diagram 4 — context/ Priority Assembly Pipeline

intelligence/ TestContext {mocks, types, layer} SnippetSelector P1 — target (100%) P2 — mocks (100%) P3 — domain types P4 — interfaces P5 — transitive (drop first if over) TokenOptimizer budget = 6000 tokens ~4 chars/token truncate from P5 down ContextResult snippets (priority-ordered) · rag_chunks token_count · mock_types intelligence_available · elapsed_ms
// Quick Start

Quick Start — Run the Agent in 5 Minutes

This guide assumes you already have Docker + Docker Compose and an NVIDIA GPU (or run CPU-only with a smaller model).

Step 0 — Prerequisites

RequirementMinimum VersionNotes
Python3.11+3.12 recommended
Docker + ComposeDocker 24+Used for Qdrant & vLLM
NVIDIA GPUCUDA 12.1+ & ≥16 GB VRAMQwen2.5-Coder-7B needs ~14 GB VRAM (AWQ)
RAM≥ 16 GBembedding model + FastAPI + indexer
Java RepoJava 17+, Maven/GradleSource code to index

Step 1 — Clone & install dependencies

1
Clone repo and create virtual environment
Create an isolated Python 3.11 environment to avoid conflicts. Never install globally.
git clone https://github.com/huynguyenjv/ai-agent.git
cd ai-agent
python -m venv venv
# Windows:
venv\Scripts\activate
# Linux/macOS:
source venv/bin/activate
2
Install Python packages
pip install -r requirements.txt
Includes: fastapi, uvicorn, qdrant-client, sentence-transformers, tree-sitter, structlog, pydantic v2.
3
Download embedding model (offline)
python download_model.py
Downloads all-MiniLM-L6-v2 (22 MB) into ./models/. Only downloads once, used offline.

Step 2 — Configure environment

cp env.example .env
# Edit the required values in .env
⚠️ Three most important values
VLLM_BASE_URL — URL of the vLLM server (default http://localhost:8000/v1)
JAVA_REPO_PATH — path to the Java source code to index
QDRANT_HOST — Qdrant host (default localhost)

Step 3 — Start supporting services

# Start Qdrant (vector database) via Docker
docker compose up -d qdrant

# Start vLLM (LLM server) — requires GPU
docker compose up -d vllm

# Check both are ready
curl http://localhost:6333/health     # Qdrant
curl http://localhost:8000/health     # vLLM
💡 Running without a GPU
Change model in .env: VLLM_MODEL=Qwen/Qwen2.5-Coder-1.5B-Instruct and add --device cpu in docker-compose.yml. ~10x slower but still works.

Step 4 — Index Java codebase

# Start agent server first
python main.py

# In another terminal: call the indexing API
curl -X POST http://localhost:8080/reindex \
  -H "Content-Type: application/json" \
  -d '{"repo_path": "/path/to/your/java/repo", "recreate": false}'

Indexing typically takes 1–10 minutes depending on repo size. Each .java file is parsed by tree-sitter, chunked, embedded, and upserted into Qdrant.

Step 5 — Generate your first test

# Basic test generation
curl -X POST http://localhost:8080/generate-test \
  -H "Content-Type: application/json" \
  -d '{
    "file_path": "src/main/java/com/example/service/UserService.java",
    "class_name": "UserService",
    "task_description": "Generate comprehensive JUnit5 unit tests"
  }'
✓ Expected result
Response JSON includes test_code as a complete Java class with JUnit5 + Mockito. validation_passed: true means all 7 passes passed. repair_attempts shows whether self-repair was needed.

Tabby IDE Integration

The agent exposes a /v1/chat/completions endpoint fully compatible with the OpenAI API. Configure Tabby to point at the agent:

Tabby SettingValue
Completion ProviderOpenAI Compatible
API Endpointhttp://localhost:8080/v1
Modelai-agent
API Keyany value (no auth yet, see issue s09)
// API Reference

API Reference — All Endpoints

The agent runs at http://localhost:8080 (configured via SERVER_PORT). All endpoints return JSON unless otherwise noted.

Core Endpoints

GET /health Check health of the entire system
Response
{
  "status": "healthy",
  "vllm_healthy": true,
  "qdrant_healthy": true,
  "index_stats": {
    "points_count": 1842,
    "collection": "java_codebase"
  }
}

Returns 200 if all services (vLLM + Qdrant) are running. Use for k8s readiness probe.

POST /generate-test Generate JUnit5 tests for a Java class
Request Body
FieldTypeRequiredDescription
file_pathstring● requiredPath to the .java file (relative or absolute)
class_namestringoptionalClass name. If omitted, auto-detected from file path
task_descriptionstringoptionalAdditional instructions for the LLM
session_idstringoptionalSession UUID — if provided, restores chat history
Response: GenerateTestResponse
{
  "success": true,
  "test_code": "import org.junit.jupiter.api.Test;\n...",
  "class_name": "UserService",
  "validation_passed": true,
  "validation_issues": [],
  "session_id": "a3f9-...",
  "rag_chunks_used": 7,
  "tokens_used": 1240,
  "plan_summary": "Steps: 7 completed, 0 failed",
  "repair_attempts": 0
}
Note
validation_passed: false does not mean the request failed. Test code is still returned but has warnings. Check validation_issues for details.
POST /refine-test Refine tests based on feedback (requires session_id)
Request Body
FieldTypeRequiredDescription
session_idstring● requiredSession ID from the previous generate call
feedbackstring● requiredDescription of the changes to make
Example
curl -X POST http://localhost:8080/refine-test \
  -H "Content-Type: application/json" \
  -d '{
    "session_id": "a3f9-...",
    "feedback": "Add test cases for null input and empty list scenarios"
  }'

The agent will call vLLM again with the full conversation history + new feedback. Test code is regenerated from scratch but has the prior session context.

POST /reindex Re-index the entire Java repository into Qdrant
⚠️ Warning
This is a destructive operation. With recreate: true, the entire vector collection will be deleted and rebuilt. This endpoint has no authentication — do not expose publicly!
Request Body
FieldTypeRequiredDescription
repo_pathstring● requiredAbsolute path to the Java repository
recreatebooloptionalDelete and recreate the collection. Default: false
Response
{"success": true, "message": "Indexed 1842 points", "points_indexed": 1842}

Current implementation is synchronous — the HTTP connection is held open until indexing is complete. For large repos this may take several minutes. See §12 for the async indexing plan.

GET /index/stats Qdrant collection statistics
{"points_count": 1842, "collection": "java_codebase", "vector_size": 384}
GET /index/lookup/{class_name} Find RAG chunks for a specific class

Returns a list of CodeChunk matching the class_name. Useful for debugging whether a class is in the index.

GET /index/lookup/UserService
→ [{class_name, file_path, content, metadata, score}, ...]

Session Endpoints

POST /session Create a new session

Creates a new session UUID. Returns SessionInfo containing session_id, created_at, expires_at.

GET /session/{session_id} Get session information

Returns session metadata: ID, creation time, number of generated tests, conversation turns.

DEL /session/{session_id} Delete session (free memory)

Immediately removes the session from memory. Sessions also auto-expire after 1 hour (session_timeout configured in agent.yaml).

GET /sessions List all active sessions

Returns a list of SessionInfo for all sessions that have not yet expired.

OpenAI-Compatible Endpoints (for use with Tabby)

GET /v1/models List models (OpenAI format)
{"object": "list", "data": [{"id": "ai-agent", "object": "model", ...}]}
POST /v1/chat/completions OpenAI chat completions (Tabby IDE entry point)

Supports the full OpenAI chat format. The agent automatically parses message content to detect file_path, class_name and dispatches to the /generate-test flow. Supports both stream: true (SSE token-by-token) and blocking mode.

Custom fields
{
  "model": "ai-agent",
  "messages": [{"role": "user", "content": "Write tests for UserService"}],
  "stream": true,
  "file_path": "src/main/java/.../UserService.java",  // optional
  "workspace_path": "/path/to/workspace"              // optional
}
Streaming
When stream: true, the response is an SSE stream with 6 phase events: PLANNINGRETRIEVINGGENERATING (token-by-token) → VALIDATINGREPAIRING (if needed) → DONE.
GET /v1/agent/status Detailed agent status (StateMachine + session count)
{"state": "IDLE", "active_sessions": 2, "total_generations": 47, ...}
GET /v1/agent/metrics Aggregated metrics from MetricsCollector
{"total_generations": 47, "avg_tokens": 1380, "validation_pass_rate": 0.91, "repair_rate": 0.21, ...}
GET /v1/agent/events/stream SSE stream of realtime events from EventBus

Long-lived SSE connection. Receives all events that EventBus publishes: PLAN_CREATED, STEP_STARTED, CONTEXT_RETRIEVED, VALIDATION_COMPLETED, REPAIR_STARTED, GENERATION_COMPLETED. Useful for monitoring dashboards.

GET /v1/rag-context Debug: view the RAG context that will be built for a given class
GET /v1/rag-context?class_name=UserService&file_path=...
→ {snippets, token_count, mock_types, intelligence_available}

Rate Limiting

Default: 10 requests / 60 seconds per IP. Exceeding this returns HTTP 429. Configure via env: RATE_LIMIT_REQUESTS, RATE_LIMIT_WINDOW. Health check endpoints (/health, /v1/models) are exempt from rate limiting.

⚠️ Current rate limiter
In-memory per-process. In a multi-worker deployment (Gunicorn/uvicorn), each worker has its own rate limit — the effective limit is RATE_LIMIT_REQUESTS × MAX_WORKERS. Redis is needed for distributed rate limiting.
// Configuration

Configuration Guide — Detailed Configuration

The agent is configured in two layers: YAML files in config/ (defaults) and environment variables in .env (overrides). Environment variables always win over YAML.

Environment Variables (.env)

Qdrant — Vector Database

VariableDefaultDescription
QDRANT_HOSTlocalhostHostname of the Qdrant server
QDRANT_PORT6333Qdrant REST API port
QDRANT_COLLECTIONjava_codebaseName of the collection storing vectors

vLLM — LLM Server

VariableDefaultDescription
VLLM_BASE_URLhttp://localhost:8000/v1Base URL of vLLM's OpenAI-compatible API
VLLM_MODELQwen/Qwen2.5-Coder-7B-Instruct-AWQModel name being served by vLLM
VLLM_API_KEYtoken-abc123API key (authentication with vLLM)

Embedding Model

VariableDefaultDescription
EMBEDDING_MODELsentence-transformers/all-MiniLM-L6-v2Model producing 384-dim vectors. Use a local path for offline usage.
SENTENCE_TRANSFORMERS_HOME./modelsHuggingFace model cache directory

Server & Performance

VariableDefaultDescription
SERVER_HOST0.0.0.0Bind address
SERVER_PORT8080HTTP port
MAX_WORKERS4ThreadPoolExecutor workers for blocking I/O
REQUEST_TIMEOUT300Timeout (seconds) for a single generation request
LOG_LEVELINFODEBUG / INFO / WARNING / ERROR

Security & Rate Limiting

VariableDefaultDescription
CORS_ORIGINS*Allowed CORS origins. Use * for dev, specific list for prod.
RATE_LIMIT_REQUESTS10Maximum requests per window
RATE_LIMIT_WINDOW60Window duration (seconds)
DISABLE_SSL_VERIFYfalseOnly set true in corporate proxy environments

agent.yaml — Details

KeyDefaultMeaning
orchestrator.max_context_tokens4000Token budget for RAG context when ContextBuilder is unavailable
orchestrator.top_k_results10Number of chunks returned from Qdrant search
orchestrator.session_timeout3600Session expiry in seconds
prompt.test_constraintslistRules injected into the system prompt (JUnit5, Mockito, AAA...)
rules.layer_detectionpatternsRegex mapping class name → DDD layer (application/domain/infra)

rag.yaml — Details

KeyDefaultMeaning
qdrant.vector_size384Must match the output dimension of the embedding model
qdrant.distanceCosineSimilarity metric
embedding.batch_size32Number of chunks embedded in parallel per batch (indexing)
search.default_top_k10Default number of results to return
search.score_threshold0.5Filter out chunks with cosine score below this threshold

vllm.yaml — Details

KeyDefaultMeaning
generation.temperature0.2Low = deterministic. High = creative. 0.2 is good for code gen.
generation.max_tokens4096Token limit for a single response
generation.top_p0.95Nucleus sampling threshold
retry.max_attempts3Retry count if vLLM call fails (separate from the repair loop)
✌ Tip: tuning for best quality
Increase top_k_results (agent.yaml) to 15–20 for large repos. Increase max_tokens (vllm.yaml) to 8192 for classes with many methods. Lower temperature to 0.1 for more deterministic code.
// Developer Guide

Developer Guide — Extending & Contributing

Quick code orientation

To understand the system, read them in this order:

1
agent/plan.py
Core data model: ExecutionPlan, PlanStep, StepAction, StepStatus. Understand this struct thoroughly before reading any other file.
2
agent/state_machine.py
AgentState enum (6 states) and StateMachine managing transitions. Check the VALID_TRANSITIONS dict to see which states can transition to which.
3
agent/orchestrator.py
Central engine. Find _execute_plan() — this is the main while loop. Find _execute_step() — the dispatch table from StepAction enum to the executing method.
4
context/context_builder.py
4-step pipeline: intelligence → RAG → SnippetSelector → TokenOptimizer. Read the build_context() method from top to bottom.
5
agent/validation.py
ValidationPipeline.validate() runs 7 _pass_*() methods. Each pass returns list[ValidationIssue]. Easy to add new passes.

Adding a new Anti-pattern to Pass 6

def _pass6_antipatterns(self, code: str, rag_chunks) -> list[ValidationIssue]:
    issues = []
    lines = code.split("\n")

    # ...existing checks...

    # ADD NEW: Detect Thread.sleep() in tests
    for i, line in enumerate(lines, 1):
        if "Thread.sleep(" in line:
            issues.append(ValidationIssue(
                severity=IssueSeverity.WARNING,
                pass_number=6,
                category="antipattern",
                message=f"Thread.sleep() at line {i} makes tests flaky",
                suggestion="Use Awaitility or mock the dependency causing delay",
                line_number=i
            ))

    return issues

Adding a new StepAction to ExecutionPlan

This is the canonical pattern for extending the pipeline:

# 1. Add enum to agent/plan.py
class StepAction(Enum):
    # ...existing...
    SAVE_TEST_FILE = "save_test_file"   # NEW

# 2. Create executor method in agent/orchestrator.py
def _step_save_test_file(self, plan: ExecutionPlan, step: PlanStep, ctx: dict):
    code = ctx["extracted_code"]
    output_path = self._resolve_test_path(ctx["file_path"])
    Path(output_path).write_text(code, encoding="utf-8")
    ctx["saved_path"] = output_path

# 3. Register in the dispatch table _execute_step()
def _execute_step(self, sm, plan, step, ctx):
    executor_map = {
        StepAction.EXTRACT_CLASS_INFO:  self._step_extract_class_info,
        # ...existing...
        StepAction.SAVE_TEST_FILE: self._step_save_test_file,  # NEW
    }
    executor = executor_map.get(step.action)
    if executor:
        executor(plan, step, ctx)

# 4. Add step in Planner
def plan_test_generation(self, ...) -> ExecutionPlan:
    plan.add_step(StepAction.RECORD_SESSION, "Record session")
    plan.add_step(StepAction.SAVE_TEST_FILE, "Save test file")  # NEW
    return plan

Running system tests

cd ai-agent
source venv/bin/activate

# Unit tests (no vLLM/Qdrant needed)
pytest tests/test_phase1.py -v      # StateMachine + Planner
pytest tests/test_phase2.py -v      # ContextBuilder + intelligence/
pytest tests/test_phase3_4.py -v    # Validation + Repair + Events + Metrics

# Generation quality benchmark
python benchmark.py --test-file benchmark/results/gen_quality_bench.json
⚠️ Current status
The tests/ directory only has placeholders. Most test files have no actual test logic yet. See §12 for the test-writing plan.

Advanced Local Development Setup

# Hot-reload when editing code
uvicorn main:app --reload --host 0.0.0.0 --port 8080 --log-level debug

# Run with structured log output (pretty-printed)
LOG_LEVEL=DEBUG python main.py 2>&1 | python -m structlog.dev

# Check health after startup
curl -s http://localhost:8080/health | python -m json.tool

Docker Compose — Infrastructure

version: "3.8"
services:
  qdrant:
    image: qdrant/qdrant:latest
    ports: ["6333:6333"]
    volumes: ["./data/qdrant:/qdrant/storage"]

  vllm:
    image: vllm/vllm-openai:latest
    command: ["--model", "Qwen/Qwen2.5-Coder-7B-Instruct-AWQ", "--gpu-memory-utilization", "0.85"]
    ports: ["8000:8000"]
    deploy:
      resources:
        reservations:
          devices: [{capabilities: ["gpu"]}]

  agent:
    build: .
    ports: ["8080:8080"]
    env_file: .env
    depends_on: [qdrant, vllm]
// Glossary

Glossary — Terms & Concepts

Architecture terminology

AAA Pattern
Arrange-Act-Assert — Unit test organization pattern with 3 sections: Arrange (configure mocks/data), Act (call the method under test), Assert (check results). The agent uses Pass 4 to ensure // Arrange, // Act, // Assert comments are clearly present.
Agent Loop
The while loop inside _execute_plan() in the orchestrator. Each iteration processes one pending PlanStep. The loop ends when there are no more pending steps or the plan reaches COMPLETED.
Auto-Repair Loop
When ValidationPipeline detects an ERROR, the orchestrator calls planner.plan_repair() to append REPAIR_CODE + GENERATE_CODE + VALIDATE_CODE steps to the running plan, then loops again. Maximum max_repair_attempts=2.
CodeChunk
Unit of data in Qdrant. Each chunk = 1 Java class + rich metadata: class_name, file_path, dependencies (FQN list), used_types, has_builder, record_components, layer.
ContextResult
Output of ContextBuilder.build_context(). Contains: snippets (priority-ordered list), rag_chunks, token_count, mock_types, intelligence_available, elapsed_ms.
DDD Layer
Domain-Driven Design layers. The agent recognizes: application (*Service, *UseCase, *Handler), domain (*Entity, *ValueObject, *Aggregate), infrastructure (*Repository, *Adapter, *Client). The layer influences which types of mocks are generated.
DependencyAnalyzer
Component in intelligence/ that merges FileGraph + SymbolMap to return a TestContext. Analyzes the AST graph to know exactly which classes need to be mocked — no guessing.
EventBus
In-process pub/sub. The orchestrator publishes an Event at each step. EventTypes: PLAN_CREATED, STEP_STARTED, STEP_COMPLETED, CONTEXT_RETRIEVED, VALIDATION_COMPLETED, REPAIR_STARTED, GENERATION_COMPLETED.
ExecutionPlan
Dataclass storing the list of PlanSteps to execute. The plan can be extended at runtime (append repair steps). plan.can_repair is the single source of truth for repair limit.
FileGraph
Directed graph in intelligence/ built on import relationships between Java files. Traverse the graph to find dependencies, dependents, transitive closures.
Graceful Degradation
The system continues to operate when one component is unavailable. Specifically: if intelligence/ is not available, ContextBuilder falls back to RAG-only. If RAG fails, falls back to inline source parsing.
MiniLM-L6-v2
sentence-transformers/all-MiniLM-L6-v2 — 22 MB embedding model producing 384-dim dense vectors. Used for both indexing (build Qdrant) and search (query transform). Runs in-process inside FastAPI.
Priority Tiers
Snippet priority system: P1 (target class, 100%) → P2 (mockable deps, 100%) → P3 (domain types, truncate) → P4 (interfaces, heavy truncate) → P5 (transitive, drop first). TokenOptimizer removes from P5 downward when over budget.
Qwen2.5-Coder-7B
Primary LLM model. The AWQ (4-bit quantized) version needs ~14 GB VRAM. Accessed via vLLM OpenAI-compatible API on port 8000. Used for both initial generation and repair.
RAG
Retrieval-Augmented Generation. Instead of relying solely on the LLM's knowledge, the agent retrieves relevant code snippets from Qdrant and injects them into the prompt before generating.
RepoSnapshot
Object in intelligence/repo_scanner.py containing all Java repo information after scanning: lookup by class name (O(1)), by FQN, by file path. A static snapshot of the repo.
RepairPlan
Output of RepairStrategySelector.build_repair_plan(). Contains: list of issues to fix by category, system prompt for repair, user prompt with the broken code + specific fix instructions.
SnippetSelector
Component in context/snippet_selector.py that classifies CodeChunks into 5 priority tiers based on their relationship to the target class (target/mock/domain/interface/transitive).
SSE
Server-Sent Events. HTTP streaming where the server pushes data to the client continuously. Used for token-by-token streaming and phase events. Tabby IDE receives the SSE stream from /v1/chat/completions when stream: true.
StateMachine
IDLE → PLANNING → RETRIEVING → GENERATING → VALIDATING → REPAIRING → COMPLETED. Each transition is validated against the VALID_TRANSITIONS dict. Raises TransitionError for invalid transitions.
SymbolMap
Global symbol table in intelligence/: maps class → methods/fields, method → class, field type → injectors, annotation → classes. Used to find dependency injection points.
TestContext
Output of DependencyAnalyzer.test_context_for(class_name). Contains: mocks (list of class names needing @Mock), domain_types (list of value objects/entities), layer (application/domain/infra).
Token Budget
Token limit for context sent to the LLM: default 6000 tokens. TokenOptimizer estimates ~4 chars/token, trims or drops lower-priority snippets to stay within budget.
tree-sitter
Library that parses source code into an AST (Abstract Syntax Tree) quickly and accurately. indexer/parse_java.py uses tree-sitter to extract class structure, methods, and dependencies from .java files.
ValidationIssue
A single issue detected by ValidationPipeline. Contains: severity (ERROR/WARNING/INFO), pass_number, category, message, suggestion, line_number.
ValidationPipeline
7-pass validator: P1 Structural, P2 Forbidden, P3 Required, P4 AAA, P5 Quality, P6 Anti-patterns, P7 RAG-aware. passed = not any(i.severity == ERROR for i in issues). On failure → raises _ValidationFailed exception.
vLLM
High-performance LLM inference server with PagedAttention. The agent calls vLLM via the OpenAI-compatible HTTP API. Supports both blocking (generate()) and streaming (stream_generate()).