Repository Overview
Purpose
A self-hosted AI coding agent purpose-built to automatically generate JUnit 5 + Mockito unit tests for large Java repositories following DDD architecture. Solves a real-world problem: manually writing tests for hundreds of services, repositories, and domain objects is time-consuming, error-prone, and inconsistent.
Key Capabilities
- Structural Intelligence — Graph-based AST analysis, knows exactly which classes need to be mocked
- RAG Hybrid Context — Vector search + dependency graph traversal in parallel (ThreadPoolExecutor)
- Token Budget — Priority-based snippet selection P1→P5, TokenOptimizer 6000 tokens
- Plan-driven StateMachine — ExecutionPlan with 8 StepActions, driven by AgentState machine
- 7-Pass Validation — Severity-aware (ERROR/WARNING/INFO), RAG-aware construction check (Pass 7)
- Auto-repair Loop — Self-corrects validation errors up to 2 times with RepairStrategySelector
- Streaming API — Token-by-token SSE with 6 phase events (PLANNING→DONE)
- EventBus + Metrics — Publishes events at every step, MetricsCollector, structlog
- OpenAI-compatible API — /v1/chat/completions for Tabby IDE
intelligence/ layer provides ground truth from the AST graph instead of letting the LLM guess dependencies.
Request → FastAPI → AgentOrchestrator
→ Planner.plan_test_generation() → ExecutionPlan {8 steps}
→ StateMachine: IDLE→PLANNING→RETRIEVING→GENERATING→VALIDATING→COMPLETED
→ ContextBuilder (context/)
→ DependencyAnalyzer (intelligence/) ← exact mocks from AST graph
→ RAGClient (rag/) + ThreadPoolExecutor parallel fetch deps
→ SnippetSelector (P1→P5 priority tiers)
→ TokenOptimizer (budget 6000 tokens)
→ PromptBuilder → vLLM / Qwen2.5-Coder-7B
→ ValidationPipeline (7 passes, ERROR/WARNING/INFO)
→ RepairStrategySelector → repair loop (max 2x, plan_repair())
→ EventBus.publish() + MetricsCollector
→ GenerationResult {test_code, validation_summary, plan_summary, repair_attempts}
Architecture Guide — What Each Subsystem Does
The system is divided into 7 clearly-defined functional layers. Each layer has a single responsibility and does not encroach on the responsibilities of other layers. Click each subsystem for a detailed breakdown.
FileGraph (import edges) + SymbolMap (field/method table) → DependencyAnalyzer merges both graphs → returns TestContext.4 files, 4 distinct responsibilities
.java files, parses each one, creates a RepoSnapshot with O(1) lookups by class name / FQN / file path. Scan only — no analysis.class_name, merges FileGraph + SymbolMap, returns TestContext {mocks, domain_types, layer}. This is the only answer the rest of the system needs.TestContext & ImpactReport..java file. Extracts: class_name, FQN, methods list, fields, dependencies (FQN list), used_types, has_builder, java_type (record/class/interface), record_components, DDD layer. This metadata is critical — Pass 7 validation relies on it.recreate=True to wipe and rebuild.POST /reindex), does not participate in the real-time generation flow. Output: a fully populated Qdrant collection.CodeChunk objects containing both source code and metadata.class_name + semantic similarity. Returns source + dependencies list from the payload.dep_simple_names list from the main chunk’s payload. context/ will then fetch each dep in parallel.RAG finds code “semantically close to X” (may miss some). intelligence/ finds “exactly which classes X depends on” (AST, 100%). The system uses both: intelligence/ for the precise mock list, RAG for the source code of those classes.
ContextResult.Snippet priority logic
TestContext.mocks from intelligence/. The LLM must understand their interface to mock correctly.ContextResult.ExecutionPlan (list of steps). StateMachine tracks state. A while loop runs each PlanStep. ValidationPipeline checks. RepairStrategySelector appends repair steps if validation fails.agent/ has 10 modules — one responsibility each
| Module | Responsibility | Input | Output |
|---|---|---|---|
| orchestrator.py | Full coordination. Sole entry point from server. | GenerationRequest | GenerationResult |
| planner.py | Decides what to do (plan steps). Separated from execution. | request metadata | ExecutionPlan |
| plan.py | Data model: PlanStep, ExecutionPlan, StepAction enum. No logic. | — | Data structures |
| state_machine.py | Tracks current state. Prevents invalid transitions. | transition event | AgentState |
| validation.py | Checks 7 quality criteria. Classifies issues by severity. | Java test code string | ValidationResult |
| repair.py | Decides how to fix based on the type of validation error. | list[ValidationIssue] | RepairPlan |
| prompt.py | Builds messages[] sent to LLM. Controls format. | context + rules + history | messages list |
| memory.py | Stores conversation history + generated tests per session. | session events | SessionMemory |
| events.py | Pub/sub bus. Publishes events at each important step. | Event objects | SSE stream (downstream) |
| metrics.py | Collects statistics: tokens, chunks, validation pass rate. | Events | MetricsReport |
/v1/chat/completions to vLLM :8000 (OpenAI format). vLLM uses PagedAttention for efficient serving. AWQ = 4-bit quantization, reduces VRAM ~4x..env.REQUEST_TIMEOUT seconds.GenerationRequest, calls orchestrator in thread pool.SessionManager manages session lifecycle: creates UUID, expires after 1h, background cleanup task every 5 minutes.Summary: Who does what?
indexer/ — “The library”: converts code into searchable vectors.
rag/ — “The librarian”: given a query, returns the exact relevant pages.
context/ — “The editor”: curates from a large body of content, keeping the most important within the page limit.
agent/ — “Brain / conductor”: plans, coordinates, quality-controls, self-repairs when wrong.
vllm/ — “The pen”: given full instructions, writes the actual code.
server/ — “The receptionist”: accepts requests from outside, routes to the right person, returns results.
Project Structure UPDATED
Full Dependency Map
server/api.py
→ agent/orchestrator.py
→ agent/planner.py → agent/plan.py (ExecutionPlan, PlanStep, StepAction)
→ agent/state_machine.py (AgentState, StateMachine, TransitionError)
→ context/context_builder.py [optional — graceful degradation]
→ intelligence/dependency_analyzer.py
→ intelligence/repo_scanner.py (RepoSnapshot, O(1) lookup)
→ intelligence/file_graph.py (FileGraph, transitive closure)
→ intelligence/symbol_map.py (SymbolMap, global symbol table)
→ rag/client.py
→ context/snippet_selector.py (5 priority tiers)
→ context/token_optimizer.py (budget-aware truncation)
→ agent/prompt.py (PromptBuilder)
→ vllm/client.py (VLLMClient)
→ agent/validation.py (ValidationPipeline — 7 passes)
→ agent/repair.py (RepairStrategySelector, RepairPlan)
→ agent/memory.py (MemoryManager, SessionMemory)
→ agent/events.py (EventBus, Event, EventType)
→ agent/metrics.py (MetricsCollector)
Execution Flow UPDATED
A — Indexing Flow (runs once / reindex)
java_codebase. Rich metadata enables intelligence/ and Pass 7 to operate correctly.B — Test Generation Flow (per request)
Agent Loop UPDATED
StateMachine States
IDLE | transition_to(PLANNING) v PLANNING ← Planner.plan_test_generation() → ExecutionPlan | transition_to(RETRIEVING) v RETRIEVING ← ContextBuilder.build_context() or _get_rag_context() | transition_to(GENERATING) v GENERATING ← vllm.generate() / stream_generate() ←―――――――――――――――+ | transition_to(VALIDATING) | v | VALIDATING ← ValidationPipeline.validate(code, rag_chunks) | | passed → transition_to(COMPLETED) | | failed → _ValidationFailed → transition_to(REPAIRING) | v | REPAIRING ← RepairStrategySelector.build_repair_plan() | | planner.plan_repair() appends REPAIR_CODE + GENERATE_CODE steps | +————————————————————————— continue loop ————————————+ | current_repair_attempt >= max_repair_attempts → COMPLETED (with issues) v COMPLETED
planner.plan_repair() to append new steps to the running ExecutionPlan, then continues the while loop. No code duplication between the generate path and the repair path.
Plan-driven Execution Engine — Core Loop
plan = self.planner.plan_test_generation(...)
while True:
step = plan.get_next_pending_step()
if step is None: break # all steps done
step.start()
self.event_bus.publish(Event(type=STEP_STARTED, ...))
try:
self._execute_step(sm, plan, step, ctx)
step.complete()
except _ValidationFailed as vf:
if plan.can_repair:
sm.transition_to(AgentState.REPAIRING, ...)
self.planner.plan_repair(plan, vf.issues, ctx["extracted_code"])
sm.transition_to(AgentState.GENERATING, repair=True)
continue # continue loop with newly appended repair steps
else:
ctx["validation_passed"] = False # accept with issues
Streaming Events
yield StreamEvent(phase=PLANNING, content="📋 Planning...")
yield StreamEvent(phase=RETRIEVING, content="🔍 Searching...", metadata={chunks_count})
yield StreamEvent(phase=GENERATING, content=token, delta=True) # token-by-token
yield StreamEvent(phase=VALIDATING, content="🔎 Validating...", metadata={passed, errors})
yield StreamEvent(phase=REPAIRING, content="🔧 Repair 1/2...")
yield StreamEvent(phase=DONE, metadata={tokens_used, rag_chunks_used, ...})
Model Integration
| Model | Size | Role | Access |
|---|---|---|---|
| Qwen/Qwen2.5-Coder-7B-Instruct | 7B | Test generation + repair | vLLM OpenAI-compat :8000 |
| sentence-transformers/all-MiniLM-L6-v2 | 22M | Embedding for RAG indexing + search | In-process SentenceTransformers |
Prompt Construction
[SYSTEM]
You are a Java test engineer.
{rules: JUnit5, Mockito, AAA pattern, no Spring context, no @SpringBootTest, ...}
{unfound_types: "These types not in index — mock them without source: [X, Y]"}
[CONTEXT — from ContextBuilder, priority-ordered, token-optimized]
// P1 — AuthUseCaseService.java (target, 100% kept)
{source code}
// P2 — OpenAPIRepository.java (mockable dep — from intelligence/)
{source code}
// P3 — UserProfile.java (domain type — record, no @Builder)
{source code, truncated if over budget}
...
[HISTORY — if refinement / session has history]
User: "Generate tests for AuthUseCaseService"
Assistant: {previous test code}
User: "Add null input tests"
[USER]
{task_description}. File: {file_path}
Response Parsing — _extract_code()
# Level 1: ```java ... ``` code block (preferred)
pattern = r"```(?:java)?\s*\n(.*?)```"
matches = re.findall(pattern, response, re.DOTALL)
if matches: return max(matches, key=len).strip()
# Level 2: detect class declaration directly
pattern = r"((?:import.*?\n)*\s*(?:@\w+.*?\n)*\s*(?:public\s+)?class\s+\w+.*?\{.*\})"
class_match = re.search(pattern, response, re.DOTALL)
if class_match: return class_match.group(1).strip()
return response.strip() # fallback
Tool System UPDATED
No dynamic tool-calling via LLM. Tools are fixed step executors dispatched by the orchestrator according to the StepAction enum from ExecutionPlan:
| StepAction | Executor | Description |
|---|---|---|
| EXTRACT_CLASS_INFO | _step_extract_class_info() | Extract class name from file path |
| RETRIEVE_CONTEXT | _step_retrieve_context() | ContextBuilder or RAG fallback, parallel fetch |
| BUILD_PROMPT | _step_build_prompt() | PromptBuilder with session history |
| GENERATE_CODE | _step_generate_code() | vllm.generate(), check response.success |
| EXTRACT_CODE | _step_extract_code() | Regex 2-level extraction |
| VALIDATE_CODE | _step_validate_code() | ValidationPipeline 7 passes, raise _ValidationFailed |
| RECORD_SESSION | _step_record_session() | session.record_generated_test() |
| REPAIR_CODE | _step_repair_code() | RepairStrategySelector → rebuild repair prompt |
Fallback Hierarchy — 3-tier degradation
# Tier 1: Full ContextBuilder (intelligence + RAG + priority + token budget)
if self.context_builder:
context_result = self.context_builder.build_context(...)
# Tier 2: RAG-only with graph traversal (parallel fetch)
else:
rag_chunks = self._get_rag_context(class_name, file_path, session)
# Tier 3: Inline source parsing (regex from Java source sent with request)
if not types_to_fetch and inline_source:
fallback_types = self._extract_types_from_source(inline_source, class_name)
Memory System
| Type | Storage | Scope | Persist? |
|---|---|---|---|
| Session conversation | Python dict — MemoryManager in-process | Per session UUID | ❌ Lost on restart |
| RAG context cache | Session-level dict (key: "class:file") | Per session | ❌ In-memory |
| generated_tests list | SessionMemory.generated_tests[] | Per session | ❌ In-memory |
| Code index (RAG) | Qdrant (disk-backed) | Entire Java repo | ✅ Yes |
RAG context is cached per-session by key "ClassName:file_path" so refinement requests don’t need to re-query Qdrant. Session memory also stores a generated_tests list — each entry has class_name, test_code, and a success flag.
Data Flow UPDATED
Indexing Data Flow
Java Repo (.java files)
→ tree-sitter parse_java.py
→ ClassInfo {class_name, FQN, methods, fields,
dependencies (FQN list), used_types (simple names),
has_builder, java_type, record_components, DDD layer}
→ summarize.py
→ MiniLM-L6-v2 → 384-dim vector
→ Qdrant upsert: vector + full payload (rich metadata for Pass 7 + intelligence/)
Generation Data Flow (full detail)
HTTP {file_path, class_name, task, session_id}
→ server/api.py → AgentOrchestrator.generate_test()
→ Planner.plan_test_generation() → ExecutionPlan
→ _execute_plan() while loop:
[EXTRACT_CLASS_INFO] file_path → class_name
[RETRIEVE_CONTEXT]
ContextBuilder.build_context(class_name, file_path, max_tokens=6000)
→ DependencyAnalyzer.test_context_for(class_name)
→ TestContext {mocks: [...], domain_types: [...], layer: "service"}
→ rag.search_by_class(class_name, top_k=1) ← main chunk
→ extract dep_simple_names from main_chunk.dependencies (FQN)
→ extract used_types from main_chunk.used_types
→ types_to_fetch = dep_simple_names | used_types
→ ThreadPoolExecutor(max_workers=5):
parallel rag.search_by_class(dep) for dep in types_to_fetch
→ unfound_types → main_chunk.unfound_types
→ SnippetSelector: P1(target)→P2(mocks)→P3(domain)→P4(iface)→P5(trans)
→ TokenOptimizer: P1 keep 100%, P5 drop first if over budget
→ ContextResult {snippets, rag_chunks, token_count, mock_types}
EventBus.publish(CONTEXT_RETRIEVED)
[BUILD_PROMPT]
PromptBuilder.build_test_generation_prompt(context_result, session)
[GENERATE_CODE]
vllm.generate(system, user) → GenerationResponse
EventBus.publish(STEP_COMPLETED)
[EXTRACT_CODE] _extract_code(response) → Java string
[VALIDATE_CODE]
ValidationPipeline.validate(code, rag_chunks) ← 7 passes
→ ValidationResult {errors, warnings, infos}
→ passed? → RECORD_SESSION
→ failed? → _ValidationFailed(issues, validation_result)
→ plan.can_repair? → plan_repair() → loop
→ else → accept with issues
EventBus.publish(VALIDATION_COMPLETED)
[RECORD_SESSION]
session.add_assistant_message(code)
session.record_generated_test(class_name, code, success)
→ StateMachine.transition_to(COMPLETED)
→ EventBus.publish(GENERATION_COMPLETED)
→ GenerationResult
intelligence/ — Structural Graph Intelligence NEW
intelligence/ returns “exactly what X needs to mock” by traversing the AST graph — no guessing, no approximation.
4 Components
| File | Class | Function |
|---|---|---|
| repo_scanner.py | RepoScanner | Scans repo with JavaParser → RepoSnapshot with O(1) lookup by name / FQN / file path |
| file_graph.py | FileGraph | Directed graph based on import relationships. Finds dependencies, dependents, transitive closures |
| symbol_map.py | SymbolMap | Global symbol table: class→methods/fields, method→classes, field_type→injectors, annotation→classes |
| dependency_analyzer.py | DependencyAnalyzer | Merges FileGraph + SymbolMap → TestContext and ImpactReport |
Key Queries
analyzer = DependencyAnalyzer(repo_scanner, file_graph, symbol_map)
# "What does AuthUseCaseService need to mock?"
ctx = analyzer.test_context_for("AuthUseCaseService")
# ctx.mocks → ["OpenAPIRepository", "UserQueryService", ...] ← 100% accurate
# ctx.domain_types → ["UserProfile", "JwtToken", ...]
# ctx.layer → "service"
# "If UserProfile changes, what is affected?"
report = analyzer.impact_of("UserProfile")
# report.direct_dependents → ["AuthUseCaseService", "UserUseCase"]
# report.transitive_dependents → [...]
RAG vs Intelligence — Direct Comparison
| RAG vector search | intelligence/ graph | |
|---|---|---|
| Finds | Semantically similar code chunks | Exact dependencies from AST |
| Answer | “Code related to X” | “X needs to mock A, B, C” |
| Mechanism | Cosine similarity | Graph traversal |
| Mock accuracy | Estimated, may miss some | 100%, complete |
| Compile result | May miss mocks → NullPointer | Full mock list → compiles immediately |
context/ — Smart Context Assembly NEW
context/ solves the problem: token budget is limited (6000 tokens) — how do you select the most important content rather than dumping everything in?
4-step Pipeline
ContextBuilder.build_context("AuthUseCaseService", file_path, max_tokens=6000)
|
+-- [1] Intelligence Layer (optional — graceful degradation)
| DependencyAnalyzer.test_context_for("AuthUseCaseService")
| → TestContext {mocks: [...], domain_types: [...], layer: "service"}
|
+-- [2] RAG Search
| rag.search_by_class(include_dependencies=True)
| + parallel ThreadPoolExecutor fetch per dep (from TestContext.mocks)
| → List[CodeChunk]
|
+-- [3] SnippetSelector — Priority tiers
| P1 ████████ target source AuthUseCaseService.java (keep 100%)
| P2 ██████ mockable deps OpenAPIRepository, ... (keep 100%)
| P3 ████ domain types UserProfile, JwtToken (truncate)
| P4 ██ interfaces IUserRepository (hard truncate)
| P5 █ transitive deps indirect dependencies (drop first)
|
+-- [4] TokenOptimizer — Budget-aware (~4 chars/token)
→ ContextResult {
snippets: list[Snippet] (priority-ordered)
rag_chunks: list[CodeChunk]
token_count: int
mock_types: list[str]
intelligence_available: bool
elapsed_ms: float
}
ContextBuilder(rag_client, intelligence=None) — if intelligence/ is unavailable, the system still runs in RAG-only mode. No crash, only reduced mock accuracy.
agent/orchestrator.py — Detail NEW
4 Development phases within the same file
| Phase | Added | Components |
|---|---|---|
| Phase 1 | StateMachine + Planner | StateMachine, Planner, ExecutionPlan, PlanStep |
| Phase 2 | Optional ContextBuilder | ContextBuilder, 3-tier graceful degradation |
| Phase 3 | Validation + Repair | ValidationPipeline, RepairStrategySelector, _ValidationFailed |
| Phase 4 | EventBus + Metrics | EventBus, MetricsCollector, structlog structured logging |
Shared Context Dict (ctx)
ctx = {
"class_name": str,
"file_path": str,
"session": Optional[SessionMemory],
"rag_chunks": list[CodeChunk],
"context_result": Optional[ContextResult],
"system_prompt": str,
"user_prompt": str,
"full_response": str,
"extracted_code": str,
"validation_result": Optional[ValidationResult],
"validation_passed": bool,
"validation_issues": list[str],
"tokens_used": int, # len(full_response) // 4 (estimate)
"repair_plan": Optional[RepairPlan],
}
agent/validation.py — ValidationPipeline 7 Passes NEW
passed = not any(i.severity == ERROR for i in issues)
Pass 6 — 8 Common Anti-patterns
Pass 7 — _extract_builder_chain_fields() Mini-parser
# Problem: simple regex would confuse nested calls with fields
# UserProfile.builder()
# .id(UUID.fromString("abc")) ← UUID.fromString is nested, NOT a field
# .name("test") ← name IS a real field
# Solution: track parenthesis depth
depth = 0
while i < length:
if c == '(': depth += 1
elif c == ')': depth -= 1
elif c == '.' and depth == 0: # only match at depth=0
name = match_next_method()
if name not in ("build", "toBuilder"):
fields.append(name) # ← this is a real field
agent/plan.py — ExecutionPlan NEW
@dataclass
class PlanStep:
step_id: int # 1, 2, 3, ...
action: StepAction # enum — no magic strings
description: str
params: dict # input cho executor
status: StepStatus # PENDING/IN_PROGRESS/COMPLETED/FAILED/SKIPPED
result: Any # output available to subsequent steps
started_at: float
completed_at: float
@property
def duration_ms(self) -> float:
return (completed_at - started_at) * 1000 # profiling per step
@dataclass
class ExecutionPlan:
plan_id: str # "plan-a3f92bc1" — prefixed UUID
task_type: TaskType # TEST_GENERATION | REFINEMENT | GENERAL_CHAT
max_repair_attempts: int = 2
current_repair_attempt: int = 0
metadata: dict # session_id, task_description, ...
@property
def can_repair(self) -> bool:
return self.current_repair_attempt < self.max_repair_attempts
# Single source of truth — orchestrator only checks plan.can_repair
def add_step(self, action, description, **params) -> PlanStep:
# Dynamic step addition — allows plan_repair() to append repair steps
def get_next_pending_step(self) -> Optional[PlanStep]:
return next((s for s in self.steps if s.status == PENDING), None)
Architecture Issues UPDATED
After reading the full source code, many initially-misassessed problems turned out to already be resolved:
tokens_used = len(full_response) // 4 appears in both orchestrator and streaming. Java code has many special characters — the actual Qwen tokenizer can deviate 20–30%. TokenOptimizer uses the same heuristic.builder_field_count > 30 hardcoded in Pass 6. This threshold doesn’t fit all project sizes. Should be config-driven in agent.yaml.Code Quality UPDATED
Simple Explanation
Before you even ask, he has already sat down and read the entire codebase (this is the indexing step). He didn’t just skim it — he drew a map of which class depends on which, which records have no builder, which interfaces are injected where. This is intelligence/.
When you ask “write tests for UserService”, he doesn’t re-read the entire codebase. He looks at the map he drew and instantly knows UserService needs to mock OpenAPIRepository and UserQueryService. Then he picks exactly the relevant files and prioritises the most important ones within his time limit. This is context/: SnippetSelector + TokenOptimizer.
Before writing, he lays out a clear plan: how many steps, which order, how many times to self-fix if something goes wrong. This is ExecutionPlan + StateMachine.
Then he writes the tests according to the team’s rules: JUnit 5, Mockito, AAA pattern, no Spring context. Afterwards he re-reads his own work 7 times, checking for different types of issues: correct structure, sufficient annotations, is SecurityContextHolder mocked correctly, do the field names in the builder actually exist. This is ValidationPipeline 7 passes.
If he finds a mistake, he self-corrects up to 2 times before handing it to you — no need for you to tell him. This is the repair loop. If you want more edge cases, you give feedback and he remembers the entire conversation to fix it in exactly the right place. This is session memory + refinement.
Every action he takes is logged: how long each step took, how many chunks retrieved, how many tokens used, whether validation passed or failed. This is EventBus + MetricsCollector.
From your IDE’s perspective (Tabby), everything looks like chatting with GPT-4 via the OpenAI API. All the complexity is hidden behind the /v1/chat/completions endpoint.
Suggested Improvements UPDATED
💾 Persistent Session Storage
Replace in-memory dict with Redis or SQLite. Sessions survive restart, support multi-worker. Priority HIGH — required before production deployment.
🔐 API Key Authentication
FastAPI middleware with API key check. Hash keys in env. Especially critical for POST /reindex — a destructive operation that needs protection.
⌛ Async Indexing + Job Queue
/reindex returns a job_id immediately. Background asyncio task or Celery handles processing. Add GET /reindex/{job_id}/status. Avoids HTTP timeout.
📲 Real Tokenizer
Replace len(response) // 4 with AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-7B"). Use for both TokenOptimizer and tokens_used tracking. Current deviation: 20–30%.
🔍 Hybrid Search (BM25 + Dense)
Add BM25 keyword search alongside vector search in Qdrant. Improves recall for exact class/method name lookups that dense vectors tend to miss.
📝 Auto-write Test Files
Option to automatically write generated tests to the correct path in the Java project (src/test/java/...). Closes the workflow loop, no more manual copy-paste.
🧪 Unit Tests for the System
Write tests for ValidationPipeline (mock Java strings), ContextBuilder, RepairStrategySelector. Especially important when adding new anti-patterns to Pass 6.
⚙️ Config-driven AP8 threshold
Replace the magic number > 30 in Pass 6 AP8 with a configurable value in agent.yaml. Per-project tuning instead of hardcoding.
📊 Full Evaluation Pipeline
Extend benchmark.py: compile check (javac), coverage target, validation pass rate, repair success rate. Detects quality regression when changing model or prompt.
🗼 ImpactReport Integration
DependencyAnalyzer.impact_of() is implemented but not yet used in the generation flow. Use it to warn: “changing UserProfile will affect 5 other tests”.
Architecture Diagrams UPDATED
Diagram 1 — System Architecture (Full)
Diagram 2 — StateMachine + Repair Loop
Diagram 3 — ValidationPipeline 7 Passes
Diagram 4 — context/ Priority Assembly Pipeline
Quick Start — Run the Agent in 5 Minutes
This guide assumes you already have Docker + Docker Compose and an NVIDIA GPU (or run CPU-only with a smaller model).
Step 0 — Prerequisites
| Requirement | Minimum Version | Notes |
|---|---|---|
| Python | 3.11+ | 3.12 recommended |
| Docker + Compose | Docker 24+ | Used for Qdrant & vLLM |
| NVIDIA GPU | CUDA 12.1+ & ≥16 GB VRAM | Qwen2.5-Coder-7B needs ~14 GB VRAM (AWQ) |
| RAM | ≥ 16 GB | embedding model + FastAPI + indexer |
| Java Repo | Java 17+, Maven/Gradle | Source code to index |
Step 1 — Clone & install dependencies
git clone https://github.com/huynguyenjv/ai-agent.git cd ai-agent python -m venv venv # Windows: venv\Scripts\activate # Linux/macOS: source venv/bin/activate
pip install -r requirements.txt
python download_model.py
all-MiniLM-L6-v2 (22 MB) into ./models/. Only downloads once, used offline.Step 2 — Configure environment
cp env.example .env # Edit the required values in .env
VLLM_BASE_URL — URL of the vLLM server (default http://localhost:8000/v1)JAVA_REPO_PATH — path to the Java source code to indexQDRANT_HOST — Qdrant host (default localhost)
Step 3 — Start supporting services
# Start Qdrant (vector database) via Docker docker compose up -d qdrant # Start vLLM (LLM server) — requires GPU docker compose up -d vllm # Check both are ready curl http://localhost:6333/health # Qdrant curl http://localhost:8000/health # vLLM
.env: VLLM_MODEL=Qwen/Qwen2.5-Coder-1.5B-Instruct and add --device cpu in docker-compose.yml. ~10x slower but still works.
Step 4 — Index Java codebase
# Start agent server first
python main.py
# In another terminal: call the indexing API
curl -X POST http://localhost:8080/reindex \
-H "Content-Type: application/json" \
-d '{"repo_path": "/path/to/your/java/repo", "recreate": false}'
Indexing typically takes 1–10 minutes depending on repo size. Each .java file is parsed by tree-sitter, chunked, embedded, and upserted into Qdrant.
Step 5 — Generate your first test
# Basic test generation
curl -X POST http://localhost:8080/generate-test \
-H "Content-Type: application/json" \
-d '{
"file_path": "src/main/java/com/example/service/UserService.java",
"class_name": "UserService",
"task_description": "Generate comprehensive JUnit5 unit tests"
}'
test_code as a complete Java class with JUnit5 + Mockito. validation_passed: true means all 7 passes passed. repair_attempts shows whether self-repair was needed.
Tabby IDE Integration
The agent exposes a /v1/chat/completions endpoint fully compatible with the OpenAI API. Configure Tabby to point at the agent:
| Tabby Setting | Value |
|---|---|
| Completion Provider | OpenAI Compatible |
| API Endpoint | http://localhost:8080/v1 |
| Model | ai-agent |
| API Key | any value (no auth yet, see issue s09) |
API Reference — All Endpoints
The agent runs at http://localhost:8080 (configured via SERVER_PORT). All endpoints return JSON unless otherwise noted.
Core Endpoints
Response
{
"status": "healthy",
"vllm_healthy": true,
"qdrant_healthy": true,
"index_stats": {
"points_count": 1842,
"collection": "java_codebase"
}
}
Returns 200 if all services (vLLM + Qdrant) are running. Use for k8s readiness probe.
Request Body
| Field | Type | Required | Description |
|---|---|---|---|
| file_path | string | ● required | Path to the .java file (relative or absolute) |
| class_name | string | optional | Class name. If omitted, auto-detected from file path |
| task_description | string | optional | Additional instructions for the LLM |
| session_id | string | optional | Session UUID — if provided, restores chat history |
Response: GenerateTestResponse
{
"success": true,
"test_code": "import org.junit.jupiter.api.Test;\n...",
"class_name": "UserService",
"validation_passed": true,
"validation_issues": [],
"session_id": "a3f9-...",
"rag_chunks_used": 7,
"tokens_used": 1240,
"plan_summary": "Steps: 7 completed, 0 failed",
"repair_attempts": 0
}
validation_passed: false does not mean the request failed. Test code is still returned but has warnings. Check validation_issues for details.Request Body
| Field | Type | Required | Description |
|---|---|---|---|
| session_id | string | ● required | Session ID from the previous generate call |
| feedback | string | ● required | Description of the changes to make |
Example
curl -X POST http://localhost:8080/refine-test \
-H "Content-Type: application/json" \
-d '{
"session_id": "a3f9-...",
"feedback": "Add test cases for null input and empty list scenarios"
}'
The agent will call vLLM again with the full conversation history + new feedback. Test code is regenerated from scratch but has the prior session context.
recreate: true, the entire vector collection will be deleted and rebuilt. This endpoint has no authentication — do not expose publicly!Request Body
| Field | Type | Required | Description |
|---|---|---|---|
| repo_path | string | ● required | Absolute path to the Java repository |
| recreate | bool | optional | Delete and recreate the collection. Default: false |
Response
{"success": true, "message": "Indexed 1842 points", "points_indexed": 1842}
Current implementation is synchronous — the HTTP connection is held open until indexing is complete. For large repos this may take several minutes. See §12 for the async indexing plan.
{"points_count": 1842, "collection": "java_codebase", "vector_size": 384}
Returns a list of CodeChunk matching the class_name. Useful for debugging whether a class is in the index.
GET /index/lookup/UserService
→ [{class_name, file_path, content, metadata, score}, ...]
Session Endpoints
Creates a new session UUID. Returns SessionInfo containing session_id, created_at, expires_at.
Returns session metadata: ID, creation time, number of generated tests, conversation turns.
Immediately removes the session from memory. Sessions also auto-expire after 1 hour (session_timeout configured in agent.yaml).
Returns a list of SessionInfo for all sessions that have not yet expired.
OpenAI-Compatible Endpoints (for use with Tabby)
{"object": "list", "data": [{"id": "ai-agent", "object": "model", ...}]}
Supports the full OpenAI chat format. The agent automatically parses message content to detect file_path, class_name and dispatches to the /generate-test flow. Supports both stream: true (SSE token-by-token) and blocking mode.
Custom fields
{
"model": "ai-agent",
"messages": [{"role": "user", "content": "Write tests for UserService"}],
"stream": true,
"file_path": "src/main/java/.../UserService.java", // optional
"workspace_path": "/path/to/workspace" // optional
}
stream: true, the response is an SSE stream with 6 phase events: PLANNING → RETRIEVING → GENERATING (token-by-token) → VALIDATING → REPAIRING (if needed) → DONE.{"state": "IDLE", "active_sessions": 2, "total_generations": 47, ...}
{"total_generations": 47, "avg_tokens": 1380, "validation_pass_rate": 0.91, "repair_rate": 0.21, ...}
Long-lived SSE connection. Receives all events that EventBus publishes: PLAN_CREATED, STEP_STARTED, CONTEXT_RETRIEVED, VALIDATION_COMPLETED, REPAIR_STARTED, GENERATION_COMPLETED. Useful for monitoring dashboards.
GET /v1/rag-context?class_name=UserService&file_path=...
→ {snippets, token_count, mock_types, intelligence_available}
Rate Limiting
Default: 10 requests / 60 seconds per IP. Exceeding this returns HTTP 429. Configure via env: RATE_LIMIT_REQUESTS, RATE_LIMIT_WINDOW. Health check endpoints (/health, /v1/models) are exempt from rate limiting.
RATE_LIMIT_REQUESTS × MAX_WORKERS. Redis is needed for distributed rate limiting.
Configuration Guide — Detailed Configuration
The agent is configured in two layers: YAML files in config/ (defaults) and environment variables in .env (overrides). Environment variables always win over YAML.
Environment Variables (.env)
Qdrant — Vector Database
| Variable | Default | Description |
|---|---|---|
| QDRANT_HOST | localhost | Hostname of the Qdrant server |
| QDRANT_PORT | 6333 | Qdrant REST API port |
| QDRANT_COLLECTION | java_codebase | Name of the collection storing vectors |
vLLM — LLM Server
| Variable | Default | Description |
|---|---|---|
| VLLM_BASE_URL | http://localhost:8000/v1 | Base URL of vLLM's OpenAI-compatible API |
| VLLM_MODEL | Qwen/Qwen2.5-Coder-7B-Instruct-AWQ | Model name being served by vLLM |
| VLLM_API_KEY | token-abc123 | API key (authentication with vLLM) |
Embedding Model
| Variable | Default | Description |
|---|---|---|
| EMBEDDING_MODEL | sentence-transformers/all-MiniLM-L6-v2 | Model producing 384-dim vectors. Use a local path for offline usage. |
| SENTENCE_TRANSFORMERS_HOME | ./models | HuggingFace model cache directory |
Server & Performance
| Variable | Default | Description |
|---|---|---|
| SERVER_HOST | 0.0.0.0 | Bind address |
| SERVER_PORT | 8080 | HTTP port |
| MAX_WORKERS | 4 | ThreadPoolExecutor workers for blocking I/O |
| REQUEST_TIMEOUT | 300 | Timeout (seconds) for a single generation request |
| LOG_LEVEL | INFO | DEBUG / INFO / WARNING / ERROR |
Security & Rate Limiting
| Variable | Default | Description |
|---|---|---|
| CORS_ORIGINS | * | Allowed CORS origins. Use * for dev, specific list for prod. |
| RATE_LIMIT_REQUESTS | 10 | Maximum requests per window |
| RATE_LIMIT_WINDOW | 60 | Window duration (seconds) |
| DISABLE_SSL_VERIFY | false | Only set true in corporate proxy environments |
agent.yaml — Details
| Key | Default | Meaning |
|---|---|---|
| orchestrator.max_context_tokens | 4000 | Token budget for RAG context when ContextBuilder is unavailable |
| orchestrator.top_k_results | 10 | Number of chunks returned from Qdrant search |
| orchestrator.session_timeout | 3600 | Session expiry in seconds |
| prompt.test_constraints | list | Rules injected into the system prompt (JUnit5, Mockito, AAA...) |
| rules.layer_detection | patterns | Regex mapping class name → DDD layer (application/domain/infra) |
rag.yaml — Details
| Key | Default | Meaning |
|---|---|---|
| qdrant.vector_size | 384 | Must match the output dimension of the embedding model |
| qdrant.distance | Cosine | Similarity metric |
| embedding.batch_size | 32 | Number of chunks embedded in parallel per batch (indexing) |
| search.default_top_k | 10 | Default number of results to return |
| search.score_threshold | 0.5 | Filter out chunks with cosine score below this threshold |
vllm.yaml — Details
| Key | Default | Meaning |
|---|---|---|
| generation.temperature | 0.2 | Low = deterministic. High = creative. 0.2 is good for code gen. |
| generation.max_tokens | 4096 | Token limit for a single response |
| generation.top_p | 0.95 | Nucleus sampling threshold |
| retry.max_attempts | 3 | Retry count if vLLM call fails (separate from the repair loop) |
top_k_results (agent.yaml) to 15–20 for large repos. Increase max_tokens (vllm.yaml) to 8192 for classes with many methods. Lower temperature to 0.1 for more deterministic code.
Developer Guide — Extending & Contributing
Quick code orientation
To understand the system, read them in this order:
ExecutionPlan, PlanStep, StepAction, StepStatus. Understand this struct thoroughly before reading any other file.AgentState enum (6 states) and StateMachine managing transitions. Check the VALID_TRANSITIONS dict to see which states can transition to which._execute_plan() — this is the main while loop. Find _execute_step() — the dispatch table from StepAction enum to the executing method.build_context() method from top to bottom.ValidationPipeline.validate() runs 7 _pass_*() methods. Each pass returns list[ValidationIssue]. Easy to add new passes.Adding a new Anti-pattern to Pass 6
def _pass6_antipatterns(self, code: str, rag_chunks) -> list[ValidationIssue]:
issues = []
lines = code.split("\n")
# ...existing checks...
# ADD NEW: Detect Thread.sleep() in tests
for i, line in enumerate(lines, 1):
if "Thread.sleep(" in line:
issues.append(ValidationIssue(
severity=IssueSeverity.WARNING,
pass_number=6,
category="antipattern",
message=f"Thread.sleep() at line {i} makes tests flaky",
suggestion="Use Awaitility or mock the dependency causing delay",
line_number=i
))
return issues
Adding a new StepAction to ExecutionPlan
This is the canonical pattern for extending the pipeline:
# 1. Add enum to agent/plan.py
class StepAction(Enum):
# ...existing...
SAVE_TEST_FILE = "save_test_file" # NEW
# 2. Create executor method in agent/orchestrator.py
def _step_save_test_file(self, plan: ExecutionPlan, step: PlanStep, ctx: dict):
code = ctx["extracted_code"]
output_path = self._resolve_test_path(ctx["file_path"])
Path(output_path).write_text(code, encoding="utf-8")
ctx["saved_path"] = output_path
# 3. Register in the dispatch table _execute_step()
def _execute_step(self, sm, plan, step, ctx):
executor_map = {
StepAction.EXTRACT_CLASS_INFO: self._step_extract_class_info,
# ...existing...
StepAction.SAVE_TEST_FILE: self._step_save_test_file, # NEW
}
executor = executor_map.get(step.action)
if executor:
executor(plan, step, ctx)
# 4. Add step in Planner
def plan_test_generation(self, ...) -> ExecutionPlan:
plan.add_step(StepAction.RECORD_SESSION, "Record session")
plan.add_step(StepAction.SAVE_TEST_FILE, "Save test file") # NEW
return plan
Running system tests
cd ai-agent source venv/bin/activate # Unit tests (no vLLM/Qdrant needed) pytest tests/test_phase1.py -v # StateMachine + Planner pytest tests/test_phase2.py -v # ContextBuilder + intelligence/ pytest tests/test_phase3_4.py -v # Validation + Repair + Events + Metrics # Generation quality benchmark python benchmark.py --test-file benchmark/results/gen_quality_bench.json
tests/ directory only has placeholders. Most test files have no actual test logic yet. See §12 for the test-writing plan.Advanced Local Development Setup
# Hot-reload when editing code uvicorn main:app --reload --host 0.0.0.0 --port 8080 --log-level debug # Run with structured log output (pretty-printed) LOG_LEVEL=DEBUG python main.py 2>&1 | python -m structlog.dev # Check health after startup curl -s http://localhost:8080/health | python -m json.tool
Docker Compose — Infrastructure
version: "3.8"
services:
qdrant:
image: qdrant/qdrant:latest
ports: ["6333:6333"]
volumes: ["./data/qdrant:/qdrant/storage"]
vllm:
image: vllm/vllm-openai:latest
command: ["--model", "Qwen/Qwen2.5-Coder-7B-Instruct-AWQ", "--gpu-memory-utilization", "0.85"]
ports: ["8000:8000"]
deploy:
resources:
reservations:
devices: [{capabilities: ["gpu"]}]
agent:
build: .
ports: ["8080:8080"]
env_file: .env
depends_on: [qdrant, vllm]
Glossary — Terms & Concepts
Architecture terminology
// Arrange, // Act, // Assert comments are clearly present._execute_plan() in the orchestrator. Each iteration processes one pending PlanStep. The loop ends when there are no more pending steps or the plan reaches COMPLETED.ValidationPipeline detects an ERROR, the orchestrator calls planner.plan_repair() to append REPAIR_CODE + GENERATE_CODE + VALIDATE_CODE steps to the running plan, then loops again. Maximum max_repair_attempts=2.class_name, file_path, dependencies (FQN list), used_types, has_builder, record_components, layer.ContextBuilder.build_context(). Contains: snippets (priority-ordered list), rag_chunks, token_count, mock_types, intelligence_available, elapsed_ms.application (*Service, *UseCase, *Handler), domain (*Entity, *ValueObject, *Aggregate), infrastructure (*Repository, *Adapter, *Client). The layer influences which types of mocks are generated.intelligence/ that merges FileGraph + SymbolMap to return a TestContext. Analyzes the AST graph to know exactly which classes need to be mocked — no guessing.Event at each step. EventTypes: PLAN_CREATED, STEP_STARTED, STEP_COMPLETED, CONTEXT_RETRIEVED, VALIDATION_COMPLETED, REPAIR_STARTED, GENERATION_COMPLETED.PlanSteps to execute. The plan can be extended at runtime (append repair steps). plan.can_repair is the single source of truth for repair limit.intelligence/ built on import relationships between Java files. Traverse the graph to find dependencies, dependents, transitive closures.intelligence/ is not available, ContextBuilder falls back to RAG-only. If RAG fails, falls back to inline source parsing.sentence-transformers/all-MiniLM-L6-v2 — 22 MB embedding model producing 384-dim dense vectors. Used for both indexing (build Qdrant) and search (query transform). Runs in-process inside FastAPI.intelligence/repo_scanner.py containing all Java repo information after scanning: lookup by class name (O(1)), by FQN, by file path. A static snapshot of the repo.RepairStrategySelector.build_repair_plan(). Contains: list of issues to fix by category, system prompt for repair, user prompt with the broken code + specific fix instructions.context/snippet_selector.py that classifies CodeChunks into 5 priority tiers based on their relationship to the target class (target/mock/domain/interface/transitive)./v1/chat/completions when stream: true.VALID_TRANSITIONS dict. Raises TransitionError for invalid transitions.intelligence/: maps class → methods/fields, method → class, field type → injectors, annotation → classes. Used to find dependency injection points.DependencyAnalyzer.test_context_for(class_name). Contains: mocks (list of class names needing @Mock), domain_types (list of value objects/entities), layer (application/domain/infra).TokenOptimizer estimates ~4 chars/token, trims or drops lower-priority snippets to stay within budget.indexer/parse_java.py uses tree-sitter to extract class structure, methods, and dependencies from .java files.ValidationPipeline. Contains: severity (ERROR/WARNING/INFO), pass_number, category, message, suggestion, line_number.passed = not any(i.severity == ERROR for i in issues). On failure → raises _ValidationFailed exception.generate()) and streaming (stream_generate()).