Using Large Language Models for Automated Test Triage
Traditional ML models classify failures into buckets. LLMs go much further — they can explain what went wrong, why it happened, and how to fix it. Let's explore how to implement LLM-powered test triage.
Beyond Classification: What LLMs Offer
Instead of this:
Category: TIMEOUT_ERROR Confidence: 0.94 Similar failures: 47
You get this:
"This test failed because the API endpoint /api/users took 32 seconds to respond, exceeding the 30-second timeout. This is likely due to the new N+1 query introduced in commit a1b2c3d that loads all user permissions without pagination.
Suggested fix: Add pagination to the get_user_permissions() method or use eager loading with select_related(). See the performance guidelines in docs/api/pagination.md.
Similar issue was resolved in PR #4521 — you might want to apply the same pattern."
This is genuinely useful. It saves engineers from reading through logs, understanding the code change, and searching for similar issues.
How It Works
The key is providing the LLM with sufficient context:
1. Context Collection
Gather all relevant information: • Stack trace and error message • Relevant log lines (filtered for noise) • The code diff from the last passing commit • Test metadata (name, duration, history) • Previous similar failures and their resolutions
2. Prompt Engineering
Structure the context for optimal LLM performance:
- •System prompt: Define the role (expert QA engineer)
- •Context section: All gathered information, well-formatted
- •Task: Generate explanation, root cause, and fix suggestions
- •Constraints: Be concise, cite evidence, admit uncertainty
3. Response Generation
The LLM produces: • Natural language explanation • Root cause hypothesis • Suggested fix (with code if relevant) • Links to documentation or similar issues • Confidence level
4. Validation
Cross-reference with traditional ML models: • Does the LLM's classification match? • Is the suggested fix relevant to the failure type? • Are the linked issues actually similar?
This catches hallucinations before they reach engineers.
Implementation Tips
Use Retrieval-Augmented Generation (RAG)
Don't expect the LLM to know your codebase. Instead: • Index your documentation • Index your historical PR descriptions • Index your resolved Jira tickets • Retrieve relevant context before generating
This grounds the LLM in your specific domain.
Fine-Tune on Your Data
If you have labeled triage decisions: • Historical root causes assigned by engineers • Fix commits linked to test failures • Documentation that was actually helpful
Use this to fine-tune or create few-shot examples.
Cache Responses
Same failure pattern = same explanation: • Hash the error signature • Store generated explanations • Serve from cache for duplicates
This reduces cost and latency significantly.
Set Confidence Thresholds
Not all LLM outputs are created equal: • High confidence: Auto-assign to engineer with explanation • Medium confidence: Show explanation but request human review • Low confidence: Flag for manual triage, don't show explanation
Limitations and Risks
Hallucination
LLMs confidently make things up. They might: • Cite non-existent documentation • Suggest fixes that don't compile • Attribute blame to the wrong commit
Always verify before trusting.
Cost
LLM inference is expensive: • GPT-4 class models: ~$0.03-0.10 per failure • At 1000 failures/day: $30-100/day in LLM costs alone • Fine-tuned smaller models can reduce cost 10x
Consider whether the value justifies the expense.
Latency
LLM generation takes 2-10 seconds: • Not suitable for blocking the CI pipeline • Run asynchronously after test completion • Cache aggressively for repeated failures
Privacy
Your test failures contain sensitive information: • Code snippets • Internal URLs and credentials • Customer data in test fixtures
Use on-premise models or ensure your provider meets compliance requirements.
Practical Architecture
Here's a realistic implementation:
1. Test fails → webhook triggers triage service 2. Service collects context (logs, diff, history) 3. RAG retrieves relevant documentation 4. LLM generates explanation (async) 5. Traditional ML validates the output 6. Result stored and displayed in test UI 7. Engineer feedback improves future predictions
Start simple: One model, one failure type, one team. Iterate based on feedback.
Conclusion
LLMs are powerful tools for test triage, but they're not magic. They work best when:
- •Grounded in your specific context (RAG)
- •Validated by traditional ML
- •Used asynchronously, not blocking
- •Continuously improved with human feedback
The future is hybrid: specialized ML models for classification and retrieval, LLMs for explanation and suggestion. Together, they can make test triage almost effortless.