TestHide | Intelligent CI/CD & QA Automation Platform

Using Large Language Models for Automated Test Triage

Traditional ML models classify failures into buckets. LLMs go much further — they can explain what went wrong, why it happened, and how to fix it. Let's explore how to implement LLM-powered test triage.

Beyond Classification: What LLMs Offer

Instead of this:

Category: TIMEOUT_ERROR Confidence: 0.94 Similar failures: 47

You get this:

"This test failed because the API endpoint /api/users took 32 seconds to respond, exceeding the 30-second timeout. This is likely due to the new N+1 query introduced in commit a1b2c3d that loads all user permissions without pagination.

Suggested fix: Add pagination to the get_user_permissions() method or use eager loading with select_related(). See the performance guidelines in docs/api/pagination.md.

Similar issue was resolved in PR #4521 — you might want to apply the same pattern."

This is genuinely useful. It saves engineers from reading through logs, understanding the code change, and searching for similar issues.

How It Works

The key is providing the LLM with sufficient context:

1. Context Collection

Gather all relevant information: • Stack trace and error message • Relevant log lines (filtered for noise) • The code diff from the last passing commit • Test metadata (name, duration, history) • Previous similar failures and their resolutions

2. Prompt Engineering

Structure the context for optimal LLM performance:

•System prompt: Define the role (expert QA engineer)
•Context section: All gathered information, well-formatted
•Task: Generate explanation, root cause, and fix suggestions
•Constraints: Be concise, cite evidence, admit uncertainty

3. Response Generation

The LLM produces: • Natural language explanation • Root cause hypothesis • Suggested fix (with code if relevant) • Links to documentation or similar issues • Confidence level

4. Validation

Cross-reference with traditional ML models: • Does the LLM's classification match? • Is the suggested fix relevant to the failure type? • Are the linked issues actually similar?

This catches hallucinations before they reach engineers.

Implementation Tips

Use Retrieval-Augmented Generation (RAG)

Don't expect the LLM to know your codebase. Instead: • Index your documentation • Index your historical PR descriptions • Index your resolved Jira tickets • Retrieve relevant context before generating

This grounds the LLM in your specific domain.

Fine-Tune on Your Data

If you have labeled triage decisions: • Historical root causes assigned by engineers • Fix commits linked to test failures • Documentation that was actually helpful

Use this to fine-tune or create few-shot examples.

Cache Responses

Same failure pattern = same explanation: • Hash the error signature • Store generated explanations • Serve from cache for duplicates

This reduces cost and latency significantly.

Set Confidence Thresholds

Not all LLM outputs are created equal: • High confidence: Auto-assign to engineer with explanation • Medium confidence: Show explanation but request human review • Low confidence: Flag for manual triage, don't show explanation

Limitations and Risks

Hallucination

LLMs confidently make things up. They might: • Cite non-existent documentation • Suggest fixes that don't compile • Attribute blame to the wrong commit

Always verify before trusting.

Cost

LLM inference is expensive: • GPT-4 class models: ~$0.03-0.10 per failure • At 1000 failures/day: $30-100/day in LLM costs alone • Fine-tuned smaller models can reduce cost 10x

Consider whether the value justifies the expense.

Latency

LLM generation takes 2-10 seconds: • Not suitable for blocking the CI pipeline • Run asynchronously after test completion • Cache aggressively for repeated failures

Privacy

Your test failures contain sensitive information: • Code snippets • Internal URLs and credentials • Customer data in test fixtures

Use on-premise models or ensure your provider meets compliance requirements.

Practical Architecture

Here's a realistic implementation:

1. Test fails → webhook triggers triage service 2. Service collects context (logs, diff, history) 3. RAG retrieves relevant documentation 4. LLM generates explanation (async) 5. Traditional ML validates the output 6. Result stored and displayed in test UI 7. Engineer feedback improves future predictions

Start simple: One model, one failure type, one team. Iterate based on feedback.

Conclusion

LLMs are powerful tools for test triage, but they're not magic. They work best when:

•Grounded in your specific context (RAG)
•Validated by traditional ML
•Used asynchronously, not blocking
•Continuously improved with human feedback

The future is hybrid: specialized ML models for classification and retrieval, LLMs for explanation and suggestion. Together, they can make test triage almost effortless.