Tests fail. AI explains why.
Testhide is a full CI/CD platform where 8 specialized ML models analyze every failure automatically — root cause, flakiness, log patterns, similar failures, Jira auto-link. One pipeline YAML. Zero AI configuration.
# ~/testhide/pipeline.yml — this is all you write
name: payments-regression
steps:
- name: unit_tests
type: pytest
path: tests/unit/ # ✓ 48 passed, 0 failed
- name: e2e_payments # ← failed step
type: pytest
path: tests/e2e/
expected: success
actual: FAIL (exit 2, 47.2s)
- name: llm_quality_check
type: llm_eval
prompt_ref: support_quality_v4
judge: gpt-4o
on_fail: block_pr
# ── stderr (e2e_payments) ──────────────────────
requests.exceptions.HTTPError:
502 BadGateway /api/v1/charge
retry attempt 3 of 3
# Root-Cause Classifier — DistilBERT (fine-tuned)
# runs automatically on every failed step
→ classification: ENV_FAILURE
→ confidence: 0.87
→ alt classes:
PRODUCT_BUG 0.09
TEST_ISSUE 0.04
# top evidence tokens
"502 BadGateway" · "retry attempt 3"
"/api/v1/charge" · "HTTPError"
# verdict: infra issue — not your code
# Flakiness Predictor — gradient-boosted ensemble
# 41 features: temporal, failure history, env, build context
→ verdict: REAL REGRESSION
→ flap_count_7d: 0 transitions
→ fail_streak: 3 consecutive
# top contributing features
fail_rate_last_6h 0.83
consecutive_fails 0.71
env_error_rate 0.64
# not flaky — something actually broke
# Failure Retriever — FAISS + sentence-transformers
# indexes every failure; cosine search over embeddings
→ top match: e2e_checkout_flow 0.94
→ occurred: 2026-05-16, build #5019
→ root_cause: ENV_FAILURE (same)
→ resolved: infra team restart
# 4 more similar failures this week
# pattern: payment gateway instability
# Log Signature Miner — Drain3 template extraction
# clusters thousands of log lines into semantic patterns
→ signature:
"* BadGateway for * during retry * of *"
→ template_id: LT-2847
→ seen: 17 times this sprint
# ↑ 340% vs previous sprint baseline
# spike alert → #platform-infra
# Bug Linker — MiniLM + FAISS · Jira embedding index
# matches failure to open tickets by semantic similarity
→ ticket: PLAT-1183
→ title: "Payment gateway 502 spikes"
→ status: In Progress
→ assignee: @platform-infra
→ linked: auto (similarity 0.91)
# PR marked yellow — known issue
# will not block merge
# Emerging Issues Detector — clustering + trend analysis
# surfaces new failure patterns before they become incidents
→ pattern: "payment gateway instability"
→ tests: 9 affected this week
→ trend: ↑ 340% vs last week
→ impact: 4 blocked PRs
# recommendation: create P1 ticket
# auto-assign to: platform-infra
/ the problem
CI tells you what broke. Never why.
The test goes red. You open the log. You read 400 lines of output looking for a clue. You open Jira. You check flakiness history in a spreadsheet. Forty minutes later you file a ticket that says "502 again" and hope infra picks it up.
No CI system classifies failures. ENV_FAILURE vs PRODUCT_BUG vs TEST_ISSUE — you decide by reading logs.
Tests flip red for one run, green the next. You retry blindly, waste agent time, and miss actual regressions.
Eval tools, CI logs, and Jira never talk to each other. Reconciling across tools is itself a full-time job.
Test traces, prompt outputs, and failure patterns shipped to third-party SaaS. No good for regulated industries.
/ how it works
Three steps. AI runs itself.
You write the pipeline. Testhide runs the tests and routes every failure through 8 ML models automatically.
Connect an agent
Pull the .NET CI agent (Windows / Linux / macOS) or spawn Docker agents dynamically. It registers with your Testhide server automatically.
docker pull testhide-agentWrite one YAML
Define your pipeline steps — pytest, jest, command, llm_eval. Same format as your existing CI configs. No wrapper scripts.
type: pytest · llm_eval · commandAI analyzes every build
8 specialized ML models run on every failure automatically. Root cause, flakiness score, similar failures, log templates, Jira link — all in the dashboard. No config needed.
8 models · zero config/ the solution
One platform. Eight AI models included.
Testhide is a full CI/CD build server with 8 embedded diagnostic ML models and a native LLM eval step type — all in one deployment, no external services required.
Full CI/CD — not a wrapper
Build pipelines, matrix builds, parallel agents, build history, artifact storage, and a real-time dashboard. Everything GitHub Actions does, minus the 47 third-party actions you had to stitch together.
8 embedded diagnostic AI models
Run on every failed build automatically — no configuration. Root Cause Classifier (DistilBERT), Flakiness Predictor, Failure Retriever (FAISS), OOD Detector, Log Signature Miner (Drain3), Bug Linker, Visual Diff Analyzer, AI Investigator Agent (local LLM).
LLM eval as a first-class step
Add type: llm_eval to any step. Choose your judge model (GPT-4o, Claude, Gemini, or local Phi-3.5 on-prem). Set a pass threshold. If it fails, the PR blocks. Same signal as unit tests.
100% self-hosted. Zero data leakage.
All 8 ML models run locally on your server — no test data, prompts, or failure traces sent to third-party services. Enterprise adds on-premise SSO, dedicated manager, and white-label. Free tier is Docker and always free.
/ pricing
Start free. Scale when you're ready.
Self-hosted Docker is free forever — full platform, all 8 AI models, unlimited builds. Cloud from $49/mo when you want us to run it.
Free tier · Cloud Starter $49/mo · Cloud Team $299/mo · Enterprise custom
/ faq
Common questions.
What are the 8 AI models — do I have to configure them?
No configuration needed. The 8 models — Root Cause Classifier, Flakiness Predictor, Failure Retriever, OOD Detector, Log Signature Miner, Bug Linker, Visual Diff Analyzer, and AI Investigator Agent — run automatically on every failed build. You write the pipeline YAML; the models do the rest.
Is Testhide only for AI/LLM projects?
No. Testhide is a full CI/CD platform that works for any test suite — pytest, jest, xUnit, JUnit, Playwright. The LLM eval step (type: llm_eval) is opt-in. The 8 diagnostic AI models run on every failure regardless of what you're testing.
How is this different from GitHub Actions + Braintrust?
Testhide runs evals synchronously as build steps — the PR can't merge until evals pass, and you see all results in one dashboard. With Braintrust + GitHub Actions you configure external webhooks, manage credentials in two places, and reconcile results manually across three UIs. The 8 embedded diagnostic models have no equivalent in that stack.
What CI agent do I need? Does it work on Windows?
Yes. The Testhide CI agent is a .NET 8 application that runs on Windows, Linux, and macOS. You can also spawn Docker agents dynamically for isolated builds. Multiple agents can connect to a single Testhide server and work in parallel.
Can I self-host on my own infrastructure?
Yes. The free tier is a Docker image you run on-prem — full platform, all 8 AI models, no phone-home. Enterprise adds SSO, on-premise licensing, and a dedicated manager. Your test data never leaves your network.
What LLM judges does the eval step support?
GPT-4o, GPT-3.5, Claude 3.5 Sonnet, Claude 3 Haiku, Gemini 1.5 Pro, and local Phi-3.5-mini / Llama 3 via llama.cpp (no API key needed for local models). Set judge: gpt-4o or judge: phi-3.5-local in your pipeline YAML.
Stop reading logs. Let AI read them for you.
Free forever self-hosted · Cloud from $49/mo · No credit card for free tier