v6.2.x CI/CD + 8 AI models for every build

Tests fail. AI explains why.

Testhide is a full CI/CD platform where 8 specialized ML models analyze every failure automatically — root cause, flakiness, log patterns, similar failures, Jira auto-link. One pipeline YAML. Zero AI configuration.

Install with Docker → See all features →

→ Self-hosted Docker, free forever. → Cloud from $49/mo. → pytest · jest · xUnit · JUnit · Playwright.

~/testhide / pipeline.yml

build #5041

# ~/testhide/pipeline.yml  —  this is all you write
name: payments-regression
steps:
  - name: unit_tests
    type: pytest
    path: tests/unit/      # ✓ 48 passed, 0 failed

  - name: e2e_payments     # ← failed step
    type: pytest
    path: tests/e2e/
    expected: success
    actual:   FAIL (exit 2, 47.2s)

  - name: llm_quality_check
    type: llm_eval
    prompt_ref: support_quality_v4
    judge: gpt-4o
    on_fail: block_pr

# ── stderr (e2e_payments) ──────────────────────
requests.exceptions.HTTPError:
  502 BadGateway /api/v1/charge
  retry attempt 3 of 3

# Root-Cause Classifier — DistilBERT (fine-tuned)
# runs automatically on every failed step

→ classification: ENV_FAILURE
→ confidence:    0.87
→ alt classes:
     PRODUCT_BUG   0.09
     TEST_ISSUE    0.04

# top evidence tokens
"502 BadGateway" · "retry attempt 3"
"/api/v1/charge" · "HTTPError"

# verdict: infra issue — not your code

# Flakiness Predictor — gradient-boosted ensemble
# 41 features: temporal, failure history, env, build context

→ verdict:       REAL REGRESSION
→ flap_count_7d: 0 transitions
→ fail_streak:   3 consecutive

# top contributing features
fail_rate_last_6h   0.83
consecutive_fails   0.71
env_error_rate      0.64

# not flaky — something actually broke

# Failure Retriever — FAISS + sentence-transformers
# indexes every failure; cosine search over embeddings

→ top match: e2e_checkout_flow  0.94
→ occurred:  2026-05-16, build #5019
→ root_cause: ENV_FAILURE (same)
→ resolved:  infra team restart

# 4 more similar failures this week
# pattern: payment gateway instability

# Log Signature Miner — Drain3 template extraction
# clusters thousands of log lines into semantic patterns

→ signature:
  "* BadGateway for * during retry * of *"
→ template_id: LT-2847
→ seen:        17 times this sprint

# ↑ 340% vs previous sprint baseline
# spike alert → #platform-infra

# Bug Linker — MiniLM + FAISS · Jira embedding index
# matches failure to open tickets by semantic similarity

→ ticket:   PLAT-1183
→ title:    "Payment gateway 502 spikes"
→ status:   In Progress
→ assignee: @platform-infra
→ linked:   auto (similarity 0.91)

# PR marked yellow — known issue
# will not block merge

# Emerging Issues Detector — clustering + trend analysis
# surfaces new failure patterns before they become incidents

→ pattern: "payment gateway instability"
→ tests:   9 affected this week
→ trend:   ↑ 340% vs last week
→ impact:  4 blocked PRs

# recommendation: create P1 ticket
# auto-assign to: platform-infra

8 AI models per build

1 YAML to configure everything

0 AI config required

∞ Free tier — self-hosted

/ the problem

CI tells you what broke. Never why.

The test goes red. You open the log. You read 400 lines of output looking for a clue. You open Jira. You check flakiness history in a spreadsheet. Forty minutes later you file a ticket that says "502 again" and hope infra picks it up.

🕵️

Root cause is manual investigation

No CI system classifies failures. ENV_FAILURE vs PRODUCT_BUG vs TEST_ISSUE — you decide by reading logs.

🎲

You can't tell flaky from real

Tests flip red for one run, green the next. You retry blindly, waste agent time, and miss actual regressions.

🔀

Three dashboards, zero signal

Eval tools, CI logs, and Jira never talk to each other. Reconciling across tools is itself a full-time job.

🔒

Your data belongs to the cloud

Test traces, prompt outputs, and failure patterns shipped to third-party SaaS. No good for regulated industries.

/ how it works

Three steps. AI runs itself.

You write the pipeline. Testhide runs the tests and routes every failure through 8 ML models automatically.

Connect an agent

Pull the .NET CI agent (Windows / Linux / macOS) or spawn Docker agents dynamically. It registers with your Testhide server automatically.

docker pull testhide-agent

→

Write one YAML

Define your pipeline steps — pytest, jest, command, llm_eval. Same format as your existing CI configs. No wrapper scripts.

type: pytest · llm_eval · command

→

AI analyzes every build

8 specialized ML models run on every failure automatically. Root cause, flakiness score, similar failures, log templates, Jira link — all in the dashboard. No config needed.

8 models · zero config

/ the solution

One platform. Eight AI models included.

Testhide is a full CI/CD build server with 8 embedded diagnostic ML models and a native LLM eval step type — all in one deployment, no external services required.

🏗️

Full CI/CD — not a wrapper

Build pipelines, matrix builds, parallel agents, build history, artifact storage, and a real-time dashboard. Everything GitHub Actions does, minus the 47 third-party actions you had to stitch together.

matrix builds parallel agents artifacts webhooks

🧠

8 embedded diagnostic AI models

Run on every failed build automatically — no configuration. Root Cause Classifier (DistilBERT), Flakiness Predictor, Failure Retriever (FAISS), OOD Detector, Log Signature Miner (Drain3), Bug Linker, Visual Diff Analyzer, AI Investigator Agent (local LLM).

DistilBERT FAISS Drain3 CLIP llama.cpp

⚖️

LLM eval as a first-class step

Add type: llm_eval to any step. Choose your judge model (GPT-4o, Claude, Gemini, or local Phi-3.5 on-prem). Set a pass threshold. If it fails, the PR blocks. Same signal as unit tests.

gpt-4o judge claude judge local phi-3.5 block_pr

🔒

100% self-hosted. Zero data leakage.

All 8 ML models run locally on your server — no test data, prompts, or failure traces sent to third-party services. Enterprise adds on-premise SSO, dedicated manager, and white-label. Free tier is Docker and always free.

on-prem SSO white-label no vendor lock-in

/ pricing

Start free. Scale when you're ready.

Self-hosted Docker is free forever — full platform, all 8 AI models, unlimited builds. Cloud from $49/mo when you want us to run it.

Install with Docker → See all pricing

Free tier · Cloud Starter $49/mo · Cloud Team $299/mo · Enterprise custom

/ faq

Common questions.

What are the 8 AI models — do I have to configure them?

No configuration needed. The 8 models — Root Cause Classifier, Flakiness Predictor, Failure Retriever, OOD Detector, Log Signature Miner, Bug Linker, Visual Diff Analyzer, and AI Investigator Agent — run automatically on every failed build. You write the pipeline YAML; the models do the rest.

Is Testhide only for AI/LLM projects?

No. Testhide is a full CI/CD platform that works for any test suite — pytest, jest, xUnit, JUnit, Playwright. The LLM eval step (type: llm_eval) is opt-in. The 8 diagnostic AI models run on every failure regardless of what you're testing.

How is this different from GitHub Actions + Braintrust?

Testhide runs evals synchronously as build steps — the PR can't merge until evals pass, and you see all results in one dashboard. With Braintrust + GitHub Actions you configure external webhooks, manage credentials in two places, and reconcile results manually across three UIs. The 8 embedded diagnostic models have no equivalent in that stack.

What CI agent do I need? Does it work on Windows?

Yes. The Testhide CI agent is a .NET 8 application that runs on Windows, Linux, and macOS. You can also spawn Docker agents dynamically for isolated builds. Multiple agents can connect to a single Testhide server and work in parallel.

Can I self-host on my own infrastructure?

Yes. The free tier is a Docker image you run on-prem — full platform, all 8 AI models, no phone-home. Enterprise adds SSO, on-premise licensing, and a dedicated manager. Your test data never leaves your network.

What LLM judges does the eval step support?

GPT-4o, GPT-3.5, Claude 3.5 Sonnet, Claude 3 Haiku, Gemini 1.5 Pro, and local Phi-3.5-mini / Llama 3 via llama.cpp (no API key needed for local models). Set judge: gpt-4o or judge: phi-3.5-local in your pipeline YAML.

Stop reading logs. Let AI read them for you.

Free forever self-hosted · Cloud from $49/mo · No credit card for free tier

Install with Docker → Talk to us