v6.2.x CI/CD + 8 AI models for every build

Tests fail. AI explains why.

Testhide is a full CI/CD platform where 8 specialized ML models analyze every failure automatically — root cause, flakiness, log patterns, similar failures, Jira auto-link. One pipeline YAML. Zero AI configuration.

Self-hosted Docker, free forever. Cloud from $49/mo. pytest · jest · xUnit · JUnit · Playwright.
~/testhide / pipeline.yml
build #5041
# ~/testhide/pipeline.yml  —  this is all you write
name: payments-regression
steps:
  - name: unit_tests
    type: pytest
    path: tests/unit/      # ✓ 48 passed, 0 failed

  - name: e2e_payments     # ← failed step
    type: pytest
    path: tests/e2e/
    expected: success
    actual:   FAIL (exit 2, 47.2s)

  - name: llm_quality_check
    type: llm_eval
    prompt_ref: support_quality_v4
    judge: gpt-4o
    on_fail: block_pr

# ── stderr (e2e_payments) ──────────────────────
requests.exceptions.HTTPError:
  502 BadGateway /api/v1/charge
  retry attempt 3 of 3
8 AI models per build
1 YAML to configure everything
0 AI config required
Free tier — self-hosted
works with → pytest Jest xUnit JUnit NUnit Playwright Git Jira

/ the problem

CI tells you what broke. Never why.

The test goes red. You open the log. You read 400 lines of output looking for a clue. You open Jira. You check flakiness history in a spreadsheet. Forty minutes later you file a ticket that says "502 again" and hope infra picks it up.

🕵️
Root cause is manual investigation

No CI system classifies failures. ENV_FAILURE vs PRODUCT_BUG vs TEST_ISSUE — you decide by reading logs.

🎲
You can't tell flaky from real

Tests flip red for one run, green the next. You retry blindly, waste agent time, and miss actual regressions.

🔀
Three dashboards, zero signal

Eval tools, CI logs, and Jira never talk to each other. Reconciling across tools is itself a full-time job.

🔒
Your data belongs to the cloud

Test traces, prompt outputs, and failure patterns shipped to third-party SaaS. No good for regulated industries.

/ how it works

Three steps. AI runs itself.

You write the pipeline. Testhide runs the tests and routes every failure through 8 ML models automatically.

1

Connect an agent

Pull the .NET CI agent (Windows / Linux / macOS) or spawn Docker agents dynamically. It registers with your Testhide server automatically.

docker pull testhide-agent
2

Write one YAML

Define your pipeline steps — pytest, jest, command, llm_eval. Same format as your existing CI configs. No wrapper scripts.

type: pytest · llm_eval · command
3

AI analyzes every build

8 specialized ML models run on every failure automatically. Root cause, flakiness score, similar failures, log templates, Jira link — all in the dashboard. No config needed.

8 models · zero config

/ the solution

One platform. Eight AI models included.

Testhide is a full CI/CD build server with 8 embedded diagnostic ML models and a native LLM eval step type — all in one deployment, no external services required.

🏗️

Full CI/CD — not a wrapper

Build pipelines, matrix builds, parallel agents, build history, artifact storage, and a real-time dashboard. Everything GitHub Actions does, minus the 47 third-party actions you had to stitch together.

matrix builds parallel agents artifacts webhooks
🧠

8 embedded diagnostic AI models

Run on every failed build automatically — no configuration. Root Cause Classifier (DistilBERT), Flakiness Predictor, Failure Retriever (FAISS), OOD Detector, Log Signature Miner (Drain3), Bug Linker, Visual Diff Analyzer, AI Investigator Agent (local LLM).

DistilBERT FAISS Drain3 CLIP llama.cpp
⚖️

LLM eval as a first-class step

Add type: llm_eval to any step. Choose your judge model (GPT-4o, Claude, Gemini, or local Phi-3.5 on-prem). Set a pass threshold. If it fails, the PR blocks. Same signal as unit tests.

gpt-4o judge claude judge local phi-3.5 block_pr
🔒

100% self-hosted. Zero data leakage.

All 8 ML models run locally on your server — no test data, prompts, or failure traces sent to third-party services. Enterprise adds on-premise SSO, dedicated manager, and white-label. Free tier is Docker and always free.

on-prem SSO white-label no vendor lock-in

/ pricing

Start free. Scale when you're ready.

Self-hosted Docker is free forever — full platform, all 8 AI models, unlimited builds. Cloud from $49/mo when you want us to run it.

Install with Docker → See all pricing

Free tier · Cloud Starter $49/mo · Cloud Team $299/mo · Enterprise custom

/ faq

Common questions.

What are the 8 AI models — do I have to configure them?

No configuration needed. The 8 models — Root Cause Classifier, Flakiness Predictor, Failure Retriever, OOD Detector, Log Signature Miner, Bug Linker, Visual Diff Analyzer, and AI Investigator Agent — run automatically on every failed build. You write the pipeline YAML; the models do the rest.

Is Testhide only for AI/LLM projects?

No. Testhide is a full CI/CD platform that works for any test suite — pytest, jest, xUnit, JUnit, Playwright. The LLM eval step (type: llm_eval) is opt-in. The 8 diagnostic AI models run on every failure regardless of what you're testing.

How is this different from GitHub Actions + Braintrust?

Testhide runs evals synchronously as build steps — the PR can't merge until evals pass, and you see all results in one dashboard. With Braintrust + GitHub Actions you configure external webhooks, manage credentials in two places, and reconcile results manually across three UIs. The 8 embedded diagnostic models have no equivalent in that stack.

What CI agent do I need? Does it work on Windows?

Yes. The Testhide CI agent is a .NET 8 application that runs on Windows, Linux, and macOS. You can also spawn Docker agents dynamically for isolated builds. Multiple agents can connect to a single Testhide server and work in parallel.

Can I self-host on my own infrastructure?

Yes. The free tier is a Docker image you run on-prem — full platform, all 8 AI models, no phone-home. Enterprise adds SSO, on-premise licensing, and a dedicated manager. Your test data never leaves your network.

What LLM judges does the eval step support?

GPT-4o, GPT-3.5, Claude 3.5 Sonnet, Claude 3 Haiku, Gemini 1.5 Pro, and local Phi-3.5-mini / Llama 3 via llama.cpp (no API key needed for local models). Set judge: gpt-4o or judge: phi-3.5-local in your pipeline YAML.

Stop reading logs. Let AI read them for you.

Free forever self-hosted · Cloud from $49/mo · No credit card for free tier