East-West GPU Fabric Observability

Your cluster
looks healthy.
It isn't.

NVSentinel tells you a node is sick. Clockwork moves your job to a healthy node. Fabric Sentinel tells you which workload made the node sick — and what it cost every team sharing it.

Request a fabric audit → github.com/plexar-io/fabric-sentinel

Live interference detection — Fabric Sentinel output

NCCL COLLECTIVE THROUGHPUT — TRAINING JOB: gpt-finetuner-7b2a1

NIC QP TABLE PRESSURE — NODE: gpu-node-12

TIME → 14:32:00 14:32:11

CAUSE POD

inference-team-a / llama-70b-server-84cf2

THROUGHPUT LOSS

↓ 24% (4m 12s)

ROOT CAUSE

QP table exhaustion — mlx5 NIC

ESTIMATED COST

$183 wasted

One inference job exhausts NIC microarchitecture resources. Your training job slows 20–30%. NVSentinel shows everything healthy. Your engineers find nothing. "The cluster was slow today."

01 — The problem

The NIC layer is a blind spot for every monitoring tool on the market.

When multiple AI workloads share GPU servers, inference jobs quietly exhaust NIC queue-pair tables and cache needed by NCCL collectives. Training throughput collapses. No alarm fires. No ticket gets opened.

20–30%

throughput loss in victim training jobs — with no visible fault

Microsoft Harmonic, NSDI 2024

5–15%

extra TCO loss at Silver-tier operators vs Gold-tier at same pricing

SemiAnalysis ClusterMAX

6–21%

of total GPU cluster TCO wasted in goodput expense

SemiAnalysis ClusterMAX

7.9 hr

mean time to failure for 1,024-GPU jobs — NIC errors leading cause

Meta HPCA 2025

02 — How it works

A read-only eBPF DaemonSet. One Helm install. No code changes.

Fabric Sentinel runs entirely in the CPU kernel. Zero GPU overhead. Zero production traffic touched. Deploys alongside NVSentinel and Clockwork — because neither sees this layer.

eBPF kprobes on mlx5 kernel driver

Reads NIC cache and QP table counters per PID at sub-second resolution — no kernel patch, no agent restart.

Correlation engine (100ms sliding window)

Matches counter events to throughput degradation signals across co-located workloads in real time.

Kubernetes pod attribution

PID → cgroup → container → K8s pod → tenant. Exact cause pod and victim pods, every time.

Output: cause, victims, duration, dollar cost

Prometheus metrics + forensic audit log + scheduler recommendations. Plugs into existing dashboards immediately.

# fabric-sentinel interference event — 2026-05-03T14:32:11Z

cause_pod: inference-team-a/llama-70b-server-84cf2

cause: QP table exhaustion (mlx5 NIC cache eviction)

duration: 4m 12s

victim_pods:

- training-team-b/gpt-finetuner-7b2a1 ↓ 24% throughput $183 wasted

- training-team-c/llm-pretrain-a100-001 ↓ 31% throughput $241 wasted

total_cost: $424

recommendation: isolate inference-team-a to nodes [gpu-04, gpu-05]

03 — Competitive landscape

Every other tool is fabric-blind.

NVSentinel uses DCGM — GPU hardware only. Clockwork detects link flaps for node migration. CoreWeave Mission Control is CoreWeave-only. None of them see NIC microarchitecture interference between workloads.

Tool	NIC layer	Inter-workload attribution	Dollar cost output	Multi-tenant K8s
NVSentinel (NVIDIA)	✗	✗	✗	—
Clockwork ($41.5M)	Partial	✗	✗	✗
CoreWeave Mission Control	✗	✗	✗	✗
Fabric Sentinel (Plexar)	✓ sub-second	✓ per-pod	✓ per-event	✓

04 — Team

Built by someone who's lived inside this infrastructure.

Harsha Sanjeeva

Founder & CEO — plexar.io

7+ years building identity infrastructure, cloud-native security tooling, and Kubernetes networking at Cisco Meraki. Approver-track contributor to Kyverno (CNCF), SPIFFE/SPIRE, and WG AI Gateway. Built eBPF runtime security tooling in production. Shipping Fabric Sentinel to close the gap the research proved exists.

eBPF Kubernetes CNCF / Kyverno SPIFFE/SPIRE mlx5 / RDMA Cisco Meraki

Get your first
fabric interference
report in 48 hours.

No agents. No production changes. Deploy the DaemonSet in your kind cluster today.

hello@plexar.io View on GitHub

Your clusterlooks healthy.It isn't.