East-West GPU Fabric Observability

Your cluster
looks healthy.
It isn't.

NVSentinel tells you a node is sick. Clockwork moves your job to a healthy node. Fabric Sentinel tells you which workload made the node sick — and what it cost every team sharing it.

Live interference detection — Fabric Sentinel output
NCCL COLLECTIVE THROUGHPUT — TRAINING JOB: gpt-finetuner-7b2a1
NIC QP TABLE PRESSURE — NODE: gpu-node-12
TIME →                                   14:32:00                                                14:32:11
CAUSE POD
inference-team-a / llama-70b-server-84cf2
THROUGHPUT LOSS
↓ 24% (4m 12s)
ROOT CAUSE
QP table exhaustion — mlx5 NIC
ESTIMATED COST
$183 wasted

One inference job exhausts NIC microarchitecture resources. Your training job slows 20–30%. NVSentinel shows everything healthy. Your engineers find nothing. "The cluster was slow today."

The NIC layer is a blind spot for every monitoring tool on the market.

When multiple AI workloads share GPU servers, inference jobs quietly exhaust NIC queue-pair tables and cache needed by NCCL collectives. Training throughput collapses. No alarm fires. No ticket gets opened.

20–30%
throughput loss in victim training jobs — with no visible fault
Microsoft Harmonic, NSDI 2024
5–15%
extra TCO loss at Silver-tier operators vs Gold-tier at same pricing
SemiAnalysis ClusterMAX
6–21%
of total GPU cluster TCO wasted in goodput expense
SemiAnalysis ClusterMAX
7.9 hr
mean time to failure for 1,024-GPU jobs — NIC errors leading cause
Meta HPCA 2025

A read-only eBPF DaemonSet. One Helm install. No code changes.

Fabric Sentinel runs entirely in the CPU kernel. Zero GPU overhead. Zero production traffic touched. Deploys alongside NVSentinel and Clockwork — because neither sees this layer.

01

eBPF kprobes on mlx5 kernel driver

Reads NIC cache and QP table counters per PID at sub-second resolution — no kernel patch, no agent restart.

02

Correlation engine (100ms sliding window)

Matches counter events to throughput degradation signals across co-located workloads in real time.

03

Kubernetes pod attribution

PID → cgroup → container → K8s pod → tenant. Exact cause pod and victim pods, every time.

04

Output: cause, victims, duration, dollar cost

Prometheus metrics + forensic audit log + scheduler recommendations. Plugs into existing dashboards immediately.

# fabric-sentinel interference event — 2026-05-03T14:32:11Z
 
cause_pod: inference-team-a/llama-70b-server-84cf2
cause: QP table exhaustion (mlx5 NIC cache eviction)
duration: 4m 12s
 
victim_pods:
- training-team-b/gpt-finetuner-7b2a1 ↓ 24% throughput $183 wasted
- training-team-c/llm-pretrain-a100-001 ↓ 31% throughput $241 wasted
 
total_cost: $424
recommendation: isolate inference-team-a to nodes [gpu-04, gpu-05]

Every other tool is fabric-blind.

NVSentinel uses DCGM — GPU hardware only. Clockwork detects link flaps for node migration. CoreWeave Mission Control is CoreWeave-only. None of them see NIC microarchitecture interference between workloads.

Tool NIC layer Inter-workload attribution Dollar cost output Multi-tenant K8s
NVSentinel (NVIDIA)
Clockwork ($41.5M) Partial
CoreWeave Mission Control
Fabric Sentinel (Plexar) sub-second per-pod per-event

Built by someone who's lived inside this infrastructure.

HS
Harsha Sanjeeva
Founder & CEO — plexar.io
7+ years building identity infrastructure, cloud-native security tooling, and Kubernetes networking at Cisco Meraki. Approver-track contributor to Kyverno (CNCF), SPIFFE/SPIRE, and WG AI Gateway. Built eBPF runtime security tooling in production. Shipping Fabric Sentinel to close the gap the research proved exists.
eBPF Kubernetes CNCF / Kyverno SPIFFE/SPIRE mlx5 / RDMA Cisco Meraki

Get your first
fabric interference
report
in 48 hours.

No agents. No production changes. Deploy the DaemonSet in your kind cluster today.