Your cluster
looks healthy.
It isn't.
NVSentinel tells you a node is sick. Clockwork moves your job to a healthy node. Fabric Sentinel tells you which workload made the node sick — and what it cost every team sharing it.
One inference job exhausts NIC microarchitecture resources. Your training job slows 20–30%. NVSentinel shows everything healthy. Your engineers find nothing. "The cluster was slow today."
The NIC layer is a blind spot for every monitoring tool on the market.
When multiple AI workloads share GPU servers, inference jobs quietly exhaust NIC queue-pair tables and cache needed by NCCL collectives. Training throughput collapses. No alarm fires. No ticket gets opened.
A read-only eBPF DaemonSet. One Helm install. No code changes.
Fabric Sentinel runs entirely in the CPU kernel. Zero GPU overhead. Zero production traffic touched. Deploys alongside NVSentinel and Clockwork — because neither sees this layer.
eBPF kprobes on mlx5 kernel driver
Reads NIC cache and QP table counters per PID at sub-second resolution — no kernel patch, no agent restart.
Correlation engine (100ms sliding window)
Matches counter events to throughput degradation signals across co-located workloads in real time.
Kubernetes pod attribution
PID → cgroup → container → K8s pod → tenant. Exact cause pod and victim pods, every time.
Output: cause, victims, duration, dollar cost
Prometheus metrics + forensic audit log + scheduler recommendations. Plugs into existing dashboards immediately.
Every other tool is fabric-blind.
NVSentinel uses DCGM — GPU hardware only. Clockwork detects link flaps for node migration. CoreWeave Mission Control is CoreWeave-only. None of them see NIC microarchitecture interference between workloads.
| Tool | NIC layer | Inter-workload attribution | Dollar cost output | Multi-tenant K8s |
|---|---|---|---|---|
| NVSentinel (NVIDIA) | ✗ | ✗ | ✗ | — |
| Clockwork ($41.5M) | Partial | ✗ | ✗ | ✗ |
| CoreWeave Mission Control | ✗ | ✗ | ✗ | ✗ |
| Fabric Sentinel (Plexar) | ✓ sub-second | ✓ per-pod | ✓ per-event | ✓ |
Built by someone who's lived inside this infrastructure.
Get your first
fabric interference
report in 48
hours.
No agents. No production changes. Deploy the DaemonSet in your kind cluster today.