Rethinking NKP for MSP Environments

What changed after evolving our management model, licensing strategy, and operational visibility in real-world MSP environments.

Rethinking NKP for MSP Environments

When we started running NKP in a centralized multi-cluster model, the goal was simple. One management plane, consistent governance, and a clean operational experience. In practice, running NKP as an MSP quickly showed that this model does not always scale in the way you would expect.

In my previous article, Why NKP Feels Different When You Run It as an MSP, I described the first set of challenges we encountered and why the multi-cluster approach quickly becomes complex in real service provider environments.

After operating both models in production and discussing these operational patterns with multiple practitioners working on real NKP deployments, we gradually moved toward a more federated approach. This article explains what changed and why.

Two operating models with very different trade-offs

The first one is the centralized multi-cluster model. A single management cluster controls multiple workload clusters. This provides strong governance, unified lifecycle management, and a fleet-style operational experience. The downside is that all managed clusters are effectively tied to the same upgrade cadence, typically N-1. Once the management plane is upgraded, workload clusters are expected to follow in order to preserve federation compatibility.

The second model is the self-managed approach. Each tenant or customer runs its own NKP cluster where management and workloads coexist. In this scenario, the lifecycle coupling introduced by multi-cluster management disappears.

NKP allows self-managed clusters to remain up to three versions behind while still staying under support. This additional flexibility is only available when clusters are operated independently and not attached to a centralized multi-cluster management plane.

In practice, this gives MSPs significantly more operational freedom and makes it possible to isolate slow-moving tenants without forcing upgrades across the entire fleet.

Why centralized management becomes a bottleneck for MSPs

The core problem is not Kubernetes itself. The problem is organizational coupling.

When multiple tenants share a single management plane, one slow-moving customer can delay upgrades for everyone else. Platform validation complexity increases exponentially with scale. Operational risk becomes correlated across tenants.

In an MSP context, this creates structural friction. Independent customers become implicitly bound to the same lifecycle timeline.

Moving to self-managed clusters breaks this dependency. Each tenant regains control of its own upgrade cadence, including the ability to remain behind the latest platform version without being forced into global upgrade timelines.

The hidden cost of the management plane

Another factor that is often underestimated is the resource footprint of the NKP management layer.

When running NKP Ultimate in centralized mode, the management cluster deploys a large number of platform services. To keep those services highly available, the recommended architecture introduces a dedicated worker pool for platform components. This typically results in a baseline of multiple worker nodes that are not hosting customer workloads at all.

I previously analyzed this aspect in more detail in NKP Starter vs Ultimate, especially focusing on the operational cost of platform services and management plane sizing.

In MSP scenarios with many small or medium tenants, this overhead becomes expensive very quickly.

By switching to NKP Starter in self-managed clusters, the platform service footprint is significantly reduced. There is no need for a dedicated platform worker pool. Management services and workloads can safely coexist on the same node pool.

This turns the cluster into a much leaner Kubernetes platform and dramatically improves infrastructure efficiency.

For an MSP, this kind of overhead can make the difference between a sustainable margin and a losing deal.

Why NKP Starter makes sense for tenant clusters

NKP Starter is frequently underestimated.

In a self-managed tenant model, you typically do not need fleet-wide lifecycle orchestration, centralized multi-cluster governance, or advanced platform services designed for large centralized environments.

What you actually need is a supported Kubernetes distribution, reliable lifecycle management for the local cluster, integration with Nutanix networking and storage, and GitOps-driven application and configuration management.

NKP Starter covers these needs without introducing the heavier operational footprint of Ultimate.

It's an architectural choice that fits what you actually need in a tenant cluster.

Losing the single pane of glass and what changes in practice

One of the biggest concerns when moving to self-managed clusters is visibility.

Without a central NKP management plane, the built-in multi-cluster view disappears. For MSP operations, this initially feels uncomfortable.

This is where Prism Central with Konnector becomes critical.

I already covered this topic in detail in Nutanix Konnector and Kubernetes Visibility in Prism Central, focusing on how cluster inventory and operational visibility are integrated directly into the platform.

Konnector does not try to replace NKP management. It plays a different role. It provides centralized inventory of Kubernetes clusters, operational visibility for infrastructure teams, health status awareness, and a human-friendly interface for troubleshooting.

At the same time, GitOps remains the source of truth for configuration and standardization.

The resulting operational model becomes layered.

GitOps controls desired state and enforcement. Prism Central with Konnector provides observed state and operational visibility. Alerting systems trigger human intervention.

This separation scales better than trying to force everything into a single control plane.

The MSP architecture that emerged in practice

After iterating on multiple designs, the pattern that worked best looks like this.

One self-managed NKP cluster per tenant. NKP Starter to minimize platform overhead. GitOps for configuration management, RBAC standardization, and application delivery. Prism Central with Konnector for inventory and operational visibility. Centralized alerting for incident-driven workflows.

There is no global management cluster. There is no shared lifecycle coupling. Each tenant evolves independently.

It's a similar model to how managed Kubernetes services work — you own your upgrade timeline, the provider handles the platform.

There is no perfect model only trade-offs

Centralized NKP management optimizes governance and uniformity. Self-managed NKP optimizes autonomy and scalability.

Which one works depends on what you're optimizing for.

In MSP environments, operational independence, cost efficiency, and tenant isolation often matter more than fleet-wide control.

NKP provides both options. The important part is understanding the trade-offs and making intentional architectural decisions.

Final thoughts

Running NKP as an MSP is still new territory. The platform is evolving. Operational patterns are evolving. There is no definitive blueprint yet.

What matters is designing for how things actually work, not for how they look on a slide.

For us, moving toward self-managed tenant clusters with lightweight NKP deployments, GitOps governance, and centralized visibility through Prism Central has proven to be the most balanced solution so far.

It's not perfect, but it works and it scales.