High-Availability Storage for VM-Based Kubernetes
TL;DR
Kubernetes-native iSCSI storage monitor for k3s clusters running on VMs that automatically detects SAN connection failures (latency > 100ms or 3+ errors/min) and triggers CSI driver failovers to secondary VMs in <5s so DevOps teams reduce unplanned storage downtime from 12+ hours/year to <1 hour/year without adding hardware or vendor dependencies
Target Audience
DevOps and SRE engineers managing small Kubernetes clusters (especially k3s) on virtualized infrastructure with iSCSI-backed storage
The Problem
Problem Context
Teams running small Kubernetes clusters (k3s) on virtual machines need reliable storage, but their current solutions either create single points of failure or introduce unacceptable tradeoffs like high CPU usage, vendor lock-in, or instability. They rely on iSCSI-backed storage presented as virtual disks, which complicates traditional high-availability storage solutions designed for physical hardware.
Pain Points
The Local Storage Provisioner works but is a single point of failure. Mayastor consumes too much CPU, VMware CSI drivers are unstable, and SAN vendor drivers create vendor lock-in. Ceph and Longhorn don’t fit well with virtual disk setups, leaving teams stuck between unreliable options or unacceptable compromises. Every solution tried so far has introduced new problems instead of solving the core high-availability challenge.
Impact
Downtime from storage failures disrupts production workloads, wasting engineering time on manual failovers and troubleshooting. High CPU usage from storage solutions reduces cluster capacity for actual workloads, forcing teams to over-provision resources. Vendor lock-in limits future flexibility, while unstable drivers create constant operational risk. The combination of these issues leads to higher cloud costs, slower deployments, and increased stress for small DevOps teams.
Urgency
This is a mission-critical problem for any team running production workloads on Kubernetes. A single storage failure can take down entire services, and the current workarounds either don’t provide true high availability or introduce unacceptable operational overhead. Teams can’t ignore this risk because it directly impacts their ability to deliver reliable services to customers, making it a top priority for any production environment.
Target Audience
DevOps engineers and site reliability engineers at small-to-medium companies running Kubernetes clusters (especially k3s) on virtualized infrastructure. This includes teams using vSphere, OpenStack, or cloud VMs with iSCSI-backed storage who need high-availability storage solutions but can’t use physical-disk-focused tools like Ceph or Rook. Startups and internal IT teams managing multi-cluster environments also face this challenge when they outgrow simple local storage provisioning.
Proposed AI Solution
Solution Approach
A lightweight, Kubernetes-native storage solution specifically designed for VM-based environments. Instead of building another complex storage system, it wraps existing iSCSI storage with intelligent health monitoring and automated failover capabilities. The solution focuses on minimizing resource usage while providing true high availability for stateful workloads in small k3s clusters, without requiring physical disks or vendor-specific dependencies.
Key Features
- Automated Failover: Uses Kubernetes-native mechanisms to fail over storage volumes between VMs when issues are detected, without requiring manual intervention.
- Low-Overhead CSI Driver: A specialized Container Storage Interface driver optimized for VM environments that avoids the CPU-intensive polling used by solutions like Mayastor.
- K3s-Native Integration: Designed specifically for small k3s clusters, with simple Helm-based installation and minimal configuration requirements.
User Experience
Users install the solution via a Helm chart in their k3s cluster, then configure it with their existing iSCSI storage details. The system automatically monitors storage health in the background, and when issues are detected, it triggers failovers while keeping applications running. Teams get alerts about potential problems before they cause downtime, and the solution works seamlessly with their existing storage infrastructure without requiring new hardware or complex setup.
Differentiation
Unlike existing solutions that either require physical disks (Ceph) or consume too many resources (Mayastor), this focuses specifically on VM-based environments with iSCSI storage. It avoids vendor lock-in by working with any iSCSI SAN and doesn’t require vCenter connectivity. The solution is designed to be lightweight enough for small clusters while still providing enterprise-grade high availability, filling the gap between overkill solutions and unreliable workarounds.
Scalability
The solution starts with single-cluster support but can scale to manage multiple k3s clusters through a centralized monitoring interface. As teams grow, they can add more storage capacity by simply expanding their iSCSI LUNs, with the system automatically handling the increased load. Advanced features like cross-cluster replication can be added later for teams that need disaster recovery across multiple data centers.
Expected Impact
Teams gain reliable high-availability storage without the resource overhead or vendor lock-in of existing solutions. They reduce downtime risk while maintaining flexibility to move between cloud providers or virtualization platforms. The solution pays for itself by preventing storage-related outages and reducing the need for over-provisioning cluster resources, making it a cost-effective choice for small-to-medium Kubernetes deployments.