VMware Footguns¶
Mistakes that cause failed deploys, outages, or data loss with VMware vSphere.
1. Leaving snapshots running for days or weeks¶
Snapshots create a delta VMDK that captures every write. Over time this delta grows until it fills the datastore, freezing every VM on that storage. Snapshot consolidation on a large delta can take hours and causes heavy I/O.
Fix: Monitor snapshot age with vCenter alarms. Set a policy: no snapshot older than 72 hours. Never use snapshots as backups — use a real backup tool that creates/deletes snapshots as part of its workflow.
2. Over-provisioning vCPUs on VMs¶
Giving a VM 16 vCPUs when it only uses 2 hurts performance. The hypervisor must
co-schedule all vCPUs simultaneously. More vCPUs = higher scheduling latency
(%CSTP in esxtop). A VM with 16 vCPUs often runs slower than one with 4.
Fix: Start small, monitor, and scale up. Check %CSTP and %RDY in esxtop.
Right-size vCPUs to what the application actually uses under peak load.
3. No EVC mode on the cluster — vMotion fails at the worst time¶
Enhanced vMotion Compatibility (EVC) masks CPU features to the lowest common denominator in the cluster. Without it, vMotion between different CPU generations fails — and you discover this during a host failure when HA tries to restart VMs on a host with a different CPU.
Fix: Set EVC mode when creating the cluster. Changing it later requires all VMs to be powered off. Plan CPU homogeneity or set EVC from day one.
4. Thin provisioning everything without monitoring space¶
Thin-provisioned disks look great on paper — allocate 500GB, use 50GB. But every VM grows, and the sum of thin disks can exceed physical capacity. When the datastore fills, all VMs on it freeze simultaneously.
Fix: Set datastore usage alarms at 75% and 85%. Track the over-commit ratio (provisioned / actual capacity). Reserve headroom for snapshot operations.
5. Putting vMotion traffic on the management network¶
vMotion moves gigabytes of RAM over the network. If it shares the management network, vCenter connectivity, SSH, and monitoring all degrade during migrations. DRS-triggered migrations can cascade and saturate the link.
Fix: Dedicated VMkernel port for vMotion on its own VLAN with 10G+ bandwidth. Same for vSAN traffic. Management, vMotion, vSAN, and NFS should each have separate VMkernel adapters and VLANs.
6. Not setting admission control in HA clusters¶
Without admission control, HA does not reserve capacity for failover. If a host dies, there may not be enough CPU/memory on surviving hosts to restart all VMs. You get a false sense of protection — HA is "enabled" but can't actually recover all workloads.
Fix: Enable admission control. Set policy to "tolerate 1 host failure" (or 2 for critical clusters). Accept that this reserves ~25-50% of cluster capacity. That reserved capacity is the whole point.
7. Forgetting to install or update VMware Tools¶
Without VMware Tools: no graceful shutdown (only hard power-off), no memory ballooning, no quiesced snapshots, no heartbeat monitoring, inaccurate performance metrics, degraded network and disk performance.
Fix: Install open-vm-tools on Linux VMs (package managed, auto-updated).
On Windows, install VMware Tools from the mounted ISO. Include Tools installation
in your VM template build process.
8. Editing VMDK files or VM config while the VM is running¶
Manually editing .vmx or .vmdk descriptor files while the VM is powered on
can corrupt the VM configuration or the virtual disk. The hypervisor has these
files locked for a reason.
Fix: Power off the VM before editing config files. Use vCenter or
vim-cmd/PowerCLI for runtime changes. If you must edit .vmx, unregister the
VM first, edit, then re-register.
9. Running vSAN with mismatched disk groups or firmware¶
vSAN is sensitive to disk and controller firmware. Mixing drives not on the VMware HCL, or running different firmware versions across hosts, causes intermittent I/O errors, resync storms, and data unavailability.
Fix: Only use drives and controllers on the VMware vSAN HCL. Keep firmware consistent across all hosts in the cluster. Use vSAN health checks to flag compatibility issues before they bite.
10. Ignoring vSAN quorum — shutting down too many hosts¶
vSAN with FTT=1 (RAID-1) tolerates 1 host failure. If you put 2 of 3 hosts in maintenance mode simultaneously, you lose quorum and all data becomes inaccessible. This catches people during maintenance windows.
Fix: Only put one host in maintenance mode at a time. Wait for resync to complete before proceeding to the next host. With 3 hosts and FTT=1, you have zero margin — consider 4+ host clusters for rolling maintenance.
11. Using resource limits instead of reservations¶
Setting a CPU or memory limit on a VM caps its performance even when the host has spare capacity. Limits are frequently set "temporarily" and forgotten, causing mysterious performance degradation months later.
Fix: Use reservations (guaranteed minimums) and shares (relative priority)
instead of limits. If you must use limits, document them and set calendar
reminders to review. Search for limits: Get-VM | Get-VMResourceConfiguration |
Where-Object {$_.CpuLimitMhz -ne -1}.
12. Changing the vCenter FQDN or IP without updating all integrations¶
vCenter's identity is baked into ESXi host configurations, backup tools, monitoring, Terraform state, and every script that connects to it. Changing the FQDN or IP without updating all consumers breaks vMotion, backup jobs, and automation silently.
Fix: Plan FQDN/IP carefully at deployment. If you must change it, inventory every integration first: backup tools, monitoring, Ansible/Terraform configs, DNS entries, ESXi host vpxa configs, SSO identity sources.