VMware - Street-Level Ops¶

Real-world patterns and debugging techniques for VMware vSphere in production.

Quick Diagnosis Commands¶

# 1. Check host health and uptime
esxcli system stats uptime get
esxcli hardware platform get

# 2. List VMs and their states
vim-cmd vmsvc/getallvms
esxcli vm process list

# 3. Check storage status
esxcli storage filesystem list
esxcli storage core device list -d <naa.id>

# 4. Check network connectivity
esxcli network ip interface list
vmkping -I vmk0 <vcenter-ip>

# 5. Review recent logs
tail -100 /var/log/hostd.log
tail -100 /var/log/vpxa.log
tail -100 /var/log/vmkernel.log

Common Scenarios¶

Scenario 1: VM Won't Power On — "File Locked"¶

Symptoms: Power-on fails with "Cannot open the disk" or "Failed to lock the file."

Diagnosis:

# Find which host holds the lock
vmkfstools -D /vmfs/volumes/datastore1/vm-name/vm-name-flat.vmdk

# Check for stale .lck directories
ls -la /vmfs/volumes/datastore1/vm-name/*.lck

# Look for leftover .vswp files
ls -la /vmfs/volumes/datastore1/vm-name/*.vswp

Resolution: 1. Confirm the VM is not running on another host (esxcli vm process list on all hosts) 2. If truly stale, remove the .lck directories 3. If a host crashed, the lock auto-expires after ~15 seconds — wait and retry 4. As a last resort, remove .vswp and retry power-on

Scenario 2: vMotion Fails — "Migration Exceeded Maximum Switchover Time"¶

Symptoms: vMotion starts but times out. Common with memory-heavy, write- intensive VMs.

Diagnosis:

# Check vMotion VMkernel connectivity between hosts
vmkping -I vmk1 <destination-host-vmk1-ip>

# Check vMotion network throughput (should be 10 Gbps+)
esxcli network nic stats get -n vmnic2

# Check VM memory dirty rate
# In vCenter: Monitor > Performance > Advanced > Memory > Active/Swap

Resolution: 1. Ensure vMotion uses a dedicated 10G+ NIC, not sharing with management 2. For large VMs (1TB+ RAM), enable multi-NIC vMotion 3. If the VM is dirtying memory faster than vMotion can copy, try migrating during a lower-activity window 4. Check that EVC mode is set — CPU incompatibility causes late-stage failure

Scenario 3: ESXi Host Disconnected from vCenter¶

Symptoms: Host shows "Disconnected" or "Not Responding" in vCenter.

Diagnosis:

# On the ESXi host via DCUI or SSH:
# Check vpxa (vCenter agent)
/etc/init.d/vpxa status
tail -50 /var/log/vpxa.log

# Check management network
esxcli network ip interface list
vmkping <vcenter-ip>

# Check hostd
/etc/init.d/hostd status
tail -50 /var/log/hostd.log

# Check DNS resolution
nslookup <vcenter-fqdn>

Resolution: 1. Restart the vpxa agent: /etc/init.d/vpxa restart 2. If hostd is hung: /etc/init.d/hostd restart (does NOT affect running VMs) 3. If network is down, check physical NIC and switch port 4. In vCenter, right-click host → "Reconnect"

Scenario 4: Datastore Running Out of Space¶

Symptoms: VM operations fail, thin-provisioned disks can't grow, snapshots fill remaining space.

Diagnosis:

# Check datastore usage
esxcli storage filesystem list

# Find large files
du -sh /vmfs/volumes/datastore1/*/

# Check for snapshot chains
find /vmfs/volumes/datastore1/ -name "*-delta.vmdk" -exec ls -lh {} \;

# Check thin disk actual usage vs provisioned
vim-cmd vmsvc/get.summary <vmid> | grep storage

Resolution: 1. Consolidate snapshots — this is the #1 space waster. In vCenter: right-click VM → Snapshots → "Delete All Snapshots" 2. Remove orphaned VMDK files (check they're not attached to any VM first) 3. Storage vMotion VMs to other datastores to rebalance 4. If using thin provisioning, set alarms at 80% to prevent surprise

Scenario 5: VM Performance Degraded — CPU Ready Time High¶

Symptoms: Application inside VM is slow, but CPU usage doesn't look high from inside the guest.

Diagnosis:

# Check CPU ready time (should be <5%)
# In vCenter: Monitor > Performance > Advanced > CPU > Ready
# Or via esxtop:
esxtop
# Press 'c' for CPU view, look at %RDY column

Key metrics in esxtop: - %RDY > 5%: VM is waiting for physical CPU time — host is overcommitted - %CSTP > 3%: co-scheduling delay — reduce vCPUs on the VM - %SWPWT > 0: host is swapping VM memory — add RAM or reduce overcommit - %MLMTD > 0: resource limit is throttling the VM — check limits

Resolution: 1. If %RDY is high across many VMs: host is overcommitted on CPU, migrate VMs or add hosts 2. If %CSTP is high: the VM has more vCPUs than it needs — reduce vCPU count 3. Never assign more vCPUs than the VM actually uses (right-size first) 4. Check for CPU resource limits — someone may have set a limit and forgotten

Scenario 6: PSOD (Purple Screen of Death)¶

Symptoms: ESXi host crashes with a purple diagnostic screen. All VMs on that host go down.

Diagnosis:

# After host reboots, collect the core dump
esxcli system coredump file list

# Check VMkernel logs from before the crash
less /var/log/vmkernel.log

# Upload the core dump to VMware support
vm-support

Common causes: - Faulty hardware (RAM, NIC firmware, HBA) - Driver bugs (especially third-party drivers) - NFS storage disconnection combined with specific conditions

Resolution: 1. Check VMware KB for the specific backtrace 2. Update ESXi to latest patch level 3. Update hardware firmware and drivers to VMware HCL versions 4. If recurring, run hardware diagnostics (memtest, HBA diag) 5. HA will restart VMs on other hosts — this is your safety net

Operational Patterns¶

Pattern: Maintenance Mode Workflow¶

# 1. Put host in maintenance mode (DRS migrates VMs automatically)
#    In vCenter: right-click host → Maintenance Mode → Enter

# 2. Verify all VMs migrated
esxcli vm process list   # should be empty

# 3. Apply patches
esxcli software vib install -d /vmfs/volumes/datastore1/patches/ESXi-7.0U3-patch.zip

# 4. Reboot if required
reboot

# 5. Exit maintenance mode after reboot
#    In vCenter: right-click host → Maintenance Mode → Exit

Pattern: ESXi Patch Compliance with vLCM¶

vSphere Lifecycle Manager (vLCM) replaces the old Update Manager: 1. Create a cluster image with a specific ESXi version + drivers + firmware 2. Check compliance across all hosts 3. Remediate non-compliant hosts (auto-enters maintenance mode, patches, reboots)

Pattern: Performance Baseline with esxtop Batch Mode¶

# Capture 60 samples at 5-second intervals
esxtop -b -d 5 -n 60 > /tmp/esxtop-$(date +%Y%m%d).csv

# Download and analyze in Excel or with perfmon (Windows)
# Key columns: %RDY, %CSTP, %SWPWT, MCTLSZ, SWCUR

Pattern: Automating VM Provisioning¶

Use Terraform for repeatable VM deployments:

# Initialize and apply
cd vmware-terraform/
terraform init
terraform plan -var-file=prod.tfvars
terraform apply -var-file=prod.tfvars

# Destroy when done
terraform destroy -var-file=prod.tfvars

Pattern: Backup and Snapshot Hygiene¶

Never use snapshots as backups — they grow unbounded and degrade I/O
Use backup tools (Veeam, Commvault, ghettoVCB) that create temporary snapshots, stream data, then delete the snapshot
Monitor snapshot age: any snapshot older than 72 hours is a red flag
Set vCenter alarms for snapshot size thresholds

Pattern: Log Bundle Collection¶

When opening a VMware support request:

# On ESXi host — generate support bundle
vm-support

# On vCenter — export support bundle
# vCenter UI: Menu → Administration → Support Bundle → Export

# Specific log locations on ESXi:
# /var/log/hostd.log      — host daemon
# /var/log/vpxa.log       — vCenter agent
# /var/log/vmkernel.log   — kernel messages
# /var/log/vobd.log       — VMware Observability
# /var/log/fdm.log        — HA agent