Skip to content

Pattern: Port-Forward as Permanent Fix

ID: FP-049 Family: Human Error Amplifier Frequency: Common Blast Radius: Single Service Detection Difficulty: Obvious (when it breaks again)

The Shape

During an incident, an engineer uses kubectl port-forward to bypass a broken Ingress, Service, or load balancer. Traffic flows; the immediate crisis is resolved. The engineer closes their laptop for the night. The port-forward process (which lives in the engineer's terminal session) terminates. The service becomes unreachable again. The fix was ephemeral but was treated as permanent, creating a false sense of resolution and a repeat incident hours later.

How You'll See It

In Kubernetes

# "Fix" during 3am incident:
kubectl port-forward svc/payment-service 8080:80 &
# Payment service accessible. Monitoring turns green. Incident resolved?

# 4 hours later (engineer's session ended):
# Connection refused. Incident re-opens.
The root cause (broken Ingress, misconfigured Service selector) was never fixed — only bypassed. The bypass was session-local and didn't survive the engineer's terminal session.

In Linux/Infrastructure

SSH tunnel (ssh -L 5432:db-server:5432 jump-host) used to connect an application to a database when the direct connection was broken. The tunnel worked; the team moved on. The SSH connection dropped (idle timeout, network issue). The application lost database connectivity again.

In CI/CD

A CI step uses kubectl port-forward to access a service for testing. The port-forward is started in a background process. If the test takes longer than the port-forward timeout, the test fails intermittently.

The Tell

The incident was "resolved" but the same incident recurred hours later. No changes were made to Ingress, Service, or NetworkPolicy resources. A port-forward process was running during the "resolved" window. The recurrence happened exactly when the engineer's session or terminal was closed.

Common Misdiagnosis

Looks Like But Actually How to Tell the Difference
Intermittent network issue Port-forward died Correlates with terminal session ending, not with network events
Service instability Unstable fix (port-forward) Service itself is stable; the bypass mechanism is what's intermittent
Resolved incident re-opening Was never fixed; bypass expired Root cause (Ingress/Service issue) is still present in the config

The Fix (Generic)

  1. Immediate: Fix the actual Ingress/Service/NetworkPolicy/DNS issue; use port-forward only for diagnosis, never as a fix.
  2. Short-term: Before closing an incident, validate that traffic flows through the actual production path (not via port-forward); run an end-to-end test that exercises the real path.
  3. Long-term: Add to incident runbooks: "port-forward is a diagnostic tool, not a fix; the incident is not resolved until traffic flows through the actual service mesh path."

Real-World Examples

  • Example 1: Ingress controller misconfigured after a certificate renewal. Engineer port-forwarded directly to the pod. "Incident resolved." 6 hours later, engineer's laptop closed, port-forward died, Ingress still broken. Incident re-opened.
  • Example 2: Service selector label mismatch (pod labels changed but Service selector wasn't updated). Port-forward bypassed the service selector issue. "Fix" lasted 8 hours (until next deployment restart the engineer's session).

War Story

My worst 3am. Fixed a payment outage at 3am with port-forward. Wrote "resolved" in the incident. Went to sleep. 7am: same outage, fresh engineers, same confusion. They couldn't figure out "why it was working at 3am and broken now." My port-forward had been the bridge. I had left the terminal open, gone to sleep, laptop hibernated. Port-forward died. The Ingress was still broken. I hadn't touched the Ingress at all — I'd just bypassed it. New rule for me: I don't close an incident until I can curl through the actual service URL (not localhost:8080). Port-forward is for diagnosing, not fixing.

Cross-References

  • Topic Packs: k8s-ops, incident-command
  • Footguns: k8s-ops/footguns.md — "Port-forward as a 'fix'"
  • Related Patterns: FP-051 (missing escalation — same "incident prematurely closed" pattern), FP-024 (health check lying — another "appears fixed but isn't")