Skip to content

Thinking Out Loud: Security Basics

A senior SRE's internal monologue while working through a real security task. This isn't a tutorial — it's a window into how experienced engineers actually think.

The Situation

The security team has flagged our production web servers during a quarterly audit: several are running outdated packages with known CVEs, SSH is configured with password authentication, and there are unmonitored open ports. I need to harden these servers to pass the audit.

The Monologue

Security audit findings on production servers. Let me not shotgun-fix everything at once — I need to assess severity and fix in order of actual risk, not audit checkbox order.

First, let me understand the exposure. Which of these servers are internet-facing?

for host in web-{1..4}.prod; do
  echo -n "$host: "
  ssh $host 'curl -s ifconfig.me; echo -n " — "; ss -tlnp | grep -c LISTEN'
  echo " listening ports"
done

All four have public IPs and between 4-7 listening ports each. Let me see what's actually exposed.

ssh web-1.prod 'ss -tlnp | column -t'

Ports: 22 (sshd), 80 (nginx), 443 (nginx), 8080 (app), 9090 (prometheus node exporter), 9100 (another exporter), 3306 (wait — MySQL?). MySQL is listening on 0.0.0.0:3306 on a web server? That should NOT be internet-accessible.

Mental Model: Reduce the Attack Surface First

In security hardening, the highest-value action is reducing the attack surface: close unnecessary ports, remove unused services, restrict network access. A perfectly patched service that shouldn't be exposed at all is still a liability. Fix exposure first, then patch, then harden configuration.

Let me check if MySQL on port 3306 is actually accessible from the internet or if there's a firewall rule.

ssh web-1.prod 'sudo iptables -L INPUT -n | grep 3306'

No iptables rule for 3306. And we're on AWS — let me check the security group.

aws ec2 describe-security-groups --group-ids sg-0abc123 --query 'SecurityGroups[0].IpPermissions[?FromPort==`3306`]'

The security group allows 3306 from 0.0.0.0/0. MySQL is open to the entire internet. This is the highest-priority fix. Let me restrict it now.

aws ec2 revoke-security-group-ingress --group-id sg-0abc123 --protocol tcp --port 3306 --cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress --group-id sg-0abc123 --protocol tcp --port 3306 --cidr 10.0.0.0/8

MySQL now only accessible from the private network. But honestly, MySQL shouldn't be on a web server at all. That's a separate conversation.

Now, SSH hardening. The audit flagged password authentication.

ssh web-1.prod 'grep -E "^(PasswordAuthentication|PermitRootLogin|PubkeyAuthentication)" /etc/ssh/sshd_config'

PasswordAuthentication yes, PermitRootLogin yes. Both are bad. Let me fix SSH config across all servers.

for host in web-{1..4}.prod; do
  echo "Hardening $host..."
  ssh $host 'sudo sed -i.bak \
    -e "s/^PasswordAuthentication yes/PasswordAuthentication no/" \
    -e "s/^PermitRootLogin yes/PermitRootLogin no/" \
    -e "s/^#MaxAuthTries.*/MaxAuthTries 3/" \
    /etc/ssh/sshd_config && sudo sshd -t && sudo systemctl reload sshd'
done

Wait — before I reload sshd, let me make sure I have key-based access working. If I disable password auth and my key isn't set up, I lock myself out.

ssh -o PreferredAuthentications=publickey web-1.prod 'echo key auth works'

Good, key auth works. Let me also verify there are no authorized_keys with suspicious entries.

for host in web-{1..4}.prod; do
  echo "=== $host ==="
  ssh $host 'for d in /root /home/*; do echo "$d:"; cat "$d/.ssh/authorized_keys" 2>/dev/null | wc -l; done'
done

Mental Model: Verify Before Hardening

Never apply security hardening that could lock you out without first verifying your alternative access method works. Test key-based auth before disabling password auth. Test serial console access before restricting SSH. Test IPMI before restricting network access. The most dangerous security incident is the one caused by your own hardening.

Now the outdated packages with CVEs. Let me see how bad it is.

ssh web-1.prod 'sudo apt list --upgradable 2>/dev/null | wc -l'

47 packages need updates. Let me check which ones have known CVEs.

ssh web-1.prod 'sudo apt list --upgradable 2>/dev/null | grep -i -E "openssl|libssl|curl|wget|sudo|kernel|nginx|openssh"'

openssl and libssl have updates. Those are critical — SSL vulnerabilities can be remotely exploitable. Let me prioritize those.

ssh web-1.prod 'sudo apt-get update && sudo apt-get install --only-upgrade openssl libssl3 -y'

For the rest of the packages, I'll do a rolling update — one server at a time, pulling it from the load balancer first.

# Take web-1 out of rotation
aws elbv2 deregister-targets --target-group-arn arn:aws:elasticloadbalancing:... --targets Id=i-abc123

# Wait for connections to drain
sleep 30

# Full update
ssh web-1.prod 'sudo apt-get upgrade -y && sudo apt-get autoremove -y'

# Verify the server is healthy
ssh web-1.prod 'curl -s localhost/health'

# Put it back in rotation
aws elbv2 register-targets --target-group-arn arn:aws:elasticloadbalancing:... --targets Id=i-abc123

Now let me also lock down the unnecessary ports. The node exporter (9090, 9100) should not be internet-facing — metrics should only be scraped from the internal network.

aws ec2 revoke-security-group-ingress --group-id sg-0abc123 --protocol tcp --port 9090 --cidr 0.0.0.0/0
aws ec2 revoke-security-group-ingress --group-id sg-0abc123 --protocol tcp --port 9100 --cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress --group-id sg-0abc123 --protocol tcp --port 9090 --cidr 10.0.0.0/8
aws ec2 authorize-security-group-ingress --group-id sg-0abc123 --protocol tcp --port 9100 --cidr 10.0.0.0/8

And port 8080 (the application's direct port) — should only be accessible from the load balancer, not directly from the internet.

aws ec2 revoke-security-group-ingress --group-id sg-0abc123 --protocol tcp --port 8080 --cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress --group-id sg-0abc123 --protocol tcp --port 8080 --source-group sg-alb-group

Now only 22, 80, and 443 are internet-facing. Everything else is restricted to internal or LB-only access.

Let me also set up unattended-upgrades for automatic security patches going forward.

for host in web-{1..4}.prod; do
  ssh $host 'sudo apt-get install unattended-upgrades -y && sudo dpkg-reconfigure -plow unattended-upgrades'
done

One last thing — let me set up fail2ban for SSH brute-force protection. Even with key-only auth, it reduces noise in the logs and blocks scanners.

for host in web-{1..4}.prod; do
  ssh $host 'sudo apt-get install fail2ban -y && sudo systemctl enable fail2ban && sudo systemctl start fail2ban'
done

Let me verify the hardening by running a quick audit.

ssh web-1.prod 'ss -tlnp | grep -c "0.0.0.0"; grep PasswordAuthentication /etc/ssh/sshd_config; dpkg -l | grep openssl | head -1; systemctl is-active fail2ban'

3 listening ports on all interfaces (22, 80, 443 — expected). Password auth disabled. OpenSSL updated. Fail2ban active. This should pass the audit.

What Made This Senior-Level

Junior Would... Senior Does... Why
Patch all packages first because "CVEs" sounds scary Close exposed ports first, then patch An unpatched service behind a firewall is less risky than a patched service exposed to the internet
Disable password auth and hope key auth works Test key-based auth FIRST, then disable password auth Locking yourself out during hardening is worse than the vulnerability you're fixing
Update all servers simultaneously Roll updates one server at a time, pulling from the LB first Rolling updates prevent a bad package from taking down the entire fleet
Fix the audit findings and stop there Also set up unattended-upgrades and fail2ban to prevent recurrence The audit findings are symptoms of missing ongoing processes

Key Heuristics Used

  1. Attack Surface First: Close unnecessary ports and restrict network access before patching or configuring. Reduce exposure first, then harden what's left.
  2. Verify Before Restricting: Always confirm your alternative access method works before disabling the current one. Never lock yourself out.
  3. Rolling Hardening: Apply changes one server at a time with health verification between each. Treat hardening changes like deploys.

Cross-References

  • Primer — Security fundamentals, defense in depth, and the principle of least privilege
  • Street Ops — Security audit commands, SSH hardening checklist, and firewall configuration
  • Footguns — Disabling password auth before verifying key auth, security groups open to 0.0.0.0/0, and MySQL on web servers