AI Tools for DevOps - Street Ops¶

Practical workflows that show how AI tools fit into real DevOps work.

Workflow 1: Debugging a Failing Deploy with Claude Code¶

You: The staging deploy failed. Here's the ArgoCD error:
     "sync failed: resource apps/v1/Deployment/order-api
      failed to sync: Forbidden: PodDisruptionBudget
      'order-api' is blocking eviction"

Claude Code:
1. Reads the Helm chart and PDB definition
2. Checks the PDB's minAvailable vs replica count
3. Identifies that minAvailable=2 with replicas=2 means
   zero disruption budget for rolling updates
4. Fixes PDB to use maxUnavailable=1 instead
5. Runs helm lint to validate
6. Commits the fix

The key: Claude Code has your files and can run helm lint to validate. No copy-pasting.

Workflow 2: Terraform Module Generation with ChatGPT¶

Step 1 - Generate:

"Generate a Terraform module for an AWS RDS PostgreSQL instance.
Requirements:
- PostgreSQL 15, db.t3.medium (parameterized)
- Multi-AZ toggle via variable
- Encrypted at rest with KMS
- Automated backups, 7-day retention
- Private subnet group
- Security group allowing inbound 5432 from CIDR list
- Parameter group: log_statement=all, log_min_duration_statement=1000
- Outputs: endpoint, port, security group ID"

Step 2 - Refine:

"Now add:
- deletion_protection = true by default
- performance_insights_enabled = true
- A variable for storage type (gp3 vs io1)
- Tagging with var.common_tags merged with Name"

Step 3 - Validate locally:

cd devops/terraform/modules/rds
terraform init -backend=false
terraform validate
terraform fmt -check

Watch out for: AI may use deprecated resource arguments. Always check against current provider docs.

Gotcha: AI-generated Terraform often uses deprecated arguments or outdated provider versions. Always pin your provider version and run terraform validate and terraform plan before applying. The aws_instance resource alone has had 5+ deprecated arguments since provider v4.0. Never terraform apply AI output without reviewing the plan.

Workflow 3: ChatGPT + Codex Pipeline¶

A power-user pattern using both tools in sequence:

ChatGPT (design): "What's the best approach to add rate limiting to our FastAPI services? Compare slowapi vs custom middleware."
ChatGPT (detail): "Design the implementation using slowapi with Redis backend. What config do we need?"
Codex (implement): "Implement rate limiting across all API services using slowapi. Config in services/shared/rate_limit.py. Add tests."
ChatGPT (review): Upload the Codex PR diff and ask "Review this against our design. Any gaps?"

ChatGPT's conversational strength for design. Codex's agentic strength for implementation.

Workflow 4: Incident Response with AI¶

During the incident (Claude Code in terminal):¶

You: I'm seeing 503 errors on the /api/orders endpoint.
     Help me investigate.

Claude Code:
1. Runs: kubectl get pods -n app | grep order
2. Sees: order-api-7d8f9c6b4-xk2lm  0/1  CrashLoopBackOff
3. Runs: kubectl logs order-api-7d8f9c6b4-xk2lm --previous
4. Identifies: "PAYMENT_GATEWAY_URL not set" in traceback
5. Runs: kubectl get configmap order-api-config -o yaml
6. Shows you the missing env var
7. Suggests the fix

After the incident (ChatGPT for postmortem):¶

"Write a blameless postmortem. Facts:
- Service: order-api
- Duration: 45 min (09:30-10:15 UTC)
- Impact: 30% of orders failed
- Root cause: missing env var after deploy
- Detection: Datadog alert on error rate
- Resolution: added env var to ConfigMap, restarted pods

Structure: Summary, Timeline, Root Cause, Impact,
What Went Well, What Went Wrong, Action Items (with priority)"

Workflow 5: Writing Ansible Playbooks with ChatGPT¶

Prompt: "Write an Ansible playbook that bootstraps Ubuntu 22.04:
- Updates all packages
- Installs Docker CE from official repo
- Adds deploy user to docker group
- Configures UFW: allow SSH (22) and HTTPS (443) only
- Sets up unattended-upgrades for security patches

Make it idempotent. Use handlers for service restarts."

Then validate:

ansible-lint playbooks/bootstrap.yml
ansible-playbook --check -i inventory playbooks/bootstrap.yml

Default trap: AI tools confidently generate plausible-looking Kubernetes YAML that uses wrong API versions (e.g., extensions/v1beta1 for Ingress, which was removed in K8s 1.22). Always validate generated manifests against your cluster version: kubectl apply --dry-run=server -f manifest.yaml catches version mismatches that --dry-run=client misses.

Workflow 6: PromQL Query Writing¶

Prompt: "Write PromQL queries for a FastAPI application:
1. Request rate by endpoint (5m window)
2. p95 latency by endpoint
3. Alert expression: error rate > 5% over 10 minutes
4. Container memory usage as percentage of limit

Use histogram_quantile for latency. Assume
http_request_duration_seconds_bucket metric."

This saves significant time vs debugging PromQL syntax manually.

Workflow 7: Claude Code for Learning a Codebase¶

You: I'm new to this repo. Read the Helm chart and explain
     how the deployment pipeline works, from PR to production.

Claude Code:
1. Reads the Helm chart templates, values files, CI workflows
2. Reads deploy scripts and ArgoCD config
3. Gives you a narrative explanation of the full pipeline
4. Points out specific files for each stage

This is one of the highest-value uses of Claude Code - it can read dozens of files and synthesize an explanation tailored to your questions.

Remember: AI tool selection rule of thumb: use conversational AI (ChatGPT, Claude chat) for design decisions, architecture review, and learning. Use agentic AI (Claude Code, Codex) for implementation across files, running commands, and validating changes. The conversational tool thinks with you; the agentic tool works for you.

Workflow 8: Custom GPTs for Team Workflows¶

Build specialized GPTs for your team:

IaC Reviewer GPT: - Upload: your Terraform conventions doc, module examples, tagging standards - Instructions: "Review Terraform code against our standards. Flag missing tags, overly permissive security groups, resources without lifecycle rules."

Incident Commander GPT: - Upload: runbooks, escalation matrix, postmortem template - Instructions: "Help triage incidents. Ask clarifying questions about symptoms. Suggest diagnostic commands. Draft comms for stakeholders."

PR Reviewer GPT: - Upload: coding standards, security checklist, common anti-patterns doc - Instructions: "Review code diffs. Focus on security (OWASP Top 10), error handling, and adherence to our conventions."