Portal | Level: L1: Foundations | Topics: AI Tools for DevOps | Domain: DevOps & Tooling
AI-Assisted DevOps Cookbook¶
Concrete recipes for common DevOps tasks using AI tools (ChatGPT, Codex, Claude Code). Each recipe includes the prompt, what to expect, and what to watch out for.
Recipes¶
1. Generate a Terraform Module From Scratch¶
Tool: ChatGPT or Claude Code Time saved: 30-60 min
Prompt:
Generate a Terraform module for an AWS RDS PostgreSQL instance.
Requirements:
- PostgreSQL 15, db.t3.medium (parameterized)
- Multi-AZ for production, single-AZ for dev (variable toggle)
- Encrypted at rest with KMS (key ARN as variable)
- Automated backups, 7-day retention
- Private subnet group (subnet IDs as variable)
- Security group allowing inbound 5432 from a provided CIDR list
- Parameter group with: log_statement=all, log_min_duration_statement=1000
- Outputs: endpoint, port, security group ID
Variables should have descriptions and sensible defaults where appropriate.
Include a versions.tf with required_providers (aws >= 5.0).
Watch out for:
- AI may use deprecated resource arguments - check against current AWS provider docs
- Security groups may be too permissive - verify CIDR blocks and egress rules
- Check that deletion_protection is enabled by default
- Verify the parameter group family matches the engine version
2. Write a GitHub Actions Workflow¶
Tool: ChatGPT or Claude Code Time saved: 20-40 min
Prompt:
Write a GitHub Actions workflow called "CI" that runs on push to main
and on pull requests. Steps:
1. Checkout code
2. Set up Python 3.11
3. Cache pip dependencies (hash requirements.txt for cache key)
4. Install dependencies from requirements.txt
5. Run ruff check .
6. Run ruff format --check .
7. Run pytest --cov=app --cov-fail-under=70 --tb=short -q
8. Upload coverage report as artifact
Use ubuntu-latest runner. Pin action versions to specific SHAs
for supply chain security (not @v4, use @<sha>).
Watch out for:
- AI rarely pins to SHAs by default - you'll likely need to follow up
- Check that cache paths match your actual pip cache location
- Verify the workflow has appropriate permissions: set (least privilege)
- Make sure it doesn't use actions/checkout@v2 (old version)
3. Debug a Kubernetes CrashLoopBackOff¶
Tool: Claude Code (best - can run kubectl) or ChatGPT (paste logs)
Claude Code prompt:
My pod app/order-api is in CrashLoopBackOff. Help me debug it.
Run kubectl commands to check:
1. Pod events and status
2. Container logs (current and previous)
3. Resource limits vs actual usage
4. ConfigMap/Secret mounts
5. Liveness/readiness probe config
Diagnose the issue and suggest a fix.
ChatGPT prompt (if you can't use Claude Code):
My pod is in CrashLoopBackOff. Here's the output of:
kubectl describe pod order-api-7d8f9c6b4-xk2lm:
[paste output]
kubectl logs order-api-7d8f9c6b4-xk2lm --previous:
[paste output]
What's the root cause and how do I fix it?
Watch out for:
- Don't run kubectl commands against production without thinking first
- AI may suggest kubectl delete pod as a fix - that's a band-aid, not a fix
- Check if the issue is in the app code, the config, or the infrastructure
4. Convert Docker Compose to Kubernetes Manifests¶
Tool: ChatGPT or Claude Code Time saved: 45-90 min
Prompt:
Convert this docker-compose.yml to Kubernetes manifests.
Create separate YAML files for each resource.
For each service, generate:
- Deployment with resource requests/limits
- Service (ClusterIP for internal, LoadBalancer for external)
- ConfigMap for environment variables
- PersistentVolumeClaim for any volumes
Don't use kompose - write clean, idiomatic Kubernetes YAML.
Add standard labels (app, component, version).
Set securityContext: runAsNonRoot where possible.
[paste docker-compose.yml]
Watch out for: - Volume mounts need real StorageClass names for your cluster - AI won't know your ingress controller - you'll need to adapt Services - Environment variables with secrets should use Secret, not ConfigMap - Resource limits are AI guesses - adjust based on actual usage data
5. Write an Incident Postmortem¶
Tool: ChatGPT Time saved: 30-45 min
Prompt:
Write a blameless postmortem for the following incident.
Use our template: Summary, Detection, Timeline, Root Cause,
Impact, What Went Well, What Went Wrong, Action Items.
Facts:
- Date: 2025-11-15, 09:30-10:45 UTC
- Service: checkout-api (EKS, us-east-1)
- Symptoms: HTTP 500 errors on /api/checkout, 30% failure rate
- Detection: Datadog alert on error rate > 5% fired at 09:35
- On-call acknowledged at 09:38, began investigating
- 09:42: Identified correlation with deploy at 09:28
- 09:50: Attempted rollback, but ArgoCD sync was stuck
- 10:00: Manually reverted the Helm release with helm rollback
- 10:15: Error rate returned to baseline
- 10:45: Confirmed all queued orders were processed
- Root cause: New env var PAYMENT_GATEWAY_URL was required
but not added to the staging ConfigMap, causing nil pointer
in payment client initialization
- Customer impact: ~450 failed checkout attempts, $12K estimated
lost revenue
Action items should be specific, assigned (use placeholder names),
and have priority (P1/P2/P3).
6. Create a Helm Values Diff Report¶
Tool: Claude Code Time saved: 15-20 min
Prompt:
Compare values-dev.yaml, values-staging.yaml, and values-prod.yaml
in devops/helm/. Create a summary table showing the differences
across environments. Flag any values that look wrong:
- Prod with fewer replicas than staging
- Dev with production-like resource limits
- Missing values in any environment
- Security settings that differ unexpectedly
7. Generate Ansible Role From Existing Commands¶
Tool: ChatGPT or Claude Code Time saved: 30-60 min
Prompt:
I currently run these commands manually to set up a new app server.
Convert this into an Ansible role called 'app-server'.
Commands I run:
apt update && apt upgrade -y
apt install -y nginx python3.11 python3.11-venv certbot
useradd -m -s /bin/bash deploy
mkdir -p /opt/app /var/log/app
chown deploy:deploy /opt/app /var/log/app
cp nginx.conf /etc/nginx/sites-available/app
ln -s /etc/nginx/sites-available/app /etc/nginx/sites-enabled/
systemctl enable --now nginx
ufw allow 22/tcp
ufw allow 80/tcp
ufw allow 443/tcp
ufw enable
Make it idempotent. Use handlers for service restarts.
Template the nginx config with variables for server_name and upstream_port.
Add a defaults/main.yml with sensible defaults.
8. Security Audit a Dockerfile¶
Tool: Claude Code or ChatGPT Time saved: 15-30 min
Prompt:
Audit this Dockerfile for security issues. Check against:
1. Running as root
2. Using latest tags
3. Unnecessary packages installed
4. Secrets in build args or env
5. Missing health check
6. Excessive file permissions
7. Not using multi-stage build
8. Including dev dependencies in production image
9. Missing .dockerignore considerations
10. Base image CVE exposure
Rate each finding as Critical/High/Medium/Low.
Provide the fixed Dockerfile after your review.
[paste Dockerfile]
Tips for All Recipes¶
- Always review before applying: No recipe output should go straight to production
- Iterate: First pass is a starting point. Follow up with "now also handle X"
- Add your context: These prompts are templates - add your specific stack details
- Save what works: When a prompt gives great results, save it for reuse
- Version your prompts: As tools evolve, update your prompt templates
Wiki Navigation¶
Prerequisites¶
- AI Tools for DevOps (Topic Pack, L1)
Related Content¶
- AI Tools for DevOps (Topic Pack, L1) — AI Tools for DevOps
- Generativeai Flashcards (CLI) (flashcard_deck, L1) — AI Tools for DevOps
- The Ops of AI/ML Workloads (Topic Pack, L2) — AI Tools for DevOps