Ops Archaeology: The Service That Won't Start¶
You've just joined a team. There are no docs. The previous engineer left last month. Something is broken. Here's everything you have to work with.
Difficulty: L1 Estimated time: 15 min Domains: Linux, Systemd, Ansible, File Permissions
Artifact 1: CLI Output¶
$ systemctl status data-exporter.service
● data-exporter.service - Nightly Data Export Agent
Loaded: loaded (/etc/systemd/system/data-exporter.service; enabled; preset: enabled)
Active: failed (Result: exit-code) since Wed 2024-10-02 02:00:03 UTC; 8h ago
Process: 48291 ExecStart=/opt/data-exporter/bin/export.sh (code=exited, status=126)
Main PID: 48291 (code=exited, status=126)
CPU: 4ms
Oct 02 02:00:03 prod-batch-01 systemd[1]: Started data-exporter.service - Nightly Data Export Agent.
Oct 02 02:00:03 prod-batch-01 systemd[1]: data-exporter.service: Main process exited, code=exited, status=126/n/a
Oct 02 02:00:03 prod-batch-01 systemd[1]: data-exporter.service: Failed with result 'exit-code'.
$ ls -la /opt/data-exporter/bin/export.sh
-rw-r--r-- 1 exporter exporter 2841 Oct 01 18:45 /opt/data-exporter/bin/export.sh
Artifact 2: Metrics¶
# Node exporter metrics for prod-batch-01
# HELP node_systemd_unit_state Systemd unit state (1=active, 0=inactive)
node_systemd_unit_state{name="data-exporter.service",state="failed"} 1
node_systemd_unit_state{name="data-exporter.service",state="active"} 0
# Last successful export timestamp (custom metric pushed by export.sh)
data_export_last_success_timestamp_seconds 1727746803
# ^ That's 2024-10-01T02:00:03Z — 24 hours ago
# Export row count from last successful run
data_export_rows_total{table="transactions"} 1482937
data_export_rows_total{table="user_events"} 3291044
Artifact 3: Infrastructure Code¶
# From: ansible/playbooks/deploy-exporter.yml
- name: Deploy data exporter binary
ansible.builtin.copy:
src: files/export.sh
dest: /opt/data-exporter/bin/export.sh
owner: exporter
group: exporter
mode: "0644"
notify: restart data-exporter
- name: Ensure systemd unit
ansible.builtin.template:
src: templates/data-exporter.service.j2
dest: /etc/systemd/system/data-exporter.service
notify:
- reload systemd
- restart data-exporter
Artifact 4: Log Lines¶
[2024-10-02T02:00:03Z] systemd | data-exporter.service: Main process exited, code=exited, status=126/n/a
[2024-10-02T02:00:01Z] crond | (exporter) CMD (/opt/data-exporter/bin/cleanup.sh)
[2024-10-01T18:45:22Z] ansible | TASK [Deploy data exporter binary] changed: [prod-batch-01]
Your Mission¶
- Reconstruct: What does this system do? What are its components and purpose?
- Diagnose: What is currently broken or degraded, and why?
- Propose: What would you do to fix it? What would you check first?