Skip to content

Ops Archaeology: The Service That Won't Start

You've just joined a team. There are no docs. The previous engineer left last month. Something is broken. Here's everything you have to work with.

Difficulty: L1 Estimated time: 15 min Domains: Linux, Systemd, Ansible, File Permissions


Artifact 1: CLI Output

$ systemctl status data-exporter.service
 data-exporter.service - Nightly Data Export Agent
     Loaded: loaded (/etc/systemd/system/data-exporter.service; enabled; preset: enabled)
     Active: failed (Result: exit-code) since Wed 2024-10-02 02:00:03 UTC; 8h ago
    Process: 48291 ExecStart=/opt/data-exporter/bin/export.sh (code=exited, status=126)
   Main PID: 48291 (code=exited, status=126)
        CPU: 4ms

Oct 02 02:00:03 prod-batch-01 systemd[1]: Started data-exporter.service - Nightly Data Export Agent.
Oct 02 02:00:03 prod-batch-01 systemd[1]: data-exporter.service: Main process exited, code=exited, status=126/n/a
Oct 02 02:00:03 prod-batch-01 systemd[1]: data-exporter.service: Failed with result 'exit-code'.

$ ls -la /opt/data-exporter/bin/export.sh
-rw-r--r-- 1 exporter exporter 2841 Oct 01 18:45 /opt/data-exporter/bin/export.sh

Artifact 2: Metrics

# Node exporter metrics for prod-batch-01

# HELP node_systemd_unit_state Systemd unit state (1=active, 0=inactive)
node_systemd_unit_state{name="data-exporter.service",state="failed"} 1
node_systemd_unit_state{name="data-exporter.service",state="active"} 0

# Last successful export timestamp (custom metric pushed by export.sh)
data_export_last_success_timestamp_seconds 1727746803
# ^ That's 2024-10-01T02:00:03Z — 24 hours ago

# Export row count from last successful run
data_export_rows_total{table="transactions"} 1482937
data_export_rows_total{table="user_events"} 3291044

Artifact 3: Infrastructure Code

# From: ansible/playbooks/deploy-exporter.yml
- name: Deploy data exporter binary
  ansible.builtin.copy:
    src: files/export.sh
    dest: /opt/data-exporter/bin/export.sh
    owner: exporter
    group: exporter
    mode: "0644"
  notify: restart data-exporter

- name: Ensure systemd unit
  ansible.builtin.template:
    src: templates/data-exporter.service.j2
    dest: /etc/systemd/system/data-exporter.service
  notify:
    - reload systemd
    - restart data-exporter

Artifact 4: Log Lines

[2024-10-02T02:00:03Z] systemd    | data-exporter.service: Main process exited, code=exited, status=126/n/a
[2024-10-02T02:00:01Z] crond      | (exporter) CMD (/opt/data-exporter/bin/cleanup.sh)
[2024-10-01T18:45:22Z] ansible    | TASK [Deploy data exporter binary] changed: [prod-batch-01]

Your Mission

  1. Reconstruct: What does this system do? What are its components and purpose?
  2. Diagnose: What is currently broken or degraded, and why?
  3. Propose: What would you do to fix it? What would you check first?