Vendor Management & Escalation - Street-Level Ops¶

Quick Diagnosis Commands¶

These are not bash commands — they are the operational checks you run before and during vendor engagements.

# Before opening a case: gather everything the vendor will ask for

# System information dump (Linux)
uname -a
cat /etc/os-release
dmidecode -s system-serial-number
dmidecode -s system-product-name

# Hardware serial numbers (Dell)
dmidecode -s system-serial-number
# Or: ipmitool fru print

# Collect logs with timestamps
journalctl --since "2 hours ago" > /tmp/syslog_last_2h.txt
dmesg -T > /tmp/dmesg.txt

# Network device show-tech equivalent
# Cisco: show tech-support | redirect flash:show_tech.txt
# Palo Alto: show system info; show running resource-monitor
# F5: qkview (generates diagnostic archive)

# Software versions
dpkg -l | grep vendor-package
rpm -qa | grep vendor-package

# Generate a support bundle (many products have this)
# Kubernetes: kubectl cluster-info dump > /tmp/cluster-dump.tar.gz
# Docker: docker info > /tmp/docker-info.txt
# VMware: vm-support (generates log bundle)

Gotcha: Writing Bad Ticket Titles¶

The title is the first (and sometimes only) thing the triage engineer reads. Bad titles get bad routing.

Bad titles:

"Help needed"
"System not working"
"URGENT!!!"
"Issue with product"
"Error message"

Good titles:

"[Sev1] Node kernel panic under >50Gbps load — 3 crashes in 24h, v14.2.3"
"[Sev2] API returns 503 after TLS cert rotation — affects /v2/orders endpoint"
"[Sev3] Config sync fails between HA pair — error: 'peer unreachable' since upgrade to 8.1.4"

Formula: [Severity] Specific symptom — scope + trigger + version

Pattern: Escalation Triggers — When to Push¶

Do not escalate too early (burns political capital) or too late (burns downtime).

Escalate technically when: - L1 asks you to perform steps you have already documented in the ticket - The response suggests a misunderstanding of the architecture - You are asked to reproduce an issue that is clearly documented with logs and timestamps - The vendor says "working as designed" but the behavior contradicts their own documentation - The case has been reassigned more than twice

Escalate managerially when: - SLA response time has been violated (with timestamps to prove it) - No substantive update for 2+ business days on a Sev1/Sev2 - The engineer asks for information already provided in the ticket (pattern of not reading) - You are told a fix requires a release with no ETA and no interim workaround

How to escalate without burning bridges:

"Hi [Manager Name],

I'm writing regarding case #12345, opened [date] as Severity 1.

Current status: We have not received a substantive technical update
in 48 hours. Our SLA specifies 4-hour response and 8-hour update
cadence for Sev1 cases.

Impact: [X] users affected, [Y] revenue at risk per hour.

Request: Please assign a senior engineer and provide an update on
root cause analysis by [specific time].

Attached: Full case timeline with timestamps.

Thank you,
[Your name]"

Gotcha: Not Attaching Logs¶

You open a Sev1 ticket at 3 AM. You write three sentences. No logs, no screenshots, no version numbers. L1 responds 4 hours later asking for all of that. You provide it. L1 reviews and routes to L2. L2 asks for a different log format. You have now lost 12 hours.

What to attach on every case:

Minimum viable attachment list:
├── System logs (filtered to relevant timeframe)
├── Application logs (filtered to relevant timeframe)
├── Version / build numbers for all relevant components
├── Network diagram (if networking issue)
├── Config files (sanitized — remove passwords/keys)
├── Screenshots of error messages
├── Output of vendor's diagnostic command (show tech, qkview, etc.)
└── Timeline of events with UTC timestamps

Name files clearly: syslog_node2_20260315_0200-0400UTC.txt, not logs.txt.

Pattern: RMA Tracking and Inventory¶

When hardware fails, you need to move fast. Have this information ready before it happens:

Critical Hardware Inventory (maintain this document):

| Device          | Serial     | Location    | Support Contract | Expiry     | Spare On-Site? |
|-----------------|------------|-------------|------------------|------------|----------------|
| Core Switch A   | SN-1234    | DC-East R14 | Premium 4HR      | 2027-01-15 | Yes (cold)     |
| Core Switch B   | SN-1235    | DC-East R14 | Premium 4HR      | 2027-01-15 | Yes (cold)     |
| Firewall HA-1   | SN-5678    | DC-East R02 | Standard NBD     | 2026-09-30 | No             |
| Server Node 01  | SN-9012    | DC-East R20 | Premium 4HR      | 2026-12-31 | Yes (drives)   |

RMA process checklist:

□ Identify failed component and serial number
□ Verify support contract is active (check vendor portal)
□ Gather failure evidence (logs, LED status, diagnostic output)
□ Open case / request RMA with evidence attached
□ Record RMA number: _______________
□ Record tracking number (inbound replacement): _______________
□ Replacement received — verify part number matches
□ Install and verify functionality
□ Ship failed unit back (record return tracking): _______________
□ Confirm vendor received return (avoid "unreturned hardware" charges)
□ Close case

Gotcha: Accepting "Working as Designed" Without Challenge¶

The vendor tells you the behavior is working as designed. Maybe it is. But often:

The documentation contradicts the behavior
The behavior changed between versions without release notes
The "design" does not match what the sales team sold you
The behavior is technically correct but operationally unusable

Response pattern:

"I understand this may be the intended behavior. However:

1. The documentation at [URL] states [X], which contradicts this behavior
2. This behavior was not present in version [Y], which we upgraded from
3. This creates an operational impact of [specific impact]

Can you confirm this was an intentional change? If so, please:
- Update the documentation to reflect the actual behavior
- File a feature request for [the behavior we need]
- Provide a workaround for the current version

If this was not intentional, please escalate as a defect."

Pattern: The Vendor War Room¶

During a major incident involving a vendor product, you may need a dedicated working session:

Setting up a war room call:

Request a bridge/conference call with the vendor's assigned engineer
Share screen access or remote session to the affected system
Ensure both sides have the same data (logs, dashboards)
Designate one person on your team as the vendor liaison
Keep a shared timeline document updated in real-time

War room rules:

├── One person talks to the vendor at a time (avoid confusion)
├── Keep a running log of every action taken and by whom
├── Do not make changes without discussing with the vendor first
├── Take screenshots before and after every change
├── Agree on rollback criteria before attempting fixes
├── Set a checkpoint every 2 hours: are we making progress?
└── If no progress in 4 hours, escalate to next tier + management

Gotcha: Escalating Too Late¶

The most common vendor management failure: waiting too long. You spend 3 days going back and forth with L1, trying every suggestion, before escalating. Those 3 days of production impact were avoidable.

Early escalation signals:

Day 1: L1 response does not address the actual problem described
→ Reply with clarification AND request L2 review

Day 1: L1 suggests steps already documented in your ticket
→ Point this out politely, request escalation

Day 2: No update on Sev1/Sev2
→ Call support, request status + manager name

Day 2: Suggested fix does not resolve issue
→ Document the failed attempt, request L2/L3

Day 3: Still at L1 with no progress
→ You have waited too long. Escalate now.

Pattern: Contract Leverage Points During Renewal¶

Renewal is your one moment of maximum leverage. Use it.

Before the renewal meeting:

# Compile your case history
Total cases opened:         47
Cases requiring L3+:        12
Average time to resolution: 4.2 days
SLA violations:             6
RMAs processed:             3
Downtime attributed:        18 hours

Negotiation talking points:

├── "We opened 47 cases last year. 12 needed L3. We need a named TAM."
├── "Six SLA violations. We need premium SLA or a service credit."
├── "Our team needs 3 training seats. Include them in the renewal."
├── "We are evaluating [competitor]. Match their response SLA."
├── "We need early access to patches for critical fixes."
└── "Multi-year commitment in exchange for [specific concession]."

Never negotiate renewals under time pressure. Start the conversation 90 days before expiry.

Gotcha: No SLA Tracking¶

You think the vendor violated SLA but you have no timestamps to prove it. The vendor says they responded within the window. You have no evidence to the contrary. During renewal, you cannot quantify the support quality.

Track every case:

Case #12345
├── Opened: 2026-03-15 02:15 UTC (Sev1)
├── SLA Response: 1 hour
├── Actual First Response: 2026-03-15 06:30 UTC (4h15m — VIOLATION)
├── First Useful Response: 2026-03-15 14:00 UTC (L2 assigned)
├── Updates: 3/15 18:00, 3/16 10:00, 3/16 22:00
├── Resolution: 2026-03-17 08:00 UTC
├── Total Time to Resolution: 53h 45m
├── SLA Violations: 1 (response time)
└── Notes: L1 spent 12h before escalating to L2

Store this in a shared spreadsheet or ticketing system. It is your evidence for every escalation and renewal conversation.