Skip to content

Portal | Level: L1: Foundations | Topics: Linux Networking Tools | Domain: Networking

Scenario: Duplex Mismatch Causing Slow Transfers and Late Collisions

Situation

At 16:05 UTC, the database team reports that replication between the primary database server (db-primary) and the replica (db-replica) is falling behind. Bulk data transfers between the two machines top out at roughly 10-15 Mbps on what should be a 1 Gbps link. Small queries work fine and latency looks normal for interactive operations. The issue appeared after a NIC replacement on db-replica during a hardware maintenance window two days ago, but was not noticed until replication lag alerts fired today.

What You Know

  • Both servers are connected to the same switch via 1 Gbps copper links
  • db-replica had its network card replaced two days ago
  • Small queries and interactive SSH feel normal
  • Bulk transfers (rsync, mysqldump over network) are extremely slow
  • The switch is managed but the network team says "the port is up and looks fine"
  • No packet loss visible with simple ping tests

Investigation Steps

Command(s):

# On db-primary
ethtool eth0

# On db-replica
ethtool eth0

# Quick summary of link parameters
ethtool eth0 | grep -iE 'speed|duplex|auto-negotiation|link detected'
What to look for: One side showing Duplex: Full and the other showing Duplex: Half. The classic scenario: the new NIC on db-replica was manually configured (or has a driver default) to force 1000Mbps/Full, but the switch port is set to auto-negotiate. When one side forces and the other auto-negotiates, the auto-negotiating side falls back to half duplex (per IEEE 802.3 specification). So the switch port ends up at 1000Mbps/Half while the server thinks it is at 1000Mbps/Full.

2. Check interface error counters for collisions

Command(s):

# Detailed NIC statistics — look for collisions and errors
ethtool -S eth0 | grep -iE 'collision|error|drop|crc|frame'

# Legacy but still useful
ifconfig eth0 2>/dev/null || ip -s link show eth0

# Watch counters increment in real time during a transfer
watch -n 1 'ethtool -S eth0 | grep -iE "collision|error|drop"'
What to look for: Late collisions are the hallmark of a duplex mismatch. The half-duplex side sees collisions because it respects CSMA/CD (carrier sense multiple access with collision detection), while the full-duplex side transmits without checking. You will see counters like tx_late_collision, tx_single_collision, or tx_multi_collision incrementing. Normal full-duplex links should have zero collisions. Also look for rx_crc_errors and rx_frame_errors which can spike during mismatch conditions.

3. Measure actual throughput and confirm the degradation

Command(s):

# Use iperf3 for a clean bandwidth test
# On db-primary (server mode):
iperf3 -s

# On db-replica (client mode):
iperf3 -c db-primary -t 30

# If iperf3 is not available, use a large file transfer
dd if=/dev/zero bs=1M count=500 | ssh db-primary 'cat > /dev/null'

# Check for TCP retransmissions during the transfer
ss -ti dst db-primary
netstat -s | grep -i retransmit
What to look for: Throughput will be a fraction of the expected 1 Gbps — typically 10-50 Mbps with high jitter. The iperf3 output will show frequent retransmissions. ss -ti will show high retransmit counts on active sockets. The half-duplex side backs off exponentially on each collision, creating severe throughput degradation under sustained load. Short transactions may not trigger enough collisions to be noticeable.

4. Check the switch side (if you have access)

Command(s):

# If managed switch with SSH access
ssh admin@switch 'show interface GigabitEthernet0/1 status'
ssh admin@switch 'show interface GigabitEthernet0/1 counters errors'

# If using SNMP
snmpwalk -v2c -c public switch-ip IF-MIB::ifSpeed
snmpwalk -v2c -c public switch-ip EtherLike-MIB::dot3StatsDuplexStatus
What to look for: The switch port showing a-half (auto-negotiated to half duplex) while the server shows forced full duplex. The switch error counters will also show late collisions, FCS errors, and runts.

Root Cause

When the NIC on db-replica was replaced, the new card's driver or a configuration script forced the link to 1000Mbps Full Duplex instead of using auto-negotiation. The switch port was configured for auto-negotiation. Per IEEE 802.3 auto-negotiation rules, when one side forces its speed/duplex and the other auto-negotiates, the auto-negotiating side can detect the speed (via electrical signaling) but cannot detect the duplex setting. It defaults to half duplex for 10/100 Mbps or may behave unpredictably at 1000 Mbps (Gigabit Ethernet technically requires auto-negotiation). The result: db-replica transmits in full duplex mode (never checking if the line is busy), while the switch port operates in half duplex mode (expecting CSMA/CD behavior). Frames collide, causing late collisions, CRC errors, retransmissions, and severely degraded throughput.

Fix

Immediate:

# Set db-replica's NIC back to auto-negotiation
ethtool -s eth0 autoneg on

# Verify the link renegotiates correctly
sleep 3
ethtool eth0 | grep -iE 'speed|duplex|auto-negotiation'

# Expected output:
#   Speed: 1000Mb/s
#   Duplex: Full
#   Auto-negotiation: on

# Confirm error counters stop incrementing
ethtool -S eth0 | grep -iE 'collision|error'

Preventive: - Always use auto-negotiation on both sides. The IEEE 802.3 specification mandates auto-negotiation for Gigabit Ethernet. Forcing speed/duplex is a legacy practice from Fast Ethernet days and causes more problems than it solves. - Persist the configuration so it survives reboots:

# For systemd-networkd (/etc/systemd/network/10-eth0.link)
[Match]
MACAddress=xx:xx:xx:xx:xx:xx

[Link]
AutoNegotiation=yes
- Monitor NIC collision counters with your monitoring system (Prometheus node_exporter exposes node_network_transmit_colls_total). Alert on any non-zero collision count on links that should be full duplex. - Add a post-maintenance checklist that includes verifying link speed, duplex, and running a throughput test after any NIC or cable replacement. - Document the expected link parameters for each server in your CMDB.

Common Mistakes

  • Forcing speed and duplex on the server "because auto-negotiation is unreliable." This was true 20 years ago. Modern NICs and switches auto-negotiate reliably. Forcing one side while the other auto-negotiates is the most common cause of duplex mismatch.
  • Not checking the switch side. The server may report Full Duplex because it was forced to Full, but the switch is actually running Half. You need to check both endpoints.
  • Blaming the application or the network for "slowness" without checking layer 1/2 fundamentals. Duplex mismatch is invisible to tools like ping and traceroute, and the symptoms (slow bulk transfers with normal latency for small operations) mimic many higher-layer issues.
  • Looking only at packet loss percentages. A duplex mismatch may not show significant packet loss in a brief ping test because CSMA/CD retries eventually succeed. The damage shows up as throughput degradation under sustained load.

Interview Angle

Q: Two servers on the same switch, but bulk transfers between them are extremely slow. What do you check? Good answer shape: Start with the physical and data-link layers. Check ethtool on both servers and the switch port for speed/duplex mismatch. Explain that when one side forces speed/duplex and the other auto-negotiates, the auto-negotiating side defaults to half duplex, causing late collisions under load. Describe checking ethtool -S for collision counters and using iperf3 to measure actual throughput. Mention that the fix is to ensure both sides use auto-negotiation, and that Gigabit Ethernet requires auto-negotiation per the IEEE 802.3 spec. This demonstrates understanding of layer 2 fundamentals that many engineers overlook when troubleshooting performance issues.


Wiki Navigation

Prerequisites