Binary and Floating Point Footguns¶

Mistakes that cause silent data corruption, counter wraps, monitoring false alarms, and billing errors.

1. Comparing Floats for Exact Equality¶

Your monitoring alert checks if error_rate == 0.0 to determine a healthy state. It fires intermittently because floating-point arithmetic produces 0.0000000000000001 instead of exact zero. The alert flaps every few minutes, desensitizing the on-call team to real problems.

Fix: Always use epsilon comparison for floats. In Prometheus: abs(metric) < 0.001. In Python: math.isclose(a, b, rel_tol=1e-9). In bash/awk: awk 'function fclose(a,b) { return (a-b < 0 ? b-a : a-b) < 0.0001 }'. There is no context where exact float equality is safe for operational decisions.

2. Using Floating Point for Money¶

Your billing system stores prices as DOUBLE PRECISION in PostgreSQL. A customer buys 1000 items at $0.01 each. The total comes out to $9.999999999999998 instead of $10.00. After millions of transactions, the accounting discrepancy becomes material. Auditors flag it. The fix requires a schema migration on a multi-TB table.

Fix: Use integer cents (store 1999 for $19.99), DECIMAL/NUMERIC types in SQL, or Decimal in Python. Never use float, REAL, or DOUBLE PRECISION for any monetary value. This includes tax calculations, exchange rates applied to amounts, and running totals.

3. Ignoring 32-bit Counter Wraps in SNMP Monitoring¶

You monitor network interface throughput using SNMP ifInOctets (32-bit counter). On a 10 Gbps link, this counter wraps every 3.4 seconds. Your monitoring system sees the counter go from 4 billion to near zero and interprets it as negative throughput or a massive spike, triggering false alerts.

Fix: Always use 64-bit HC (High Capacity) counters: ifHCInOctets and ifHCOutOctets. In Prometheus with SNMP exporter, ensure the MIB configuration references HC OIDs. A 64-bit counter on a 100 Gbps link takes 5,800 years to wrap.

4. Forgetting Network Byte Order When Parsing Binary Protocols¶

You write a script to parse TCP headers from a packet capture. You read the destination port as two bytes 0x00 0x50 and interpret them in little-endian order as 20480. The actual port is 80 (big-endian / network byte order). Your parser silently produces wrong data, and you spend hours debugging a "firewall issue" that does not exist.

Fix: Network protocols always use big-endian (most significant byte first). In Python: struct.unpack('!H', data) (the ! means network byte order). In C: use ntohs() and ntohl(). Always verify byte order when reading raw packet data, binary file formats, or wire protocols.

5. Assuming NaN Comparisons Work Like Normal Numbers¶

Your Python monitoring script checks if value > threshold to decide whether to alert. The value is NaN (because a division by zero occurred upstream). The comparison returns False, so no alert fires. The system is actually broken but NaN silently passes through every threshold check.

Fix: Always check for NaN before comparing. In Python: math.isnan(value). In JavaScript: Number.isNaN(value) (not isNaN() which coerces). In SQL: WHERE value IS NOT NULL AND value = value (NaN != NaN by IEEE 754). In Prometheus: absent() or isnan() functions. Add NaN guards at data ingestion, not just at alerting.

6. Integer Overflow in 32-bit Timestamps (Y2038)¶

Your embedded device, IoT fleet, or legacy system stores Unix timestamps as 32-bit signed integers. On 2038-01-19 at 03:14:07 UTC, the counter overflows from 2147483647 to -2147483648. The system thinks the date is 1901-12-13. Certificate validation fails because all certificates appear to be from the future. Time-based logic breaks silently.

Fix: Audit all timestamp storage for 32-bit time_t. Check database columns (TIMESTAMP in MySQL is 32-bit; use DATETIME or BIGINT instead). Check embedded firmware, file formats (ext3 superblock timestamps are 32-bit), and protocol fields. Migrate to 64-bit timestamps. This is not a 2037 problem — certificates and scheduled events years in the future are already affected.

7. Mixing Signed and Unsigned Integer Comparison¶

Your C program compares a signed int -1 with an unsigned int 1. In C, the signed value is implicitly converted to unsigned, turning -1 into 4294967295 (the largest uint32). The comparison (-1 > 1) evaluates to true. This silent promotion causes buffer overflows, incorrect loop bounds, and security vulnerabilities in network protocol parsers.

Fix: Never mix signed and unsigned in comparisons without explicit casting. Enable compiler warnings (-Wall -Wextra -Wsign-compare). In Go, the compiler prevents this. In Python, integers have unlimited precision so the issue does not arise. When reading C code that parses network data, audit every comparison involving size fields (which are typically unsigned) and return values (which are typically signed, with -1 for error).

8. Trusting Float Precision at Large Magnitudes¶

Your metrics system stores a counter as a 64-bit float. The counter reaches 10 trillion (1e13). At this magnitude, the smallest representable increment is 1.0 — the float cannot distinguish between 10000000000000 and 10000000000001. Individual increments are silently lost. Your counter appears to stop increasing even though events are still occurring.

Fix: Use integer types for counters, always. A 64-bit integer can count to 9.2 x 10^18 without losing precision at any magnitude. If you must use floats (e.g., Prometheus internally uses float64), understand that precision degrades above 2^53 (about 9 x 10^15). For counters that might reach this range, use integer storage and convert to float only for display.

9. Getting Subnet Calculations Wrong by Misunderstanding Bitmasks¶

You need to check if IP 10.0.1.50 is in the 10.0.0.0/22 network. You mentally simplify /22 as "first three octets match" (which is /24). You conclude the IP is not in the network because 10.0.1 does not equal 10.0.0. In reality, /22 covers 10.0.0.0 through 10.0.3.255, and the IP is in the network. Your firewall rule is wrong.

Fix: Always compute the network address with a bitwise AND: IP AND mask = network address. For /22, the mask is 255.255.252.0. 10.0.1.50 AND 255.255.252.0 = 10.0.0.0. Use ipcalc, Python's ipaddress module, or sipcalc to verify. Do not do subnet math in your head for anything other than /8, /16, /24, and /32.

10. Accumulating Floating-Point Errors in Running Sums¶

Your monitoring system calculates a running average by adding each new data point to a sum and dividing. After millions of additions, the accumulated rounding error becomes significant. The average drifts from the true value. On a dashboard, the SLO metric shows 99.97% when the real value is 99.95%, and nobody notices the drift until an audit reveals the discrepancy.

Fix: Use compensated summation (Kahan summation) for long-running float accumulations. Better yet, use integer arithmetic (count events as integers, compute ratios only at display time). For monitoring systems, prefer rate-based calculations that reset periodically over unbounded running sums.

# Kahan summation — maintains a compensation variable
def kahan_sum(values):
    total = 0.0
    compensation = 0.0
    for val in values:
        y = val - compensation
        t = total + y
        compensation = (t - total) - y
        total = t
    return total

11. Misreading Hex Dumps Due to Endianness Confusion¶

You see 04 03 02 01 in a hex dump and interpret it as the integer 0x04030201 (67,305,985 in decimal). But the system is little-endian, so the actual value is 0x01020304 (16,909,060). You file a bug report about a wrong value in a binary protocol, wasting a day of investigation.

Fix: Always determine the byte order before interpreting multi-byte values in hex dumps. Check the file header (ELF files specify endianness in byte 5), the protocol spec (network protocols are big-endian), or the architecture (x86 is always little-endian). Use od -A x -t x1 file to see raw bytes without interpretation, then apply the correct byte order manually or with struct.unpack using the right format character (< for little-endian, > for big-endian).

12. Bit-Shifting Errors in Permission and Flag Calculations¶

You try to set the setuid bit on a file and use chmod 4755. But you accidentally type chmod 755, forgetting that the leading octal digit controls setuid/setgid/sticky. Or you try to check whether a Linux capability flag is set by testing flags & CAP_NET_RAW but use the wrong constant value because you confused the capability number with its bitmask position.

Fix: For file permissions, always explicitly include the leading digit when setting special bits: chmod 4755 (setuid), chmod 2755 (setgid), chmod 1755 (sticky). For capability and flag checks, use the defined constants from headers or libraries — never hardcode bit positions. Verify with stat -c '%a' for permissions or getfacl for ACLs. When in doubt, use the symbolic form: chmod u+s file is less error-prone than remembering octal positions.