Linux Performance¶

53 cards — 🟢 7 easy | 🟡 22 medium | 🔴 14 hard

🟢 Easy (7)¶

1. What is load average?

Show answer

Linux **load averages** are "system load averages" that show the running thread (task) demand on the system as an average number of running plus waiting threads. This measures demand, which can be greater than what the system is currently processing. Most tools show three averages, for 1, 5, and 15 minutes.

Remember: Load avg = processes in run/IO-wait over 1/5/15 min. 4-core: load 4.0 = 100%.

Gotcha: Load includes I/O wait. High load + low CPU = disk bottleneck.

2. How do you trace system calls?

Show answer

`strace ` or `strace -f -p `.

Use `-e` filters. For lower overhead: `perf trace`.

Remember: Toolkit: top/htop(overview), vmstat(memory), iostat(disk), sar(historical).

Remember: USE method: Utilization, Saturation, Errors for each resource.

3. How to check what is the current load average?

Show answer

One can use `uptime` or `top`

Remember: Load avg = processes in run/IO-wait over 1/5/15 min. 4-core: load 4.0 = 100%.

Gotcha: Load includes I/O wait. High load + low CPU = disk bottleneck.

4. Your first 5 commands on a *nix server after login.

Show answer

- `w` - a lot of great information in there with the server uptime
- `top` - you can see all running processes, then order them by CPU, memory utilization and more
- `netstat` - to know on what port and IP your server is listening on and what processes are using those
- `df` - reports the amount of available disk space being used by file systems
- `history` - tell you what was previously run by the user you are currently connected to

5. Explain the difference between "Load Average" and "CPU Utilization."

Show answer

These metrics measure different aspects of system performance.

CPU Utilization:
- Percentage of time CPU was busy (not idle)
- Ranges from 0% to 100% per CPU core
- Measured via /proc/stat (user, system, idle, iowait, etc.)
- High CPU = CPU is actively processing work
- Tools: top, mpstat, sar

Load Average:

Remember: Load avg = processes in run/IO-wait over 1/5/15 min. 4-core: load 4.0 = 100%.

Gotcha: Load includes I/O wait. High load + low CPU = disk bottleneck.

6. Explain the difference between symmetric and asymmetric encryption.

Show answer

Symmetric: same key for encryption and decryption (e.g., AES). Fast, used for bulk data encryption. Challenge: secure key distribution.
Asymmetric: uses a key pair (public key encrypts, private key decrypts, e.g., RSA, ECDSA). Slower, used for key exchange, digital signatures, and TLS handshakes. In practice, TLS uses asymmetric crypto to exchange a symmetric session key, then uses symmetric crypto for the data.

Remember: Toolkit: top/htop(overview), vmstat(memory), iostat(disk), sar(historical).

Remember: USE method: Utilization, Saturation, Errors for each resource.

7. A junior engineer sees 128 MB free in top and panics that the server is out of memory. The server has 16 GB total. Is this a real problem?

Show answer

Almost certainly not. Linux aggressively uses free memory for page cache (buff/cache). The real question is what `avail Mem` shows — this includes reclaimable cache and buffers. If avail Mem is 11 GB, the server has plenty of memory. Only worry if avail Mem drops below ~10% of total AND swap is actively churning (check vmstat si/so columns).

Remember: top: PID, PR, NI, VIRT, RES, %CPU, %MEM. htop = prettier with mouse.

🟡 Medium (22)¶

1. Explain the difference between ulimit -n and fs.file-max — how do they interact?

Show answer

These are two different layers of file descriptor limits.

ulimit -n (per-process limit):
- Soft and hard limits per process
- Configured in /etc/security/limits.conf
- Syntax: ` nofile `
- Example: `* hard nofile 65535`
- Check current: `ulimit -n` (soft), `ulimit -Hn` (hard)

Remember: Toolkit: top/htop(overview), vmstat(memory), iostat(disk), sar(historical).

Remember: USE method: Utilization, Saturation, Errors for each resource.

2. You have added several aliases to .profile. How to reload shell without exit?

Show answer

The best way is `exec $SHELL -l` because `exec` replaces the current process with a new one. Also good (but other) solution is `. ~/.profile`.

Remember: Toolkit: top/htop(overview), vmstat(memory), iostat(disk), sar(historical).

Remember: USE method: Utilization, Saturation, Errors for each resource.

3. What is the difference between CPU load and utilization?

Show answer

They measure different things:

CPU Utilization:
- Percentage of time CPU is busy (0-100%)
- Measured by: top, mpstat

CPU Load (Load Average):
- Number of processes wanting CPU + waiting for I/O
- Can exceed number of cores
- Measured by: uptime, /proc/loadavg

Key insight: Linux load includes D-state (I/O wait) processes.
- High load + low CPU util = I/O bottleneck
- High load + high CPU util = CPU bottleneck

Remember: CPU: us(user), sy(system), wa(IO wait), st(stolen). High wa = disk problem.

4. A Linux server is slow. Where do you start?

Show answer

Systematic approach - don't guess, validate bottlenecks:

1. **Load**: `uptime` - is the system under pressure?
2. **CPU**: `top`/`htop` - check steal time (VM), iowait, user vs system
3. **Memory**: `free -h`, check for swapping (`vmstat 1`)
4. **Disk I/O**: `iostat -x 1`, `iotop` - check await, %util
5. **Network**: `ss -s`, `iftop` if network-bound
6.

Remember: Toolkit: top/htop(overview), vmstat(memory), iostat(disk), sar(historical).

Remember: USE method: Utilization, Saturation, Errors for each resource.

5. You know how to see the load average, great. but what each part of it means? for example 1.43, 2.34, 2.78

Show answer

[This article](http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html) summarizes the load average topic in a great way

Remember: Load avg = processes in run/IO-wait over 1/5/15 min. 4-core: load 4.0 = 100%.

Gotcha: Load includes I/O wait. High load + low CPU = disk bottleneck.

6. How you measure time execution of a program?

Show answer

Several methods:

1. time command: time ./program
- real = wall clock, user = user CPU, sys = kernel CPU

2. /usr/bin/time -v: Detailed stats including memory

3. perf stat: CPU cycles, cache misses

4. hyperfine: Benchmarking with statistics

Interpretation:
- real > user+sys = I/O or sleep
- user+sys > real = multi-core

Remember: Toolkit: top/htop(overview), vmstat(memory), iostat(disk), sar(historical).

Remember: USE method: Utilization, Saturation, Errors for each resource.

7. You found a server with high CPU load but it's not clear which process is causing it. How would you troubleshoot?

Show answer

Systematic approach to find CPU hogs:

1. Real-time monitoring:
- top (press 1 for per-CPU view)
- htop (more user-friendly)
- Look for high %CPU processes

2. Sort by CPU:
- ps aux --sort=-%cpu | head
- ps -eo pid,ppid,%cpu,cmd --sort=-%cpu

3. Check load average:

Remember: CPU: us(user), sy(system), wa(IO wait), st(stolen). High wa = disk problem.

8. How do you troubleshoot a Linux system that's acting slow?

Show answer

I start by checking CPU, I/O wait, memory pressure, and storage latency — top, iostat, vmstat, dstat, and logs. Then I look at runaway processes, misbehaving services, and disk space. From there I verify systemd units, network paths, and kernel messages. My approach is always layered: symptoms → resource limits → logs → root cause.

Remember: Toolkit: top/htop(overview), vmstat(memory), iostat(disk), sar(historical).

Remember: USE method: Utilization, Saturation, Errors for each resource.

9. What do the three load average numbers represent in Linux (shown by uptime or top)?

Show answer

The three numbers represent the average number of processes in a runnable or
uninterruptible state over 1, 5, and 15 minute intervals.

- 1-minute average: Short-term load, shows recent activity
- 5-minute average: Medium-term trend
- 15-minute average: Long-term trend

Interpretation on a single-core system:
- 1.0 = CPU is exactly at capacity
- Below 1.0 = CPU has idle time

Remember: top: PID, PR, NI, VIRT, RES, %CPU, %MEM. htop = prettier with mouse.

10. Explain iostat output

Show answer

iostat shows CPU and I/O statistics. Key columns:

CPU section:
- %user, %system - User/kernel CPU time
- %iowait - CPU waiting for I/O
- %idle - Idle CPU time

Device section (iostat -x):
- r/s, w/s - Reads/writes per second
- rkB/s, wkB/s - KB read/written per second
- await - Average I/O wait time (ms)
- %util - Device utilization percentage

Key insights:
- High %iowait = I/O bottleneck
- High await = slow storage
- %util near 100% = device saturated (HDDs)

Example: `iostat -xz 1` — %util(busy), await(latency), r/s & w/s (IOPS).

11. What are you using for debugging CPU related issues?

Show answer

`top` will show you how much CPU percentage each process consumes
`perf` is a great choice for sampling profiler and in general, figuring out what your CPU cycles are "wasted" on
`flamegraphs` is great for CPU consumption visualization (http://www.brendangregg.com/flamegraphs.html)

Remember: CPU: us(user), sy(system), wa(IO wait), st(stolen). High wa = disk problem.

12. You get a call from someone claiming "my system is SLOW". What do you do?

Show answer

* Check with `top` for anything unusual
* Run `dstat -t` to check if it's related to disk or network.
* Check if it's network related with `sar`
* Check I/O stats with `iostat`

Remember: Toolkit: top/htop(overview), vmstat(memory), iostat(disk), sar(historical).

Remember: USE method: Utilization, Saturation, Errors for each resource.

13. What is the OWASP Top 10 and why does it matter for DevOps?

Show answer

The OWASP Top 10 is a regularly updated list of the most critical web application security risks. Current top entries include: Broken Access Control, Cryptographic Failures, Injection, Insecure Design, Security Misconfiguration, Vulnerable Components, Authentication Failures, Software Integrity Failures, Logging Failures, and SSRF. DevOps teams use it to prioritize security testing in CI/CD pipelines and set security gates.

Remember: top: PID, PR, NI, VIRT, RES, %CPU, %MEM. htop = prettier with mouse.

14. What are the phases of incident response?

Show answer

1) Preparation: policies, tools, training, runbooks.
2) Identification: detect and confirm the incident via monitoring, alerts, or reports.
3) Containment: limit damage (short-term: isolate affected systems; long-term: apply temporary fixes).
4) Eradication: remove the root cause (malware, compromised accounts, vulnerabilities).
5) Recovery: restore systems to normal operation, verify integrity.
6) Lessons Learned: post-incident review, update procedures, improve defenses.

Remember: Toolkit: top/htop(overview), vmstat(memory), iostat(disk), sar(historical).

Remember: USE method: Utilization, Saturation, Errors for each resource.

15. How do you implement least privilege in cloud IAM?

Show answer

1) Start with zero permissions and add only what is needed; 2) Use managed policies scoped to specific services; 3) Avoid wildcard permissions (Resource: "*"); 4) Use conditions (IP range, MFA required, time-based); 5) Separate roles for different workloads; 6) Use IAM Access Analyzer to find unused permissions; 7) Regularly audit and remove stale permissions; 8) Prefer short-lived credentials (STS AssumeRole) over long-lived access keys.

Remember: Toolkit: top/htop(overview), vmstat(memory), iostat(disk), sar(historical).

Remember: USE method: Utilization, Saturation, Errors for each resource.

16. Running the command df you get "command not found". What could be wrong and how to fix it?

Show answer

Most likely the default $PATH was modified or overridden, so `/bin/` (where df lives) is missing.

Fix:
1. Manually reset PATH: `PATH=/bin:/sbin:/usr/bin:/usr/sbin`
2. Check what broke it: review `~/.bashrc`, `~/.bash_profile`, `/etc/profile`
3. Fix the offending file and source it: `source ~/.bashrc`

To schedule periodic tasks, use `cron`:
`crontab -e` then add entries like: `*/30 * * * * bash myscript.sh`
Format: ` `. On systemd distros, consider systemd timers as an alternative.

Remember: Toolkit: top/htop(overview), vmstat(memory), iostat(disk), sar(historical).

Remember: USE method: Utilization, Saturation, Errors for each resource.

17. You define x=2 in /etc/bashrc and x=6 in ~/.bashrc. You then log in. What is the value of x?

Show answer

x=6 (user's .bashrc overrides system bashrc)

Order of execution (login shell):
1. /etc/profile
2. ~/.bash_profile (or ~/.bash_login or ~/.profile)
- Often sources ~/.bashrc
3. /etc/bashrc (typically sourced by .bashrc)
4. ~/.bashrc

For login shells:
- System files first, user files after
- Later definitions override earlier ones
- x=2 set in /etc/bashrc
- x=6 set in ~/.bashrc (wins)

Important notes:
- Non-login shells may differ
- Depends on how files source each other
- .bashrc usually sources /etc/bashrc first

Result: x=6 (user config takes precedence)

Remember: Toolkit: top/htop(overview), vmstat(memory), iostat(disk), sar(historical).

Remember: USE method: Utilization, Saturation, Errors for each resource.

18. Explain piping. How do you perform piping?

Show answer

Using a pipe in Linux, allows you to send the output of one command to the input of another command. For example: `cat /etc/services | wc -l`

Example: `perf top` = live CPU hotspots. `perf record && perf report` for profiling.

19. What does LC_ALL=C before command do? In what cases it will be useful?

Show answer

`LC_ALL` is the environment variable that overrides all the other localisation settings. This sets all `LC_` type variables at once to a specified locale.

The main reason to set `LC_ALL=C` before command is that fine to simply get English output (general change the locale used by the command).

On the other hand, also important is to increase the speed of command execution with `LC_ALL=C` e.g. `grep` or `fgrep`. Using the `LC_ALL=C` locale increased our performance and brought command execution time down.

Remember: Toolkit: top/htop(overview), vmstat(memory), iostat(disk), sar(historical).

Remember: USE method: Utilization, Saturation, Errors for each resource.

20. A Kubernetes node on AWS shows 8% st (steal time) in top but all pods report normal CPU usage via kubectl top. What is happening and what can you do?

Show answer

Steal time means the underlying hypervisor is taking CPU cycles from the VM to serve other tenants. Pods report normal usage because cAdvisor measures CPU time consumed, not wall-clock time. The actual execution is slower because the VM is not getting all the CPU time it asks for. Solutions: resize to a dedicated/larger instance type, migrate the node, or use instances with dedicated tenancy. You cannot fix steal time from inside the VM.

Remember: top: PID, PR, NI, VIRT, RES, %CPU, %MEM. htop = prettier with mouse.

21. How does high wa (I/O wait) on a container host relate to container I/O throttling?

Show answer

`wa` on the host means CPUs are idle waiting for I/O to complete. For containers, this often maps to pods hitting their blkio cgroup limits or contending for shared node storage. A container writing heavily to an emptyDir on the node's disk drives up host `wa` and affects every container on that node. Check container I/O limits and consider using dedicated volumes or adjusting blkio cgroup settings.

Remember: Toolkit: top/htop(overview), vmstat(memory), iostat(disk), sar(historical).

Remember: USE method: Utilization, Saturation, Errors for each resource.

22. A server shows load average of 24 on a 4-core system but CPU utilization is only 15%. What does this indicate?

Show answer

High load average with low CPU utilization means most of the load is from processes in uninterruptible sleep (D state), not from CPU work. These processes are blocked on I/O — typically disk, NFS, or SAN. The load average counts both runnable and D-state processes. Confirm with `vmstat 1` (check the `b` column for blocked processes) and `iostat -xz 1` to identify the saturated device.

Remember: Load avg = processes in run/IO-wait over 1/5/15 min. 4-core: load 4.0 = 100%.

Gotcha: Load includes I/O wait. High load + low CPU = disk bottleneck.

🔴 Hard (14)¶

1. High load average but low CPU usage - why?

Show answer

Load average includes both runnable AND uninterruptible (D state) processes. Low CPU with high load means processes are blocked waiting on something:

Common causes:
* **I/O wait**: Disk saturation, slow storage
* **NFS latency**: Hung NFS mounts
* **Blocked threads**: Mutex contention, lock waits
* **Storage issues**: SAN latency, RAID rebuild

Remember: Load avg = processes in run/IO-wait over 1/5/15 min. 4-core: load 4.0 = 100%.

Gotcha: Load includes I/O wait. High load + low CPU = disk bottleneck.

2. What does CPU jumps mean?

Show answer

An OS is a very busy thing, particularly so when you have it doing something (and even when you aren't). And when we are looking at an active enterprise environment, something is always going on.

Most of this activity is "bursty", meaning processes are typically quiescent with short periods of intense activity. This is certainly true of any type of network-based activity (e.g.

Remember: CPU: us(user), sy(system), wa(IO wait), st(stolen). High wa = disk problem.

3. How would you debug high load average with almost no CPU usage?

Show answer

High load with low CPU indicates processes in uninterruptible sleep (D state), typically waiting on I/O.

Diagnostic approach:
1. Identify D state processes: `top` or `htop` - look for 'D' in state column
2. Check I/O metrics: `iostat -x 1` - look at %util, await, avgqu-sz
3. System overview: `vmstat 1` - check 'b' column (blocked processes)
4.

Remember: Load avg = processes in run/IO-wait over 1/5/15 min. 4-core: load 4.0 = 100%.

Gotcha: Load includes I/O wait. High load + low CPU = disk bottleneck.

4. Load averages are above 30 on a server with 24 cores but CPU shows around 70 percent idle. One of the common causes of this condition is? How to debug and fixed?

Show answer

Requests which involve disk I/O can be slowed greatly if cpu(s) needs to wait on the disk to read or write data. I/O Wait, is the percentage of time the CPU has to wait on disk.

Lets looks at how we can confirm if disk I/O is slowing down application performance by using a few terminal command line tools (`top`, `atop` and `iotop`).

Remember: Load avg = processes in run/IO-wait over 1/5/15 min. 4-core: load 4.0 = 100%.

Gotcha: Load includes I/O wait. High load + low CPU = disk bottleneck.

5. Explain interrupts and interrupt handlers in Linux.

Show answer

Here's a high-level view of the low-level processing. I'm describing a simple typical architecture, real architectures can be more complex or differ in ways that don't matter at this level of detail.

When an **interrupt** occurs, the processor looks if interrupts are masked. If they are, nothing happens until they are unmasked.

Remember: Toolkit: top/htop(overview), vmstat(memory), iostat(disk), sar(historical).

Remember: USE method: Utilization, Saturation, Errors for each resource.

6. Is there a way to allow multiple cross-domains using the Access-Control-Allow-Origin header in Nginx?

Show answer

Yes. Use `if` blocks to match `$http_origin` against a regex of allowed domains, then set the header dynamically:

```\nlocation / {\n if ($http_origin ~* (^https?://([^/]+\.)*(domain1|domain2)\.com$)) {\n add_header 'Access-Control-Allow-Origin' "$http_origin";\n add_header 'Access-Control-Allow-Credentials' 'true';\n add_header 'Access-Control-Allow-Methods' 'GET, POST, OPTIONS';\n }\n}\n```

Key point: you cannot list multiple origins in a single `Access-Control-Allow-Origin` header. Instead, dynamically echo back the matched origin.

Remember: Toolkit: top/htop(overview), vmstat(memory), iostat(disk), sar(historical).

Remember: USE method: Utilization, Saturation, Errors for each resource.

7. How to recover deleted file held open e.g. by Apache?

Show answer

A deleted file that is still open retains its inode (hard link count = 0). Linux exposes open file descriptors via `/proc//fd/`. The symlink target shows the original path with `(deleted)` appended.

To recover: `cat /proc//fd/ > /path/to/recovered_file`

To find the fd: check `ls -l /proc//fd/` or use `lsof | grep deleted`. You can iterate all open fds for a process or scan all processes under `/proc/[1-9]*/fd/*`.

Remember: Toolkit: top/htop(overview), vmstat(memory), iostat(disk), sar(historical).

Remember: USE method: Utilization, Saturation, Errors for each resource.

8. Write two golden rules for reducing the impact of hacked system.

Show answer

1. **Principle of Least Privilege**: Run services with the minimum permissions needed. If Apache is compromised, the attacker is limited to what the `apache` user can access — not root.

2. **Principle of Separation of Privileges**: Isolate components — e.g., give the web app a read-only database account. Use SELinux or AppArmor to enforce mandatory access controls. Whitelist allowed actions rather than blacklisting bad ones to reduce attack surface.

Remember: Toolkit: top/htop(overview), vmstat(memory), iostat(disk), sar(historical).

Remember: USE method: Utilization, Saturation, Errors for each resource.

9. Explain :(){ :|:& };: and how stop this code if you are already logged into a system?

Show answer

It is a **fork bomb**. `:()` defines a function named `:`. The body `:|:&` calls itself, pipes output to another copy of itself, and backgrounds it. The final `:` executes it, causing exponential process creation.

To stop it if already logged in:
- `killall -STOP -u ` to freeze all user processes
- If the shell can't fork: `exec killall -STOP -u ` (replaces the shell process)

Prevention: use PAM (`/etc/security/limits.conf`) to limit per-user process count (`nproc`).

Remember: top: PID, PR, NI, VIRT, RES, %CPU, %MEM. htop = prettier with mouse.

10. The team of admins needs your support. You must remotely reinstall the system on one of the main servers. There is no access to the management console (e.g. iDRAC). How to install Linux on disk, from and where other Linux exist and running?

Show answer

Use `debootstrap` to install a minimal Linux into a working directory, chroot into it, then mount and wipe the old root filesystem, restore from backup, and reinstall GRUB.

High-level steps:
1. `debootstrap` a minimal system to `/mnt/system`
2. Bind-mount `/proc`, `/sys`, `/dev` and chroot in
3. Mount the old root (e.g., `/dev/sda1`), delete old files, extract backup tarball
4. Chroot into restored system, run `grub-install` and `update-grub`
5. Reboot with `sync; reboot -f` (normal shutdown commands won't work from chroot)

Remember: Toolkit: top/htop(overview), vmstat(memory), iostat(disk), sar(historical).

Remember: USE method: Utilization, Saturation, Errors for each resource.

11. What is threat modeling and name a common framework for it.

Show answer

Threat modeling is the process of identifying potential threats to a system during design. It answers: What are we building? What can go wrong? What are we going to do about it?
STRIDE is a common framework: Spoofing (identity), Tampering (data), Repudiation (deniability), Information Disclosure (confidentiality), Denial of Service (availability), Elevation of Privilege (authorization). Each maps to a security property to protect.

Remember: Toolkit: top/htop(overview), vmstat(memory), iostat(disk), sar(historical).

Remember: USE method: Utilization, Saturation, Errors for each resource.

12. What is the difference between penetration testing and red teaming?

Show answer

Penetration testing: scoped, time-boxed assessment of specific systems or applications. Goal is to find as many vulnerabilities as possible. The target team usually knows it is happening.
Red teaming: adversary simulation that tests the organization holistically (people, processes, technology). Goal is to test detection and response capabilities. Often covert, longer duration, uses social engineering and physical access. Red teams emulate real attackers; pen testers find bugs.

Remember: Toolkit: top/htop(overview), vmstat(memory), iostat(disk), sar(historical).

Remember: USE method: Utilization, Saturation, Errors for each resource.

13. What does high si (software interrupts) in top indicate on a Kubernetes node, and why is it invisible to kubectl top?

Show answer

High `si` means the kernel is spending significant time processing softirqs — typically network packet processing (NET_RX). On a container host, all pods share the host kernel's softirq handling. A pod receiving a flood of traffic drives up `si` on the host, degrading all pods. `kubectl top` only reports per-pod CPU usage and cannot see shared kernel overhead. Diagnose with `cat /proc/softirqs` and look for NET_RX growth.

Remember: top: PID, PR, NI, VIRT, RES, %CPU, %MEM. htop = prettier with mouse.

14. Multiple processes show state D (uninterruptible sleep) in top. What does this mean and how do you investigate?

Show answer

D state means the process is waiting for the kernel to complete an I/O operation and cannot be interrupted — not even by kill -9. Multiple D-state processes signal an active I/O problem: disk failure, NFS hang, SAN timeout, or filesystem corruption.

Investigate with:
1. `cat /proc//wchan` — kernel function it's blocked in
2. `cat /proc//stack` — full kernel stack trace
3. `iostat -xz 1` — identify saturated devices
4. `dmesg -T` — check for hardware errors or NFS timeouts

Fix the underlying I/O problem; the processes will unblock on their own.