Monitoring¶

75 cards — 🟢 10 easy | 🟡 42 medium | 🔴 8 hard

🟢 Easy (10)¶

1. What is Grafana Enterprise?

Show answer

[Grafana Enterprise](https://grafana.com/docs/grafana/latest/enterprise/#enterprise-plugins) is a commercial edition of Grafana offered with enterprise features such as _Enterprise datasource_ plugins and built-in collaboration features. The edition includes full-time support and training from the Grafana team.

Remember: Grafana = visualization, doesn't store data. Queries Prometheus/Loki. Port 3000.

2. What is the data format for the dashboard?

Show answer

[Grafana docs](https://grafana.com/docs/grafana/latest/dashboards/json-model/): Grafana dashboards are represented in JSON files as objects, they store metadata about a dashboard e.g. dashboard properties, panel metadata and variables.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

3. What is the default HTTP port of Grafana?

Show answer

[Grafana getting started](https://grafana.com/docs/grafana/latest/getting-started/getting-started/): Grafana runs on port 3000 by default.

Remember: Grafana = visualization, doesn't store data. Queries Prometheus/Loki. Port 3000.

4. What is the "Default configuration"?

Show answer

[Grafana docs](https://grafana.com/docs/grafana/latest/administration/configuration/): The default configuration contains settings that Grafana use by default. The location depends on the OS environment, note that $WORKING_DIR refers to the working directory of Grafana.
- Windows: ```$WORKING_DIR/conf/defaults.ini```
- Linux: ```/etc/grafana/grafana.ini```
- macOS: ```/usr/local/etc/grafana/grafana.ini```

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

5. What is Grafana Cloud?

Show answer

[Grafana cloud](https://grafana.com/products/cloud/) is an edition of Grafana that is offered as a service through the cloud. The observabilty stack is set up, administered and maintained by Grafana Labs and offers both free and paid options. You can also send data from existing data sources e.g. Promethetus, Loki and visualise existing time series data.

Remember: Grafana = visualization, doesn't store data. Queries Prometheus/Loki. Port 3000.

6. What's your monitoring experience?

Show answer

I've built and maintained large-scale monitoring stacks: Nagios XI, CloudWatch, LogicMonitor, SumoLogic, custom Bash/Python checkers, and centralized logging pipelines. My focus is on actionable alerts — signal over noise — and building automation around remediation.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

7. Explain what is Grafana

Show answer

[Grafana Docs](https://grafana.com/docs/grafana/latest/introduction): "Grafana is a complete observability stack that allows you to monitor and analyze metrics, logs and traces. It allows you to query, visualize, alert on and understand your data no matter where it is stored. Create, explore, and share beautiful dashboards with your team and foster a data driven culture."

Remember: Grafana = visualization, doesn't store data. Queries Prometheus/Loki. Port 3000.

8. What is the fundamental difference between Nagios-style and Prometheus-style monitoring?

Show answer

Nagios uses check-based (pass/fail) monitoring with active polling or push. Prometheus uses metric-based monitoring (continuous numerical values) with a pull/scrape model, multi-dimensional labels, and PromQL for dynamic queries.

Remember: Prometheus = pull-based, scrapes /metrics. Port 9090.

Under the hood: Local TSDB. Long-term: Thanos or Cortex.

9. What is node_exporter, and what type of metrics does it provide?

Show answer

node_exporter is a Prometheus exporter that runs on each host (port 9100) and exposes OS-level metrics: CPU, memory, disk, network, and I/O. It replaces Nagios checks like check_disk, check_load, and check_mem.

Remember: Exporters → Prometheus format. node_exporter, blackbox_exporter, mysqld_exporter.

10. What is the blackbox_exporter used for in a Prometheus deployment?

Show answer

The blackbox_exporter performs endpoint probes (HTTP, TCP, ICMP, DNS) from outside the target, replacing Nagios checks like check_http, check_tcp, and check_ping. It exposes metrics like probe_success and probe_http_status_code.

Remember: Prometheus = pull-based, scrapes /metrics. Port 9090.

Under the hood: Local TSDB. Long-term: Thanos or Cortex.

🟡 Medium (42)¶

1. How can we install plugins for Grafana?

Show answer

[Grafana getting started](https://grafana.com/docs/grafana/latest/plugins/installation/): Navigate to the [Grafana plugins page](https://grafana.com/grafana/plugins/), find the desired plugin and click on it, then click the installation tab. There are two ways to install depending on where your Grafana server is running:
- Cloud: On the **For** field of the installation tab, select the name of the organization you want to install the plugin on (unless you are only part of one), then click **install plugin**.

Remember: Grafana = visualization, doesn't store data. Queries Prometheus/Loki. Port 3000.

2. How can we import a dashboard to a Grafana instance?

Show answer

[Grafana getting started](https://grafana.com/docs/grafana/latest/dashboards/export-import/): Grafana dashboards can be imported through the Grafana UI. Click on the + icon in the sidebar and then click import. You can import a dashboard through the following options:
- Uploading a dashboard JSON file, which is exported from the Grafana UI or fetched through the [HTTPS API](https://grafana.com/docs/grafana/latest/http_api/dashboard/#create-update-dashboard
)
- Paste a Grafana dashboard URL which is found at [grafana Dashboards](https://grafana.com/grafana/dashboards/), or a dashboard unique id into the text area.
- Paste raw Dashboard JSON text into the panel area.
Click load afterwards.

3. Which external authentication is supported out-of-the-box?

Show answer

[Grafana docs](https://grafana.com/docs/grafana/latest/auth/overview/): Grafana Auth is the built-in authentication system with password authentication enabled by default.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

4. How can you organise your dashboards and users in Grafana?

Show answer

[Grafana docs](https://grafana.com/blog/2022/03/14/how-to-best-organize-your-teams-and-resources-in-grafana/
): The recommended way by Grafana labs is to create Folders for grouping dashboards, library panels and alerts. Users can be organised through Teams which grants permissions to members of a group.
- [Folders](https://grafana.com/docs/grafana/latest/dashboards/dashboard_folders/): Click the + icon in the sidebar, then click "Create folder".

Remember: Grafana = visualization, doesn't store data. Queries Prometheus/Loki. Port 3000.

5. Explain how we can enforce HTTPS

Show answer

[Grafana community](https://grafana.com/docs/grafana/latest/getting-started/getting-started/): Set the protocol to _https_ on the Configuration settings, Grafana will then expect clients to send requests using the HTTPS protocol. Any client that uses HTTP will receive an SSL/TLS error.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

6. Explain how we can add Custom configuration to Grafana

Show answer

[Grafana docs](https://grafana.com/docs/grafana/latest/administration/configuration/):
The custom configuration can be configured, either by modifying the custom configuration file or by adding environment variables that overrides default configuration. The configuration varies depending on the OS:
- Windows: There is a file ```sample.ini``` in the same directory as the defaults.ini file, copy sample.ini and name it ```custom.ini```.

Remember: Grafana = visualization, doesn't store data. Queries Prometheus/Loki. Port 3000.

7. Explain the steps to create an 'Alert'

Show answer

[Grafana docs](https://grafana.com/docs/grafana/latest/alerting/old-alerting/create-alerts/):

"Navigate to the panel you want to add or edit an alert rule for, click the title, and then click Edit. On the Alert tab, click Create Alert. If an alert already exists for this panel, then you can just edit the fields on the Alert tab. Fill out the fields. Descriptions are listed below in Alert rule fields. When you have finished writing your rule, click Save in the upper right corner to save alert rule and the dashboard. (Optional but recommended) Click Test rule to make sure the rule returns the results you expect"

8. Explain what a 'Data source' is

Show answer

[Grafana Docs](https://grafana.com/docs/grafana/latest/datasources/): A data source is a storage backend that acts as a source of data for Grafana. Some popular data sources are Prometheus, InfluxDB, Loki, AWS cloudwatch.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

9. Explain the steps to share your dashboard with your team

Show answer

[Grafana docs](https://grafana.com/docs/grafana/latest/sharing/share-dashboard/): Go to the homepage of your grafana Instance. Click on the share icon in the top navigation, from there three tabs are visible with the link tab shown.
- Direct link: Click copy, send the link to a Grafana user, note that the user needs authorization to view the link. This is done by adding the user to a team.
- Public Snapshot: Click on local snapshot to publish a snapshot to your local Grafana instance, or Publish to snapshots.raintank.io which is a free service for publishing dashboard snapshots to an external Grafana instance
You can configure snapshots to expire after a certain time and the timeout value to collect dashboard metrics

10. What measures do you take to minimize downtime during planned maintenance activities in the data center?

Show answer

• Comprehensive Planning: Develop a detailed maintenance plan outlining tasks, timelines, and dependencies to minimize surprises during execution. • Communication Strategy: Communicate maintenance schedules well in advance to stakeholders, including IT teams, end-users, and management, to manage expectations. • Impact Assessment: Conduct a thorough impact assessment to understand potential disruptions and plan mitigations for critical services.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

11. How do you ensure that backups are consistent across different types of databases and applications?

Show answer

• Backup Policies and Procedures: Establish standardized backup policies and procedures that are tailored to the specific requirements of different types of databases and applications. • Database Consistency Checks: Implement database consistency checks before initiating backups to ensure that data is in a stable and coherent state. • Application-Aware Backups: Utilize application-aware backup solutions that understand the internal structure and dependencies of applications, ensuring consistent backups.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

12. Discuss the factors that influence capacity planning in a data center.

Show answer

• Business Growth Projections: Anticipated business growth and expansion plans significantly influence capacity planning, ensuring that the data center can accommodate increased demand. • Seasonal Variations: Industries with seasonal variations experience fluctuating demand for resources, requiring capacity planning to address peak loads during high-demand periods. • Technology Trends: Emerging technologies and trends, such as the adoption of new applications or services, influence capacity planning to support the integration of these technologies.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

13. How do you ensure physical security in a data center?

Show answer

Ensuring physical security in a data center is crucial for protecting sensitive equipment and data. Key measures include: **Access Control:* • • Implement strict access control measures, including biometric authentication, key cards, or access codes. • Restrict access to only authorized personnel. **Surveillance Systems:* • • Install security cameras at key locations to monitor and record activities in the data center. • Ensure cameras cover entry points, server racks, and other critical areas.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

14. Have you implemented any self-healing mechanisms in your data center environment, and how do they work?

Show answer

Yes, in a previous role, I implemented self-healing mechanisms using a combination of monitoring tools and automation scripts. Implementation Steps: • Continuous Monitoring: Deployed monitoring tools to continuously monitor the health and performance of servers, applications, and networking devices. • Threshold-based Alerts: Set up threshold-based alerts to trigger notifications when predefined thresholds for resource utilization or performance were exceeded.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

15. How do you evaluate and select vendors for data center hardware and software?

Show answer

• Vendor Reputation and Reliability: Assess the reputation and reliability of vendors by reviewing customer testimonials, industry reports, and case studies to ensure a history of delivering quality products and services. • Technical Compatibility: Evaluate the technical compatibility of hardware and software solutions with existing infrastructure, ensuring seamless integration and minimal disruptions. • Scalability and Future-Proofing: Consider the scalability of hardware and software solutions to accommodate future growth and technological advancements, ensuring a long-term investment.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

16. How do you stay informed about the latest developments and products in the data center industry?

Show answer

• Industry Publications: Regularly read industry publications, journals, and magazines to stay updated on the latest trends and developments. • Online Forums and Communities: Participate in online forums and communities dedicated to data center management to engage in discussions and share insights with peers. • Vendor Webinars and Documentation: Attend webinars hosted by data center equipment vendors and review documentation to learn about new products and features.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

17. Have you used any tools or methodologies for predicting resource usage and capacity requirements?

Show answer

• Performance Monitoring Tools: Utilized performance monitoring tools such as Nagios, Prometheus, or SolarWinds to collect and analyze real-time data on resource usage, aiding in the identification of usage patterns. • Historical Data Analysis: Leveraged historical data analysis to identify trends and patterns in resource usage, providing insights into seasonal variations and long-term growth trends. • Capacity Planning Tools: Employed capacity planning tools like VMware vRealize Operations or Turbonomic to model resource usage, simulate scenarios, and predict future capacity requirements.

18. How do you keep abreast of industry best practices and evolving technologies in data center management?

Show answer

• Professional Memberships: Join professional organizations and communities related to data center management to stay informed about industry trends and best practices. • Conferences and Seminars: Attend conferences, seminars, and webinars focused on data center management to learn about the latest technologies and industry insights. • Continuous Learning: Engage in continuous learning through online courses, certifications, and workshops to stay updated on evolving technologies.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

19. How do you troubleshoot network connectivity issues in a data center?

Show answer

Network connectivity issues in a data center can be complex, and troubleshooting involves a systematic approach: • **Verify Physical Connections:* • • Check physical connections, ensuring cables are securely plugged in. • Inspect network interface cards (NICs), switches, and routers for any physical damage. • **Check Network Configuration:* • • Confirm IP configurations, subnet masks, and gateway settings. • Ensure DNS and DHCP settings are correct. • Validate VLAN configurations if VLANs are in use.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

20. Describe the difference between a router and a switch.

Show answer

• **Router:* • • Function: A router operates at Layer 3 (Network layer) of the OSI model and is responsible for routing packets between different IP networks. • Traffic Handling: Routers make decisions based on IP addresses, enabling them to forward traffic between networks. They can connect multiple subnets and route traffic based on logical addressing. • Use Case: Routers are commonly used to connect different segments of a network, including connecting a local network to the internet.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

21. How do you manage and monitor servers in a cloud-based data center?

Show answer

• Cloud Management Console: Use the cloud provider's management console to access and manage cloud resources, configure settings, and monitor overall health. • Infrastructure as Code (IaC): Implement Infrastructure as Code (IaC) tools like Terraform or AWS CloudFormation to automate the provisioning and configuration of cloud resources. • Cloud Monitoring Tools: Utilize native cloud monitoring tools provided by the cloud provider to track server performance, resource utilization, and operational metrics.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

22. How do you stay updated on the latest trends and technologies in data center engineering?

Show answer

• Continuous Learning Courses: Enroll in continuous learning courses and certifications to stay updated on emerging technologies in data center engineering. • Industry Webinars: Attend webinars hosted by industry experts and organizations to gain insights into the latest trends and technologies. • Research and Whitepapers: Regularly read research papers, whitepapers, and publications from reputable sources to stay informed about advancements in data center engineering.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

23. Discuss the role of intrusion detection and prevention systems in data center security.

Show answer

• **Intrusion Detection Systems (IDS):* • • Monitoring: IDS monitors network or system activities for malicious activities or security policy violations. • Alerting: When suspicious behavior is detected, the IDS generates alerts or notifications to security administrators. • **Intrusion Prevention Systems (IPS):* • • Blocking and Mitigation: IPS goes beyond detection by actively preventing or blocking malicious activities. It can drop or modify packets in real-time to stop attacks.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

24. Discuss the importance of log analysis in data center operations.

Show answer

Log analysis is crucial for maintaining the health and security of a data center: **Issue Detection:* • Logs capture events, errors, and warnings, aiding in the early detection of issues or anomalies within the data center environment. **Troubleshooting:* • When issues arise, logs provide detailed information to troubleshoot and identify the root cause, expediting problem resolution. **Security:* • Log analysis helps in detecting and investigating security incidents by monitoring for unusual or suspicious activities.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

25. What are the best practices for securing server rooms in a data center?

Show answer

Securing server rooms in a data center involves implementing a combination of physical and logical security measures. Best practices include: **Access Control:* • • Implement strict access controls using key cards, biometric authentication, or access codes. • Restrict access to only authorized personnel. **Surveillance Systems:* • • Install security cameras to monitor server room entrances, exits, and equipment. • Use motion sensors to detect unauthorized access.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

26. Explain the importance of forecasting in capacity planning.

Show answer

• Resource Optimization: Forecasting helps organizations optimize resource allocation by predicting future demands and ensuring that adequate resources are provisioned to meet those demands. • Cost Management: Accurate forecasting contributes to effective cost management by preventing over-provisioning of resources, minimizing unnecessary expenses, and optimizing the return on investment. • Performance Assurance: Forecasting aids in maintaining optimal system performance by anticipating increases in demand and proactively adjusting capacity to prevent performance bottlenecks.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

27. How do you approach the integration of automation scripts into existing workflows in the data center?

Show answer

• Workflow Analysis: Conduct a thorough analysis of existing workflows to identify manual or repetitive tasks that can be automated. • Script Compatibility: Ensure that automation scripts are compatible with the existing infrastructure, applications, and tools used in the data center. • Incremental Implementation: Integrate automation scripts incrementally, starting with less critical tasks, to allow for testing and validation without disrupting the entire workflow.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

28. Discuss the significance of a UPS (Uninterruptible Power Supply) in a data center.

Show answer

• **Continuous Power Supply:* • A UPS provides an immediate and seamless switch to battery power in the event of a power outage, ensuring continuous and uninterrupted power to critical systems. • **Prevention of Downtime:* • By bridging the gap between a power outage and the activation of backup generators or restoration of utility power, a UPS prevents downtime and maintains business continuity.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

29. How do you handle a situation where multiple servers are down, and the cause is unclear?

Show answer

• Initial Triage: Quickly assess the severity of the issue and its impact on critical services and operations. • Communication: Initiate communication with the relevant teams, including system administrators, network engineers, and support staff. • Check Centralized Monitoring: Review centralized monitoring tools to identify common patterns or alerts across the affected servers. • Review Recent Changes: Investigate recent changes in configurations, updates, or deployments that may have contributed to the outage.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

30. What is subnetting, and how is it used in networking?

Show answer

• **Subnetting:* • • Subnetting is the process of dividing a larger IP network into smaller, more manageable sub-networks (subnets). • Subnetting allows for efficient use of IP address space, reduces broadcast domains, and enhances network security by isolating different parts of the network. • **How it is Used:* • • Address Organization: Subnetting divides a network into subnets, each with its own range of IP addresses. This helps in organizing and managing IP addresses more effectively.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

31. How do you handle a situation where there's resistance to implementing a new technology or process in the data center?

Show answer

• Communication: Engage in open and transparent communication to understand the concerns and reasons behind the resistance. • Stakeholder Involvement: Involve key stakeholders in the decision-making process to gather diverse perspectives and address specific objections. • Educational Initiatives: Conduct educational sessions to communicate the benefits of the new technology or process, addressing any misconceptions and building awareness.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

32. How do you set up alerts for critical events in a data center environment?

Show answer

Setting up alerts ensures that critical events are promptly addressed: **Define Critical Events:* • Identify key metrics or events that indicate potential issues or require immediate attention. **Select Monitoring Tools:* • Use monitoring tools that support alerting and integrate with the data center environment. **Set Thresholds:* • Define thresholds for each critical metric. When a metric surpasses the threshold, an alert is triggered. **Notification Channels:* • Configure notification channels such as email, SMS, or integration with collaboration tools to receive alerts.

33. What certifications do you hold related to data center management?

Show answer

• Certified Data Center Professional (CDCP): The CDCP certification validates knowledge of data center design, operations, and best practices. • Certified Data Center Specialist (CDCS): The CDCS certification focuses on advanced skills in data center design and operations. • ITIL Foundation: The ITIL Foundation certification demonstrates understanding of IT service management, including data center processes. • Cisco Certified Network Associate (CCNA): The CCNA certification showcases proficiency in networking, a crucial aspect of data center management.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

34. Discuss the importance of firmware and driver updates for servers.

Show answer

Firmware and driver updates are crucial for maintaining the health, security, and performance of servers in a data center. Here's why they are important: • **Security Patching:* • Firmware and driver updates often include security patches that address vulnerabilities. Regularly updating firmware and drivers helps protect servers from potential security threats and attacks. • **Bug Fixes:* • Updates can include fixes for bugs and issues identified in previous versions. Applying these updates improves the stability and reliability of server operations.

35. What measures do you take to ensure data integrity in backup processes?

Show answer

Ensuring data integrity in backup processes is critical for reliable disaster recovery: **Checksums and Hashing:* • Use checksums or hashing algorithms to verify the integrity of backup files. A mismatch indicates data corruption. **Regular Validation:* • Periodically validate backups by restoring a subset of data and confirming its consistency with the original. **Error Handling:* • Implement robust error-handling mechanisms during backup operations to detect and address any issues promptly.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

36. Explain the concept of SNMP (Simple Network Management Protocol) and its role in monitoring.

Show answer

SNMP is a protocol used for network management and monitoring. SNMP is an application-layer protocol that facilitates the exchange of management information between network devices. **Components:* • SNMP involves three key components - managed devices (routers, servers), agents (software installed on managed devices), and a network management system (NMS). **Information Retrieval:* • SNMP allows the NMS to retrieve information from managed devices (e.g., performance metrics, error rates) and, in some cases, configure these devices.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

37. Explain the purpose of VLANs (Virtual Local Area Networks) in a data center.

Show answer

Virtual Local Area Networks (VLANs) are used in data centers to logically segment a physical network into multiple isolated broadcast domains. The primary purposes of VLANs include: • **Network Segmentation:* • VLANs allow the segmentation of a physical network into multiple logical networks. This segmentation is beneficial for better network management, improved performance, and increased security. • **Broadcast Control:* • In a VLAN, broadcast traffic is contained within the virtual network, reducing the overall broadcast domain size.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

38. How do you handle server capacity planning in a data center?

Show answer

Server capacity planning involves predicting future resource needs to ensure that the data center has the necessary computing resources to meet demand. The process typically includes the following steps: • **Performance Monitoring:* • Regularly monitor the performance of existing servers to identify trends and patterns in resource usage. • **Collect Usage Data:* • Gather historical data on server performance, including CPU usage, memory usage, storage capacity, and network bandwidth.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

39. What are the five phases of a monitoring migration, and why is the parallel run phase the longest?

Show answer

Assessment (2-4 weeks), Foundation (2-4 weeks), Parallel Run (4-8 weeks), Cutover (1-2 weeks), Decommission (2-4 weeks). The parallel run is longest because both systems must monitor simultaneously to compare alert fidelity, tune Prometheus thresholds, train the team, and build confidence in the new system before committing.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

40. How do you translate a Nagios "check_disk -w 20% -c 10%" check to a Prometheus alerting rule?

Show answer

Use PromQL: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 20 with "for: 5m" and severity: warning. This continuously evaluates disk usage rather than running a point-in-time check script.

Remember: Prometheus = pull-based, scrapes /metrics. Port 9090.

Under the hood: Local TSDB. Long-term: Thanos or Cortex.

41. In a Zabbix-to-Prometheus migration, what replaces Zabbix triggers, templates, and host groups?

Show answer

Zabbix triggers become Prometheus alerting rules (PromQL expressions). Templates become recording rules plus Grafana dashboards. Host groups become labels (job, environment, team) which provide multi-dimensional grouping.

Remember: Prometheus = pull-based, scrapes /metrics. Port 9090.

Under the hood: Local TSDB. Long-term: Thanos or Cortex.

42. Should you migrate historical metric data from the legacy system to Prometheus? Why or why not?

Show answer

Generally no — the data models are fundamentally different (check results vs time series), making direct migration impractical. Options: accept the data break, keep the legacy system read-only for N months for historical queries, or export key metrics to CSV for compliance/reporting.

Remember: Prometheus = pull-based, scrapes /metrics. Port 9090.

Under the hood: Local TSDB. Long-term: Thanos or Cortex.

🔴 Hard (8)¶

1. What is the difference between metrics, logs, and alerts?

Show answer

Three pillars with different purposes:

**Metrics**: Numerical time-series data showing trends.
* CPU usage, request latency, error rates
* Good for: dashboards, capacity planning, anomaly detection
* Aggregated, cheap to store long-term

**Logs**: Event records explaining what happened.
* Stack traces, request details, audit trails
* Good for: debugging, forensics, compliance
* Expensive at scale, usually sampled or rotated

**Alerts**: Notifications requiring human attention.
* Something is wrong NOW and needs action
* Should be actionable and rare

**Anti-pattern**: Mixing them. Dashboards full of logs. Alerts on every metric. Logs nobody reads.

Metrics show trends, logs explain why, alerts wake humans.

2. Why do alerts rot over time?

Show answer

Alerts decay because systems change but alerts don't:

**Causes**:
* Services deprecated but alerts remain
* Thresholds no longer meaningful after scaling
* Alert copied from template, never tuned
* Original author left, context lost
* Business requirements changed

**Symptoms**:
* Alerts ignored or auto-closed
* "We always silence that one"
* Nobody knows what action to take

**Solutions**:
* Alert ownership must be enforced - every alert has an owner
* Regular alert reviews (quarterly)
* Track alert → action ratio
* Delete alerts that don't result in action
* Document runbooks linked to alerts

An unowned alert is technical debt.

3. What makes an alert bad in monitoring and how do you improve it?

Show answer

A bad alert is one that fires but doesn't require action.

**Examples of bad alerts**:
* Fires constantly (becomes noise)
* No clear remediation steps
* Requires no human intervention (self-heals)
* Too sensitive (false positives)
* Too vague ("Something is wrong")
* Pages for non-urgent issues

**Good alert criteria**:
* Actionable: Someone must DO something
* Urgent: Can't wait until business hours
* Real: High signal, low false positive rate
* Documented: Runbook exists

**Test**: If you can't write a runbook for it, it's not an alert - it's a metric to watch.

Every unnecessary alert trains people to ignore alerts.

4. How do you reduce alert fatigue?

Show answer

Alert fatigue kills incident response effectiveness:

**Fewer alerts**:
* Delete alerts that don't result in action
* Consolidate related alerts
* Use symptoms not causes (alert on user impact, not internal metrics)

**Better thresholds**:
* Base on actual SLOs, not arbitrary numbers
* Use anomaly detection where appropriate
* Tune after every false positive

**Actionable context**:
* Include relevant metrics in alert
* Link to runbook
* Show recent changes (deployments, config)

**Proper routing**:
* Not everything pages
* Severity levels that mean something
* Escalation paths

**Measure and improve**:
* Track alert → incident correlation
* Regular review of noisy alerts
* On-call retrospectives

If everything is urgent, nothing is.

5. How do you design good monitoring?

Show answer

Avoid alert storms. Set thresholds based on behavior, not guesses. Monitor symptoms (latency, queue depth) as well as root causes. Add runbooks or automation to reduce manual intervention.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

6. How should custom Nagios NRPE plugins be handled during a migration to Prometheus?

Show answer

Custom NRPE plugins (typically shell scripts checking specific things) should NOT be ported directly. They need to become either custom Prometheus exporters (that expose metrics on an HTTP endpoint) or pushgateway jobs (for batch/cron jobs that run and push results). This transforms point-in-time checks into continuous metric collection.

Remember: Prometheus = pull-based, scrapes /metrics. Port 9090.

Under the hood: Local TSDB. Long-term: Thanos or Cortex.

7. Why is "big-bang cutover" dangerous in a monitoring migration, and what should you do instead?

Show answer

Turning off Nagios and enabling Prometheus on the same day risks missing incidents because the new system has untested coverage gaps. Instead, run both systems in parallel for at least 4 weeks, comparing alert fidelity to ensure Prometheus catches the same incidents Nagios does, then gradually cut over alerting before decommissioning legacy.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.

8. What challenge does network device monitoring present during a monitoring migration?

Show answer

Nagios and Zabbix handle SNMP natively and well. Prometheus requires the snmp_exporter, which needs MIB configuration — a non-trivial setup. If this is overlooked, network devices go unmonitored during the migration gap, creating a blind spot for switch/router health, port utilization, and environmental sensors.

Remember: Stack: Prometheus(metrics) + Grafana(viz) + AlertManager(alerts) + Loki(logs).

Remember: Four Golden Signals: Latency, Traffic, Errors, Saturation.