Prometheus. A Time Series database that is aimed at storing metrics. I'm going to be giving you a quick start on getting it running without digging too deep into how it works (it'd be no fun exploring it if I just explained everything, right?). As always, please see the docs for more information.
A quick note on the software and components
Prometheus: A time series database, alerting and metric server. Prometheus obtains its data by scraping HTTP endpoints.
Alert Manager: An endpoint for Prometheus' alerts which aggregates them and decides what to do with them next (where to send them, etc).
Node Exporter: One of many Exporters that are Prometheus compatible. They provide metrics over HTTP for Prometheus to scrape.
Let's get to work.
If you'd prefer not building it from source, PackageCloud has a repository which I will be using throughout this little introductory post. Install instructions are contained within the repo. You can skip this part if you prefer installing from source.
Once the repo is enabled run
yum -y install prometheus2 alertmanager to install Prometheus 2 and Alert Manager on your monitor node. On the server you're going to monitor, run
yum -y install node-exporter. You will need to use commands appropriate for your Linux distro if you're not using CentOS 7.
Bonus: A simple ansible playbook to deploy node exporter is on the this blog post's git
To enable/disable the exporters you want in Node Exporter, edit
/etc/default/node_exporter. A sample of this file is provided here. You can skip it if you'd like. I've disabled a few as I don't need all the data at the moment.
Once you've got the software setup it's time to write all the configs. A gotcha for server monitoring is that Prometheus labels instances (scrape targets) by the names they are set up within the Configuration file, so in my case to retain names for hostnames that I don't have DNS records for I'm going to add them to
220.127.116.11 some.server 18.104.22.168 some.other.server
This enables me to have Prometheus "see" these instances as the hostnames, instead of me having to explicitly specify their IP addresses, regardless of DNS.
In my setup, these hosts entries are generated automatically from Ansible Inventory. But for simplicity let's set them up manually here. Not doing so will make Prometheus report alerts for the IP addresses, or similar, which isn't the best for reading.
The Prometheus configuration file
# my global config global: # Scrape every minute, default setting scrape_interval: 1m # Evaluate rules every minute, default setting evaluation_interval: 1m # scrape_timeout is set to the global default (10s). # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: - 127.0.0.1:9093 # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: # - "monitoring.yml" # - "second_rules.yml" scrape_configs: - job_name: 'monitoring' static_configs: - targets: - 'some.server:9100' - 'some.other.server:9100'
Moving forward, let's create a simple alert for 5-minute load average being larger than 4, uncomment the "monitoring.yml" line in
/etc/prometheus/prometheus.yml and open up
groups: - name: monitoring rules: - alert: LoadAlert expr: node_load5 >= 4 for: 1m labels: severity: critical annotations: summary: "5-minute load reached 4. Send help."
Other metrics offered by your deployed Node Exporter can be seen live on port 9100
For the purpose of this exercise, we're going to configure Alert Manager to send us e-mail. Open up
/etc/prometheus/alertmanager.yml and edit accordingly:
global: smtp_smarthost: 'mail.domain.com:PORT' smtp_from: "AlertManager firstname.lastname@example.org" smtp_auth_username: "email@example.com" smtp_auth_password: "PassWordGoesHere" route: group_by: ['alertname', 'cluster'] group_wait: 1m group_interval: 1m repeat_interval: 10m receiver: email receivers: - name: email email_configs: - to: 'myrecipientemailaddress'
More information on e-mail configs and other alerting methods: https://prometheus.io/docs/alerting/configuration/#email_config
All configs done?
service prometheus restart; service alertmanager restart.
That's it. Your 5 minute load alert should now be firing. Check your inbox!
Now onto making more alerts! Info on the PromQL query language can be found here: https://prometheus.io/docs/prometheus/latest/querying/basics/
Tip: Once Prometheus is up and running, graphs can be viewed/setup in your browser on port 9090, ie. http://127.0.0.1:9090 if it's setup on localhost. This is coincidentally where this post's image came from. You can talk to Prometheus using Grafana and other tools as well!
A note on security
By default there are no access control mechanisms, so it is up to us to configure it so that only we can view it. For simplicity, if you have access to a firewall wrapper such as CSF, firewall-cmd or just plain iptables - make sure only authorized IP addresses can access the appropriate ports. The provided Ansible playbook does exactly that, though only for CSF and iptables for now.
- All relevant logs are saved to /var/log/messages (or syslog)
- If you need to start the service manually to troubleshoot an issue, grab the ExecStart line from the systemd unit (service prometheus status, etc), start manually and it should be a bit more verbose than the log files
Remember, all of the files mentioned here are available on this blog post's gitlab page