Thursday, June 4, 2020

Anarcat: Replacing Smokeping with Prometheus

I've been struggling with replacing parts of my old sysadmin monitoring toolkit (previously built with Nagios, Munin and Smokeping) with more modern tools (specifically Prometheus, its "exporters" and Grafana) for a while now.

Replacing Munin with Prometheus and Grafana is fairly straightforward: the network architecture ("server pulls metrics from all nodes") is similar and there are lots of exporters. They are a little harder to write than Munin modules, but that makes them more flexible and efficient, which was a huge problem in Munin. I wrote a Migrating from Munin guide that summarizes those differences. Replacing Nagios is much harder, and I still haven't quite figured out if it's worth it.

How does Smokeping work

Leaving those two aside for now, I'm left with Smokeping, which I used in my previous job to diagnose routing issues, using Smokeping as a decentralized looking glass, which was handy to debug long term issues. Smokeping is a strange animal: it's fundamentally similar to Munin, except it's harder to write plugins for it, so most people just use it for Ping, something for which it excels at.

Its trick is this: instead of doing a single ping and returning this metrics, it does multiple ones and returns multiple metrics. Specifically, smokeping will send multiple ICMP packets (20 by default), with a low interval (500ms by default) and a single retry. It also pings multiple hosts at once which means it can quickly scan multiple hosts simultaneously. You therefore see network conditions affecting one host reflected in further hosts down (or up) the chain. The multiple metrics also mean you can draw graphs with "error bars" which Smokeping shows as "smoke" (hence the name). You also get per-metric packet loss.

Basically, smokeping runs this command and collects the output in a RRD database:

fping -c $count -q -b $backoff -r $retry -4 -b $packetsize -t $timeout -i $mininterval -p $hostinterval $host [ $host ...]

... where those parameters are, by default:

  • $count is 20 (packets)
  • $backoff is 1 (avoid exponential backoff)
  • $timeout is 1.5s
  • $mininterval is 0.01s (minimum wait interval between any target)
  • $hostinterval is 1.5s (minimum wait between probes on a single target)

It can also override stuff like the source address and TOS fields. This probe will complete between 30 and 60 seconds, if my math is right (0% and 100% packet loss).

How do draw Smokeping graphs in Grafana

A naive implementation of Smokeping in Prometheus/Grafana would be to use the blackbox exporter and create a dashboard displaying those metrics. I've done this at home, and then I realized that I was missing something. Here's what I did.

  1. install the blackbox exporter:

    apt install prometheus-blackbox-exporter
    
  2. make sure to allow capabilities so it can ping:

    dpkg-reconfigure prometheus-blackbox-exporter
    
  3. hook monitoring targets into prometheus.yml (the default blackbox exporter configuration is fine):

    scrape_configs:
      - job_name: blackbox
          metrics_path: /probe
          params:
            module: [icmp]
          scrape_interval: 5s
          static_configs:
            - targets:
              - octavia.anarc.at
              # hardcoded in DNS
              - nexthop.anarc.at
              - koumbit.net
              - dns.google
          relabel_configs:
            - source_labels: [__address__]
              target_label: __param_target
            - source_labels: [__param_target]
              target_label: instance
            - target_label: __address__
              replacement: 127.0.0.1:9115  # The blackbox exporter's real hostname:port.
    

    Notice how we lower the scrape_interval to 5 seconds to get more samples. nexthop.anarc.at was added into DNS to avoid hardcoding my upstream ISP's IP in my configuration.

  4. create a Grafana panel to graph the results. first, add this query:

    sum(probe_icmp_duration_seconds{phase="rtt"}) by (instance)
    
    • Set the Legend field to RTT
    • Set Draw modes to lines and Mode options to staircase
    • Set the Left Y axis Unit to duration(s)
    • Show the Legend As table, with Min, Avg, Max and Current enabled

    Then add this query, for packet loss:

    1-avg_over_time(probe_success[$__interval])!=0 or null
    
    • Set the Legend field to packet loss
    • Set a Add series override to Lines: false, Null point mode: null, Points: true, Points Radios: 1, Color: deep red, and, most importantly, Y-axis: 2
    • Set the Right Y axis Unit to percent (0.0-1.0) and set Y-max to 1

    Then set the entire thing to Repeat, on target, vertically. And you need to add a target variable like label_values(probe_success, instance).

The result looks something like this:

A plot of RTT and packet loss over time of three nodes Not bad, but not Smokeping

This actually looks pretty good!

I've uploaded the resulting dashboard in the Grafana dashboard repository.

What is missing?

Now, that doesn't exactly look like Smokeping, does it. It's pretty good, but it's not quite what we want. What is missing is variance, the "smoke" in Smokeping.

There's a good article about replacing Smokeping with Grafana. They wrote a custom script to write samples into InfluxDB so unfortunately we can't use it in this case, since we don't have InfluxDB's query language. I couldn't quite figure out how to do the same in PromQL. I tried:

stddev(probe_icmp_duration_seconds{phase="rtt",instance=~"$instance"})
stddev_over_time(probe_icmp_duration_seconds{phase="rtt",instance=~"$instance"}[$__interval])
stddev_over_time(probe_icmp_duration_seconds{phase="rtt",instance=~"$instance"}[1m])

The first two give zero for all samples. The latter works, but doesn't looks as good as Smokeping. So there might be something I'm missing.

SuperQ wrote a special exporter for this called smokeping_prober that came out of this discussion in the blackbox exporter. Instead of delegating scheduling and target definition to Prometheus, the targets are set in the exporter.

They also take a different approach than Smokeping: instead of recording the individual variations, they delegate that to Prometheus, through the use of "buckets". Then they use a query like this:

histogram_quantile(0.9 rate(smokeping_response_duration_seconds_bucket[$__interval]))

This is the rationale to SuperQ's implementation:

Yes, I know about smokeping's bursts of pings. IMO, smokeping's data model is flawed that way. This is where I intentionally deviated from the smokeping exact way of doing things. This prober sends a smooth, regular series of packets in order to be measuring at regular controlled intervals.

Instead of 20 packets, over 10 seconds, every minute. You send one packet per second and scrape every 15. This has the same overall effect, but the measurement is, IMO, more accurate, as it's a continuous stream. There's no 50 second gap of no metrics about the ICMP stream.

Also, you don't get back one metric for those 20 packets, you get several. Min, Max, Avg, StdDev. With the histogram data, you can calculate much more than just that using the raw data.

For example, IMO, avg and max are not all that useful for continuous stream monitoring. What I really want to know is the 90th percentile or 99th percentile.

This smokeping prober is not intended to be a one-to-one replacement for exactly smokeping's real implementation. But simply provide similar functionality, using the power of Prometheus and PromQL to make it better.

[...]

one of the reason I prefer the histogram datatype, is you can use the heatmap panel type in Grafana, which is superior to the individual min/max/avg/stddev metrics that come from smokeping.

Say you had two routes, one slow and one fast. And some pings are sent over one and not the other. Rather than see a wide min/max equaling a wide stddev, the heatmap would show a "line" for both routes.

That's an interesting point. I have also ended up adding a heatmap graph to my dashboard, independently. And it is true it shows those "lines" much better... So maybe that, if we ignore legacy, we're actually happy with what we get, even with the plain blackbox exporter.

So yes, we're missing pretty "fuzz" lines around the main lines, but maybe that's alright. It would be possible to do the equivalent to the InfluxDB hack, with queries like:

min_over_time(probe_icmp_duration_seconds{phase="rtt",instance=~"$instance"}[1m])
avg_over_time(probe_icmp_duration_seconds{phase="rtt",instance=~"$instance"}[1m])
max_over_time(probe_icmp_duration_seconds{phase="rtt",instance=~"$instance"}[1m])

... and setup the "fill lines" but that's not really the way things work in Prometheus/Grafana land. I'm actually satisfied with the current result.

Credits

Credits to Chris Siebenmann for his article about Prometheus and pings which gave me the avg_over_time query idea.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...