Writing agent-based check plug-ins

1. Introduction

Check plug-ins are software modules written in Python that are executed on a Checkmk site and which create and evaluate the services on a host.

Checkmk includes over 2000 ready-made check plug-ins for all conceivable hardware and software. These plug-ins are maintained by the Checkmk team and new ones are added every week. In addition, there are further plug-ins on the Checkmk Exchange that are contributed by our users.

And yet it can always be the case that a device, an application or simply a certain metric, that is important to you is not yet covered by any of these existing plug-ins — perhaps simply because it is something that was developed in your organization and therefore no one else could have it. In the article on the introduction to developing extensions for Checkmk, you can find out what options are available to you.

This article shows you how to develop real check plug-ins for the Checkmk agent — including everything that goes with it. There is a separate article for SNMP-based check plug-ins.

1.1. The Check API documentation

Since Checkmk version 2.0.0, there has been a newly developed Check API for programming check plug-ins. We will show you how to use this Check API for plug-in programming.

You can access the Check API documentation at any time via the Checkmk user interface: Help > Developer resources > Check plugin API reference. In the new browser window, select BASE > Agent based API ("Check API") in the left-hand navigation bar:

Page for getting started with the Check API documentation.

Note for users of the Check API valid up to version 1.6.0: If you still have check plug-ins that were developed with the old API, you should migrate them to the new Check API soon. Although the old Check API will continue to be supported for a transitional period, this period will also come to an end. The advantages of the new Check API make up for the one-off migration effort, because it is more consistent, more logical, better documented and future-proof. In the blog post on the migration of check plug-ins we provide you with detailed information on the necessary steps for a migration.

1.2. Prerequisites

If you are interested in programming check plug-ins, you will need the following:

Knowledge of the Python programming language.
Experience with Checkmk, especially when it comes to agents and checks.
Practice using Linux from the command line.

1.3. Terms and definitions

There is always talk here of 'a check plug-in' and of writing it. But strictly speaking, you always need two plug-ins, the agent plug-in on the monitored host and the check plug-in on the Checkmk server. Both must be written and be compatible with each other — only then can each function smoothly in a monitoring operation.

2. Writing an agent plug-in

If you are interested in programming plug-ins for Checkmk, it is very likely that you have already set up a Checkmk server. If you have done this, you have probably also monitored your Checkmk server itself as a host.

In the following, we will assume an exemplary scenario in which the Checkmk server and the monitored host are identical. This allows us to use Livestatus queries from the host to obtain information about host groups provided by the Checkmk server.

In the example described, we will assume an organization with several locations:

Each of these locations is represented in Checkmk by a host group.
Each location has its own service team.

To ensure that the correct service team can be notified in the event of problems, each host must be assigned to a location — i.e. also to a host group. The aim of this example is to set up a check to ensure that no host has forgotten to assign a host group.

The whole process comprises two steps:

Read information for monitoring from the host. This is what this chapter is about.
Writing a check plug-in in the Checkmk site that evaluates this data. We will show this in the next chapter.

So, let’s go …

2.1. Retrieving and filtering information

The first step before writing any plug-in program is research! This means one needs to find out how to get the information you need for the monitoring.

For the example chosen, we use the fact that the Checkmk server is also the host. This means that initially it is sufficient to retrieve the status data via Livestatus, i.e. the data organized in tables that Checkmk holds about the monitored hosts and services in volatile memory.

OMD[mysite]:~$ lq "GET hostgroups"
action_url;alias;members;members_with_state;name;notes;notes_url;num_hosts;num_hosts_down;num_hosts_handled_problems;num_hosts_pending;num_hosts_unhandled_problems;num_hosts_unreach;num_hosts_up;num_services;num_services_crit;num_services_handled_problems;num_services_hard_crit;num_services_hard_ok;num_services_hard_unknown;num_services_hard_warn;num_services_ok;num_services_pending;num_services_unhandled_problems;num_services_unknown;num_services_warn;worst_host_state;worst_service_hard_state;worst_service_state
;Hamburg;myhost11,myhost22,myhost33;myhost11|0|1,myhost22|0|1,myhost33|0|1;Hamburg;;;3;0;0;0;0;0;3;123;10;0;10;99;0;14;99;0;24;0;14;0;2;2
;Munich;myhost1,myhost2,myhost3;myhost1|0|1,myhost2|0|1,myhost3|0|1;Munich;;;3;0;0;0;0;0;3;123;10;0;10;99;0;14;99;0;24;0;14;0;2;2
;check_mk;localhost;localhost|0|1;check_mk;;;1;0;0;0;0;0;1;66;0;0;0;4;0;1;4;61;1;0;1;0;1;1

The first line of the output contains the column names from the queried hostgroups table. The semicolon acts as a separator. The following lines then contain the contents of all columns, also separated by semicolons.

The output is already relatively confusing in this small example and contains information that is not relevant for our example. In general, you should leave the interpretation of the data to Checkmk. However, pre-filtering on the host can reduce the volume of data to be transferred if not all of it is actually required. So in this case, restrict the query to the relevant columns (Columns), to the names of the host groups (name) and the hosts in those groups (members):

OMD[mysite]:~$ lq "GET hostgroups\nColumns: name members"
Hamburg;myhost11,myhost22,myhost33
Munich;myhost1,myhost2,myhost3
check_mk;localhost

The Livestatus interface expects to receive all commands and headers in their own separate line. The necessary line breaks are indicated by \n.

In this example, there are currently three host groups: two groups for the locations, and one for the check_mk group. This contains a host called localhost.

The check_mk host group is a special feature within the host groups. You have not created it yourself. And you cannot actively add a host to this group. So where does this host group come from? As by definition every host in Checkmk must belong to a group, Checkmk assigns every host that you do not specifically assign to a group to the 'special' check_mk group as the default.

As soon as you have assigned a host to one of your own host groups, it will be removed by Checkmk from the check_mk group. There is also no way to reassign a host to the check_mk host group.

Exactly these properties of the check_mk group are now used for our example: Since each host should be assigned to a location, the host group check_mk should be empty. If it is not empty, action is required, i.e. the hosts in it must be assigned to the host groups and thus to their appropriate locations.

2.2. Incorporate the command into the agent

Up until now, as a site user, you have used the lq command to display the information. This is helpful for getting an understanding of the data.

However, in order to retrieve this data from the Checkmk server, the new command must become part of the Checkmk agent on the monitored host. Theoretically, you could now directly edit the Checkmk agent in the /usr/bin/check_mk_agent file and include this part. This method would however have the disadvantage that your new command would disappear again when the agent software is updated because this file will be overwritten during the update.

It is therefore better to create an agent plug-in. All you need for this is an executable file that contains the command and which is located in the /usr/lib/check_mk_agent/plugins/ directory.

And one more thing is important: The data cannot be simply output. You will still need a section header. This is a specially formatted line that contains the name of the new agent plug-in. This section header allows Checkmk to later recognize where the data from the new agent plug-in begins and where the data from the preceding plug-in ends. It is easiest when the section header and check plug-in have the same name — even if this is not mandatory.

So first of all, you will need a meaningful name for your new check plug-in. This name may only contain lower case letters (only a-z, no umlauts, no accents), underscores and digits and must be unique. Avoid name conflicts with existing check plug-ins. If you are curious which names already exist, in a Checkmk site on the command line you can list these with cmk -L:

OMD[mysite]:~$ cmk -L
3par_capacity               agent      HPE 3PAR: Capacity
3par_cpgs                   agent      HPE 3PAR: CPGs
3par_cpgs_usage             agent      HPE 3PAR: CPGs Usage
3par_hosts                  agent      HPE 3PAR: Hosts
3par_ports                  agent      HPE 3PAR: Ports
3par_remotecopy             agent      HPE 3PAR: Remote Copy
3par_system                 agent      HPE 3PAR: System
3par_volumes                agent      HPE 3PAR: Volumes
3ware_disks                 agent      3ware ATA RAID Controller: State of Disks
3ware_info                  agent      3ware ATA RAID Controller: General Information
3ware_units                 agent      3ware ATA RAID Controller: State of Units
acme_agent_sessions         snmp       ACME Devices: Agent Sessions
acme_certificates           snmp       ACME Devices: Certificates

The output here only shows the first lines of the very long list. By using prefixes, the assignment of many check plug-ins can already be easily recognized here. The use of prefixes is therefore also recommended for your own check plug-ins. Incidentally, the second column shows how the respective check plug-in obtains its data.

A suitable name for the new check plug-in for our example is myhostgroups.

Now you have all of the information you need to create the script for the agent plug-in. Create a new file myhostgroups as the root user in the /usr/lib/check_mk_agent/plugins/ directory:

/usr/lib/check_mk_agent/plugins/myhostgroups

#!/bin/bash

columns="name members"
site="mysite"

echo '<<<myhostgroups:sep(59)>>>'
su - ${site} lq "GET hostgroups\nColumns: ${columns}"

What does this mean in detail?

The first line contains the 'shebang' (this is an abbreviation for sharp and bang, the latter being an abbreviation for the exclamation mark), by which Linux recognizes that it should execute the script with the specified shell.

To keep the script adaptable, two variables are introduced next:

the columns variable, which currently contains the group names and the associated members,
the site variable, which contains the name of the Checkmk site.

Use the echo command to output the section header. As the table columns are separated by a semicolon, use the addition sep(59) to specify that the semicolon is used as a separator for the data in the agent output. The 59 stands for the ASCII character code 59, the semicolon. Without this addition, the space character (ASCII character 32) would be used as a separator by default.

To be able to use the lq command, which is available to you as site user, in a script that is executed by the root user, prefix it with su.

Note: It is possible that accessing lq via su can cause problems. Alternatively, as a root user, you can also access Livestatus directly in the shell with printf or echo -e via a Unix socket. The article on Livestatus explains how to do this.

One more point is very important once you have created the file — make the file executable:

root@linux# chmod +x /usr/lib/check_mk_agent/plugins/myhostgroups

You can try out the agent plug-in directly by hand by entering the complete path as a command:

root@linux# /usr/lib/check_mk_agent/plugins/myhostgroups
<<<myhostgroups:sep(59)>>>
Hamburg;myhost11,myhost22,myhost33
Munich;myhost1,myhost2,myhost3
check_mk;localhost

Host groups that do not contain any hosts are not listed here.

2.3. Testing the agent

Testing and troubleshooting are the most important tasks when creating a functioning agent plug-in. It is best to proceed in three steps:

Try out the agent plug-in 'standalone'. You have just done this in the previous section.
Test the agent as a whole locally.
Retrieve the agent from the Checkmk server.

Testing the agent locally is very simple. As root, call the check_mk_agent command:

root@linux# check_mk_agent

The new section must appear somewhere in the very long output. Agent plug-ins are output by the agent at the end of processing.

You can scroll through the output by appending less (press the space bar to scroll, / to search and q to exit):

root@linux# check_mk_agent | less

Or you can search the output for the interesting lines. For example, grep with -A has an option to output a few more lines after each hit. This allows you to conveniently search and output the section:

root@linux# check_mk_agent | grep -A3 '^<<<myhostgroups'
<<<myhostgroups:sep(59)>>>
Hamburg;myhost11,myhost22,myhost33
Munich;myhost1,myhost2,myhost3
check_mk;localhost

The third and final test is then directly from the Checkmk site. Include the host in the monitoring (for example as localhost), log in as the site user and then retrieve the agent data with cmk -d:

OMD[mysite]:~$ cmk -d localhost | grep -A3 '^<<<myhostgroups'

This should produce the same output as the previous command.

If this works, your agent is ready. And what have you done for this? You have created a short script under the path /usr/lib/check_mk_agent/plugins/myhostgroups and made it executable.

Everything that follows now only takes place on the Checkmk server: There you write the check plug-in.

3. Writing a simple check plug-in

Preparing the agent is only half the fun. Now you need to teach Checkmk how to handle the information from the new agent section, which services it should generate, when they should go to WARN or CRIT, etc. You can do all this by programming a check plug-in using Python.

3.1. Preparing the file

You will find a directory prepared for your own check plug-ins in the local hierarchy in the site directory. This is ~/local/lib/check_mk/base/plugins/agent_based/. Here in the path, base means the part of Checkmk that is responsible for the actual monitoring and notifications. The agent_based folder contains all of the plug-ins associated with the Checkmk agent (i.e. not notification plug-ins, for example). It is best to switch to this folder to work with:

OMD[mysite]:~$ cd local/lib/check_mk/base/plugins/agent_based

The directory belongs to the site user and you can therefore edit it. You can edit your check plug-in with any text editor installed on the Linux system.

Thus, create the myhostgroups.py file for the check plug-in here. The convention is that the file name reflects the name of the agent section. It is mandatory that the file ends with .py, because from version 2.0.0 of Checkmk the check plug-ins are always real Python modules.

An executable basic framework (Download at GitHub), which you will expand step by step in the following, looks like this:

~/local/lib/check_mk/base/plugins/agent_based/myhostgroups.py

#!/usr/bin/env python3

from .agent_based_api.v1 import check_levels, Metric, register, Result, Service, State

def parse_myhostgroups(string_table):
    parsed = {}
    return parsed

def discover_myhostgroups(section):
    yield Service()

def check_myhostgroups(section):
    yield Result(state=State.OK, summary="Everything is fine")

register.agent_section(
    name = "myhostgroups",
    parse_function = parse_myhostgroups,
)

register.check_plugin(
    name = "myhostgroups",
    service_name = "Host group check_mk",
    discovery_function = discover_myhostgroups,
    check_function = check_myhostgroups,
)

First you need to import the functions and classes required for the check plug-ins from Python modules. The simplest method for this is import *, but you should avoid this, as it obscures which namespaces have actually been made available.

For our example, only what is used in the rest of this article will be imported:

~/local/lib/check_mk/base/plugins/agent_based/myhostgroups.py

from .agent_based_api.v1 import check_levels, Metric, register, Result, Service, State

3.2. Writing the parse function

The parse function has the task of 'parsing' the 'raw' agent data, i.e. analyzing and splitting it up, and putting this data into a logically-structured form that is easy for all subsequent steps to process.

As shown in the section on testing the agent, the section supplied by the agent plug-in has the following structure:

<<<myhostgroups:sep(59)>>>
Hamburg;myhost11,myhost22,myhost33
Munich;myhost1,myhost2,myhost3
check_mk;localhost

Checkmk already splits the lines of the section supplied by the agent plug-in into a list of lines based on the separator in the section header (in the example ;), these lines in turn are lists of words. The following data structure is therefore available in Checkmk instead of the raw data from the agent plug-in:

[
    ['Hamburg', 'myhost11,myhost22,myhost33'],
    ['Munich', 'myhost1,myhost2,myhost3'],
    ['check_mk', 'localhost']
]

In the inner list, the first element contains the name of the host group and the second the names of the hosts belonging to the group.

You can address all of this information, but only via its position in the data set. You would therefore always need to specify the number of square brackets and the 'sequence' number of the desired content(s) within each bracket. With larger volumes of data, this becomes increasingly complex and it becomes more and more difficult to maintain an overview.

At this point, the parse function offers clear advantages thanks to the structure it creates. It makes the code easier to read, the accesses are more performant and it is much easier to maintain an overview. It transforms the data structure supplied by Checkmk in such a way that you can address each of the individual values by name (or key) at will, and not be dependent on repeatedly searching through the field (array) to find what you are looking for:

{
    'Hamburg': {'members': 'myhost11,myhost22,myhost33'},
    'Munich': {'members': 'myhost1,myhost2,myhost3'},
    'check_mk': {'members': 'localhost'}
}

The convention is that the parse function is named after the agent section and begins with parse_. It receives string_table as its only argument. Note that you are not free to choose the argument here — it really must be named like this.

~/local/lib/check_mk/base/plugins/agent_based/myhostgroups.py

def parse_myhostgroups(string_table):
    # print(string_table)
    parsed = {}
    for line in string_table:
        parsed[line[0]] = {"members": line[1]}
    # print(parsed)
    return parsed

With def you specify in Python that a function is to be defined below. parsed = {} creates the dictionary with the improved data structure. In our example, we will go through each line, element by element. The host group followed by the members of the host group is taken from each line and assembled into an entry for the dictionary.

The dictionary is then returned with return parsed.

Note: In the example shown above, you will find two commented-out lines. If you comment these later when testing the check plug-in, the data before and after executing the parse function will be displayed on the command line. This allows you to check whether the function really does what it is supposed to do.

3.3. Registering the agent section

In order for the whole thing to take effect, you must make the parse function, and the new agent section in general, known to Checkmk. To do this, call a registration function:

~/local/lib/check_mk/base/plugins/agent_based/myhostgroups.py

register.agent_section(
    name = "myhostgroups",
    parse_function = parse_myhostgroups,
)

Here it is important that the name of the section exactly matches the section header in the agent output. From this moment on, every check plug-in that uses the myhostgroups section receives the return value of the parse function. As a rule, this will be the check plug-in of the same name. But other check plug-ins can also subscribe to this section, as we will show in the extension of the check plug-in.

By the way: If you want to know exactly how this works, you can take a look at the Check API documentation at this point. There you will find a detailed description of this registration function — and also of the functions and objects that will be used later in this article.

Check API documentation for the registration function 'agent_section'.

3.4. Registering the check plug-in

In order for Checkmk to know that there is a new check plug-in, this must be registered. It is done by calling the register.check_plugin function. You must always specify at least four things:

name: The name of the check plug-in. The easiest way to do this is to use the same name as your new agent section. This way, the check defined later in the check function automatically knows which section it should evaluate.
service_name: The name of the service as it should then appear in the monitoring.
discovery_function: The function for discovering services of this type (more on this shortly).
check_function: The function for performing the actual check (more on this shortly).

For the check plug-in it will look like this:

~/local/lib/check_mk/base/plugins/agent_based/myhostgroups.py

register.check_plugin(
    name = "myhostgroups",
    service_name = "Host group check_mk",
    discovery_function = discover_myhostgroups,
    check_function = check_myhostgroups,
)

It is best not to try this out just yet, because you first need to write the functions discover_myhostgroups and check_myhostgroups, since these must appear in the source code before the above registration.

3.5. Writing the discovery function

A special feature of Checkmk is the automatic discovery of services to be monitored. For this to work, each check plug-in must define a function that uses the agent output to recognize whether a service of this type or which services of this type should be created for the host in question.

The discovery function is always called when a service discovery is carried out for a host. It then decides whether or which services should be created. In the standard case, it receives exactly one argument with the name section. This contains the agent section data in a format prepared by the parse function.

Therefore, implement the following simple logic: If the myhostgroups agent section exists, then also create a suitable service. This will then automatically appear on all hosts on which the agent plug-in is deployed.

For check plug-ins that only create one service per host, no further information is required:

~/local/lib/check_mk/base/plugins/agent_based/myhostgroups.py

def discover_myhostgroups(section):
    yield Service()

The discovery function must return an object of the service type for each service to be created using yield (not with return). In Python, yield has the same function as return — both return a value to the calling function. The decisive difference is that yield remembers how far the function has come in data processing. The next call continues after the last yield statement — and does not start at the beginning again. This means that not only the first hit is read out (as would be the case with return), but all hits in sequence (this advantage will become relevant later in our example with service discovery).

3.6. Writing the check function

You can now move on to the actual check function, which uses the current agent output to decide which state the service should assume and can output further information.

The aim of the check function is to set up a check that can be used to verify whether a host group has been assigned for any host. To do this, it checks whether the check_mk host group contains hosts. If this is the case, the service should receive the status CRIT. If not, everything is OK and so is the state of the service.

Here is the implementation:

~/local/lib/check_mk/base/plugins/agent_based/myhostgroups.py

def check_myhostgroups(section):
    attr = section.get("check_mk")
    hosts = attr["members"] if attr else ""
    if hosts:
        yield Result(state=State.CRIT, summary=f"Default group is not empty; Current member list: {hosts}")
    else:
        yield Result(state=State.OK, summary="Everything is fine")

And now the explanation: The check_myhostgroups() function first fetches the value belonging to the check_mk key into the attr variable. Then the hosts variable is linked to the members value if it exists. If there are no members, hosts remains empty.

This is followed by an if query for the actual evaluation:

If the hosts variable has content, i.e. the check_mk host group is not empty, the status of the service goes to CRIT and an advisory text is output. This text also contains a list of the host names of all hosts that are in the check_mk host group. The Python F-String is used to output the text with expressions, which is so named because the string is preceded by the letter f.
If the hosts variable is empty, i.e. there are no hosts in the check_mk host group, the status of the service instead changes to OK. In this case, an appropriate message text is also output.

Once the check function has been created, the check plug-in is ready.

The check plug-in and as well as the agent plug-in have been made available on GitHub.

3.7. Testing and activating the check plug-in

Testing and activation are carried out on the command line with the cmk command.

First try the service discovery with the option -I. By adding the v option (for verbose), detailed output is requested. The --detect-plugins restricts the command execution to this check plug-in and through localhost to this host:

OMD[mysite]:~$ cmk -vI --detect-plugins=myhostgroups localhost
Discovering services and host labels on: localhost
localhost:
+ FETCHING DATA
[TCPFetcher] Execute data source
[PiggybackFetcher] Execute data source
No piggyback files for 'localhost'. Skip processing.
No piggyback files for '127.0.0.1'. Skip processing.
+ ANALYSE DISCOVERED HOST LABELS
SUCCESS - Found no new host labels
+ ANALYSE DISCOVERED SERVICES
+ EXECUTING DISCOVERY PLUGINS (1)
  1 myhostgroups
SUCCESS - Found 1 services

As planned, the service discovery recognizes a new service in the myhostgroups check plug-in.

Now you can try out the check contained in the check plug-in:

OMD[mysite]:~$ cmk --detect-plugins=myhostgroups -v localhost
+ FETCHING DATA
[TCPFetcher] Execute data source
[PiggybackFetcher] Execute data source
No piggyback files for 'localhost'. Skip processing.
No piggyback files for '127.0.0.1'. Skip processing.
Host group check_mk   Default group is not empty; Current member list: localhost
[agent] Success, [piggyback] Success (but no data found for this host), execution time 1.3 sec | execution_time=1.330 user_time=0.010 system_time=0.000 children_user_time=0.000 children_system_time=0.000 cmk_time_agent=1.330

By executing the check, the status of the service found previously will be determined.

If everything went as expected, you can activate the changes. If not, you will find helpful information in the chapter on troubleshooting.

Finally, activate the changes by restarting the monitoring core:

OMD[mysite]:~$ cmk -R
Generating configuration for core (type nagios)...
Precompiling host checks...OK
Validating Nagios configuration...OK
Restarting monitoring core...OK

In the Checkmk monitoring, you will now find the new service Host group check_mk at the host localhost:

The new service created in the monitoring by the check plug-in.

Since the host group check_mk is not empty, the service is CRIT

Congratulations on the successful creation of your first check plug-in!

4. Extending the check plug-in

4.1. Preparatory work

The recently completed first check plug-in is now to be extended step by step. So far, the agent plug-in has only provided information on the names and members within the host groups. In order to be able to evaluate the status of the hosts and the services running on them, for example, more data is required.

Extending the agent plug-in

You will first extend the agent plug-in once to collect all of the information that will be needed to extend the check plug-in in the following sections.

To find out what information Checkmk provides for host groups, you can query all available columns of the host group table with the following command as site user:

OMD[mysite]:~$ lq "GET columns\nFilter: table = hostgroups\nColumns: name"
action_url
alias
members
members_with_state
name
notes
notes_url
num_hosts
...

The output goes even further. The table has almost 30 columns - and most of the columns even have names that are meaningful. The following columns are of interest here: Number of hosts per group (column num_hosts), number of hosts in the UP state (num_hosts_up), number of services of all hosts in the group (num_services) and number of services in the OK state (num_services_ok).

Now these new columns only need to be supplied by the agent. You can achieve this by extending the agent plug-in created in the previous chapter.

As the root user, edit the agent plug-in’s script. Since the script has already put the configurable values into variables, it is sufficient to only change the line beginning with columns and enter the four additional columns retrieved there:

/usr/lib/check_mk_agent/plugins/myhostgroups

#!/bin/bash

columns="name members num_hosts num_hosts_up num_services num_services_ok"
site="mysite"

echo '<<<myhostgroups:sep(59)>>>'
su - ${site} lq "GET hostgroups\nColumns: ${columns}"

Run the script to verify this:

root@linux# /usr/lib/check_mk_agent/plugins/myhostgroups
<<<myhostgroups:sep(59)>>>
Munich;myhost3,myhost2,myhost1;3;3;180;144
Hamburg;myhost22,myhost33,myhost11;3;2;132;105
check_mk;SQL-Server,localhost;2;2;95;83

The four new values — each separated by a semicolon — now appear at the end of each line.

With this change, the agent plug-in now provides different data than before. At this point, it is important to make sure that with the changed data, the check plug-in still does what it is supposed to do.

Extending the parse function

In a check plug-in, the parse function is responsible for converting the data supplied by the agent plug-in. With writing the parse function, you have only considered two columns of the host group table. Now six columns are supplied instead of two. The parse function must therefore be customized to process the additional four columns.

As the site user, change the parse function in the myhostgroups.py file, which contains the check plug-in:

~/local/lib/check_mk/base/plugins/agent_based/myhostgroups.py

def parse_myhostgroups(string_table):
    parsed = {}
    column_names = [
        "name",
        "members",
        "num_hosts",
        "num_hosts_up",
        "num_services",
        "num_services_ok",
    ]
    for line in string_table:
        parsed[line[0]] = {}
        for n in range(1, len(column_names)):
            parsed[line[0]][column_names[n]] = line[n]
    return parsed

Everything between parsed = {} and return parsed has been changed here. First, the columns to be processed are defined under their names as a column_names list. A dictionary is then created in the for loop by generating the key-value pairs in each line from the column name and the value read.

This extension is not critical for the existing check function, as the data structure in the first two columns remains unchanged. Only additional columns are being provided, which are not (yet) evaluated in the check function.

Now that the new data can be processed, you will also use this data.

4.2. Service discovery

In the example you have built a very simple check that creates a service on a host. However, it is also very common for there to be several services from a single check on a host.

The most common example of this is a service for a file system on a host. The check plug-in with the name df creates one service per file system on the host. In order to distinguish these services, the mount point of the file system (for example /var) or the drive letter (for example C:) is incorporated into the name of the service. This results in a service name such as Filesystem /var or Filesystem C:. The word /var or C: is referred to here as an item. We are therefore also talking about a check with items.

If you want to build a check with items, you must implement the following functions:

The discovery function must generate a service for each of the items that should be monitored on the host.
You must include the item in the service name using the placeholder %s (e.g. "Filesystem %s").
The check function is called once, separately for each item, and receives this as an argument. From the agent data it must then fish out the relevant data for this item.

To test this in practice, you will create a separate service for each existing host group.

Since the myhostgroups check plug-in created in the previous chapter for checking the standard check_mk group should continue to function, this check plug-in remains as it is. For the extension, create a new check plug-in in the file myhostgroups.py — in the first step, in the same way as previously, by registering the plug-in.

Important: The new registration is made in addition to the existing one, the registration shown in the previous chapter remains unchanged. Here is just the new code:

~/local/lib/check_mk/base/plugins/agent_based/myhostgroups.py

...

register.check_plugin(
    name = "myhostgroups_advanced",
    sections = ["myhostgroups"],
    service_name = "Host group %s",
    discovery_function = discover_myhostgroups_advanced,
    check_function = check_myhostgroups_advanced,
)

So that the new check plug-in can be distinguished from the old one, it is given a unique name with myhostgroups_advanced. The sections parameter determines the sections of the agent output that the check plug-in subscribes to. Here, myhostgroups is used to specify that the new check plug-in uses the same data as the old one: the section of the agent plug-in prepared by the parse function. The service name now contains the placeholder %s. The name of the item is later inserted at this point by Checkmk. In the last two lines, the names for the new discovery function and the new check function are defined, both of which still need to be written.

First to the discovery function, which now has the task of determining the items to be monitored - this is also entered in addition to the existing one:

~/local/lib/check_mk/base/plugins/agent_based/myhostgroups.py

def discover_myhostgroups_advanced(section):
    for group in section:
        if group != "check_mk":
            yield Service(item=group)

As with previously, the discovery function receives the section argument. The individual host groups are run through in a loop. All host groups are of interest here — with the exception of check_mk, as this special host group has already been taken care of by the existing myhostgroups check plug-in. Whenever an item is found, it is returned with yield, which creates an object of the type Service, that in turn receives the host group name as an item.

If the host is monitored later, the check function is called separately for each service — and therefore for each item. This brings you to the definition of the check function for the new myhostgroups_advanced check plug-in. The check function receives the item argument in addition to the section. The first line of the function then looks like this:

~/local/lib/check_mk/base/plugins/agent_based/myhostgroups.py

def check_myhostgroups_advanced(item, section):

The algorithm for the check function is simple: If the host group exists, the service is set to OK and the number and names of the hosts in the group are listed. The complete function for this:

~/local/lib/check_mk/base/plugins/agent_based/myhostgroups.py

def check_myhostgroups_advanced(item, section):
    attr = section.get(item)
    if attr:
        yield Result(state=State.OK, summary=f"{attr['num_hosts']} hosts in this group: {attr['members']}")

The check result is delivered by returning an object of the class Result via yield. This requires the parameters state and summary. Here, state defines the state of the service (in the example OK) and summary the text that is displayed in the Summary of the service. This is purely informative and is not evaluated further by Checkmk. You can find out more about this in next section.

So far, so good. But what happens if the item you are looking for is not found? This can happen if a service has already been created for a host group in the past, but this host group has now disappeared — either because the host group still exists in Checkmk but no longer contains a host, or because it has been deleted completely. In both cases, this host group will no longer be present in the agent output.

The good news: Checkmk takes care of this! If a searched for item is not found, Checkmk automatically generates the result UNKNOWN - Item not found in monitoring data for the service. This is intentional and a good thing. If a searched item is not found, you can simply run Python out of the function and let Checkmk do its work.

Checkmk only knows that the item that was there before is now gone. Checkmk does not know the reason for this — but you do. It is therefore appropriate to not keep such knowledge to yourself and to intercept this condition in the check function and to display a helpful message.

~/local/lib/check_mk/base/plugins/agent_based/myhostgroups.py

def check_myhostgroups_advanced(item, section):
    attr = section.get(item)
    if not attr:
        yield Result(state=State.CRIT, summary="Group is empty or has been deleted")
        return

    yield Result(state=State.OK, summary=f"{attr['num_hosts']} hosts in this group: {attr['members']}")

What has changed? The error condition is now handled first. Therefore, check in the if branch if the item really does not exist, set the status to CRIT and exit the function with return. In all other cases, return OK as before.

This means that you have adopted the disappearing host groups situation into the check function. Instead of UNKNOWN, the associated service will now be CRIT and contain information on the cause of the critical status.

This completes the new check plug-in as an extension of the old one. The extended agent plug-in and the extended file for the check plug-ins can again be found on GitHub. The latter contains the simple myhostgroups check plug-in from the previous chapter, the extended parse function and the components of the new myhostgroups_advanced check plug-in with its registration, discovery function and check function. Note that the functions must always be defined before registration so that there are no errors caused by undefined function names.

As the new myhostgroups_advanced check plug-in provides new services, you must perform a service discovery for this check plug-in and activate the changes in order to see these services in the monitoring:

The two new services created by the advanced check plug-in in the monitoring.

Two new services in the monitoring:

Proceed as described in the simple check plug-in chapter.

4.3. Summary and details

In the monitoring of Checkmk, each service has a status — OK, WARN, etc.. — as well as a line of text. This text is located in the Summary column — as can be seen in the previous screenshot — and therefore has the task of providing a brief summary of the service’s status. The concept is that this text should not exceed a length of 60 characters. That ensures a concise table display without annoying line breaks.

There is also the Details field, in which all details on the status of the service are displayed, which also includes all of the summary information. Clicking on the service opens the service page, in which the two fields Summary and Details can be seen alongside many others.

When calling yield Result(...), you can determine which information is so important that it should be displayed in the summary, and for which it is sufficient for it to appear in the details.

In our example, you have always used a call of the following type:

~/local/lib/check_mk/base/plugins/agent_based/myhostgroups.py

    yield Result(state=State.OK, summary=f"{attr['num_hosts']} hosts in this group: {attr['members']}")

This means that the text defined as summary always appears in the Summary — and also in the Details. You should therefore only use this for important information. If a host group contains many hosts, the summary list can become very long — longer than the recommended 60 characters. If a piece of information is of secondary importance, you can use details to specify that its text only appears in the details:

~/local/lib/check_mk/base/plugins/agent_based/myhostgroups.py

    yield Result(
        state = State.OK,
        summary = f"{attr['num_hosts']} hosts in this group",
        details = f"{attr['num_hosts']} hosts in this group: {attr['members']}",
    )

In the example above, the list of hosts is therefore only displayed in the Details. The Summary then only shows the number of hosts in the group:

'Summary' and 'Details' shown in the service details.

Different content for summary and details in the monitoring

In addition to summary and details, there is a third parameter. With notice you specify that a text for a service in the OK is only displayed in the details — but also in the summary for all other states. This makes it immediately clear from the summary why the service is not OK. The notice parameter is not particularly useful if texts are permanently bound to states, as in our example so far.

In summary, this means:

The total text for the summary should not be longer than 60 characters for services that are OK.
Always use either summary or notice — at least one or the other, but not both.
If required, add details if the text for the details is to be an alternative.

4.4. Multiple partial results per service

To prevent the number of services on a host from increasing excessively, several partial results are often combined in one service. For example, the Memory service under Linux not only checks RAM and swap usage, but also shared memory, page tables and all sorts of other things.

The Check API provides a very convenient interface for this. A check function can simply generate a result with yield as often as required. The overall status of the service is then based on the worst partial result in the order OK → WARN → UNKNOWN → CRIT.

In our example, use this option to define two additional results for each service of the host groups in addition to the existing result. These evaluate the percentage of hosts in the UP state and the services in the OK state. You use the additional columns of the host group table previously defined in the agent output and the parse function.

Now expand the check function in stages from top to bottom:

~/local/lib/check_mk/base/plugins/agent_based/myhostgroups.py

def check_myhostgroups_advanced(item, section):
    attr = section.get(item)
    if not attr:
        yield Result(state=State.CRIT, summary="Group is empty or has been deleted")
        return

    members = attr["members"]
    num_hosts = int(attr["num_hosts"])
    num_hosts_up = int(attr["num_hosts_up"])
    num_services = int(attr["num_services"])
    num_services_ok = int(attr["num_services_ok"])

The if branch remains unchanged, i.e. the new partial results only apply to host groups that also exist. You then define five variables for the columns in the host group table contained in the section. On the one hand, this increases readability and, on the other hand, you can convert the strings read into numbers with int() for the four columns that are still to be used for the calculation.

The only existing result remains (almost) unchanged:

~/local/lib/check_mk/base/plugins/agent_based/myhostgroups.py

    yield Result(
        state = State.OK,
        summary = f"{num_hosts} hosts in this group",
        details = f"{num_hosts} hosts in this group: {members}",
    )

Only the access in the Python 'F-String' to the expression that returns the value is now easier than previously, since the attr is already in the variable definitions.

Now to the actual core of the extension, the definition of a result that implements the following statement: "The service of the host group is WARN when 90 % of the hosts are UP, and CRIT when 80 % of the hosts are UP." The convention here is that the check goes to WARN or CRIT as soon as the threshold is reached - and not only when it is exceeded. The Check API provides the auxiliary check_levels function for comparing a determined value with threshold values.

~/local/lib/check_mk/base/plugins/agent_based/myhostgroups.py

    hosts_up_perc = 100.0 * num_hosts_up / num_hosts
    yield from check_levels(
        hosts_up_perc,
        levels_lower = (90.0, 80.0),
        label = "UP hosts",
        notice_only = True,
    )

In the first line, the percentage is calculated from the total number and number of hosts in the UP state and stored in the hosts_up_perc variable. The single slash (/) executes a floating point division, which ensures that the result is a float value. This is useful because some of the functions used later on expect float as input.

In the second line, the result of the check_levels function is then returned as an object of the type result. This function is fed with the percentage just calculated as a value (hosts_up_perc), the two lower threshold values (levels_lower), a label that precedes the output (label) and finally with notice_only=True.

The last parameter makes use of the notice parameter already introduced in the previous section for the Result() object. With notice_only=True you specify that the text for the service is only displayed in the Summary if the status is not OK. However, partial results that lead to a WARN or CRIT will in any case always be visible in the summary — regardless of the value of notice_only.

Finally, you define the third result in the same way as the second, which evaluates the percentage of services in the OK state:

~/local/lib/check_mk/base/plugins/agent_based/myhostgroups.py

    services_ok_perc = 100.0 * num_services_ok / num_services
    yield from check_levels(
        services_ok_perc,
        levels_lower = (90.0, 80.0),
        label = "OK services",
        notice_only = True,
    )

This completes the check function.

The service for a host group now evaluates three results and displays the worst status from these in the monitoring, as in the following example:

The summary shows the text for the critical state.

The check function summary displays the text for the critical status

4.5. Metrics

Not always, but often, checks deal with numbers — and these numbers are very often measured or calculated values. In our example, the number of hosts in the host group (num_hosts) and the number of hosts in the UP state (num_hosts_up) are the measured values. The percentage of hosts in the UP state (hosts_up_perc) is a value calculated from these values. If such a value can then be displayed over a time range, it is also referred to as a metric.

With its built-in graphing system, Checkmk has a component for storing, evaluating and displaying such values. This is completely independent of the calculation of the OK, WARN and CRIT states.

In this example, you will define the two calculated values, hosts_up_perc and services_ok_perc, as metrics. Metrics will be immediately visible in Checkmk’s graphical user interface without you having to do anything. A graph is automatically generated for each metric.

Metrics are determined by the check function and returned as an additional result. The easiest way is to add the metrics information to the check_levels() function in the call.

As a reminder, the line with the check_levels() function call from the previous section follows here:

~/local/lib/check_mk/base/plugins/agent_based/myhostgroups.py

    yield from check_levels(
        hosts_up_perc,
        levels_lower = (90.0, 80.0),
        label = "UP hosts",
        notice_only = True,
    )

The two new arguments for the metric are metric_name and boundaries:

~/local/lib/check_mk/base/plugins/agent_based/myhostgroups.py

    yield from check_levels(
        hosts_up_perc,
        levels_lower = (90.0, 80.0),
        metric_name = "hosts_up_perc",
        label = "UP hosts",
        boundaries = (0.0, 100.0),
        notice_only = True,
    )

To keep it nice and simple and meaningful, for the name of the metric use the name of the variable in which the percentage is stored as a value.

You can use boundaries to provide the graphing system with information about the range of possible values. This refers to the smallest and largest possible value. In the case of a percentage, the limits of 0.0 and 100.0 are not too difficult to determine. Both floating point numbers and integers (which are converted internally into floating point numbers) are permitted, but not strings. If only one limit of the value range is defined, simply enter None for the other, for example boundaries=(0.0, None).

With this extension, the check_levels function now also returns an object of the type Metric via yield in addition to the Result.

You can now define the metric services_ok_perc in the same way. The last lines of the check function will then look like this:

~/local/lib/check_mk/base/plugins/agent_based/myhostgroups.py

    hosts_up_perc = 100.0 * num_hosts_up / num_hosts
    yield from check_levels(
        hosts_up_perc,
        levels_lower = (90.0, 80.0),
        metric_name = "hosts_up_perc",
        label = "UP hosts",
        boundaries = (0.0, 100.0),
        notice_only = True,
    )
    services_ok_perc = 100.0 * num_services_ok / num_services
    yield from check_levels(
        services_ok_perc,
        levels_lower = (90.0, 80.0),
        metric_name = "services_ok_perc",
        label = "OK services",
        boundaries = (0.0, 100.0),
        notice_only = True,
    )

With the extended check function, both graphs are visible in the monitoring. In the service list, the Icon for displaying the graphs for a service. icon now shows that there are graphs for the service. If you point to the icon with the mouse, the graphs will be displayed as a preview.

The service list with 2 graphs as a preview.

The names of the metrics are used as the titles for the graphs

An overview of all graphs including their legends and more can be found in the service details.

But what do you do if the value for the desired metric has not been defined with the check_levels() function? You can, of course, define a metric independently of a function call. The Metric() object, which you can also create directly via its constructor, is used for this purpose. The alternative definition of a metric for the value hosts_up_perc looks like this:

    yield Metric(
        name = "hosts_up_perc",
        value = hosts_up_perc,
        levels = (80.0, 90.0),
        boundaries = (0.0, 100.0),
    )

The arguments in Metric() are very similar to those in the function call shown above: Mandatory are the first two arguments for the metric name and the value. In addition, there are two optional arguments: levels for the threshold values WARN and CRIT and boundaries for the value range.

Important: The specification of levels is only used here as information for displaying the graph. In the graph, the threshold values are usually drawn as yellow and red lines. The check_levels function with its specified threshold values is responsible for the actual checking.

Now use the option of defining not only the two calculated values, but all measured values as metrics using Metric() — in our example, the four measured values from the host group table. Limit yourself to the two obligatory specifications of metric name and value. The four new lines complete the extension of the check function for metrics:

    hosts_up_perc = 100.0 * num_hosts_up / num_hosts
    yield from check_levels(
        hosts_up_perc,
        levels_lower = (90.0, 80.0),
        metric_name = "hosts_up_perc",
        label = "UP hosts",
        boundaries = (0.0, 100.0),
        notice_only = True,
    )
    services_ok_perc = 100.0 * num_services_ok / num_services
    yield from check_levels(
        services_ok_perc,
        levels_lower = (90.0, 80.0),
        metric_name = "services_ok_perc",
        label = "OK services",
        boundaries = (0.0, 100.0),
        notice_only = True,
    )

    yield Metric(name="num_hosts", value=num_hosts)
    yield Metric(name="num_hosts_up", value=num_hosts_up)
    yield Metric(name="num_services", value=num_services)
    yield Metric(name="num_services_ok", value=num_services_ok)

This increases the number of graphs per service, but also gives you the option of combining multiple metrics in a single graph, for example. We show these and other options in the section Customizing the display of metrics below.

In the example file at GitHub you can find the entire check function again.

4.6. Check parameters for rule sets

In the advanced check plug-in myhostgroups_advanced you have created the WARN state if only 90 % of the hosts are UP, and CRIT state if only 80 % of the hosts are UP. The numbers 90 and 80 are programmed directly into the check function or, as programmers would say, hard-coded. In Checkmk, however, users are used to being able to configure such threshold values and other check parameters via rules. For example, if a host group only has four members, then the two threshold values of 90 % and 80 % do not really fit well, since the percentage will drop to 75 % as soon as the first host fails and the status will go directly to CRIT — without an interim WARN level.

Therefore, the check plug-in should now be improved so that it can be configured via the Setup interface. To do this, you will need a rule set. If there is already a suitable rule set in Checkmk, you can use this. As the check function and rule set must match, this will not usually be the case. You can find more information on existing rule sets below.

Note: Be aware that the examples presented in this section are not covered by the Check API.

Defining a new rule set

To create a new rule set, create a new file in the ~/local/share/check_mk/web/plugins/wato directory. The name of the file should be based on that of the check plug-in and, like all plug-in files, must have the .py extension. For our example, the file name myhostgroups_advanced_parameters.py is suitable.

Take a step by step look at the structure of such a file. First come some import commands.

If the texts in your file, which are displayed on the Checkmk GUI, are to be translatable into other languages, import _ (underscore):

~/local/share/check_mk/web/plugins/wato/myhostgroups_advanced_parameters.py

from cmk.gui.i18n import _

This is a function and acts as a marker for all translatable texts. For example, instead of "Threshold for Warning", write _("Threshold for Warning") for the function call. The translation system in Checkmk, which is based on gettext, finds such texts and adds them to the list of texts to be translated. If you are only building the check for yourself, you can do without it and therefore also without the import line.

Next, import the so-called ValueSpecs. A ValueSpec is a very practical and universal tool that uses Checkmk in many situations. It is used to generate customized input forms, to display and validate the values entered and to convert these into Python data structures.

~/local/share/check_mk/web/plugins/wato/myhostgroups_advanced_parameters.py

from cmk.gui.valuespec import (
    Dictionary,
    Percentage,
    TextInput,
    Tuple,
)

You will need the Dictionary in any case, because from Checkmk version 2.0.0 check parameters are stored in Python dictionaries. Percentage is responsible for the input of percentage values, TextInput for a Unicode text and Tuple for an ordered, one-dimensional set of objects. You will also use Float and Integer more frequently.

The next step is to import classes and functions that are required when registering:

~/local/share/check_mk/web/plugins/wato/myhostgroups_advanced_parameters.py

from cmk.gui.plugins.wato.utils import (
    CheckParameterRulespecWithItem,
    rulespec_registry,
    RulespecGroupCheckParametersApplications,
)

As your myhostgroups_advanced check plug-in generates several services, import CheckParameterRulespecWithItem. If your check does not generate a service, in other words does not have an item, import CheckParameterRulespecWithoutItem instead. There is more information on the RulespecGroup at the end of this section.

Next come the actual definitions. First, you define an input field with which the user can specify the Item of the check. This field is displayed in the rule condition and allows users to restrict the rule to specific services. This is also necessary for the manual creation of checks that are to function without the need for a discovery.

You create this field with TextInput. It is assigned a title using title, which is then usually displayed as the heading for the input field. If you have sympathy for your users, include a helpful comment for the inline help:

~/local/share/check_mk/web/plugins/wato/myhostgroups_advanced_parameters.py

def _item_valuespec_myhostgroups_advanced():
    return TextInput(
        title = "Host group name",
        help = "You can restrict this rule to certain services of the specified hosts.",
    )

You are free to choose the name for the function that returns this ValueSpec; it is only required for registration below. The name should begin with an underscore so that the function is not visible beyond the module boundary.

Next comes the ValueSpec for entering the actual check parameters. You also create a function which generates the ValueSpec for this. The user should be able to define the two threshold values for WARN and CRIT separately for the number of hosts in the UP state and the services in the OK state:

~/local/share/check_mk/web/plugins/wato/myhostgroups_advanced_parameters.py

def _parameter_valuespec_myhostgroups_advanced():
    return Dictionary(
        elements = [
            ("hosts_up_lower",
                Tuple(
                    title = _("Lower percentage threshold for host in UP status"),
                    elements = [
                        Percentage(title=_("Warning")),
                        Percentage(title=_("Critical")),
                    ],
                )
            ),
            ("services_ok_lower",
                Tuple(
                    title = _("Lower percentage threshold for services in OK status"),
                    elements = [
                        Percentage(title=_("Warning")),
                        Percentage(title=_("Critical")),
                    ],
                )
            ),
        ],
    )

The return Dictionary() is mandatory. Within it, use elements=[] to create the list of parameters, in our example hosts_up_lower and services_ok_lower. Both parameters are each defined by a tuple of two values for the CRIT and WARN threshold values. All threshold values should be percentages, so use Percentage here.

Note: If you want to give the users of the rule set values instead of empty input fields, you can do this by adding the default_value argument to the percentage() function. For the hosts_up_lower tuple from the example just shown, it looks like this:

            ("hosts_up_lower",
                Tuple(
                    title = _("Lower percentage threshold for host in UP status"),
                    elements = [
                        Percentage(title=_("Warning"), default_value=90.0),
                        Percentage(title=_("Critical"), default_value=80.0),
                    ],
                )
            ),

Note that these values displayed in the GUI are not the default values that are defined further down via check_default_parameters in the registration of the check plug-in. If you want to display the same default values in the GUI that also apply to the check function, you must keep the values consistent in both places.

Finally, register the new rule set using the imported and self-defined items. The rulespec_registry.register() function is available for this purpose:

~/local/share/check_mk/web/plugins/wato/myhostgroups_advanced_parameters.py

rulespec_registry.register(
    CheckParameterRulespecWithItem(
        check_group_name = "myhostgroups_advanced",
        group = RulespecGroupCheckParametersApplications,
        match_type = "dict",
        item_spec = _item_valuespec_myhostgroups_advanced,
        parameter_valuespec = _parameter_valuespec_myhostgroups_advanced,
        title = lambda: _("Host group status"),
    )
)

The following explanations apply:

If your check does not use an item, the inner function will be CheckParameterRulespecWithoutItem. The item_spec line is then omitted.
The check_group_name as the name of the rule set establishes the connection to the check plug-ins. A check plug-in that wants to use this rule set must use the same name as check_ruleset_name when registering. Under no circumstances may the name be identical to an existing rule set, as this would overwrite it. To avoid this risk, it is best to use a prefix in the name.
The group determines where the rule set should appear in the Setup. With the value selected in the example, you can find it under Setup > Services > Service monitoring rules in the first box Applications, Processes & Services. Most of these groups are defined in the ~/lib/check_mk/gui/plugins/wato/utils/__init__.py file. There you will also find examples of how to create your own new group.
The match_type is always "dict".
Enter the names of the two previously created functions as item_spec and parameter_valuespec.
title defines the title of the rule set as it appears in the Checkmk GUI. However, the title is not specified directly as text, but as an executable function that returns the text (hence the lambda:).

Testing the rule set

Once you have created the file for the rule set, you should test whether everything is working up to this point — still without a connection to the check plug-in. To do this, you must first restart the Apache of the site so that the new file will be read. This is done by the command:

OMD[mysite]:~$ omd restart apache

The rule set should then be found in the Setup on the page mentioned above. You can also find the rule set using the search function in the Setup menu — but only after restarting Redis:

OMD[mysite]:~$ omd restart redis

The rule set you have just defined looks like this in the GUI:

The newly created rule set in the Setup.

In the condition you will find the defined Host group name field with the inline help:

The field for service selection in the rule condition.

Incidentally, Checkmk has included the texts on how to deal with regular expressions in your rule set.

Create a rule and try out different values, as shown in the screenshot above. If this works without errors, you can now use the check parameters in the check function.

Using the rule set in a check plug-in

In order for the rule to take effect, you must allow the check plug-in to accept check parameters and tell it which rule should be used. To do this, add two new lines to the registration function:

~/local/lib/check_mk/base/plugins/agent_based/myhostgroups.py

register.check_plugin(
    name = "myhostgroups_advanced",
    sections = ["myhostgroups"],
    service_name = "Host group %s",
    discovery_function = discover_myhostgroups_advanced,
    check_function = check_myhostgroups_advanced,
    check_default_parameters = {},
    check_ruleset_name = "myhostgroups_advanced",
)

With the check_default_parameters entry you can define the default values that apply as long as no rule has been created. This line must be present during registration. In the simplest case, pass an empty dictionary {}.

Secondly, pass the check_ruleset_name to the registration function, i.e. the name that you assigned to the rule set above using check_group_name. This way Checkmk knows from which rule set the parameters are to be determined.

Now Checkmk will try to pass parameters to the check function. For this to work, you must extend the check function so that it expects the params argument, which is inserted between item and section:

~/local/lib/check_mk/base/plugins/agent_based/myhostgroups.py

def check_myhostgroups_advanced(item, params, section):

If you are building a check without an item, the item is omitted and params is at the beginning.

It is highly recommended to have the content of the variable params output with a print as a first test:

def check_myhostgroups_advanced(item, params, section):
    print(params)

When the check plug-in is executed, the two printed lines (one for each service) will look something like this:

OMD[mysite]:~$ cmk --detect-plugins=myhostgroups_advanced -v localhost
Parameters({'hosts_up_lower': (70.0, 60.0), 'services_ok_lower': (75.0, 65.0)})
Parameters({'hosts_up_lower': (70.0, 60.0), 'services_ok_lower': (75.0, 65.0)})
Parameters({'hosts_up_lower': (70.0, 60.0), 'services_ok_lower': (75.0, 65.0)})

Important: Once everything is ready and working correctly, be sure to remove the print commands again, as these can confuse Checkmk’s internal communication.

Now adapt your check function further so that the transferred parameters can take effect. Get the two tuples from the parameters with the name selected in the rule:

~/local/lib/check_mk/base/plugins/agent_based/myhostgroups.py

def check_myhostgroups_advanced(item, params, section):
    hosts_up_lower = params["hosts_up_lower"]
    services_ok_lower = params["services_ok_lower"]

Further down in the check function, the previously hard-coded threshold values (90, 80) are then replaced by the hosts_up_lower and services_ok_lower variables:

~/local/lib/check_mk/base/plugins/agent_based/myhostgroups.py

    hosts_up_perc = 100.0 * num_hosts_up / num_hosts
    yield from check_levels(
        hosts_up_perc,
        levels_lower = (hosts_up_lower),
        metric_name = "hosts_up_perc",
        label="UP hosts",
        boundaries = (0, 100),
        notice_only = True,
    )
    services_ok_perc = 100.0 * num_services_ok / num_services
    yield from check_levels(
        services_ok_perc,
        levels_lower = (services_ok_lower),
        metric_name = "services_ok_perc",
        label = "OK services",
        boundaries = (0, 100),
        notice_only = True,
    )

If a rule has been configured, you can now monitor the host groups in the example with the threshold values which were defined via the GUI. However, if no rule has been defined, this check function will crash, since the default parameters for the check plug-in are not filled and the plug-in will generate a KeyError in the absence of a rule.

However, this problem can be solved if the default values are defined during registration:

~/local/lib/check_mk/base/plugins/agent_based/myhostgroups.py

register.check_plugin(
    name = "myhostgroups_advanced",
    sections = ["myhostgroups"],
    service_name = "Host group %s",
    discovery_function = discover_myhostgroups_advanced,
    check_function = check_myhostgroups_advanced,
    check_default_parameters = {"hosts_up_lower": (90, 80), "services_ok_lower": (90, 80)},
    check_ruleset_name = "myhostgroups_advanced",
)

You should always transfer default values in this way (and not intercept the situation of missing parameters in the check plug-in), as these default values can also be displayed in the Setup interface. For example, in the service discovery for a host, on the Services of host page, in the Display menu, there is the Show check parameters. entry.

Note: On GitHub you will find both the file with the rule set as well as the rule set-enhanced check plug-in.

Using an existing rule set

It is unlikely, but not impossible, that a rule set delivered with Checkmk matches your check function. This can actually only be the case if your check evaluates something for which Checkmk already has check plug-ins in the same form, for example the monitoring of a temperature or other values measured with sensors. However, if such a rule set matches your check plug-in, then you can save yourself the trouble of creating your own rule set.

The existing rule sets for check parameters supplied with Checkmk can be found in the ~/lib/check_mk/gui/plugins/wato/check_parameters/ directory.

Take the file memory_simple.py as an example. This declares a rule set with the following section:

~/lib/check_mk/gui/plugins/wato/check_parameters/memory_simple.py

rulespec_registry.register(
    CheckParameterRulespecWithItem(
        check_group_name = "memory_simple",
        group = RulespecGroupCheckParametersOperatingSystem,
        item_spec = _item_spec_memory_simple,
        match_type = "dict",
        parameter_valuespec = _parameter_valuespec_memory_simple,
        title = lambda: _("Main memory usage of simple devices"),
    )
)

As with the self-written rule set, the check_group_name keyword, which here is set to "memory_simple", is decisive. This establishes the connection to the check plug-in. You do this when registering the check plug-in with the check_ruleset_name keyword, for example like this:

register.check_plugin(
    name = "foobar",
    service_name = "Foobar %s",
    discovery_function = discover_foobar,
    check_function = check_foobar,
    check_ruleset_name = "memory_simple",
    check_default_parameters = {},
)

Here as well it is necessary to define default values using the check_default_parameters keyword.

5. Customizing the display of metrics

In the example above you have let the myhostgroups_advanced check plug-in generate metrics for all measured and calculated values. We have presented two ways of achieving this. First, the metrics of the calculated values were created as part of the check_levels() function with the metric_name argument, for example like this:

~/local/lib/check_mk/base/plugins/agent_based/myhostgroups.py

    yield from check_levels(
        services_ok_perc,
        levels_lower = (services_ok_lower),
        metric_name = "services_ok_perc",
        label = "OK services",
        boundaries = (0.0, 100.0),
        notice_only = True,
    )

Then we generated the measured metrics directly with the Metric() object — for the number of services in the OK state, for example, like this:

~/local/lib/check_mk/base/plugins/agent_based/myhostgroups.py

    yield Metric(name="num_services_ok", value=num_services_ok)

Metrics will be immediately visible in Checkmk’s graphical user interface without you having to do anything. However, there are a few restrictions:

Matching metrics are not automatically combined in a graph, instead each one appears individually.
The metric does not have a proper title, instead it shows the internal variable name of the metric.
No unit is used that allows a meaningful representation (e.g. GB instead of individual bytes).
A color is selected at random.
A 'Perf-O-Meter', i.e. the graphical preview of the metric shown as a bar, does not automatically appear in the service list (for example, in the view that shows all of a host’s services).

To complete the display of your metrics in these aspects, you will need metric definitions.

Note: The examples presented in this chapter are not covered by the Check API.

5.1. Using existing metric definitions

Before you create a new metric definition, you should first check whether Checkmk does not already have a suitable definition. The predefined metric definitions can be found in the ~/lib/check_mk/gui/plugins/metrics/ directory.

For CPU utilization, for example, you will find the metric with the name util in the cpu.py file:

~/lib/check_mk/gui/plugins/metrics/cpu.py

metric_info["util"] = {
    "title": _l("CPU utilization"),
    "unit": "%",
    "color": "26/a",
}

If your check plug-in measures the CPU utilization as a percentage, you can use this metric definition without any problems. All you have to do is enter "util" as the name for the metric in your check function. Everything else will then be automatically derived from this.

5.2. Creating a new metric definition

If there is no suitable existing metric definition, simply create one yourself.

For our example, define your own metric for the number of services in the OK state. To do this, create a file in ~/local/share/check_mk/web/plugins/metrics:

~/local/share/check_mk/web/plugins/metrics/myhostgroups_advanced_metrics.py

from cmk.gui.i18n import _

from cmk.gui.plugins.metrics import metric_info

metric_info["num_services_ok"] = {
    "title": _("Services in OK status"),
    "unit": "count",
    "color": "15/a",
}

Here is the explanation:

Importing and using the underscore (_) for internationalization is optional, as already discussed when creating rule sets.
The key (here "num_services_ok") is the metric name and this must correspond to what the check function outputs. Choose a unique name so that no existing metric will be 'overwritten'.
The title is the heading in the metric graph and replaces the previously used internal variable name.
You can find out which definitions are available for units (unit) in the ~/lib/check_mk/gui/plugins/metrics/unit.py file.
The color definition color uses a palette. Each palette color has /a and /b. These are two shades of the same color. In the existing metric definitions you will also find many direct color codes such as "#ff8800". These will be gradually phased out and replaced by palette colors, as these offer a more uniform look and can also be more easily adapted to the GUI’s color themes.

This definition in the metric file now ensures that the title, unit and color of the metric are displayed appropriately.

Analogous to the creation of a rule set file, the metrics file must first be read before the change becomes visible in the GUI. This is done by restarting the site’s Apache:

OMD[mysite]:~$ omd restart apache

The metric graph will then look something like this in the Checkmk GUI:

The new metric definition in the service details.

5.3. Graphs with multiple metrics

If you want to combine several metrics into a single graph (which is often very useful), you will need a graph definition that you can add to the metrics file created in the previous section. This is done via the global dictionary graph_info.

For our example, the two metrics num_services_ok and num_services are to be displayed in the same graph. The metric definitions for this are as follows:

~/local/share/check_mk/web/plugins/metrics/myhostgroups_advanced_metrics.py

from cmk.gui.i18n import _

from cmk.gui.plugins.metrics import (
    metric_info,
    graph_info,
)

metric_info["num_services_ok"] = {
    "title": _("Services in OK status"),
    "unit": "count",
    "color": "15/a",
}

metric_info["num_services"] = {
    "title": _("Number of services"),
    "unit": "count",
    "color": "24/a",
}

Now add a graph that shows these two metrics as lines:

graph_info["num_services_combined"] = {
    "metrics": [
        ("num_services_ok", "line"),
        ("num_services", "line"),
    ],
}

The first entry under metrics determines which metric is used for the title of the graph. The result is the combined graph in the Checkmk GUI:

The graph shows both metrics in the service details.

5.4. Metrics in the Perf-O-Meter

Would you like to display a Perf-O-Meter for a metric in the line in the service list? It could look like this, for example:

The Perf-O-Meter shows the number of services with the status 'OK'.

The Perf-O-Meter shows the absolute number of services

To create such a Perf-O-Meter, you need another file, this time in the ~/local/share/check_mk/web/plugins/perfometer directory:

~/local/share/check_mk/web/plugins/perfometer/myhostgroups_advanced_perfometer.py

from cmk.gui.plugins.metrics import perfometer_info

perfometer_info.append(
    {
        "type": "logarithmic",
        "metric": "num_services_ok",
        "half_value": 25,
        "exponent": 2.0,
    }
)

Perf-O-Meters are somewhat trickier than graphs, as they have no legend. This makes it difficult to visualize the ranges of the displayed values. As the Perf-O-Meter cannot know which values are even possible and since the space is very limited, many built-in check plug-ins use a logarithmic representation. This is also the case in the example above:

type selects the format for displaying the values, in this case the logarithmic representation.
metric denotes the name of the metric, in this case the number of services in the OK state.
half_value is the measured value that is displayed exactly in the middle of the Perf-O-Meter. For a value of 25, the bar is therefore half full.
exponent specifies the factor required to fill a further 10 % of the range. So in the example, a measured value of 50 would fill the bar up to 60 %, and one of 100 up to 70 %.

The advantage with this method: If you equip a list of services of the same type with similar Perf-O-Meters, you can quickly compare the Perf-O-Meters with each other visually, as they all use the same scale. And despite the small display format, you can easily recognize the differences in both very small and very large values. The values in such a display are however not true to their actual scale.

The Perf-O-Meter above shows a metric, but not the one that is responsible for the status of the service. It is not the number of services in the OK state, but their percentage that determines the state of the host group service shown.

To display the metric of the percentage (services_ok_perc), you can use a linear Perf-O-Meter. This is always useful if there is a known maximum value. For example, it would look like this in the file:

~/local/share/check_mk/web/plugins/perfometer/myhostgroups_advanced_perfometer.py

perfometer_info.append(
    {
        "type": "linear",
        "segments": ["services_ok_perc"],
        "total": 100.0,
    }
)

And so in the GUI:

The Perf-O-Meter shows the percentage of services with the status 'OK'.

The percentage of services in the OK state is displayed here

This is already an improvement. However, the state of the service is not only determined by the percentage of services in the OK state, but also by the percentage of hosts in the UP state. There are various ways to combine multiple metrics in a Perf-O-Meter. One method looks like this:

~/local/share/check_mk/web/plugins/perfometer/myhostgroups_advanced_perfometer.py

perfometer_info.append(
    {
        "type": "dual",
        "perfometers": [
            {
                "type": "linear",
                "segments": ["hosts_up_perc"],
                "total": 100.0,
            },
            {
                "type": "linear",
                "segments": ["services_ok_perc"],
                "total": 100.0,
            },
        ],
    }
)

This ensures that the two metrics share the bar in the Perf-O-Meter and are displayed next to each other:

The Perf-O-Meter displays two percentages next to each other.

Here, two percentage values are displayed next to each other

You can find many examples on this topic in the check plug-ins supplied by Checkmk — in the ~/lib/check_mk/gui/plugins/metrics/perfometers.py file.

6. Formatting numbers

Numbers are often output in the Summary and the Details of a service. To make neat and correct formatting as easy as possible for you, and also to standardize the output of all check plug-ins, there are helper functions for displaying the various types of sizes. All of these are subfunctions of the render module and are therefore called with render.. For example, render.bytes(2000) results in the text 1.95 KiB.

What all of these functions have in common is that their value is shown in a so-called canonical or natural unit. This means you never have to think and there are no difficulties or errors when converting. For example, times are always given in seconds, and the sizes of hard disks, files, etc., are always given in bytes and not in kilobytes, kibibytes, blocks or other such confusion.

Use these functions even if you do not like the display that much. In any case, it will be standardized for the user. Future versions of Checkmk may be able to change the display or even make it configurable by the user, as is already the case with the temperature display, for example. Your check plug-in will then also benefit from this.

Before you can use the render function in your check plug-in, you must also import it:

from .agent_based_api.v1 import check_levels, Metric, register, render, Result, Service, State

Following this detailed description of all display functions (render functions), you will find a summary in the form of an easy to read table.

6.1. Times, time ranges, frequencies

Absolute time specifications (timestamps) are formatted with render.date() or render.datetime(). The information is always given as Unix time, i.e. in seconds from January 1, 1970, 00:00:00 UTC — the beginning of the Unix epoch. This is also the format used by the Python function time.time().

The advantage with this representation is that it is very easy to calculate, for example the calculation of a time range, when the start and end times are known. The formula is then simply duration = end - start. These calculations work regardless of the time zone, daylight saving time changes, or leap years.

render.date() only outputs the date, render.datetime() adds the time. The output is according to the current time zone in which the Checkmk server that is executing the check is located. Examples:

Call Output

Call	Output
`render.date(0)`	`Jan 01 1970`
`render.datetime(0)`	`Jan 01 1970 01:00:00`
`render.date(1700000000)`	`Nov 14 2023`
`render.datetime(1700000000)`	`Nov 14 2023 23:13:20`

render.date(0)

Jan 01 1970

render.datetime(0)

Jan 01 1970 01:00:00

render.date(1700000000)

Nov 14 2023

render.datetime(1700000000)

Nov 14 2023 23:13:20

Do not be surprised that render.datetime(0) does not output 00:00 as the time, but 01:00. This is because we are writing this User Guide in the time zone for Germany — which is one hour ahead of standard UTC time (at least during standard time, because January 1st is not in daylight saving time).

For time ranges (or time spans) there is also the render.timespan() function. This is given a duration in seconds and outputs it in human-readable form. For larger time spans, seconds or minutes are omitted. If you have a time span in a TimeDelta object, use the total_seconds() function to read out the number of seconds as a floating point number.

Call Output

Call	Output
`render.timespan(1)`	`1 second`
`render.timespan(123)`	`2 minutes 3 seconds`
`render.timespan(12345)`	`3 hours 25 minutes`
`render.timespan(1234567)`	`14 days 6 hours`

render.timespan(1)

1 second

render.timespan(123)

2 minutes 3 seconds

render.timespan(12345)

3 hours 25 minutes

render.timespan(1234567)

14 days 6 hours

A frequency is effectively the reciprocal of time. The canonical unit is Hz, which means the same as 1 / sec (the reciprocal of one second). An example of its use is the clock rate of a CPU:

Call Output

Call	Output
`render.frequency(111222333444)`	`111 GHz`

render.frequency(111222333444)

111 GHz

6.2. Bytes

Wherever working memory, files, hard disks, file systems and the like are concerned, the canonical unit is the byte. Since computers usually organize such things in powers of two, for example in units of 512, 1024 or 65,536 bytes, it became established right from the beginning, that a kilobyte is not 1000, and therefore a thousand times the unit, but 1024 (2 to the power of 10) bytes. This is admittedly illogical, but very practical because it usually results in round numbers. The legendary Commodore C64 had 64 kilobytes of memory and not 65.536.

Unfortunately, at some point hard disk manufacturers came up with the idea of specifying the sizes of their disks in units of 1000. Since the difference between 1000 and 1024 is 2.4 % for each size, and these are multiplied, a disk of size 1 GB (1024 times 1024 times 1024) suddenly becomes 1.07 GB. That sells better.

This annoying confusion still exists today and continues to produce errors. To alleviate this, the International Electrotechnical Commission (IEC) has defined new prefixes based on the binary system. Accordingly, today a kilobyte is officially 1000 bytes and a kibibyte is 1024 bytes (2 to the power of 10). In addition, one should say Mebibyte, Gibibyte and Tebibyte. The abbreviations are then KiB, MiB, GiB and TiB.

Checkmk conforms to this standard and helps you with a number of customized render functions to ensure that you always produce the correct output. For example, there is the render.disksize() function especially for hard disks and file systems, which produces its output in powers of 1000.

Call Output

Call	Output
`render.disksize(1000)`	`1.00 kB`
`render.disksize(1024)`	`1.02 kB`
`render.disksize(2000000)`	`2.00 MB`

render.disksize(1000)

1.00 kB

render.disksize(1024)

1.02 kB

render.disksize(2000000)

2.00 MB

When it comes to the sizes of files, it is often customary to specify the exact size in bytes without rounding. This has the advantage that you can see very quickly if a file has changed even minimally or that two files are (probably) the same. The render.filesize() function is used for this:

Call Output

Call	Output
`render.filesize(1000)`	`1,000 B`
`render.filesize(1024)`	`1,024 B`
`render.filesize(2000000)`	`2,000,000 B`

render.filesize(1000)

1,000 B

render.filesize(1024)

1,024 B

render.filesize(2000000)

2,000,000 B

If you want to output a value that is not the size of a hard disk or file, simply use the generic render.bytes(). With this you get the output in the 'classic' 1024 powers in the official notation:

Call Output

Call	Output
`render.bytes(1000)`	`1000 B`
`render.bytes(1024)`	`1.00 KiB`
`render.bytes(2000000)`	`1.91 MiB`

render.bytes(1000)

1000 B

render.bytes(1024)

1.00 KiB

render.bytes(2000000)

1.91 MiB

6.3. Bandwidths, data rates

Networkers have their own terms and ways of expressing things. And as always, Checkmk makes every effort to adopt the conventional way of communicating in each domain. That is why there are three different render functions for data rates and speeds. What these all have in common is that the rates are passed in bytes per second, even if the actual output is in bits!

render.nicspeed() represents the maximum speed of a network card or a switch port. As these are not measured values, there is no need for rounding. Although no port can send individual bits, the data is in bits for historical reasons.

Important: Nevertheless, you must also pass bytes per second here!

Call Output

Call	Output
`render.nicspeed(12500000)`	`100 MBit/s`
`render.nicspeed(100000000)`	`800 MBit/s`

render.nicspeed(12500000)

100 MBit/s

render.nicspeed(100000000)

800 MBit/s

render.networkbandwidth() is intended for an actually measured transmission speed in the network. The input value is again bytes per second:

Call Output

Call	Output
`render.networkbandwidth(123)`	`984 Bit/s`
`render.networkbandwidth(123456)`	`988 kBit/s`
`render.networkbandwidth(123456789)`	`988 MBit/s`

render.networkbandwidth(123)

984 Bit/s

render.networkbandwidth(123456)

988 kBit/s

render.networkbandwidth(123456789)

988 MBit/s

Where a network is not involved and data rates are still output, bytes are again common. The most prominent case is the IO rates of hard disks. The render.iobandwidth() function, which works with powers of 1000 in Checkmk, is used for this:

Call Output

Call	Output
`render.iobandwidth(123)`	`123 B/s`
`render.iobandwidth(123456)`	`123 kB/s`
`render.iobandwidth(123456789)`	`123 MB/s`

render.iobandwidth(123)

123 B/s

render.iobandwidth(123456)

123 kB/s

render.iobandwidth(123456789)

123 MB/s

6.4. Percentages

The render.percent() function represents a percentage — rounded to two decimal places. It is an exception to the other functions in that it does not pass the actual natural value — i.e. the ratio — but the real percentage. For example, if something is half full, you do not have to pass 0.5 but 50.

Because it can sometimes be interesting to know whether a value is almost zero or exactly zero, values are marked by appending a ‘<’ character, for values that are greater than zero but less than 0.01 percent.

Call Output

Call	Output
`render.percent(0.004)`	`<0.01%`
`render.percent(18.5)`	`18.50%`
`render.percent(123)`	`123.00%`

render.percent(0.004)

<0.01%

render.percent(18.5)

18.50%

render.percent(123)

123.00%

6.5. Summary

In conclusion, here is an overview of all render functions:

Function Input Description Output example

Function	Input	Description	Output example
`date()`	Unix time	Date	`Nov 14 2023`
`datetime()`	Unix time	Date and time	`Nov 14 2023 23:13:20`
`timespan()`	Seconds	Duration / age	`3 hours 25 minutes`
`frequency()`	Hz	Frequency (e.g. clock rate)	`111 GHz`
`disksize()`	Bytes	Size of a hard disk, base 1000	`1.234 GB`
`filesize()`	Bytes	Size of a file, full precision	`1,334,560 B`
`bytes()`	Bytes	Size, base 1024	`23.4 KiB`
`nicspeed()`	Bytes per second	Network card speed	`100 MBit/s`
`networkbandwidth()`	Bytes per second	Transmission speed	`23.50 GBit/s`
`iobandwidth()`	Bytes per second	IO bandwidths	`124 MB/s`
`percent()`	Percentage number	Percentage, meaningfully-rounded	`99.997%`

date()

Unix time

Date

Nov 14 2023

datetime()

Unix time

Date and time

Nov 14 2023 23:13:20

timespan()

Seconds

Duration / age

3 hours 25 minutes

frequency()

Frequency (e.g. clock rate)

111 GHz

disksize()

Bytes

Size of a hard disk, base 1000

1.234 GB

filesize()

Bytes

Size of a file, full precision

1,334,560 B

bytes()

Bytes

Size, base 1024

23.4 KiB

nicspeed()

Bytes per second

Network card speed

100 MBit/s

networkbandwidth()

Bytes per second

Transmission speed

23.50 GBit/s

iobandwidth()

Bytes per second

IO bandwidths

124 MB/s

percent()

Percentage number

Percentage, meaningfully-rounded

99.997%

7. Troubleshooting

The correct handling of errors (unfortunately) takes up a large part of any programming work. The good news is that the Check API already takes a lot of the work out of error handling. For some types of errors, it is therefore better not to handle them at all yourself.

If Python encounters a situation that is in any way unexpected, it reacts with a so-called exception. Here are a few examples:

You convert a string to a number with int(), but the string does not contain a number, for example int("foo").
You use bar[4] to access the fifth element of bar, but it only has four elements.
You are calling a function that does not exist.

In order to decide how to tackle errors, it is first important to know the exact point in the code where an error occurs. You can use either the GUI or the command line for this — depending on where you are currently working.

7.1. Exceptions and crash reports in the GUI

If an exception occurs in monitoring or during service discovery in the Setup, the Summary contains references to the crash report that has just been created. It will look like this, for example:

A service whose check plug-in has crashed.

Clicking on the Icon for a crashed check plug-in. icon displays a page with details in which you:

can see the file in which the crash occurred,
receive all information about the crash, such as a list of the errors that occurred in the program (traceback), the current values of local variables, the agent output and much more, and
can send the report to us (Checkmk GmbH) as feedback.

The traceback helps you as a developer to decide whether there is an error in the program (e.g. the call of a non-existent function) or agent data that could not be processed as expected. In the first case you will want to correct the error, in the second case it often makes sense to do nothing.

Submitting the report is of course only useful for check plug-ins that are officially part of Checkmk. If you make your own plug-ins available to third parties, you can ask your users to send you the data.

7.2. Viewing exceptions on the command line

If you run your check plug-in using the command line, you will not receive any indication of the ID of any crash report generated. You will only see the summarized error message:

OMD[mysite]:~$ cmk --detect-plugins=myhostgroups_advanced localhost
Error in agent based plugin myhostgroups: invalid syntax (myhostgroups.py, line 11)

If you append the --debug option as an additional call parameter, you will receive the traceback from the Python interpreter:

OMD[mysite]:~$ cmk --debug --detect-plugins=myhostgroups_advanced localhost
Traceback (most recent call last):
  File "/omd/sites/mysite/bin/cmk", line 97, in <module>
    errors = config.load_all_agent_based_plugins(
             ^{^}^{^}^{^}^{^}^{^}^{^}^{^}^{^}^{^}^{^}^{^}^{^}
  File "/omd/sites/mysite/lib/python3/cmk/base/config.py", line 1673, in load_all_agent_based_plugins
    errors = agent_based_register.load_all_plugins()
             ^{^}^{^}^{^}^{^}^{^}^{^}^{^}^{^}^{^}^{^}^{^}^{^}^{^}
  File "/omd/sites/mysite/lib/python3/cmk/base/api/agent_based/register/init.py", line 48, in load_all_plugins
    raise exception
  File "/omd/sites/mysite/lib/python3/cmk/utils/plugin_loader.py", line 49, in load_plugins_with_exceptions
    importlib.import_module(full_name)
  File "/omd/sites/mysite/lib/python3.11/importlib/init.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^{^}^{^}^{^}^{^}^{^}^{^}^{^}^{^}^{^}^{^}^{^}^{^}^{^}^{^}^{^}^{^}^{^}^
  File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1149, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 936, in exec_module
  File "<frozen importlib._bootstrap_external>", line 1074, in get_code
  File "<frozen importlib._bootstrap_external>", line 1004, in source_to_code
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/omd/sites/mysite/local/lib/python3/cmk/base/plugins/agent_based/myhostgroups.py", line 11
    parsed =
            ^
SyntaxError: invalid syntax

If the error does not occur again the next time you call --debug, for example because a new agent output is available, you can also view the last crash reports in the file system:

OMD[mysite]:~$ ls -lhtr ~/var/check_mk/crashes/check/ | tail -n 5
drwx------ 1 mysite mysite 44 Sep  3 05:15 f47550fc-4a18-11ee-a46c-001617122312/
drwx------ 1 mysite mysite 44 Sep  3 05:17 38657652-4a19-11ee-a46c-001617122312/
drwx------ 1 mysite mysite 44 Sep  3 09:31 9e716690-4a3c-11ee-9eaf-001617122312/
drwx------ 1 mysite mysite 44 Sep  3 10:40 479b20ea-4a46-11ee-9eaf-001617122312/
drwx------ 1 mysite mysite 44 Sep  4 14:11 fdec3ef6-4b2c-11ee-9eaf-001617122312/

There are two files in each of these folders:

crash.info contains a Python dictionary with traceback and much more information. A glance at the file with the Pager is often sufficient.
agent_output contains the complete agent output that was current at the time of the crash.

7.3. Custom debug output

In the examples shown above, we use the print() function to output the content of variables or the structure of objects for you as a developer. These functions for debug output must be removed from the finished check plug-in.

As an alternative to removal, you can also have your debug output displayed only when the check plug-in is called from the console in debug mode. To do this, import the debug object from the Checkmk toolbox and, if necessary, the pprint() formatting aid. You can now produce debug output depending on the value of the debug object:

from cmk.utils import debug

from pprint import pprint

def check_mystuff(section):
    if debug.enabled():
        pprint(section)

Note that any remaining debug output should be used sparingly and limited to hints that will help subsequent users with debugging. Obvious and foreseeable user errors (for example, that the contents of the agent section indicate that the agent plug-in has been incorrectly configured) should be answered with the UNKNOWN state and include explanatory notes in the summary.

7.4. Invalid agent output

The question is how you should react if the output from the agent is not in the form you actually expect — whether from the Checkmk agent or received via SNMP. Suppose you always expect three words per line, what should you do if you only get two words?

Well, if this is a permitted and known behavior by the agent, then of course you need to intercept this and work with a case distinction. However, if this is not actually allowed, then it is best to act as if the line always consists of three words, for example with the following parse function:

def parse_foobar(string_table):
    for foo, bar, baz in string_table:
        # ...

If there is a line that does not consist of exactly three words, an exception is generated and you receive the very helpful crash report just mentioned.

If you access keys in a dictionary that are expected to be missing occasionally, it can of course make sense to react accordingly. This can be done by setting the service to CRIT or UNKNOWN and placing a note in the summary about the agent output that cannot be evaluated. In any case, it is better to use the get() function of the dictionary for this than to catch the KeyError exception. This is because get() returns an object of type None or an optional replacement to be passed as the second parameter if the key is not available:

def check_foobar(section):
    foo = section.get("bar")
    if not foo:
        yield Result(state=State.CRIT, summary="Missing key in section: bar")
        return
    # ...

7.5. Missing items

What if the agent outputs correct data, but the item to be checked is missing? Like this, for example:

def check_foobar(item, section):
    # Try to access the item as key in the section:
    foo = section.get(item)
    if foo:
        yield Result(state=State.OK, summary="Item found in monitoring data")
    # If foo is None, nothing is yielded here

If the item you are looking for is not included, the loop is run through and Python simply drops out at the end of the function without returning a result via yield. And that is exactly the right approach! Because Checkmk recognizes that the item to be monitored is missing and generates the correct status and a suitable standard text with UNKNOWN.

7.6. Testing with spool files

If you want to simulate particular agent outputs, spool files are very helpful. You can use these to test borderline cases that are otherwise difficult to recreate. Or you can directly use the agent output that led to a crash report to test changes to a check plug-in.

First deactivate your regular agent plug-in, for example by revoking its execution authorization. Then create a file in the /var/lib/check_mk_agent/spool/ directory that contains the agent section (or expected agent sections) that your check plug-in expects, including the section header, and ends with Newline. The next time the agent is called, the content of the spool file will be transferred instead of the output from the agent plug-in.

7.7. Old check plug-ins become slow for many services

With some check plug-ins that use items, it is quite possible that on larger servers several hundred services will be generated. If no separate parse function is used, this means that the entire list of hundreds of lines must be run through for each of the hundreds of items. The time required to search therefore increases by the square of the number of listed items, which means tens of thousands of comparisons for hundreds of services. If, on the other hand, the nested list is transferred to a dictionary, the time required to search for an element increases only linearly with the size of the dictionary.

In the Python wiki you will find a Overview of costs for searching in different data types, including an explanation and O notation. Using the parse function reduces the complexity of the search from O(n) to O(1).

As older versions of this article did not make use of the parse function, you should identify such check plug-ins and rewrite them to use a parse function.

8. Files and directories

File path Function

File path	Function
`~/local/lib/check_mk/base/plugins/agent_based/`	Storage location for self-written check plug-ins.
`~/local/share/check_mk/web/plugins/wato/`	Storage location for your rule sets for check parameters.
`~/local/share/check_mk/web/plugins/metrics/`	Storage location for your own metric definitions.
`~/local/share/check_mk/web/plugins/perfometer/`	Storage location for your own definitions of Perf-O-Meters.
`~/lib/check_mk/gui/plugins/wato/check_parameters/`	Here you will find the rule set definitions of all check plug-ins supplied by Checkmk.
`~/lib/check_mk/gui/plugins/wato/utils/__init__.py`	In this file the groups of the Setup interface are defined in which you can store new rule sets.
`~/lib/check_mk/gui/plugins/metrics/`	Here you will find the metric definitions of the supplied plug-ins.
`~/lib/check_mk/gui/plugins/metrics/unit.py`	This file contains the predefined units for metrics.
`/usr/lib/check_mk_agent/plugins/`	This directory refers to a monitored Linux host. The Checkmk agent for Linux expects extensions to the agent (agent plug-ins) here.

~/local/lib/check_mk/base/plugins/agent_based/

Storage location for self-written check plug-ins.

~/local/share/check_mk/web/plugins/wato/

Storage location for your rule sets for check parameters.

~/local/share/check_mk/web/plugins/metrics/

Storage location for your own metric definitions.

~/local/share/check_mk/web/plugins/perfometer/

Storage location for your own definitions of Perf-O-Meters.

~/lib/check_mk/gui/plugins/wato/check_parameters/

Here you will find the rule set definitions of all check plug-ins supplied by Checkmk.

~/lib/check_mk/gui/plugins/wato/utils/__init__.py

In this file the groups of the Setup interface are defined in which you can store new rule sets.

~/lib/check_mk/gui/plugins/metrics/

Here you will find the metric definitions of the supplied plug-ins.

~/lib/check_mk/gui/plugins/metrics/unit.py

This file contains the predefined units for metrics.

/usr/lib/check_mk_agent/plugins/

This directory refers to a monitored Linux host. The Checkmk agent for Linux expects extensions to the agent (agent plug-ins) here.

On this page

1. Introduction
2. Writing an agent plug-in
3. Writing a simple check plug-in
4. Extending the check plug-in
5. Customizing the display of metrics
6. Formatting numbers
7. Troubleshooting
8. Files and directories