Dynamic Tracing on OpenBSD 7.3

A couple years ago a new device silently entered the OpenBSD source tree.

The dt(4) device first appearing in OpenBSD 6.7 with very sparse information. In the releases since, it's gained quite a bit of functionality and now has tools to manipulate it properly.

The documentation has gotten better over the last couple years, but there isn't much info about how to leverage it to do anything useful.

The dt pseudo device driver is connected to /dev/dt, which is used to interact with the device through ioctl calls. dt stand for "dynamic tracer", similar to dtrace and bpftrace. It's a device that lets you interact with probes on various parts of the system.

By default you can't communicate with the device since it can't be opened unless the kern.allowdt sysctl flag is set to 1. However you can't set this specific sysctl flag while the system's kern.securelevel is above 0, which it is under normal circumstances.

The securelevel of the machine is set during the boot process and can't be lowered once the machine is running. That means you need to set the flag before the machine's securelevel is raised. You can do this by adding the following line to /etc/sysctl.conf.

kern.allowdt=1

You can see in /etc/rc that /etc/sysctl.conf is loaded very early in the boot process, and that the securelevel is raised near the end of the boot.

50 # Apply sysctl.conf(5) settings.
51 sysctl_conf() {
52         # do not use a pipe as limits would only be applied to the subshell
53         set -- $(stripcom /etc/sysctl.conf)
54         while [[ $# > 0 ]] ; do
55                 sysctl "$1"
56
57                 case "$1" in
58                 kern.maxproc=*)
59                         update_limit -p maxproc
60                         ;;
61                 kern.maxfiles=*)
62                         update_limit -n openfiles
63                         ;;
64                 esac
65                 shift
66         done
67 }

[...]

589 [[ -f /etc/rc.securelevel ]] && sh /etc/rc.securelevel
590
591 # rc.securelevel did not specifically set -1 or 2, so select the default: 1.
592 (($(sysctl -n kern.securelevel) == 0)) && sysctl kern.securelevel=1

Taken from /etc/rc

You could also run the machine in Permanently insecure mode by adding kern.securelevel=-1 in /etc/sysctl.conf. This would let you change the kern.allowdt flag at runtime, but it's not recommended because it greatly reduces the security of your system.

Once you've added the line to your sysctl.conf and rebooted you should be able to open the device as root.

The main way to interact with the device is through the btrace command. btrace is the "bug tracer" tool, used to run scripts written in bt, the bug tracing language. bt uses the same syntax as Linux's bpftrace tool.

Using the bt language and btrace tool, you can profile system internals and probe the inner workings of many programs. You can see which processes are forking, opening files, reading and writing to file descriptors, using pipes, using chown, pledging, listening on a socket, etc. You can see the full kernel and userland stack traces, process ids, function arguments, command names, thread id, cpu id, and return values.

You can see the full list of supported probes with btrace -l.

# doas btrace -l
profile:hz:97
interval:hz:1
syscall:exit:entry
syscall:exit:return
syscall:fork:entry
syscall:fork:return
syscall:read:entry
syscall:read:return
syscall:write:entry
syscall:write:return
syscall:open:entry
syscall:open:return
syscall:close:entry
syscall:close:return
syscall:getentropy:entry
syscall:getentropy:return
syscall:__tfork:entry
syscall:__tfork:return
syscall:unlink:entry
syscall:unlink:return
syscall:wait4:entry
syscall:wait4:return
syscall:chdir:entry
syscall:chdir:return
syscall:fchdir:entry
syscall:fchdir:return
syscall:mknod:entry
syscall:mknod:return
syscall:chmod:entry
syscall:chmod:return
syscall:chown:entry
syscall:chown:return
[...]
tracepoint:raw_syscalls:sys_enter(register_t)
tracepoint:raw_syscalls:sys_exit(register_t)
tracepoint:uvm:fault(vaddr_t, vm_fault_t, vm_prot_t)
tracepoint:uvm:map_insert(vaddr_t, vaddr_t, vm_prot_t)
tracepoint:uvm:map_remove(vaddr_t, vaddr_t, vm_prot_t)
tracepoint:uvm:malloc(int, void *, size_t, int)
tracepoint:uvm:free(int, void *, size_t)
tracepoint:uvm:pool_get(void *, void *, int)
tracepoint:uvm:pool_put(void *, void *)
tracepoint:vfs:bufcache_rel(long, int, int64_t)
tracepoint:vfs:bufcache_take(long, int, int64_t)
tracepoint:vfs:cleaner(long, int, long, long)
tracepoint:vmm:guest_enter(void *, void *)
tracepoint:vmm:guest_exit(void *, void *, uint64_t)
tracepoint:vmm:inout(void *, uint16_t, uint8_t, uint32_t)
[...]

A sample of the available probes

Currently it seems most probes are for system calls syscall:* but there are also a couple trace points for other function calls, like the scheduler tracepoint:sched:enqueue, virtual memory system tracepoint:uvm:malloc, virtual machine manager (vmm) tracepoint:vmm:guest_enter, etc.

Trace points are often functions that have arguments that you can inspect as part of your probe.

You can print values every time a probe is called using printf. You can also collect aggregates values, build simple histograms, calculate sums, or get min/max values. Check bt man page for full syntax. Since it's based on the bpftrace syntax you should also be able to look at those resources for ideas of how to use it.

The bt language is somewhat similar to awk.

bt programs have probes, filters, and actions.

They're put together like this.

PROBE /FILTER/ { ACTION(S) }

The action is called every time the probe gets activated, and you can optionally filter the probe to only activate on certain conditions like when it's a certain PID or thread ID using the filter format /pid == 1234/

There are special BEGIN and END probes that are called once at the beginning and end of program execution, which can be used to set and clear variables.

Variables are global and take the form of @var for a scalars or @var[key] for a maps. There are functions like clear(@map) and delete(@map[key]) that operate on map values.

You can also do basic math operations on values.

Here are a couple examples of how it can be used:

Print the process name and pid every time fork is called

doas btrace -e 'syscall:fork:entry { printf("%s[%d]\n", comm, pid) }'
ksh[68490]
ksh[68490]
ksh[68490]
ksh[68490]
sh[83762]
ksh[68490]
ksh[87548]
ksh[87548]
ksh[87548]
sh[33064]
smtpd[32246]
smtpd[33076]
smtpd[33076]
ksh[88916]
smtpd[33076]
smtpd[33076]
smtpd[33076]
ksh[52579]
smtpd[33076]
smtpd[33076]

Count how many time processes call read (values printed after Ctrl-C)

doas btrace -e 'syscall:read:entry { @[comm] = count() }'
^C
@[sshd]: 243
@[ksh]: 48
@[less]: 31
@[cat]: 2

See which processes are calling which syscalls, and how many times

doas btrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm, arg0] = count() }'
^C
@[sshd, 48]: 354
@[top, 49]: 292
@[sshd, 109]: 176
@[top, 86]: 155
@[top, 202]: 124
@[sshd, 3]: 92
@[sshd, 4]: 92
@[less, 86]: 82
@[ksh, 4]: 55
@[less, 49]: 42
@[top, 46]: 36
@[sshd, 54]: 34
@[top, 169]: 31
@[ksh, 3]: 28
@[top, 3]: 27
@[ksh, 202]: 26
@[top, 252]: 22
@[ksh, 46]: 20
@[top, 253]: 19
@[ksh, 40]: 18
@[top, 5]: 18
@[top, 6]: 18
@[top, 74]: 18
@[top, 53]: 16
@[top, 54]: 15
@[ksh, 54]: 14
@[ksh, 99]: 14
@[less, 159]: 14
@[less, 74]: 14
@[top, 159]: 14
@[top, 4]: 11
@[ksh, 48]: 10
@[ksh, 6]: 9
@[sshd, 73]: 9
@[ksh, 38]: 8

You can check the syscall numbers here: https://github.com/openbsd/src/blob/master/sys/kern/syscalls.master

See a histogram of the size of read() calls for a certain process

doas btrace -e 'syscall:read:return /pid == 5353/ { @readsize = hist(retval) }'
^C
@readsize:
[0, 1)             3 |@@@@@@@@@@@@@@@@@@@                                 |
[2, 4)             3 |@@@@@@@@@@@@@@@@@@@                                 |
[32, 64)           8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

See a histogram of how long it takes for a read() to return

doas btrace -e 'syscall:read:entry { @start[tid] = nsecs } syscall:read:return /@start[tid]/ { @times = hist(nsecs - @start[tid]); delete(@start[tid]) } END { clear(@start) }'
^C@times:
[0]                3 |@@@                                                 |
[1K, 2K)           3 |@@@                                                 |
[2K, 4K)          50 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4K, 8K)          24 |@@@@@@@@@@@@@@@@@@@@@@@@                            |
[8K, 16K)          8 |@@@@@@@@                                            |
[16M, 32M)         2 |@@                                                  |
[32M, 64M)         1 |@                                                   |
[64M, 128M)        5 |@@@@@                                               |
[128M, 256M)       4 |@@@@                                                |

There are also a couple of examples in /usr/share/btrace that use sampling and command line arguments.

You can also check out the Linux bpftrace guide for ideas.