I think my firewalls are trying to kill me

From time to time, my OpenNMS installation has appeared to cause issues with devices that it’s managing. Sometimes I’ve got to the bottom of it, on other occasions I’ve not been able find a solution. Failure to isolate the problem has usually been because the device has been critical and there’s no test environment. Whenever I’ve found a solution, it’s always been with the device, rather than OpenNMS.

I had just such an example today. I have a remote network that’s only accessible via a VPN. The VPN endpoint is maintained by a firewall. Naturally, I wanted to poll and collect stats from the firewall with OpenNMS.

It seemed sensible to collect from the inside interface of the firewall across the VPN rather than the outside interface directly. As usual, this required a couple of arcane global configuration directives to make it happen. One to allow to allow traffic to come out, then reenter the same interface, another to permit management access to an interface other than the one from which you entered the firewall.

I did this, and added the firewall’s IP address to OpenNMS. After a eternity of capsd chugging away against the firewall, a new node appeared with all the interfaces and services I expected, plus a number that I didn’t. For some reason, OpenNMS seemed to have discovered Postgres, MySQL, OpenManage and a host of other TCP based services running on the firewall, which was odd. Further examination showed that all the services detected were found using capsd’s TCP plugin.

A few minutes with tcpdump showed that my firewall was actually completing a TCP handshake for all the services tested, then sending a RST back after the ACK from the OpenNMS box. As the handshake was all the capsd plugin and the pollers required to detect and poll the service, these bogus services had mysteriously appeared.

It’s not obvious why this is happening, it seems as if one of the arcane configuration directives is causing the firewall to somehow proxy the TCP connection to it’s inside interface, then sending a RST back if there is no service available. Firewalls are complex beasts and it’s not as if I have the source code to hand to find out what it’s up to. I’m just stuck with the empirical evidence, this is the device’s behavior and OpenNMS has to deal with it.

Fortunately, OpenNMS provided a solution in turning off capsd scanning of the inside interface for everything apart from ICMP and SNMP. It’s a pain for large numbers of nodes, but it’s good enough for this one instance, and it’s secure (which is key). The more disturbing aspect of this is that in earlier versions of the firewall vendor’s software, there is a “severe” bug associated with this behavior. Apparently, TCP connections to any port on a management interface configured in this way via the VPN will stay open “forever”. This will eventually hit the maximum connection count and will prevent further connections to the device. Fortunately this is fixed in the version I’m running.

A younger and less pragmatic me would have railed against the firewall vendor (how dare they release security software with bugs of any kind) and argued with the firewall administrator (“of course I _have_ to collect stats from your firewall this way”). As it was, the solution was with OpenNMS, and the problem has vanished after a little research. Anyhow, as the firewall admin, I can hardly castigate myself for my choice of platform.

It was reassuring to see that OpenNMS was behaving itself and should not be “considered harmful”. My firewalls are _not_ trying to kill me. It was also nice to be reminded that my job has its positive aspects. In the past this would have resulted in a week long extended conversation with a firewall admin discussing which of our software was the culprit for the “bogus” service.

Leave a Reply

You must be logged in to post a comment.