Designing for failure, part deux

October 10th, 2008

Following up on my previous post (sadly now over a year old), I wanted to talk about how to write good error messages in Java. If you haven’t read the previous post, go read it, including Eric Schrock’s Designing for Failure post. Go on, I’ll wait …..

Now that you’ve done your homework, let’s get started!

As I mentioned in my previous post, a comment on Eric’s blog covers one of my Java pet peeves:

try {
// do something which throws an exception

} catch (RuntimeException ux) {
throw new ExceptionHiddenException(“Something bad happened”);
}

This is error hiding. It doesn’t tell you what is wrong, in fact it explicitly discards all information in the caught RuntimeException–information that will likely tell you what is wrong. It would almost always be better to let this exception bubble up, even to the top-level, than to do what is done above. If you need to do some cleanup, like close a (possibly) open database connection, at least wrap it in a finally clause and drop information-hiding catch clauses like the one above. Or better–fix it…

What’s better? Maybe something like this:

try {
// do something which throws an exception

} catch (RuntimeException ux) {
throw new ModeratelyUsefulException(“Something bad happened”, ux);
}

So, we have a new exception that includes the cause exception, but we didn’t give any context, and we still fall into the trap of two of my lesser Java pet peeves: we still have getMessage() results that are largely devoid of information and useful exception data split across multiple lines. If you want to search through your application logs for errors, you might find the “Something bad happened” line, but you’d have to see the adjacent lines to get more details, making grep less handy that it could be and making you lean towards using much more heavyweight tools to look at logs. If you include the nested exception’s message in your message along with any additional context you can add, you end up with a very detailed (albeit fairly long) error message on line that is very greppable. Your support staff will love you. And so will I.

An even better example is in a patch for a bug I submitted to Maven’s Mojo project after being irritated that part of my OpenNMS build failed with no details whatsoever. The original re-thrown exception:

throw new MojoExecutionException( “Unable to copy dependency to staging area” );

Great! No nested exception, nor does it tell me what the dependency was, nor where it was trying to copy it to the staging area. The only way I could figure this out was to fire up a system call tracer (truss/strace/etc.) and see what system call failed, or fire up Maven inside of a debugger.

Here’s the improved message with context and the nested exception:

throw new MojoExecutionException( “Unable to copy dependency to staging area. Could not copy ” + artifact.getFile() + ” to ” + newLocation, ioe );

My only complaint about the new code is that it doesn’t include the nested exception’s message in the new exception’s message. This isn’t so bad with Maven, because it will show nested exceptions if there are any. If that was going to end up in a log4j log message, you can bet that I would include the nested exception’s message. Speaking of log4j, this is how I like to log exceptions:

log().warn(“Could not connect to host ” + host + “:” + port + ” due to: ” + ex, ex);

I both include the exception in the String log message and also pass it as a second argument. This gets support staff a pretty darn good amount of information on one line, as well as a stack backtrace for developers (and those super-duper support staff in the world).

I’ll stop blabbering on and share one of the simple little classes that makes me love Spring: NestedExceptionUtils, in particular its buildMessage method. This method is used by the getMessage() methods in the Exception classes that are the root of all Spring Exceptions to return their own message along with the message for a nested exception, if any. This can create huge, long messages, but they sure are chock-full-o-information. Take a peek at the NestedExceptionUtils.java source in its elegant simplicity along with the source of NestedRuntimeException to see it in action.

Happy coding, and good luck improving your logging!

Designing for failure

February 24th, 2007

This is a shameless plug for Eric Schrock‘s blog post on the same subject.

When we write software, one of the things we usually strive for is to make the user experience friendly. A large part of this is making the software easy to use, but a bigger part is making it fail nicely. Although it’s relatively easy to write software that works when everything goes well, but it’s harder to make software that does the right thing in the case of failure. This is one of my personal IT crusades because I’ve dealt with way to much software that provides near meaningless error messages, or even worse it completely fails with dreaded segmentation violations, NullPointerExceptions, etc. due to poor error checking.

Eric Schrock has some good ideas on what the right thing is and suggests that “[w]hen choosing to display an error message, you should always take the following into account:

“An error message must clearly identify the source of the problem in a way that that the user can understand.

“An error message must suggest what the user can do to fix the problem.”

I like these principles; they are simple, concise, and I can do nothing but agree with both of them based on my experience as a user, a system administrator, and a developer. He goes into more detail in his blog post, and I suggest that anyone who writes any code, be it a small shell script, or a large application read it. We can all do our own part to improve the situation for our users and raise the bar for all software.

Next post, I’ll talk about some of the things that I’ve found help out a lot in Java programming (one of the comments on Eric’s blog post goes into one such item that happens to be a pet peeve of mine). For now, I just have to say that I think the people working on the Spring Framework do a great job. Rock on, folks!

Using jicmp from the command-line

November 3rd, 2006

One of the bits of C code in OpenNMS is the code that we use to “ping” machines with ICMP echoes. Java doesn’t have the ability for us to send and receive ICMP packets, so we have a little bit of C code that we call from Java to do the twiddling with the ICMP packets for us.

We run on a handful of different instruction set architectures (ISA), including x86, x64, SPARC 32 & 64 bit, and PowerPC 32 & 64 bit. Now, if you are running an Intel 32 bit Java virtual machine, but our C code gets compiled for a 64 bit ISA, then the C code won’t work from Java, and you’ll get errors whenever we try to use it. The same problem can happen if the ISAs match, but there might be linking problems or other oddities.

You can read the rest in the wiki.

Moved to WordPress

March 23rd, 2006

We’ve gone from Tiki Wiki blogs, to MediaWiki user talk pages (which just aren’t good for blogging), and now we have settled on WordPress. It looks quite nice, and I hope to blog more often.

Welcome to our new blog home, everyone!

More pollers here we come!

December 1st, 2005

The unstable branch of OpenNMS has had some recent commits that make it easier for users to convert from Nagios. Over the past three months I added NRPE support, and complete support for it is available in 1.3.1 (partial support was in 1.3.0–it lacked a capsd plugin). Last week, Matt Raykowski, who has been a fixture on the IRC channel lately, submitted enhancement bug #1389, an NSClient capsd plugin and poller monitor.

What do these two things do? NRPE is a simple way to execute Nagios plugins on a remote (UNIX) machine. What you do is setup the remote machine with an NRPE daemon along with the Nagios plugins that you want to run, and configure the remote daemon to execute those plugins with the options you desire. You then setup your central monitoring host to query the plugins you setup on the remote host. With Nagios, you would use check_nrpe plugin to query the remote NRPE daemon (which you could also call from OpenNMS using the Generic Plugin poller), but now you can directly query the remote NRPE daemon from OpenNMS using the NRPE plugin. You can see an example of configuration the NRPE poller in the examples poller-configuration.xml file. Please note that you always have to have a capsd plugin for each poller monitor you have configured.

NSClient is a bit different, in that it allows you to query performance counters on Windows systems. Like NRPE, it requires a daemon to be installed on the remote system. What’s different is that it doesn’t let you run arbitrary Nagios plugins on the remote system (there’s nrpe_nt for that). It lets you query utilization of CPU, memory, and disk, as well as whether a service or process is running, as well as arbitrary Windows performance data.

I have included the NsclientManager documentation below for those who are curious. Unless you feel comfortable building from source (and doing so in the unstable branch), you should hold off until the 1.3.2 release.

For testing, you can run the NsclientManager implementation directly:

 java -cp $OPENNMS_HOME/lib/opennms_services.jar org.opennms.netmgt.poller.nsclient.CheckNsc

Here are a couple of example tests using the command-line tool:

  1. Testing the client version:
    java -cp $OPENNMS_HOME/lib/opennms_services.jar org.opennms.netmgt.poller.nsclient.CheckNsc CLIENTVERSION 0 0 "1.0.7.0"
  2. Testing NT services:
    java -cp $OPENNMS_HOME/lib/opennms_services.jar org.opennms.netmgt.poller.nsclient.CheckNsc  USEDDISKSPACE 95 90 "C:"
  3. Testing memory usage:
    java -cp $OPENNMS_HOME/lib/opennms_services.jar org.opennms.netmgt.poller.nsclient.CheckNsc  MEMUSE 95 90 ""

Available commands

CLIENTVERSION
This check uses the parameter property only. This string must be formatted like “#.#.#.#” or empty. If you provide a number, such as “1.0.7.0″, you are specifying the minimum version supported, any version lower than this parameter will return ‘critical’. If you do not specify a parameter (for example, supplying “”) the check always returns ‘okay’ and will provide the version number in the response. Using CLIENTVERSION in capsd:
<protocol-plugin protocol="NSC-CLIENTVERSION" class-name="org.opennms.netmgt.capsd.plugins.NsclientPlugin" scan="on" user-defined="false">
  <property key="port" value="1248" />
  <property key="timeout" value="3000" />
  <property key="retry" value="2" />
  <property key="command" value="CLIENTVERSION" />
  <!-- parameter is optional, if you want to do version checking
       property key="parameter" value="1.0.7.0" />
  -->
</protocol-plugin>
CPULOAD
This check uses the warningPercent and criticalPercent properties only. These values must be integers in percentage format. Using CPULOAD in capsd:
<protocol-plugin protocol="NSC-CPULOAD" class-name="org.opennms.netmgt.capsd.plugins.NsclientPlugin" scan="on" user-defined="false">
  <property key="port" value="1248" />
  <property key="timeout" value="3000" />
  <property key="retry" value="2" />
  <property key="command" value="CPULOAD" />
  <property key="warningPercent" value="90" />
  <property key="criticalPercent" value="95" />
</protocol-plugin>
UPTIME
This check simply returns the uptime of the system. No validation is performed.
USEDDISKSPACE
This check uses the warningPercent and criticalPercent properties to determine thresholds. These values must be integers and they should be in percentage form (e.g. 1-100.) The parameter property is used to determine the drive letter, either “C” or “C:” can be used. Some older services have mixed support, so try adding/removing the ‘:’ if you are experiencing problems. Using USEDDISKSPACE in capsd:
<protocol-plugin protocol="NSC-C-SPACE" class-name="org.opennms.netmgt.capsd.plugins.NsclientPlugin" scan="on" user-defined="false">
  <property key="port" value="1248" />
  <property key="timeout" value="3000" />
  <property key="retry" value="2" />
  <property key="command" value="USEDDISKSPACE" />
  <property key="warningPercent" value="90" />
  <property key="criticalPercent" value="95" />
  <property key="parameter" value="C:" />
</protocol-plugin>
SERVICESTATE
This check determines the status of NT services on a remote server. This check only uses the parameter property. The parameter property should contain a comma delimited list of services you would like the status of. Using SERVICESTATE in capsd:
<protocol-plugin protocol="NSC-SERVICE-SERVERS" class-name="org.opennms.netmgt.capsd.plugins.NsclientPlugin" scan="on" user-defined="false">
  <property key="port" value="1248" />
  <property key="timeout" value="3000" />
  <property key="retry" value="2" />
  <property key="command" value="SERVICESTATE" />
  <property key="parameter" value="Eventlog,lanmanserver,Netlogon,RpcSs" />
</protocol-plugin>
PROCSTATE
This check determines whether or not a list of processes are running on a remote server. This check only uses the parameter property. The parameter property should contain a comma delimited list of processes you want to determine the status of. Using PROCSTATE in capsd:
<protocol-plugin protocol="NSC-PROCESS-NAVISPHERE" class-name="org.opennms.netmgt.capsd.plugins.NsclientPlugin" scan="on" user-defined="false">
  <property key="port" value="1248" />
  <property key="timeout" value="3000" />
  <property key="retry" value="2" />
  <property key="command" value="SERVICESTATE" />
  <property key="parameter" value="naviagent.exe,EmcPowSrv.exe" />
</protocol-plugin>
MEMUSE
This check uses the warningPercent and criticalPercent properties only. These values must be integers in percentage format. Using MEMUSE in capsd:
<protocol-plugin protocol="NSC-MEMORY" class-name="org.opennms.netmgt.capsd.plugins.NsclientPlugin" scan="on" user-defined="false">
  <property key="port" value="1248" />
  <property key="timeout" value="3000" />
  <property key="retry" value="2" />
  <property key="command" value="MEMUSE" />
  <property key="warningPercent" value="90" />
  <property key="criticalPercent" value="95" />
</protocol-plugin>
COUNTER
This check is used to check the PerfMon OID objects on a remote Windows server. This check uses the warningPercent and criticalPercent properties as values. These values may or may not actually be in percentage format. You will have to use discretion when setting these up and read the documentation for the specific PerfMon OID that you will be monitoring. It also uses the parameter property to define the PerfMon OID that you will be monitoring. Using COUNTER in capsd:
<protocol-plugin protocol="NSC-PRINT-QUEUE" class-name="org.opennms.netmgt.capsd.plugins.NsclientPlugin" scan="on" user-defined="false">
  <property key="port" value="1248" />
  <property key="timeout" value="3000" />
  <property key="retry" value="2" />
  <property key="command" value="COUNTER" />
  <property key="warningPercent" value="10" />
  <property key="criticalPercent" value="20" />
  <property key="parameter" value="\Print Queue(_Total)\Jobs" />
</protocol-plugin>
FILEAGE
This check is used to determine the age of a specified file on the remote server. This check uses the warningPercent and criticalPercent properties as values. These values will be the newest age of the file allowed in minutes. For example, if you set the criticalPercent to 60 then a file that is 59 minutes or newer will result in a critical response from the manager. This check uses the parameter property to determine the full path of the file to be monitored. This check is useful for monitoring log files or other files that may collect critical events or files that should rarely change. Note: This check does not yet support using the modified date response supplied by the server. Using FILEAGE in capsd:
<protocol-plugin protocol="NSC-SYSTEMFILES" class-name="org.opennms.netmgt.capsd.plugins.NsclientPlugin" scan="on" user-defined="false">
  <property key="port" value="1248" />
  <property key="timeout" value="3000" />
  <property key="retry" value="2" />
  <property key="command" value="FILEAGE" />
  <property key="warningPercent" value="0" />
  <property key="criticalPercent" value="525600" /> <!-- two years -->
  <property key="parameter" value="C:\autoexec.bat" />
</protocol-plugin>

Changing snmp-graph.properties to XML and making a resilient parser

July 11th, 2005

The configuration file for OpenNMS’s SNMP (AKA performance) graphs leaves a bit to be desired. snmp-graph.properties is a Java properties file and it is very picky about being formatted correctly.

For awhile, I’ve wanted to change snmp-graph.properties to an XML format, maybe something based on JRobin’s XML template format. Over the weekend I made an XML schema for the JRobin XML template format, and I also created a conversion program to take the existing snmp-graph.properties file and convert it into individual XML files. I’m pretty happy with the format, although I still need to deal with an issue where I need Castor to order things in a certain way.

What does everyone think about the format of these files? So far, this is just the format for each graph, and it needs a few more things to make it a functional replacement for what we have now, and a few more things on top of that to make it super-sweet. First, stuff that is needed:

  1. Sorting. The existing file has a “reports” property that specifies all of the available graphs (AKA reports). They are displayed in the order in which they are listed in the property (and only if all of the RRD files needed for the graph exist). I would like to just have a directory full of XML files that get loaded automatically, but I have the issue of sorting. The sanest way to me would be to sort them alphabetically or based on some sorting key that is in each XML file, neither of which I’m a fan of. So, I’ll need some other configuration file that specifies sorting.

Features that would be super-sweet:

  1. Resiliency. If someone screws up a single graph, only make that one graph fail, and make it fail gracefully. Either don’t attempt to use that graph at all, or display a helpful, administrator/user-friendly error message that things are broken. If someone screws up the master graph configuration file, display things unsorted and/or a administrator/user-friendly error message.
  2. Auto-loading. I would like all graphs that the system can find to be configured, even if they aren’t listed in the master list that specifies sorting (e.g.: all files in the dedicated graph configuration directory). If a graph is found but isn’t in the sorting list, just toss it at the end of the graphs in the sorting list. If a graph is changed/removed/added or if the sorting list is changed, reload things automatically. We shouldn’t have to restart.
  3. Templates. I want to be able to have templates that specific graphs can use to specify common elements, such as things like fonts, borders, etc.. Things might be able to be made a lot less redundant by being able to have templates for graphs with one, two, three, etc. lines, for example. Chaining of templates would also be good to support. Strictly, what I already have are templates because they exclude the location of the RRD files (these would be {rrd1}, {rrd2}, etc. in snmp-graph.properties). The RRD file location would be filled in based on the node (and possibly interface) interface information that OpenNMS knows, along with the name of the RRD file which is derived from the datasource names in the graph.
  4. Graph editor. Let users specify their own graphs from the web interface.
  5. Graph builder. Have something like Dave Plonka’s RRGrapher that administrators can use to easily build graphs and then “store” them as an XML configuration file for future automated graphing.
  6. Data-source flexibility. Right now, snmp-graph.properties only supports node-specific and interface-specific graphs. We need to support graphs across multiple nodes, across multiple interfaces, and with indexes other than the interface index. We also should support non-SNMP data. It would be great to support finding and displaying a RRD file from a group of hosts, such as based on filters and rules that are used elsewhere in OpenNMS. E.g.: graph the ping times for all of my web servers on one graph, or average all of the ping times from my web servers into a single line on a graph–all without having to explicitly specify the hosts and/or RRD files that are involved.

I’m open to comments and suggestions on the XML format, features, and anything else listed above or that you can think of. I hope to have this done by August.

What I’ve been working on lately

July 8th, 2005

So, I’ve gotten into documentation lately, which feels a little weird for me. I don’t think of myself as a documentation person, but instead as one of those people that should write it but never does. I’ve noticed myself drifting this way over the past few years, and I guess it isn’t a bad thing. :-)

So, I attacked the install guide a few months ago, and ever since I’ve been itching to do something about an administrator’s guide and a reference guide. I just got back from a week-long vacation, and I had planned to do some hacking on OpenNMS while I was away. Well, most of what I got done was hacking not on code, but on documentation–this time for our XML schemas. These define what is allowed in OpenNMS’s XML configuration files, and probably more importantly from a developer’s standpoint, they used as input to Castor to automatically generate Java code that we then use to access the XML files.

It turns out that XML schemas allow embedded documentation pretty much anywhere in the schema inside of almost any element. There are a few tools out there to take this documentation embedded in the schemas and make a user-friendly representation of it, and I chose the open-source xsddoc tool (part of the xframe project) do do the work for us. The only thing left for me was add the elements that xsddoc uses. It turned out that most of the files were already pretty well commented with XML comments (), so after writing a short script I got everything converted.

I committed my work last week into HEAD (the “unstable” branch), and I also have a copy of it on my website. There are some bugs in xsddoc that need to be worked out, and it could use some enhancements, but so far, I’m pretty happy with it. It was definitely a lot easier than writing everything by hand, and is a lot more maintainable.

That’s all for now, folks!