GaffneyWare

Thoughts on Software Development from Michael Gaffney

The BoxTone Dashboard and the BlackBerry Outage

"A dashboard is a visual display of the most important information needed to achieve one or more objectives; consolidated and arranged on a single screen so the information can be monitored at a glance." Stephen Few, Information Dashboard Design (2006)

The BlackBerry infrastructure went down on Monday February 11, 2008 at approximately 3:20 PM EST. This event, and the reasons behind it, has been covered in detail on a number of other websites so I'm not going to cover it here. Instead, I want to show you what this event looked like on the BoxTone Dashboard.

Please note that I work for BoxTone. I am one of the two Principal Software Architects there and I designed the BoxTone Dashboard.

Below are two screen shots of the Dashboard: figure 1 shows the Dashboard when everything is normal and figure 2 shows the Dashboard during the BlackBerry outage.

* This screen shot was provided by jibi. It was taken on Tuesday February 19, 2008 at 9:54 AM EST. He blurred out names for security reasons.

* It has been resized to fit better, however the orginal can be viewed here.
BoxTone Dashboard

Figure 1: BoxTone Dashboard when everything is normal. Click to enlarge.

* This screen shot was originally posted on the BlackBerry Forums by jibi. It was taken on Monday February 11, 2008 at 4:47 PM EST. For security reasons, he changed the name of each BES and removed the "Users in Critical" and the "Mail Servers with hung threads" sections. BoxTone Dashboard during BlackBerry Outage

Figure 2: BoxTone Dashboard during the BlackBerry outage. Click to enlarge.

A Global Problem

If you quickly glance at figure 1 and then quickly glance at figure 2 you should immediately notice that figure 2 stands out more. It stands out more because of the careful and restrained use of colors.

Unlike many other dashboards, the BoxTone Dashboard uses highly saturated colors only when there is a problem.* * Edward Tufte's, Envisioning Information (1990), has a chapter entitled "Color and Information" which provides an excellent overview on the use of color. When everything is normal, the BoxTone Dashboard looks, well ... normal.

Our users are highly intelligent people. They do not need bright green check marks to make them feel good about themselves. They bought BoxTone to tell them when something is wrong. And when something is wrong, they want to know what it is and quickly!

A single bright red dot on a calm page screams out and demands the user's immediate attention. On February 11, 2008, this technique is what allowed our users to know instantly that the problem was a global one.

The devil is in the details

* This detail is from Figure 1 above. I have replaced the blurred name with BES09 for illustrative purposes only. Normal BES

Figure 3: The detail for each BES shows over 600 individual points of data.

The BoxTone Dashboard displays the values of 10 KPIs over the last 3 hours at 3 minute intervals for each BES. All of this data is graphed using sparklines which are stacked vertically and aligned across their common axis of time. Figure 3 shows what these KPIs look like when a BES is healthy.

In addition to being aligned by time, the KPIs are organized so that the health KPIs sit on top of the performance KPIs. This ordering allows the BES Administrator to quickly detect any causal relationships which may emerge between the health and the performance of a BES.

The details of the BlackBerry outage

Knowing there is a problem and knowing what the problem is, are two very different things. Bright red dots are just not enough.

* This image is a closeup from figure 2 above. BES with a problem

Figure 4: One of the BES from the BoxTone Dashboard during the BlackBerry Outage.

Figure 4 shows the details of the BlackBerry outage from the point of view of BES09 and, I must say, it's quite a view!

At 3:24 PM, the first SRP connection failure appears on the BoxTone Dashboard. Then, for the next 9 minutes, the SRP connection is up but then fails again at 3:36 and stays down for 6 minutes. During this entire 18 minute window, the BoxTone Dashboard shows BES09 as having multiple errors.

You can see the effects of the SRP connection failure are immediate and long lasting. The percentage of users with messages pending quickly jumps to nearly 100% and at the same time, the inbound and outbound message volume plummets. A causal relationship is clearly visible.

If you look at figure 2 again, you will see that almost all of the other BES display a similar pattern. That is except for three of them.

EUBES01, PACBES01, and PACBES02 were unaffected by the outage. These BES were located outside of North America and so another causal relationship appears. This outage was localized to North America. Again, the design of the BoxTone Dashboard allows the BES Administrator to quickly see what the problem is, how wide spread it is, and finally which users are affected by it.

The BlackBerry outage from another product's point of view

Figure 5 is a screen shot of the Zenprise**Please note that Zenprise is considered one of BoxTone's competitors. User Dashboard which was posted to the BlackBerry Forums by mingjing on Tuesday February 13, 2008 at 4:03 PM EST, along with this message:

"We are using Zenprise to monitor our BlackBerry infrastructure ... I've included a screenshot of the Zenprise console and one alert message. We were able to see pending messages growing for critical users, as well as immediately identify the root cause to be connectivity problems with the SRP network."

*I removed two big red circles that were in the original image. I did this on the assumption that someone added it to the screen shot. If my assumption is incorrect and those red circles do appear under normal use of the product, please let me know. The original image can be found here. Zenprise Screen Shot

Figure 5: Zenprise Screen Shot.

So, according to the message, figure 5* *Update:I noticed a larger version of figure 5 was linked to the original posting in the BlackBerryForums. It shows the timestamps much clearer. You can see it here. is a screen shot of what they, a customer of Zenprise, saw during the BlackBerry outage in the Zenprise User Dashboard. And, again according to this customer, they could see the number of pending messages for "critical users" was growing and they were able to "immediately identify" the global SRP connection problem to be the root cause.

OK, I must confess that I have been studying this screen shot for more than a few hours now, and I honestly cannot see how that would be possible.

Annotated Zenprise Screen Shot

Figure 6: Four main problems in the Zenprise User Dashboard.

This screen shot has four major problems in it:

  1. 1. It was taken sometime after 11:29 AM on 2/12/2008 which is nearly a full day after the BlackBerry outage. There is nothing to indicate which data might be historical and which data might be current.

  2. 2. SRP connections to RIM were working when this screen shot was taken, so this icon is just plain wrong.

  3. 3. The starting time of the SRP connectivity problem is shown as 2/9/2008 and the last occurrence is shown as 2/11/2008. This is horribly incorrect. The actual SRP connection was only down for approximately 15-20 minutes not 3 days.

  4. 4. This big blue rectangle does not give the user any useful information. It does not show growth. It does not show causality. It shows less than 2 minutes worth of data and it simply does not answer the fundamental question all good data graphics should answer: "Compared to what?"

Seeing the Big Picture in the BoxTone Dashboard

* This image is a closeup from figure 2 above. Cluster of BES on the BoxTone Dashboard

Figure 7: Seeing a global problem is easy in the BoxTone Dashboard.

I think figure 7 speaks for itself. Seeing there was a global problem during the RIM outage was crystal clear in the BoxTone Dashboard.

Wrapping it up

I plan on writing more posts about the BoxTone Dashboard and other parts of our system in the future. As always, I would appreciate any comments you might want to share on this topic.

Posted in: blackberry · dashboard · visualization · 

6 Comments · Add a comment

Erik Cussack · Friday, 9 May 2008 9:42 AM

An excellent and interesting explanation.

Michael Gaffney · Friday, 7 March 2008 5:50 AM

Geoffrey, thank you. The sparklines crossing cells has the added benefit of showing a user that a drastic change has occurred. It has been very useful.

Michael Gaffney · Friday, 7 March 2008 5:45 AM

Jorge, thank you and I agree with your feedback about Ben Shneiderman's Mantra. We are working on the next revision and for larger deployments a cleaner overview would be more helpful. I'm also a regular reader of your site: http://charts.jorgecamoes.com . Thanks again.

Jorge Camoes · Thursday, 6 March 2008 6:47 PM

Michael, I really like your dashboard. It is a very clean and shows how Tufte's and Few's principles work when correctly implemented. Ben Shneiderman's Visual Information-Seeking Mantra ("overview first, zoom and filter, then details on demand") could be a base for some improvement (adding a bird's eye view of the servers, for example).

Geoffrey Grosenbach · Thursday, 6 March 2008 5:47 PM

Beautiful and useful. I'm using a grey/blue scheme, but a yellow/grey/red scheme (a la Information Dashboard Design) is more informative at a glance.

I also like the way that sparklines cross cells if needed. It's easy to lose information if the scale is incorrect, but this makes it easier to see relative differences.

Bharat · Tuesday, 4 March 2008 6:32 PM

Nice article and excellent explanation.