Latest Articles
Tags
Data Visualization
User Interface Design
Software Development
Ruby on Rails
Web 2.0
The BoxTone Dashboard and the BlackBerry Outage
"A dashboard is a visual display of the most important information needed to achieve one or more objectives; consolidated and arranged on a single screen so the information can be monitored at a glance." —Stephen Few, Information Dashboard Design (2006)
The BlackBerry infrastructure went down on Monday February 11, 2008 at approximately 3:20 PM EST. This event, and the reasons behind it, has been covered in detail on a number of other websites so I'm not going to cover it here. Instead, I want to show you what this event looked like on the BoxTone Dashboard.
Please note that I work for BoxTone. I am one of the two Principal Software Architects there and I designed the BoxTone Dashboard.
Below are two screen shots of the Dashboard: figure 1 shows the Dashboard when everything is normal and figure 2 shows the Dashboard during the BlackBerry outage.
* This screen shot was provided by jibi.
It was taken on Tuesday
February 19, 2008 at 9:54 AM EST. He blurred out names for security reasons.
* It has been resized to fit better, however the orginal can be viewed
here.
Figure 1: BoxTone Dashboard when everything is normal. Click to enlarge.
* This screen shot was originally
posted on the
BlackBerry Forums by jibi.
It was taken on Monday February 11, 2008 at 4:47 PM EST. For security reasons, he changed the name of each BES
and removed the "Users in Critical" and the "Mail Servers with hung threads" sections.
Figure 2: BoxTone Dashboard during the BlackBerry outage. Click to enlarge.
A Global Problem
If you quickly glance at figure 1 and then quickly glance at figure 2 you should immediately notice that figure 2 stands out more. It stands out more because of the careful and restrained use of colors.
Unlike many other dashboards, the BoxTone Dashboard uses highly saturated colors only when there is a problem.* * Edward Tufte's, Envisioning Information (1990), has a chapter entitled "Color and Information" which provides an excellent overview on the use of color. When everything is normal, the BoxTone Dashboard looks, well ... normal.
Our users are highly intelligent people. They do not need bright green check marks to make them feel good about themselves. They bought BoxTone to tell them when something is wrong. And when something is wrong, they want to know what it is and quickly!
A single bright red dot on a calm page screams out and demands the user's immediate attention. On February 11, 2008, this technique is what allowed our users to know instantly that the problem was a global one.
The devil is in the details
* This detail is from Figure 1 above. I have replaced the blurred name
with BES09 for illustrative purposes only.
Figure 3: The detail for each BES shows over 600 individual points of data.
The BoxTone Dashboard displays the values of 10 KPIs over the last 3 hours at 3 minute intervals for each BES. All of this data is graphed using sparklines which are stacked vertically and aligned across their common axis of time. Figure 3 shows what these KPIs look like when a BES is healthy.
In addition to being aligned by time, the KPIs are organized so that the health KPIs sit on top of the performance KPIs. This ordering allows the BES Administrator to quickly detect any causal relationships which may emerge between the health and the performance of a BES.
The details of the BlackBerry outage
Knowing there is a problem and knowing what the problem is, are two very different things. Bright red dots are just not enough.
* This image is a closeup from figure 2 above.
Figure 4: One of the BES from the BoxTone Dashboard during the BlackBerry Outage.
Figure 4 shows the details of the BlackBerry outage from the point of view of BES09 and, I must say, it's quite a view!
At 3:24 PM, the first SRP connection failure appears on the BoxTone Dashboard. Then, for the next 9 minutes, the SRP connection is up but then fails again at 3:36 and stays down for 6 minutes. During this entire 18 minute window, the BoxTone Dashboard shows BES09 as having multiple errors.
You can see the effects of the SRP connection failure are immediate and long lasting. The percentage of users with messages pending quickly jumps to nearly 100% and at the same time, the inbound and outbound message volume plummets. A causal relationship is clearly visible.
If you look at figure 2 again, you will see that almost all of the other BES display a similar pattern. That is except for three of them.
EUBES01, PACBES01, and PACBES02 were unaffected by the outage. These BES were located outside of North America and so another causal relationship appears. This outage was localized to North America. Again, the design of the BoxTone Dashboard allows the BES Administrator to quickly see what the problem is, how wide spread it is, and finally which users are affected by it.
The BlackBerry outage from another product's point of view
Figure 5 is a screen shot of the Zenprise**Please note that Zenprise is considered one of BoxTone's competitors. User Dashboard which was posted to the BlackBerry Forums by mingjing on Tuesday February 13, 2008 at 4:03 PM EST, along with this message:
"We are using Zenprise to monitor our BlackBerry infrastructure ... I've included a screenshot of the Zenprise console and one alert message. We were able to see pending messages growing for critical users, as well as immediately identify the root cause to be connectivity problems with the SRP network."
*I removed two big red circles that were in the original image. I did this on the assumption that someone added
it to the screen shot. If my assumption is incorrect and those red circles do appear under normal use of the
product, please let me know. The original image can be found here.
Figure 5: Zenprise Screen Shot.
So, according to the message, figure 5* *Update:I noticed a larger version of figure 5 was linked to the original posting in the BlackBerryForums. It shows the timestamps much clearer. You can see it here. is a screen shot of what they, a customer of Zenprise, saw during the BlackBerry outage in the Zenprise User Dashboard. And, again according to this customer, they could see the number of pending messages for "critical users" was growing and they were able to "immediately identify" the global SRP connection problem to be the root cause.
OK, I must confess that I have been studying this screen shot for more than a few hours now, and I honestly cannot see how that would be possible.
Figure 6: Four main problems in the Zenprise User Dashboard.
This screen shot has four major problems in it:
1. It was taken sometime after 11:29 AM on 2/12/2008 which is nearly a full day after the BlackBerry outage. There is nothing to indicate which data might be historical and which data might be current.
2. SRP connections to RIM were working when this screen shot was taken, so this icon is just plain wrong.
3. The starting time of the SRP connectivity problem is shown as 2/9/2008 and the last occurrence is shown as 2/11/2008. This is horribly incorrect. The actual SRP connection was only down for approximately 15-20 minutes not 3 days.
4. This big blue rectangle does not give the user any useful information. It does not show growth. It does not show causality. It shows less than 2 minutes worth of data and it simply does not answer the fundamental question all good data graphics should answer: "Compared to what?"
Seeing the Big Picture in the BoxTone Dashboard
* This image is a closeup from figure 2 above.
Figure 7: Seeing a global problem is easy in the BoxTone Dashboard.
I think figure 7 speaks for itself. Seeing there was a global problem during the RIM outage was crystal clear in the BoxTone Dashboard.
Wrapping it up
I plan on writing more posts about the BoxTone Dashboard and other parts of our system in the future. As always, I would appreciate any comments you might want to share on this topic.
10 Comments · Add a comment
adave · Monday, 10 November 2008 2:54 PM
I was looking for avialable products for BES and Exchange monitoring and troubleshooting and came accross your article. I have never used any of the products but i can see the zenprise dashboard image is showing single user profile, last Sync is i think last time the device was synchronized with server and i guess pending messages graph is also for single user. The other details are not readable on zenprize dashboard image. I think the Boxtone Dashboard is very nice and detailed but the aricle is not conclusive for me as i think the comparision is between server dashboard vs user profile dashboard.
Jeff · Wednesday, 4 June 2008 12:32 PM
Mike, this article was very helpful in understanding more about the BoxTone dashboard and was a great follow up to our discussion yesterday. It was great to meet you. Thanks for taking the time.
JeffAndreas · Wednesday, 14 May 2008 8:43 AM
Check out MicroCharts & XLCubed for a nice Excel based implementation of the BoxTone dashboard:
http://www.bonavistasystems.com/OnlineDemoReports.htmlErik Cussack · Friday, 9 May 2008 9:42 AM
An excellent and interesting explanation.
Michael Gaffney · Friday, 7 March 2008 5:50 AM
Geoffrey, thank you. The sparklines crossing cells has the added benefit of showing a user that a drastic change has occurred. It has been very useful.
Michael Gaffney · Friday, 7 March 2008 5:45 AM
Jorge, thank you and I agree with your feedback about Ben Shneiderman's Mantra. We are working on the next revision and for larger deployments a cleaner overview would be more helpful. I'm also a regular reader of your site: http://charts.jorgecamoes.com. Thanks again.
Jorge Camoes · Thursday, 6 March 2008 6:47 PM
Michael, I really like your dashboard. It is a very clean and shows how Tufte's and Few's principles work when correctly implemented. Ben Shneiderman's Visual Information-Seeking Mantra ("overview first, zoom and filter, then details on demand") could be a base for some improvement (adding a bird's eye view of the servers, for example).
Geoffrey Grosenbach · Thursday, 6 March 2008 5:47 PM
Beautiful and useful. I'm using a grey/blue scheme, but a yellow/grey/red scheme (a la Information Dashboard Design) is more informative at a glance.
I also like the way that sparklines cross cells if needed. It's easy to lose information if the scale is incorrect, but this makes it easier to see relative differences.Bharat · Tuesday, 4 March 2008 6:32 PM
Nice article and excellent explanation.
Michael Gaffney · Tuesday, 2 December 2008 6:44 AM
Adave, thanks for your comments. The Zenprise screenshot I used for this post was the only one they released, which, according to them, showed their product detecting the SRP outage. I would have happily used a screenshot comparable to the BoxTone dashboard, but I don’t think their product has an equivalent view.