If there's another e-mail crash, CIT will be ready

A report from Cornell Information Technologies (CIT) on the Great E-mail Outage of 2008 suggests that the best way to prepare for such crises in the future is to improve communication -- with the public, with vendors and between communicators.

A major hardware crash June 15 left many Cornell users unable to receive e-mail for up to a week. It was a wakeup call that e-mail has become as integral to daily life as electric lights and telephones.

"I'm not sure any of us fully anticipated how debilitating this outage would be, but it is clear that we cannot tolerate the loss of what has become our main communication channel," said Polley McClure, vice president for information technologies, in a June 18 statement.

A "root cause analysis" by CIT has nailed down the technical problems and recommends procedures to remedy them and to improve communication with the campus community in the event of a future outage.

Incoming e-mail for cornell.edu addresses is stored in "postoffices," mostly on a unit known as the Sun 6120 Storage Array, made by Sun Microsystems. The 6120 has a well-known, albeit silly, bug: on its 994th day of continuous operation it spontaneously reboots. The solution is obvious: Before day 994, turn it off and on again. Sun had warned users to do this, and CIT technicians did so a week before the deadline. But it turned out the procedure Sun provided wasn't complete. Some circuits inside the system remembered the time, and the 6120 rebooted anyway on day 994 -- June 15.

Many computer users know to their sorrow that if a system reboots while it is writing data, records can be damaged. Technicians were able to get the drives up and running in a few hours, but corrupted data caused more crashes. Lost data was restored from backups, but since the backups were made about 12 hours before the crash, some mail was lost.

Sun Microsystems took its share of the blame, and sent a letter of apology to Cornell President David Skorton.

CIT has strengthened communications with Sun, said Rick MacDonald, CIT director of systems and operations. Part of the delay in recovery from the crash, he said, resulted from trying to contact Sun tech support on a Sunday. "Once we got the right people, Sun's support was exemplary," he said, adding, "Now ... if we're not getting answers we know who to call at the next step up."

CIT has set up its own rules about when and how to send signals up the chain of command and to make sure information flows to the public.

An "after-action review" panel of campus stakeholders noted that information did not get out to the campus community soon enough to let people use workarounds like forwarding mail to another e-mail address. Tech support people and the Help Desk were not well-informed. One reason, MacDonald said, was that technicians at first thought they had the problem solved. "By Monday evening all but one postoffice was back up," he explained. "And then they started crashing again."

For a while the only source of information on the Web was the Network Status Page -- hard to find and full of technical terms. An Audix broadcast message went out, but since broadcast messages don't light the message light on phones, most people didn't know there was a message there.

A "rapid response team" within CIT Communications Services now will include writers who can craft messages in nontechnical terms. The Network Status page is being redesigned to provide both nontechnical information and technical details for those who need them. New channels to unit support providers are being developed, and a public awareness campaign is planned to tell people where to look for information if there is a major outage.

The after-action review panel did find one positive effect of the June outage. Ironically, their report noted, "Some people used the opportunity to get more work done."

Media Contact

Media Relations Office