Hardware crash leaves Cornell e-mail users message-less for up to a week

A major hardware crash Sunday, June 15, left many Cornell users unable to receive e-mail for periods ranging from two days to almost a week. Some incoming mail was irretrievably lost in the process, according to Cornell Information Technologies (CIT) technicians.

"I'm not sure any of us fully anticipated how debilitating this outage would be, but it is clear that we cannot tolerate the loss of what has become our main communication channel," said Polley McClure, vice president for information technologies, in a statement posted on the CIT Web site.

McClure promised a review of the incident to plan how to minimize the probability of a recurrence and prepare for more effective recovery. She thanked IT staff for their work in repairing the system and helping those affected by the outage. Many CIT staffers, she said, literally worked around the clock, sleeping on campus, until the job was done. "You cannot pay people enough to have this kind of dedication to their work and Cornell," she said.

The first sign of a problem was a rather innocuous message from the Network Operations Center at 12:22 p.m. on June 15: "A service affecting issue with SAN (Storage Area Network) is currently under investigation. This may also be affecting some of the Mail servers." Eventually it developed that eight arrays of hard disks holding the "mailboxes" in which each user's incoming mail is stored had failed.

While details are still to come from Sun Microsystems, the supplier of the disk arrays, there is a known bug in the hardware that causes the storage arrays to spontaneously reboot (i.e., abruptly shut down and restart) on the 994th day of operation, according to Rick MacDonald, director of systems and operations for CIT. Sun supplied instructions for correcting the problem. Cornell followed the instructions, MacDonald said.

Despite this, on the 994th day of continuous operation -- June 15 -- the arrays rebooted anyway. A reboot when reading and writing operations are in progress can result in damage.

The damage varied across the disk arrays, so some "postoffices" that house individual mailboxes (actually the equivalent of folders in the storage system) were affected more than others. Postoffices 9 and 10 were up and running Monday afternoon but failed again Monday evening. Postoffices 6, 9 and 10 returned to service on Tuesday evening. Postoffices 7 and 8 were hardest hit, with some users not seeing incoming mail until Friday morning and with some stored mail completely lost.

CIT restored mail for many users of these postoffices from a backup, but any mail received between the time of the last backup, done between 7 and 10 p.m. the preceding Saturday night, and the crash on Sunday, is lost, MacDonald reported. About 3,800 users were affected, and they have been notified by e-mail. Some mail that arrived at other postoffices just at the time of the crash may also have been lost, he said, but there is no way to know how much.

Mail that arrived after the crash, including mail from Cornell users to other cornell.edu addresses, eventually went through but was held on Cornell mail hubs and fed back to the postoffices after the system was restored. For a few days, "Your mail has not yet been delivered" messages were almost as common as spam. "This is merely a warning. There is no need for these messages to be sent again," CIT announced. "The messages are in queue and will be delivered."

Outgoing mail from Cornell users to the outside world was sent on its way with no problems. On CIT's recommendation, many Cornell users coped by having their incoming mail forwarded to temporary accounts they set up with services like Gmail, Yahoo and Hotmail.

Anyone who is still having problems with e-mail should call the CIT Contact Center at 255-8990. CIT cautioned that "CIT will *NOT* need your NetID password from you to restore your e-mail account." Messages asking for such information, they said, would be "phishing" or identity-theft attempts.

The current status of Cornell's networks is always available at http://networkstatus.cit.cornell.edu/. Members of the Cornell community can find technical details of the e-mail outage at http://www.cit.cornell.edu/computer/news/cuonly/. McClure's complete statement is at http://www.cit.cornell.edu/computer/news/emailoutage.html.

Media Contact

Media Relations Office