CIT continues investigation of June e-mail failure

Since the massive e-mail outage that hit campus June 15, administrators and technicians at Cornell Information Technologies (CIT) have worked tirelessly to confirm the cause of the system failure and to strengthen their response to any future disruptions in e-mail service.

"By the end of July, we will complete our report detailing the root causes of the failure and our proposals for how to better prepare for an event such as this one," said Polley Ann McClure, vice president for information technologies.

This effort, she said, encompasses an after-action review by a cross-campus group; a root-cause analysis by Sun Microsystems, the manufacturer of the disk storage systems that failed in the June outage; a CIT analysis of the technical events and how they were communicated to the campus; and an analysis of how the failure affected campus users. CIT is also seeking input from the Cornell community about what went well and how the disruption of one of the main channels of university communication could have been better handled.

The e-mail outage occurred midday, Sunday, June 15, when eight Sun Microsystems disk storage arrays crashed. Five of these disks serviced Cornell's five post office servers, where the e-mail accounts for individual users are stored.

The crash was due to a known bug that causes the Sun disks to spontaneously reboot on their 994th day of continuous operation, McClure explained. Sun had previously alerted users to this problem, and on June 8 Cornell technicians performed a set of Sun-supplied procedures to avert the problem. On June 15, the 994th day of continuous service, the disks spontaneously rebooted anyway, bringing down Cornell's e-mail system and damaging the disk file systems that support it.

McClure said Sun has confirmed that CIT had performed the preventive procedures as indicated in the Sun alert. As a result of the findings in their root cause analysis, dated July 15, Sun has updated their remediation procedures by adding additional steps that they believe will properly reset the disk arrays, thereby preventing the bug from activating.

On Monday, June 16, CIT tried to restore the disk arrays to operational status. By late evening, some post office servers were operational, and some people had e-mail service. But the servers crashed again due to operating system failures that were traced to data corruption and had been undetected during the previous day's restoration efforts. On Tuesday afternoon, CIT discovered, and Sun verified, that a specific patch and configuration setting were needed to repair the data corruption.

By Wednesday morning, June 18, e-mail service for most users had been restored. Over the next two days, CIT technicians undertook patching, restoring, replacing and repairing various aspects of the stored e-mail data until all systems became operational.

The e-mail data for a small number of users was so badly damaged that it had to be replaced from backups, and there were some difficulties in making those restorations, McClure said. Further maintenance to resolve possible file system corruption was successfully completed July 13.

"I appreciate the support, understanding and patience that people on campus have given CIT during this difficult situation," McClure said. She urges people to send her input to email-feedback@cornell.edu.

CIT's final report and recommendations will be available by July 31 on the Office of Information Technologies Web site and the Computing at Cornell Web site.

 

Media Contact

Media Relations Office