Statistics professor helps navigate the 'data deluge'

Paul Velleman
Jason Koski/University Photography
Statistician Paul Velleman explains the methods researchers use to sift through the enormous amounts of data being generated.

Humanity is generating massive amounts of data, and extracting useful information from this deluge is extremely challenging, said Paul Velleman, associate professor of social statistics, at a public lecture April 21 in celebration of National Math Awareness Month.

In his lecture, "Surfing the Data Deluge," Velleman noted that in 2005, approximately 1 billion gigabytes of data were generated. Just five years later, it was eight times that amount. This massive mushrooming of information will soon overwhelm our ability to store the data, Velleman said.

One way to make sense of all this data is through data mining, which uses statistical methods and computer algorithms to discover patterns. However, Velleman pointed out that without meaningful questions to guide it, data mining isn't very helpful.

A better alternative would be to "surf the deluge" and to learn to think statistically, which humans don't do naturally, he continued. Thinking statistically is often counterintuitive, he said, and can require effort.

One case where statistics can be confusing is in presidential election polls, Velleman said, as different polls give different results for predicted winners. While Gallup predicts that Mitt Romney will win the election, NBC and the Wall Street Journal predict Barack Obama.

Why do these polls give conflicting results? Individuals, samples, statistical methods and polling organizations all vary, and these can account for the discrepancies often observed in poll results, Velleman said.

To effectively examine polls, one should look at who was sampled, the size of the sample and how the question was worded, he said. For instance, respondents are more likely to reply affirmatively to a question asking whether they favor "President Obama" over whether they favor "Obama," he said.

Another area where statistical thinking can be revealing is in selecting lottery numbers, for which people often develop strategies. For instance, Velleman noted that people often use "hot" numbers -- numbers that have come up recently, or "lucky" numbers.

Velleman pointed out the futility of this approach, because the set of lottery numbers isn't just random, but is an independent event.

"There's no way pingpong balls can remember what was selected in any previous time, and either be 'hot' or be 'due' or be more random or less random," he said. "Every possible collection of five eligible numbers is equally likely."

Statistical thinking isn't about mathematical ability, Velleman concluded, but it does require thinking in ways that often don't come naturally to people. He noted that Mark Twain said he was "beguiled" by figures, leading to his oft-quoted "There are three kinds of lies: lies, damned lies and statistics." Velleman said he believes Twain was referring to an alternate definition of "beguile": "to win and hold somebody's attention, interest or devotion."

"I like to think that Twain was beguiled by arranging his figures because he discovered the truth in his data," Velleman said. "I hope that you, too, will be beguiled by statistics."

Farhan Nuruzzaman '12 is a writer intern for the Cornell Chronicle.

 

Media Contact

Joe Schwartz