Baseball

Baseball produces a welter of data, from which correlations can be drawn – for example between the number of hits and a player’s salary. Donald Miralle / Getty Images

The US humorist Evan Esar once called statistics the science of producing unreliable facts from reliable figures. An innovative technique now promises to make those facts a whole lot more dependable.

Brothers David Reshef of the Broad Institute of MIT and Harvard in Cambridge, Massachusetts, Yakir Reshef, now at the Weizmann Institute of Science in Rehovot, Israel, and their coworkers have devised a method to extract from complex sets of data relationships and trends that are invisible to other types of statistical analysis. They describe their approach in Science today.

“This appears to be an outstanding achievement,” says Douglas Simpson, astatistician at the University of Illinois at Urbana–Champaign. “It opens up whole new avenues of inquiry.”

Dizzying complexity

Here is the basic problem. You have collected lots of data on someproperty of a system that could depend on many governing factors. Towork out what depends on what, you plot them on a graph.

If you are lucky, you might find that one property changes in asimple way as a function of another factor: for example, people’s healthmight steadily get better as their wealth increases. There are wellknown statistical methods for assessing how reliable such correlationsare. But what if there are many simultaneous dependencies in the data?Suppose that you are looking at how genes interact in an organism. Theactivity of one gene could be correlated with that of another, but therecould be hundreds of such relationships all mixed together. To acursory inspection, the data might look like random noise.

“If you have a data set with 22 million relationships, the 500relationships in there that you care about are effectively invisible to ahuman,” says Yakir Reshef.

And the relationships are all the harder to tease out if you don’tknow what you’re looking for in the first place — if you have no reasonto suspect that one thing depends on another.

The statistical method that Reshef and his colleagues have devisedaims to crack those problems. It can spot many superimposed correlationsbetween variables and measure exactly how tight each relationship is,on the basis of a quantity that the team calls the maximal informationcoefficient (MIC). The MIC is calculated by plotting data on a graph andlooking for all ways of dividing up the graph into blocks or grids thatcapture the largest possible number of data points. MIC can then bededuced from the grids that do the best job.

To demonstrate the power of their technique, the researchers appliedit to a diverse range of problems. In one case they looked at factorsthat influence people’s health globally, using data collected by theWorld Health Organization in Geneva, Switzerland. Here they were able totease out superimposed trends — for example, female obesity increaseswith income in the Pacific Islands, where it is considered a sign ofstatus, but there is no such link in the rest of the world.

In another example, the researchers identified genes that wereexpressed periodically, but with differing cycles, during the cell cycleof brewer’s yeast (Saccharomyces cerevisiae). They alsouncovered groups of human gut bacteria that proliferate or decline whendiet is altered, finding that some bacteria are abundant precisely whenothers are not. Finally, the team identified performance factors forbaseball players that are strongly correlated to their salaries.

Correlation and causation

Reshef cautions that finding statistical correlations is only thestart of understanding relationships between variables. “At the end ofthe day you’ll need an expert to tell you what your data mean,” he says.“But filtering out the junk in a data set in order to allow someone toexplore it is often a task that doesn't require much context orspecialized knowledge.”

He adds, “Our hope is that this tool will be useful in just about anyfield that is amassing large amounts of data.” Reshef points togenomics, proteomics, epidemiology, particle physics, sociology,neuroscience and atmospheric science as just some of the fields that are“saturated with data”.  The method should also be valuable for ‘datamining’ in sports statistics, social media and economics.

One of the big questions remaining after a relationship has beenuncovered is what causes what; the familiar mantra of statisticians isthat correlation does not imply causality.  “We see the issue ofcausality as a potential follow–up,” says Reshef. “Inferring causalityis an immensely complicated problem, but has been well studiedpreviously.”

Raya Khanin, a bioinformatician at the Memorial Sloan–KetteringCancer Center in New York, acknowledges the need for a technique likethe Reshefs’, but reserves judgement about whether the MIC is theanswer. “I’m not sure whether its performance is as good as anddifferent from other measures,” she says.

For example, she questions whether the findings about gut bacteriareally needed this advanced statistical technique. “Having worked withthis type of data, and judging from the figures, I’m quite certain thatsome basic correlation measures would have uncovered the same type ofnon–coexistence behaviour,” she says.

Click here for article link.