Showing posts with label voodoo. Show all posts
Showing posts with label voodoo. Show all posts

Friday, February 25, 2011

The Decline And Fall of Effects In Science

Nature has a piece called Unpublished results hide the decline effect.
This refers to the fact that many scientific findings which seem to indicate something big is happening, end up getting smaller and smaller as more people try to replicate them until they, eventually, may vanish entirely.

The Last Psychiatrist's take is that "The Decline Effect" just represents sloppy thinking, treating different things as if they were all instances of The One True Phenomenon. Someone does a study about something and finds an effect. Then someone else comes along and does a new study, of a related but different topic, and finds a different result. Both are right: there's a difference. Only if you, sloppily, decide that both studies were measuring the same thing does the "Decline Effect" appear.

This is perfectly true and I've touched on it before, but I think it's a bit optimistic. It assumes that the first study was true. Sometimes they are. But because of the way science is published at the moment, a lot of results that get published are flukes. Some even say that the majority are.

The problem is that there are so many ways to statistically analyze any given body of data that it's easy to test and retest it until you find a "positive result" - and then publish that, without saying (or only saying in the small print) that your original tests all came out negative. Combine this with selective publication of only the best data, and other scientific sins, and you can pull positive results out the hat of mere random noise.

In the Nature article, Jonathan Schooler discusses this and suggests that an open-access repository of findings (meaning raw data rather than the end product of analyses) would be A Good Thing. I agree. However, he seems to think that if we did this, we might still observe the "Decline Effect", and would be able to find out more about it. He even seems to suggest that some kind of weird quantum effect might mean that scientists are actually changing the laws of reality by observing them
Perhaps, just as the act of observation has been suggested to affect quantum measurements, scientific observation could subtly change some scientific effects. Although the laws of reality are usually understood to be immutable, some physicists, including Paul Davies, director of the BEYOND: Center for Fundamental Concepts in Science at Arizona State University in Tempe, have observed that this should be considered an assumption, not a foregone conclusion.
Hmm. Maybe. But there is really no need to posit such magical mysteries when plain old statistical conjuring tricks seem like a perfectly good explanation. On my view a raw result repository would not explain the decline effect, but just make it disappear.

Schooler doesn't go into detail as to how this repository would be set up, but he does cite the fact that we already have a pretty good one for clinical trials of medicines conducted in the USA. Anyone running a clinical trial is required to register it in advance, saying what they're planning to do and crucially, to spell out which statistics they are going to run on the data when it arrives.

What's really silly is that most scientists already do this when applying for funding: most grant applications include detailed statistical protocols. The problem is that these are not made public so people can ignore them when it comes to publication. Back in 2008 I suggested that scientific journals should require all studies, not just clinical trials, to be publicly pre-registered if they're to be considered for publication. This would be eminently do-able if there was a will to make it happen.

ResearchBlogging.orgSchooler, J. (2011). Unpublished results hide the decline effect Nature, 470 (7335), 437-437 DOI: 10.1038/470437a

The Decline And Fall of Effects In Science

Nature has a piece called Unpublished results hide the decline effect.
This refers to the fact that many scientific findings which seem to indicate something big is happening, end up getting smaller and smaller as more people try to replicate them until they, eventually, may vanish entirely.

The Last Psychiatrist's take is that "The Decline Effect" just represents sloppy thinking, treating different things as if they were all instances of The One True Phenomenon. Someone does a study about something and finds an effect. Then someone else comes along and does a new study, of a related but different topic, and finds a different result. Both are right: there's a difference. Only if you, sloppily, decide that both studies were measuring the same thing does the "Decline Effect" appear.

This is perfectly true and I've touched on it before, but I think it's a bit optimistic. It assumes that the first study was true. Sometimes they are. But because of the way science is published at the moment, a lot of results that get published are flukes. Some even say that the majority are.

The problem is that there are so many ways to statistically analyze any given body of data that it's easy to test and retest it until you find a "positive result" - and then publish that, without saying (or only saying in the small print) that your original tests all came out negative. Combine this with selective publication of only the best data, and other scientific sins, and you can pull positive results out the hat of mere random noise.

In the Nature article, Jonathan Schooler discusses this and suggests that an open-access repository of findings (meaning raw data rather than the end product of analyses) would be A Good Thing. I agree. However, he seems to think that if we did this, we might still observe the "Decline Effect", and would be able to find out more about it. He even seems to suggest that some kind of weird quantum effect might mean that scientists are actually changing the laws of reality by observing them
Perhaps, just as the act of observation has been suggested to affect quantum measurements, scientific observation could subtly change some scientific effects. Although the laws of reality are usually understood to be immutable, some physicists, including Paul Davies, director of the BEYOND: Center for Fundamental Concepts in Science at Arizona State University in Tempe, have observed that this should be considered an assumption, not a foregone conclusion.
Hmm. Maybe. But there is really no need to posit such magical mysteries when plain old statistical conjuring tricks seem like a perfectly good explanation. On my view a raw result repository would not explain the decline effect, but just make it disappear.

Schooler doesn't go into detail as to how this repository would be set up, but he does cite the fact that we already have a pretty good one for clinical trials of medicines conducted in the USA. Anyone running a clinical trial is required to register it in advance, saying what they're planning to do and crucially, to spell out which statistics they are going to run on the data when it arrives.

What's really silly is that most scientists already do this when applying for funding: most grant applications include detailed statistical protocols. The problem is that these are not made public so people can ignore them when it comes to publication. Back in 2008 I suggested that scientific journals should require all studies, not just clinical trials, to be publicly pre-registered if they're to be considered for publication. This would be eminently do-able if there was a will to make it happen.

ResearchBlogging.orgSchooler, J. (2011). Unpublished results hide the decline effect Nature, 470 (7335), 437-437 DOI: 10.1038/470437a

Friday, June 25, 2010

The A Team Sets fMRI to Rights

Remember the voodoo correlations and double-dipping controversies that rocked the world of fMRI last year? Well, the guys responsible have teamed up and written a new paper together. They are...

The paper is Everything you never wanted to know about circular analysis, but were afraid to ask. Our all-star team of voodoo-hunters - including Ed "Hannibal" Vul (now styled Professor Vul), Nikolaus "Howling Mad" Kriegeskorte, and Russell "B. A." Poldrack - provide a good overview of the various issues and offer their opinions on how the field should move forward.

The fuss concerns a statistical trap that it's easy for neuroimaging researchers, and certain other scientists, to fall into. Suppose you have a large set of data - like a scan of the brain, which is a set of perhaps 40,000 little cubes called voxels - and you search it for data points where there is a statistically significant effect of some kind.

Because you're searching in so many places, in order to avoid getting lots of false positives you set the threshold for significance very high. That's fine in itself, but a problem arises if you find some significant effects and then take those significant data points and use them as a measure of the size of the effects - because you have specifically selected your data points on the basis that they show the very biggest effects out of all your data. This is called the non-independence error and it can make small effects seem much bigger.

The latest paper offers little that's new in terms of theory, but it's a good read and it's interesting to get the authors' expert opinion on some hot topics. Here's what they have to say about the question of whether it's acceptable to present results that suffer from the non-independence error just to "illustrate" your statistically valid findings:
Q: Are visualizations of non-independent data helpful to illustrate the claims of a paper?

A: Although helpful for exploration and story telling, circular data plots are misleading when presented as though they constitute empirical evidence unaffected by selection. Disclaimers and graphical indications of circularity should accompany such visualizations.
Now an awful lot of people - and I confess that I've been among them - do this without the appropriate disclaimers. Indeed, it is routine. Why? Because it can be useful illustration - although the size of the effects appears to be inflated in such graphs, on a qualitative level they provide a useful impression of the direction and nature of the effects.

But the A Team are right. Such figures are misleading - they mislead about the size of the effect, even if only inadvertently. We should use disclaimers, or ideally, avoid using misleading graphs. Of course, this is a self-appointed committee: no-one has to listen to them. We really should though, because what they're saying is common sense once you understand the issues.

It's really not that scary - as I said on this blog at the outset, this is not going to bring the whole of fMRI crashing down and end everyone's careers; it's a technical issue, but it is a serious one, and we have no excuse for not dealing with it.

ResearchBlogging.orgKriegeskorte, N., Lindquist, M., Nichols, T., Poldrack, R., & Vul, E. (2010). Everything you never wanted to know about circular analysis, but were afraid to ask Journal of Cerebral Blood Flow & Metabolism DOI: 10.1038/jcbfm.2010.86

The A Team Sets fMRI to Rights

Remember the voodoo correlations and double-dipping controversies that rocked the world of fMRI last year? Well, the guys responsible have teamed up and written a new paper together. They are...

The paper is Everything you never wanted to know about circular analysis, but were afraid to ask. Our all-star team of voodoo-hunters - including Ed "Hannibal" Vul (now styled Professor Vul), Nikolaus "Howling Mad" Kriegeskorte, and Russell "B. A." Poldrack - provide a good overview of the various issues and offer their opinions on how the field should move forward.

The fuss concerns a statistical trap that it's easy for neuroimaging researchers, and certain other scientists, to fall into. Suppose you have a large set of data - like a scan of the brain, which is a set of perhaps 40,000 little cubes called voxels - and you search it for data points where there is a statistically significant effect of some kind.

Because you're searching in so many places, in order to avoid getting lots of false positives you set the threshold for significance very high. That's fine in itself, but a problem arises if you find some significant effects and then take those significant data points and use them as a measure of the size of the effects - because you have specifically selected your data points on the basis that they show the very biggest effects out of all your data. This is called the non-independence error and it can make small effects seem much bigger.

The latest paper offers little that's new in terms of theory, but it's a good read and it's interesting to get the authors' expert opinion on some hot topics. Here's what they have to say about the question of whether it's acceptable to present results that suffer from the non-independence error just to "illustrate" your statistically valid findings:
Q: Are visualizations of non-independent data helpful to illustrate the claims of a paper?

A: Although helpful for exploration and story telling, circular data plots are misleading when presented as though they constitute empirical evidence unaffected by selection. Disclaimers and graphical indications of circularity should accompany such visualizations.
Now an awful lot of people - and I confess that I've been among them - do this without the appropriate disclaimers. Indeed, it is routine. Why? Because it can be useful illustration - although the size of the effects appears to be inflated in such graphs, on a qualitative level they provide a useful impression of the direction and nature of the effects.

But the A Team are right. Such figures are misleading - they mislead about the size of the effect, even if only inadvertently. We should use disclaimers, or ideally, avoid using misleading graphs. Of course, this is a self-appointed committee: no-one has to listen to them. We really should though, because what they're saying is common sense once you understand the issues.

It's really not that scary - as I said on this blog at the outset, this is not going to bring the whole of fMRI crashing down and end everyone's careers; it's a technical issue, but it is a serious one, and we have no excuse for not dealing with it.

ResearchBlogging.orgKriegeskorte, N., Lindquist, M., Nichols, T., Poldrack, R., & Vul, E. (2010). Everything you never wanted to know about circular analysis, but were afraid to ask Journal of Cerebral Blood Flow & Metabolism DOI: 10.1038/jcbfm.2010.86

Friday, April 30, 2010

New, Voodoo-Free fMRI Technique

MIT brain scanners Fedorenko et al present A new method for fMRI investigations of language: Defining ROIs functionally in individual subjects. Also on the list of authors is Nancy Kanwisher, one of the feared fMRI voodoo correlations posse.

The paper describes a technique for mapping out the "language areas" of the brain in individual people, not for their own sake, but as a way of improving other fMRI studies of language. That's important because while everyone's brain is organized roughly the same way, there are always individual differences in the shape, size and location of the different regions.

This is a problem for fMRI researchers. Suppose you scan 10 people and show them pictures of apples and pictures of pears. And suppose that apples activate the brain's Fruit Cortex much more strongly than pears. But unfortunately, the Fruit Cortex is a small area, and its location varies between people. In fact, in your 10 subjects, no-one's Fruit Cortex overlaps with anyone else's, even though everyone has one and they all work exactly the same way.

If you did this experiment you'd fail to find the effect of apples vs. pears, even though it's a strong effect, because there will be no one place in the brain where apples reliably cause more activation. What you need is a way of finding the Fruit Cortex in each person beforehand. What you'd need to do is a functional localization scan - say, showing people a big bowl of fruit - as a preliminary step.

Fedorenko et al scanned a bunch of people while doing a simple reading task, and compared that to a control condition, reading a random list of nonsense which makes no linguistic sense. As you can see, there's a lot of variation between people, but there's also clearly a basic pattern of activation: it looks a bit like a tilted "V" on the left side of the brain:

These are the language areas of each person. (Incidentally, this is why fMRI, despite its limitations, is an amazing technology. There is no better way of measuring this activation. EEG is cheaper but nowhere near as good at localizing activity; PET is close, but it's slow, expensive and involves injecting people with radioactivity.)

Fedorenko et al then overlapped all the individual images to produce of map of the brain showing how many people got activation in each part:

The most robust activations were on the left side of the brain, and they formed a nice "V" shape again. These are the areas which have long been known to be involved in language, so this is not surprising in itself.

Here's the clever bit: they then took the areas activated in a large % of people, and automatically divided them up into sub-regions; each of the "peaks" where an especially large proportion of subjects showed activation became a separate region.

This is on the assumption that these peaks represent parts of the brain with distinct functions - separate "language modules" as it were. But each module will be in a slightly different place in each person (see the first picture). So they overlapped the subdivisions with the individual activation blobs to get a set of individual functional zones they call Group-constrained Subject-Specific functional Regions of Interest, or GcSSfROIs to their friends.

Fedorenko et al claim various advantages to this technique, and present data showing that it produces nice results in independent subjects (i.e. not the ones they used to make the group map in the first place.)

In particular, they argue that it should allow future fMRI studies to have a better chance of finding the specific functions of each region. So far, experiments using fMRI to investigate language have largely failed to find activations specific to particular aspects of language like grammar, word meaning, etc. which is unexpected because patients suffering lesions to specific areas often do show very selective language problems.

Does this relate to the voodoo correlations issue? Indirectly, yes. The voodoo (non-independence error) problem arises when you do a large number of comparisons, and then focus on the "best" results, because these are likely to be wholly, or partially, only that good by chance.

Fedorenko et al's method allows you to avoid doing lots of comparisons in the first place. Instead of looking all over the whole brain for something interesting, you can first do a preliminary scan to map out where in each person's brain interesting stuff is likely to happen, and then focus on those bits in the real experiment.

There's still a multiple-comparisons problem: Fedorenko et al identified 16 candidate language areas per brain, and future studies could well provide more. But that's nothing compared to the 40,000 voxels in a typical whole-brain analysis. We'll have to wait and see if this technique proves useful in the real world, but it's an interesting idea...

ResearchBlogging.orgFedorenko, E., Hsieh, P., Nieto Castanon, A., Whitfield-Gabrieli, S., & Kanwisher, N. (2010). A new method for fMRI investigations of language: Defining ROIs functionally in individual subjects Journal of Neurophysiology DOI: 10.1152/jn.00032.2010

New, Voodoo-Free fMRI Technique

MIT brain scanners Fedorenko et al present A new method for fMRI investigations of language: Defining ROIs functionally in individual subjects. Also on the list of authors is Nancy Kanwisher, one of the feared fMRI voodoo correlations posse.

The paper describes a technique for mapping out the "language areas" of the brain in individual people, not for their own sake, but as a way of improving other fMRI studies of language. That's important because while everyone's brain is organized roughly the same way, there are always individual differences in the shape, size and location of the different regions.

This is a problem for fMRI researchers. Suppose you scan 10 people and show them pictures of apples and pictures of pears. And suppose that apples activate the brain's Fruit Cortex much more strongly than pears. But unfortunately, the Fruit Cortex is a small area, and its location varies between people. In fact, in your 10 subjects, no-one's Fruit Cortex overlaps with anyone else's, even though everyone has one and they all work exactly the same way.

If you did this experiment you'd fail to find the effect of apples vs. pears, even though it's a strong effect, because there will be no one place in the brain where apples reliably cause more activation. What you need is a way of finding the Fruit Cortex in each person beforehand. What you'd need to do is a functional localization scan - say, showing people a big bowl of fruit - as a preliminary step.

Fedorenko et al scanned a bunch of people while doing a simple reading task, and compared that to a control condition, reading a random list of nonsense which makes no linguistic sense. As you can see, there's a lot of variation between people, but there's also clearly a basic pattern of activation: it looks a bit like a tilted "V" on the left side of the brain:

These are the language areas of each person. (Incidentally, this is why fMRI, despite its limitations, is an amazing technology. There is no better way of measuring this activation. EEG is cheaper but nowhere near as good at localizing activity; PET is close, but it's slow, expensive and involves injecting people with radioactivity.)

Fedorenko et al then overlapped all the individual images to produce of map of the brain showing how many people got activation in each part:

The most robust activations were on the left side of the brain, and they formed a nice "V" shape again. These are the areas which have long been known to be involved in language, so this is not surprising in itself.

Here's the clever bit: they then took the areas activated in a large % of people, and automatically divided them up into sub-regions; each of the "peaks" where an especially large proportion of subjects showed activation became a separate region.

This is on the assumption that these peaks represent parts of the brain with distinct functions - separate "language modules" as it were. But each module will be in a slightly different place in each person (see the first picture). So they overlapped the subdivisions with the individual activation blobs to get a set of individual functional zones they call Group-constrained Subject-Specific functional Regions of Interest, or GcSSfROIs to their friends.

Fedorenko et al claim various advantages to this technique, and present data showing that it produces nice results in independent subjects (i.e. not the ones they used to make the group map in the first place.)

In particular, they argue that it should allow future fMRI studies to have a better chance of finding the specific functions of each region. So far, experiments using fMRI to investigate language have largely failed to find activations specific to particular aspects of language like grammar, word meaning, etc. which is unexpected because patients suffering lesions to specific areas often do show very selective language problems.

Does this relate to the voodoo correlations issue? Indirectly, yes. The voodoo (non-independence error) problem arises when you do a large number of comparisons, and then focus on the "best" results, because these are likely to be wholly, or partially, only that good by chance.

Fedorenko et al's method allows you to avoid doing lots of comparisons in the first place. Instead of looking all over the whole brain for something interesting, you can first do a preliminary scan to map out where in each person's brain interesting stuff is likely to happen, and then focus on those bits in the real experiment.

There's still a multiple-comparisons problem: Fedorenko et al identified 16 candidate language areas per brain, and future studies could well provide more. But that's nothing compared to the 40,000 voxels in a typical whole-brain analysis. We'll have to wait and see if this technique proves useful in the real world, but it's an interesting idea...

ResearchBlogging.orgFedorenko, E., Hsieh, P., Nieto Castanon, A., Whitfield-Gabrieli, S., & Kanwisher, N. (2010). A new method for fMRI investigations of language: Defining ROIs functionally in individual subjects Journal of Neurophysiology DOI: 10.1152/jn.00032.2010

Wednesday, March 10, 2010

Can We Rely on fMRI?

Craig Bennett (of Prefrontal.org) and Michael Miller, of dead fish brain scan fame, have a new paper out: How reliable are the results from functional magnetic resonance imaging?


Tal over at the [citation needed] blog has an excellent in-depth discussion of the paper, and Mind Hacks has a good summary, but here's my take on what it all means in practical terms.

Suppose you scan someone's brain while they're looking at a picture of a cat. You find that certain parts of their brain are activated to a certain degree by looking at the cat, compared to when they're just lying there with no picture. You happily publish your results as showing The Neural Correlates of Cat Perception.

If you then scanned that person again while they were looking at the same cat, you'd presumably hope that exact same parts of the brain would light up to the same degree as they did the first time. After all, you claim to have found The Neural Correlates of Cat Perception, not just any old random junk.

If you did find a perfect overlap in the area and the degree of activation that would be an example of 100% test-retest reliability. In their paper, Bennett and Miller review the evidence on the test-retest reliability of fMRI studies. They found 63 of them. On average, they found that the reliability of fMRI falls quite far short of perfection: the areas activated (clusters) had a mean Dice overlap of 0.476, while the strength of activation was correlated with a mean ICC of 0.50.

But those numbers, taken out of context, do not mean very much. Indeed, what is a Dice overlap? You'll have to read the whole paper to find out, but even when you do, they still don't mean that much. I suspect this is why Bennett and Miller don't mention them in the Abstract of the paper, and in fact they don't spend more than a few lines discussing them at all.

A Dice overlap of 0.476 and an ICC of 0.50 are what you get if average over all of the studies that anyone's done looking at the test-retest reliability of any particular fMRI experiment. But different fMRI experiments have different reliabilities. Saying that the average reliability of fMRI is 0.5 is rather like saying that the mean velocity of a human being is 0.3 km per hour. That's probably about right, averaging over everyone in the world, including those who are asleep in bed and those who are flying on airplanes - but it's not very useful. Some people are moving faster than others, and some scans are more reliable than others.


Most of this paper is not concerned with "how reliable fMRI is", but rather, with how to make any given scanning experiment more reliable. And this is an important thing to write about, because even the most optimistic cognitive neuroscientist would agree that many fMRI results are not especially reliable, and as Bennett and Miller say, reliability matters for lots of reasons:
Scientific truth. While it is a simple statement that can be taken straight out of an undergraduate research methods course, an important point must be made about reliability in research studies: it is the foundation on which scientific knowledge is based. Without reliable, reproducible results no study can effectively contribute to scientific knowledge.... if a researcher obtains a different set of results today than they did yesterday, what has really been discovered?
Clinical and Diagnostic Applications. The longitudinal assessment of changes in regional brain activity is becoming increasingly important for the diagnosis and treatment of clinical disorders...
Evidentiary Applications. The results from functional imaging are increasingly being submitted as evidence into the United States legal system...
Scientific Collaboration. A final pragmatic dimension of fMRI reliability is the ability to share data between researchers...
So what determines the reliability of any given fMRI study? Lots of things. Some of them are inherent to the nature of the brain, and are not really things we can change: activation in response to basic perceptual and motor tasks is probably always going to be more reliable than activation related to "higher" functions like emotions.

But there are lots of things we can change. Although it's rarely obvious from the final results, researchers make dozens of choices when designing and analyzing an fMRI experiment, many of which can at least potentially have a big impact on the reliability of their findings. Bennett and Miller cover lots of them:
voxel size... repetition time (TR), echo time (TE), bandwidth, slice gap, and k-space trajectory... spatial realignment of the EPI data can have a dramatic effect on lowering movement-related variance ... Recent algorithms can also help remove remaining signal variability due to magnetic susceptibility induced by movement... simply increasing the number of fMRI runs improved the reliability of their results from ICC = 0.26 to ICC = 0.58. That is quite a large jump for an additional ten or fifteen minutes of scanning...
The details get extremely technical, but then, when you do an fMRI scan you're using a superconducting magnet to image human neural activity by measuring the quantum spin properties of protons. It doesn't get much more technical.

Perhaps the central problem with modern neuroimaging research is that it's all too easy for researchers to write off the important experimental design issues as "merely" technicalities, and just put some people in a scanner using the default scan sequence and see what happens. This is something few fMRI users are entirely innocent of, and I'm certainly not, but it is a serious problem. As Bennett and Miller point out, the devil is in the technical details.
The generation of highly reliable results requires that sources of error be minimized across a wide array of factors. An issue within any single factor can significantly reduce reliability. Problems with the scanner, a poorly designed task, or an improper analysis method could each be extremely detrimental. Conversely, elimination of all such issues is necessary for high reliability. A well maintained scanner, well designed tasks, and effective analysis techniques are all prerequisites for reliable results.
ResearchBlogging.orgBennett CM, Miller MB. (2010). How reliable are the results from functional magnetic resonance imaging? Annals of the New York Academy of Sciences

Can We Rely on fMRI?

Craig Bennett (of Prefrontal.org) and Michael Miller, of dead fish brain scan fame, have a new paper out: How reliable are the results from functional magnetic resonance imaging?


Tal over at the [citation needed] blog has an excellent in-depth discussion of the paper, and Mind Hacks has a good summary, but here's my take on what it all means in practical terms.

Suppose you scan someone's brain while they're looking at a picture of a cat. You find that certain parts of their brain are activated to a certain degree by looking at the cat, compared to when they're just lying there with no picture. You happily publish your results as showing The Neural Correlates of Cat Perception.

If you then scanned that person again while they were looking at the same cat, you'd presumably hope that exact same parts of the brain would light up to the same degree as they did the first time. After all, you claim to have found The Neural Correlates of Cat Perception, not just any old random junk.

If you did find a perfect overlap in the area and the degree of activation that would be an example of 100% test-retest reliability. In their paper, Bennett and Miller review the evidence on the test-retest reliability of fMRI studies. They found 63 of them. On average, they found that the reliability of fMRI falls quite far short of perfection: the areas activated (clusters) had a mean Dice overlap of 0.476, while the strength of activation was correlated with a mean ICC of 0.50.

But those numbers, taken out of context, do not mean very much. Indeed, what is a Dice overlap? You'll have to read the whole paper to find out, but even when you do, they still don't mean that much. I suspect this is why Bennett and Miller don't mention them in the Abstract of the paper, and in fact they don't spend more than a few lines discussing them at all.

A Dice overlap of 0.476 and an ICC of 0.50 are what you get if average over all of the studies that anyone's done looking at the test-retest reliability of any particular fMRI experiment. But different fMRI experiments have different reliabilities. Saying that the average reliability of fMRI is 0.5 is rather like saying that the mean velocity of a human being is 0.3 km per hour. That's probably about right, averaging over everyone in the world, including those who are asleep in bed and those who are flying on airplanes - but it's not very useful. Some people are moving faster than others, and some scans are more reliable than others.


Most of this paper is not concerned with "how reliable fMRI is", but rather, with how to make any given scanning experiment more reliable. And this is an important thing to write about, because even the most optimistic cognitive neuroscientist would agree that many fMRI results are not especially reliable, and as Bennett and Miller say, reliability matters for lots of reasons:
Scientific truth. While it is a simple statement that can be taken straight out of an undergraduate research methods course, an important point must be made about reliability in research studies: it is the foundation on which scientific knowledge is based. Without reliable, reproducible results no study can effectively contribute to scientific knowledge.... if a researcher obtains a different set of results today than they did yesterday, what has really been discovered?
Clinical and Diagnostic Applications. The longitudinal assessment of changes in regional brain activity is becoming increasingly important for the diagnosis and treatment of clinical disorders...
Evidentiary Applications. The results from functional imaging are increasingly being submitted as evidence into the United States legal system...
Scientific Collaboration. A final pragmatic dimension of fMRI reliability is the ability to share data between researchers...
So what determines the reliability of any given fMRI study? Lots of things. Some of them are inherent to the nature of the brain, and are not really things we can change: activation in response to basic perceptual and motor tasks is probably always going to be more reliable than activation related to "higher" functions like emotions.

But there are lots of things we can change. Although it's rarely obvious from the final results, researchers make dozens of choices when designing and analyzing an fMRI experiment, many of which can at least potentially have a big impact on the reliability of their findings. Bennett and Miller cover lots of them:
voxel size... repetition time (TR), echo time (TE), bandwidth, slice gap, and k-space trajectory... spatial realignment of the EPI data can have a dramatic effect on lowering movement-related variance ... Recent algorithms can also help remove remaining signal variability due to magnetic susceptibility induced by movement... simply increasing the number of fMRI runs improved the reliability of their results from ICC = 0.26 to ICC = 0.58. That is quite a large jump for an additional ten or fifteen minutes of scanning...
The details get extremely technical, but then, when you do an fMRI scan you're using a superconducting magnet to image human neural activity by measuring the quantum spin properties of protons. It doesn't get much more technical.

Perhaps the central problem with modern neuroimaging research is that it's all too easy for researchers to write off the important experimental design issues as "merely" technicalities, and just put some people in a scanner using the default scan sequence and see what happens. This is something few fMRI users are entirely innocent of, and I'm certainly not, but it is a serious problem. As Bennett and Miller point out, the devil is in the technical details.
The generation of highly reliable results requires that sources of error be minimized across a wide array of factors. An issue within any single factor can significantly reduce reliability. Problems with the scanner, a poorly designed task, or an improper analysis method could each be extremely detrimental. Conversely, elimination of all such issues is necessary for high reliability. A well maintained scanner, well designed tasks, and effective analysis techniques are all prerequisites for reliable results.
ResearchBlogging.orgBennett CM, Miller MB. (2010). How reliable are the results from functional magnetic resonance imaging? Annals of the New York Academy of Sciences

Wednesday, September 16, 2009

fMRI Gets Slap in the Face with a Dead Fish

A reader drew my attention to this gem from Craig Bennett, who blogs at prefrontal.org:

Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction

This is a poster presented by Bennett and colleagues at this year's Human Brain Mapping conference. It's about fMRI scanning on a dead fish, specifically a salmon. They put the salmon in an MRI scanner and "the salmon was shown a series of photographs depicting human individuals in social situations. The salmon was asked to determine what emotion the individual in the photo must have been experiencing."

I'd say that this research was justified on comedic grounds alone, but they were also making an important scientific point. The (fish-)bone of contention here is multiple comparisons correction. The "multiple comparisons problem" is simply the fact that if you do a lot of different statistical tests, some of them will, just by chance, give interesting results.

In fMRI, the problem is particularly severe. An MRI scan divides the brain up into cubic units called voxels. There are over 40,000 in a typical scan. Most fMRI analysis treats every voxel independently, and tests to see if each voxel is "activated" by a certain stimulus or task. So that's at least 40,000 separate comparisons going on - potentially many more, depending upon the details of the experiment.

Luckily, during the 1990s, fMRI pioneers developed techniques for dealing with the problem: multiple comparisons correction. The most popular method uses Gaussian Random Field Theory to calculate the probability of falsely "finding" activated areas just by chance, and to keep this acceptably low (details), although there are other alternatives.

But not everyone uses multiple comparisons correction. This is where the fish comes in - Bennett et al show that if you don't use it, you can find "neural activation" even in the tiny brain of dead fish. Of course, with the appropriate correction, you don't. There's nothing original about this, except the colourful nature of the example - but many fMRI publications still report "uncorrected" results (here's just the last one I read).

Bennett concludes that "the vast majority of fMRI studies should be utilizing multiple comparisons correction as standard practice". But he says on his blog that he's encountered some difficulty getting the results published as a paper, because not everyone agrees. Some say that multiple comparisons correction is too conservative, and could lead to genuine activations being overlooked - throwing the baby salmon out with the bathwater, as it were. This is a legitimate point, but as Bennett says, in this case we should report both corrected and uncorrected results, to make it clear to the readers what is going on.

fMRI Gets Slap in the Face with a Dead Fish

A reader drew my attention to this gem from Craig Bennett, who blogs at prefrontal.org:

Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction

This is a poster presented by Bennett and colleagues at this year's Human Brain Mapping conference. It's about fMRI scanning on a dead fish, specifically a salmon. They put the salmon in an MRI scanner and "the salmon was shown a series of photographs depicting human individuals in social situations. The salmon was asked to determine what emotion the individual in the photo must have been experiencing."

I'd say that this research was justified on comedic grounds alone, but they were also making an important scientific point. The (fish-)bone of contention here is multiple comparisons correction. The "multiple comparisons problem" is simply the fact that if you do a lot of different statistical tests, some of them will, just by chance, give interesting results.

In fMRI, the problem is particularly severe. An MRI scan divides the brain up into cubic units called voxels. There are over 40,000 in a typical scan. Most fMRI analysis treats every voxel independently, and tests to see if each voxel is "activated" by a certain stimulus or task. So that's at least 40,000 separate comparisons going on - potentially many more, depending upon the details of the experiment.

Luckily, during the 1990s, fMRI pioneers developed techniques for dealing with the problem: multiple comparisons correction. The most popular method uses Gaussian Random Field Theory to calculate the probability of falsely "finding" activated areas just by chance, and to keep this acceptably low (details), although there are other alternatives.

But not everyone uses multiple comparisons correction. This is where the fish comes in - Bennett et al show that if you don't use it, you can find "neural activation" even in the tiny brain of dead fish. Of course, with the appropriate correction, you don't. There's nothing original about this, except the colourful nature of the example - but many fMRI publications still report "uncorrected" results (here's just the last one I read).

Bennett concludes that "the vast majority of fMRI studies should be utilizing multiple comparisons correction as standard practice". But he says on his blog that he's encountered some difficulty getting the results published as a paper, because not everyone agrees. Some say that multiple comparisons correction is too conservative, and could lead to genuine activations being overlooked - throwing the baby salmon out with the bathwater, as it were. This is a legitimate point, but as Bennett says, in this case we should report both corrected and uncorrected results, to make it clear to the readers what is going on.

Monday, April 27, 2009

More Brain Voodoo, and This Time, It's Not Just fMRI

Ed Vul et al recently created a splash with their paper, Puzzlingly high correlations in fMRI studies of emotion, personality and social cognition (better known by its previous title, Voodoo Correlations in Social Neuroscience.) Vul et al accused a large proportion of the published studies in a certain field of neuroimaging of committing a statistical mistake. The problem, which they call the "non-independence error", may well have made the results of these experiments seem much more impressive than they should have been. Although there was no suggestion that the error was anything other than an honest mistake, the accusations still sparked a heated and ongoing debate. I did my best to explain the issue in layman's terms in a previous post.

Now, like the aftershock following an earthquake, a second paper has appeared, from a different set of authors, making essentially the same accusations. But this time, they've cast their net even more widely. Vul et al focused on only a small sub-set of experiments using fMRI to examine correlations between brain activity and personality traits. But they implied that the problem went far beyond this niche field. The new paper extends the argument to encompass papers from across much of modern neuroscience.

The article, Circular analysis in systems neuroscience: the dangers of double dipping, appears in the extremely prestigious Nature Neuroscience journal. The lead author, Dr. Nikolaus Kriegeskorte, is a postdoc in the Section on Functional Imaging Methods at the National Institutes of Health (NIH).

Kriegeskorte et al's essential point is the same as Vul et al's. They call the error in question "circular analysis" or "double-dipping", but it is the same thing as Vul et al's "non-independent analysis". As they put it, the error could occur whenever
data are first analyzed to select a subset and then the subset is reanalyzed to obtain the results.
and it will be a problem whenever the selection criteria in the first step are not independent of the reanalysis criteria in the second step. If the two s
ets of criteria are independent, there is no problem.


Suppose that I have some eggs. I want to know whether any of the eggs are rotten. So I put all the eggs in some water, because I know that rotten eggs float. Some of the eggs do float, so I suspect that they're rotten. But then I decide that I also want to know the average weight of my eggs . So I take a handful of eggs within easy reach - the ones that happen to be floating - and weigh them.

Obviously, I've made a mistake. I've selected the eggs that weigh the least (the rotten ones) and then weighed them. They're not representative of all my eggs. Obviously, they will be lighter than the average. Obviously. But in the case of neuroscience data analysis, the same mistake may be much less obvious. And the worst thing about the error is that it makes data look better, i.e. more worth publishing:
Distortions arising from selection tend to make results look more consistent with the selection criteria, which often reflect the hypothesis being tested. Circularity is therefore the error that beautifies results, rendering them more attractive to authors, reviewers and editors, and thus more competitive for publication. These implicit incentives may create a preference for circular practices so long as the community condones them.
To try to establish how prevalent the error is, Kriegeskorte et al reviewed all of the 134 fMRI papers published in the highly regarded journals Science, Nature, Nature Neuroscience, Neuron and the Journal of Neuroscience during 2008. Of these, they say, 42% contained at least one non-independent analysis, and another 14% may have done. That leaves 44% which were definitely "clean". Unfortunately, unlike Vul et al who did a similar review, they don't list the "good" and the "bad" papers.

They then go on to present the results of two simulated fMRI experiments in which seemingly exciting results emerge out of pure random noise, all because of the non-independence error. (One of these simulations concerns the use of pattern-classification algorithms to "read minds" from neural activity, a technique which I previously discussed). As they go on to point out, these are extreme cases - in real life situations, the error might only have a small impact. But the point, and it's an extremely important one, is that the error can creep in without being detected if you're not very careful. In both of their examples, the non-independence error is quite subtle and at first glance the methodology is fine. It's only on closer examination that the problem becomes apparent. The price of freedom from the error is eternal vigilance.

But it would be wrong to think that this is a problem with fMRI alone, or even neuroimaging alone. Any neuroscience experiment in which a large amount of data is collected and only some of it makes it into the final analysis is equally at risk. For example, many neuroscientists use electrodes to record the electrical activity in the brain. It's increasingly common to use not just one electrode but a whole array of them to record activity from more than brain one cell at once. This is a very powerful technique, but it raises the risk the non-independence error, because there is a temptation to only analyze the data from those electrodes where there is the "right signal", as the author's point out:
In single-cell recording, for example, it is common to select neurons according to some criterion (for example, visual responsiveness or selectivity) before applying
further analyses to the selected subset. If the selection is based on the same dataset as is used for selective analysis, biases will arise for any statistic not inherently independent of the selection criterion.
In fact,
Kriegeskorte et al praise fMRI for being, in some ways, rather good at avoiding the problem:
To its great credit, neuroimaging has developed rigorous methods for statistical mapping from its beginning. Note that mapping the whole measurement volume avoids selection altogether; we can analyze and report results for all locations equally, while accounting for the multiple tests performed across locations..
With any luck, the publication of this paper and Vul's so close together will force the neuroscience community to seriously confront this error and related statistical weaknesses in modern neuroscience data analysis. Neuroscience can only emerge stronger from the debate.

ResearchBlogging.orgKriegeskorte, N., Simmons, W., Bellgowan, P., & Baker, C. (2009). Circular analysis in systems neuroscience: the dangers of double dipping Nature Neuroscience DOI: 10.1038/nn.2303

More Brain Voodoo, and This Time, It's Not Just fMRI

Ed Vul et al recently created a splash with their paper, Puzzlingly high correlations in fMRI studies of emotion, personality and social cognition (better known by its previous title, Voodoo Correlations in Social Neuroscience.) Vul et al accused a large proportion of the published studies in a certain field of neuroimaging of committing a statistical mistake. The problem, which they call the "non-independence error", may well have made the results of these experiments seem much more impressive than they should have been. Although there was no suggestion that the error was anything other than an honest mistake, the accusations still sparked a heated and ongoing debate. I did my best to explain the issue in layman's terms in a previous post.

Now, like the aftershock following an earthquake, a second paper has appeared, from a different set of authors, making essentially the same accusations. But this time, they've cast their net even more widely. Vul et al focused on only a small sub-set of experiments using fMRI to examine correlations between brain activity and personality traits. But they implied that the problem went far beyond this niche field. The new paper extends the argument to encompass papers from across much of modern neuroscience.

The article, Circular analysis in systems neuroscience: the dangers of double dipping, appears in the extremely prestigious Nature Neuroscience journal. The lead author, Dr. Nikolaus Kriegeskorte, is a postdoc in the Section on Functional Imaging Methods at the National Institutes of Health (NIH).

Kriegeskorte et al's essential point is the same as Vul et al's. They call the error in question "circular analysis" or "double-dipping", but it is the same thing as Vul et al's "non-independent analysis". As they put it, the error could occur whenever
data are first analyzed to select a subset and then the subset is reanalyzed to obtain the results.
and it will be a problem whenever the selection criteria in the first step are not independent of the reanalysis criteria in the second step. If the two s
ets of criteria are independent, there is no problem.


Suppose that I have some eggs. I want to know whether any of the eggs are rotten. So I put all the eggs in some water, because I know that rotten eggs float. Some of the eggs do float, so I suspect that they're rotten. But then I decide that I also want to know the average weight of my eggs . So I take a handful of eggs within easy reach - the ones that happen to be floating - and weigh them.

Obviously, I've made a mistake. I've selected the eggs that weigh the least (the rotten ones) and then weighed them. They're not representative of all my eggs. Obviously, they will be lighter than the average. Obviously. But in the case of neuroscience data analysis, the same mistake may be much less obvious. And the worst thing about the error is that it makes data look better, i.e. more worth publishing:
Distortions arising from selection tend to make results look more consistent with the selection criteria, which often reflect the hypothesis being tested. Circularity is therefore the error that beautifies results, rendering them more attractive to authors, reviewers and editors, and thus more competitive for publication. These implicit incentives may create a preference for circular practices so long as the community condones them.
To try to establish how prevalent the error is, Kriegeskorte et al reviewed all of the 134 fMRI papers published in the highly regarded journals Science, Nature, Nature Neuroscience, Neuron and the Journal of Neuroscience during 2008. Of these, they say, 42% contained at least one non-independent analysis, and another 14% may have done. That leaves 44% which were definitely "clean". Unfortunately, unlike Vul et al who did a similar review, they don't list the "good" and the "bad" papers.

They then go on to present the results of two simulated fMRI experiments in which seemingly exciting results emerge out of pure random noise, all because of the non-independence error. (One of these simulations concerns the use of pattern-classification algorithms to "read minds" from neural activity, a technique which I previously discussed). As they go on to point out, these are extreme cases - in real life situations, the error might only have a small impact. But the point, and it's an extremely important one, is that the error can creep in without being detected if you're not very careful. In both of their examples, the non-independence error is quite subtle and at first glance the methodology is fine. It's only on closer examination that the problem becomes apparent. The price of freedom from the error is eternal vigilance.

But it would be wrong to think that this is a problem with fMRI alone, or even neuroimaging alone. Any neuroscience experiment in which a large amount of data is collected and only some of it makes it into the final analysis is equally at risk. For example, many neuroscientists use electrodes to record the electrical activity in the brain. It's increasingly common to use not just one electrode but a whole array of them to record activity from more than brain one cell at once. This is a very powerful technique, but it raises the risk the non-independence error, because there is a temptation to only analyze the data from those electrodes where there is the "right signal", as the author's point out:
In single-cell recording, for example, it is common to select neurons according to some criterion (for example, visual responsiveness or selectivity) before applying
further analyses to the selected subset. If the selection is based on the same dataset as is used for selective analysis, biases will arise for any statistic not inherently independent of the selection criterion.
In fact,
Kriegeskorte et al praise fMRI for being, in some ways, rather good at avoiding the problem:
To its great credit, neuroimaging has developed rigorous methods for statistical mapping from its beginning. Note that mapping the whole measurement volume avoids selection altogether; we can analyze and report results for all locations equally, while accounting for the multiple tests performed across locations..
With any luck, the publication of this paper and Vul's so close together will force the neuroscience community to seriously confront this error and related statistical weaknesses in modern neuroscience data analysis. Neuroscience can only emerge stronger from the debate.

ResearchBlogging.orgKriegeskorte, N., Simmons, W., Bellgowan, P., & Baker, C. (2009). Circular analysis in systems neuroscience: the dangers of double dipping Nature Neuroscience DOI: 10.1038/nn.2303