227 Comments

DARE has a positive effect size for keeping kids off drugs? I realize this isn't the point of the article, but I thought I remembered it as being negative (and significant).

Expand full comment

I recall Scared Straight being found to have a negative effect size, but while it's easy to lump initiatives like this together into some category like "Naive social pressure interventions," they're still different initiatives with separate outcomes.

That said, with that effect size, I'd guess that the error bars for DARE most likely cross zero.

Expand full comment

From the linked abstract:

> Results. The overall weighted effect size for the included D.A.R.E. studies was extremely small (correlation coefficient = 0.011; Cohen d = 0.023; 95% confidence interval = −0.04, 0.08) and nonsignificant (z = 0.73, NS).

So, as @Desertopa speculated, the error bars cross zero; the true effect size could be negative ( though not statistically distinguishable from zero in either direction)

Expand full comment

Scott cited a meta analysis from 2004. If you look at individual studies, some found negative effects and some found positive, and the meta-analysis itself said it was not significant. It's very possible you read a study or someone else's interpretation of a single study that found a negative effect. This relates to Scott's older article, Beware the Man of One Study. https://slatestarcodex.com/2014/12/12/beware-the-man-of-one-study/

Expand full comment

It should be noted that people who accept that result (that the effect is statistically meaningless) include... DARE. That meta-analysis came out about the same time they did a major overhaul of their curriculum. (I've seen no research on whether the new one works any better, though.)

Expand full comment

All the numbers in this post easily have a +- 0.05 error bar margin and could be rounded to the nearest 0.1. That would make DARE a 0.0 and possibly negative.

Expand full comment

These standardized effect sizes are pretty useless things to report.

The context for an effect size is the subject matter. Normalizing in the way you've done smooths over the variances of the populations as well as the size of the effect. Those things matter a TON when thinking about whether an effect is worth caring about!

Things like Coen's d can be useful for making comparisons within a specific context, but as you cross contextual boundaries (which is the stated purpose of your exercise here), it becomes actively detrimental by acting as a substitute for thinking about the actual subject matter.

Expand full comment

Disagree. They're interesting to report, and enlightening to me as empirical examples of the point that (to quote the post)

> But there are many statistics that are much higher than you would intuitively think, and many other statistics that are much lower than you would intuitively think. A dishonest person can use one of these for “context”, and then you will incorrectly think the effect is very high or very low.

This is thematically similar to basically all the stuff written in https://slatestarcodex.com/tag/statistics/

Expand full comment

They're fine as trivia or to help you understand better those specific populations. But the magnitude of that is rarely what you care about when considering effect sizes in a practical context.

For example, you probably don't care whether oxy helps pain in relation to variation in pain reporting over time. You care if it decreases pain in absolute terms. You probably don't care about the effect of tutoring relative to the variation in learning among children. You probably care about how much tutoring will improve your child's educational outcomes.

Anyhow, my point is that the 'context' thing Scott is describing is a BAD thing to do for normalized effect sizes. Just try unpacking the statements one would be making:

"The ratio of the difference in the average reported pain levels between populations who used oxycodone and those who didn't to the variation among individuals in these populations is similar to the ratio of the difference in the average self-reported interest in engineering among men and women to the variation in self-reported interest in engineering among the whole population."

Aside from people with an intuitive understanding of population-level variation in engineering interest along with good knowledge of how much men vs. women self-report interest in engineering, I don't believe that adding this "context" will be helpful.

Expand full comment

Isn’t that Scott’s entire point? That this is a way to lie or mislead with statistics that one should be wary of?

Expand full comment

Scott is saying, "Telling people about ratios of differences in means to variances in those populations of something not necessarily related to the thing you're discussing is a useful way to give them an intuition about the ratio the difference in means of two populations to the variance of those populations." My opinion is that attempting to do this at all is likely to be counterproductive because measures like Coen's d are only relevant or useful in their own context with respect to whether an effect is "big" or "important".

Expand full comment

> You probably don't care about the effect of tutoring relative to the variation in learning among children. You probably care about how much tutoring will improve your child's educational outcomes.

What exactly are the "educational outcomes"? If that means grades at school, or a chance to get to university, those things already depend on the variation of learning among children, so you do indirectly care about how the tutoring is relative to that.

As a thought experiment, imagine being teleported to a planet where the variation between children is so great that unless you are an Einstein, you have no chance to complete the high school. On such planet, parents of Earth-average children would not waste money on tutoring, because it would be pointless, your child will not complete the high school anyway. -- On the other hand, imagine being teleported on a planet where people are practically clones of each other. A little bit of tutoring could make your child understand something that no one else understands at given age, so your child would be considered a genius by all teachers.

Expand full comment
Jun 8, 2023·edited Jun 8, 2023

> As a thought experiment, imagine being teleported to a planet where the variation between children is so great that unless you are an Einstein, you have no chance to complete the high school.

I'm having trouble with this. There is no obvious connection between (1) the variance of intelligence in the population; and (2) the intelligence threshold for completing high school. Why would increasing or decreasing (1) have a predictable effect on (2)?

We have a lot of variance in the ability of schoolchildren now, but the response to that has been to steadily lower the threshold for graduation over time. Our threshold is currently at negative infinity. Why would it get higher if variance increased?

Expand full comment

If you have a wide variation, then the class instruction will only be suited to some of the students. If you suppose that the class instruction is aimed at the mean or median, a large part of the class will be left out. If you assume it's aimed at the low end, just about nobody progressing from that class will be prepared for the next class. Repeat this a few times in series and you'd probably get the described effect.

Expand full comment

But you're talking about knowledge gain, the objective phenomenon with continuous results. Viliam is very explicit that he's talking about school completion, the credentialist phenomenon with discrete binary results. These are unrelated concepts. In an American school, no one is expected to gain any knowledge, but everyone is expected to graduate.

Expand full comment

You wrote: " You probably don't care about the effect of tutoring relative to the variation in learning among children. You probably care about how much tutoring will improve your child's educational outcomes."

What I tried to say is that "educational outcomes" and "variation in learning" are related -- so if you care about the latter, you at least indirectly care about the former.

Education is set up to reflect the variation in learning in the population. If grade A is the best, and grade Z (or whatever it is in given country) is the worst, then the difference between A and Z reflects the local variation in learning. On a planet of clones, either tiny differences in knowledge would result in dramatically different grades, or everyone would get the same grade. On a planet of clones, tutoring would make a great difference, because it would be the only difference there is.

I don't have a good metaphor for whatever is the *opposite* extreme of the planet of clones.

Expand full comment

In my opinion, it depends a lot on the tutor. OTOH, this is consistent with smaller class sized improve learning. And with the right tutor (and the right student) the effect can be quite strong. My wife used to tutor in music and art, so I've got anecdotal confirmation of that opinion.

Expand full comment

I have no strong opinion on class size, because a lot of research ignores one statistical issue -- it is easier to get extreme values when you calculate averages of smaller sets, compared to averages of larger sets.

As an example, imagine that each student rolls a die, and you calculate the average for the entire classroom. Statistically, classrooms with the highest averages will be the small ones. (Also, classrooms with the lowest averages will be the small ones.) This is, because if you e.g. roll three dice, your chances to get 6,6,6 (or 1,1,1) are way higher than your chances of getting 6,6,6,6,6,6 (or 1,1,1,1,1,1) when you roll six dice.

So you need to pay great attention to whether the finding is "small classes on average have better average score than large classes" which would mean something, or whether the finding is "classes with best average score are the smallest ones" which means nothing (because it is probably simultaneously true that "classes with worst average score are the smallest ones" -- the smallest classes are overrepresented at both extremes). I suspect that most psychologists just threat these two as synonyms, and use the latter as evidence of the former.

Expand full comment

> For example, you probably don't care whether oxy helps pain in relation to variation in pain reporting over time.

I don't know, I think I do. However, the trick is how we define variance. I don't care how much Oxy reduces pain compared to the variance of pain in the population as a whole (encompassing some pain-free people and some people with arthritis and cluster headaches); I care how much it reduces pain compared to the day-to-day variability in pain reporting for a single person.

Suppose I have arthritis, and oxy reduces pain about half as much as the difference between an (unmedicated) good day and bad day; maybe it's not worth the side effects and potential addiction. If it reduces pain about three times that difference, however, then it would likely feel like a wonder drug.

Expand full comment

I agree with your point. I believe the value Scott noted above was the first version you discussed. AKA, the difference in average reported pain before and after "intervention" among individuals who received oxy against individuals who received a placebo, scaled by the variance.

The second thing that you describe would be very relevant, but that's the variance that Coen's d is being scaled with.

Expand full comment

I do care about the effect of tutoring would give my child relative to the variation in learning among children - and of course I do. I would want to know how much he could move up the bellcurve. And the background variation in the interest in engineering seems like a good thing to compare women's and men's interest to. You have successfully convinced me of the opposite point.

Expand full comment

Fair enough, though I think there are plenty of situations where the goal is, "My kid needs a 3.0 GPA to qualify for X" or "My kid needs at least a passing grade on their AP test to get college credit" or "My kid needs at least X score on their SAT to qualify for a scholarship."

Expand full comment

As a professional researcher, I would say that standardized effect sizes may not be useful to report but are immensely useful when planning studies. When applying for funding one must always report what sort of effect size one will be able to detect given the anticipated sample size, and, absent great estimates of precision, one is reduced to reporting this as a standardized effect. Often, whether or not the NIH gives you a million bucks to do a study depends on whether the reviewer thinks that detecting a standardized effect size of 0.3 is useful or not, so I always reference something concrete as Scott has handily provided here. I'll probably refer back here in the future for just that purpose.

Expand full comment

Absolutely. When you're planning a study, signal-to-noise ratio is absolutely something you should consider. But the SNRs you should consider are those relevant to the context you're working in, not SNRs from unrelated fields or contexts.

Expand full comment

For something like pain medication, where you have both placebo effects and where the condition can naturally get better or worse for unrelated reasons, effect size matches up pretty well with our intuition for "this medicine helps often and a lot" vs "this medicine helps a little, or only sometimes". You can do better by providing more than a single number, but that can also make it harder to understand without a background in statistics.

Expand full comment

I think presenting Coen's d will be more confusing to people without a background in statistics than saying "It reduced pain on average by 3 on a scale from 0 to 10", but YMMV.

FWIW, when I've taught courses on experimental design and effect sizes, students (intelligent professionals without statistics backgrounds) have almost always been more comfortable with absolute units than generic scaled ones. This also holds when working with subject-matter experts to scope tests.

Expand full comment

I don't think a layperson would know what would know what Cohen's d is, but we often call them small effect size (d = 0.2) or large effect size (d = 0.8), which is fairly intuitive.

"Reduced pain on average by 3 on a scale from 0 to 10" works too but has its own problems. If you're looking the effects of ibuprofen on headache pain after 2 hours, the control group might have a pain reduction of 3 or more on a scale from 0 to 10. Even if ibuprofen reduced pain by 3 points beyond the control group, it helps to contextualize it by whether this is a big effect compared to normal headache variance.

3 point improvement vs control group on a 0-10 pain scale is also, I think, a much bigger difference than you'd see in an ibuprofen headache study. Unless you select for only the most severe headaches, you'd probably start with an average pain around 4 on 10 before treatment (guessing mild headaches are somewhat more common). The control group would improve a few points, and ibuprofen isn't particularly effective and doesn't work on everyone (hence the ~0.2 effect size). So you might end up with numbers more like a 0.6 point improvement on 10, which is harder to understand intuitively. It sounds insignificant, when it might actually be small but decent.

Expand full comment

Could you give an example of a situation where taking the effect size seriously (e.g., rho=.70 indicates pretty substantial correlation, rho = .20 indicates not much correlation) is misleading unless you consider the subject matter?

Expand full comment

My point was about effect sizes, not correlation. I think what Scott is proposing is much more reasonable when talking about correlation.

Expand full comment

Also, standardized effect sizes are sensitive to the scale/measure used, because it can change the variance. You can get a larger standardized effect size, of the same treatment, for "can this child read a paragraph" than for "words read per minute", because the latter has higher variance.

Expand full comment
Jun 8, 2023·edited Jun 8, 2023

After 20 years of disuse, I have lost all the statistical knowledge and ability I once had in college. However, this post reminds of a Twitter post of Taleb's from last year stating, "Perhaps the most misunderstood idea in the 'empirical' sciences is that a correlation of 90% is closer to 0 than 1." https://twitter.com/nntaleb/status/1553688258738995200?s=20. I did not really understand the post then, and I can't say I really understand it now, but I have a vague intuition this blog post is related to the concept Taleb was referencing?

Expand full comment

That looks like a typo or exaggeration. The image says rho=.5 is much closer to 0 than to rho=1. He means the corresponding scatter plot is spread out, not closely following the line. At rho=.5, there will be a lot of variation in the output, at the same input. I’m not sure how universally-relevant that is.

Expand full comment

Seems like his point is that any probability is closer to absolute ignorance than it is to perfect knowledge. Like, if you have a coin that comes up 90% heads and 10% tails, if will feel more like having a random coin than like having a coin that *reliably* comes up heads.

(To put in uncharitably, unless the probability is literally 100%, I am free to ignore it; IQ does not 100% correlate to anything important, therefore it is debunked; buy my books!)

Expand full comment

As far as I can tell, the things he actually thinks about IQ aren't wrong. He just disagrees with people on Twitter without figuring out what they're saying, which causes him to disagree with things that are also not wrong.

Expand full comment

That is, I think the question with IQ isn't how significant is the effect or how close is the correlation, but how close or how good would these have to be to justify departing from an ideology that excludes prejudice. If literally every blue person had an IQ of 80 and literally every green person had an IQ of 130, obviously we would need to design a society that separates them into separate casts. Hopefully there is also a small group of orange people that all have IQs of 180.

Expand full comment

I don't think it's necessary to have 0% overlap in order to recognize caste differences. Some women are stronger/taller/faster than some men, but no reasonable person argues that we shouldn't have gender-segregated sports.

>justify departing from an ideology that excludes prejudice

I don't think that admitting the possibility that some groups have higher IQs than others necessitates prejudice.

Expand full comment

That's why I said the question is "how close or how good". You appear to set the bar lower than Taleb does. That's an ideological decision, and that's fine.

Expand full comment

I don't think any level of accuracy/confidence in the measurement of group differences necessitates prejudice. That's because the former is an empirical question and the latter a philosophical one. They are categorically different questions, regardless of "how close or how good" the empirics are.

Expand full comment

>no reasonable person argues that we shouldn't have gender-segregated sports.

Plenty of allegedly reasonable people disagree about what this pesky "gender" business actually means though, to the extent that it becomes unclear what the purpose of segregation is supposed to be in the first place.

Expand full comment

I disagree that they're reasonable.

Expand full comment

Obviously? Why?

Expand full comment

If you're confused, try reading the comments to which I was responding.

Expand full comment

That sounds like Taleb. He thinks every shower thought he has should be made into a book.

Expand full comment

Those scatter plots are slightly weird, because they've normalised X and Y differently. To get a sense of how closely correlated two variables are from a picture that can be compared to other pictures, you want to rescale X and Y so that they have the same variance - i.e. so that the major and minor axes of the "oval of best fit" are the lines X=Y and X=-Y.

These pictures have all had X stretched compared to Y, which I think intuitively increases your estimate of "how much information about Y does X give you" and reduces your estimate of "How much information about X does Y give you".

Expand full comment

Do you have the labels back to front in your last paragraph? I tried looking at the graphs sideways, and I thought, “wow, Y has a really big effect on X.” If you rotate it, X seems to change a lot in response to Y. If anything, I’d say the current layout is designed to minimise the apparent effect of X on Y, to support Taleb’s claim that X has a small effect on Y.

Expand full comment

No, I do mean it that way round, but "how much information about Y does X give you"? is not the same thing as "how much does Y change in response to X?" - the former feels as though it's about the width of a vertical slice through the graph, and the latter as though it's about the slope of the graph.

Expand full comment

What the hell does he mean by a correlation of 90%? Correlations aren't measured in percents. they're measured by the statistic rho, the correlation coefficient. You can talk about what percent of the variance in x is accounted for by y, and that number is a percent. You get it by squaring rho. To get a rho squared of 90% you'd need a correlation coefficent rho of 94.9. That's a *really* high correlation.

Expand full comment

90% is just another way of writing .9.

Expand full comment

Yes I understand they are mathematically equivalent, but they are not equivalent when you are talking about statistics. Rho is never expressed as a percent. And the math behind calculating it does not fit with the logic of percent - you are not figuring out what portion of one thing another thing represents. PLUS there is a closely related statistic, rho squared, percent of variance in A that B accounts for, that *is* expressed as a percent, so to use a percent when speaking of rho is confusing. It's sort of say like a car is going 6,500% mph. 6500% mph also equals 65 mph, but that doesn't mean there's no problem with using a percent sign when indicating the speed.

Expand full comment

The (one?) way to put a statistic in context is to show it being used in a cost benefit analysis, assuming it can be shown as cause. A "small" effect size might be important in deciding to deploy a vaccine or not. A large effect size might not be important in choosing your pet's food.

Expand full comment

I'd phrase it as "is it decision-relevant? How so?" for which CBA is one of many approaches.

Expand full comment

Since learning it many years ago from lawyers who I was working with, I've always found the courtroom concept of "material" (as in "material to the case") the most useful way of grappling with that point. And yes, going way beyond literal CBA.

Expand full comment

Out of curiosity: I suspect that northwestern European countries are a key reason why the correlation between temperature and latitude isn't higher. The British Isles, the Netherlands, Denmark, etc. are quite high latitude, but not particularly cold. Almost all of Europe is noticeably warmer than the equivalent latitudes in Asia and North America, and Europe packs a lot of data points (a lot of small-to-medium countries) into a small space.

Expand full comment

Northwestern Europe being the global exception-- many such cases

Expand full comment

Every time you have a good thing going, the British come and ruin it. Sad!

Expand full comment

VERY high elevations in South America are near the equator, yet very cold.

Anchorage, Alaska is much warmer in winter than Minot, North Dakota.

Expand full comment

It's often put down to the Gulf Stream as to why the British Isles are not colder. And now, seemingly, there's a 'cold blob' in the North Atlantic which is why Ireland isn't being affected as badly as next door by the higher summer temperatures:

https://www.rte.ie/news/environment/2023/0601/1386813-climate-change-tourism/

"He said this remarkable cold patch is being caused by a significant slowdown in the Gulf Stream. That is the flow of warm water that travels in a northeastward direction across the Atlantic Ocean from coast of Florida.

It operates like a central heating system, warming the waters to the west of Ireland and ensuring the country's climate remains temperate.

Dr McCarthy said every place in the world, except this patch of the Atlantic, has been getting warmer.

...The "cold blob" [Professor Stefan Rahmstorf] said would lead to a big increase in extreme summer heat in Europe, but that because of where Ireland is positioned, he showed how it would escape the worst of Europe’s extreme heat.

In short, he said that the cold patch of water in the Atlantic Ocean cools the air above it, creating a low-pressure zone. This low-pressure zone then operates a bit like a roundabout, causing the air to flow anticlockwise around it, driving it further south where it gets far warmer. It then travels onwards around the low-pressure "roundabout" and heads straight for continental Europe and the southeast of England."

Expand full comment

Europe is actually an instance of a more general phenomenon. The west coasts of continents are milder than the east coasts, because the coriolis effect makes northern currents spin clockwise and southern currents spin counterclockwise. There is a distinctive climate on subtropical west coasts with a dry summer and a wet winter that is often considered one of the nicest climates on earth, shared by California, Portugal, the populated parts of Chile, Cape Town, and Perth.

Expand full comment

Wait, but isn't the coriolis effect reversed in the southern hemisphere?

Expand full comment

Yes - in the north it spins clockwise, while in the south it spins counterclockwise. Both ways, the east coast of continents has warm equatorial currents, while the west coast gets cool polar currents.

Expand full comment

There are also a fair number of largely populated cities at high altitudes in the tropics. Bogota, Nairobi, Mexico City that are all quite cool.

Expand full comment

You forgot the Andes and the Himalayans. Countries high in the mountains, but near the equator, have less tendency to be hot places. Another similar effect is caused by small islands that are countries.

Expand full comment

This was a fun one

Expand full comment

>the correlation between which college majors have higher IQ, and which college majors have skewed gender balance

I looked at your link on this, and they admit, in a note at the bottom, that their source didn't actually look at IQ scores, but rather "pre-2011 GRE scores" (I'm not sure the significance of the year). Their source then did some kind of statistical translation to get approximate IQ scores from that. Anyway, this means that r value doesn't actually relate IQ to gender skew of majors.

Expand full comment

Which maybe kind of even further support's Scott's point: it's easy to cheat these kinds of comparisons. That comparison isn't even comparing what it sounds like it's comparing, but the headline is the same.

Expand full comment
Jun 8, 2023·edited Jun 8, 2023

I mean, yeah, but "it's easy to convey misleading info if one of your sources is lying about what it's presenting" isn't exactly revelatory. (Of course, Scott will probably have some way to move the goalposts such that it won't be considered the media lying /hj)

Expand full comment

That article further notes that when you break it down and split the scores into verbal and quantitative, the verbal scores are basically uncorrelated with gender skew and the quantitative scores have an even stronger correlation. Which makes sense, since those male-dominated fields all look very quantitative, and the female-dominated ones not at all.

What I want to know is if anyone here knows the if purported SAT/GRE => IQ correlation holds true for both the verbal and quantitative portions individually (the 0.72 correlation between the math and verbal implies it might?), or the correlation holds true only for the quantitative portions. The answer there determines whether it's a valid proxy for IQ here or not. (If both are correlated, it's not, since clearly only half of the correlation appears)

Expand full comment

I think my favourite new piece of information that I'll take away from this is that brain size correlates with IQ about as much as parental social class correlates with children's grades.

The latter is considered common wisdom, the former is considered phrenological claptrap, but they're both equally real.

Expand full comment

I think people like to play up the social influences on intelligence because they recognise how useful intelligence is at the individual level, and they're egalitarians who want that intelligence to be shared equally.

Never mind that intelligence isn't at all like money, you can't share it out equally among the population no matter how strong the political will to do so. Just pretend you can because redistribution is the only tool you've got, and when all you have is a hammer everything looks like a nail.

Expand full comment

One of my favorite short reads of the year 😊Thanks as always.

Expand full comment

Is spots on dogs correlated to friskiness?

I know spots on foxes is correlated to gentleness, and spots on horses anti-correlated to gentleness.

Expand full comment

Do spots include disease spots and injury spots? I'd assume they're anti-correlated to friskiness.

Expand full comment

In foxes, In the Russian fox farm experiment, where the farmer was trying to breed fur foxes to be less wild, the finding was increased in spots & floppy ears, are tied to stress hormone reduction, and companionship hormone increase.

In horses, painted horses (splashy white) and appaloosas are known as flighty & hard to ride horses. A cowboy culture saying about painted horses, is George Custer rode up against Indians on painted horses ... that wasn't going to turn out well.

Expand full comment
Jun 8, 2023·edited Jun 8, 2023

Why not just write the correlation values directly? When writing about something like GDP growth, people usually feel no need to write in similes... Or maybe this is what you're saying?

Expand full comment

Because most people feel like they have a vague idea what a dollar or a million dollars is worth, but not idea what correlation values mean. Remember exceedingly few people study statistics in any formal way.

Expand full comment

Except most people really have no intuitive idea what a million dollars means, and how it differs from a billion dollars. I think most articles about budgets that mention millions or billions really should add a couple comparisons, like the cost of a parking garage or a subway system or the moon landing.

Expand full comment

There's a Twitter account called "NHS minutes" or something like that, where every time someone complains about British government spending and says it should go towards the NHS instead, helpfully points out just how little it would contribute to a healthcare programme which costs £150 billion per year (generally expressed as "this could fund the NHS for X minutes"). It's a really useful service.

Expand full comment

The real issue with effect sizes is that they depend so much on error in measurement.

Switching a study to a more reliable (test retest) measurement would substantially increase the effect size, without changing the real size of the effect at all.

So, this makes it even easier to cheat. Just find a field where the effects are expected to be small, but where the measurement happens to be good.

Expand full comment

The specific example of human height differences is a psychologically weird one, because we keep our eyes close to the tops of our bodies and assign high salience to looking slightly up or down. A 5'6" person barely comes up to the chin of a 6'2", making them feel *dramatically* shorter, rather than about 90% as tall like a 66-inch-long table vs. a 74-incher.

People are constantly meeting men and women and noticing which is taller, so I would use male vs. female height if I wanted to draw attention to pairwise comparisons (i.e. the probability that a random person from the SSRI group is feeling better than a random placebo taker); this seems to have been the intention and can be a good way to look at what Cohen's D is measuring. But it's not necessarily clarifying to talk about the absolute difference ("just a couple of inches") when these particular inches are perceived so differently from inches of table or even inches of inseam.

Expand full comment
Jun 8, 2023·edited Jun 8, 2023

Doesn't correlation remain same for scaling/shifting though? I'm not sure if your point is valid.

Expand full comment

I'm talking about the effect size (in the Cohen's D sense, mean difference expressed in pooled standard deviations), not correlation in the linear regression sense.

My point is that in social contexts, "tallness" is perceived nonlinearly. So this *might* be a great intuitive metaphor for a situation where two means differ by ~7% with each mean having a SD of about 10%, but only if there's a reason to assign a nonlinear normative meaning to the bottom and top of the overall distribution, analogous to the way a 5% reduction in height reduces "intuitive tallness" by more like 50%.

It's not clear to me that clinical drug trials generally have this feature—reducing your viral load or tumor size by 5% might or might not take you ~50% of the way to recovery.

Expand full comment

That makes sense, thanks for clarification.

Expand full comment

"each mean having a SD of about 10%"

More like 4%?

Expand full comment

There's a catch though. I feel we use a different bar to intuit height of men vs women. I am average height among men, and I remember a few situations where women who were shorter than me but tall among women thought they were taller than me until proven otherwise.

Expand full comment

> I sort of agree, but I can see some use for these. We can’t express how much less depressed someone gets in comprehensible units - it will either be stuff like this or “5 points on the HAM-D”, which is even more arcane and harder for laymen to interpret.

So _how_does the layman interpret this? If he takes the effect size as “calibrated relative to the variation I experience in my own moods”, he’s just wrong. If he understands enough to take it as “calibrated relative to population variation on some measurement scale”, AND he also has some idea what this population variance is and what the measurement scale means, he’s not a layman (neither as a statistician or a psychiatrist).

It seems to me that presenting effect size to a layman is not a problem that always has an easy answer. And often the best answer you can give (if any) will be domain dependent and may even (horror of horrors) require you to educate your layman a bit first. It's not going to come from pressing a generic button applying to a table of numbers. Even if the output has a label saying 'effect size' suggesting it is answering your problem. Even if you can't think of anything else. Maybe that's just the way of things.

Expand full comment

^ a far less annoying way of saying what I said above

Expand full comment

I agree, but I also think this post pointed out something meaningful, even though it may have been obvious to you(not being sarcastic).

Expand full comment

But, as you see it, what is the point?

I would say that the premise of this post is that we will want to summarize some effect by a single number (correlation coefficent, Cohen's D) in a way that casual non-specialist can get something useful out of. The 'get something out of it' implies that the number can be sensibly (even if roughly) compared across diverse topics - because otherwise you need to draw on domain knowledge.

And Scott is showing that this is very hard to do consistently, and prone to abuse.

But perhaps the right answer is not "be careful out there" (my read on Scott's attitude) but "this whole effort is misguided". Yes, it would be nice: nice for scientists, nice for communication, nice for reasoning, ... if it were possible. But perhaps it isn't remotely true. And "well, we can't do any better, so let's live with the least bad solution" is a wrong approach if the least bad solution is awful enough.

This doesn't condemn science or even statistics. But it strikes me that every topic mentioned here would support real findings, meaningful answers to useful/concrete questions, solid communication, so long as we take the area on it's own terms. But the "and here's one number that will convey useful information with very little subject-specific context' wish may not be one the universe grants. And we do not, in fact, need it to be so.

Expand full comment

Since it's become my thing, I suppose I'll just put it out there: IQ is a test score, not a thing that can "determine" other things. It's an effect, not a cause.

I know it's conventional to discuss correlations and R2 in terms of "explaining x% of y effect" but I hate that convention, because tautologies aren't explanations but statistics don't tell you about endogeneity by themselves. You have to use those hated CONCEPTS instead of nice clean numbers. Ugh! What am I, a filthy English major?

Expand full comment

I’ve always felt the same way about the ”explains” language. It feels like a way to nudge your audience toward inferring a causal connection while maintaining plausible deniability.

Expand full comment

How do you measure a concept?

Expand full comment

Fairly often, you can't. Many are irreducibly non-quantifiable. The shape rotators among us will argue that means they aren't real, don't exist, and/or don't matter.

Expand full comment

But that's the point. The vast majority of philosophical disagreements about irreducibly non-quantifiable stuff have been around for thousand of years, with no promising approaches to resolving them emerging. Tangible progress only happens when people manage to quantify some of that. (Unless you're a pomo and deny any such progress, but there's nothing to talk about then.)

Expand full comment

"The vast majority of philosophical disagreements about irreducibly non-quantifiable stuff have been around for thousand of years, with no promising approaches to resolving them emerging."

I don't think this is even close to being true.

Expand full comment

What are some counterexamples? Platonists still profoundly disagree with Aristotelians, far as I can tell.

Expand full comment

That would be true by definition, it's not an interesting statement.

Expand full comment

> R2 in terms of "explaining x% of y effect" but I hate that convention

This is misleading to the point of being wrong. Effect sizes are the right way to measure stuff, even just R is better than R^2, which is far far more likely to give people a misleading idea of what's going on, see this (short and sweet) post: https://emilkirkegaard.dk/en/2022/10/variance-explained-is-mostly-bad/

Expand full comment

Even "effect size" is a bit tricky because it does strongly imply a direct cause-and-effect relationship which may or may not exist.

Expand full comment

When Scott directly asked the community about this before, I believe the consensus was "R is better for predicting datums, but the linearity of R^2 is better for discussing causality". I wouldn't necessarily privilege one over the other until I knew what I wanted.

I don't feel like searching for it at this moment, but maybe i will later.

Expand full comment

i want to say that the relevant post was shortly after

https://slatestarcodex.com/2015/08/02/stalin-and-summary-statistics/

but i can't find it.

Expand full comment

As a sympathetic social scientist and media theorist, I feel that your intellectual honesty and the community's fundamental optimism is approaching the absurd. My conclusion is the same as your statistician friends', from the opposite direction.

The intellectual move you're making here is indeed *the kind of thing* that must be done in order to have the kind of discussions -- here defined as exchanges of typed text and hyperlinks in general pseudonymous web forums -- that we want to have. The issue is illustrated by dwelling on the phrase "put statistics in context."

"Context" here means the typed text (or perhaps spoken language) that is the medium of communication. This medium is linear, logical, progressive -- and extremely old.

"Statistics" are a *radically* new media technology. 99% of all statistics were calculated in the past 10 years (speculatively). The methodology used to produce these statistics is both highly varied and rapidly changing. I wanted this sentence to be about what statistics "mean," but I can't think of a sentence that could accomplish this: it's hard to even define "statistics"!

This post asks the question: "Given that we are committed to combining the media technology of typed paragraphs (with hyperlinks) with statistics, how can we best do this?" Again: noble! honest! but my conclusion is that the conclusion reached in this post is an indictment of the premise, not of the logic.

To rephrase the problem information-theoretically: How much information is necessary to put a statistic in context? This formulation emphasizes that "in context" implies a binary -- and if we're going to impose a binary filter at the end of our knowledge communication, why bother with these continuous statistics?

The full inversion of the logic of the post, then, is the question: how much information can possibly be conveyed in a phrase like "about the same as the degree to which the number of spots on a dog affects its friskiness”? And it seems like the answer is: not enough!

Expand full comment

I find that when statics are involved, our brain has a strong tendency - oops! I just did it - to look for the low-energy shortcut, in several ways:

1) Causation instead of correlation.

2) Generalization: “All conservatives support Trump” or “this drug is ineffective in treating this condition”, but trying to avoid anything in between.

3) A slightly more nuanced way of 2) such as “correlation is significant” or “not significant”.

4) Completely ignoring the statistical figures and picking the satisfying narrative instead (maybe associated with 1, 2 and 3). E.g. “class sizes are determinant in academic results”, which favors a popular policy amongst teachers and parents (and politicians).

Expand full comment

The claim that 99% of statistics are calculated in the last 10 years itself needs statistical verification.

Expand full comment

Scott's post probably wasn't a standalone post, but was rather a sequel to "All Medications Are Insignificant In The Eyes Of God And Traditional Effect Size Criteria" from about a week ago. It's not colloquial text that's ultimately desired, it's an interpretation that's ultimately desired. (Reminder: knowing the mathematical definition of x isn't necessarily the same as meaningfully interpreting x.)

https://astralcodexten.substack.com/p/all-medications-are-insignificant

Expand full comment

But men and women differ by much more than a couple of inches (= 5 cm). According to Swedish statistics, the difference is 14 cm (180 cm vs 166 cm). If the difference were only 5 cm, we'd see a lot more overlap.

Expand full comment

The average difference in height between men and women in Sweden might be different than the same measurement in other countries.

Expand full comment

Not so much, it's that "couple" can also mean "a small number of" and not just "two".

Expand full comment

An inch is about couple of inches

Expand full comment

I eyeballed it recently, I can't remember where, and as I recall the average seemed like 5 inches or a bit more, so more like 12-13 cm. The lowest that I recall were down in the 4 inch range.

Expand full comment

Which reminds me of a time when I asked at a restaurant in Cornwall when it was open for evening meals. It was about 3pm. Conversation went like this.

Me: when are you open?

Him: about a couple of hours.

Me: right. But exactly when.

Him. Couple of hours, like I said.

Me: but exactly when, like I said?

Him: 5pm like I said.

I sheepishly said “ok”.

Expand full comment

We do see a lot of overlap. Consider Dutch women compared with Ecuadoran men.

Expand full comment

I think it's unusual too meet that many Dutch woman and Ecuadorian men that one would see a lot of overlap. Probably most people live in communities where the probability of a random man being taller than a random woman is .9 or so.

Expand full comment

IMO, the problem come from trying to reduce everything to a single number (or small set of numbers). A picture is worth a thousand words. Show a graph of the distribution of grades under teaching method A and under teaching method B (on the same graph). Eyeballing that picture will immediately tell you how much better one method is than another (if at all)

Expand full comment

Not always. Graphs can be hard to read, too.

Expand full comment

And/or framed/constructed deceptively.

It really depends on the data. There are super important relationships that are “invisible” to the “eyeball” method, but some things which are much more obvious with a graph than just the figures.

Expand full comment

I think a lot of it is just about reducing things to the wrong number in some quest for generalizability. "Men commit X times more murders than women" and "SSRIs cause Y% to fall below the threshold of clinical depression" are useful reductions to a single number.

Expand full comment

There's a broken link in the "Children tutored individually learn more than in a classroom: 2.0" line (it looks like a typo; the intended link seems to be https://en.m.wikipedia.org/wiki/Bloom%27s_2_sigma_problem )

Expand full comment

Possibly dumb question, but I was of the impression that effect size was between -1 and 1?

Is there anyone who could give me a quick reminder of how to interpret or how is calculated an effect size that goes up to 2.0?

Expand full comment

You are likely mixing that up with correlation.

Effect size is the size of your effect (eg difference between men and women) divided by the standard deviation (of eg height in the population).

Basically, how much does the intervention rise above the background noise.

Effect size can go arbitrarily large.

For example the effect size of being on Astronaut on distance from the ground is likely above 100?

Expand full comment
Jun 8, 2023·edited Jun 8, 2023

For two equal-sized groups, if the standard deviation is based on the whole population, then the effect size would be a maximum of 2, right?

If all men were 70 inches tall, and all women were 64 inches tall, then the mean would be 67, and the SD 3, so the effect would be (70 - 64)/3 = 2.

If the treatment group is smaller, then yes, you can get a large effect size.

When computing standardized effect size for a treatment, is the standard deviation normally that of the untreated population?

Expand full comment

See https://en.wikipedia.org/wiki/Effect_size

Apparently there's no universally agreed way to do it.

For the men/women example, you could take something like the (weighted) average of the standard deviations within the sub population as your denominator.

In more sophisticated terms, that would be the expected standard deviation left in the conditional probability distribution after someone already tells you whether a person is in the treatment or control group.

(Or you could take the bigger of the two standard deviations?)

Expand full comment

Galton introduced the correlation coefficient as recently as 1888, a couple of centuries after Newton's big book. I've often been struck by how long it took humanity to get interested in statistics compared to the harder subject of physics.

It's not as if people didn't have any data to work with in the past: the Bible mentions at least three censuses and even has a "Book of Numbers." But the urge to nerd out with data seems to be fairly recent: e.g., William Playfair invented most of the main types of statistical graphs in the late 18th Century.

An interesting question is whether humans just innately aren't that interested in or adept at statistical thinking. Or is this more a socially constructed problem which could improve over time?

Expand full comment
Jun 8, 2023·edited Jun 8, 2023

Statistics was initially heavily associated with (being good at) gambling, which traditional cultures were highly disapproving of.

Expand full comment

Where did you get the idea that physics is a harder subject than statistics? Seems to me that physics is much easier, which is why it’s possible for people to even learn something as fundamental as relativity or quantum mechanics. (Those individual things may be harder to learn than normal distributions and p values, but the equivalently fundamental concepts in statistical sciences are still waiting to be discovered.)

Expand full comment

Statistics is easier than physics. The types of stuff and effects people generally study with statistics is harder. That is how I would put it.

Expand full comment

There's not really anything about statistics that the Ancient Greeks couldn't have worked out if they'd decided it would be worth investigating - by contrast, "physics" (study of the material world from stars to atoms) was something they seemed to be very interested in but made limited progress on due to the difficulty in obtaining the necessary data.

Expand full comment
Jun 11, 2023·edited Jun 11, 2023

Would their system of numbers have been a major problem? It seems hard to do much practically useful calculation with probabilities if multiplying fractions is just barely possible.

It's always seem to me like probabilities and even more statistics requires a kind of a weird change of mindset. Instead of thinking about things that will happen (X causes Y which causes Z), you end up needing to think about things that may or may not happen, and uncertainty. My very low confidence intuition is that the difficulty of making this kind of Leap was likely one reason why probability didn't get figured out earlier. I'm sure another part of it was just social stuff about what problems were considered interesting and what problems are considered low status to think about.

I guess one other difficulty is that if some genius figured out a whole bunch of probability theory, he might've preferred using it to win a lot of money gambling rather than writing a treatise telling everybody else what he was doing. But I don't know how likely this would've been to actually work, rather than just get you stabbed by someone who's somehow sure that you're cheating, even though he doesn't understand how, because you keep winning his money from him.

Expand full comment

How would we actually measure the difference in difficulty? Personally, I would guess that the difference in when they were discovered just has to do with Random variation in which very rare geniuses happen to get interested in some problem with a different time.

Expand full comment

The kind of thinking that results in being good at physics is arguably downstream from the kind of thinking that makes you good at knapping flint hand axes. The kind of thinking that results in being good at statistics is upstream from the kind of thinking that makes you good at lying to yourself and others for social benefit.

And yes, for the literal bloodyminded in the back, that's a humorous oversimplification for the purpose of making a point.

Expand full comment

Even though I’m generally aware that human attributes come normally distributed, I don’t have a general intuition for the standard deviation of each. For example, how much smarter is a 1 in 100 person than the average person, compared to how much more attractive a 1 in 100 person is compared to the average. Any thoughts on this?

Expand full comment
Jun 8, 2023·edited Jun 8, 2023

In terms of standard deviations 1 in 100 is always about 2.33 SD higher than the average by definition.

I guess what you're really asking is something different, like how many percent taller is a 1 in 100 person compared to the average person as compared to other traits. However, that doesn't really work for things that aren't measured in absolute scales like IQ or attractiveness.

For example, on a 0-10 scale, a 10 is twice as attractive as the average on that scale, a 5. However, if the scale is -10 to 10, then 10 is infinitely more attractive than the average on that scale, 0. So it depends entirely on the scale.

Expand full comment

Isn’t that for a standard normal distribution of mean 0 and variance 1? What if it’s a normal distribution of SD 100? I’m pretty sure 1/100 will look different in that case

Expand full comment

It's still 2.33 SD above the mean, it's just that 2.33 SD will equal 233 units of whatever you're measuring.

Expand full comment

Ok say attractiveness had a standard deviation of 1 beaut vs 100 beauts. Then I would expect to see a lot more stunning people at random, walking on the streets on the latter, and be more surprised in the former scenario, assuming I am not desensitized to the distribution.

Can you help me visualize how reality changes when an attribute is more volatile than another? Intuitively, there must be something.

Expand full comment

https://www.geogebra.org/m/np7cpzfg lets you visualize two normal distributions.

Expand full comment

"an attribute is more volatile than another"

How could one tell whether attractiveness has more variation than intelligence when they're on relative scales? One could say that weight is more variable than height, because there are tons of people twice the average weight but not people twice the average height, but that only works for absolute scales.

Expand full comment

On some level, this isn’t even really a meaningful question. How much taller than the average building is the Empire State Building compared to how much more famous the Mona Lisa is compared to the average painting?

Expand full comment

Yeah you are asking less about SD than about absolute magnitude. Which is absolutely a question worth asking, but can be hard to get at. You can imagine a situation where the shortest people are 1’ and the tallest 10’ in a normal distribution, and that is a very different situation than one where the tallest is 6’ and the shortest 5’10”.

Additionally a lot of distributions are not normal.

Expand full comment

Right, the difficulty is that a percentile or IQ score tells you where someone falls in the distribution, but not what that means. For an attribute like height, it's easy to figure out what those differences mean, and a one sigma difference is always the same number of inches or centimeters. But for an attribute like intelligence, things don't work so easily. Someone with a -2 sigma intelligence in the modern US is classified as intellectually disabled, and we assume they basically can't take care of themselves or handle an ordinary job. It's not at all clear the same kind of difference exists between someone with an average and +2 sigma intelligence, or between +2 sigma and +4 sigma.

Expand full comment

i feel like “5 points on the HAM-D” would be easier to interpret

Expand full comment
Jun 8, 2023·edited Jun 8, 2023

Based on Scott's earlier description of the HAM-D, that is certainly not the case. It's a scale with more than 50 points, of which the first 5 points comprise the ordinary non-depressed population and the remaining more than 40 points are devoted to fine distinctions in exactly how depressed you are.

There is no guarantee that reducing HAM-D score from 35 (severely depressed) to 30 (still severely depressed) represents the same amount of improvement as reducing HAM-D score from 9 (depressed, but at least not "severely depressed") to 4 (normal person).

Expand full comment

Similar to Mantic Monday, it would useful to get a Fact Friday or something like that with a list similar to the one at the end of this post. It would help develop everyone’s intuition.

Expand full comment

Median effect size in Psychology papers is r = 0.36 but r = 0.16 is the study is pre-registered. Either we need to change public understandarding of what is a good correlation or we need to admit that academics are too interested in things that don't matter.

https://www.frontiersin.org/articles/10.3389/fpsyg.2019.00813/full

Expand full comment
Jun 8, 2023·edited Jun 8, 2023

I've had an understanding that it's commonly accepted as a fact that academics in general are too interested in things that don't matter. What turns out to be important later on represents a minuscule part of all academic output (pulling this out of my ass).

Expand full comment

One could just as well say that this is evidence that academics aren’t interested *enough* in things that don’t matter, because it turns out that the big discoveries all start as interest in things that don’t matter.

Expand full comment

Yes, I agree - the first part of my comment was intended partly as a joke on the quintessential absent-minded professor.

Expand full comment

I'm an academic, and I'm interested in your comment.

Expand full comment

Another potentially misleading thing is what you control for. Like if you say the "correlation between life outcomes and education is low" that sounds very significant until you add "controlling for income", as income mediates most of the effect you're interested in. (That's a made up example so may not reflect the real stats).

Expand full comment

Idea: for each quantity of interest, familiarize some people with the nature of the quantity, and have them all come up with a proposal for what the smallest difference is that they would consider meaningful. Call this a "Small Meaningful Difference".

Then instead of reporting effect sizes in terms of variance-standardized numbers, report them in terms of SMDs.

Expand full comment

My education is in engineering and physics. I have zero idea what these numbers mean. Maybe if you showed me some graphs that would be better.

Expand full comment

Scott, correlation does not equal identity. Just because students who are good at reading are also good at math, it doesn't follow that there is a "general" intelligence. There could be multiple intelligences that correlate.

I have a hard time understanding how you can be so bought-into the g factor when you yourself are 1-in-a-million talented at writing while worse than most of your peers at math. How does a "general" intelligence explain this?

Just let go of general intelligence and simply say that different talents correlate very strongly (without being literally the same thing as each other).

Expand full comment

What makes you think Scott or anyone would argue with

>different talents correlate very strongly

That’s a pretty bland uninteresting claim. And the fact remains for day-to-day life there is something people mean when they say someone is “intelligent”, and most everyone who isn’t pursuing some political project knows exactly what that means. Even if it is a bit messier in the gory details.

Expand full comment

What it means is context-dependent. But sure, since different abilities correlate, it is ok for many purposes to round it off to "intelligence".

What's not OK is to pretend "general intelligence" is a thing that exists any more than "physical fitness" exists. Some people are more physically fit than others! And all sports-abilities correlate. But it does not follow that talent at running is the same thing as talent at weightlifting. "Physical fitness" is an imperfect shorthand that's fine for many purposes but shouldn't be reified.

Expand full comment

I don’t think we disagree much, I just think there is a lot of pushback in this community (for good reason) for the “different intelligences” nonsense. Jimmy who can’t add three plus three but is great at baseball isn’t “just as intelligent but in different ways” from Sally the chess team captain. This is true in the vast vast vast majority of cases.

It’s not that Michael Jordan and I are equally athletic and I am simply much more elite at the “sitting in an office chair like a blob” form of athleticism. Michael Jordan is just more athletic than me. That’s ok.

Expand full comment

What point are you making exactly? You say it's not ok to pretend that physical fitness exists, and then in the next sentence say that some people are more physically fit. Huh?

>But it does not follow that talent at running is the same thing as talent at weightlifting.

Agreed. But do you think it's somehow wrong to look at, say, Lebron James and say "hey, that guy's really athletic" and to therefore predict that he'd probably be good at most physical activities?

Expand full comment

So long as you only use "intelligence" in the same way you'd use "fitness", I have no gripe with you.

But this means: no debating whether there's truly a "general factor of intelligence" as a scientific matter (nobody does this for fitness). No saying "this particular intellectual task is not very g-loaded" (nobody does this for fitness). No saying "all IQ tests increased over time in that country, but the method of correlated vectors says the increases negatively correlated with g-loading, so the increase wasn't *true* intelligence" (nobody does this for fitness).

Expand full comment

Why should that be prohibited?

First of all, I'm not sure nobody does that for fitness. All of the Armed Services have physical fitness entrance tests - I'm sure the effectiveness of those tests have been studied. And I'm *positive* that professional sports teams have analyzed how best to predict athletic performance - I would bet dollars to donuts that more than one NBA team has a team of data scientists creating Athletic Performance Models that rival anything in psychometry for complexity.

And if it's less-studied than IQ it's probably because a) intelligence is more interesting and b) intellectual output is far more valuable than physical output.

Why are you so opposed to accepting the notion that intelligence a) exists b) is important and c) varies considerably between individuals? There is a veritable mountain of both empirical data and conventional wisdom supporting all of those things.

Expand full comment

You can indeed predict athletic performance. But what predicts NBA performance is not the same combination of fitness measurements as what predicts football performance or baseball performance! There is more than one talent, even if they all correlate!

None of your examples for fitness are anywhere near the g factor nonsense that's done for IQ (mentioned in my previous comment). Maybe you don't understand the statistical claims made about the g factor? I promise you there are absolutely NO analogous claims made about a general factor of fitness. Nobody says fitness is secretly one measure, that basketball talent and baseball talent are literally the same thing.

To your question: I believe math talent exists, is important, and varies considerably between individuals. I believe agree writing talent exists, is important, and varies considerably between individuals. I also believe that math talent and writing talent positively correlate, so "intelligence" can be a useful shorthand.

Why are *you* so opposed to the notion that there's more than one possible talent? Why must math talent be literally the same thing as writing talent? It's clearly not! There's more than one intelligence, and both empirical data and conventional wisdom support this!

(Yes, the different intelligences positively correlate. No, that does not mean they're all secretly the same. Correlation does not equal identity!)

Expand full comment

> What's not OK is to pretend "general intelligence" is a thing that exists any more than "physical fitness" exists.

Speaking as a couch potato, I fully expect that a fit person will be able to beat me at just about any sport I have ever done with minimal training. I would assume that "physical fitness" is sufficient to explain most of the outcome differences for a PE class. By contrast, I guess that within the population of pro soccer players, fitness is much less indicative, because all of them will be in the top fitness percentiles of the general population.

The best advice for me, if I wanted to improve in any sport (except chess or mini-golf) would be to get more fit. Unfortunately, increasing ones general intelligence is harder than getting physically fit, so telling a student that their under-performance in math, languages and science is probably correlated is less helpful.

Expand full comment

no such thing as temperature either. it's just vibrations that correlate.

Expand full comment
Jun 8, 2023·edited Jun 8, 2023

If most objects day to day had one hot corner and one cold corner, with the variations important enough that the coldest object you could find had a pretty warm part to it and the hottest had some cold parts -- then I would indeed say "the object's overall temperature" is a simplistic way of thinking about the situation.

In this case, the best-at-math person you know is probably not great at writing, and the best-at-writing person around (Scott Alexander, for example) is bad at math. The abilities correlate, sure, but correlation is not identity: just because two things correlate doesn't make them the same thing.

Expand full comment

This really isn’t how it is though. The best at math person you know tends to be significantly above average at writing. Like very significantly.

Expand full comment

Literally the best person at writing on this forum is Scott Alexander (who has one of the top blogs on the internet), and he literally says he sucks at math and got a "C" in high school calculus despite putting in a lot of effort.

It would be one thing if the top writers were math geniuses and vice versa, but we live in a world in which that's just not true. The top writers are above average, I guess, at math; and mathematicians write above average too; but they're both pretty pathetic compared to what you might imagine if "general intelligence" was just one thing.

Expand full comment

Your focusing on the exceptions, and not the pattern.

Expand full comment

Even if so, don't the exceptions prove that "general intelligence" is not a single thing? If it were a single thing -- if smartness was literally ONE THING, which governed all mental talents -- then exceptions wouldn't be possible: nobody could be world-class at writing while being mediocre at math (what's their "general intelligence"?)

Expand full comment

Some things that correlate with IQ: Academic performance in primary education, Educational attainment, Job performance (supervisory rating), Occupational attainment, Job performance (work sample), Skill acquisition in work training, Degree attainment speed in graduate school, Group leadership success (group productivity), Promotions at work, Interview success (interviewer rating of applicant), Reading performance among problem children, Becoming a leader in group, Academic performance in secondary education, Academic performance in tertiary education, Income, Having anorexia nervosa, Research productivity in graduate school, Participation in group activities, Group leadership success (group member rating), Creativity, Popularity among group members, Happiness, Procrastination (needless delay of action), Changing jobs, Physical attractiveness, Recidivism (repeated criminal behavior), Number of children, Traffic accident involvement, Conformity to persuasion, Communication anxiety, Having schizophrenia.

Even skills with firearms (from handguns to cannons) correlates with IQ.

At what point is the list long enough that we can talk about "general" intelligence?

Expand full comment
Jun 8, 2023·edited Jun 8, 2023

Correlation is not identity. If two things correlate, it does not make them equal.

Some other things that correlate with each other: longevity, wealth, happiness, height (stature), physical strength, all measures of IQ, social skills, conscientiousness, beauty.

Shall we declare a "general factor of being awesome"? At what point does the list become long enough that we can start pretending longevity, wealth, height, and strength are all secretly the same thing?

Expand full comment

"And the second effect might sound immense, but it is only r = 0.64." This result appears to be limited to the USA - where New Orleans is a 29 degrees north and Duluth is 46 degrees north. So there is a real range restriction issue.

Expand full comment

I smell a rat in your countries near the equator v country hotness correlation. Not nearly high enough.

Chasing the reference leads to:

9. Nearness to the equator and daily temperature in the U.S.A. (National Oceanic and Atmospheric

Administration, 1999; data reflect the average of the daily correlations for latitude with maximum

temperature and latitude with minimum temperature across 1 87 U.S. recording stations for the

time period from January 1, 1970, to December 3 1 , 1 996).

Which is not quite the same thing! That's states on a continental landmass, and it doesn't include the extremes of 'on the equator' or 'near the poles'.

Expand full comment
Jun 8, 2023·edited Jun 8, 2023

Any figures for variance need to consider how many other things are allowed to vary. In a world where everyone was raised in a 100% identical environment, IQ would explain 100% of the variance; in a world where everyone had identical IQ, IQ would explain none of the variance.

Explaining variance almost never explains what you really want to know.

Expand full comment

"Any figures for variance need to consider how many other things are allowed to vary. In a world where everyone was raised in a 100% identical environment, IQ would explain 100% of the variance;" 100% of the variance in *what*? Certainly would not explain 100% of the variance in weight, social success with peers, athletic prowess or even things correlated with intelligence like success in getting hired, number of papers published etc.

Expand full comment

> only as big, in terms of effect size, as the average height difference between men and women - just a couple of inches

Average height difference between men and women is more like five inches. standard deviation about 3, d=1.6

Expand full comment

>Children tutored individually learn more than in a classroom: 2.0

Is not actually true. Peer tutoring (students teach each other) is about 0.4 d

https://www.sciencedirect.com/science/article/pii/S2405844019361511

Expert tutoring was about 0.8 d, but looks like AI tutoring is about the same, and this study is from 2011. So using GPT4 tutoring will surely be better, and we basically don't need teachers much anymore for most stuff. Humans need not apply.

https://www.tandfonline.com/doi/abs/10.1080/00461520.2011.611369

Expand full comment

Why would GPT4 tutoring make any difference? The students will still have an individual genetic endowment which will predispose them to a given performance band, and students remain in their performance band across lifespan to a remarkable degree, even in the face of major interventions. Educational technology has failed to move the needle for decades for precisely this reason - it doesn't have any impact on where the difference happens, which is in baseline cognitive ability. So what problem will AI tutoring solve?

Expand full comment

My point being that 2011 AI tutoring was 0.8 d, and GPT4 >> 2011 AI, so it must be more effective. GPT4 is pretty good at tutoring, it can solve pretty much any coding issue you get while learning. You can also just ask it to teach you German or French grammar.

AI tutoring solves the price and availability of tutoring, but not the lack of desire to learn.

Expand full comment

I suspect there are serious diminishing returns to quality of tutoring. Even if GPT4 tutor is 100x smarter than 2011 AI, there may not be much improvement in effect size.

Expand full comment

"We can’t express how much less depressed someone gets in comprehensible units - it will either be stuff like this or “5 points on the HAM-D”, which is even more arcane and harder for laymen to interpret."

Is there really no useful equivalent of mean/median/mode? No way we can just indicate "What we call depression isn't fundamentally just one thing, and people differ in ways we can't currently predict in advance, even though symptoms cluster. So, responses are highly variable, and most people won't respond, but if this helps you at all, it will help about this much"? Do our institutions not have the ability in principle to report and interpret an ordered triplet of numbers?

By the time I was in 5th grade my math tests had questions that asked questions like, "Which measure of central tendency best describes this data?" And then the data would be a list like "1,1,1,1,1,2,50" or "1,2,5,6,7,7,8,8,12." I realize that real-world data and its use are far more complicated, and there's value in having a common metric across many data sets, but still, this is the best we've got?

Expand full comment

The only context most anybody needs for anything statistical or involving studies is 'how much does this agree with what I already think'.

Expand full comment

>The correlation between reading scores and math scores is lower than the correlation between which college majors have higher IQ, and which college majors have skewed gender balance.

Am I the only one that experiences this as a wrenching, mind-jarring typo?

I don't even know if it's grammatically incorrect technically, but it feels unpleasant to read.

Expand full comment

I think there might be an omitted comparative between "have" & "skewed".

Expand full comment

The infinite boundary between mathematics and psychology.

On the correlation of 0.59 between IQ and grade, this actually says nothing about the effect of IQ on grades since grade is a variable that is highly subject to the bias inherent in the method of assessing it. If the assessment method is heavily recall-based, then the correlation with IQ would necessarily be moderate. But if the assessment method is heavily logic-based or critical-thinking-oriented, then the correlation would be very high. Hence, I would expect IQ/Grade correlation to be very high in STEM subjects, moderate in Arts, and low in social sciences.

Expand full comment

>the infinite boundary between mathematics and psychology.

I refer to it as the boundary between reason and motion, but I think we're on the same page. I'm not qualified to address the rest of your post. Recall-based makes me think you mean, depending on people self reporting, and you do not have to be a scientist to know that that is a rabbit hole.

Expand full comment

I'm not sure it matters, but nothing is sourced, so it's all anecdotal. It's like astrology from Parade magazine, or Harpers' fractured anecdotes. AI could have written it.

Expand full comment

Every underlined number leads to a linked article for me, so I don't understand your comment.

Expand full comment

> The correlation between reading scores and math scores is lower than the correlation between which college majors have higher IQ, and which college majors have skewed gender balance.

Scott, this has been commented on, but I'm curious what you even mean here. Reading scores and math scores are both continuous variables that have a total order. It is obvious what correlation is in this case, just Pearson's coefficient.

But what is a correlation between an unordered nominal variable like "college major" and continuous variables like IQ and gender balance? There is a good discussion of this topic here: https://stats.stackexchange.com/questions/119835/correlation-between-a-nominal-iv-and-a-continuous-dv-variable#124618. There are possible approaches, but it's not clear how you would meaningfully compare these in terms of predictive strength to something more straightforward like Pearson's coefficient.

Expand full comment

Write down the proportion of male students taking a given major and the avg IQ of all students taking that major. Then measure the correlation between the gender column and the IQ column across all majors. You’ll see that the higher the percentage of males, the higher the paired IQ value. Scott should have referred to this as “correlation between gender and IQ, paired by major”.

Expand full comment

It probably reads better if he just removes the comma before the "and" but, okay, correlation between percentage of males and average IQ where the data points are college majors makes sense. The way it was phrased originally made me think he was talking about two different correlations.

Expand full comment

One general strategy in comparing models in machine learning is to always have a baseline (that you're trying to beat on some performance benchmark) and, depending on the problem, an "oracle", which represents a meaningful upper bound (typically human performance or a model that somehow cheats by having access to additional information that would not be realistically available to a model used in the real world). The key is to put predictive performance into meaningful context by comparing with end points in the same problem space rather than cherry picking performance on other data sets to compare with (which clearly can be useful in some contexts, but unhelpful if the two distributions are completely unrelated, e.g., SSRI effect sizes vs. male-female height differences). I think in the case of the example given of depression, laypeople won't typically understand "2 points on the HAM-D," but will understand "15% of the difference between the means of the healthy/typical and depressed populations". Generally, this approach of making comparisons within the same problem space in this way is also constructive: it naturally points at "here's where we are vs. where we were vs. where we want to go" in a way that comparing totally unrelated distributions won't.

Expand full comment

Cognitive dissonance

Generally, refusing to accept as true something one knows to be true, or unwilling to deal with the discomfort of admitting one doesn't have a clue. The latter discomfort seems to invariably lead to Cherry picking performance on other datasets

Expand full comment

I think scatterplots are a good way to get a feel for what a correlation is really telling you. Here's a little graphic showing a bunch of correlations and scatterplots of data that produced them.:https://imgur.com/JuDikNA

It's from here: https://www.westga.edu/academics/research/vrc/assets/docs/scatterplots_and_correlation_notes.pdf

Expand full comment

Should these be reported without a confidence interval or standard deviation? For a lot of the more controversial things, what people usually doubt isn't whether the effect is positive but whether the effect is reliable.

Expand full comment

Finally, a chance to share a funny anecdote that involves myself!

So I took the SAT twice because I applied to college early in Junior year of high school, but failed to get in. And I had to retake SAT the next year for the "real" college applications.

I used that one sketchy online table for converting one's SAT score(s?) to IQ score.

By taking the SAT twice, I """improved my IQ""" by about 10 points in 1 year.

Expand full comment

Great, super helpful. Thanks.

Expand full comment

I would have thought that comparing men and women's height would be the most familiar and thus best-all-around comparison, but on further thought I see three problems:

- Up until the end of high school, the sex difference in height isn't stable, so it can be confusing for, say, 10th graders taking stats.

- Among adults, women often wear heels in public, so many people might underestimate the gap.

- Lots of people these days will consider mentioning the difference to be sexist.

Are there any other go-to comparisons that don't have problems like these?

Expand full comment

"Children tutored individually learn more than in a classroom" I thought the consensus was that while Bloom did find d = 2.0 in his original studies, later studies with larger sample size and better methodology got closer to d = 0.5 - 0.8 for tutoring, which is still amazing for an educational intervention but not quite as extreme?

A confounding variable here is that Bloom was not just measuring "human tutoring" but "human tutoring with the Mastery Learning method". Hattie, which you also linked, has "Mastery Learning" at d = 0.57 even without the one-on-one tutoring part.

A good starting point for this is https://nintil.com/bloom-sigma/ which also links to VanLenh (Ed Psy 2011) https://www.public.asu.edu/%7Ekvanlehn/Stringent/PDF/EffectivenessOfTutoring_Vanlehn.pdf that discusses the two sigma effect from P. 210. Key quote: "At any rate, the 1.95 effect sizes of both the Anania study and first Evens and Michael study [on which Bloom's analysis is based] were much higher than any other study of human tutoring versus no tutoring. The next highest effect size was 0.82. In short, it seems that human tutoring is not usually 2 sigmas more effective than classroom instruction, as the six studies presented by Bloom (1984) invited us to believe. Instead, it is closer to the mean effect size found here, 0.79. This is still a large effect size, of course."

And " ... Bloom’s 2 sigma article now appears to be a demonstration of the power of mastery learning rather than human tutoring ..."

Expand full comment
Jun 9, 2023·edited Jun 9, 2023

Another educational point to beware of from the Hattie analysis linked in the post: "Whole-Language Instruction" has a positive d = 0.06. As "Sold a Story" explains, the Whole-Language people came up with a method, did studies, found statistical significance (I will be charitable and assume no p-hacking) and a positive effect size, and then went around promoting this as the Scientific Way (TM) to help disadvantaged kids to read.

The problem is that Phonics, which the Whole-Language people explicitly were out to replace, has d = 0.70.

Standardised effect sizes may be hard to derive meaning from in general, but "Phonics >>> Whole Language" for kids who don't pick up reading "naturally" seems to be established beyond reasonable doubt, and effect sizes are part of the evidence for that.

Expand full comment

I get sent Bloom's two sigma paper approximately once a month. Unsurprisingly, it fails to replicate, although the literature is a mess. https://nintil.com/bloom-sigma/

Expand full comment

Isn't 0.8 still a huge deal?

Expand full comment

Can you add some effect sizes for things we are pretty sure deterministically almost always work, to give an idea for how big the scale is supposed to go? Like, effect size for getting shot in the head to be lethal, or for a fresh-out-of-the-factory battery to cause an electric current in a circuit.

Expand full comment

A little late to the party but this 2011 metareview finds that the effect size of individual tutoring over classroom learning is closer to .8 than 2.0, and that Bloom’s famous 2-sigma studies were outliers likely caused by factors other than the effect of one-on-one tutoring

https://www.public.asu.edu/~kvanlehn/Stringent/PDF/EffectivenessOfTutoring_Vanlehn.pdf

Expand full comment

Political liberalism vs. concern about COVID: 0.33 surprised me. Then I realised how being really shouty on social media can create a false impression around entire groups.

Expand full comment

Lets say you have made a new teaching app, you try testing it in two ways. method A) you get the students to use the app and test them immediatly (while they are still in the psyc lab). method B) you test the participants by looking at improvement of their exam score at the end of the semester. Even if you get 0 drop-outs and the immediate-test and exam were the same you would expect method B to have a much lower effect size. When comparing effect sizes in education I often run into this problem, if rather than two methods of studying the same thing it was two studies, I get the problem of should I go with teaching intervention 1 which uses method A and has an effect size of 0.5 or should I use teaching intervention 2 which uses method B and has an effect size of 0.2 --- from this data I have almost no usable information about which one to pick.

That said, I find them a very useful metric.

Expand full comment

StrongerByScience released a post that's an instance of this - the researchers said "tiny effect size of using creatine for muscle growth," but the research really says "makes your workouts 1/3 more effective," which is pretty huge! https://www.strongerbyscience.com/creatine-effect-size/

Expand full comment