The Subjects Under Question (courtesy Bob) For our very first experiment we asked our IGORs to tackle a fairly simple experiment. Can tasters detect a difference between the same wort fermented with the classics Wyeast 1056 American Ale (nee Chico) and White Labs WLP001 California Ale? See the link above for the full writeup on the parameters of the experiment.
Here are the basics - IGORs brewed and split a batch of our Magnum Blonde ale, chilled and then pitched one part with a pack of Wyeast 1056 and the other with a vial/pack of WLP001. We asked the IGORs to grab yeast samples of roughly the same manufacture date and to pitch without making starters to reduce possible variations. (Thought on that towards the end!) After fermentation, the IGORs were instructed to package the beers in the same manner and run a basic triangle test to determine if tasters could reliably detect the different beer. We gave no instruction on weighting the samples in favor of Wyeast or White Labs.
Seven IGORs conducted the experiment in time for our recap episode and this report. We'd like to thank Andy Turlington, Bob_In_So_CA, Casey Price, Jason Click, Jason Mundy, Nicki Forster and The Mossy Owl for tackling this effort! (You can always see how many experiments people have participated in here).
Mike's Brew Ingredients (notice the small amount of hops there!) (courtesy of Mike O'Toole) What's the story with the Magnum Blonde? It's one of my favorite beers. That's right, the guy known for putting clams in a beer really loves stupidly simple beer. In this case, the recipe was originally called "California Magnum" because I used it to test Great Western's California State Select 2 Row Malt. It's tasty and super cheap to make so you don't have to turn to those tall boys of PBR! Anyhoo.. back to what should be a really easy brew day! For proof, Andy Turlington wrote up his brew day right here, so go check it out - http://gallowspolebrewing.com/igor-smash-blonde-ale/ This is what it's all about, right? (courtesy Mike O'Toole) Looking through the brewing notes for our reporters, nothing seems very amiss about their brew days. Every brewer who reported their gravities reported original gravities pretty much dead on target of 1.047. Every batch reported target gravities pretty much in line between strains. In other words, each brewer's Wyeast 1056 batch fermented to the same terminal gravity as White Labs WLP001 (or within a point). The Gravity of Jason's Situation (courtesy Jason) Interestingly, the range of final gravities was pretty broad. Of those that were reported we had one batch come in at 1.012 on the high side (The Mossy Owl) with the low side being repped at a pretty dry 1.003. (Jason Click). Bob's Fermenting Buckets - neatly labelled - sure beats my labelling methods! (courtesy Bob) Fermenters Under Way (courtesy Jason) Mike's Fermenters Under Way (courtesy Mike O'Toole) When reported, IGORs reported pretty consistently that the Wyeast 1056 batches started showing krausen faster than the WLP001 batches. Otherwise, everything looked and acted the same - one tester did report that the WLP001 batch threw a larger krausen (Bob). Could the Wyeast "speed" be from that last minute smack giving the Wyeast cells a bit of a leg up? Basically with a straight pitch of White Labs without a starter, you're completely at the mercy of your yeast viability in the storage medium. With the smack pack, you get a boost of yeast vitality - aka the yeast are primed and ready for fermentation. Could be that Wyeast has an advantage here, but that's probably abrogated by the practice of making a starter. You can read up on Ray Found's experiment about short term starters designed to maximize vitality at Brulosophy.com. You should know that neither Denny nor I perceive a value in the fine macho art treating your lag times like they're quarter mile launch times. In our experience, it feels like effort without effect.
And Now For Nikki's Musical Interlude Our testers then packaged their beers up in a mix of bottles (with corn sugar) and kegs. They got down to the hard business of tasting beer!
Side by Side Samples (note - not how they were poured for the tasters) (courtesy Jason) Here's where we really think our farming out process works like a charm. Denny and I could do these experiments but we'd only be getting the one data point and since we're sloppy process controllers (even Denny and the average "uptight" homebrewer is sloppy in comparison to an honest science experiment), having multiple teams tackling the project can help smooth out some of the experimental wrinkles that might creep in. After all, we all can't screw this up in the same way! (or can we - maybe we can - sounds like a challenge!) Our seven IGORs ran a total of 12 tasting sessions. Smallest group had 5 tasters which is our minimum for these crowd sourced experiments. The largest panels had 15 and 16 participants which feels like a great time. In all the panels averaged out to 6.25 tasters. (There were a total of 75 tasters) A number of IGORs took advantage of their local homebrew club to serve as a source of tasters. We love it and think that's a great thing. You can fully expect to see members of the Maltose Falcons and the Cascade Brewers Society in the mix for some of our future experiments! The experience level of the tasters was reported as a healthy mix of experienced brewers and beer geeks along with the beer curious. We asked testers to keep the question in question under wraps - but naturally there are people who listen to the podcast that know what's going on. Expect that to be a question of much debate in the near future! So how did we do... Well first..
Outliers - A Matter of Science
People tend to think of science as a machine. Execute an experiment, get results, feed results into an algorithm, spit out the answer - voila - 42. But... we are human beings and humans share one fantastic super power - the ability to mess things up. Things go wrong - something doesn't ferment right, we don't hit the calendar time right, someone does something to make the tasting go all pear shaped. Scientists have debated for years, because scientists have been screwing up for years, what to do when something goes wrong or the data you collect is just so far out of whack to make absolutely zero sense. From a "pure" perspective - shouldn't the data get blended in? After all the universe did provide it to you and it could possible be valid. From a "practical" perspective - a mis-execution in an experiment means you're no longer testing your question (e.g. Do Wyeast 1056 and WLP001 produce detectably different beers), you're testing a new one ("Can tasters detect a flawed beer") and therefore your data isn't for the correct question. Naturally, this is a very sensitive question. Get too cavalier with tossing results that don't meet your expectations and you're not really looking for answers. You're looking for a way to confirm your beliefs. This means you're not performing science, you're performing politics! Get too stringent with including all the results and you run the risk of getting the right answer to the wrong question or at least muddying the waters sufficiently. There are accepted methodologies to reject "outlier" data. The first two I can think of are Chauvenet's Criterion and Peirce's Criterion. Both use standard deviations and statistical analysis to provide firm mathematical pinnings for rejecting data. For tasting results like ours, there are iterative tests like Grubb's that can help as well. (To see Grubbs in Action) For this experiment though, I don't think we need to worry about the math because it's pretty clear one test had a misfire. The batch produced by Andy Turlington, who was brewing in a hurry, developed a rather noticeable phenol character in the WLP001 portion. When presented to tasters all 11 correctly picked out the different beer. In this particular case, that seems pretty clear a non-standard test result and I (and Denny and Marshall) all agree that the tasting panel results are answering the wrong question. Bummer - it happens - and we know how to deal with it. Don't worry we will always be good little scientists and reveal when we decide to strike data from the record and if we have the time, we'll show you the results with and without the outlier data.
Executive Summary. Ok, here's what you really want you info junkies - what did our tasters find? Can testers reliably detect that one beer is done with 1056 and one with WLP001? Crunch the numbers on our 64 tasters given the non-anomalous samples and we find that 29 of them correctly identified the odd sample. In other words, 45% of the time, a taster could correctly choose the beer made with the different yeast. This is right over the line of what p-value calculation would tell you is significant (28 out of 64).Compared to the expectations of random chance (e.g. 33%), that seems pretty interesting! Looking at the calculated p-value, we get a value of 0.021, well below the normal threshold of 0.05 to be considered significant. (When you use the anomalous results, that drops even further to 0.000001 thanks to the pool being 40 out of 75 or 53%) For openness about the numbers, we're following the cue of our good friends over at Brulosophy.com and using a single tailed t-test function. Just to keep everything on a level playing field, we're using the same calculator as well. The calculator was provided by Justin Angevaare and can be found here In other words, side by side - tasters were reliably able to tell which beer was different, but does that mean they could tell which beer was 1056 or WLP001? Could they all agree on common differences or just "hey, these are different!" The Details Here's a listing of the results we see from our individual panels. We've included the thoughts and observations of both the successful tasters and the experimenters. Let's see what they say! (N.B. As number nerds will tell you - the actual magnitude of the p-value difference from 0.05 is, in theory, meaningless.)
Tasting Panel Numeric Data
|Jason Click||8||2||0.691 (NOT significant)|
|Andy Turlington||11||11||0.00 (VERY significant - but also flawed - see above)|
|Casey Price||5||3||0.103 (NOT significant)|
|Nicki Forster||10||4||0.327 (NOT significant)|
|The Mossy Owl||10||5||0.132 (NOT significant)|
|Jason Mundy||16||9||0.026 (significant)|
|Bob In So Cal||15||6||0.292 (NOT significant)|
Now the interesting thing to me - on a panel by panel basis, what we see is a p-value returned that says "Not Significant", but when the analysis is applied across the whole data set (e.g. 29/64), we get a return that gives us a significant finding. The question is - is this sort of data stacking correct or are we skewing the numbers by putting the trials together this way? Hopefully a real scientist type can help us out here and tell us we're ok or we're horribly messing things up and should feel shame at our efforts. In discussing amongst the team (Denny, Marshall and I), there's a few ways to look at this: Aggregate Results Are Good: There's value in the larger data pool as the more results, the less sensitive the numbers are to the whims/abilities of a few tasters. With some of the smaller tasting panels, you're looking at a 1 vote difference swinging the p-value around a fair amount. Aggregate Results Are Bad And We Are Bad People: On the other hand, the dyed in the wool number fiend could easily argue that our trials aren't rigorous enough to provide repeatability. That same thing we claimed earlier as an advantage to having multiple teams (smoothing out individual "unknown" variances) makes it easy to dismiss the aggregate results because you can't say everything tested wasn't the same. We admit, this is sloppy Citizen Science. We're not looking at winning the Nobel Prize for Beerology with our experiments, but instead point out things we think are interesting and keep trying new things!
Tasting Panels Qualtative Data
|IGOR||Beer Thoughts||Experiment Thoughts|
|Jason Click||WLP001 drier; fruitier and more bitter - "I find that the 1056 has a little more flavor... 001 is more muted. Also the 1056 seem to drop clearer."||"All in all both yeast are almost identical. I believe I like the flavor and clarity of the the WY1056 a little more."|
|Andy Turlington||All tasters were successful. "The WLP001 batch had a phenol that I have never experienced with this yeast before. I believe it is because I ran the experiment at an accelerated pace."||"Tough to say. The WLP001 had a phenol that I haven't experienced with that yeast before. It was way too easy to identify the odd beer because of this."|
|Casey Price||No difference in aroma, Wyeast (WY) beer was hazier, white labs (WL) beers had thicker mouthfeel. WY beer had more head retention. The WY beer had a thinner mouthfeel. The WY beer had more bitterness in the back of the throat.|
|Nicki Forster||- WYEAST 1056 sample is softer, milder and less crisp than the WLP 001. Sample WLP 001 was light, crisp and had more of a lager flavor, slightly sweeter up front, lager on the backside. Preferred the WYEAST sample. - WYEAST 1056 was slightly cloudier than WLP 001 sample. Preferred WLP 001: little sweeter, crisp, nice & clear color, mild after taste. WYEAST 1056 was tangy, a little bitter by comparison. - WYEAST 1056 sample was mellow, drinkable. WLP 001 had a slightly different aroma which helped it stand out from the other two samples. Winner, winner chicken dinner. - Good carbonation. WLP 001 was slightly fruitier and sweeter than WYEAST 1056. Also was slightly smoother and has a mildly fuller mouth feel. Preferred WLP 001.||"I personally enjoyed the flavor profile better with the WLP 001 and thought the overall recipe was a good design and taste success. "|
|The Mossy Owl||Reactions weren't confident. They stated beers were very similar. Some said 1056 seemed a bit more bitter, perhaps brighter.|
|Jason Mundy||1056 More malt flavor, 001 Rubbery, 001 Butterscotch, 1056 clean biscuit flavor, 1056 more malt flavor, 001a little dryer||"I think that the order of tasting can play a role in this. But I think that all these were really close and made great beers. Even though it appears that we can tell a difference between the yeasts, I will freely substitute one for the other."|
|Bob In So Cal||Lighter in flavor, More malt flavor (Wyeast 1056)||"There is a slight difference between the two yeast strains, but not enough for this test to determine that they are different strains. 1056 produced a larger amount of yeast slurry, started faster and was highly active before the 001 started to get going."|
Looking through these taster comments (which are only the comments from successful tasters and the IGOR experimenter), do we see any consistent trends to Wyeast 1056's character vs. WLP001's? Here's what I see in the comments: Wyeast 1056
- Samples tended to be hazier
- Samples tended to emphasize malt character more than WLP001
White Labs WLP001
- Drier and crisper, more lager like
- Dropped clearer than 1056
Now here's the rub though - I've picked those reactions from the more common reactions out of the limited tasting data returned from our panels. Other comments from tasters seem to contradict - "1056 seem to drop clearer", for instance. Or "WLP 001 was slightly fruitier and sweeter than WYEAST 1056." So who's right about those yeast characteristics? I think we can't safely say without more data so keep brewing and keep getting us results! In the meanwhile, our general brewing recommendaton stands - you can treat Wyeast 1056 and WLP001 as interchangable - until you put them side by side! What do you think experimenters and brewers? Did we call this correctly? Does it match with your experience? Did we screw something up horribly? Let us know below or at [email protected]