Writeup: Yeast Comparison of the "same" strain - Wyeast 1056 / WLP001

drew's picture

Experiment: 

Recipe: 

The Subjects Under Question (courtesy Bob) For our very first experiment we asked our IGORs to tackle a fairly simple experiment. Can tasters detect a difference between the same wort fermented with the classics Wyeast 1056 American Ale (nee Chico) and White Labs WLP001 California Ale? See the link above for the full writeup on the parameters of the experiment.

The Experiment

Here are the basics - IGORs brewed and split a batch of our Magnum Blonde ale, chilled and then pitched one part with a pack of Wyeast 1056 and the other with a vial/pack of WLP001. We asked the IGORs to grab yeast samples of roughly the same manufacture date and to pitch without making starters to reduce possible variations. (Thought on that towards the end!) After fermentation, the IGORs were instructed to package the beers in the same manner and run a basic triangle test to determine if tasters could reliably detect the different beer. We gave no instruction on weighting the samples in favor of Wyeast or White Labs.

The Experimenters

Seven IGORs conducted the experiment in time for our recap episode and this report. We'd like to thank Andy Turlington, Bob_In_So_CA, Casey Price, Jason Click, Jason Mundy, Nicki Forster and The Mossy Owl for tackling this effort! (You can always see how many experiments people have participated in here).

The Brews

Mike's Brew Ingredients (notice the small amount of hops there!) (courtesy of Mike O'Toole) What's the story with the Magnum Blonde? It's one of my favorite beers. That's right, the guy known for putting clams in a beer really loves stupidly simple beer. In this case, the recipe was originally called "California Magnum" because I used it to test Great Western's California State Select 2 Row Malt. It's tasty and super cheap to make so you don't have to turn to those tall boys of PBR! Anyhoo.. back to what should be a really easy brew day! For proof, Andy Turlington wrote up his brew day right here, so go check it out - http://gallowspolebrewing.com/igor-smash-blonde-ale/ This is what it's all about, right? (courtesy Mike O'Toole) Looking through the brewing notes for our reporters, nothing seems very amiss about their brew days. Every brewer who reported their gravities reported original gravities pretty much dead on target of 1.047. Every batch reported target gravities pretty much in line between strains. In other words, each brewer's Wyeast 1056 batch fermented to the same terminal gravity as White Labs WLP001 (or within a point). The Gravity of Jason's Situation (courtesy Jason) Interestingly, the range of final gravities was pretty broad. Of those that were reported we had one batch come in at 1.012 on the high side (The Mossy Owl) with the low side being repped at a pretty dry 1.003. (Jason Click). Bob's Fermenting Buckets - neatly labelled - sure beats my labelling methods! (courtesy Bob) Fermenters Under Way (courtesy Jason) Mike's Fermenters Under Way (courtesy Mike O'Toole) When reported, IGORs reported pretty consistently that the Wyeast 1056 batches started showing krausen faster than the WLP001 batches. Otherwise, everything looked and acted the same - one tester did report that the WLP001 batch threw a larger krausen (Bob). Could the Wyeast "speed" be from that last minute smack giving the Wyeast cells a bit of a leg up? Basically with a straight pitch of White Labs without a starter, you're completely at the mercy of your yeast viability in the storage medium. With the smack pack, you get a boost of yeast vitality - aka the yeast are primed and ready for fermentation. Could be that Wyeast has an advantage here, but that's probably abrogated by the practice of making a starter. You can read up on Ray Found's experiment about short term starters designed to maximize vitality at Brulosophy.com. You should know that neither Denny nor I perceive a value in the fine macho art treating your lag times like they're quarter mile launch times. In our experience, it feels like effort without effect.

And Now For Nikki's Musical Interlude Our testers then packaged their beers up in a mix of bottles (with corn sugar) and kegs. They got down to the hard business of tasting beer!

The Tastings

Side by Side Samples (note - not how they were poured for the tasters) (courtesy Jason) Here's where we really think our farming out process works like a charm. Denny and I could do these experiments but we'd only be getting the one data point and since we're sloppy process controllers (even Denny and the average "uptight" homebrewer is sloppy in comparison to an honest science experiment), having multiple teams tackling the project can help smooth out some of the experimental wrinkles that might creep in. After all, we all can't screw this up in the same way! (or can we - maybe we can - sounds like a challenge!) Our seven IGORs ran a total of 12 tasting sessions. Smallest group had 5 tasters which is our minimum for these crowd sourced experiments. The largest panels had 15 and 16 participants which feels like a great time. In all the panels averaged out to 6.25 tasters. (There were a total of 75 tasters) A number of IGORs took advantage of their local homebrew club to serve as a source of tasters. We love it and think that's a great thing. You can fully expect to see members of the Maltose Falcons and the Cascade Brewers Society in the mix for some of our future experiments! The experience level of the tasters was reported as a healthy mix of experienced brewers and beer geeks along with the beer curious. We asked testers to keep the question in question under wraps - but naturally there are people who listen to the podcast that know what's going on. Expect that to be a question of much debate in the near future! So how did we do... Well first..

Outliers - A Matter of Science

People tend to think of science as a machine. Execute an experiment, get results, feed results into an algorithm, spit out the answer - voila - 42. But... we are human beings and humans share one fantastic super power - the ability to mess things up. Things go wrong - something doesn't ferment right, we don't hit the calendar time right, someone does something to make the tasting go all pear shaped. Scientists have debated for years, because scientists have been screwing up for years, what to do when something goes wrong or the data you collect is just so far out of whack to make absolutely zero sense. From a "pure" perspective - shouldn't the data get blended in? After all the universe did provide it to you and it could possible be valid. From a "practical" perspective - a mis-execution in an experiment means you're no longer testing your question (e.g. Do Wyeast 1056 and WLP001 produce detectably different beers), you're testing a new one ("Can tasters detect a flawed beer") and therefore your data isn't for the correct question. Naturally, this is a very sensitive question. Get too cavalier with tossing results that don't meet your expectations and you're not really looking for answers. You're looking for a way to confirm your beliefs. This means you're not performing science, you're performing politics! Get too stringent with including all the results and you run the risk of getting the right answer to the wrong question or at least muddying the waters sufficiently. There are accepted methodologies to reject "outlier" data. The first two I can think of are Chauvenet's Criterion and Peirce's Criterion. Both use standard deviations and statistical analysis to provide firm mathematical pinnings for rejecting data. For tasting results like ours, there are iterative tests like Grubb's that can help as well. (To see Grubbs in Action) For this experiment though, I don't think we need to worry about the math because it's pretty clear one test had a misfire. The batch produced by Andy Turlington, who was brewing in a hurry, developed a rather noticeable phenol character in the WLP001 portion. When presented to tasters all 11 correctly picked out the different beer. In this particular case, that seems pretty clear a non-standard test result and I (and Denny and Marshall) all agree that the tasting panel results are answering the wrong question. Bummer - it happens - and we know how to deal with it. Don't worry we will always be good little scientists and reveal when we decide to strike data from the record and if we have the time, we'll show you the results with and without the outlier data.

The Results

Executive Summary. Ok, here's what you really want you info junkies - what did our tasters find? Can testers reliably detect that one beer is done with 1056 and one with WLP001? Crunch the numbers on our 64 tasters given the non-anomalous samples and we find that 29 of them correctly identified the odd sample. In other words, 45% of the time, a taster could correctly choose the beer made with the different yeast. This is right over the line of what p-value calculation would tell you is significant (28 out of 64).Compared to the expectations of random chance (e.g. 33%), that seems pretty interesting! Looking at the calculated p-value, we get a value of 0.021, well below the normal threshold of 0.05 to be considered significant. (When you use the anomalous results, that drops even further to 0.000001 thanks to the pool being 40 out of 75 or 53%) For openness about the numbers, we're following the cue of our good friends over at Brulosophy.com and using a single tailed t-test function. Just to keep everything on a level playing field, we're using the same calculator as well. The calculator was provided by Justin Angevaare and can be found here In other words, side by side - tasters were reliably able to tell which beer was different, but does that mean they could tell which beer was 1056 or WLP001? Could they all agree on common differences or just "hey, these are different!" The Details Here's a listing of the results we see from our individual panels. We've included the thoughts and observations of both the successful tasters and the experimenters. Let's see what they say! (N.B. As number nerds will tell you - the actual magnitude of the p-value difference from 0.05 is, in theory, meaningless.)

Tasting Panel Numeric Data

IGOR Tasters Successful ID's p-Value
Jason Click 8 2 0.691 (NOT significant)
Andy Turlington 11 11 0.00 (VERY significant - but also flawed - see above)
Casey Price 5 3 0.103 (NOT significant)
Nicki Forster 10 4 0.327 (NOT significant)
The Mossy Owl 10 5 0.132 (NOT significant)
Jason Mundy 16 9 0.026 (significant)
Bob In So Cal 15 6 0.292 (NOT significant)

Now the interesting thing to me - on a panel by panel basis, what we see is a p-value returned that says "Not Significant", but when the analysis is applied across the whole data set (e.g. 29/64), we get a return that gives us a significant finding. The question is - is this sort of data stacking correct or are we skewing the numbers by putting the trials together this way? Hopefully a real scientist type can help us out here and tell us we're ok or we're horribly messing things up and should feel shame at our efforts. In discussing amongst the team (Denny, Marshall and I), there's a few ways to look at this: Aggregate Results Are Good: There's value in the larger data pool as the more results, the less sensitive the numbers are to the whims/abilities of a few tasters. With some of the smaller tasting panels, you're looking at a 1 vote difference swinging the p-value around a fair amount. Aggregate Results Are Bad And We Are Bad People: On the other hand, the dyed in the wool number fiend could easily argue that our trials aren't rigorous enough to provide repeatability. That same thing we claimed earlier as an advantage to having multiple teams (smoothing out individual "unknown" variances) makes it easy to dismiss the aggregate results because you can't say everything tested wasn't the same. We admit, this is sloppy Citizen Science. We're not looking at winning the Nobel Prize for Beerology with our experiments, but instead point out things we think are interesting and keep trying new things!

Tasting Panels Qualtative Data

IGOR Beer Thoughts Experiment Thoughts
Jason Click WLP001 drier; fruitier and more bitter - "I find that the 1056 has a little more flavor... 001 is more muted. Also the 1056 seem to drop clearer." "All in all both yeast are almost identical. I believe I like the flavor and clarity of the the WY1056 a little more."
Andy Turlington All tasters were successful. "The WLP001 batch had a phenol that I have never experienced with this yeast before. I believe it is because I ran the experiment at an accelerated pace." "Tough to say. The WLP001 had a phenol that I haven't experienced with that yeast before. It was way too easy to identify the odd beer because of this."
Casey Price No difference in aroma, Wyeast (WY) beer was hazier, white labs (WL) beers had thicker mouthfeel. WY beer had more head retention. The WY beer had a thinner mouthfeel. The WY beer had more bitterness in the back of the throat.  
Nicki Forster - WYEAST 1056 sample is softer, milder and less crisp than the WLP 001. Sample WLP 001 was light, crisp and had more of a lager flavor, slightly sweeter up front, lager on the backside. Preferred the WYEAST sample. - WYEAST 1056 was slightly cloudier than WLP 001 sample. Preferred WLP 001: little sweeter, crisp, nice & clear color, mild after taste. WYEAST 1056 was tangy, a little bitter by comparison. - WYEAST 1056 sample was mellow, drinkable. WLP 001 had a slightly different aroma which helped it stand out from the other two samples. Winner, winner chicken dinner. - Good carbonation. WLP 001 was slightly fruitier and sweeter than WYEAST 1056. Also was slightly smoother and has a mildly fuller mouth feel. Preferred WLP 001. "I personally enjoyed the flavor profile better with the WLP 001 and thought the overall recipe was a good design and taste success. "
The Mossy Owl Reactions weren't confident. They stated beers were very similar. Some said 1056 seemed a bit more bitter, perhaps brighter.  
Jason Mundy 1056 More malt flavor, 001 Rubbery, 001 Butterscotch, 1056 clean biscuit flavor, 1056 more malt flavor, 001a little dryer "I think that the order of tasting can play a role in this. But I think that all these were really close and made great beers. Even though it appears that we can tell a difference between the yeasts, I will freely substitute one for the other."
Bob In So Cal Lighter in flavor, More malt flavor (Wyeast 1056) "There is a slight difference between the two yeast strains, but not enough for this test to determine that they are different strains. 1056 produced a larger amount of yeast slurry, started faster and was highly active before the 001 started to get going."

Looking through these taster comments (which are only the comments from successful tasters and the IGOR experimenter), do we see any consistent trends to Wyeast 1056's character vs. WLP001's? Here's what I see in the comments: Wyeast 1056

  • Samples tended to be hazier
  • Samples tended to emphasize malt character more than WLP001

White Labs WLP001

  • Drier and crisper, more lager like
  • Dropped clearer than 1056

Now here's the rub though - I've picked those reactions from the more common reactions out of the limited tasting data returned from our panels. Other comments from tasters seem to contradict - "1056 seem to drop clearer", for instance. Or "WLP 001 was slightly fruitier and sweeter than WYEAST 1056." So who's right about those yeast characteristics? I think we can't safely say without more data so keep brewing and keep getting us results! In the meanwhile, our general brewing recommendaton stands - you can treat Wyeast 1056 and WLP001 as interchangable - until you put them side by side! What do you think experimenters and brewers? Did we call this correctly? Does it match with your experience? Did we screw something up horribly? Let us know below or at podcast@experimentalbrew.com

denny
denny's picture
Thanks to all the IGORs and

Thanks to all the IGORs and tasters for their time! As I said on the podcast, I'm not too surprised by the results. I've always felt that there was a difference in the beer produced by the 2 yeasts, although I've never done an extensive test like this. Hope you'll all participate in the next experiment.

Life begins at 60....1.060, that is!

CA_Mouse
CA_Mouse's picture
It was a nice experiment to

It was a nice experiment to get me out of the 'White Labs is my go to' to try something else. Since White Labs is less than 90 minutes from me, most of my vials tend to be very fresh, so that is what I've gone with. Seeing the slight difference in how these two worked has me thinking about trying other Wyeast products to see if there is other variations that can improve my other beers.

Can't wait for the FWH experiment, since that gives me a reason to brew a double batch of a nice Pale Ale!

*thanks for the phot credits too! Kind of cool seeing my photos were worth taking.

Bob

denny
denny's picture
Thanks back atcha, Bob!

Thanks back atcha, Bob!

Life begins at 60....1.060, that is!

stickyfinger
stickyfinger's picture
"aggregate" stats analysis

Hi, awesome stuff. I hope a lot more is coming!

How did you aggregate the data for this exbeeriment? If you did the same analysis as for an individual, but with the "summed" data, I am pretty sure that this would be inappropriate in this case. I think it would be more appropriate to statistically take into account the fact that each brewer introduces his/her own bias to the separate exbeeriments.

I will ask my stats-geek friend and see if she can shed some light on the required method of analysis here.

drew
drew's picture
We'll take all the commentary

We'll take all the commentary we can - neither Denny or I are statisticians - although this whole effort is forcing me to relearn!

pweis909
pweis909's picture
Stats...

Yeah, using a t-test for this reminds me of the time my thesis committee ripped me a new one during a public defense. I think some sort of multi-factor anova would be a more appropriate test, and I don't think you would end up being able to say there is a difference between the yeasts. More like the difference between the yeasts is dependent on the experimenter. Not very satisfying if looking for general truths. Still, I applaud the effort (plus, my stats knowledge has gone stale in recent years, so there likely is a better approach).

Peter

kramerog
Binomial proportions test

I believe that the binomial proportions test was used, not the t-test.  Use of binomial proportions test is described in the blog http://onbrewing.com/triangle-test/.  The author of that blog is referenced in the article here.

KramerOG

timbower
Methodological Issues

Very interesting hypothesis, can people tell the difference in very similar yeast strains.
To address one of your concerns, the way the data was pooled was not correct given you had 7 different experiments instead of 1 experiment with 7 different populations (or tasters). If you had all 7 populations try (using the triangle testing as you did) each of the 7 sets of beers (or only one set of beer) and then pooled the results you would have had done it correctly. The thing with experiments is that you need to have a common and consistent manipulation ( to answer, can participants tell the difference between x1 and x2, you need each subject exposed to the same x's).

I am not sure why you used the t-test, t-tests are used to compare sample means of normally distributed data from a randomly drawn population. I believe a simple presentation of percentages (% correctly identifying vs. % not correctly identifying) would have sufficed here. Determining a p-value is only to let you know if your findings are more a product of chance and should only be used when the sample drawn is representative (aka. randomly drawn); it us not appropriate to use inferential statistics with non-probability sampling designs.

That being said, I have used both 1056 and 001 numerous times and have found slight differences in clarity (001 better flocculation) mouth feel (1056 full-bodied and smoother) and if 1056 is fermented at cooler temps you can bring out more fruit/citrus flavor while warmer temp lets the malt come more forward.

Thanks for conducting this quasi-experiment, it made me think.

kramerog
Binomial proportions test

I believe that the binomial proportions test was used, not the t-test.  Use of binomial proportions test is described in the blog http://onbrewing.com/triangle-test/.  The author of that blog is referenced in the article here.

KramerOG

Pietro
pH

Did anyone happen to measure post-boil and finished beer pH in both yeasts? I wonder how much this may have affected the tasters' perceptions and is something I am following closely, particularly when it comes to the new NE IPA's (Alchemist, Hill, Foley, Singlecut, Treehouse, etc.)

Maybe next time?!

denny
denny's picture
That's a really good idea!  I

That's a really good idea!  I hope we have an opportunity to look into it.

Life begins at 60....1.060, that is!

CA_Mouse
CA_Mouse's picture
I just redid this experiment.

I just redid this experiment. This time with a kettle soured IPA. There is a very slight difference this time around. The Wyeast started faster and took longer to finish active fermentation, but finished a little lower than the White Labs. I did a 2 step starter for each (Denny can add me to his yeast abuser list - the Wyeast smackpacks got moved around and both actually had chunks of ice in them). I ended up with a very tart and dry IPA. The Wyeast has a fruitier nose and the White Labs has a slightly sweeter taste (even though the difference was 0.003). I think that for a clean style ale, that the Wyeast 1056 is a hardier strain, but the difference between the two is completely negligible.

Bob

drew
drew's picture
Interesting find. More ammo

Interesting find. More ammo for those dang Oregonians!

CA_Mouse
CA_Mouse's picture
Not really a lot of

Not really a lot of difference from the first time around. The differences are so small that I would call this the same beer if I didn't know. I think the difference could have even been a small pitching rate difference, since I didn't do a cell count on the different starters. I did cold crash and decant off 95% of the starter wort, there were very similar volumes of slurry from both flasks, so there is a little wiggle room as to exact cell counts, but both should have been near 450B cells (yes an over pitch for Ale, but because of the low pH it is needed).

Bob

denny
denny's picture
Thanks for you diligence, Bob

Thanks for you diligence, Bob!

Life begins at 60....1.060, that is!

Todd H.
Stats

I once asked a statistician friend about compiling results from repeated behavioral studies at work into one.  She said it was okay only if I did an analysis of covariance to show that the individual studies all varied in the same way before compiling them into one result.

I'd assume the same holds here for all the IGOR studies.

On the other hand, how anal do you want to get over beer analysis?

jdpils
Standardizing Data and Objectives with Yeast strains

Thanks to all the IGORs who ran this experiment.  For years I used to split my ESB with WLP002 and WY1968 until after numerous observations I found that they performed and tasted the same and with some reasearch found they are of the same origin.  So now as with most 11 gallon batches I use two yeasts but rearely ever the same strain.  For me WY1056 is my goto and I love to compare with others but have never done so with WLP001.  I like the idea of a blone ale chosen as to eliminate hop variations and masking of yeast flavors.  I have a  couple points I would like to make before proposing some ideas for future multi site yeast experiments.  As some one pointed out above th experiement should be viewed as 7 repeated experiments with differnet sample sizes.  Also it would be logical to eliminate Jason's results due to the phenol noted in WLP001.  It suggests something went wrong.  It was also mentioned some beers were bottle conditioned and some kegged.  That might add a large variable.  Lastly, this experiment seemed to focus on the final outcome comparison in flavor and I would be interested in more comparison on how the yeasts processed or performed. 

So my proposal that address both understanding  experimental variations and yeast perfromance are to definee a cost effective list of data each IGOR should take.  This might include mash temp and times, pH at various points, boil time, chill method, ferment sg versus time profile, final pH, oxygenation method, etc.  If a standardized list could  e created and then specific to each exp block out anything that does not make sense.  In this manner it may be possible to identify second or third order effects.  I would be happy to participate in such and effort.  Thanks again for al the information and effort. 

Cheers, 

Jim Dunlap