Brewing the Perfect Pot of Office Coffee with Design-Expert® Software, Version 10
Stat-Ease Staff (from L to R): Joe Carriere, Martin Bezener (author), Neal Vaughn, Hank Anderson, Mark Anderson
Motivation The idea for this experiment began while our programming lead, Hank Anderson, was at a cabin in northern Wisconsin. Hank's host made him some whole-bean coffee using a burr grinder. Having only started drinking it after his son was born (for reasons that will be immediately obvious to any parent out there), he hadn’t really explored many coffee options beyond the different brands of pre-ground coffee at the grocery store. He immediately started thinking about the possibility of experimenting to improve the Stat-Ease office coffee, which, to put it plainly, was disgusting.
Choosing Factors When Hank returned to the office, he enlisted me as his partner, and we gathered up the other coffee drinkers and started to plan an experiment. It was immediately apparent that the most important and interesting factor was going to be the blend of coffee beans. The Stat-Ease world headquarters is located across the street from Up Coffee Roasters (www.upcoffeeroasters.com), so we obviously had to use them. We selected three different beans, with three different roast profiles corresponding to light, medium, and dark roasts. To gain more information, we didn't just want to limit ourselves to brewing pots of coffee using a single bean type. We were hoping to try blends of two beans, as well as mixtures of all three beans.
Another factor that the group thought was important was the amount of coffee used to brew each pot. After some quick range-finding tests, we decided that we'd use between 2.5 and 4.0 ounces of coffee per coffee pot. We had a feeling that the ideal amount of coffee may not be the same for all blends of beans.
Initially we were going to test the effect of blade vs burr grinders on the coffee taste, but after grinding some beans with the blade grinder, we quickly abandoned the idea. The blade grinder had several deficiencies. First off, the grind size is determined by the amount of time you run the grinder. The operator would have to carefully time each grind, potentially increasing the amount of noise in the experiment. The blade grinder was also inconsistent—some beans were completely pulverized, while others remained largely intact. We settled on using three different burr grinder settings: low, medium, and high. We had hoped to quantify the grind size to make it a numeric factor, but after trying several methods we decided it would either be too time consuming (digital image analysis), too inaccurate (measuring volume with a graduated cylinder), or was too troublesome to use (sieve analysis).
In the end, we settled on the following experimental factors (lower cased ones [a, b, c] being hard-to-change):
The Responses The last thing we had to do was decide how to measure the response. There are many opinions on the right way to taste coffee. We decided that we'd ask a core group of five tasters to rate each pot of coffee on a number of characteristics, with "overall liking" being the most important characteristic. The overall liking would be rated on a scale from 1 to 9, using the current office coffee as a benchmark score of 5. Testing would have to be done blindly so that knowledge of the experimental settings would not affect the subjective rating of each coffee pot.
After collecting all five testers' scores, the plan was to create two responses for each pot of coffee:
The average overall liking of the five individual scores
The minimum overall liking of the five individual scores
Later we'll explain why we collected both the average and minimum of the five tasters' overall liking scores.
The Experimental Design It was clear that this was going to be a mixture-amount experiment with an additional categorical factor for grind size. For more information on mixture-amount experiments, we encourage you to read the original 1985 article by Greg Piepel and John Cornell, listed in the references. However, we noticed an immediate practical difficulty: a fully randomized experiment would require an independent freshly roasted blend of coffee for each pot. This was not going to be feasible, since coffee beans can only be purchased in large bags that were all roasted as part of one large batch. We wanted to split up each 1lb-bag (or a mix of bags) of coffee to use at several amount-grind size combinations. Running the experiment in this way would force us to use a split-plot type design. We were in luck, as the new combined mixture-process split-plot feature in Design-Expert® software, version10 (DX10) could be used to help us create the perfect pot of office coffee.
A design comprised of 74 runs at 16 blends of coffee was created. Each blend of coffee beans would be, on average, tested at 4 amount-grind size combinations. We also randomly interspersed six runs of the current office coffee throughout the experiment to serve as a control.
The first 6 pots of coffee in the experimental design. Note the Group column, which clearly indicates the convenient split-plot arrangement of hard-to-change factors (a, b, c) in the experiment.
Check out below a fun video we made of setting up and running the experiment.
Figure 1-1 Figure 1-2
The predicted average (left) and minimum overall liking (right) at a (1/3, 1/3, 1/3) blend of light, medium, and dark beans. Even though the average overall liking looks okay at 4 ounces, the minimum sees a steady drop off at the finest grind setting.
After nearly three months of coffee tasting, it was finally time to analyze our data. After discarding a few problematic runs and double checking the spreadsheet, we needed to come up with a model to use for optimization. Due to the large number of model terms, we used DX10's new automatic model selection feature, and immediately noted the following:
Pots of coffee brewed with all (or mostly) light beans were poorly rated at nearly all amount-grind size combinations. Stat-Ease coffee drinkers want it dark!
There was a clear amount-grind size interaction, as coarser grinds generally required more coffee to achieve high average overall likings. See Figure 1-1 above for an illustration.
Using lots of coffee generally increased the overall average liking, but lowered the minimum overall liking at the finest grind setting. This suggested that that those pots were too strong for one or two tasters to handle, even though the tasters liked them as a whole. See Figure 1-2 above for an illustration—note the sharply downsloped red line!
Optimization Once we settled on a model, we needed to make a recommendation on how to brew the coffee. We spent a lot of time playing with the optimization parameters, trying to maximize the taste while keeping costs low. We decided that a coffee blend that simultaneously met the following objectives would be ideal:
Maximize the expected average overall liking (good coffee).
Ensure that the expected minimum overall liking would be at least 5 out of 9 (coffee that everyone liked at least as much as the current coffee).
Minimize the amount of coffee used (save money).
It turns out we were in luck: 2.5 ounces of an approximately 50/50 blend of medium and dark coffee ground at the fine grind setting gave an expected average likeability of 5.7 and an expected minimum likeability of 5.2. If we had ignored objectives 2 and 3, we would have gotten a coarsely ground 4.0 ounce pure blend of dark coffee with an expected overall likeability of 6.4, but an expected minimum likeability of 4.2, which would be unacceptable.
Confirmation To ensure that our results would be reproducible in the future, we performed several confirmation runs. Confirmation is a bit complicated in our situation, since knowing that we would be confirming the "best" coffee would likely bias the confirmation results. To circumvent this issue, we performed the 9 following confirmation runs in a random order: 2 of the current office coffee, 3 of the "best" blend from the previous section, 2 of the "worst" blend, and 2 runs of a blend that was somewhere in the middle. Fortunately, all of the confirmations runs were within their respective prediction intervals.
Conclusion After several months of experimenting and several hours of analyzing and optimizing, the group was happy with the results, and there was a noticeable uptick in productivity. This experiment was only possible using the powerful combined mixture-process split-plot tools in DX10 software.
If you are teaching a class on the basics of DOE or RSM, consider using either the DOE Simplified or RSM Simplified books as a text for your course. They make statistics interesting and come complete with practice problems. Click here if you are interested in purchasing the books or learning more about them. To request a complimentary instructor copy, click here.
Shock-ing Improvement: An Experimental Design to Improve Mountain Bike Riding
Jim Anderson, Guest writer
Hey mountain bikers, have you ever spent time setting up your shocks and tire pressure beyond riding off the curb a few times to set sag? Or have you ever paid good money for a front shock and decided it had poor performance so it must be cheap junk? Maybe you should have spent more time experimenting with the settings to give you the best ride possible.
The opportunity for this experiment came about when my 29er hardtail front shock lost pressure and was due for a rebuild. Instead of rebuilding it, I upgraded to a new shock, a RockShox® Recon Gold Solo Air 100 mm shock (see photo below). I never really had my original shock set up properly so I resolved to get it right this time. I read discussions online where the consensus advice was to experiment with different settings. So since I have training and experience in statistical experimental design through my profession, I decided to use my skills to run an experiment in an effort to improve the riding performance. I don’t race mountain bikes, but do enjoy intermediate level, cross country riding in the Twin Cities area on trails that include Lebanon Hills, Murphy Hanrehan, and Elm Creek, as well as the awesome Cuyuna trails further north and the excellent Cable, Wisconsin area trails.
In preparation for the experiment, I considered the criteria that would be needed for a valid experiment. I wanted to use an actual trail segment in the local area since it would provide the best test conditions for the type of riding that I like to do. Additional trail criteria included access to an easy route of return to the trailhead for ease of changing settings and also a trail that wasn’t too busy in order to minimize interruptions in the testing. I considered eight Twin Cities trails that had at least some technically challenging sections to them. The trail chosen for the experiment was the Bertram Chain of Lakes Regional Park mountain bike trail near Monticello, Minnesota. I chose this trail since it had some technical terrain with several curves and berms in the first mile. There was also easy access to a paved road close to the trail which allowed me to return to the trailhead and make setting changes to ensure that each treatment was done on an identical section of trail.
“Beautiful scenery, cool lakes, steep hills and tight turns. Old school mountain biking at its best!”
—Joe Switek, Bertram Chain of Lakes trail rider
The objective of this design was to find better settings for factors that involve front suspension performance and tire pressure. The factors investigated included the shock pressure, rebound setting, and tire pressure. Front and rear tire pressure were combined into one factor. Tire pressure can affect the handling and speed of a mountain bike through curves, rock gardens and small drops. The tires on the bike, Geax Saguaro 29 x 2.2 inch tires, were setup as tubeless on Stan’s NoTubes Flow Ex rims.
There are many different types of experimental designs that could have been used including one-factor-at-a-time, full factorial, fractional factorial, or optimization designs like central composite or Box-Behnken designs. I chose a full factorial of the three factors at two levels each (23=8) plus duplicate center points for a total of ten runs. This design allows for clear interpretation of the main factors and interactions. With center points, it can also give an indication of curvature if any of the responses aren’t linear.
To determine the levels of the settings for the experiment, I found the manufacturer’s suggestions which had a maximum value listed as 220+ lb. rider weight, so extrapolation was performed to determine the recommended setting for shock pressure to be 165 psi for my body weight. There was also some discussion online that lower shock pressures could be used to take advantage of more of the shock travel distance than typically recommended so 135 psi was chosen for the low level. Tire pressures for tubeless tires run lower than with tubes so front and rear levels were chosen to be 24/27 psi for the low end and 34/37 for the high setting. The differential between front and rear tire pressure was chosen based on the equations: front = x-1 and rear = x + 2 found in several bike forums. The low and high rebound settings were 1 (rabbit) and 5 (turtle) clicks, respectively. Center points were set at 150 psi for shock pressure, 29/32 for tire pressure, and 3 for the rebound setting. See Table 1 below.
Table 1: Design
The responses for the experiment included fork travel, ride quality and elapsed time. For fork travel, the Recon Gold fork has a maximum of 100 mm of travel or about 4 inches. I wanted to use a good portion of it but didn’t expect to use all of it since there were no drop-offs in this section that would require much travel. It’s best to keep some travel in reserve so the fork doesn’t bottom out. In ride quality, I was seeking smoothness of the ride and good traction through the course. Elapsed time was measured although I didn’t want to push the speed envelope in order to run a fair comparison of treatments, so the plan was to use a similar intensity for each run.
Running the Experiment On a beautiful fall day in October, I ran the experiment. There were just a few riders on the trail which was a bit surprising given the awesome weather. Each trail run was 0.8 miles, and it was another 0.9 back to the trailhead via pavement, so after an initial test run and 10 treatment runs, it added up to more than 18 miles. With the riding and the work to change settings between runs, I was definitely getting tired going into the final runs. Changing the settings between runs took more time than I expected and involved using a shock pump, a tire pump and the lever for resetting rebound. I wish I would have had enough endurance to ride the entire trail that day, but when my experiment was done I was tired and ready to go home to run the results through the statistical software.
The Results The 3 responses evaluated are shown in the Table 2 below.
Table 2: Results
The maximum travel used in the experiment was just over half of the travel available at 2.5 inches out of a possible 4. Front shock pressure and rebound were the factors that had an effect on the amount of travel used. The graph in Figure 1 below shows this relationship. With high shock pressure, 165 psi, and fast rebound, not as much travel was used as at low shock pressure and slow rebound. I didn’t expect all of the travel to be used since there weren’t any drop-offs on this segment of the trail.
Figure 1: Travel used vs. front shock pressure and rebound
Ride quality (or ‘ridability’) was a subjective response based on my rating following each run (1= low ride quality to 10= high ride quality). I was looking for a smooth flowing, well-controlled ride. Ride quality was highly related to tire pressure as seen in Figure 2. For low tire pressure, the average ride quality score was 6.5 while it was only 3.25 for high tire pressure. This result was due to the fact that the higher tire pressure settings gave such a jerky, uncontrollable ride compared to the lowest settings. As you’ll see later on in this article, high tire pressures also affected normalized elapsed time negatively. This goes along with less control and more lost time due to too much deflection off of bumps, rocks and roots thus resulting in poor traction. I definitely learned the benefits of lower tire pressures which was a well-known advantage of the tubeless tire setup but this really helped to hammer it home.
Figure 2: Ride quality vs. tire pressure
Even though I tried to ride with the same intensity each run to control for improvement in riding skill, my effort at equalizing intensity didn’t work as you can see by in the graph in Figure 3. Run number was the actual run order that treatments were performed. Running the treatments in this randomized order typically helps reduce the effect of any nuisance factors. Even though I thought I was giving a similar effort each time, elapsed times were gradually decreasing due to a subconscious improvement and familiarity with the trail section. This demonstrates that it can be difficult to control for improvement in skill as treatments proceed. Let this be a lesson to any mountain bike racer who thinks that pre-riding race courses has little value.
Figure 3: Elapsed time vs run number
To correct the elapsed time for this underlying trend, a regression was done and the trend was subtracted from the results to generate the normalized elapsed time response seen in Table 3 and Figure 4. When statistical analysis was done on normalized elapsed time, the main effects of tire pressure and front shock pressure were found to be statistically significant as well as two interactions, one between tire pressure and rebound (shown in Figure 5) and the other between tire pressure and front shock pressure. The highest R-squared value, 0.82, of the three responses was obtained with this model which included two main effects and two interactions. This means that the other two responses, ride quality and amount of travel used, likely have other factors or noise factors involved that resulted in less variation being explained by their respective models.
Table 3: Normalized elapsed time
Figure 4: Elapsed time and normalized elapsed time
Figure 5: Normalized elapsed time vs. tire pressure and rebound
Conclusions In summary, lower tire pressure gave a better ride quality and lower elapsed times while rebound and front shock pressure affected the amount of fork travel used. Based on numerical optimization, the best combination of settings for my style of riding on this bike would be low tire pressures of 24 psi in the front and 27 in the rear, slow rebound at 5 clicks or ‘turtle’ and a high front shock pressure at 165 psi. Confirmation runs on longer trail segments should be done with these settings to ensure that the settings are acceptable. Further optimization could be done using an optimization design in order to find the best settings overall since this experiment only looked at three levels for each factor and doesn’t give the capability to make strong predictions between these levels.
More detailed summary of the evaluation:
Faster rebound and higher front shock pressure resulted in less travel while slower rebound and low front shock pressure resulted in the most travel.
Lower tire pressure resulted in better ride quality, i.e., a smoother and better flowing ride.
Lower tire pressure resulted in fastest normalized elapsed times.
An interaction between tire pressure and rebound also affected normalized elapsed times with lower tire pressure and slower rebound resulting in the lowest normalized elapsed times.
I definitely learned some new lessons first hand including that experimentation is the way to go, that low tire pressure has its advantages and that shock pressure and rebound setting will affect the amount of travel. Each rider can use this type of experimentation to find the unique settings to give them the best mountain bike riding of their lives.
Get Up to Speed on DOE with Our Instructor-Led Workshops
Whether you are just starting out or are a practiced experimenter, Stat-Ease has a workshop for you. Find a list of our upcoming public workshops below. We also offer a large variety of private on-site workshops, including industry-specific classes. This is a cost-effective and convenient option if you have 5 or more people to train. For more information on private workshops, click here.
Workshops are limited to 16. Receive a $200 quantity discount per class when you enroll 2 or more students, or when a single student enrolls in multiple workshops. For more information, contact Rachel via e-mail or at at 612.746.2030.
Recap of the 6th European DOE User Meeting & Workshops in Leuven, Belgium
The Grand Béguinage of Leuven
Stat-Ease, Inc. and CQ Consultancy hosted the 6th European DOE User Meeting & Workshops in Leuven, Belgium on May 18th-20th, 2016. There were two design of experiments (DOE) workshop tracks on the 18th, and then a two-day DOE User Meeting on the 19th-20th. The venue at the historic Faculty Club was charming—complete with cobblestone streets, flowers in full bloom, and delicious artistically-arranged cuisine. The Faculty Club is located together with the 13th century Grand Béguinage, which is a beautiful UNESCO World Heritage site.
The technical program was very informative and the presenters were excellent. We learned about the latest design of experiments (DOE) techniques from well-known speakers in the field, and experimenters from a variety of industries spoke about their DOE successes and failures, and what they learned from them. View the program here.
Mark Anderson, Stat-Ease, Inc. (left), and Sebastian Hoffmeister, STATCON (right)
Pat Whitcomb, Stat-Ease, Inc. (left), and Peter Goos, University of Antwerp & KU Leuven (right)
Baard Buttingsrud, CAMO Software AS
Attendees had the chance to network with each other and consult with experts about their DOE questions. A highlight of the meeting was our special event on the 19th. It included a mesmerizing performance by the choral group, Currende, a delicious dinner at Improvisio restaurant, and then beer sampling under the stars at The Capital (which has the largest beer selection in the world!). A good time was had by all. Hopefully you can join us for the 7th DOE User Meeting in 2018. Look for details to come!
Currende Choral Group (left), Chapel where the concert was held (right)
Pat Whitcomb and Mark Anderson relaxing at The Capital (left), and Leuven City Hall (right)
Have you ever explored the factorial design options that are outside the standard red/yellow/green design selection tool? There is a special class of designs that bear extra attention—the Minimum-Run designs. These designs have been created to address the problem of regular fractional factorial designs that may require an excessive number of runs relative to the number of coefficients that really need to be estimated.
Consider the need to fully characterize 7 factors, meaning that you want to estimate both main effects and two-factor interactions (2FI’s). The standard resolution V (green) fractional factorial design requires 64 runs. This is enough work to send most people scrambling towards the resolution IV (yellow) designs. However, this reduction to 32 runs means that at least some of the 2FI’s will be aliased with each other. Instead of choosing either of these, why not look for other design options. The Min-Run Characterize design for 7 factors requires only 30 runs, more than a 50% savings! The design can estimate all main effects and 2FI’s.
While this is a great design, in all fairness there is no free lunch. Along with the highly attractive benefit of reduced runs come two short-comings to recognize—partial aliasing and non-orthogonality. Partial aliasing: This is a trait that is encountered in other designs, like Plackett-Burman (PB) designs. However, unlike PB’s which have main effects partially aliased with 2FI’s—thus rendering them Resolution III designs, the Minimum-Run Characterization (Resolution V) designs have main effects and 2FI’s partially aliased with 3FI’s. These high-order interactions are generally assumed to be negligible and thus can be ignored. The resulting design can easily estimate both main effects and all two-factor interactions!
Non-orthogonality: Textbooks and old concepts may have convinced you that a design must be orthogonal to provide good results—NOT TRUE!! This was true when we were using pencil and paper to do the calculations, but modern DOE’s are analyzed with software using sophisticated numerical methods. Effect estimates can now be estimated even with non-orthogonal designs. The effects WILL be dependent on the other terms in the model, so it is absolutely necessary that the software you are using can account for this. Design-Expert is fully capable of handling non-orthogonal analysis. As you choose effects on the half-normal plot, notice that the remaining effects will change position on the graph. Although this may be a bit unnerving at first, the important thing is to focus on the largest effects and always use subject matter knowledge to help make decisions. Ultimately, every DOE should end with some confirmation runs to verify the results of the analysis.
The Min-Run Screening designs are Resolution IV (four) and should also be considered to be highly effective screening designs. The definition of screening is to sift through a large number of likely insignificant factors to discover the vital few that are important. A good screening design will correctly estimate main effects (this means that they should NOT be aliased with 2FI’s). The Min-Run Screening designs also have partial aliasing, but main effects are partially aliased with 3FI’s.
For example, if you want to screen 9 factors, the Min-Run Screen option uses only 20 runs! This is 2 runs per factor, plus a default of 2 additional runs added as “padding” just in case a run or two is lost during the experiment. This adds to the robustness of the design for real-world experimentation. Just like the Min-Run Characterize designs, the small drawback is partial aliasing and non-orthogonality. But the advantage is the reduction in the time and money needed to do the experiment.
One last time—always confirm your experimental results by doing a few confirmation runs!