Chapter 12 Designing your study

Investing time into designing your study pays off in the long run. Testing participants is tedious, hard work. During some periods and for some kinds of participants, it is expensive. You want to be as sure as possible that the experiments you run are capable of addressing the points that you are trying to address. What’s more, you want your experiments to address these points thoroughly and well.

We always try to design experiments according to the following principles, unless we have a very good reason to deviate from them.

12.1 Use computers whenever possible to control stimulus presentation and response collection

Back in the late 1990s when I was an undergraduate student interested in cognitive psychology, I worked in several labs trying to figure out what kind of research I wanted to specialize in. One thing became crystal-clear with experience, and it is even more true now than it was then: if you are serious about becoming a cognitive psychologist, you must be proficient at getting computers to work for you. Without decent tech skills, your work will always be sub-par. Yes, some classic papers were written in which the researchers presented their stimuli in low-tech ways, and the quality of the results were based on their great ideas and committment to conscientiousness. But even if you have a comparable commitment to conscientiousness and comparably great ideas, there will be someone else who has those things and also has sufficiently good technical skills to automate their research designs. And in 99 cases out of 100, that automation will improve the project. Why is that?

12.1.1 People make more mistakes

Suppose you want to show participants a stimulus for 1 second. For our research, it may not be extremely important whether the participant is exposed to the stimulus for 900 ms versus 1200 vs vs 1000 ms, the fluctuations in time you might expect if a human is reading out stimuli or manually displaying them on flashcards. But you don’t really know, do you? Suppose a reviewer asks for some confirmation about timing, or thinks that a particular pattern in the data might make sense for stimuli presented for 1100 ms, but would be surprising if they were presented for < 1000 ms. You will need to be reasonably certain about major elements of your methods, and reasonable certainty won’t be possible with human-manual presentation.

Computers have brief hiccups that can cause events to be displaced from their stated timeline slightly. These displacements are usually on the order of 7 ms, and these blips are recorded. People deviate way more. Suppose you are showing a participant stimuli on a series of flashcards, aiming for exposures of 1 second each, and you need to sneeze. There is going to be an irregularity of way more than 7 ms. What’s more, you will have to decide, in real-time, whether to record the presence of that regularity in the trial somewhere, which will again take more than 7 ms. If you decide not to, then we will not know from the data that the irregularity occurred, and that trial may be analyzed as though it were normal. There isn’t a great solution for this. But with a good experimental presentation program, we would always be able to check whether our assumptions about timing were met, and if not, how much deviation there was. We could quickly exclude any trial with massive irregularities if we decided it was important to do so.

Another example: people are notoriously terrible at judging randomness. If we want some variable to be presented in random order, a human will have trouble distributing the levels of that variable in a random way. What’s more, different humans will have somewhat different ideas about which patterns look random and which don’t. It’s better to let machines sort this out, based on some rules we restrict them with.

12.1.2 People are biased

The first rule of objective research is that your hypothesis should not be allowed to affect the outcome of a test. This becomes impossible to regulate when the human delivering the stimuli and recording the responses has a hypothesis about how participants will perform under various conditions. Effects of experimenter bias on test outcomes have been demonstrated in many settings. An example: Tryon ran a research program in the 1930s and 1940s aiming to test genetic inheritance of intelligence in rats. He tested rats on completing a maze, and segregated the rats into groups based on how well they got through the maze. He let the ones with the fewest errors interbreed and the ones with the most errors interbreed, and tested their offspring on mazes for generations. The rats in the “bright” population got better and better than the rats in the “dull” populations generations after the first selection occurred.

This may not be so surprising: nearly everyone believes that intelligence is at least partially genetic. But there was more to Tryon’s measure than just that smart rats breeding with other smart rats produces smart baby rats. Tryon’s studies were not blind. The researchers working with the rats knew which ones were “bright” and which were “dull”. Rosenthal and Kode (1963) replicated the research design, but randomly assigned rats from the same population to “bright” and “dull” groups. There were in fact no systematic differences between their “bright” and “dull” rats, except that the researchers working with them believed that there were. Rosenthal and Kode found the same effects as Tryon: the “bright” rats produced offspring better at completing mazes than the “dull” rats.

One can imagine how this played out, and the many sources where bias that might affect the rats behavior itself or even just the recordings could be introduced. The researchers would have cared for the rats, making sure they had food, water, stimulation, and clean bedding. The researchers would have put the rats in the maze, counted their errors and timed their performance. The “bright” rats might have been petted more, had their food and water refilled more often or more quickly. When being put in the maze, there could have been differences in judging what was an error, or systematic differences in how quickly the stop watch was triggered and stopped. If you expect the “bright” rat to go very fast, perhaps you very attentive to its run, and extra vigilant about stopping the timer.

This bias isn’t restricted to rats. It can happen any time a person has to make a judgment, however minor, about what a participant did, or make a decision about when to deliver a stimulus, stop a clock, record an error, etc. Can this be counteracted? We could take care to always ensure data collectors do not know the hypothesis. But then who would collect data? Usually, students collect data for their own project, which they have contributed to developing. They can hardly keep themselves from forming expectations about the results. The best and most practical way to restrict experimenter bias is to take as many decisions as possible out of the hands of the human experimenters during data collection.

12.1.3 It’s more work in the end not to automate

All of our data will eventually need to be quantitatively analyzed. Even data that may begin as qualitative, like responses to questions about how participants performed a task, would eventually need to be coded as falling into one response category or another for some kind of quantitative analysis. So, if you plan to have participants manually write responses, or if you code their spoken responses in real time by hand, you will still need to transfer all of that data to a .csv file. Errors will certainly be introduced when transcribing, so it would not be enough for one person to transcribe these data: it would need to be done independently by two people, compared, and then discrepancies would need to be transcribed again.

Properly programmed and tested, the computer cannot mistake which button the participant pressed, or where the participant clicked the mouse, or what the participant typed. The computer’s “judgments” need to be checked at the start, to be sure we know it is systematically recording what we meant for it to record, but once we are satisfied, we can be confident that it is recording these responses the same way every time. It will not get tired and make random errors. It will not forget which button is which. The responses it records do not usually need the time-consuming checking that human-coded data need. A little time spent upfront will therefore save you hours of painstaking work later.

12.2 Work from a good example

Whether you are a skilled programmer or totally new to it, your first step to programming your study should be to find an example program that does part of what you need, and work from that. Even though we are currently switching from closed-source E-Prime to open-source PsychoPy, we already have programs or programs-in-progress that execute spatial and verbal memory span tasks, display spatial locations around a circular perimeter, and administer complex span. Don’t start from nothing: break down what you need for your program to do, and ask other lab members for advice finding examples.

Examples can come from inside or outside the lab, but if you are programming something that is incrementally different from something we already have, you should work from the lab version. This is important because we want to make sure that different iterations of our studies differ only in the ways we are aware of. If, when exploring programs from other labs, you discover a better way of implementing something, put that on the agenda for the next lab meeting, and explain what we would gain from adopting the other method. Then 1) we all learn something new, 2) we decide as a group whether the method you discovered is worth implementing across projects, 3) we are all at least aware if one task iteration is going to differ in implementation from other similar implementations.

Another reason to work from an in-lab example when implementing your project is that it increases the likelihood that programming conventions some of us may have previously agreed upon persist in new programs. It is always rough figuring out just what someone else’s code does. The platforms we favor have some features that help with this, and by working from an in-lab example, you may be able to continue in the style that the rest of us understand.

Most importantly, we don’t have a formal ban on closed-source software. We may sometimes use E-Prime, MatLab, or whatever when it is convenient to do so. Good reasons for this include: carrying on from existing projects that use those platforms (e.g., avoiding “changing horses mid-stream”), or when we are collaborating with colleagues who insist on these (though we will generally try to persuade them to let us provide open-source materials). If you are starting a new project, assume that you will be using open-source software. We do not want to use closed-source software indefinitely, and starting anything new with it perpetuates its use. Consider (especially if you are a student) that your next job may not include access to E-Prime or MatLab, so learning an open-source solution makes it more likely that you can continue working with the platform you become familiar with later, rather than being forced by a new employer to learn something else.

We favor open-source programming platforms that make use of some sort of graphical user interface, like PsychoPy and OpenSesame. Even if you are an expert coder, you should plan to implement as much as possible using the GUI framework. This is because our lab will never exclusively feature expert coders who can follow along with your code. Our group will always include students who are brand-new to programming and coding, and we need our general-purpose materials to be comprehensible to new users who may wish to use them as a starting point later. Many (maybe most?) of our programs will include some code components, but when it is possible to build in the interface, that’s what we do. Because we need for newbies to comprehend things, code components should be obsessively commented.

12.3 Record as much as you can

Though we are committed to documenting our hypotheses and analysis pipelines before collecting data, it is inevitable that we will sometimes think of a great way to test a hypothesis later. There is nothing inherently wrong with data-driven analysis, so long as we are honest about what we predicted before seeing outcomes, so we should try to facilitate getting as much out of each precious data collection opportunity as possible. Programmers must of course ensure that the dependent variables needed for planned analyses are recorded properly, but they should also consider whether other elements of the response may also be conveniently collected. Examples:

Your main interest may be response accuracies, but that should not prevent you from also collecting response speeds (and checking that you know how your program records them)
Your main dependent variable might be whether a whole sequential response is 100% correct or not, but that should not prevent you from collecting the components of the response (e.g., that the participant responded 8675309, not just that they got a list “right” or “wrong”).

There could be good reasons to aurally record spoken responses, to have response speeds for individual elements in a list response, for elements I’ve not yet imagined. You should record as much as you ethically can, and try to think in advance of the ways in which various response elements would work together to converge on a hypothesis (and include those in your preregistration when possible). If you want an element later and did not record it, the only solution is to collect data again, which costs time and (sometimes) money.

12.4 There are always exceptions

There will sometimes be projects for which we don’t follow these principles. Sometimes, we will have a good reason for wanting participants to give natural, vocal responses, which currently cannot be evaluated in real-time by a computer. There are bound to be other exceptions. Expedience is not a good reason though. If you have no idea how to implement something, ask for the support of the group.