Linguistic Experiments with WebExp2
General Design Issues
I am by far no expert in experimental design issues, but I have tried to collect some basic thoughts on experimental design issues on this page. As sources I used Kimmel (1970), Cliff (1996), as well as personal communications with Lera Boroditsky, Ted Gibson, and my own common sense (yes, p.c.). None of those is to be blamed for whatever you think misleading or even disconcerting on this page.
The basic idea of experiments
The basic idea of experiments is that you come up with a hypothesis (or you take someone else's hypothesis) and you want to test whether this hypothesis holds. Usually you will consider an hypothesis that states that one difference (variation along some kind of dimension) stands in some kind of causal relationship to another difference (variation along another dimension). For the purpose of testing, the former is called your independent variable (or variables if you're interested in the effect of variations along several dimensions) and the latter is called the dependent variable. NB: Never forget that these variables usually are operationalizations of what you really are interested in! For example, you may have a hypothesis that certain things make it harder to process a language. In your experiment you may operationalize "harder" in terms of reading times in a self-paced reading time study, but never forget that the reading times you will measure are themselves merely a behavioral correlate of some cognitive mechanism, and that it is that mechanism that you're making hypotheses about.
It is important to understand that at least the most interesting hypotheses state causal relationships. Any kind of statistical test we conduct to test this a hypothesis, however, will only be able to test the presence of correlations, which by definition are non-directional. Another thing to keep in mind is that we will never be able to actually prove a hypothesis. The only thing that can be done is to reject competing hypotheses. This is done by showing that these hypotheses make the wrong predictions about the relation between your independent variable(s) and the dependent variable. There are basically three ways in which the predictions of competing hypotheses can be rejected (here I ignore cases where your hypothesis is that "nothing is going on", i.e. where you're predicting a null effect; accepting a null effect or rather rejecting the hypothesis of inequality is possible under certain assumptions, but slightly more complicated than the cases considered here. I refer you to Shravan Vasishth's homepage that contains instructions on how to conduct such tests).
A competing hypothesis may predict the opposite correlation of yours. For example, one hypothesis may predict an increase in acceptability (your dependent variable) given a certain manipulation (of your independent variable(s)), but your hypothesis predicts a decrease in acceptability. If you find a significant decrease in acceptability in your experiments, you can reject the competing hypothesis.
Another competing hypothesis that can be rejected in such a case, is the null hypothesis, i.e. the hypothesis that "nothing is going on" with regard to the relation between the dependent and the independent variables. The null-hypothesis always exists. Even if no other researcher has suggested another hypothesis for your phenomenon, you can test your hypothesis by rejecting the null-hypothesis (of course, this doesn't prove your hypothesis; it merely puts it out there until someone rejects it).
Finally, and this is where things get more complicated, your hypothesis may 'outperform' competing hypotheses quantitatively. I won't say much about this case, since it is still a rather uncommon test in research on the psychology of language (but, I think it will become more important). The general idea is that one hypothesis predicts the observed relation between dependent and independent variables better, in the sense that it predicts more of the observed variation. How this can be determined will be described in the section on Comparing Models: Goodness-of-fit).
The remainder of this page briefly deals with the following topics:
Finally, I highly recommend reading some papers by Reips, who has laid out the trade-offs of internet-based experimentation. Reips (2002) Standards of Internet-based Experimenting in experimenting. Experimental Psychology, 49 (4) (pp. 243-256) summarizes drawback and advantages of online experiments, as well as some ways around common problems.
The typical psycholinguistic experiment consists of a number of conditions, which results from e.g. a factorial design. For example, there may be two factors, one with two levels (i.e. the different values a factor can take) and one with three levels. In a full-factorial design, each level of each factor is crossed with all levels of all other factors. So in the example just mentioned, there would be 2 x 3 = 6 conditions.
Since the goal of experiments is to generalize across the observed sample to the overall population, we test the effect of many different instantiations of the conditions (items) on many different people (subjects/participants). For language experiments, an item can be thought of as a lexicalization of the conditions. That is, if an experiment has n conditions, an item will consist of n different stimuli (e.g. sentences) which preferably differ only with regard to the factors that you intend to manipulate. Here is an example.
Let's say we are interested in it-clefts ("It is NP-X, who NP-Y ..."). More precisely, we want to know how the acceptability of it-clefts is influenced by (a) the form of the clefted NP-X and (b) the form of the subject NP-Y. We come up with a 2 (NP-X is a pronoun vs. a common noun phrase) x 2 (NP-Y is a pronoun vs. a common noun phrase) design resulting in 4 conditions. An example item in its four conditions is given below. Note that the four stimuli are identical except for the form of NP-X and NP-Y. That's what is meant by that the stimuli of an item should "differ only with regard to the factors that you intend to manipulate".
[PRO, CNP] It is you, who men dislike.
[CNP, PRO] It is cowards, who I dislike.
[CNP, CNP] It is cowards, who men dislike.
Below, I discuss the following guidelines for good experimental design:
What is a good hypothesis?
Keep it simple and focus on your hypothesis
While it is good to control for possible confounds, there is a trade-off in terms of how much energy you put into the first experiment on a new topic. You don't have to address all possible objections to the interpretation you intend to make (given certain results) in the first experiment. As a rule of thumb, I suggest to control or balance for two types of things:
First of all, you should make sure that your hypothesis predicts a different outcome of your experiment than the competing hypotheses. Your experiment makes no sense, if it doesn't distinguish between your hypothesis and competing hypotheses.
Second, you should control or balance differences that are known to affect the type of dependent variable you are measuring (e.g. reaction times, acceptability judgments, etc.). For example, it is well known that plausibility, dependency length, and lexical bias can affect reading times. Similarly, complexity of the stimulus and probably plausibility affect people's judgments of how natural a sentence is. So, if none of these things is what you're interested in, be sure to control or balance them across items.
You can run additional experiments. Actually, most journals require several experiments to a publication (at least 2, often more). So, running follow-up experiments isn't only better in terms of getting clean results, it'll also make publishing much easier ;-).
Bias against your hypothesis
For language experiments, we don't only want to randomly sample across subjects. We also want to generalize beyond the sample of language that we used. This means, we should randomly sample our items (sentences) from the population (e.g. English). This point isn't trivial. When we sit down and think about items, e.g. verbs, for our experiment, it will usually be the high-frequency words that come to mind first. Depending on your research questions, this can be highly problematic. Use databases (a lexicon, or e.g. the MRC psycholinguistic database) to overcome this bias (e.g. you could use every third verb that meets your criterion in alphabetic order). Like for subject, random sampling of items matters in two ways: (a) you want to be able to generalize beyond your sample (external validity); (b) you want to make sure that differences between the conditions in your experiment are not due to violations of random sampling (which would lead to confounding; internal validity).
The latter, internal validity, is often violated. Consider that you hypothesize that verbs with more participants are more complex to process and that this leads to lower acceptability scores for sentences with ditransitive verbs than for sentences with intransitive verbs. Now let's say that the intransitive verbs you sample are more frequent than ditransitive verbs you sample. In that case, any difference you observe could be due to frequency and you cannot conclude anything from the results. Note further that random sampling may not always solve this problem. In the current example, if intransitive verbs are on average more frequent than ditransitive verbs, random sampling would lead to the same confound. In such cases, you may consider balancing your data (for frequency).
Items per condition
You may find this summary of common problems (and solutions) in web experimentation useful (it is based on Reips plus some of my own thoughts).
If you want to run an experiment, and especially if you intend to publish the results, you will have to check whether the experiments needs approval by the Stanford IRB (Internal Review Board, Human Subjects Research). The department has made an agreement (or, in fact, several agreements) with the IRB that some of the standard experimental methods used by linguists are exempt from IRB approval. Talk with Penny Eckert about this. In case of doubt, or if required, make sure that you get IRB approval for your study. This is a time-consuming process and the IRB only meets every other month. So make sure to apply early on. You will have to fill out questionnaires about your study and, if it is the first time you register for a study, you will have to take an online tutorial on running human subjects and the ethical constraints/consideration involved.
One of the requirements that the IRB has for running experiments is that you get consent from your participants. This is done via an online form that informs subjects about the risks, payment, duration, and tasks involved in taking the experiment. I strongly recommend asking for informed consent even if you are not required to do so by the IRB. Also, if not required otherwise, I recommend restricting participation in your experiments to adults (whatever the legal age is in the country that your participants will come from).
The IRB will inform you of most things that have to be considered in running (online) experiments, but I have summarized some that I find very important below. Many of these points clearly serve your own interest in addition to being the right thing to do.