Analyzing Findings
Interpreting Experimental Findings
Once data is collected from both the experimental and the control groups, a statistical analysis is conducted to find out if there are meaningful differences between the two groups. A statistical analysis determines how likely any difference found is due to chance (and thus not meaningful). For example, if an experiment is done on the effectiveness of a nutritional supplement, and those taking a placebo pill (and not the supplement) have the same result as those taking the supplement, then the experiment has shown that the nutritional supplement is not effective. Generally, psychologists consider differences to be statistically significant if there is less than a five percent chance of observing them if the groups did not actually differ from one another. Stated another way, psychologists want to limit the chances of making “false positive” claims to five percent or less.
The greatest strength of experiments is the ability to assert that any significant differences in the findings are caused by the independent variable. This occurs because random selection, random assignment, and a design that limits the effects of both experimenter bias and participant expectancy should create groups that are similar in composition and treatment. Therefore, any difference between the groups is attributable to the independent variable, and now we can finally make a causal statement. If we find that observing aggressive behaviour results in more imitated aggression than observing non-aggressive behaviour, we can safely say that exposure to aggressive behaviour causes an increase in imitated aggression.
For a full transcript of this video, click here
Reporting Research
When psychologists complete a research project, they generally want to share their findings with other scientists. The American Psychological Association (APA) publishes a manual detailing how to write a paper for submission to scientific journals. Unlike an article that might be published in a magazine like Psychology Today, which targets a general audience with an interest in psychology, scientific journals generally publish peer-reviewed journal articles aimed at an audience of professionals and scholars who are actively involved in research themselves.
LINK TO LEARNING
A peer-reviewed journal article is read by several other scientists (generally anonymously) with expertise in the subject matter. These peer reviewers provide feedback—to both the author and the journal editor—regarding the quality of the draft. Peer reviewers look for a strong rationale for the research being described, a clear description of how the research was conducted, and evidence that the research was conducted in an ethical manner. They also look for flaws in the study’s design, methods, and statistical analyses. They check that the conclusions drawn by the authors seem reasonable given the observations made during the research. Peer reviewers also comment on how valuable the research is in advancing the discipline’s knowledge. This helps prevent unnecessary duplication of research findings in the scientific literature and, to some extent, ensures that each research article provides new information. Ultimately, the journal editor will compile all of the peer reviewer feedback and determine whether the article will be published in its current state (a rare occurrence), published with revisions, or not accepted for publication.
Peer review provides some degree of quality control for psychological research. Poorly conceived or executed studies can be weeded out, well-designed research can be improved, and ideally, studies can be described clearly enough to allow other scientists to replicate them, which helps to determine reliability.
So why would we want to replicate a study? Imagine that our version of the Bobo doll study is done exactly the same as the original, only using a different set of participants and researchers. We use the same operational definitions, manipulations, measurements, and procedures, and our groups are equivalent in terms of their baseline levels of aggression. In our replication however, we receive completely different results and the children do not imitate aggressive behaviours any more than they would at the level of chance. If our experimental manipulation is exactly the same, then the difference in results must be attributable to something else that is different between our study and the original, which might include the researchers, participants, and location. If on the other hand, we were able to replicate the results of the original experiment using different researchers and participants at a different location, then this would provide support for the idea that the results were due to the manipulation and not to any of these other variables. The more we can replicate a result with different samples, the more reliable it is.
In recent years, there has been increasing concern about a “replication crisis” that has affected a number of scientific fields, including psychology. One study found that only about 62% of social science studies reviewed were replicable, and even then their effect sizes were reduced by half (Cramerer et. al, 2018). In fact, even a famous Nobel Prize-winning scientist has recently retracted a published paper because she had difficulty replicating her results (Nobel Prize-winning scientist Frances Arnold retracts paper, 2020 January 3). These kinds of outcomes have prompted some scientists to begin to work together and more openly, and some would argue that the current “crisis” is actually improving the ways in which science is conducted and in how its results are shared with others (Aschwanden, 2018). One example of this more collaborative approach is the Psychological Science Accelerator, a network of over 500 laboratories, representing 82 countries. This network allows researchers to pre-register their study designs, which minimizes any cherry-picking that might happen along the way to boost results. The network also facilitates data collection across multiple labs, allowing for the use of large, diverse samples and more wide-spread sharing of results. Hopefully with a more collaborative approach, we can develop a better process for replicating and quality checking research.
LINK TO LEARNING
DIG DEEPER
The Vaccine-Autism Myth and Retraction of Published Studies
Some scientists have claimed that routine childhood vaccines cause some children to develop autism, and, in fact, several peer-reviewed publications published research making these claims. Since the initial reports, large-scale epidemiological research has suggested that vaccinations are not responsible for causing autism and that it is much safer to have your child vaccinated than not. Furthermore, several of the original studies making this claim have since been retracted.
A published piece of work can be rescinded when data is called into question because of falsification, fabrication, or serious research design problems. Once rescinded, the scientific community is informed that there are serious problems with the original publication. Retractions can be initiated by the researcher who led the study, by research collaborators, by the institution that employed the researcher, or by the editorial board of the journal in which the article was originally published. In the vaccine-autism case, the retraction was made because of a significant conflict of interest in which the leading researcher had a financial interest in establishing a link between childhood vaccines and autism (Offit, 2008). Unfortunately, the initial studies received so much media attention that many parents around the world became hesitant to have their children vaccinated (Figure 2.19). Continued reliance on such debunked studies has significant consequences. For instance, between January and October of 2019, there were 22 measles outbreaks across the United States and more than a thousand cases of individuals contracting measles (Patel et al., 2019). This is likely due to the anti-vaccination movements that have risen from the debunked research. For more information about how the vaccine/autism story unfolded, as well as the repercussions of this story, take a look at Paul Offit’s book, Autism’s False Prophets: Bad Science, Risky Medicine, and the Search for a Cure.
Reliability and Validity
Reliability and validity are two important considerations that must be made with any type of data collection. Reliability refers to the ability to consistently produce a given result. In the context of psychological research, this would mean that any instruments or tools used to collect data do so in consistent, reproducible ways. There are a number of different types of reliability. Some of these include inter-rater reliability (the degree to which two or more different observers agree on what has been observed), internal consistency (the degree to which different items on a survey that measure the same thing correlate with one another), and test-retest reliability (the degree to which the outcomes of a particular measure remain consistent over multiple administrations).
Unfortunately, being consistent in measurement does not necessarily mean that you have measured something correctly. To illustrate this concept, consider a kitchen scale that would be used to measure the weight of cereal that you eat in the morning. If the scale is not properly calibrated, it may consistently under- or overestimate the amount of cereal that’s being measured. While the scale is highly reliable in producing consistent results (e.g., the same amount of cereal poured onto the scale produces the same reading each time), those results are incorrect. This is where validity comes into play. Validity refers to the extent to which a given instrument or tool accurately measures what it’s supposed to measure, and once again, there are a number of ways in which validity can be expressed. Ecological validity (the degree to which research results generalize to real-world applications), construct validity (the degree to which a given variable actually captures or measures what it is intended to measure), and face validity (the degree to which a given variable seems valid on the surface) are just a few types that researchers consider. While any valid measure is by necessity reliable, the reverse is not necessarily true. Researchers strive to use instruments that are both highly reliable and valid.
To illustrate how complicated it can be to determine the validity of a measure, let’s look again at the original Bobo doll study. Bandura and colleagues were not only interested in whether children would imitate aggressive behaviours, they also wanted to know if observing same-sex adults would have a greater impact on children’s behaviour than observing adults of a different sex. So how did they define sex? The children involved in the study were nursery school aged, so it’s likely that the researchers simply did a visual assessment or asked their parents. Generally we now use ‘sex’ to refer to the different biological categories people might fit into, while ‘gender’ refers to the socially constructed characteristics we assign to those categories, and is something an individual must define for themselves. In this case, the researchers were really categorizing their participants based on assumed sex, rather than actual biological sex.
A visual assessment of biological sex might seem to have clear face validity to many people, but as a measurement it is low in construct validity. That is, while it is often assumed that sex can be determined by looking at an individual’s appearance, that approach has little ability to accurately measure biological sex. The assumption that biological sex is both binary and visually obvious results in a lack of research in populations of people who exist outside of those assumptions. This in turn means that the measure has little ecological validity, as these people do exist in real-world populations. To illustrate further, let’s consider some of the ways biological sex has been traditionally assessed:
- Visual assessment: In this approach, researchers would record ‘sex’ based on their visual assessment of the clothed participant. This might work for many individuals, but it’s based on an assumption about the individual’s reproductive biology, which is not reliably assessed by external appearance. There are female humans with beards, male humans without Adam’s apples, and of course, transgender, non-binary, and intersex people whose bodies may not fit these assumed categories
- Medical records or birth certificates: In medical research, data is routinely collected about patient’s demographics, so sex may be assessed simply by looking at a participants’ medical record. However, given that biological sex assignment of infants at birth is based a visual assessment of the infant’s external genitalia, which is used to categorize the infant as “male” or “female.” While visually assessing the genitals of an individual may appear to be enough to determine their biological sex, this is unreliable for the reasons outlined above (visual assessment). A larger problem is this method of sex assignment at birth has been incredibly harmful to intersex people.
- Self-report: To avoid the issues described above, it may seem reasonable to ask participants to self-identify. This can present two issues: (1) most people’s understanding of their own biological sex is based on their medical record or birth certificate, and not further testing (thus their beliefs about their biological sex may not be congruent with their actual biology). This may seem unlikely but remember birth certificates and the determination of biological sex of an infant at birth is not a foolproof way of getting this information. And, (2) as with all self-report measures, it’s reliant upon the researchers providing appropriate categories (i.e., has intersex been included as an option) and the participant being truthful. Participants may not readily volunteer this information for reasons of privacy, safety, or simply because it makes navigating the world easier for them.
While it may seem like something that ought to be easy, truly determining the biological sex of an individual can be difficult. The biological sex of an individual is determined by more than their external genitals, or whether they have a penis/vagina. Some other determinants of biological sex include the internal gonads (ovaries or testes), predominant hormones (testosterone or estrogen), and chromosomal DNA (e.g., XX, XY). It is often assumed that the chromosomal DNA of a person is the truest indicator of their biological sex, however this does not always “match” gonadal, hormonal, or genital sex. Intersex is a general term applied to a variety of conditions which result in a person being born with anatomy that defies traditional male-female categorization. This could result in a person having atypical external genitalia, genitals that don’t align with internal reproductive organs, or atypical chromosomal structures (despite typical anatomy). For instance, a person could be born with mosaic genetics, with both XX and XY chromosomes. These conditions may be present at birth, but often a person has no reason to suspect they are intersex until they reach puberty. In some cases, the condition may not even be discovered until post-mortem during a medical autopsy.
It’s not impossible to determine an individual’s biological sex, but sex should be operationally defined and data should be collected accordingly. As you go learn more about the scientific process and research, it’s important to critically evaluate methods and findings. Participants are frequently divided into male-female categories and it’s worth exploring the details of the methods to understand how scientists have actually measured biological sex and whether their approach may impact the validity of their conclusions.
LINK TO LEARNING
Now that we understand the complexity of measuring biological sex, let’s consider how it was it was assessed in the original Bobo doll study. There is no mention of biological measures being taken so we can assume that sex categorization was based on visual assessment or by report from the children’s parents. We know now that this is not an accurate way to measure biological sex, so we should consider why sex was included in the experiment. Given that children were found to perceive their parents as having preferences for them to behave in ‘sex appropriate’ ways (eg. Girls playing with dolls and boys playing with trucks), the researchers hypothesized their participants would be more likely to imitate the behaviour of a same-sex model than a model of a different sex. They were interested in the effect that socially reinforced gender roles would have on imitated behaviour so the assumed sex of the children was all that mattered, as this would determine what behaviours would be discouraged or reinforced by adults. It’s ok that no biological measures were taken for this study because biological sex wasn’t actually relevant, but terms like sex and gender have often used interchangeably in research so it’s important to think critically about how constructs are being operationally defined and measured. If Bandura and colleagues were to replicate this study today, we should hope that they would be more accurate and specific with their terminology and use something like ‘assumed sex/gender’ rather than ‘sex’ to refer to the variable they were interested in. After all, this kind of specificity in language allows for more accurate understanding and replication, which can increase our confidence in their original conclusion. As you move forward in your degree and engage in research more directly, consider carefully what variables you are interested in, and do your best to choose language that is both specific and accurate.