Critical appraisal of study validity (Systematic Reviews)
Last updated: August 11th 2020
CEE Standards for conduct and reporting
- An effort should be made to identify all relevant sources of bias (threats to internal and external validity)
- Each relevant type of bias (threat to internal and external validity) should be assessed individually for all included studies
- Results should be reported using a critical appraisal sheet constructed and tested at the protocol stage.
- Critical appraisal criteria should be consistent between a-priori Protocol and review or differences fully explained.
- At least two people should have independently critically appraised each study with disagreements and process of resolution reported.
- A description should be provided of how the information from critical appraisal was used in synthesis.
Some primary studies provide evidence of higher reliability and relevance than others in respect to the review question. Assessing the comparative validity of the included studies (often referred to as critical appraisal) is of key importance to the resulting value of the Systematic Review (see examples in Box 8.1 and Table 8.1). It can form a basis for the differential weighting of studies in later synthesis or partitioning of studies into subgroups for separate analyses.
Study validity assessment requires a number of decisions about the absolute and relative importance of different sources of bias and data validity elements common to environmental data, particularly the appropriateness of temporal and spatial scales. It is therefore vital that the assessment process be standardised and as transparent and repeatable as possible. This challenge has been extensively covered in the Planning Section (Section 3). Some extra points are made below that may help in the conduct as well as the planning stages.
8.2 Internal validity
In an ideal world, each data set included in a SR should be of high internal validity, thus ensuring that the potential for error and bias is minimised and that any differences in the outcome measure between experimental groups can be attributed to the exposure or intervention of interest. To determine the level of confidence that may be placed in selected data sets, the methodology employed to generate each one must be critically appraised, using a transparent and consistent framework, to assess the extent to which it is likely to prevent systematic errors or bias (Moher et al. 1995). However, the nature of the critical appraisal and the hierarchy employed is dependent on the nature of the question and the ‘theory of change’. The Review Team should be able to justify their approach and not blindly follow an established methodology.
In the health sciences, a hierarchy of research methodology is recognised that scores the value of the data in terms of the scientific rigour; the extent to which the methodology seeks to minimise error and bias (Stevens & Milne 1997). The hierarchy of methodological design can be viewed as generic and has been translated from medicine to environmental sciences (Pullin & Knight 2003), but these generic hierarchies are crude tools and usually just a starting point and can rarely be used without modification to ensure relevance to individual review questions. Where a number of well-designed, high-validity studies are available, others with inferior methodology may be demoted from subsequent quantitative analysis to narrative tabulation, or rejected from the SR entirely. However, there are dangers in the rigid application of hierarchies as the importance of various methodological dimensions within studies will vary, depending on the study system to which an intervention is being applied. For example, a rigorous methodology, such as a randomised controlled trial (RCT), applied over inadequately short time and small spatial scales could be viewed as superior to a time series experiment providing data over longer time and larger spatial scales that were more appropriate to the question. The former has high internal validity but low external validity or generalisability in comparison to the latter. This problem carries with it the threat of misinterpretation of evidence. Potential pitfalls of this kind need to be considered at this stage and explored in covariate analyses (e.g. experimental duration or study area: see Downing et al. 1999 and Côté et al. 2001, respectively) or by judicious use of sensitivity analysis at the synthesis stage (see below).
As a consequence, authors may use existing checklists of critical appraisal tools as a basis for their specific exercise, but they should either explain why they use them as such (no modification, because not considered to be needed, and why) or adapt them to their own case-study review, in which case the decisions made must be stated and justified (see Gough et al. 2012).
We suggest that review-specific a priori assessment criteria for appraising the internal validity are included in the Protocol and two or more assessors should appraise each study. The subjective decisions may be a focus of criticism; thus, we advocate consultation with subject experts and relevant stakeholders when planning your approach. Pragmatic grouping of studies into high, medium and low validity based on simple but discriminatory checklists of “desirable” study features may be necessary if sample sizes are small and do not allow investigation of all the study features individually (for example, Felton et al. 2010, and Isasi-Catalá 2010).
The scope of CEE Systematic Reviews is broad and often interdisciplinary and therefore we seek to be inclusive of different forms of evidence provided their strengths and weaknesses are properly appraised and comparative study weightings are appropriate. However, alongside this inclusivity we expect high levels of transparency providing details of the critical appraisal criteria, how they were applied and the judgements on validity of each study. Normally the full dataset will be provided as an additional supplementary file (see Section 10).
8.3 External validity
External validity is often considered in terms of the relevance of the study; how transferable is it to the context of the question? As noted above, some studies can be of high internal validity (low risk of bias) but may be misleading on account of low external validity (low relevance). A simple example is a high validity study that has been conducted outside the geographical region or in a slightly different ecosystem than the one of interest.
Appraisal of study relevance can be a more subjective exercise than appraisal of study reliability. Estimating the external validity of a study may require the construction of review-specific criteria formed by fit to the question elements or similar subjective measures (see Gough et al. 2012 for examples).
For transparency of reporting, tables of study validity assessment should be included as an appendix or supplementary material. The data validity assessment can be incorporated in narrative synthesis tables if appropriate. Sufficient text should be provided to enable the reader to navigate the tables and understanding the coding and appraisal methods used.
Table 8.1. Elements of a data validity assessment of studies included in a SR examining impacts of land management on carbon cycling and greenhouse gas fluxes in boreal and temporal lowland peats.
|Methods||Site comparison, GHG flux measured weekly for whole year using closed chambers.|
|Population||Forested peatlands in Slovenia.|
|Intervention(s)||Drained plot (19th Century).|
|Comparator-matching||Comparator plots close to intervention but distances not disclosed. Soil types moderately different (intervention=rheic hemic histosol (dystric), control=rheic fibric histosol (dystric)).|
|Outcomes||N2O, CO2, and CH4.|
|Study design||CI (comparator-intervention).|
|Level of replication||Plot-level (1 treatment, 1 control), 3 pseudoreplicate samples per plot.|
|Sampling precision||Weekly measurements 60 minutes each with 3 samples per hour (regression modelling), time=zero measurement.|
|Confounding variables||Permanent collars account for soil disturbance, foil-covered chambers reduce temperature effects.|
|Conclusions||Small effective sample size, but good outcome measurement precision. High external validity to SR question. Include in review accounting for low replication.|
|Methods||Site comparison, GHG flux measured once using closed chambers.|
|Population||Ombrotrophic fen and minerotrophic bog in Finland.|
|Intervention(s)||Drained plots (30 years previously).|
|Comparator-matching||pH, %N and water table depth measured in all plots and appear similar.|
|Outcomes||CO2 and CH4.|
|Study design||CI (comparator-intervention).|
|Level of replication||Plot-level (one treatment, one control); two regions, one with only ombrotrophic bog, other with ombrotrophic bog and minerotrophic fen. Each site has drained and undrained counterparts. Each site must be treated as a separate study due to substantial differences in plot soil characteristics.|
|Sampling precision||One sample per plot taken between two and five times over seven month period (exact number unspecified).|
|Confounding variables||Drained and undrained plots actually only differ very slightly in water table depth, so stated exposure difference may have no real impact. Data extrapolated from very low degree of pseudoreplication (2 to 5 samples over 7 month period).|
|Conclusions||Drained and undrained plots compared in study but also shown to have minimal differences in water table depth (external validity questionable).|
At the end of this stage (if not before) it should become clear what form or forms of synthesis will be possible with the available data. There are a number of different pathways from this point and therefore the following sections become more diverse in terms of the guidance given. They also become more reliant on guiding the reader to more detailed information sources.