Section 3

Planning a CEE Evidence Synthesis

Last updated: March 15th 2021.

To meet CEE standards for the conduct of Evidence Syntheses the Review Team will need to establish an a priori Protocol detailing how they will conduct each stage of the Evidence Synthesis. The Protocol sets out how the question was formulated and how each stage of the synthesis will be conducted, and is submitted for approval and registration by CEE in advance of conducting the synthesis. The steps that aid planning the conduct of each stage are described in this section followed by guidance on the structure of the Protocol itself (Section 4). In addition, a set of checklists (ROSES) that can be used during the preparation of any systematic evidence are available at . All the way through writing the protocol or final map/review, the checklists indicate the correct level of detail to be reported, so that the high standards of replicability are achieved. 

 3.1 Scoping the evidence

Before the commencement of an Evidence Synthesis, it is essential that some ‘scoping’ is undertaken to guide the construction of a comprehensive and appropriate Protocol, and to provide an indication of the likely form of the synthesis and thus facilitate resource planning. In certain circumstances, it may not be efficient to commit to a synthesis without some prior estimation of its value in terms of the likely extent and reliability of its findings. In addition, when scoping a Systematic Review, an estimate of the type of data (quantitative, qualitative) may be desirable to inform the type of data synthesis that might be appropriate.

Scoping may be undertaken by the commissioning organisation, by the Review Team, or a combination of the two. A thorough scoping exercise might entail:

  • The development and testing of a search strategy (see below).
  • An estimate of the volume of relevant literature and the volume of material likely to be unavailable in easily-accessible format (see below).
  • An estimate of resources required based on the above, including time and personnel to achieve the search and sorting of the literature, possible financial resources to obtain some articles, contact some authors, use the help of translators, and even plan for possible need of statisticians if quantitative data are identified during this scoping stage.
  • An estimate of the study types likely to be found (as identified through focused data extraction of a small subset of relevant papers). This may indicate whether a meta-analysis will be possible (for Systematic Review only).

The expected output from a scoping exercise is an estimate of the quantity of evidence, and a characterisation of the likely evidence base, pertaining to the question (see Box 3.1 for an example). The extent of investment in scoping required to meet CEE standards will differ with each Evidence Synthesis. We detail below the steps of a full scoping exercise.

3.2 Developing and testing a search strategy

Systematic and comprehensive searching for relevant studies is essential to minimise bias (see Section 5). The searching step requires more planning and preparation than other stages and so most of this Section is devoted to this task. Enlisting an information specialist in the review team is recommended so that an efficient search strategy can be established. Aside from validity, a good search strategy can make a substantial difference to the time and cost of a synthesis. A step-by-step overview of the search process for evidence synthesis is illustrated in Figure 3.1

Figure 3.1 A guide to the planning, conduct, management and reporting of the searching phase of systematic reviews and systematic maps (after Livoreil et al. 2017).

In practice, it is unlikely that absolutely all of the relevant literature can be identified during an evidence synthesis search, for several reasons: (1) literature is often searched and examined only in those languages known to the project team; (2) some articles may not be accessible due to restricted access pay walls or confidentiality; (3) others lack an abstract or have unhelpful titles, which makes them difficult to identify; (4) others may simply not be indexed in a searchable database. Within these constraints, searches conducted for evidence synthesis should be as comprehensive as possible, and they should be documented so they can be repeated and readers can appreciate their strengths and weaknesses. Reporting any limitations to searches, such as unavoidable gaps in coverage (e.g. lack of access to some literature) is an important part of the search process, to ensure that readers have confidence in the review methods, and to qualify the interpretation of the evidence synthesis findings.

Steps involved in planning a search are presented in chronological order, bearing in mind that some of the process may be iterative. We also highlight the methods that enable the project team to identify, minimise and report any risks of bias that may affect the search and how this can affect the findings of an evidence synthesis.

We use the following terminology: search terms encompasses individual or compound words used in a search to find relevant articles. A search string is a combination of search terms combined using Boolean operators. A search strategyis the whole search methodology, including search terms, search strings, the bibliographic sources searched, and enough information to ensure the reproducibility of the search. Bibliographic sources (see below for more details) capture any source of references, including electronic bibliographic databases, those sources which would not be classified as databases (e.g. the Internet via search engines), hand searched journals, and personal contacts.

Preventing errors and biases

Conducting a rigorous evidence synthesis implies to try to minimise risks of errors and biases which may happen at all stages. Errors that can occur during the search include: missing search terms, unintentional misspelling of search terms, errors in the search syntax (e.g. inappropriate use of Boolean operators, see below) and inappropriate search terms. Such problems may be minimised when the search term identification process is conducted rigorously, and by peer-reviewing the search strategy, including within and outside the project team, during development of the Protocol (See Section 4).

Biases (systematic errors) in the search strategy may affect the search outcomes (Song et al. 2010). The methods used to minimize bias should be reported in the Protocol and the final review or map. Minimizing bias may require 1) looking for evidence outside traditional academic electronic bibliographic sources (e.g. grey literature); 2) using multiple databases and search tools to reduce the possibility of bias in the retrieved results; and, 3) contacting organisations or individuals who may have relevant material (Bayliss & Beyer 2015). Some biases have been listed in Bayliss & Beyer (2015) and a few of them are reported here to be considered by project teams as appropriate:  language bias (Song et al., 2010) means that studies with significant or ‘interesting’ results are more likely to be published in the English language and easier to access to than results published in other languages. The impacts of this on synthesis outcomes have been evaluated, and consequences of omitting non-English-language studies could be serious (e.g. providing a different direction of mean effect; Konno et al. 2020). The way to reduce the risk of language bias is to look beyond the English language literature. Prevailing paradigm bias (Bayliss & Beyer, 2015) suggests that studies relating to or supporting the prevailing paradigm or topic (for example climate change) are more likely to be published and hence discoverable. To reduce this bias Review Teams should not rely only on finding well known relevant studies. Temporal bias includes the risk that studies supporting a hypothesis are more likely to be published first (Bayliss & Beyer, 2015). The results may not be supported by later studies (Leimu and Koricheva, 2004). Due to the culture of ’the latest is best’, older articles may be overlooked and mis-interpretations perpetuated. The ways to reduce this bias include searching older publications, considering updating the search in the future, or test statistically whether this bias significantly affects the results of studies. Publication bias (Dickersin, 2005; Hopewell et al. 2007; Song et al., 2010) refers to asymmetry in the likelihood of publishing results: statistically significant results (positive results) are more likely to be accepted for publication than non-significant ones (negative results). This has been a source of major concern for systematic reviews and meta-analysis as it might lead to overestimating an effect/impact of an Intervention or Exposure on a Population (e.g. Gurevitch & Hedges, 1999; Rothstein et al. 2005; Lortie et al. 2007). To minimise this bias, searches for studies reporting non-significant results (most probably found in grey literature and studies in languages other than English) should be conducted in all systematic reviews and maps (Leimu & Koricheva 2005).

Structuring the search with PICO/PECO elements

An evidence synthesis process starts with a question that is usually structured into “building blocks” (concepts or elements), some of which are then used to develop the search strategy.  The search strategy illustrated below is based on PICO/PECO elements which are commonly used in CEE evidence synthesis. Other elements and question structures exist (See Section 2). In any of these question structures it is possible to narrow the question (and the search) by adding additional search terms defining the Context or Setting of the question (e.g. “tropical”, “experimental”, or “pleistocene”). Searching for geographic location is not recommended because location names may be difficult to list or duplicate when the geographical range is broad. Geographical elements (e.g. name of the country) may, instead, be more efficiently used as eligibility screening criteria (see below).

Use of multiple languages

Identifying which languages are most relevant for the search may depend on the topic of the evidence synthesis. There are two main challenges with languages for an evidence synthesis: translating search terms into various languages to capture as many relevant articles as possible, and then being able to select and use the paper when not written in a language spoken by the project team members. In many electronic bibliographic sources, articles written in languages other than English can be discovered using English search terms.  However, a large literature in languages other than English remains to be discovered in national and regional databases, e.g. CiNii for Japanese research. Searching is likely to require a range of languages when relevant articles are produced at national level, as much of it will be published in the official language of those nations (Corlett, 2011). Reporting the choice of language(s) in the Protocol and in the final synthesis report is important to enable repetition and updating when appropriate.

Human resources needed for searching

Each evidence synthesis is conducted by a project team. It may be composed of a project leader and associated experts (thematic and methodological). Because of the systematic aspect of the searching and the need to keep careful track of the findings, project teams should, when possible, include librarians or information specialists. Subject specialist librarians are conversant with bibliographic sources, and are often very familiar with the nuances of different transdisciplinary and subject-specific resources (Zhang et al. 2006). They are aware of the broad range of tools available for undertaking literature searches and they are aware of recent improvements in the range and use of those tools. They are also expert in converting research questions into search strategies. Such experts can themselves benefit by contributing to a project team since their institutions may require demonstration of collaborative work (Holst et al. 2005).

3.2.1 Planning the search strategy

The first step in planning a search is to design a strategy to maximise the probability of identifying relevant articles whilst minimizing the time spent doing so.  Planning may also include discussions about eligibility criteria for subsequent screening (Frampton et al. 2017) as they are often linked to search terms. Planning should also include discussions about decision criteria defining when to stop the search as resource constraints (such as time, manpower, skills) may be a major reason to limit the search and should be anticipated and explained in the Protocol.

Establishing a test-list

A test-list is a set of articles that have been identified as relevant to answer the question of the evidence synthesis (e.g. are within the scope and provide some evidence to answer the question). The test-list can be created by asking experts, researchers and stakeholders (i.e. anyone who has an interest in the review question) for suggestions and by perusing existing reviews. The project team should read the articles of the test-list to make sure they are relevant to the synthesis question. Establishing a test-list is independent of the search itself and is used to help develop the search strategy and to assess the performance of the search strategy. The performance of a search strategy should be reported, i.e. whether the search strategy correctly retrieves relevant articles and whether all available relevant literature to answer the evidence synthesis question is likely to have been identified. The test-list may be presented in the Protocol submitted for peer-review.

The test-list should ideally cover the range of authors, journals, and research projects within the scope of the question. In order to be an effective tool it needs to reflect the range of the evidence likely to be encountered in the review. The number of articles to include in the test-list is a case-by-case decision and may also depend on the breadth of the question. When using a very small test-list, the project team may inappropriately conclude that the search is effective whilst it is not. Using the test-list may be an indicator for the project team to improve the search strategy, or to help decide when to stop the search.

Identifying search terms

A search string that is efficient at finding relevant articles means that a maximum of relevant papers will have been found and the project team will not have to run the search again during the course of the conduct of the evidence synthesis. Moreover, it may be re-used as such when amending or updating the search in the future, saving time and resources. Initial search terms can usually be generated from the question elements and by looking at the articles in the test-list. However, authors of articles may not always describe the full range of the PICO/PECO criteria in the few words available in the title and abstract. As a consequence, building search strings from search terms requires project teams to draw upon both their scientific expertise, a certain degree of imagination, and an analysis of titles and abstracts to consider how authors might use different terminologies to describe their research.

Reading the articles of the test-list as well as existing relevant reviews often helps to identify search terms describing the population, intervention/exposure, outcome(s), and the context of interest. Synonyms can also be looked for in dictionaries. An advantage of involving librarians in the project team and among the peer-reviewers is that they bring their knowledge of specialist thesauri to the creation of search term lists. For example, for questions in agriculture, CAB Abstracts provides a thesaurus whose terms are added to database records. The thesaurus terms can offer broad or narrow concepts for the search term of interest, and can provide additional ways to capture articles or to discover overlooked words ( As well as database thesauri that offer terms that can be used within individual databases, there are other thesauri that are independent of databases.  For example, the Terminological Resource for Plant Functional Diversity ( offers terms for 700 plant characteristics, plant traits and environmental associations. Experts and stakeholders may suggest additional keywords, for instance when an intervention is related to a special device (e.g. technical name of an engine, chemical names of pollutants) or a population is very specific (e.g. taxonomic names which have been changed over time, technical terminology of genetically-modified organisms). Other approaches can be used to identify search terms and facilitate eligibility screening (e.g. text-mining, citation screening, cluster analysis and semantic analysis) and are likely to be helpful for CEE evidence synthesis.

The search terms identified using these various methods should be presented as part of the draft evidence synthesis Protocol so that additional terms may be suggested by peer-reviewers. Once the list is finalised in the published Protocol it should not be changed, unless justification is provided in the final evidence synthesis report.

Developing search strings

The development of effective search strings (combinations of key words and phrases) for searching should take place largely during the planning stage, and will most likely be an iterative process, testing search strings using selected databases, recording numbers of references identified and sampling titles for proportional relevance or specificity (the proportion of the sample that appears to be relevant to the Evidence Synthesis question). Sensitivity (the proportion of potentially relevant articles identified as estimated using the test list) should improve as testing progresses and reach 100% when results from databases are combined. The iterative process may include considering synonyms, alternative spellings, and non-English language terms within the search strategy. An initial list of search terms may be compiled with the help of the commissioning organisation and stakeholders. All iterations of tested terms should be recorded, along with the number of references (hits) they return. This should be accompanied by an assessment of proportional relevance, so that the usefulness of individual search terms can easily be examined. Comparing search results when you include or exclude particular terms will allow you to identify superfluous or ineffective terms, and work out whether any should be removed from your search strategy.  It is important to remember, however, that the functionality of different literature databases may vary considerably and terms that are apparently useful in one source will not always be appropriate in others: thus, search strings may need to be modified to suit each one.

Boolean operators (AND, OR, NOT) specify logic functions. They are used to group search terms into blocks according to the PICO or PECO elements, so that the search is structured and easy to understand, review and amend, if necessary. AND and OR are at the core of the structure of the search string. Using AND decreases the number of articles retrieved whilst using OR enlarges it, so combining these two operators will change the exhaustivity and precision of the search.

OR is used to identify bibliographic articles in which at least one of the search terms is present. OR is used to combine terms within one of the PICO elements, for example all search terms related to the Population. Using “forest OR woodland OR mangrove” will identify documents mentioning at least one of the three search terms.

AND is used to narrow the search as it requires articles to include at least one search term from the lists given on each side of the AND operator. Using AND identifies articles which contain, for example, both a Population AND an Intervention (or Exposure) search term. For instance, a search about a population of butterflies exposed to various toxic compounds and then observed for the outcomes of interest can be structured as three sets of search terms combined with AND as follows: “(lepidoptera OR butterfly OR coleoptera OR beetle) AND (toxi* OR cry* OR vip3* OR Bacillus OR bt) AND (suscept* OR resist*)”. Note: truncating words with* (see Box 3.1) at 3 characters (e.g. cry* in this example) may find lots of irrelevant words and may not be recommended.

NOT is used to exclude specified search terms or PICO elements from search results. However, it can have unanticipated results and may exclude relevant records. For this reason, it should not usually be used in search strategies for evidence synthesis. For example, searching for ‘rural NOT urban’ will remove records with the word ‘urban’, but will also remove records which mention both ‘rural’ AND ‘urban’.

Proximity operators (e.g. SAME, NEAR, ADJ, depending on the source) can be used to constrain the search by defining the number of words between the appearance of two search terms. For example, in the Ovid interface “pollinators adj4 decline*” will find records where the two search terms “pollinators” and “decline” are within four words of each other. Proximity operators are more precise than using AND, so may be helpful when a large volume of search results are being returned.

Box 3.1. Example of test search

All test searches should be carefully recorded (including the date of the search) and saved so that they may be accessed later, removing duplication of effort where possible. However, since the test searches are conducted in advance of the actual search, it will be necessary to update the search again in order to check whether any recent literature has become available. In the larger bibliographic databases and services it is possible to save searches and set up an alert service that will periodically run the saved searches and return new records. This can be useful if the testing occurs well in advance of the synthesis, or if the synthesis runs over a long period of time.

A high-sensitivity and low-specificity approach is often necessary to capture all or most of the relevant articles available, and reduce bias and increase repeatability in capture (see below). Typically, large numbers of articles are therefore identified but rejected at the title and/or abstract screening stage.

A final step in the development of the search terms and strings is to test the strategy with the test list. A comprehensive set of terms and strings with an appropriate balance of specificity and sensitivity SHOULD retrieve these relevant articles without returning an unmanageable number of irrelevant articles. Reasons why any articles from the test list were not retrieved should be investigated so that the search strategy can be appropriately modified to capture them.

The Review Team should report the performance of the search strategy in the Protocol with an update in the final report (e.g. as a percentage of the test-list finally retrieved by the search strategy when applied in each electronic bibliographic source, e.g. Söderström et al. 2014, Haddaway et al. 2015). A high percentage is one indicator that the search has been optimized and the conclusions of the review rely on a range of available relevant articles that reflect at least those provided by the test-list. A low percentage would indicate that the conclusion of the review could be susceptible to change if other ‘missed’ articles are added. The test list should be fully captured when searches from all bibliographic sources are combined.

Assessing the volume of literature

The volume of literature arising from test searches may be used as a predictor of the extent of the evidence base and a crude predictor of its strength (number of rigorous studies). For example, whether the review question is likely to identify a knowledge gap (very few articles), seems too broad and should be broken down or targeted toward a systematic map approach (very many diverse articles covering a range of populations, interventions and/or outcomes), or if it has the potential to provide an answer to the question as it is currently phrased and with the resources highlighted by the scoping exercise (nothing needs to be changed). This has implications in terms of the time and resources required to complete the review. Note, however, that the total number of returned articles is likely to reflect the specificity of the chosen search terms (and possibly searching skills of the Review Team) and is only an indicator. This can then be used to extrapolate and determine the likely quantity (but not quality) of articles relevant to the review question. The volume of literature that is likely to be difficult to access (in languages unfamiliar to the Review Team, or in publications that are not available electronically or not readily available in libraries) should, if possible, be assessed at this stage.

Identifying relevant sources of articles

Various sources of articles relevant to the question may exist. Understanding the coverage, the functions and limitations of information sources can be time-consuming, so involving a librarian or information specialist at this stage is highly recommended. We will use bibliography to refer to a list of articles generally described by authorship, title, year of publication, place of publication, editor, and often, keywords as well as, more recently, DOI identifiers. A bibliographic source allows these bibliographies to be created by providing a search and retrieval interface. Much of the information today is likely to come from searches of electronic bibliographic sources, which are becoming increasingly comprehensive with the passage of time as more material is digitised. Here we use the term electronic bibliographic source in the broad sense. It includes individual electronic bibliographic sources (e.g. Biological Abstracts) as well as platforms that allow simultaneous searches of several sources of information (e.g. Web of Science) or could be accessed through search engines (such as Google). Platforms are a way to access databases.

Coverage and accessibility

Several sources should be searched to ensure that as many relevant articles as possible are identified (Avenell et al., 2001; Grindlay et al. 2012). A decision needs to be made as to which sources would be the most appropriate for the question. This mostly depends on the disciplines addressed by the question (e.g. biology, social sciences, other disciplines) and the identification of sources that may provide the greatest quantity of relevant articles for a limited number of searches and their contribution in reducing the various biases described earlier in the paper (see 1.3). The quantity of results given by an electronic bibliographic source is NOT a good indicator of the relevance of the articles identified and thus should not be a criterion to select or discard a source. Information about access to databases and articles (coverage) can be obtained directly from the project team by sharing knowledge and experience, asking librarians and information experts and, if needed, stakeholders. Peer-review of the evidence synthesis Protocol may also provide extra feedback and information regarding the relevance of searching in some other sources.

Some sources are open-access, such as Google Scholar, whereas others require subscription such as Scopus. Therefore, access to electronic bibliographic sources may depend on institutional library subscriptions, and so availability to project teams will vary across organisations. A diverse project team from a range of institutions may therefore be beneficial to ensure adequate breadth of search strategies. When the project team does not have access to all the relevant bibliographic sources, it should explain its approach and list the sources that were available but not searchable and acknowledge these limitations. This may include indications as to how to further upgrade the evidence synthesis at a later stage.

Types of sources

In this subsection we first present bibliographic sources which allow the use of search strings, mostly illustrated from the environmental sciences. An extensive list of searchable databases for the social sciences is available in Kugley et al. (2016). Other sources and methods mentioned below (such as searches on Google) are complementary but cannot be the core strategy of the search process of an evidence-synthesis as they are less reproducible and transparent.

Bibliographic sources may vary in the search tools provided by their platforms. Help pages give information on search capabilities and these should be read carefully. Involving librarians who keep up-to-date with developments in information sources and platforms is likely to save considerable time.

Electronic bibliographic sources

The platforms which provide access to bibliographic information sources may vary according to:

A) Platform issues

  • the syntax needed within search strings (see 2.2) and the complexity of search strings that they will accept
  • access: not all bibliographic sources are completely accessible. It depends on the subscriptions available to the project team members in their institutions. The Web of Science platform, for example, contains several databases, and it is important to check and document which ones are accessible to the project team via that platform.

B) Database issues

  • disciplines: subject-based bibliographic sources (CAB ebooks; applied life sciences, agriculture, environment, veterinary sciences, applied economics, food science and nutrition) versus multidisciplinary sources (Scopus, Web of Science);
  • geographical regions (e.g. Latin America, HAPI-Hispanic American Periodicals Index, or Europe CORDIS). It may be necessary to search region-specific bibliographic sources if the evidence-synthesis question has a regional focus (Bayliss & Beyer, 2015);
  • document types: scientific papers, conference or proceedings, chapters, books, theses. Many university libraries hold digital copies of their theses, such as the EThOS British Library thesis database. Conference papers may be a source of unpublished results relevant for the synthesis, and may be found through the BIOSIS Citation index or the Conference Proceedings Citation Index (Thomson Reuters 2016, in Glanville, in press)
  • durations at the time of writing, in the Web of Science Core Collection some articles may be accessible from 1900 although by no means all, in Scopus they may date from 1960);
Publishers’ databases

The websites of individual commercial publishers may be valuable sources of evidence, since they can also offer access to books, chapters of books, and other material (e.g. datasets). Using their respective search tools and related help pages allows the retrieval of relevant articles based on search terms. For example, Elsevier’s ScienceDirect and Wiley Interscience are publishers’ platforms that give access to their journals, their tables of contents and (depending on licence) abstracts and the ability to download the article.

 Web-based search engines

Google is one example of a web-based search engine that searches the Internet for content including articles, books, theses, reports and grey literature (see 1.5 and 2.5 Grey literature). It also provides its own search tools and help pages. Such resources are typically not transparent (i.e. they order results using an unknown and often changing algorithm, Giustini & Boulos, 2013) and are restricted in their scope or in the number of results that can be viewed by the user (Google Scholar). Google Scholar has been shown not to be suitable as a standalone resource in systematic reviews but it remains a valuable tool for supplementing bibliographic searches (Bramer et al. 2013; Haddaway et al. 2015) and to obtain full-text PDF of articles. BASE Bielefeld academic search engine ( is developed by the University of Bielefeld (Germany) and gives access to a wide range of information, including academic articles, audio files, maps, theses, newspaper articles, and datasets. It lists sources of data and displays detailed search results so that transparent reporting is facilitated (Ortega 2004).

Full-text documents will be needed only when the findings of the search have been screened for eligibility and retained based on their title and abstract, and need to be screened at full-text (see Frampton et al. 2017). Limited access to full-texts might be a source of bias in the synthesis, and finding documents may be time-consuming as it may involve inter-library loans or direct contact with authors. Documents can be obtained directly if (a) the articles are open-access, (b) the articles have been placed on an author’s personal webpage, or (c) are included in the project team’ institutional subscriptions. Checking institutional access when listing the sources of bibliography may help the project team anticipate needs to get extra support.

Choosing bibliographic management software

Bibliographic searches may produce thousands or sometimes tens of thousands of references that require screening for eligibility and so it is important to ensure that search results are organised in such a way that they can be screened efficiently for their eligibility for an evidence synthesis. Key actions that will be necessary before screening can commence are to assemble the references into a library, using one or more bibliographic reference management tool(s); and to identify and remove any duplicate references.

Assembling references

A range of bibliographic reference management tools are available into which search results may be downloaded directly from bibliographic databases or imported manually, and these vary in their complexity and functionality. Some tools, such as Eppi Reviewer (Social Science Research Unit, 2016) and Abstrackr (Rathbone et al. 2015) include text mining and machine learning functionality to assist with some aspects of eligibility screening. According to recently-published evidence syntheses and Protocols, the most frequently-used reference management tools in CEE evidence syntheses are Endnote and Eppi Reviewer (sometimes used in combination with Microsoft Excel), although others such as Mendeley and Abstrackr are also used.  Given that reference management tools have diverse functionality and are continually being developed and upgraded, it is not possible to recommend any one tool as being ‘better’ than the others. An efficient reference management tool should:

  • enable easy removal of duplicate articles, which can reduce substantially the number of articles;
  • readily locate and import abstracts and full-text versions for articles where available;
  • enable the review team to record their screening decisions for each article;
  • enable articles, and any screening decisions accompanying them, to be grouped and analysed to assist the team in checking progress with eligibility screening and in identifying any disagreements between screeners.

Other features of reference management tools that review teams may find helpful to consider are: whether the software is openly accessible (e.g. Mendeley) or may require payment (e.g. Endnote, Eppi Reviewer); the number of references that can be accommodated; the number of screeners who can use the software simultaneously; and how well suited the tool is for project management tasks, such as allocating eligibility screening tasks among the review team members and monitoring project progress.

Addressing the need for grey literature

Grey literature” relates to documents that may be difficult to locate because they are not indexed in usual bibliographic sources (Konno & Pullin 2020). It has been defined as “manifold document types produced on all levels of government, academics, business and industry in print and electronic formats that are protected by intellectual property rights, of sufficient quality to be collected and preserved by libraries and institutional repositories, but not controlled by commercial publishers; i.e. where publishing is not the primary activity of the producing body” (12th Int Conf On Grey Lit. Prague 2010, but see Mahood et al., 2014). Grey literature includes reports, proceedings, theses and dissertations, newsletters, technical notes, white papers, etc. (see list on This literature may not be as easily found by internet and bibliographic searches, and may need to be identified by other means (e.g. asking experts) which may be time-consuming and requires careful planning (Saleh et al. 2014).

Searches for grey literature should normally be included in evidence synthesis for two main reasons: 1) to try to minimize possible publication bias (Hopewell et al. 2007), where ‘positive’ (i.e. confirmative, statistically significant) results are more likely to be published in academic journals (Leimu and Koricheva 2005); and 2) to include studies not intended for the academic domain, such as practitioner reports and consultancy documents which may nevertheless contain relevant information such as details on study methods or results not reported in journal articles often limited by word length.

Deciding when to stop

If time and resources were unlimited, the project team should be able to identify all published articles relevant to the evidence-synthesis question. In the real world this is rarely possible.  Deciding when to stop a search should be based on explicit criteria and it should be explained in the Protocol and/or synthesis report. Often, reaching the budget limit (in terms of project team time) is the key reason for stopping the search (Saleh et al. 2014) but justification for stopping should rely primarily on the acceptability of the performance of the search for the project team. Searching only one database is not considered as adequate (Kugley et al. 2016). Observing a high rate of article retrieval for the test-list should not preclude the conduct additional searches in other sources to check whether new relevant papers are identified.  Practically, when searching in electronic bibliographic sources, search terms and search strings are modified progressively, based on what is retrieved at each iteration, using the “test-list” as one indicator of performance. When each additional unit of time spent in searching returns fewer relevant references, this may be a good indication that it is time to stop the search (Booth 2010). Statistical techniques, such as capture-recapture and the relative recall method, exist to guide decisions about when to stop searching, although to our knowledge they have not been used in CEE evidence-synthesis to date (reviewed in Glanville, in press).

For web-searches (e.g. using Google) it is difficult to provide specific guidance on how much searching effort is acceptable. In some evidence syntheses, authors have chosen a “first 50 hits” approach (hits meaning articles, e.g. Smart & Burling 2001) or a ‘first 200 hits’ approach (Ojanen et al. 2014), but the CEE does not encourage such arbitrary cut-offs. What should be reported is whether stopping the screening after the first 50 (or more) retrieved articles is justified by a decline in the relevance of new articles. As long as relevant articles are being identified, the project team should ideally keep on screening the list of results.


3.3 Planning study eligibility criteria and eligibility screening

3.3.1 The Eligibility Criteria

Rationale for eligibility criteria

The use of pre-specified and explicit eligibility criteria ensures that the inclusion or exclusion of articles or studies from a systematic review or systematic map is done in a transparent manner, and as objectively as possible. This reduces the risk of introducing errors or bias which could occur if decisions on inclusion or exclusion are selective, subjective, or inconsistent. An objective and transparent approach also helps to ensure reproducibility of eligibility screening. Failing to consistently apply eligibility criteria, or using criteria which are not relevant to the evidence synthesis question, can lead to inconsistent conclusions from different evidence syntheses (e.g. illustrated by Englund et al. 1999 for stream predation experiments and McDonagh et al. 2014 for health research studies).

The eligibility criteria for a systematic review or systematic map should reflect the question being asked and therefore follow logically from the ‘key elements’ that describe the question structure. Many environmental questions are of the ’PICO‘ type, where the interest is on determining effects of an intervention within a specified population. For a PICO-type question the key elements (P, I, C, O) would specify which population(s), intervention(s), comparator(s) and outcome(s) must be reported in an article describing a primary research study in order for that article to be eligible for inclusion in the evidence synthesis (examples of PICO and other types of question structure are given by EFSA, 2010; Aiassa et al., 2016; and James et al., 2016).

Developing your search strategy can in turn help define or refine eligibility criteria that will be used for the screening of the literature once the full search is conducted (see Section 6). Titles and abstracts and full text found during scoping can form a sample of the literature within which papers that are not relevant (ineligible) for different reasons (including unexpected use of synonyms, or use of similar wording in other disciplines) may be identified and appropriate eligibility criteria developed. Planning eligibility criteria allows for discussion with the commissioner about the scope and scale of the articles that will be retained and the finalised eligibility criteria will be reported later on in the evidence synthesis Protocol.

An example of eligibility criteria for an environmental intervention (i.e. PICO-type) systematic review question is shown in Box 3.2, for the question ‘What are the environmental and socioeconomic effects of China’s Conversion of Cropland to Forest Programme (CCFP) after the first 15 years of implementation?’ (Rodríguez et al. 2016). As the example illustrates, eligibility criteria may be expressed as inclusion criteria and, if helpful, also as exclusion criteria.


Box 3.2 Example systematic review eligibility criteria in relation to question key elements for an intervention (PICO-type) environmental systematic review question (from Rodríguez et al., 2016)

Ideally, the eligibility criteria should be specified in such a way that they are easy to interpret and apply by the review team with minimal disagreement. For some systematic review or systematic map questions the eligibility criteria may be very similar to or identical to the question key elements and the question itself, whereas in other cases the eligibility criteria may need to be more specific, to provide adequate information for the review team to make selection decisions.

In the example systematic review question (Box 3.2) it is clear that if an article describing a primary research study did not provide information on the intervention (i.e. the Conversion of Cropland to Forest Programme) then it would not be appropriate for answering the review question. As such, the article could be excluded. Similarly, an article that did not report any environmental or socioeconomic outcomes would not be relevant and could be excluded. The example question illustrates that articles can be efficiently excluded if they fail to meet one or more inclusion criteria; they are included only if they meet all the eligibility criteria.

Keeping the list of eligibility criteria short and explicit, and specifying the criteria such that an article would be excluded if it fails one or more of the criteria is a useful approach since this minimises the range of information that members of the review team would need to locate in an article and means that if an article is clearly seen not to meet one of the criteria then the remaining criteria would not have to be considered. Since a single failed eligibility criterion is sufficient for an article to be excluded from an evidence synthesis, it may be helpful to assess the eligibility criteria in order of importance (or ease of finding them within articles), so that the first ‘no’ response can be used as the primary reason for exclusion of the study, and the remaining criteria need not be assessed (Higgins & Green, 2011).

The example in Box 3.2 is for a relatively broad systematic review question. For a systematic map the question may be even broader since the objective of a map is to provide a descriptive output.  Irrespective of how broad the question is, the process for developing eligibility criteria which we have outlined here applies both to systematic reviews and systematic maps (James et al., 2016).

Study design as an eligibility criterion

The types of primary research study design (e.g. observational or experimental; controlled or uncontrolled) that can answer an evidence synthesis question will vary according to the type of question. The study design is sometimes made explicit in the key elements (e.g. ‘PICO’- type questions may be stated as ‘PICOD’ or ‘PICOS’ in the scientific literature, where ‘D’ (design) or ‘S’ (study) indicates that study design is being considered) (e.g. Rooney et al., 2014). Even if study design is not explicit in the question structure it should be considered as an eligibility criterion. This is particularly important for systematic reviews since the designs of studies that are included should be compatible with the planned approach for the data synthesis step (e.g. some meta-analysis methods may specifically require controlled studies). The type of study design may also be indicative of the likely validity of the evidence, since some study designs may be more prone to bias than others (see Box 3.3). Note that in systematic reviews the full assessment of risks of bias and other threats to validity takes place at the critical appraisal step, and this should always be conducted irrespective of whether any quality-related eligibility criteria have been specified.

Box 3.3 Overview of research designs


3.3.2 Pilot testing the eligibility criteria and screening process

The eligibility screening procedure should be pilot-tested and refined by arranging for several reviewers (at least two per article) to apply the agreed study inclusion (eligibility) criteria to the subset of identified relevant articles. A typical approach is to develop an eligibility screening form that lists the inclusion and exclusion criteria, together with instructions for the reviewers, to ensure that each reviewer follows the same procedure. A standard approach is to develop a form that guides the reviewers to make simple decisions, for example: to include the article; to exclude it; or to mark it as unclear. Reviewers screen the titles and/or abstracts of the subset of articles and then compare their screening decisions to identify whether they are adequately consistent. If necessary, the form should be refined and re-tested until an acceptable level of agreement is reached.  Once the suitability of the eligibility form has been tested on titles and/or abstracts, it should be tested on full-text versions of articles in the identified subset using a similar approach. The finally agreed draft eligibility screening criteria and form should then be provided when the Protocol is submitted (see below).

Pilot testing is important for validating reproducibility and reliability of the method. Pilot testing can:

  • check that the eligibility criteria correctly classify studies;
  • provide an indication of how long the screening process takes, thereby assisting with planning the full evidence synthesis;
  • enable agreement between screeners to be checked; if agreement is poor this should lead to a revision of the eligibility criteria or the instructions for applying them;
  • provide training for the review team in how to interpret and apply the eligibility criteria, to ensure consistency of understanding and application;
  • identify unanticipated issues and enable these to be dealt with before the methods are finalised.

The eligibility screening process should be tested on a sample of articles. There is no firm ‘rule’ about how many articles should be tested, but the review team will need to satisfy themselves that the eligibility criteria will correctly identify articles that can answer the evidence synthesis question without needing any further amendments. Higgins & Green (2011) suggested using around 10-12 articles, including ones which are thought by one screener to be definitely eligible, definitely ineligible, and doubtful, and can be screened by one or more further members of the review team to assess consistency. Pilot testing should be performed for each separate step of the screening process that will be conducted, i.e. the title, abstract (or title plus abstract) and full-text screening steps.

If relevant articles are found to have been excluded, irrelevant articles are included, or a large number of ‘unclear’ judgements are being made by the review team, then the eligibility criteria should be revised and re-tested until an acceptable discrimination between relevant and irrelevant articles is achieved. The finally-agreed eligibility criteria should then be specified in the evidence synthesis Protocol.


3.4 Planning for data coding (Systematic Reviews and Maps) and data extraction (Systematic Reviews)

Data coding and data extraction refer to the process of systematically extracting relevant information from the articles included in the Evidence Synthesis. Data coding is the recording of relevant characteristics (meta-data) of the study such as when and where the study was conducted and by whom, as well as aspects of the study design and conduct. Data coding is undertaken in both Systematic Reviews and Systematic Maps. Data extraction refers to the recording of the results of the study (e.g. in terms of effect size means and variances). Data extraction is undertaken in Systematic Reviews only. A standard data coding or extraction form or table (e.g. spreadsheet) is usually developed and pilot-tested on full-text copies of the relevant subset of identified articles. The table contains prompts to the reviewers to record all relevant information necessary to address the synthesis question, plus any additional information required for critical appraisal (see below) and any contextual information that will be required when writing the final Evidence Synthesis report. As with the eligibility screening step, the pilot test should involve at least two reviewers per article, so that any inconsistencies can be identified and corrected. Any issues with data presentation should be noted at this point, so that they may inform synthesis planning. For example, Review Teams may find that data are not consistently presented in a suitable format and that they may need to contact original authors for missing or raw data. The finally agreed draft data coding or extraction table should then be provided when the Evidence Synthesis Protocol is submitted (see Section 4). Data coding and extraction tables for Systematic Reviews are likely to be more detailed than Data coding tables for Systematic Maps, reflecting the different principles of these Evidence Synthesis methods (as explained in Section 2). Data coding in Systematic Reviews should take into account capture of information on potential reasons (effect modifiers) for heterogeneity in outcomes.

3.5 Developing Critical appraisal criteria (Systematic Reviews only)

3.5.1 Why is critical appraisal necessary?

Not all research is conducted to the best standards of scientific rigour and therefore not all information available about a particular topic may be correct. A key challenge is to identify information which is likely to be correct and that which is not. If a systematic review is based on incorrect evidence then the results of the review will also be incorrect.  The critical appraisal step is a crucial part of a systematic review since this is where the “correctness” of the evidence is ascertained and decisions are made as to which evidence is permitted to inform the review’s conclusions. For this process to work effectively, two key criteria have to be met: First, the critical appraisal should focus on aspects of research study conduct that influence whether the resulting information will be correct or not (bearing in mind that some aspects of study design may be more important than others). Second, to have any bearing on the review’s conclusions the critical appraisal step has to directly inform the data synthesis step of the systematic review. It is implicit from this that critical appraisal should be not only a structured process but one that has to be planned a priori. As with the other key steps of a systematic review, the methods of critical appraisal should be pre-specified in the review Protocol.

Critical appraisal refers to the process of assessing whether the evidence is valid for answering the Review question. Key aspects of validity are “internal validity” which is the extent to which evidence is free from bias or confounding, and “external validity” which is the extent to which the evidence is relevant to the question being asked (i.e. whether it can be generalised from the original study to address the review question). Other aspects of evidence “quality” can also be assessed if considered important. The critical appraisal process requires reviewers to use pre-specified criteria to make judgements about whether validity and other quality criteria are met (often “yes”, “no” or “unclear” judgements). Review teams are advised to use the CEE Critical Appraisal Tool to assist in this process.

In developing a checklist using the tool, review teams may find it useful to think of a theoretical gold standard methodology that a primary study might adopt to minimise bias and maximise analytical power. The gold standard may be practically impossible but nevertheless possible to describe in theory. The checklist can then be based on the impact of elements of the gold standard being missing (e.g. measurements at baseline or randomization). Ideally, the type of bias (see below) that each missing element potentially introduces should be listed.

This checklist should be pilot tested on the full-text version of each article in the sample of potentially relevant references, by at least two reviewers per article. Reviewers can then compare their judgements and inconsistencies or disagreements can be taken into account when improving the critical appraisal process and checklist.  The finally agreed draft critical appraisal checklist should then be provided in the Evidence Synthesis Protocol (see Section 4).

3.5.2 Internal validity: Understanding bias

Bias is defined as a systematic deviation in study results from their true value, i.e. it means either an underestimation or overestimation of the true value. The magnitude of bias can range from trivial to substantial. Bias should not be confused with statistical uncertainty as a result of random error, which is present in all research studies. Random error reflects inaccuracy of estimation that is distributed randomly around the true result. Often, random error can be reduced by increasing the sample size in a research study, or by quantitatively combining the results of similar studies in a meta-analysis (subject to the studies being adequately comparable), hence improving the precision of the result (Glass, 1976). Bias, on the other hand, refers to a systematic error which cannot be reduced by increasing the sample size or by pooling study results in a meta-analysis. If bias is present in primary research studies their results will be incorrect. It is generally acknowledged that bias is an important threat to the validity of research findings across scientific disciplines, and it has been argued that bias is one of several factors that collectively contribute to the majority of research findings being incorrect (Ioannidis, 2005). Traditional non-systematic reviews of evidence which do not formally assess the rigour of primary research studies would not be able to detect bias.

A misleading result from an evidence synthesis could occur where a precise but wrong answer is generated (e.g. a point estimate that is incorrect but has a narrow confidence interval). This could arise for example if the included studies in a meta-analysis exhibit consistent systematic error with relatively low random error (Figure 3.2). To avoid this kind of misleading result, it is clearly important that risks of bias are sought and if possible identified before the data synthesis step of a systematic review takes place.

Figure 3.2. Schematic illustration of the potential influence of random error and systematic error (bias) on a study outcome

Bias in research studies can arise for a variety of reasons. Poor design of a research study may mean that it consistently underestimates or overestimates the true value of an outcome and the study researchers may not be aware of this. In some cases researchers may have a vested interest in a particular outcome and this could lead, either intentionally or unintentionally, to various types of bias. Considerable experience from evidence synthesis in health research has shown that where bias is present it often leads to over-estimation of beneficial outcomes, e.g. exaggerating the actual benefits of an intervention such as a drug treatment (Higgins et al., 2011).

The concept of “risk of bias

Evidence for the existence of bias comes from meta-epidemiological health research that has assessed large numbers of studies to determine whether outcomes differ systematically between studies that have a particular design feature and those that do not (e.g. Wortman 1994; Schulz et al. 1995; Chan et al. 2004; Wood et al. 2008; Liberati et al. 2009; Kirkham et al. 2010; Higgins & Green, 2011; Holman et al. 2015). However, it is usually impossible to directly measure bias within individual primary research studies. Instead, an indirect approach is to infer the “risk of bias” by examining the study design and methods to determine whether adequate steps were taken to protect against bias. Studies that fail to meet specified criteria for mitigating known types of bias may be referred to as being at “high risk of bias”  whilst studies with adequate methodology to protect against bias are considered to be at “low risk of bias” (Higgins et al. 2011).

Meta-epidemiological studies on randomised controlled trials of interventions in health research have identified five main types of bias that the trials need to protect against to ensure that their results would be unbiased. These are selection bias, performance bias, detection bias, attrition bias, and reporting bias. These and other types of bias are explained in more detail later in this section. To understand and be able to identify the different types of bias that may arise in research studies, Review Teams should be familiar with the concepts of confounding and effect modification.

Confounding and effect modification

To assess whether there might be a risk of bias, it is important to understand the interrelationships between the explanatory variables and dependent variables that are present in a study. In a well-conducted research study of causation the intervention/exposure would be the explanatory variable, the effect (and hence the measured outcome) would be the dependent variable, and these would be linked by one or more clearly-specified cause-effect pathways. Numerous other variables, which could act as covariables in relation to the study hypothesis, are likely to be present in the system under study and these would need to be controlled for in the study design to ensure that inferences based on the measured outcomes accurately reflect the hypothesised effect of the intervention/exposure. The variables are often categorised as prognostic variables, effect modifiers and confounding variables (or confounders) in the evidence synthesis literature, although sometimes these may be referred to by a variety of synonyms (Peat 2001).

  • A prognostic variable is a variable that is known (e.g. based on knowledge from previous empirical research), or considered very likely (e.g. based on plausibility and probability) to predict the outcome of interest.
  • An effect modifier is a variable which differentially (positively or negatively) influences an outcome by interacting with the cause-effect pathway, but is not a causal factor in itself (i.e. it does not modify the intervention/exposure). The observed cause-effect association will be correct in principle, but the outcome will be biased (systematically under-or over-estimated) if the effect modifier is not controlled for in the study. An effect modifier may also be a prognostic variable.
  • A confounder is a variable external to the cause-effect pathway that interacts with both the intervention/exposure and the outcome. A confounder would meet these three criteria: (1) it is a predictor of the outcome, independent of the intervention/exposure; (2) it is associated with the intervention/exposure; and (3) it is not in the causal pathway between the intervention/exposure and outcome. Presence of a confounder means that the observed cause-effect association is not correct and so the outcome will be biased if the confounder is not controlled for in the study.

Figure 3.3 provides a schematic summary of how these types of variable interact with the intervention/exposure and outcome of interest.


Figure 3.3. Schematic illustration of interactions to look for when investigating potential sources of bias in research studies

Whether a variable is a prognostic variable, effect modifier, and/or confounder will depend on the outcome and exposure being assessed. If a study is asking whether a pesticide influences the fecundity of an organism, then age would almost certainly be a prognostic variable since it is known from empirical research across a wide range of organisms that fecundity varies strongly with age.

A prognostic variable can be defined in isolation of any intervention/exposure (i.e. the prognostic influence of age upon fecundity does not require there to be an intervention/exposure). An effect modifier on the other hand can only be defined in the context of a putative effect of interest, meaning that a putative cause-effect pathway for an intervention/exposure would need to have been specified.  As such, effect modifiers are treatment-specific. Supposing that the effect that a pesticide has on fecundity varies with an organism’s age, then age would be both a prognostic variable and an effect modifier.

The term “confounder”, or “confounding variable”, is sometimes used in the scientific literature in a general sense to mean any covariate that could predict the intervention/exposure or outcome (i.e. referring to both confounders and effect modifiers as defined above). However, in statistical analysis it is important to distinguish between confounding variables and effect modifiers. This is because confounders exhibit collinearity with both the intervention/exposure and outcome whilst effect modifiers exhibit collinearity with the outcome but not the intervention/exposure. A challenge in environmental research studies is to understand which of the many biotic and abiotic variables and their interactions that are present in ecological systems could be confounding variables or effect modifiers in relation to the study hypothesis. A conceptual model can be a helpful means of visualising the key variables that relate to the intervention/exposure and outcome, so as to clarify which may be confounders or effect modifiers.

The relationships shown in Figure 3.3 highlight the types of variables and interactions that review teams should look for in the system to which the review question relates and may form a useful basis for developing a conceptual model to help ensure that key variables and interactions have not been missed.

Principles for assessing risk of bias

Extensive experience of conducting critical appraisal of studies in systematic reviews of health topics has identified several core principles that should guide how risk of bias is assessed (e.g. Higgins et al. 2011):

Assessment should focus on internal validity

Internal validity indicates whether the results are correct or not (i.e. biased). This should be distinguished from random error (precision), external validity (i.e. generalisability), and quality of reporting, which do not themselves indicate whether bias is present. Note that some aspects of study “quality”, such as whether sample size was calculated, are not related directly to the risk of bias (Higgins et al. 2011). Critical appraisal assessments which mix up these different aspects of study “quality” or reporting would not be able to clearly detect threats to internal validity. External validity, which is explained below, should be assessed separately from internal validity.

Risk of bias should be assessed separately for each outcome

Risks of bias are likely to differ according to the result being assessed (Page & Higgins 2016) and  should therefore be assessed separately for each outcome rather than for the study as a whole (unless it can be justified that outcomes are similar enough that they would be subject to the same risks of bias).

3.5.3 Criteria for identifying risks of bias in environmental research studies

In this section we provide lists of scenarios that can help to identify risks of bias in each of the core domains, i.e. selection bias (Box 3.4), performance bias (Box 3.5), Detection bias (Box 3.6), attrition bias (Box 3.7) and reporting bias (Box 3.8). Where possible, we have contextualised the scenarios with actual or hypothetical examples from environmental management research (see the descriptive text below for each domain of bias). Unless stated otherwise, the scenarios are likely to be broadly applicable across a range of study designs.

The aim of this section is to guide review teams on where to look for risks of bias in environmental management studies, but the lists of scenarios are not exhaustive. Review teams should check whether further confounding variables or effect modifiers are present in addition to those listed. The identification of risks of bias is an iterative process and pilot testing of the process is essential to enable the review team to become adept at identifying risks of bias.

Selection bias (Box 3.4)

Selection bias  is an inherent concern in all types of study design and can only be controlled by ensuring that the study units (e.g. people or animals being assigned to intervention or comparator groups or chemicals being assigned to plots in a field experiment) are allocated randomly. A key challenge that review teams face when looking for similarity among study groups is to know which of the factors might or might not be potential confounding variables or effect modifiers, although sometimes these may be obvious. For instance, in human and animal studies age and health status are very likely to be effect modifiers (i.e. they are likely to systematically influence the outcome if not balanced between study groups). In agricultural field experiments soil type is very likely to be a confounding variable or effect modifier given that it is a key determinant of biotic and abiotic diversity (soil type is correlated with other factors such as geographical location and vegetation type and so these factors would also likely be confounders or effect modifiers).

Key issues to look for are:

  • Lack of randomisation (i.e. no randomisation, or randomisation is stated but not appropriately implemented). All types of study that lack random allocation are inherently at risk of selection bias since unmeasured confounding variables cannot be controlled for.
  • In randomised studies: Study investigators may be able to influence the allocation process, preventing it from being truly random (e.g. preferentially selecting which participants are assigned to each study group, or which interventions or exposures are assigned to study plots or areas).  Concealment of the intervention or exposure allocation should always be feasible in well-conducted studies (though may not be commonly implemented in environmental research).

Non-randomised studies are inherently prone to selection bias, but steps can be taken in some types of study design to control for selection bias as much as possible.  These include selecting populations (or study areas) that appear to be as similar as possible such that the comparator (or control) group is sampled from the same population as the intervention (or cases) group, and/or using statistical correction to ensure that the groups are matched on all the known variables that could influence outcomes. This cannot account for any imbalances in unmeasured variables, but study investigators may make a pragmatic assumption that the measured variables are likely to be the most important confounders or effect modifiers.

A common problem in environmental research studies is that the baseline characteristics of the populations or study areas of interest are not always reported, which may preclude an assessment of the comparability of study groups. For example, Stewart, Coles & Pullin (2005) found that baseline data  to confirm whether study sites were homogeneous before a vegetation burning intervention was applied were generally lacking. And Mackenzie Ross et al. (2016) found that several studies assessing neurotoxicity of low level exposure of people to organophosphate insecticides did not provide any information on prior exposure before the study, which could be a confounding factor. This illustrates the importance of considering not only the characteristics of study groups at the start of a study but also any historical differences between groups that could introduce selection bias. If review scoping suggests that studies are likely to generally be deficient in reporting baseline information then review teams should consider whether it would be feasible to contact study authors for this information. 

Box 3.4 Scenarios indicative of risk of selection bias

Depending upon the study design, imbalances in study groups may be quite subtle to detect. An example is provided by Duffy et al. (2014) in which the use of a standard test system would result in unnaturally healthy controls in ecotoxicological testing of pharmaceutical effects on fish.

Note that the allocation of study groups can have implications both for selection bias and external validity (see Section 3.5.4). For example, if a cross-sectional study sampled a range of geographical sites there could be a risk of selection bias if the sites were not selected randomly, but also a threat to external validity if the randomly-selected sites were only a subset of those relevant to the review question.

Performance bias (Box 3.5)

Performance bias is a systematic error in the effect attributed to an intervention or exposure caused by the influence of a confounding factor. Performance bias may arise for several reasons, which may or may not occur together in the same study.

Study investigators who are aware of the allocations (e.g. of people or animals to groups, or crop treatments to plots in a field study) may be prone to inconsistency in how they manage the study groups, potentially favouring one group over the other (e.g. by being more meticulous in their adherence to the Protocol for one group). These “observer biases” are strongest when researchers expect a particular result, are measuring subjective variables, and have an incentive to produce data that confirm predictions. For example, students who believed their test rats had been selectively bred for maze-solving ability recorded better maze performance than did students told their rats were bred for poor maze-solving ability, despite both groups possessing randomly assigned, normal rats. This type of performance bias can be prevented by blinding study investigators to the group allocations. Although it is known that non-blind studies tend to report higher effect sizes and more statistically significant results, blinding is uncommon in the life sciences. It is not always feasible to blind researchers in environmental management studies. For example, where vegetation characteristics are likely to differ between study plots or areas (as with agri-environment or vegetation control interventions) the plant species composition or density would likely indicate the intervention that was allocated.

In environmental field studies performance bias may relate to the scale of the study. For example, in studies with insecticides the use of small plots can lead to an overestimation of the recolonization rate of invertebrates (Bero et al. 2016).

Box 3.5 Scenarios indicative of risk of performance bias

Detection bias (Box 3.6)

Detection bias may arise if there are systematic differences in the way outcomes are assessed among the study groups being compared. Possible sources of detection bias are: systematic misclassification of the exposure, intervention or outcome (e.g. because of variable definitions or timings of assessments), inconsistent application of diagnostic thresholds across study groups, the need for recall from memory (e.g. in surveys or questionnaires), inadequate assessor blinding (such that the investigator’s knowledge of the study group allocations could influence how they measure and/or record outcomes), and faulty measurement techniques.

Where organisms are being sampled in their natural environment, detection bias might arise if there is a systematic difference between study groups in the investigators who are assigned to do the sampling. For example, bias might be introduced if one group of investigators always sampled the exposure plots whilst a different group of investigators always sampled the comparator plots; or if the investigators assigned to sampling the exposure plots had different training in sampling, or other relevant expertise, compared to those who sampled the comparator plots. Random assignment of outcome assessors to study groups, and blinding of the outcome assessors so that they are unaware of the group allocations, are ways to reduce the risk of detection bias although, as mentioned above, blinding is uncommon in the life sciences.

Sampling devices could introduce bias if their capture efficiency is variable and differs systematically between study groups. It is well-known, for example, that the capture efficiency of pitfall traps and suction samplers for sampling terrestrial invertebrates is dependent upon vegetation characteristics and habitat structure. Another issue with pitfall traps is that they depend upon organisms’ activity (which is related to temperature and body size) and therefore they provide a measure of “activity-abundance” rather than an estimate of abundance. If studies using pitfall traps are claimed to be providing abundance estimates without accounting for between-group differences in activity then bias could be introduced. Usually, a range of methods is available for sampling organisms, e.g. for reptiles  or invertebrates, and these can differ in which taxa they sample, so it is important that the review team is experienced enough to know which sampling methods would be most appropriate for the answering the review question without introducing bias. Given that many environmental management studies will involve manipulations of vegetation or habitat structure (e.g. studies involving herbicides, fertilisers, comparisons of crops, agri-environment schemes, or other environmental management prescriptions such as burning or drainage), there is considerable scope in research studies for sampling efficiency to be confounded with these factors if they differ systematically between study groups and are not accounted for in the study design.

Box 3.6 Scenarios indicative of risk of detection bias

Attrition bias (Box 3.7)

Attrition bias may occur where there are systematic differences between groups in the loss of participants, organisms, or samples from a study and the missing observations are related to the intervention/exposure and/or the outcome of interest (i.e. the missing observations are systematically different from those which remain in the analysis). Attrition bias can potentially change the collective (group) characteristics of the study groups and their observed outcomes in ways that affect study results by confounding and spurious associations. For example, if animals which die following exposure to a chemical are excluded from an analysis, this would create an imbalance between groups in the sensitivity of those animals that remain in the study (since the most sensitive animals have been excluded, those remaining in the analysis would be unrepresentatively insensitive in the exposure group). In cross-sectional studies such as surveys, non-response of participants could introduce systematic error if the reasons for non-response are related to the intervention/exposure or the outcome being assessed.

The risk of bias may be less, and perhaps could be considered trivial by the review team, if the proportion of missing data is small and/or the reasons for data being missing do not differ systematically between the study groups (e.g. if they can be assumed to be missing at random). However, clear justification should be provided for any assumptions made about missing data.

Box 3.7 Scenarios indicative of risk of attrition bias

Reporting bias (Box 3.8)

Reporting bias refers to selective disclosure of results such that the outcomes that are reported do not provide a true reflection of the results that would have been observed had all measured outcomes been reported. In order to check for reporting bias the review team will need to have access to a statement of which outcomes were measured in the study. Ideally, this would be found in the study protocol. However, protocols are not commonly provided for environmental research studies and the review team may therefore need to consult the methods section (and possibly other sections) of the study report to ascertain which outcomes were measured.

Types of selective disclosure of results that review teams should be aware of are: reporting results for selected sampling times; reporting results for selected species or other taxa from among a wider list of taxa sampled (e.g. preferentially reporting the most or least sensitive species to an exposure); reporting the most or least sensitive of a range of biomarkers or other outcomes measured; reporting incomplete data for outcomes (e.g. continuous data presented as categorical data with arbitrary cut-offs); and preferential reporting of only statistically significant (or statistically non-significant) results. Selective reporting could be a problem in studies that use multiple ways of assessing the same outcome but do not report all of these (e.g. if diversity is being measured using various different indices such as species richness, Shannon-Wiener, Simpson, and Berger-Parker indices), or in studies that employ multiple sampling methods but do not report results from all of them.

The review team should consider carefully whether non-reporting of outcomes would likely introduce bias, since there may be cases where non-disclosure of outcomes might be considered inconsequential or relatively unimportant. For example, non-reporting of short-term measurements in a long-term study may be considered less likely to misrepresent the true findings of the study than if the long-term measurements are not reported and only short-term measurements given. The review team should provide a clear rationale for their judgements made about the risk of reporting bias.

Box 3.8 Scenarios indicative of risk of reporting bias

Other bias

“Other bias” refers to the presence of any further factors that could lead, directly or indirectly, to systematic underestimation or overestimation of outcomes or effect estimates but do not appear to be readily classifiable as one of the core bias types above. According to the original Cochrane Risk of Bias tool, the five core bias domains are independent. If additional confounding variables or effect modifiers are identified and are suggestive of a risk of bias then these should only be grouped under one of core domains in the recording template if they clearly relate to that domain. In cases of doubt as to whether an identified risk of bias can clearly be classified as selection bias, performance bias, detection bias, attrition bias or reporting bias then it should be listed under  “Other” risk of bias.

Bias arising from study sponsorship is an example of a type of bias that does not relate to any of the five core domains of bias is and would therefore be appropriate to list in the “Other bias” category. For example, in studies on non-human animals, non-industry sponsored studies were found to be more likely to conclude that the herbicide atrazine was harmful compared to industry sponsored studies (Duffield & Aebischer 1994).

3.5.4 External validity

External validity refers to whether the information obtained from a scientific research study is generalizable (i.e. directly applicable) to how the answer to the question being addressed would be applied in practice (Higgins et al., 2011). Experimental studies are typically conducted under controlled conditions that may not fully resemble those of the ‘real world’. An experimental study of an intervention may demonstrate that the intervention can work under the specific conditions of the study, but we also need to know whether it would work in real-life field conditions where it is intended to be used. An intervention’s performance in a study is termed “efficacy” whilst its performance in the real world is termed “effectiveness”. Studies designed to reflect real-world conditions are referred to as “pragmatic” studies. External validity is important as it relates to how well efficacy predicts effectiveness (Khorsan & Crawford, 2014).  Another element of external validity concerns whether the setting of a primary study included within a systematic review is appropriate to that of the review question being asked, for example whether the population, intervention, exposure, or outcomes in the primary study are comparable to those of the setting in which the answer to the review question is intended to be applied. Note that external validity is sometimes referred to in the literature as “generalisability”, “applicability” or “directness”.

The extent to which external validity should be assessed within a systematic review depends on whether the interest is on experimental or pragmatic studies and how the review question is framed (i.e. whether it is broad or narrow, and whether it captures efficacy and/or pragmatic studies). However, review teams should always consider two aspects of external validity: (1) whether the studies included in the review are appropriate for answering the review question; and (2) whether the answer to the review question can be applied directly by the intended end-user (which might, depending on the purpose of the review, be a conservation manager or other environmental practitioner; a policymaker; or a statistical model or process for which the review has generated a specified parameter).

The first aspect needs to be assessed for each individual primary study in the systematic review during the critical appraisal step of the review and the review team should specify in the Protocol the process that will be used if studies are judged to have low external validity (e.g. whether such studies would be excluded from data synthesis, or included in subgroup analyses or sensitivity analyses). The second aspect relates to how appropriate the review question is in relation to its intended purpose, and this should be considered during the development of the review question (rather than in the critical appraisal step of the systematic review).

3.5.5 Criteria for assessing the external validity of environmental research studies

There are two aspects of external validity: the extent to which the studies included in a systematic review are generalizable to answer the review question; and the extent to which the answer to the review question is generalizable to the setting in which the results of the review will be applied. The first of these is relevant to the critical appraisal step of a systematic review and the second is considered at the question development step.

The extent to which external validity of individual included studies will need to be assessed depends upon the breadth of the review’s eligibility criteria and the nature of the included studies. For reviews with very narrow and clearly defined eligibility criteria it is unlikely that the studies included would lack external validity for answering the review question. However, sometimes it may not be clear how relevant studies are until they have been included and carefully scrutinised. This may be the case for studies on complex behavioural interventions for example, or studies that may be conducted at a range of different spatial and temporal scales.

A pragmatic way to assess the external validity of studies included in a systematic review is to consider systematically how well the key elements of the studies (e.g. PICO elements and study design) match those of the review question. Criteria that the review team should consider are: the relevance to the review question of the population; intervention/exposure; comparator; outcome; setting; geographical location; temporal scale; spatial scale; and study design.

3.6 Developing data synthesis methods (Systematic Reviews only)

 Data synthesis refers to the collation of all relevant evidence identified in the Systematic Review in order to answer the review question. A narrative synthesis of the data should always be planned involving listing of eligible studies and tabulation of their key characteristics and outcomes. For Systematic Reviews, if evidence is available in a suitable format and quantity then a quantitative synthesis, such as aggregating by meta-analysis, may also be planned. The likely form of the data synthesis may be informed by the previous pilot-testing of data extraction and critical appraisal steps. For example, the Review Team may identify whether the studies reported in the articles are likely to be of sufficient quality to allow relatively robust statistical synthesis and what sorts of study designs are appropriate to include. This pilot-testing process should also inform the approach to the synthesis by allowing, for example: the identification of the range of data types and methodological approaches; the determination of appropriate effect size metrics and analytical approaches (e.g. meta-analysis or qualitative synthesis); and the identification of study covariates (see Section 9).

3.7 Estimating resource requirements

Whilst the process of scoping may seem like a time-consuming one, the benefits can be considerable and this early investment will allow the development of a comprehensive Protocol as well as improve the focus and efficiency of the review. Scoping should provide an estimate of the timeline of the review and team effort required so that a realistic budget can be prepared or the likely costs compared with the available resources.


Continue to Section 4 – Writing and registering a Protocol

Return to the Table of Contents