NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.
Bojke L, Soares M, Claxton K, et al. Developing a reference protocol for structured expert elicitation in health-care decision-making: a mixed-methods study. Southampton (UK): NIHR Journals Library; 2021 Jun. (Health Technology Assessment, No. 25.37.)
Over the last few decades, SEE has been used in areas such as natural hazards, environmental management, food safety, health care, security and counterterrorism, economic and geopolitical forecasting, and risk and reliability analysis. All of these areas require consequential decisions be taken in the face of significant uncertainty about future events or scientific knowledge.
How judgements are elicited is critical to the quality of the resulting judgements and, hence, the ultimate decisions and policies. Methods for SEE should be suitable for specific contexts and understood by content experts to be useful to decision-makers. Example applications and recommended practices do exist in certain fields, but the specifics vary.
In developing a reference protocol for SEE specific to the needs of HCDM, the methodological recommendations and choices that exist in other fields need to be understood. This chapter surveyed the existing best practices for SEE, as reflected in published elicitation guidance, to identify areas of consensus, places where no consensus exists and other gaps. Identifying areas of commonality across current guidance can support elicitation practice in areas that lack context-specific guidance, such as HCDM. The recommendations and choices for the SEE process identified in this chapter are further explored in Chapters 5–8 and their suitability for HCDM is considered in Chapter 9.
To identify areas of agreement and disagreement in elicitation practice, both domain-specific and generic elicitation guidelines were systematically reviewed according to the search strategy and screening process detailed in Report Supplementary Material 1. A SEE guideline is defined as a document, either peer reviewed or in the grey literature, that advises on the design, preparation, conduct and analysis of a structured elicitation exercise. The review focused on SEE guidelines rather than applications to determine a full list of the possible methodological options, rather than relying on the partial reporting available in applications.
To constrain the scope of this review, guidelines needed to concern explicitly probabilistic judgements and offer guidance on more than one stage of the elicitation process. Literature relating to only one element of elicitation is considered in the targeted searches discussed in Chapters 5 and 6. When the same or a similar author lists published multiple guidance documents making similar recommendations, only one version was included. An extraction template was used to collect information from each guideline. The extracted data were analysed to create an overview of all of the stages, elements and choices involved in an elicitation, and to understand where current advice across guidelines conflicts or agrees. When the guidelines agreed, we assumed that this represented best practice that could be be taken forward within the HCDM context, as applicable. When the guidelines disagreed, we sought additional evidence to support the development of a reference protocol for HCDM (see Chapters 3–8).
The searches identified 16 unique SEE guidelines (see Report Supplementary Material 1, Table 2). Five of the guidelines are generic and aim to inform practice across disciplines, and 11 focus on specific domains. Six of the domain-specific guidelines are agency white papers or agency-sponsored peer-reviewed articles and are tailored to the specific decision-making processes the agencies govern. Agencies issuing guidelines include the European Food Safety Authority (EFSA), the US Environmental Protection Agency (EPA), the Institute and Faculty of Actuaries and the US Nuclear Regulatory Commission. Both the Institute and Faculty of Actuaries and the Nuclear Regulatory Commission have published two distinct guidelines. The 10 guidelines not connected to agencies are based on reviews of existing evidence and practice about elicitation methods (two guidelines), reflections on personal experience and practice (three guidelines), or combinations of review and reflection (five guidelines) (see Report Supplementary Material 1 for details).
Two of the agency SEE guidelines were included with caveats. First, the EFSA guideline covers three distinct elicitation methods, but the classical model and SHELF are presented in other guidelines, so only the portions of the EFSA document related to the EFSA Delphi method are included in this review. 16 Second, the EPA guideline is a white paper released for public review that was not intended to be the final agency report on the subject. 17 However, a final version was never released and, thus, the document is widely cited in elicitation literature and has served as a de facto guideline as nothing has superseded it.
Although the characterisation of the process, including the number and categorisation of steps, differed among the 16 guidelines, the underlying elicitation process described, depicted in Figure 3, was remarkably similar.
The elicitation process. a, These steps are described as post elicitation in some guidelines.
At each step of the elicitation process, analysts are faced with a variety of methodological choices. Table 1 provides the full list of choices described in the 16 guidelines and Table 2 summarises the level of agreement in the recommendations and choices discussed for each element. The following sections discuss the variety of methodological recommendations for each stage made across the guidelines (see Report Supplementary Material 1, Tables 4–15, for further detail).
Summary of the elicitation elements, components and choices described in SEE guidelines
Level of agreement on recommendations and choices in SEE guidelines
Structured expert elicitation is often undertaken in areas with many relevant uncertainties and a decision has to be made about what will be elicited. Only one 18 of the 16 guidelines does not provide advice on selecting what quantities to elicit. Recommendations and choices from the other guidelines are summarised in Report Supplementary Material 1, Table 3.
Five guidelines recommend that elicited variables should be limited to quantities that are, at least in principle, observable. 16 , 20 – 23 This includes probabilities that can be conceptualised as frequencies of an event in a sample of data (even if such data may in practice not be directly available to the expert). However, three guidelines 20 , 24 , 25 argue that elicited quantities can be ‘unobservable’ model parameters, such as odds ratios, provided that they are well defined and understood by the participating experts.
Parameters are here described as ‘unobservable’ if they are complex functions of observable data, such as odds ratios. The guidelines list many types of quantities or parameters that can be elicited, including physical quantities, proportions, frequencies, probabilities and odds ratios. These guidelines give few recommendations; however, aside from Cooke and Goossens, 21 they recommend that experts should not be asked about uncertainty regarding probabilities, but that questions should be reframed as uncertainty about frequencies in a large population. Choy et al. 22 also recommend against eliciting probabilities directly, but two other guidelines 25 , 26 list it as a possible choice. Chapter 7 further considers the possible types of quantities relevant for HCDM.
Three of the guidelines 16 , 24 , 27 recommend formal processes for selecting what to elicit, and several guidelines 16 , 17 , 21 – 24 , 27 – 31 describe principles the elicited quantities should adhere to. Principles discussed include that questions should be clear and well defined, have neutral wording, be asked in a manner consistent with how experts express their knowledge, and be elicited only when the uncertainty affects the final model and/or decision.
Some SEE guidelines describe two issues related to the quantities to elicit: disaggregation and dependence. Five guidelines 16 , 17 , 23 , 25 , 26 suggest that disaggregating or decomposing a variable makes the questions clearer and the elicitation easier for experts. Five guidelines 20 , 21 , 23 , 28 , 30 also discuss the importance of considering dependence between variables. When dependence is discussed, guidelines recommend reframing dependent items in terms of independent variables wherever possible. If dependence cannot be avoided, the elicitation task will be more complicated, but they recommend assessing conditional scenarios or using other elicitation framing and related techniques to estimate dependence.
In addition to choosing what questions to put to experts in an elicitation, analysts must also choose how questions will be put to experts. That is, how will experts be asked to assess their uncertainty about the unknown quantities?
Three guidelines 24 , 26 , 28 – all agency documents – either do not discuss methods for encoding judgements at all 26 , 28 or do not offer advice (i.e. neither recommendations nor a list of choices) on the matter. 24 Report Supplementary Material 1, Table 4, summarises the recommendations and choices described by the other 13 guidelines.
Most approaches can be classified as either fixed interval or variable interval. Fixed interval techniques (discussed in six 16 , 17 , 20 , 22 , 25 , 30 of the 16 guidelines) present experts with a specific set of ranges, and the experts provide the probability the quantify falls within that range. A popular fixed interval technique is the roulette or ‘chips and bins’ method, in which experts construct histograms that represent their beliefs. In contrast, VIMs (recommended by five guidelines 16 , 21 , 23 , 27 , 31 and discussed in another five 17 , 20 , 22 , 25 , 30 ) give the experts set probabilities and ask for the corresponding values. Popular VIMs include the bisection and other quantile techniques. These methods are described further in Chapter 8.
Two guidelines recommend methods that cannot be classified as either fixed interval or variable interval. The Investigate, Discuss, Estimate, Aggregate (IDEA) protocol utilises a combination approach, asking experts to provide a minimum, maximum and best guess for each quantity, as well as a ‘degree of belief’ that reflects the probability that the true value falls between the minimum and the maximum. Experts may all provide assessments for different credible ranges, and the analyst standardises them to an 80% or 90% credible interval (CrI) using linear extrapolation. 32
Kaplan’s method takes a very different approach. 18 Rather than asking experts to encode their beliefs in a way that can be transformed or interpreted as a probability distribution, the method requires that experts only discuss evidence related to the quantity of interest before a facilitator creates a probability distribution that reflects the existing evidence and uncertainty.
In addition to the core encoding method, three guidelines 17 , 20 , 29 also discuss that physical or visual aids can be used by the elicitor(s) to assist with the encoding process.
Despite the variety of encoding methods discussed, none of the guidelines present empirical or anecdotal evidence or other justification for their recommendations or choices. Chapter 8 provides new evidence relating to the choice of encoding method.
Recommendations and choices related to identifying and selecting experts are summarised in Report Supplementary Material 1, Tables 5 and 6. Only one guideline 28 does not discuss the number of experts to include in an elicitation. The others either explicitly recommend or imply that judgements will be elicited from multiple experts. The range of how many experts should be included spans from four experts 21 to 20 experts. 32 The EPA white paper 17 is the only guideline that gives considerations beyond practical concerns for how many experts to include in an exercise. It observes that, if opinions vary widely among experts, more experts may be needed. On the other hand, if the experts in a field are highly dependent (e.g. based on similar training or experiences), adding more experts has limited value. The risk of dependence between experts is discussed in only three other guidelines. 20 , 23 , 26
Most guidelines do not address how many facilitators or analysts should be involved in an elicitation. The few that do so state that two or three facilitators is ideal, with the facilitators having different backgrounds or managing different tasks during the elicitation. 17 , 21 , 24 , 27 , 30
Identifying and selecting experts is discussed in all but three guidelines. 18 , 22 , 23 Recommendations from the other 13 guidelines overlap considerably. Common criteria relate to reputation in the field, relevant experience, the number and quality of publications, and the expert’s willingness and availability to participate. Normative expertise is listed as desired by five guidelines, 16 , 24 – 26 , 30 but three 16 , 24 , 30 specify that it is not a requirement.
Five guidelines 17 , 20 , 26 , 28 , 30 recommend that all potential experts disclose a list of their personal and financial interests, often noting that interests should be recorded but will not automatically disqualify an expert from participating, as that may impose too extreme a limit on the pool of possible experts. Eight guidelines recommend that the group of experts included in an elicitation reflects the diversity of opinions and range of fields relevant to the elicitation topic. The agency guidelines tend to provide more details on identifying and selecting experts, with four describing optional procedures producing a longlist of possible experts that is then winnowed down based on agreed on selection criteria. Although many guidelines suggest identifying experts through peer nomination, Meyer and Booker 25 caution that this process can, if not well managed, lead to issues related to experts nominating only other people with similar views. Chapter 5 considers the broader literature on selecting and identifying experts.
Recommendations and choices related to identifying and selecting experts are summarised in Report Supplementary Material 1, Table 7. Eight guidelines 16 , 17 , 20 – 22 , 25 , 26 , 29 either explicitly recommend piloting the elicitation protocol with a subject matter expert not participating in the exercise or imply 32 that piloting will be done. The remaining seven guidelines do not discuss piloting. 18 , 23 , 24 , 27 , 28 , 30 , 31
Only one guideline 18 offers training as a choice; the other 15 guidelines all require at least some form of training. Recommendations and suggestions for what should be included in expert training are largely consistent across the guidelines and cover issues related to elicitation generally and the subject matter at hand specifically. Commonly recommended aspects of training include an introduction to probability and uncertainty, an overview of the elicitation process, an introduction to heuristics and biases, the aim and motivation for the elicitation, information on how elicitation will be used, relevant background information, and details of any assumptions or definitions used in the elicitation. Five guidelines 25 – 27 , 29 , 30 recommend using practice questions to ensure that experts understand the elicitation process.
Most guidelines do not discuss what, if any, training should be provided to the elicitation facilitator(s) or other roles involved in conduction an elicitation. Five guidelines, including four generic guidelines, provide material that is meant to assist the facilitator, including sample text and forms. 16 , 21 , 25 , 30 , 32
Recommendations and choices about the mode of administration and the level of elicitation (group or individual) are summarised in Report Supplementary Material 1, Table 8.
Elicitations can be conducted in person, in either individual interviews or group workshops, or remotely via the internet, e-mail, mail, telephone, video conferencing or other means. Nine guidelines 17 , 18 , 21 , 23 , 24 , 26 , 27 , 29 , 30 recommend in-person elicitation and only one guideline 16 recommends remote elicitation. Eight guidelines 17 , 22 , 23 , 25 , 28 , 29 , 31 , 32 list remote elicitation as a choice, recognising that it may be logistically easier to arrange than an in-person elicitation.
The mode of administration may be governed by whether or not a method elicits judgements from individual experts (i.e. each expert provides an individual assessment) or groups (i.e. a group of experts provides a single assessment). Of the 16 guidelines, only that by Choy et al. 22 does not discuss the level of elicitation. Group-level elicitation is only recommended by Kaplan, 18 who recommends a process in which experts discuss the evidence relevant to an elicitation variable and then the facilitator proposes a probability distribution that matches the input provided by all of the experts. Individual-level elicitation is recommended by five guidelines, 16 , 21 , 26 , 27 , 32 and two guidelines 24 , 30 recommend a combination approach wherein individual assessments are elicited first followed by the group works to provide a communal assessment that reflects the diversity of opinion in the group. Chapter 5 provides more detail on individual-level compared with group-level elicitation.
All but one guideline 25 discusses the importance of feedback and revision, but three guidelines 20 , 28 , 29 do not provide information on how it should be done. The other guidelines discuss a range of possible feedback methods, which can provide information on an individual’s judgements, the aggregated group judgements or a summary of what the other experts provided. Recommendations and choices about the mode of administration and the level of elicitation are summarised in Report Supplementary Material 1, Table 9.
Only the guideline by Knol et al. 29 warns of a possible negative impact of feedback and revision, cautioning that it can cause unwanted regression to the mean in the experts’ revised assessments. None of the guidelines recommends against providing feedback and opportunities for revision in any form. The feedback of group summary judgements is investigated in Chapter 8.
Recommendations and choices regarding interaction and rationales are summarised in Report Supplementary Material 1, Table 10. Three guidelines did not explicitly discuss interaction between the experts. 21 , 22 , 31 Although no guidelines recommended avoiding interaction, seven guidelines, 17 , 20 , 23 , 25 , 27 – 29 say that no interaction is a possible choice. Interaction is closely related to level of elicitation, with guidelines recommending group discussion prior to individual elicitation, group discussion prior and during a group elicitation, and group discussion following an individual elicitation. One guideline 16 recommended that interaction should be limited to a remote, anonymous, facilitated process. Other guidelines also described these options as choices. 17 , 20 , 25 , 32
Although the guidelines disagreed about if and how interaction should be managed in an elicitation, many do present more justification for the recommendations or choices around interaction than they do for other methodological choices. The benefits of interaction between experts is that it minimises the differences in assessments that are due to different information or interpretation 29 and allows analysts to explore correlation between experts. 23 The drawbacks, however, are that it can allow strong personalities to carry too much weight, 20 , 23 , 29 the experts may feel pressure to reach a consensus, 20 there may be risk of confrontation 23 and interaction can encourage groupthink, resulting in the experts being overconfident. 28 Practical considerations can also guide the choice of if and how to include interaction, as individual interviews may take more time, but a group workshop may be more expensive. 29 These issues are further discussed in Chapter 5.
Only one guideline 25 presented collecting the experts’ rationales during an elicitation as a choice rather than a recommendation. The other 15 guidelines all recommend collecting rationales because they help analysts and decision-makers understand what an answer is based on, 20 , 23 , 28 provide a check of the internal consistency of an expert’s responses, 20 record any assumptions 27 and may help limit biases. 22 The information collected in rationales can also be useful for peer review or for future updating of the judgements. 28
One guideline 31 also recommended collecting rationales from the decision-maker about how they use the expert judgement results.
Even when eliciting judgements from multiple experts, it can be important to have a single distribution that reflects the beliefs of the experts that can be used in modelling. Recommendations and choices on aggregation methods are summarised in Report Supplementary Material 1, Table 11. Five guidelines 17 , 22 , 26 , 28 , 29 presented aggregation as a choice, but the remaining 11 recommended aggregation always be done. 16 , 18 , 20 , 21 , 23 – 25 , 27 , 30 – 32
Aggregation can be behavioural or mathematical. In behavioural aggregation, experts interact with the goal of producing a single, consensus distribution. Mathematical aggregation involves the facilitator(s) eliciting individual assessments from the experts and then combining them into a single distribution through a mathematical process. Two guidelines recommend behavioural aggregation. Kaplan 18 recommends a process that includes group-level elicitation and behavioural aggregation: the experts discuss the evidence relevant to an elicitation variable, the facilitator suggests a probability distribution that reflects the diversity of evidence on the subject and then the process concludes when there is consensus from the experts about the proposed distribution. The SHELF method recommends an initial round of individual-level elicitations followed by expert discussion designed to produce a single distribution that represents how a ‘rational independent observer’ would summarise the range of expert opinions. 30
Four 16 , 21 , 26 , 32 of the guidelines recommended variations on mathematical aggregation. Three guidelines 16 , 26 , 32 recommended combining expert judgements in a linear opinion pool that equally weights all of the experts. The guidelines by Cooke and Goossens 21 is the only one to recommend mathematical aggregation with differential weights for the experts. Cooke and Goossens 21 suggested a method whereby the experts are scored and weighted according to their performance in assessing a set of seed questions, which are items that are unknown to the expert but known to the facilitator.
Budnitz et al. 24 recommend a unique approach wherein the analysts determine the aggregation method during an elicitation, based on an evaluation of how the process is unfolding and determining what is most appropriate. They recommend that a behavioural aggregation-based consensus is the best choice, but believe it is not appropriate in all situations. The analysts can also decide to use mathematical aggregation with equal weights or analyst-determined weights or a process similar to that recommended by Kaplan, 18 in which the analysts supply a distribution that they believe captures the discussion and evidence presented by the experts.
Like interaction, several of the guidelines give more background to help guide an analyst in his or her choice of method. The main drawback of aggregation, according to Tredger et al., 28 is that it can lead to a result that no one believes. Two guidelines 20 , 24 warn that the expert selection is of increased importance if an elicitation will use mathematical aggregation with an opinion pool, particularly equal weights, as increasing the number of experts with similar beliefs will result in those beliefs having more influence in the final, aggregated distribution. Garthwaite et al. 20 also suggest that opinion pools may be problematic as the result does not represent any one person or group’s opinion, but Bayesian weighting requires a lot of information on the decision-maker’s views of the experts’ opinions. Finally, several guidelines 16 , 20 , 23 , 25 , 26 , 28 , 30 , 31 discuss that the possible issues around behavioural aggregation are linked to the challenge of properly managing group interactions, the topic discussed next. The broader literature on aggregation is discussed in Chapter 5.
Recommendations and choices on fitting to distribution are summarised in Report Supplementary Material 1, Table 12. Analysts can fit the elicited data to a probability distribution either as part of the elicitation or during post-elicitation analysis of the data. Possible choices, discussed in about half of the guidelines, include fitting to a parametric distribution, using non-parametric approaches or just using the information directly elicited from the experts.
None of the guidelines recommended specific distributions to be used in fitting, but they say that the analysts should choose based on the nature of the elicited quantity and the information provided by the experts. Cooke and Goossens 21 describe probabilistic inversion, a method that can be done if the observable elicited variable needs to be transformed into a distribution on an unobservable model parameter. Chapter 5 explores issues of fitting judgements to distributions in more detail.
Recommendations and choices related to the other post-elicitation components are summarised in Report Supplementary Material 1, Table 13. Only two guidelines discussed obtaining feedback from the experts on the elicitation process. Walls and Quigley 23 recommended that analysts ask experts what could have been done differently if new data are later collected that differ from the experts’ judgements. The EFSA Delphi 16 recommended that analysts give experts a questionnaire with the opportunity to provide general comments on the elicitation questions and process.
None of the guidelines recommended that analysts should adjust experts’ assessments, but five describe related choices, such as manually adjusting assessments, 16 , 20 dropping an expert from the panel 23 , 24 or adjusting assessments to be more accurate, which is recommended against by two guidelines. 20 , 25
Documenting the elicitation process and results is the only elicitation element discussed by all 16 guidelines. Although the specific recommendations regarding what to include in the final documentation varies across the guidelines, they do not conflict. The guidelines typically recommend that documentation includes the elicitation questions, experts’ individual (if elicited) and aggregated responses, experts’ rationales and a detailed description of the procedures and design of the elicitation, including the reasoning behind any methodological decision. Many of the agency guidelines are more prescriptive about what documentation should entail, and some provide detailed templates. 16 , 17 , 31
Expert judgements are affected by a variety of heuristics and biases. 33 , 34 Morgan 35 argues that these biases cannot be completely eliminated, but that the elicitation process is designed to minimise their influence on the results. The 16 reviewed guidelines discussed 11 different cognitive biases and eight motivational biases that can affect an elicitation. A list of the biases discussed and possible actions to minimise them can be found in Report Supplementary Material 1, Table 14.
Most of the bias-reducing actions mentioned by SEE guidelines are discussed in only one or two guidelines, but the actions do not conflict with one another. The most frequently recommended actions are to frame questions in a way that minimises biases (discussed in five guidelines 16 , 22 , 23 , 28 , 32 ) and to ask for the upper and lower bound first, to avoid anchoring (discussed in three guidelines 26 , 30 , 32 ). Although most guidelines offer some recommendations for mitigating and managing biases, they present little to no empirical evidence to support that their recommended actions have the intended effect. The broader literature on heuristics and biases is reviewed in Chapter 6.
Four guidelines 16 , 18 , 25 , 30 do not discuss how to ensure the validity of elicited results and the other 12 guidelines present a range of perspectives on what is meant by validity, summarised in Report Supplementary Material 1, Table 15. Validity can mean that the exercise captured what the experts believe (even if that is later proven false). 20 It can also refer to whether the expressed quantities correspond to reality, 20 , 21 , 23 , 32 are consistent with the laws of probability 20 , 23 or are internally consistent. 26 , 29 Some guidelines – all agency documents – also view validity as mostly concerned with the process, rather than the results, and suggest that an elicitation is valid if it has been subjected to peer review. 17 , 24 , 31 Recommendations and choices for handling validity differ across the guidelines and can involve actions at any stage of the elicitation process, depending on what definition of validity the guideline seeks to achieve.
The SEE guideline review reveals a developing body of work designed to guide elicitation practice. Although the guidelines evolved separately in different fields, they largely agree on issues around what quantities to elicit, expert selection, the importance of piloting the exercise and training experts, face-to-face elicitation being preferable to remote modes, the importance of collecting rationales from the experts alongside the quantitative assessments, fitting assessments to distributions, the key role documentation plays in supporting and communicating an elicitation exercise, and how to manage heuristics and biases. The guidelines recommend different approaches for encoding judgements, using individual- or group-level elicitation, aggregating judgements and managing interaction between the experts. Although the guidelines agree that validation is important, they disagree on what actions an analyst can take to encourage or demonstrate validity. Finally, some areas seem underdiscussed. Dependence between questions, for example, is a complicated issue that could be critically important when interpreting elicitation results, but little guidance exists on the topic.
The elicitation choices identified in this review are further considered in Chapters 5–8, and their suitability for use in the HCDM context is evaluated in Chapter 9.
Copyright © Queen’s Printer and Controller of HMSO 2021. This work was produced by Bojke et al. under the terms of a commissioning contract issued by the Secretary of State for Health and Social Care. This issue may be freely reproduced for the purposes of private research and study and extracts (or indeed, the full report) may be included in professional journals provided that suitable acknowledgement is made and the reproduction is not associated with any form of advertising. Applications for commercial reproduction should be addressed to: NIHR Journals Library, National Institute for Health Research, Evaluation, Trials and Studies Coordinating Centre, Alpha House, University of Southampton Science Park, Southampton SO16 7NS, UK.