 Methodology
 Open Access
 Published:
Determining sample size for progression criteria for pragmatic pilot RCTs: the hypothesis test strikes back!
Pilot and Feasibility Studies volume 7, Article number: 40 (2021)
Abstract
Background
The current CONSORT guidelines for reporting pilot trials do not recommend hypothesis testing of clinical outcomes on the basis that a pilot trial is underpowered to detect such differences and this is the aim of the main trial. It states that primary evaluation should focus on descriptive analysis of feasibility/process outcomes (e.g. recruitment, adherence, treatment fidelity). Whilst the argument for not testing clinical outcomes is justifiable, the same does not necessarily apply to feasibility/process outcomes, where differences may be large and detectable with small samples. Moreover, there remains much ambiguity around sample size for pilot trials.
Methods
Many pilot trials adopt a ‘traffic light’ system for evaluating progression to the main trial determined by a set of criteria set up a priori. We construct a hypothesis testing approach for binary feasibility outcomes focused around this system that tests against being in the RED zone (unacceptable outcome) based on an expectation of being in the GREEN zone (acceptable outcome) and choose the sample size to give high power to reject being in the RED zone if the GREEN zone holds true. Pilot point estimates falling in the RED zone will be statistically nonsignificant and in the GREEN zone will be significant; the AMBER zone designates potentially acceptable outcome and statistical tests may be significant or nonsignificant.
Results
For example, in relation to treatment fidelity, if we assume the upper boundary of the RED zone is 50% and the lower boundary of the GREEN zone is 75% (designating unacceptable and acceptable treatment fidelity, respectively), the sample size required for analysis given 90% power and onesided 5% alpha would be around n = 34 (intervention group alone). Observed treatment fidelity in the range of 0–17 participants (0–50%) will fall into the RED zone and be statistically nonsignificant, 18–25 (51–74%) fall into AMBER and may or may not be significant and 26–34 (75–100%) fall into GREEN and will be significant indicating acceptable fidelity.
Discussion
In general, several key process outcomes are assessed for progression to a main trial; a composite approach would require appraising the rules of progression across all these outcomes. This methodology provides a formal framework for hypothesis testing and sample size indication around process outcome evaluation for pilot RCTs.
Background
The importance and need for pilot and feasibility studies is clear: “A wellconducted pilot study, giving a clear list of aims and objectives … will encourage methodological rigour … and will lead to higher quality RCTs” [1]. The CONSORT extension to external pilot and feasibility trials was published in 2016 [2] with the following key methodological recommendations: (i) investigate areas of uncertainty about the future definitive RCT; (ii) ensure primary aims/objectives are about feasibility, which should guide the methodology used; (iii) include assessments to address the feasibility objectives which should be the main focus of data collection and analysis; and (iv) build decision processes into the pilot design whether or how to proceed to the main study. Given that many trials incur process problems during implementation—particularly with regard to recruitment [3,4,5]—the need for pilot and feasibility studies is evident.
One aspect of pilot and feasibility studies that remains unclear is the required sample size. There is no consensus but recommendations vary from 10 to 12 per group through to 60–75 per group depending on the main objective of the study. Sample size may be based on precision of a feasibility parameter [6, 7]; precision of a clinical parameter which may inform main trial sample size—particularly the standard deviation (SD) [8,9,10,11] but also event rate [12] and effect size [13, 14]; or, to a lesser degree, for clinical scale evaluation [9, 15]. Billingham et al. [16] reported that the median sample size of pilot and feasibility studies is around 30–36 per group but there is wide variation. Herbert et al. [17] reported that targets within internal as opposed to external pilots are often slightly larger and somewhat different, being based on percentages of the total sample size and timeline rather than any fixed sample requirement.
The need for a clear directive on sample size of studies is of upmost relevance. The CONSORT extension [2] reports that “Pilot size should be based on feasibility objectives and some rationale given” and states that a “confidence interval approach may be used to calculate and justify the sample size based on key feasibility objective(s)”. Specifically, item 7a (How sample size was determined: Rationale for numbers in the pilot trial) qualifies: “Many pilot trials have key objectives related to estimating rates of acceptance, recruitment, retention, or uptake … for these sorts of objectives, numbers required in the study should ideally be set to ensure a desired degree of precision around the estimated rate”. Item 7b (When applicable, explanation of any interim analyses and stopping guidelines) is generally an uncommon scenario for pilot and feasibility studies and is not given consideration here.
A key aspect of pilot and feasibility studies is to inform progression to the main trial, which has important implications for all key stakeholders (funders, researchers, clinicians and patients). The CONSORT extension [2] states that “decision processes about how to proceed needs to be built into the pilot design (which might involve formal progression criteria to decide whether to proceed, proceed with amendments, or not to proceed)” and authors should present “if applicable, the prespecified criteria used to judge whether or how to proceed with a future definitive RCT; … implications for progression from pilot to future definitive RCT, including any proposed amendments”. Avery et al. [18] published recommendations for internal pilots emphasising a traffic light (stopamendgo/redambergreen) approach to progression with focus on process assessment (recruitment, protocol adherence, followup) and transparent reporting around the choice of trial design and the decisionmaking processes for stopping, amending or proceeding to a main trial. The review of Herbert et al. [17] reported that the use of progression criteria (including recruitment rate) and traffic light stopamendgo as opposed to simple stopgo is increasing for internal pilot studies.
A common misuse of pilot and feasibility studies has been the application of hypothesis testing for clinical outcomes in small underpowered studies. Arain et al. [19] claimed that pilot studies were often poorly reported with inappropriate emphasis on hypothesis testing. They reviewed 54 pilot and feasibility studies published in 2007–2008, of which 81% incorporated hypothesis testing of clinical outcomes. Similarly, Leon et al. [20] stated that a pilot is not a hypothesis testing study: safety, efficacy and effectiveness should not be evaluated. Despite this, hypothesis testing has been commonly performed for clinical effectiveness/efficacy without reasonable justification. Horne et al. [21] reviewed 31 pilot trials published in physical therapy journals between 2012 and 2015 and found that only 4/31 (13%) carried out a valid sample size calculation on effectiveness/efficacy outcomes but 26/31 (84%) used hypothesis testing. Wilson et al. [22] acknowledged a number of statistical challenges in assessing potential efficacy of complex interventions in pilot and feasibility studies. The CONSORT extension [2] reaffirmed many researchers’ views that formal hypothesis testing for effectiveness/efficacy is not recommended in pilot/feasibility studies since they are underpowered to do so. Sim’s commentary [23] further contests such testing of clinical outcomes stating that treatment effects calculated from pilot or feasibility studies should not be the basis of a sample size calculation for a main trial.
However, when the focus of analysis is on confidence interval estimation for process outcomes, this does not give a definitive basis for acceptance/rejection of progression criteria linked to formal powering. The issue in this regard is that precision focuses on alpha (α, type I error) without clear consideration of beta (β, type II error) and may therefore not reasonably capture true differences if a study is underpowered. Further, it could be argued that hypothesis testing of feasibility outcomes (as well as addressing both alpha and beta) is justified on the grounds that moderatetolarge differences (‘processeffects’) may be expected rather than small differences that would require large sample numbers. Moore et al. [24] previously stated that some pilot studies require hypothesis testing to guide decisions about whether larger subsequent studies can be undertaken, giving the following example of how this could be done for feasibility outcomes: asking the question “Is taste of dietary supplement acceptable to at least 95% of the target population?”, they showed that sample sizes of 30, 50 and 70 provide 48%, 78% and 84% power to reject an acceptance rate of 85% or lower if the true acceptance rate is 95% using a 1sided α = 0.05 binomial test. Schoenfeld [25] advocates that, even for clinical outcomes, there may be a place for testing at the level of clinical ‘indication’ rather than ‘clinical evidence’. He suggested that preliminary hypothesis testing for efficacy could be conducted with high alpha (up to 0.25), not to provide definitive evidence but as an indication as to whether a larger study should be conducted. Lee et al. [14] also reported how type 1 error levels other than the traditional 5% could be considered to provide preliminary evidence for efficacy, although they did stop short of recommending doing this by concluding that a confidence interval approach is preferable.
Current recommendations for sample sizes of pilot/feasibility studies vary, have a single rather than a multicriterion basis, and do not necessarily link directly to formal progression criteria. The purpose of this article is to introduce a simple methodology that allows sample size derivation and formal testing of proposed progression cutoffs, whilst offering suggestions for multicriterion assessment, thereby giving clear guidance and signposting for researchers embarking on a pilot/feasibility study to assess uncertainty in feasibility parameters prior to a main trial. The suggestions within the article do not directly apply to internal pilot studies built into the design of a main trial, but given the similarities to external randomised pilot and feasibility studies, many of the principles outlined here for external pilots might also extend to some degree to internal pilots of randomised and nonrandomised studies.
Methods
The proposed approach focuses on estimation and hypothesis testing of progression criteria for feasibility outcomes that are potentially modifiable (e.g. recruitment, treatment fidelity/ adherence, level of follow up). Thus, it aligns with the main aims and objectives of pilot and feasibility studies and with the progression stopamendgo recommendations of Eldridge et al. [2] and Avery et al. [18].
Hypothesis concept
Let R_{UL} denote the upper RED zone cutoff and G_{LL} denote the lower GREEN zone cutoff. The concept is to set up hypothesis testing around progression criteria that tests against being in the RED zone (designating unacceptable feasibility—‘STOP’) based on an alternative of being in the GREEN zone (designating acceptable feasibility—‘GO’). This is analogous to the zero difference (null) and clinically important difference (alternative) in a main superiority trial. Specifically, we are testing against R_{UL} when G_{LL} is hypothesised to be true:

Null hypothesis: True feasibility outcome (ε) not greater than the upper “RED” stop limit (R_{UL})

Alternative hypothesis: True feasibility outcome (ε) is greater than R_{UL}
The test is a 1tailed test with suggested alpha (α) of 0.05 and beta (β) of 0.05, 0.1 or 0.2, dependent on the required strength of evidence of the test. An example of a feasibility outcome might be percentage recruitment uptake.
Progression rules
Let E denote the observed point estimate (ranging from 0 to 1 for proportions, or for percentages 0–100%). Simple 3tiered progression criteria would follow as:

E ≤ R_{UL} [P value nonsignificant (P ≥ α)] > RED (unacceptable—STOP)

R_{UL} < E < G_{LL} > AMBER (potentially acceptable—AMEND)

E ≥ G_{LL} [P value significant (P < α)] > GREEN (acceptable—GO)
Sample size
Table 1 displays a quick lookup grid for sample size across a range of anticipated proportions for R_{UL} and G_{LL} for onesample onesided 5% alpha with typical 80% and 90% (as well as 95%) power for the normal approximation method with continuity correction (see Appendix for corresponding mathematical expression; derived from Fleiss et al. [26]). Table 2 is the same lookup grid relating to the Binomial exact approach with sample sizes derived using G*Power version 3.1.9.7 [27]. Clearly, as the difference between proportions R_{UL} and G_{LL} increases the sample size requirement is reduced.
Multicriteria assessment
We recommend that progression for all key feasibility criteria should be considered separately, and hence overall progression would be determined by the worstperforming criterion, e.g. RED if at least one signal is RED, AMBER if none of the signals fall into RED but at least one falls into AMBER and GREEN if all signals fall into the GREEN zone. Hence, the GREEN signal to ‘GO’ across the set of individual criteria will give indication that progression to a main trial can take place without any necessary changes. A signal to ‘STOP’ and not proceed to a main trial is recommended if any of the observed estimates are ‘unacceptably’ low (i.e. fall within the RED zone). Otherwise, where neither ‘GO’ nor ‘STOP’ are signalled, the design of the trial will need amending by indication of subpar performance on one or more of the criteria.
Sample size requirements across multicriteria will vary according to the designated parameters linked to the progression criteria, which may be set at different stages of the study on different numbers of patients (e.g. those screened, eligible, recruited and randomised, allocated to the intervention arm, total followed up). The overall size needed will be dictated by the requirement to power each of the multicriteria statistical tests. Since these tests will yield separate conclusions in regard to the decision to ‘STOP’, ‘AMEND’ or ‘GO’ across all individual feasibility criteria there is no need to consider a multiple testing correction with respect to alpha. However, researchers may wish to increase power (and hence, sample size) to ensure adequate power to detect ‘GO’ signals across the collective set of feasibility criteria. For example, powering at 90% across three criteria (assumed independent) will ensure a collective power of 73% (i.e. 0.9^{3}), which may be considered reasonable, but 80% power across five criteria will reduce the power of the combined test to 33%. The final three columns of Table 1 cover the sample sizes required for 95% power, which may address collective multicriteria assessment when considering keeping a high overall statistical power.
Further expansion of AMBER zone
Within the same sample size framework, the AMBER zone may be further split to indicate whether ‘minor’ or ‘major’ amendments are required according to the significance of the p value. Consider a 2way split in the AMBER zone denoted by cutoff A_{C}, which indicates the threshold for statistical significance, where an observed estimate below the cutpoint will result in a nonsignificant result and an estimate at or above the cutpoint a significant result. Let AMBER_{R} denote the region of Amber zone adjacent to the RED zone between R_{UL} and A_{C}, and AMBER_{G} denote the region of AMBER zone between A_{C} and G_{LL} adjacent to the GREEN zone. This would draw on two possible levels of amendment (‘major’ AMEND and ‘minor’ AMEND) and the reconfigured approach would follow as:

E ≤ R_{UL} [P value nonsignificant (P ≥ α)] > RED (unacceptable—STOP)

R_{UL} < E < G_{LL} > AMBER (potentially acceptable—AMEND)

R_{UL} < E < G_{LL} and P ≥ α {R_{UL} < E < A_{c}} > AMBER_{R} (major AMEND)

R_{UL} < E < G_{LL} and P < α { A_{c} ≤ E < G_{LL}} > AMBER_{G} (minor AMEND)


E ≥ G_{LL} [P value significant (P < α)] > GREEN (acceptable—GO)
In Tables 1 and 2 in relation to designated sample sizes for different R_{UL} and G_{LL} and specified α and β, we show the corresponding cutpoints for statistical significance (p < 0.05) both in absolute terms of sample number (n) [A_{C}] and as a percentage of the total sample sizes [A_{C}%].
Results
A motivating example (aligned to the normal approximation approach) is presented in Table 3, which illustrates a pilot trial with three progression criteria. Table 4 presents the sample size calculations for the example scenario following the 3tiered approach, and Table 5 gives the sample size calculations for the example scenario using the extended 4tiered approach. Cutpoints for the feasibility outcomes relating to the shown sample sizes are also presented to show RED, AMBER and GREEN zones for each of the three progression criteria.
Overall sample size requirement should be dictated by the multicriteria approach. This is illustrated in Table 4 where we have three progression criteria each with a different denominator population. For recruitment uptake, the denominator denotes the total number of children screened and the numerator the number of children randomised; for followup, the denominator is the number of children randomised with the numerator being number of those randomised who are successfully followed up; and lastly for treatment fidelity, the denominator is the number allocated to the intervention arm with the numerator being the number of children who were administered the treatment correctly by the dietician. In the example in order to meet the individual ≥ 90% power requirement for all three criteria we would need: (i) for recruitment, the number to be screened to be 78; (ii) for treatment fidelity, the number in the intervention arm to be 34; and (iii) for follow up, the number randomised to be 44. In order to determine the overall sample size for the whole study, we base our decision on the criterion that requires the largest numbers, which is the treatment fidelity criterion which requires 68 to be randomised. We cannot base our decision on the 78 required to be screened for recruitment because this would give only an expected number of 28 randomised (i.e. 35% of 78). If we expect 35% recruitment uptake, then we need to inflate the total 68 (randomised) to be 195 (1/0.35 × 68) children to be screened (rounded to 200). This would give 99.9%, 90% and 98.8% power for criteria (i), (ii) and (iii), respectively (assuming 68 of the 200 screened are randomised), giving a very reasonable collective 88.8% power of rejecting the null hypotheses over the three criteria if the alternative hypotheses (for acceptable feasibility outcomes) are true in each case.
Inherent in our approach are the probabilities around sample size, power and hypothesised feasibility parameters. For example, taking the cutoffs from treatment fidelity as a feasibility outcome from Table 4 (ii), we set a lower GREEN zone limit of G_{LL} = 0.75 (“acceptable” (hypothesised alternative value)) and an upper RED zone limit of R_{UL} = 0.5 (“not acceptable” (hypothesised null value)) for rejecting the null for this criterion based on 90% power and a 1sided 5% significance level (alpha). Figure 1 presents the normal probability density functions for ε, for the null and alternative hypotheses. In the illustration this would imply through normal sampling theory that if G_{LL} holds true (i.e. true recruitment uptake (ε) = G_{LL}) there would be the following:

A probability of 0.1 (type II error probability β) of the estimate falling within RED/AMBER_{R} zones (i.e. blue shaded area under the curve to the left of A_{C} where the test result will be nonsignificant (p ≥ 0.05))

Probability of 0.4 of it falling in the AMBER_{G} zone (i.e. area under the curve to the right of A_{C} but below G_{LL})

Probability of 0.5 of the estimate falling in the GREEN zone (i.e. G_{LL} and above).
If R_{UL} (the null) holds true (i.e. true feasibility outcome (ε) = R_{UL}), there would be the following:

A probability of 0.05 (onetailed type I error probability α) of the statistic/estimate falling in the AMBER_{G}/GREEN zones (i.e. pink shaded area under the curve to the right of A_{C} where the test result will be significant (p < 0.05) as shown within Fig. 1)

Probability of 0.45 of it falling in the AMBER_{R} zone (i.e. to the left of A_{C} but above R_{UL})

Probability of 0.5 of the estimate falling in the RED zone (i.e. R_{UL} and below)
Figure 1 also illustrates how changing the sample size affects the sampling distribution and power of the analysis around the set null value (at R_{UL}) when the hypothesised alternative (G_{LL}) is true. The figure emphasises the need for a large enough sample to safeguard against underpowering of the pilot analysis (as shown in the last plot which has a wider bellshape than the first two plots and where the size of the beta probability is increased).
Figure 2 plots the probabilities of making each type of traffic light decision as functions of the true parameter value (focused on the recruitment uptake example from Table 5 (i)). Additional file 1 presents the R code for reproducing these probabilities and enables readers to insert different parameter values.
Discussion
The methodology introduced in this article provides an innovative formal framework and approach to sample size derivation, aligning sample size requirement to progression criteria with the intention of providing greater transparency to the progression process and full engagement with the standard aims and objectives of pilot/feasibility studies. Through the use of both alpha and beta parameters (rather than alpha alone), the method ensures rigour and capacity to address the progression criteria by ensuring there is adequate power to detect an acceptable threshold for moving forward to the main trial. As several key process outcomes are assessed in parallel and in combination, the method embraces a composite multicriterion approach that appraises signals for progression across all the targeted feasibility measures. The methodology extends beyond the requirement for ‘sample size justification but not necessarily sample size calculation’ [28].
The focus of the strategy reported here is on process outcomes, which align with the recommended key objectives of primary feasibility evaluation for pilot and feasibility studies [2, 24] and necessary targets to address key issues of uncertainty [29]. The concept of justifying progression is key. Charlesworth et al. [30] developed a checklist for intended use in decisionmaking on whether pilot data could be carried forward to a main trial. Our approach builds on this philosophy by introducing a formalised hypothesis test approach to address the key objectives and pilot sample size. Though the suggested sample size derivation focuses around the key process objectives, it may also be the case that other objectives are also important, e.g. assessment of precision of clinical outcome parameters. In this case, researchers may also wish to ensure that the size of the study suitably covers the needs of those evaluations, e.g. to estimate the SD of the intended clinical outcome, then the overall sample size may be boosted to cover this additional objective [10]. This tallies with the review by BlatchJones et al. [31] who reported that testing recruitment, determining the sample size and numbers available, and the intervention feasibility were the most commonly used targets of pilot evaluations.
Hypothesis testing in pilot studies, particularly in the context of effectiveness/efficacy of clinical outcomes, has been widely criticised due to the improper purpose and lack of statistical power of such evaluations [2, 20, 21, 23]. Hence, pilot evaluations of clinical outcomes are not expected to include hypothesis testing. Since the main focus is on feasibility the scope of the testing reported here is different and importantly relates back to the recommended objectives of the study whilst also aligning with nominated progression criteria [2]. Hence, there is clear justification for this approach. Further, for the simple 3tiered approach hypothesis testing is somewhat hypothetical: there is no need to physically carry out a test since the zonal positioning of the observed sample statistic estimate for the feasibility outcome will determine the decision in regard to progression; thus adding to the simplicity of the approach.
The link between the sample size and need to adequately power the study to detect a meaningful feasibility outcome gives this approach the extra rigour over the confidence interval approach. It is this sample sizepower linkage that is key to the determination of the respective probabilities of falling into the different zones and is a fundamental underpinning to the methodological approach. In the same way as for a key clinical outcome in a main trial where the emphasis is not just on alpha but also on beta thereby addressing the capacity to detect a clinically significant difference, similarly, our approach is to ensure there is sufficient capacity to detect a meaningful signal for progression to a main trial if it truly exists. A statistically significant finding in this context will at least provide evidence to reject RED (signifying a decision to STOP) and in the 4tiered case it would fall above AMBER_{R} (decision to majorAMEND); hence, the estimate will fall into AMBER_{G} or GREEN (signifying a decision to minorAMEND or GO, respectively). The importance of adequately powering the pilot trial to address a feasibility criterion can be simply illustrated. For example, if we take R_{UL} as 50% and G_{LL} as 75% but with two different sample sizes of n = 25 and n = 50; the former would have 77.5% power of rejecting RED on the basis of a 1sided 5% alpha level whereas the larger sample size would have 97.8% power of rejecting RED. So, if G_{LL} holds true, there would be 20% higher probability of rejecting the null and being in the AMBER_{G}/GREEN zone for the larger sample giving an increased chance of progressing to the main trial. It will be necessary to carry out the hypothesis test for the extended 4tier approach if the observed statistic (E) falls in the AMBER zone to determine statistical significance or not, which will inform whether the result falls into the ‘minor’ or ‘major’ AMBER subzones.
We provide recommended sample sizes within a lookup grid relating to perceived likely progression cutpoints to aid quick access and retrievable sample sizes for researchers. For a likely set difference in proportions between hypothesised null and alternative parameters of 0.15 to 0.25 when α = 0.05 and β = 0.1 the corresponding total sample size requirements for the approach of normal approximation with continuity correction take the range of 33 to 100 (median 56) [similarly these are 33–98 (median 54) for the binomial exact method]. Note, for treatment fidelity/adherence/compliance particularly, the marginal difference could be higher, e.g. ≥ 25%, since in most situations we would anticipate and hope to attain a high value for the outcome whilst being prepared to make necessary changes within a wide interval of below par values (and providing the value is not unacceptably low). As this relates to an armspecific objective (relating to evaluation of the intervention only), then a usual 1:1 pilot will require twice the size; hence, the armspecific sample size powered for detecting a ≥ 25% difference from the null would be about 34 (or lower)—as depicted from our illustration (Table 4 (ii), equating to n ≤ 68 overall for a 1:1 pilot; intervention and control arms). Hence, we expect that typical pilot sizes of around 30–40 randomised per arm [16] would likely fit with the proposed methodology within this manuscript (the number needed for screening being extrapolated upward of this figure) but if a smaller marginal difference (e.g. ≤ 15%) is to be tested then these sample sizes may fall short. We stress that the overall required sample size needs to be carefully considered and determined in line with the hypothesis testing approach across all criteria ensuring sufficiently high power. In our paper, we have made recommendations regarding various sample sizes based on both the normal approximation (with continuity correction) and binomial exact approaches; these are conservative compared to the Normal approximation (without continuity correction).
Importantly, the methodology outlines the necessary multicriterion approach to the evaluation of pilot and feasibility studies. If all progression criteria are performing as well as anticipated (highlighting ‘GO’ according to all criteria), then the recommendation of the pilot/feasibility study is that all criteria meet their desired levels with no need for adjustment and the main trial can proceed without amendment. However, if the worst signal (across all measured criteria) is an AMBER signal, then adjustment will be required against those criteria that fall within that signal. Consequently, there is the possibility that the criteria may need subsequent reassessment to reevaluate processes in line with updated performance for the criteria in question. If one or more of the feasibility statistics fall within the RED zone then this signals ‘STOP’ and concludes that a main trial is not feasible based on those criteria. This approach to collectively appraising progression based on the results of all feasibility outcomes assessed against their criteria will be conservative as the power of the collective will be lower than the individual power of the separate tests; hence, it is recommended that the power of the individual tests is set high enough (for example, 90–95%) to ensure the collective power is high enough (e.g. at least 70 or 80%) to detect true ‘GO’ signals across all the feasibility criteria.
In this article, we also expand the possibilities for progression criterion and hypothesis testing where the AMBER zone is subdivided arbitrarily based on the significance of the p value. This may work well when the AMBER zone has a wide range and is intended to provide a useful and workable indication of the level of amendment (‘minor’ (nonsubstantive) or ‘major’ (substantive)) required to progress to the main trial. Examples of substantial amendments include study redesign with possible reappraisal and change of statistical parameters, inclusion of several additional sites, adding further data recruitment methods, significant reconfiguration of exclusions, major change to the method of delivery of trial intervention to ensure enhanced treatment fidelity/adherence, enhanced measures to systematically ensure greater patient compliance with allocated treatment, additional mode(s) of collecting and retrieving data (e.g. use of electronic data collection methods in addition to postal questionnaires). Minor amendments include small changes to the protocol and methodology, e.g. addition of one or two sites for attaining a slightly higher recruitment rate, use of occasional reminders in regard to treatment protocol and adding a further reminder process for boosting follow up. For the most likely parametrisation of α = 0.05/β = 0.1, the AMBER zone division will be roughly at the midpoint. However, researchers can choose this point (the major/minor cutpoint) based on decisive arguments around how major and minor amendments would align to the outcome in question. This should be factored within the process of sample size determination for the pilot. In this regard, a smaller sample size will move A_{C} upwards (due to increased standard error/reduced precision) and hence increase the size of the AMBER_{R} zone in relation to AMBER_{G} (whereas a larger sample size will shift A_{C} downwards and do the opposite, increasing the ratio of AMBER_{G}:AMBER_{R}). From Table 1, for smaller sample sizes (related to 80% power) the AMBER_{R} zone makes up 56–69% of the total amber zone across presented scenarios, whereas this falls to 47–61% for samples (related to 90% power) and 41–56% for larger samples (related to 95% power) for the same scenarios. Beyond our proposed 4tier approach, other ways of providing an indication of level of amendment could include evaluation and review of the point and interval estimates or by evaluating posterior probabilities via a Bayesian approach [14, 32].
The methodology illustrated here focuses on feasibility outcomes presented as percentages/proportions, which is likely to be the most common form for progression criteria under consideration. However, the steps that have been introduced can be readily adapted to any feasibility outcomes taking a numerical format, e.g. rate of recruitment per month per centre, count of centres taking part in the study. Also, we point out that in the examples presented in the paper (recruitment, treatment fidelity and percent followup), high proportions are acceptable and low ones not. This would not be true for, say, adverse events where a reverse scale is required.
Biased sample estimates are a concern as they may result in a wrong decision being made. This systematic error is overandabove the possibility of an erroneous decision being made on the basis of sampling error; the latter may be reduced through an increased pilot sample size. Any positive bias will inflate/overestimate the feasibility sample estimate in favour of progressing whereas a negative bias will deflate/underestimate it towards the null and stopping. Both are problematic for opposite reasons; for example, the former may inform researchers that the main trial can ‘GO’ ahead when in fact it will struggle to meet key feasibility targets, whereas the latter may caution against progression when in reality the feasibility targets of a main trial would be met. For example, in regard to the choice of centres (and hence practitioners and participants), a common concern is that the selection of feasibility trial centres might not be a fair and representative sample of the ‘population’ of centres to be used for the main trial. It may be that the host centre (likely used in pilot studies) recruits far better than others (positive bias), thus exaggerating the signal to progress and subsequent recruitment to the main trial. Beets et al. [33] ‘define “risk of generalizability biases” as the degree to which features of the intervention and sample in the pilot study are NOT scalable or generalizable to the next stage of testing in a larger, efficacy/effectiveness trial … whether aspects like who delivers an intervention, to whom it is delivered, or the intensity and duration of the intervention during the pilot study are sustained in the larger, efficacy/effectiveness trial.’ As in other types of studies, safeguards regarding bias should be addressed through appropriate pilot study design and conduct.
Issues relating to progression criteria for internal pilots may be different to those for external pilots and nonrandomised feasibility studies. The consequence of a ‘stop’ within an internal pilot may be more serious for stakeholders (researchers, funders, patients) as it would bring an end to the planned continuation into the main trial phase, whereas there would be less at stake for a negative external pilot. By contrast, the consequence of a ‘GO’ signal may work the other way with a clear and immediate gain for the internal pilot whereas for an external pilot, the researchers would still need to apply and get the necessary funding and approvals to undertake an intended main trial. The chances of falling into the different traffic light zones are likely to be quite different between the two designs. Possibly external pilot and feasibility studies are more likely to have estimates falling in and around the RED zone than for internal pilots, reflecting the greater uncertainty in the processes for the former and greater confidence in the mechanisms for trial delivery for the latter. However, to counter this, there are often large challenges with recruitment within internal pilot studies where the target population is usually spread over more diverse sites than may be expected for an external pilot. Despite this possible imbalance, the interpretation of zonal indications remains consistent for external and internal pilot studies. As such, our focus with regard to the recommendations in this article are aligned to requirements for external pilots, though application of this methodology to a degree may similarly hold for internal pilots (and further, to nonrandomised studies that can include progression criteria—including longitudinal observational cohorts with the omission of the treatment fidelity criterion).
Conclusions
We propose a novel framework that provides a paradigm shift towards formally testing feasibility progression criteria in pilot and feasibility studies. The outlined approach ensures rigorous and transparent reporting in line with CONSORT recommendations for evaluation of STOPAMENDGO criteria and presents clear progression signposting which should help decisionmaking and inform stakeholders. Targeted progression criteria are focused on recommended pilot and feasibility objectives, particularly recruitment uptake, treatment fidelity and participant retention, and these criteria guide the methodology for sample size derivation and statistical testing. This methodology is intended to provide a more definitive and rounded structure to pilot and feasibility design and evaluation than currently exists. Sample size recommendations will be dependent on the nature and cutpoints for multiple key predefined progression criteria and should ensure a sufficient sample size for other feasibility outcomes such as review of the precision of clinical parameters to better inform main trial size.
Availability of data and materials
Not applicable.
Abbreviations
 Alpha (α):

Significance level (Type I error probability)
 AMBER_{G} :

AMBER subzone split adjacent to the GREEN zone (within 4tiered approach)
 AMBER_{R} :

AMBER subzone split adjacent to the RED zone (within 4tiered approach)
 A _{ C } :

AMBERstatistical significance threshold (within the AMBER zone) where an observed estimate below the cutpoint will result in a nonsignificant result (p ≥ 0.05) and figures at or above the cutpoint will be significant (p < 0.05)
 A _{C}%:

A_{C} expressed as a percentage of the sample size
 Beta (β):

Type II error probability
 E :

Estimate of feasibility outcome
 ε :

True feasibility parameter
 G _{ LL } :

Lower Limit of GREEN zone
 n :

Sample size (n_{s} = number of patients screened; n_{r} = number of patients randomised; n_{i} = number of patients randomised to the intervention arm only)
 Power = 1Beta:

(1 – Type II error probability)
 R _{ UL } :

Upper Limit of RED zone
References
Lancaster GA, Dodd S, Williamson PR. Design and analysis of pilot studies: recommendations for good practice. J Eval Clin Pract. 2004;10(2):307–12.
Eldridge SM, Chan CL, Campbell MJ, Bond CM, Hopewell S, Thabane L, et al. CONSORT 2010 statement: extension to randomised pilot and feasibility trials. Pilot Feasibility Stud. 2016;2:64.
McDonald AM, Knight RC, Campbell MK, Entwistle VA, Grant AM, Cook JA, et al. What influences recruitment to randomised controlled trials? A review of trials funded by two UK funding agencies. Trials. 2006;7:9.
Sully BG, Julious SA, Nicholl J. A reinvestigation of recruitment to randomised, controlled, multicenter trials: a review of trials funded by two UK funding agencies. Trials. 2013;14:166.
Walters SJ, Bonacho Dos Anjos HenriquesCadby I, Bortolami O, Flight L, Hind D, Jacques RM, et al. Recruitment and retention of participants in randomised controlled trials: a review of trials funded and published by the United Kingdom Health Technology Assessment Programme. BMJ Open. 2017;7(3):e015276.
Julious SA. Sample size of 12 per group rule of thumb for a pilot study. Pharm Stat. 2005;4:287–91.
Thabane L, Ma J, Chu R, Cheng J, Ismaila A, Rios LP, et al. A tutorial on pilot studies: the what, why and how. BMC Med Res Methodol. 2010;10:1.
Browne RH. On the use of a pilot sample for sample size determination. Stat Med. 1995;14:1933–40.
Hertzog MA. Considerations in determining sample size for pilot studies. Res Nurs Health. 2008;31(2):180–91.
Sim J, Lewis M. The size of a pilot study for a clinical trial should be calculated in relation to considerations of precision and efficiency. J Clin Epidemiol. 2012;65(3):301–8.
Whitehead AL, Julious SA, Cooper CL, Campbell MJ. Estimating the sample size for a pilot randomised trial to minimise the overall trial sample size for the external pilot and main trial for a continuous outcome variable. Stat Methods Med Res. 2016;25(3):1057–73.
Teare MD, Dimairo M, Shephard N, Hayman A, Whitehead A, Walters SJ. Sample size requirements to estimate key design parameters from external pilot randomised controlled trials: a simulation study. Trials. 2014;15:264.
Cocks K, Torgerson DJ. Sample size calculations for pilot randomized trials: a confidence interval approach. J Clin Epidemiol. 2013;66(2):197–201.
Lee EC, Whitehead AL, Jacques RM, Julious SA. The statistical interpretation of pilot trials: should significance thresholds be reconsidered? BMC Med Res Methodol. 2014;14:41.
Johanson GA, Brooks GP. Initial scale development: sample size for pilot studies. Edu Psychol Measurement. 2010;70(3):394–400.
Billingham SA, Whitehead AL, Julious SA. An audit of sample sizes for pilot and feasibility trials being undertaken in the United Kingdom registered in the United Kingdom Clinical Research Network database. BMC Med Res Methodol. 2013;13:104.
Herbert E, Julious SA, Goodacre S. Progression criteria in trials with an internal pilot: an audit of publicly funded randomised controlled trials. Trials. 2019;20(1):493.
Avery KN, Williamson PR, Gamble C, O’Connell Francischetto E, Metcalfe C, Davidson P, et al. Informing efficient randomised controlled trials: exploration of challenges in developing progression criteria for internal pilot studies. BMJ Open. 2017;7(2):e013537.
Arain M, Campbell MJ, Cooper CL, Lancaster GA. What is a pilot or feasibility study? A review of current practice and editorial policy. BMC Med Res Methodol. 2010;10:67.
Leon AC, Davis LL, Kraemer HC. The role and interpretation of pilot studies in clinical research. J Psychiatr Res. 2011;45(5):626–9.
Horne E, Lancaster GA, Matson R, Cooper A, Ness A, Leary S. Pilot trials in physical activity journals: a review of reporting and editorial policy. Pilot Feasibility Stud. 2018;4:125.
Wilson DT, Walwyn RE, Brown J, Farrin AJ, Brown SR. Statistical challenges in assessing potential efficacy of complex interventions in pilot or feasibility studies. Stat Methods Med Res. 2016;25(3):997–1009.
Sim J. Should treatment effects be estimated in pilot and feasibility studies? Pilot Feasibility Stud. 2019;5:107.
Moore CG, Carter RE, Nietert PJ, Stewart PW. Recommendations for planning pilot studies in clinical and translational research. Clin Transl Sci. 2011;4(5):332–7.
Schoenfeld D. Statistical considerations for pilot studies. Int J Radiat Oncol Biol Phys. 1980;6(3):371–4.
Fleiss JL, Levin B, Paik MC. Statistical methods for rates and proportions, Third Edition. New York: John Wiley & Sons; 2003. p. 32.
Faul F, Erdfelder E, Lang AG, Buchner A. G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behav Res Methods. 2007;39:175–91.
Julious SA. Pilot studies in clinical research. Stat Methods Med Res. 2016;25(3):995–6.
Lancaster GA. Pilot and feasibility studies come of age! Pilot Feasibility Stud. 2015;1(1):1.
Charlesworth G, Burnell K, Hoe J, Orrell M, Russell I. Acceptance checklist for clinical effectiveness pilot trials: a systematic approach. BMC Med Res Methodol. 2013;13:78.
BlatchJones AJ, Pek W, Kirkpatrick E, AshtonKey M. Role of feasibility and pilot studies in randomised controlled trials: a crosssectional study. BMJ Open. 2018;8(9):e022233.
Willan AR, Thabane L. Bayesian methods for pilot studies. Clin Trials 2020;17(4):4149.
Beets MW, Weaver RG, Ioannidis JPA, Geraci M, Brazendale K, Decker L, et al. Identification and evaluation of risk of generalizability biases in pilot versus efficacy/effectiveness trials: a systematic review and metaanalysis. Int J Behav Nutr Phys Act. 2020;17:19.
Acknowledgements
We thank Professor Julius Sim, Dr Ivonne SolisTrapala, Dr Elaine Nicholls and Marko Raseta for their feedback on the initial study abstract.
Funding
KB was supported by a UK 2017 NIHR Research Methods Fellowship Award (ref RMFI201708006).
Author information
Authors and Affiliations
Contributions
ML and CJS conceived the original methodological framework for the paper. ML prepared draft manuscripts. KB and GMcC provided examples and illustrations. All authors contributed to the writing and provided feedback on drafts and steer and suggestions for article updating. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
R codes used for Fig. 2.
Appendix
Appendix
Mathematical formulae for derivation of sample size
The required sample size may be derived using normal approximation to binary response data—using a continuity correction, via Fleiss et al. [26] if the convention of np > 5 and n(1 − p) > 5 holds true:
where R_{UL} = upper limit of RED zone; G_{LL} = lower limit of GREEN zone; z_{1−α} = onesided statistical significance level (type I error probability); z_{1−β} = beta (type II error probability)
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Lewis, M., Bromley, K., Sutton, C.J. et al. Determining sample size for progression criteria for pragmatic pilot RCTs: the hypothesis test strikes back!. Pilot Feasibility Stud 7, 40 (2021). https://doi.org/10.1186/s4081402100770x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s4081402100770x
Keywords
 Outcome and process assessment
 Pilots
 Sample size, Statistics