The methodology introduced in this article provides an innovative formal framework and approach to sample size derivation, aligning sample size requirement to progression criteria with the intention of providing greater transparency to the progression process and full engagement with the standard aims and objectives of pilot/feasibility studies. Through the use of both alpha and beta parameters (rather than alpha alone), the method ensures rigour and capacity to address the progression criteria by ensuring there is adequate power to detect an acceptable threshold for moving forward to the main trial. As several key process outcomes are assessed in parallel and in combination, the method embraces a composite multi-criterion approach that appraises signals for progression across all the targeted feasibility measures. The methodology extends beyond the requirement for ‘sample size justification but not necessarily sample size calculation’ [28].

The focus of the strategy reported here is on process outcomes, which align with the recommended key objectives of primary feasibility evaluation for pilot and feasibility studies [2, 24] and necessary targets to address key issues of uncertainty [29]. The concept of justifying progression is key. Charlesworth et al. [30] developed a checklist for intended use in decision-making on whether pilot data could be carried forward to a main trial. Our approach builds on this philosophy by introducing a formalised hypothesis test approach to address the key objectives and pilot sample size. Though the suggested sample size derivation focuses around the key process objectives, it may also be the case that other objectives are also important, e.g. assessment of precision of clinical outcome parameters. In this case, researchers may also wish to ensure that the size of the study suitably covers the needs of those evaluations, e.g. to estimate the SD of the intended clinical outcome, then the overall sample size may be boosted to cover this additional objective [10]. This tallies with the review by Blatch-Jones et al. [31] who reported that testing recruitment, determining the sample size and numbers available, and the intervention feasibility were the most commonly used targets of pilot evaluations.

Hypothesis testing in pilot studies, particularly in the context of effectiveness/efficacy of clinical outcomes, has been widely criticised due to the improper purpose and lack of statistical power of such evaluations [2, 20, 21, 23]. Hence, pilot evaluations of clinical outcomes are not expected to include hypothesis testing. Since the main focus is on feasibility the scope of the testing reported here is different and importantly relates back to the recommended objectives of the study whilst also aligning with nominated progression criteria [2]. Hence, there is clear justification for this approach. Further, for the simple 3-tiered approach hypothesis testing is somewhat hypothetical: there is no need to physically carry out a test since the zonal positioning of the observed sample statistic estimate for the feasibility outcome will determine the decision in regard to progression; thus adding to the simplicity of the approach.

The link between the sample size and need to adequately power the study to detect a meaningful feasibility outcome gives this approach the extra rigour over the confidence interval approach. It is this sample size-power linkage that is key to the determination of the respective probabilities of falling into the different zones and is a fundamental underpinning to the methodological approach. In the same way as for a key clinical outcome in a main trial where the emphasis is not just on alpha but also on beta thereby addressing the capacity to detect a clinically significant difference, similarly, our approach is to ensure there is sufficient capacity to detect a meaningful signal for progression to a main trial if it truly exists. A statistically significant finding in this context will at least provide evidence to reject RED (signifying a decision to STOP) and in the 4-tiered case it would fall above AMBER_{R} (decision to major-AMEND); hence, the estimate will fall into AMBER_{G} or GREEN (signifying a decision to minor-AMEND or GO, respectively). The importance of adequately powering the pilot trial to address a feasibility criterion can be simply illustrated. For example, if we take *R*_{UL} as 50% and *G*_{LL} as 75% but with two different sample sizes of *n* = 25 and *n* = 50; the former would have 77.5% power of rejecting RED on the basis of a 1-sided 5% alpha level whereas the larger sample size would have 97.8% power of rejecting RED. So, if *G*_{LL} holds true, there would be 20% higher probability of rejecting the null and being in the AMBER_{G}/GREEN zone for the larger sample giving an increased chance of progressing to the main trial. It will be necessary to carry out the hypothesis test for the extended 4-tier approach if the observed statistic (*E*) falls in the AMBER zone to determine statistical significance or not, which will inform whether the result falls into the ‘minor’ or ‘major’ AMBER sub-zones.

We provide recommended sample sizes within a look-up grid relating to perceived likely progression cut-points to aid quick access and retrievable sample sizes for researchers. For a likely set difference in proportions between hypothesised null and alternative parameters of 0.15 to 0.25 when *α* = 0.05 and β = 0.1 the corresponding total sample size requirements for the approach of normal approximation with continuity correction take the range of 33 to 100 (median 56) [similarly these are 33–98 (median 54) for the binomial exact method]. Note, for treatment fidelity/adherence/compliance particularly, the marginal difference could be higher, e.g. ≥ 25%, since in most situations we would anticipate and hope to attain a high value for the outcome whilst being prepared to make necessary changes within a wide interval of below par values (and providing the value is not unacceptably low). As this relates to an arm-specific objective (relating to evaluation of the intervention only), then a usual 1:1 pilot will require twice the size; hence, the arm-specific sample size powered for detecting a ≥ 25% difference from the null would be about 34 (or lower)—as depicted from our illustration (Table 4 (ii), equating to *n* ≤ 68 overall for a 1:1 pilot; intervention and control arms). Hence, we expect that typical pilot sizes of around 30–40 randomised per arm [16] would likely fit with the proposed methodology within this manuscript (the number needed for screening being extrapolated upward of this figure) but if a smaller marginal difference (e.g. ≤ 15%) is to be tested then these sample sizes may fall short. We stress that the overall required sample size needs to be carefully considered and determined in line with the hypothesis testing approach across all criteria ensuring sufficiently high power. In our paper, we have made recommendations regarding various sample sizes based on both the normal approximation (with continuity correction) and binomial exact approaches; these are conservative compared to the Normal approximation (without continuity correction).

Importantly, the methodology outlines the necessary multi-criterion approach to the evaluation of pilot and feasibility studies. If all progression criteria are performing as well as anticipated (highlighting ‘GO’ according to all criteria), then the recommendation of the pilot/feasibility study is that all criteria meet their desired levels with no need for adjustment and the main trial can proceed without amendment. However, if the worst signal (across all measured criteria) is an AMBER signal, then adjustment will be required against those criteria that fall within that signal. Consequently, there is the possibility that the criteria may need subsequent re-assessment to re-evaluate processes in line with updated performance for the criteria in question. If one or more of the feasibility statistics fall within the RED zone then this signals ‘STOP’ and concludes that a main trial is not feasible based on those criteria. This approach to collectively appraising progression based on the results of all feasibility outcomes assessed against their criteria will be conservative as the power of the collective will be lower than the individual power of the separate tests; hence, it is recommended that the power of the individual tests is set high enough (for example, 90–95%) to ensure the collective power is high enough (e.g. at least 70 or 80%) to detect true ‘GO’ signals across all the feasibility criteria.

In this article, we also expand the possibilities for progression criterion and hypothesis testing where the AMBER zone is sub-divided arbitrarily based on the significance of the *p* value. This may work well when the AMBER zone has a wide range and is intended to provide a useful and workable indication of the level of amendment (‘minor’ (non-substantive) or ‘major’ (substantive)) required to progress to the main trial. Examples of substantial amendments include study re-design with possible re-appraisal and change of statistical parameters, inclusion of several additional sites, adding further data recruitment methods, significant reconfiguration of exclusions, major change to the method of delivery of trial intervention to ensure enhanced treatment fidelity/adherence, enhanced measures to systematically ensure greater patient compliance with allocated treatment, additional mode(s) of collecting and retrieving data (e.g. use of electronic data collection methods in addition to postal questionnaires). Minor amendments include small changes to the protocol and methodology, e.g. addition of one or two sites for attaining a slightly higher recruitment rate, use of occasional reminders in regard to treatment protocol and adding a further reminder process for boosting follow up. For the most likely parametrisation of *α* = 0.05/β = 0.1, the AMBER zone division will be roughly at the midpoint. However, researchers can choose this point (the major/minor cut-point) based on decisive arguments around how major and minor amendments would align to the outcome in question. This should be factored within the process of sample size determination for the pilot. In this regard, a smaller sample size will move *A*_{C} upwards (due to increased standard error/reduced precision) and hence increase the size of the AMBER_{R} zone in relation to AMBER_{G} (whereas a larger sample size will shift *A*_{C} downwards and do the opposite, increasing the ratio of AMBER_{G}:AMBER_{R}). From Table 1, for smaller sample sizes (related to 80% power) the AMBER_{R} zone makes up 56–69% of the total amber zone across presented scenarios, whereas this falls to 47–61% for samples (related to 90% power) and 41–56% for larger samples (related to 95% power) for the same scenarios. Beyond our proposed 4-tier approach, other ways of providing an indication of level of amendment could include evaluation and review of the point and interval estimates or by evaluating posterior probabilities via a Bayesian approach [14, 32].

The methodology illustrated here focuses on feasibility outcomes presented as percentages/proportions, which is likely to be the most common form for progression criteria under consideration. However, the steps that have been introduced can be readily adapted to any feasibility outcomes taking a numerical format, e.g. rate of recruitment per month per centre, count of centres taking part in the study. Also, we point out that in the examples presented in the paper (recruitment, treatment fidelity and percent follow-up), high proportions are acceptable and low ones not. This would not be true for, say, adverse events where a reverse scale is required.

Biased sample estimates are a concern as they may result in a wrong decision being made. This systematic error is over-and-above the possibility of an erroneous decision being made on the basis of sampling error; the latter may be reduced through an increased pilot sample size. Any positive bias will inflate/overestimate the feasibility sample estimate in favour of progressing whereas a negative bias will deflate/underestimate it towards the null and stopping. Both are problematic for opposite reasons; for example, the former may inform researchers that the main trial can ‘GO’ ahead when in fact it will struggle to meet key feasibility targets, whereas the latter may caution against progression when in reality the feasibility targets of a main trial would be met. For example, in regard to the choice of centres (and hence practitioners and participants), a common concern is that the selection of feasibility trial centres might not be a fair and representative sample of the ‘population’ of centres to be used for the main trial. It may be that the host centre (likely used in pilot studies) recruits far better than others (positive bias), thus exaggerating the signal to progress and subsequent recruitment to the main trial. Beets et al. [33] ‘define “risk of generalizability biases” as the degree to which features of the intervention and sample in the pilot study are NOT scalable or generalizable to the next stage of testing in a larger, efficacy/effectiveness trial … whether aspects like who delivers an intervention, to whom it is delivered, or the intensity and duration of the intervention during the pilot study are sustained in the larger, efficacy/effectiveness trial.’ As in other types of studies, safeguards regarding bias should be addressed through appropriate pilot study design and conduct.

Issues relating to progression criteria for internal pilots may be different to those for external pilots and non-randomised feasibility studies. The consequence of a ‘stop’ within an internal pilot may be more serious for stakeholders (researchers, funders, patients) as it would bring an end to the planned continuation into the main trial phase, whereas there would be less at stake for a negative external pilot. By contrast, the consequence of a ‘GO’ signal may work the other way with a clear and immediate gain for the internal pilot whereas for an external pilot, the researchers would still need to apply and get the necessary funding and approvals to undertake an intended main trial. The chances of falling into the different traffic light zones are likely to be quite different between the two designs. Possibly external pilot and feasibility studies are more likely to have estimates falling in and around the RED zone than for internal pilots, reflecting the greater uncertainty in the processes for the former and greater confidence in the mechanisms for trial delivery for the latter. However, to counter this, there are often large challenges with recruitment within internal pilot studies where the target population is usually spread over more diverse sites than may be expected for an external pilot. Despite this possible imbalance, the interpretation of zonal indications remains consistent for external and internal pilot studies. As such, our focus with regard to the recommendations in this article are aligned to requirements for external pilots, though application of this methodology to a degree may similarly hold for internal pilots (and further, to non-randomised studies that can include progression criteria—including longitudinal observational cohorts with the omission of the treatment fidelity criterion).