Those who have worked in drug development know the drill. Turning a promising biomarker into something you can rely on in the clinic is one of the most critical steps, and one of the most failure-prone. A candidate can look brilliant in controlled studies. Then it falls apart when it meets real-world biology, workflows and heterogeneous patient populations. The phrase “valley of death” captures the gap between early discovery and real clinical application, where many biomarkers stall. In other words, researchers validated the assay one way, then tried to use it another. The strategy didn’t match the job you actually needed the biomarker to do.
In In January 2025, the FDA issued a guidance for industry on bioanalytical method validation for biomarkers (nonbinding recommendations). The document formalizes a fit-for-purpose idea. In essence, validation rigor should scale to intended use. If biomarker measurements will support regulatory submissions, approval decisions, labeling claims, or dosing recommendations, FDA points sponsors to full validation aligned with ICH M10. If biomarkers are used mainly for internal decisions such as candidate selection, go or no-go calls, or proof-of-concept, the guidance leaves much more to sponsor judgment. In other words, there’s a clear ceiling, and a fuzzier floor.
1. Fit-for-purpose validation: a clear bar at the top, discretion everywhere else
The FDA guidance is explicit about why it leans on M10. M10 is the harmonized international standard for bioanalytical method validation and study sample analysis when measurements are expected to support regulatory decisions, but M10 excludes biomarkers from its scope. The January 2025 biomarker guidance exists partly to bridge that gap by telling sponsors to use M10 as the starting point for biomarker assay validation, while recognizing that some M10 characteristics or criteria may not apply to some biomarker platforms.
The FDA is directing sponsors to use drug assay validation standards as the reference point for biomarkers, while acknowledging that not all M10 criteria will apply to every platform. This includes method selectivity, accuracy and precision, sensitivity and reportable range, reproducibility and stability. The agency also emphasizes how sample collection, handling, storage and shipping affect reliability. The guidance sets a clear expectation for regulatory-facing work and leaves the rest to fit-for-purpose justification by sponsors.
One way teams operationalize this discretion is through tiered, risk-based validation frameworks that map assay work to decision weight:
- High-consequence use (patient selection, dose justification, safety-triggering decisions, pivotal enrichment): validation plans often resemble M10-style discipline, with predefined acceptance criteria, formal stability and robustness assessments, controlled change management and documentation designed to withstand audit.
- Medium-consequence use (Phase 2 decision support, strong mechanistic claims, intensive stratification): teams often validate key performance characteristics and pre-analytical requirements, but may accept narrower operating ranges or more limited stability claims if the workflow is controlled.
- Exploratory use (hypothesis generation, early signal finding): teams may prioritize feasibility and signal detection, characterize variability rather than eliminate it and defer parts of robustness testing, as long as limitations are explicit and downstream claims stay constrained.
This shift elevates the Context of Use (COU) statement from a narrative add-on to the central design document. A COU that is actually useful specifies the decision and population, but it also forces operational commitments that determine whether a biomarker will replicate outside the development dataset: specimen type and handling requirements, site training assumptions, cutpoint strategy and what “indeterminate” means, missing-data rules and how discordant results will be adjudicated.
The same FDA guidance that increases the importance of COU does not standardize it. There is no shared regulatory template for COU structure outside formal qualification programs, and there is no universal checklist for what “enough” looks like for biomarkers used only for internal decisions. That is why fit-for-purpose validation reduces avoidable ambiguity without solving reproducibility. It makes trade-offs explicit. It does not remove the trade-offs.
2. p-tau217 case study: the Lumipulse ratio as a COU stress test
Blood-based Alzheimer’s biomarkers provide a current example of how operational context can dominate performance. On
May 16, 2025, the FDA cleared Fujirebio’s Lumipulse G pTau217/β-Amyloid 1-42 Plasma Ratio as the first in vitro diagnostic device that tests blood to aid in diagnosing Alzheimer’s disease. The clearance positioned the test for adults aged 55 years and up with signs and symptoms of the disease in specialized care. The FDA also noted practical constraints. For instance, the agency flagged that the test is not intended as a screening tool or a stand-alone diagnostic. FDA also noted that results must be interpreted with context of other clinical information.
The immediate translational question after clearance was transportability: do the FDA-cleared cutpoints behave the same way outside the conditions that generated the clearance dataset? A study presented at the 18th Clinical Trials on Alzheimer’s Disease (CTAD) conference (December 1 to 4, 2025, San Diego) directly tested that by applying FDA-cleared ratio cutpoints across two cohorts, a risk-enriched research cohort and a real-world memory clinic cohort, while also comparing alternative interpretation strategies.
The headline result, reported from a CTAD 2025 conference presentation and summarized by Practical Neurology, was a performance split:
- Research cohort (n=101): FDA-cleared ratio cutpoints yielded 99% sensitivity and 85% specificity (95% accuracy).
- Real-world cohort (n=252): FDA-cleared ratio cutpoints yielded 100% sensitivity but 23% specificity (77% accuracy).
- Same real-world cohort, alternative strategies: in-house ratio cutpoints yielded 96% sensitivity and 79% specificity (94% accuracy); p-tau217 concentration thresholds alone yielded 98% sensitivity and 84% specificity (94% accuracy).
Those numbers are useful because they localize the problem. Sensitivity held up, suggesting the assay could still identify many amyloid-positive individuals. The collapse was in specificity in routine clinical populations using FDA-cleared cutpoints, producing a high false-positive burden that changes how the test behaves as a triage tool.
A plausible contributor is the ratio format itself: dividing by plasma Aβ42 means the result inherits any instability in Aβ42 measurement. Plasma Aβ peptides are sensitive to pre-analytical handling. The literature is clear that collection, preservation and processing can shift measured Aβ40 and Aβ42. For instance, A pre-analytical study in Scientific Reports (2024) describes the issue directly, proposing a sample‑handling procedure that uses a HISCL (High Sensitivity Chemiluminescence Immunoassay) analyzer to shed light on “the individual/combined effects of pre‑analytical parameters on plasma Aβ42/Aβ40 levels.” The study notes that plasma Aβ peptides are unstable molecules. It also explains that measurement is influenced by sample handling conditions and other pre-analytical parameters. In that study’s experiments, Aβ40 and Aβ42 could shift by more than 10% under some storage conditions, even when a ratio readout remained within the authors’ acceptability criteria. A late-2025 review in the Journal of Neurology similarly emphasizes that plasma Aβ42 and Aβ40 levels are also influenced by sample collection and preservation. Other influencing factors include pre-analytical processing, detection methods and analysis choices.
COU is where that operational fragility becomes a concrete design decision. If the intended workflow is a tightly managed study with strict pre-analytical controls, a ratio approach may be appropriate and may deliver strong discrimination. If the intended workflow is trial prescreening across many sites, or routine clinic use where tubes, processing times, shipping windows, and freeze-thaw histories vary, the “best” marker may be the one that fails more gracefully under imperfect handling. The CTAD comparison suggests that, in at least one real-world cohort, a single-analyte strategy based on p-tau217 thresholds performed better than applying the FDA-cleared ratio cutpoints as written.
The translational lesson is narrow and practical. “Best biomarker” is not a property of the analyte alone. It is a property of analyte, assay format, pre-analytical control, and the decision being made. A COU statement that names the decision and the operating environment forces teams to confront that reality early, before a cutpoint becomes a fragile asset that does not travel.
3. NIS2+ case study: validation that starts with the comorbidity
MASH drug development illustrates a different fit-for-purpose problem: trial logistics. Many mid- to late-stage trials still rely on histology to confirm “at-risk” disease, and liver biopsy is invasive, variable and operationally expensive. The result is often a high biopsy failure rate during screening, which becomes an enrollment constraint independent of therapeutic hypothesis.
NIS2+ was developed with that screening funnel as the target use case. In a prospective derivation and validation study, investigators describe NIS2+ as an optimization of the earlier NIS4 technology for detecting at-risk NASH in patients with metabolic risk factors, with an emphasis on robustness across patient characteristics that commonly drive heterogeneity in real cohorts, including age, sex, BMI and type 2 diabetes status (Journal of Hepatology, 2023 (PDF)).
The comorbidity constraint is not background context in that paper. It is a design input. The authors note that NIS4 requires a multi-platform analysis of four biomarkers, including HbA1c and they state that antidiabetic medications that reduce HbA1c can affect NIS4 scores and complicate interpretation, including practical ambiguity about whether the test should be performed before or after antidiabetic treatment changes. They describe NIS2+ as an optimization intended to address those issues, including a move to an exclusively serum-based test built on fewer biomarkers.
The COU is deliberately narrow: improve trial screening by identifying which metabolically at-risk patients should be referred to biopsy, while avoiding recruitment bias tied to common stratification variables. A retrospective simulation analysis in the RESOLVE-IT screening pathway cohort, published open access in the Journal of Hepatology (2024), quantified what “improvement” means in operational terms. Using cost-optimized cutoffs, an NIS2+ pathway reduced liver biopsy failure rate to 39%, compared with 60% under the RESOLVE-IT screening pathway and 58% under a FIB-4 strategy. In the same analysis, the NIS2+ pathway reduced unnecessary biopsies and lowered estimated screening costs, while increasing the number of patients needed to screen only moderately.
This is fit-for-purpose validation behaving as intended. The claim is not that biopsy is obsolete, or that a single score “diagnoses” MASH across all settings. The claim is that a specific screening decision in a comorbid population can be made more efficient without shifting the enrolled population in undesirable ways.
4. Implications and what “enhancement” looks like operationally
Across both case studies, the direction of travel is consistent. Enhancing biomarker validation has become less about chasing a universal validation state and more about building evidence that matches a defined decision in a defined operating environment. Translational teams that are doing this well tend to standardize a small set of moves:
- Write the COU early enough to shape the work.
Treat COU as a specification, not a paragraph. Include population, setting, sample type, required handling constraints, interpretation rules (including indeterminate zones), and what decision the result will and will not support. - Tier validation to the decision and record the rationale.
Use M10-aligned rigor when biomarker results will support regulatory decisions, as the FDA guidance recommends. For internal decision support, define the minimum acceptable performance and the failure modes that are tolerable, then document what you chose not to validate yet and why. - Stress-test pre-analytics as part of “validation,” not as an afterthought.
The FDA guidance explicitly treats sample collection, handling, storage and shipping as determinants of data integrity. Design bridging work that matches the reality of the intended deployment. If the biomarker must travel across sites, validate against site variability rather than assuming it away. - Run COU audits at milestones.
Before Phase transitions and before locking pivotal protocols, check whether the biomarker is being used for higher-stakes decisions than the COU originally described, whether the cutpoint strategy drifted and whether site logistics still satisfy the original pre-analytical assumptions. - Build a lifecycle plan.
For biomarkers that will be deployed beyond the original development cohort, define monitoring metrics, recalibration triggers and revalidation thresholds in advance. Treat drift as expected and manageable, not as a surprise.
Fit-for-purpose validation, as formalized in the FDA’s January 2025 guidance, raises the floor for how explicitly teams should connect biomarker evidence to biomarker use. It does not remove uncertainty. It makes the uncertainty audit-able, which is the core requirement for translational work that has to survive beyond a single dataset.



