The purpose of this page is to create a reference to describe our methodology for assigning levels of evidence to regimens.

Important note: Our intent is not to provide clinical decision support. Rather, our goal is to faithfully reproduce findings of clinical trials. Efficacy and toxicity information, in particular, is sometimes presented by authors in a confusing or ambivalent manner. As such, we try to illustrate ambiguities when they happen, and take no responsibility for your decision to choose a particular treatment regimen. Please read our disclaimer for further information.

Note for colorblind users: We are aware that the color scales we use are not colorblind-safe and are not compliant with Section 508. We have no current plans to change the overall coloring schema but welcome feedback on this particular point.

See the sections below for a discussion of the various metrics we use. Feedback is welcome!

Evidence

back to top

Generally, a regimen should be evaluated in a randomized fashion with an adequate patient sample to be considered strong evidence. We have defined adequate as 20 or more patients per arm. Non-randomized studies and randomized studies with fewer than 20 patients per arm are considered to be moderate evidence. Finally, case reports, retrospective series, and non-randomized studies with fewer than 20 patients enrolled are considered to be weak evidence. Of course, there are finer gradations of the quality of evidence, such as whether an RCT was blinded, so this simplified scheme should be taken with a grain of salt.

Evidence is thus reported using one of three color-coded labels, using a sequential ColorBrewer scale:

Strength of evidence	Wiki code
Strong evidence	style="background-color:#1a9851"
Moderate evidence	style="background-color:#91cf61"
Weak evidence	style="background-color:#ffffbe"

Examples

back to top

A trial with strong evidence: R-CHOP for untreated follicular lymphoma

Study	Evidence
Flinn et al. 2014 (BRIGHT)	Phase III

A trial with moderate evidence: bortezomib & rituximab for untreated follicular lymphoma

Study	Evidence
Evens et al. 2014	Phase II

A trial with weak evidence: cladribine for aggressive systemic mastocytosis

Study	Evidence
Lim et al. 2009	Retrospective

Frequently asked questions

back to top

Q: What is the current status of evidence labeling on hemonc.org?
A: Nearly 100% of chemotherapy regimens and their variants now have a level of evidence label.

Q: Are all arms of a randomized trial labeled the same?
A: No, it depends on how many patients are in each arm of the trial. For arms that have at least 20 patients, the evidence is labeled as strong. For arms with fewer than 20, the evidence is labeled as moderate.

Q: Are non-randomized trials all labeled the same?
A: No, it depends on how many patients are in the trial. For trials that have at least 20 patients, the evidence is labeled as moderate. For trials with fewer than 20, the evidence is labeled as weak.

Q: Some retrospective analyses are very large, will they be labeled as moderate evidence?
A: No, currently we label all retrospective analyses as weak evidence, no matter how large. Although we are major proponents of the secondary use of data, including automated methods of EHR data extraction, there is currently too high of a level of unknown biases and confounding to label regimens derived from retrospective data other than as weak evidence. Likewise, a trial that reports on a comparison to historic or contemporary controls not enrolled in that trial will be considered non-randomized.

Q: How can I tell which arm of an RCT was the experimental arm?
A: Currently, we do not distinguish between experimental and control arms. However, starting with phase III trials, we will begin labeling the arms using either of the following:

Phase III (E): an experimental arm of the RCT
Phase III (C): the control arm of the RCT

Efficacy

back to top

Defined generally, efficacy is the presence of a positive effect on the study population. Conversely, lack of efficacy is the absence of an expected positive effect, or the failure to achieve expected outcomes in adequate numbers of patients. Efficacy can be reported ranging from a weak surrogate measure (e.g., response rate) to a direct measure of overall survival; see our responses to treatment page for more information. Currently, we are focusing on adding information on statistical comparative efficacy for randomized trials, and primary endpoints for non-randomized trials with >20 participants. Many non-randomized trials report efficacy compared to historical controls. However, in the rapidly developing field of oncology, this approach is rife with bias and as such we do not report on comparison to historical controls. Future work will involve reporting on effect sizes as well as statistical comparative efficacy (see Miksad et al. 2007 for a discussion of why this is important).

Non-comparative efficacy

back to top

Non-comparative efficacy is reported using a multi-hue sequential Color Brewer scale, where the lighter the color the higher the response rate. At this time, if a non-comparative trial reports a time interval as the primary endpoint (e.g., PFS), we will still report a response rate, given that time intervals can't be interpreted except as compared to historic controls (see above):

Response rate	Wiki code
0-12.5%	style="background-color:#6e016b; color:white"
12.5-25%	style="background-color:#88419d; color:white"
25-37.5%	style="background-color:#8c6bb1"
37.5-50%	style="background-color:#8c96c6"
50-62.5%	style="background-color:#9ebcda"
62.5-75%	style="background-color:#bfd3e6"
75-87.5%	style="background-color:#e0ecf4"
87.5-100%	style="background-color:#f7fcfd"

Comparative efficacy

back to top

Comparative efficacy is reported using a red-yellow-green divergent ColorBrewer scale, with hue as a function of p-value.

Superior and inferior findings

Comparative efficacy	p-value	Narrative description	Wiki code
Superiority	≤ 0.01	Superior endpoint	style="background-color:#1a9850"
Superiority	> 0.01 and ≤ 0.05	Seems to have superior endpoint	style="background-color:#91cf60"
Superiority	> 0.05 and ≤ 0.1	Might have superior endpoint	style="background-color:#d9ef8b"
Inferiority	> 0.05 and ≤ 0.1	Might have inferior endpoint	style="background-color:#fee08b"
Inferiority	> 0.01 and ≤ 0.05	Seems to have inferior endpoint	style="background-color:#fc8d59"
Inferiority	≤ 0.01	Inferior endpoint	style="background-color:#d73027"

Negative and non-inferior findings

back to top

The distinction between a negative superiority trial and a positive non-inferiority or equivalence study is crucial. In a superiority trial, a treatment has been hypothesized to be better than another, but in the end the null hypothesis was not rejected. There is still a distinct possibility that one treatment is superior to the other (type II error), but the signal is not observed due to underpowering issues, excess crossover, attrition such that the intention-to-treat population is small, or obfuscation of a subgroup by the larger population. Non-inferiority trials use a one-sided test to determine whether a new intervention is no worse than a standard intervention. Equivalence trials have a similar design but use a two-sided test, allowing for the possibility that the new intervention is no better than the standard one. These designs require much greater numbers of participants, and are often used to evaluate a new treatment that is likely to have comparable efficacy but has an improved (or different) toxicity profile.

For this reason, we use intensity as well as hue to distinguish between these concepts:

Type of outcome	Narrative description	Wiki code
A "negative" superiority trial	Seems not superior	style="background-color:#ffffbf"
A "negative" non-inferiority trial	Inconclusive whether non-inferior	style="background-color:#ffffbf"
A "negative" equivalence trial	Inconclusive whether equivalent	style="background-color:#ffffbf"
A "positive" non-inferiority trial	Non-inferior endpoint	style="background-color:#eeee01"
A "positive" equivalence trial	Equivalent endpoint	style="background-color:#eeee01"

More details

back to top

What we are really interested in is whether efficacy findings from a clinical trial will work for our patient. As such, we as a medical community have historically relied on the cutoff of p=0.05 to accept whether or not a finding is significant and true. Of course, this means that approximately 1 in 20 reportedly "true" findings are in fact falsely positive. This "holy grail" cutoff has led to significant publication bias which is well summarized by Dr. John Ioannidis in his seminal 2005 paper "Why Most Published Research Findings Are False." One potential solution, which we have adopted, is to report comparative efficacy "in plain English" as shown in the graphic below (link to original 2009 article).

Here is another way of considering P-values, only just a bit tongue-in-cheek from XKCD.

Examples

back to top

1. A treatment regimen with superior efficacy: BR for untreated follicular lymphoma

Study	Evidence	Comparator	Comparative Efficacy
Rummel et al. 2013 (StiL NHL1)	Phase III	R-CHOP	Superior PFS

2. A treatment regimen which failed to demonstrate a difference in primary endpoint: gemcitabine for metastatic pancreatic cancer

Study	Evidence	Comparator	Comparative Efficacy
Hong et al. 2013	Randomized Phase II	Gemcitabine & Simvastatin	Seems not superior

3. A treatment regimen that is non-inferior to its comparator: capecitabine & bevacizumab for metastatic breast cancer

Study	Evidence	Comparator	Comparative Efficacy
Zielinski et al. 2016 (TURANDOT)	Phase III	Paclitaxel & Bevacizumab	Non-inferior OS

4. A treatment regimen with inferior efficacy: dexamethasone for relapsed/refractory multiple myeloma

Study	Evidence	Comparator	Comparative Efficacy
Rajkumar et al. 2008	Phase III	TD	Inferior TTP

Frequently asked questions

back to top

Q: What is the current status of efficacy labeling on hemonc.org?
A: All phase III trials and most randomized phase II trials are now labeled for efficacy. Future work includes labeling non-randomized trials with overall response rates (ORR).

Q: How do you choose to label efficacy when multiple outcomes are reported?
A: Often, a trial will report on multiple outcomes, such as overall response rate, progression-free survival, and overall survival. We generally look to the PRIMARY endpoint, as defined in the published methods. If the PRIMARY endpoint is negative but the trial reports a positive secondary more surrogate finding, we do NOT label this (example - primary endpoint was OS, which was not shown, but the investigational arm did have better PFS). If a secondary endpoint shows differential efficacy and is LESS "surrogate" than the primary endpoint (see below), we will label by that endpoint.

Q: How do you handle changes in efficacy when trial results are updated?
A: Many trials are published at the time of preliminary findings, and have subsequent publications as the results mature. At this time, most of these updates are added to the bibliography section of the regimen. If the strength of the efficacy assertion changes, or if a less surrogate endpoint becomes significant, we will updated the efficacy label with a (*) to denote that it is an updated efficacy.

Q: Do you account for effect size?
A: Not currently, although this is certainly something we may want to include in the future, given that a highly statistically significant but extremely small effect size could be considered clinically meaningless. Likewise, regimens that have a borderline p-value due to power issues but have a large effect size, such as a 6-month improvement in overall survival in stage IV cancer, may be highly clinically meaningful.

Q: Do you have a hierarchy of surrogacy?
A: Yes, we use a three-level hierarchy to determine the strength of an outcome measure: strong endpoints, intermediate surrogate endpoints, and weak surrogate endpoints. Please see the dedicated response to treatment page for more details. Note that this hierarchy is NOT used to inform the coloration of the efficacy label, at this time.

Q: What about exceptional responders?
A: It is increasingly recognized, especially with newer therapies such as immunotherapy, that some patients may experience a remarkable response to a drug that otherwise appears to lack efficacy in the population. These patients are usually referred to as "exceptional responders" and may provide significant insights into rational treatment selection a.k.a., precision medicine. At this time we do not make a particular effort to identify exceptional responders, nor do we consider a regimen for inclusion in HemOnc.org if the manuscript states that it generally lacks efficacy.

Q: Do you consider health-related quality of life (HRQoL) measures in efficacy?
A: Very few RCTs report on QoL measures, but when they do, we plan to include this as a measure of toxicity, not efficacy.

Toxicity

back to top

Defined generally, toxicity is the presence or absence of a negative effect (harm) on the study population. This is often also referred to as safety. As with efficacy, we only report comparative toxicity. Much of the focus on HemOnc.org has been on efficacy, due in large part to the fact that, while standardized, toxicity reporting tends to be limited in granularity and interpretability. We are currently taking two approaches to toxicity:

Toxicity information from the primary literature

back to top

Due to less granular and more subjective reporting in the primary literature, toxicity is currently reported using one of three labels:

Decreased toxicity	Equivalent toxicity	Increased toxicity

Example

A treatment regimen with increased toxicity: R-CHOP for untreated follicular lymphoma

Study	Evidence	Comparator	Comparative Efficacy	Comparative Toxicity
Hiddemann et al. 2005	Phase III	CHOP	Seems to have superior OS	Increased toxicity

Toxicity information from companion or post hoc analyses

back to top

If a dedicated analysis of toxicity is performed, we will report using the standard coloring used above for efficacy. In particular, we consider health-related quality of life (HRQoL) analyses to be a rigorous surrogate of toxicity, and are actively adding these to the site.

Example

A treatment regimen with worse HRQoL: placebo for metastatic castrate-resistant prostate cancer

Study	Evidence	Comparator	Comparative Efficacy	Comparative Toxicity
Beer et al. 2014 (PREVAIL)	Phase III	Enzalutamide	Inferior OS	Worse HRQoL

Frequently asked questions

back to top

Q: What is the current status of toxicity labeling on hemonc.org?
A: A few regimens are currently labeled for toxicity; we are focusing current efforts on labeling efficacy.

Q: Are you basing the label on the reported CTCAE measures?
A: CTCAE measures are extremely valuable in that they are structured and thus reproducible. However, it is often hard to compare them directly. For example, if one regimen has grade 4 lab-based toxicity and the other has grade 2 gastrointestinal toxicity, which is the more toxic? In general, we plan to use the authors' interpretation of overall toxicity and tolerability when labeling from primary literature - or better yet, prospectively-gathered quality-of-life data (see above).

Q: Do you plan to incorporate patient-reported outcomes?
A: As shown in numerous publications, patient reports of toxicity are more accurate than clinician assessments. However, they have not until recently been standardized. Now that the PRO-CTCAE is available, we expect to see more of these in the future and will incorporate them into the toxicity assessment.

Q: Do you plan to incorporate financial toxicity?
A: This is an incredibly important topic, and the reader is encouraged to read the publications of Dr. Yousuf Zafar for some excellent background. For now, we are focused primarily on clinical measures but are open to adding financial toxicity information.

Value frameworks

back to top

Introduction

Several leading hematology/oncology societies and organizations have put forth value frameworks. Conceptually, these are approaches which are intended to combine measures of efficacy, toxicity, quality of life, and financial toxicity into a metric of value that can be used to influence treatment decisions. Our coverage of value frameworks will expand over time.

ASCO Value Framework

ESMO Magnitude of Clinical Benefit Scale

2015 (Version 1.0): A standardised, generic, validated approach to stratify the magnitude of clinical benefit that can be anticipated from anti-cancer therapies: the European Society for Medical Oncology Magnitude of Clinical Benefit Scale (ESMO-MCBS) PubMed
2017 (Version 1.1): ESMO-Magnitude of Clinical Benefit Scale version 1.1

NCCN Evidence Blocks

2016: NCCN Evidence Blocks PubMed

Comparisons

ASCO Value Framework vs. ESMO-MCBS

2017: Do the American Society of Clinical Oncology Value Framework and the European Society of Medical Oncology Magnitude of Clinical Benefit Scale Measure the Same Construct of Clinical Benefit? PubMed

ASCO Value Framework vs. NCCN Evidence Blocks

2017: Value Frameworks for the Patient-Provider Interaction: A Comparison of the ASCO Value Framework Versus NCCN Evidence Blocks in Determining Value in Oncology PubMed

Frequently asked questions

back to top

Q: Do you plan to add value measures to HemOnc.org?
A: At some point in the future, we may add value measures to some regimens; however, given that these frameworks are recently introduced and undergoing frequent revision, we will continue to monitor their development, for now.

Levels of Evidence

Contents

Evidence

Examples

A trial with strong evidence: R-CHOP for untreated follicular lymphoma

A trial with moderate evidence: bortezomib & rituximab for untreated follicular lymphoma

A trial with weak evidence: cladribine for aggressive systemic mastocytosis

Frequently asked questions

Efficacy

Non-comparative efficacy

Comparative efficacy

Superior and inferior findings

Negative and non-inferior findings

More details

Examples

1. A treatment regimen with superior efficacy: BR for untreated follicular lymphoma

2. A treatment regimen which failed to demonstrate a difference in primary endpoint: gemcitabine for metastatic pancreatic cancer

3. A treatment regimen that is non-inferior to its comparator: capecitabine & bevacizumab for metastatic breast cancer

4. A treatment regimen with inferior efficacy: dexamethasone for relapsed/refractory multiple myeloma

Frequently asked questions

Toxicity

Toxicity information from the primary literature

Example

A treatment regimen with increased toxicity: R-CHOP for untreated follicular lymphoma

Toxicity information from companion or post hoc analyses

Example

A treatment regimen with worse HRQoL: placebo for metastatic castrate-resistant prostate cancer

Frequently asked questions

Value frameworks

Introduction

ASCO Value Framework

ESMO Magnitude of Clinical Benefit Scale

NCCN Evidence Blocks

Comparisons

ASCO Value Framework vs. ESMO-MCBS

ASCO Value Framework vs. NCCN Evidence Blocks

Frequently asked questions

Navigation menu

Search