Difference between revisions of "Levels of Evidence"
Warner-admin (talk | contribs) m (Text replacement - "_-.3E_" to ".2C_then_") |
|||
Line 22: | Line 22: | ||
|style="background-color:#eeee00"|Moderate evidence | |style="background-color:#eeee00"|Moderate evidence | ||
|style="background-color:#ff0000"|Weak evidence | |style="background-color:#ff0000"|Weak evidence | ||
+ | |} | ||
+ | |||
+ | Wiki codes for these colors are: | ||
+ | {| border="1" style="text-align:center;" !align="left" | ||
+ | |style="background-color:#00cd00"|style="background-color:#00cd00" | ||
+ | |style="background-color:#eeee00"|style="background-color:#eeee00" | ||
+ | |style="background-color:#ff0000"|style="background-color:#ff0000" | ||
|} | |} | ||
Line 119: | Line 126: | ||
|style="background-color:#1a9850"|Superior endpoint | |style="background-color:#1a9850"|Superior endpoint | ||
|- | |- | ||
+ | |} | ||
+ | |||
+ | Wiki code: | ||
+ | {| border="1" style="text-align:center;" !align="left" | ||
+ | |style="background-color:#1a9850"|style="background-color:#1a9850" | ||
|} | |} | ||
Line 134: | Line 146: | ||
|style="background-color:#91cf60"|Seems to have superior endpoint | |style="background-color:#91cf60"|Seems to have superior endpoint | ||
|- | |- | ||
+ | |} | ||
+ | |||
+ | Wiki code: | ||
+ | {| border="1" style="text-align:center;" !align="left" | ||
+ | |style="background-color:#91cf60"|style="background-color:#91cf60" | ||
|} | |} | ||
Line 150: | Line 167: | ||
|- | |- | ||
|} | |} | ||
+ | |||
+ | Wiki code: | ||
+ | {| border="1" style="text-align:center;" !align="left" | ||
+ | |style="background-color:#d9ef8b"|style="background-color:#d9ef8b" | ||
+ | |} | ||
+ | |||
==Inferior findings== | ==Inferior findings== | ||
Regimens with inferior comparative efficacy are labeled using a [http://colorbrewer2.org/#type=diverging&scheme=RdYlGn&n=7 '''red''' divergent ColorBrewer scale], with intensity as a function of p-value: | Regimens with inferior comparative efficacy are labeled using a [http://colorbrewer2.org/#type=diverging&scheme=RdYlGn&n=7 '''red''' divergent ColorBrewer scale], with intensity as a function of p-value: | ||
Line 166: | Line 189: | ||
|style="background-color:#d73027"|Inferior endpoint | |style="background-color:#d73027"|Inferior endpoint | ||
|- | |- | ||
+ | |} | ||
+ | |||
+ | Wiki code: | ||
+ | {| border="1" style="text-align:center;" !align="left" | ||
+ | |style="background-color:#d73027"|style="background-color:#d73027" | ||
|} | |} | ||
Line 181: | Line 209: | ||
|style="background-color:#fc8d59"|Seems to have inferior endpoint | |style="background-color:#fc8d59"|Seems to have inferior endpoint | ||
|- | |- | ||
+ | |} | ||
+ | |||
+ | Wiki code: | ||
+ | {| border="1" style="text-align:center;" !align="left" | ||
+ | |style="background-color:#fc8d59"|style="background-color:#fc8d59" | ||
|} | |} | ||
Line 196: | Line 229: | ||
|style="background-color:#fee08b"|Might have inferior endpoint | |style="background-color:#fee08b"|Might have inferior endpoint | ||
|- | |- | ||
+ | |} | ||
+ | |||
+ | Wiki code: | ||
+ | {| border="1" style="text-align:center;" !align="left" | ||
+ | |style="background-color:#fee08b"|style="background-color:#fee08b" | ||
|} | |} | ||
Revision as of 01:41, 17 October 2017
The purpose of this page is to create a reference to describe our methodology for assigning levels of evidence to regimens.
Important note: Our intent is not to provide clinical decision support. Rather, our goal is to faithfully reproduce findings of clinical trials. Efficacy and toxicity information, in particular, is sometimes presented by authors in a confusing or ambivalent manner. As such, we try to illustrate ambiguities when they happen, and take no responsibility for your decision to choose a particular treatment regimen. Please read our disclaimer for further information.
Note for colorblind users: We are aware that the color scales we use are not colorblind-safe and are not compliant with Section 508. We have no current plans to change the overall coloring schema but welcome feedback on this particular point.
See the sections below for a discussion of the various metrics we use. Feedback is welcome!
Evidence
back to top |
Generally, a regimen should be evaluated in a randomized fashion with an adequate patient sample to be considered a "green" regimen. We have defined adequate as 20 or more patients per arm. Non-randomized studies and randomized studies with fewer than 20 patients per arm are considered to be "yellow" regimens. Finally, case reports, retrospective series, and non-randomized studies with fewer than 20 patients enrolled are considered to be "red" regimens. Of course, there are finer gradations of the quality of evidence so this simplified scheme should be taken with a grain of salt.
Evidence is thus reported using one of the three color-coded labels:
Strong evidence | Moderate evidence | Weak evidence |
Wiki codes for these colors are:
style="background-color:#00cd00" | style="background-color:#eeee00" | style="background-color:#ff0000" |
Examples
back to top |
A trial with strong evidence: R-CHOP for untreated follicular lymphoma
Study | Evidence |
Flinn et al. 2014 (BRIGHT) | Phase III |
A trial with moderate evidence: bortezomib & rituximab for untreated follicular lymphoma
Study | Evidence |
Evens et al. 2014 | Phase II |
A trial with weak evidence: cladribine for aggressive systemic mastocytosis
Study | Evidence |
Lim et al. 2009 | Retrospective |
Frequently asked questions
back to top |
Q: What is the current status of evidence labeling on hemonc.org?
A: Nearly 100% of chemotherapy regimens and their variants now have a level of evidence label.
Q: If a randomized trial has more than two arms, will they all be labeled the same?
A: No, it depends on how many patients are in each arm of the trial. For arms that have at least 20 patients, the label is green. For arms with fewer than 20, the label is yellow.
Q: Are non-randomized trials all labeled the same?
A: No, it depends on how many patients are in the trial. For trials that have at least 20 patients, the label is yellow. For trials with fewer than 20, the label is red.
Q: Some retrospective analyses are very large, will they be labeled yellow?
A: No, currently we label all retrospective analyses as red (weak evidence), no matter how large. Although we are major proponents of the secondary use of data, including automated methods of EHR data extraction, there is currently too high of a level of unknown biases and confounding to label regimens derived from retrospective data other than as weak evidence. Likewise, a trial that reports on a comparison to historic or contemporary controls not enrolled in that trial will be considered non-randomized.
Efficacy
back to top |
Defined generally, efficacy is the presence of a positive effect on the study population. Conversely, lack of efficacy is the absence of an expected positive effect, or the failure to achieve expected outcomes in adequate numbers of patients. Efficacy can be reported ranging from a weak surrogate measure (e.g., response rate) to a direct measure of overall survival. Currently, we are focusing on adding information on statistical comparative efficacy for randomized trials, and overall response rates (ORRs) for non-randomized trials with >20 participants. Many non-randomized trials report efficacy compared to historical controls. However, in the rapidly developing field of oncology, this approach is rife with bias and as such we do not report on comparison to historical controls. Future work will involve reporting on effect sizes as well as statistical comparative efficacy (see Miksad et al. 2007 for a discussion of why this is important).
Efficacy is reported using a tri-color labeling, with a modifier based on statistical significance of the finding:
Superior findings
Regimens with superior comparative efficacy are labeled using a green divergent ColorBrewer scale, with intensity as a function of p-value:
Strong signal: p-value ≤ 0.01
No modifier is used, e.g., efficacy will simply be labeled as superior; e.g., superior PFS
Study | Evidence | Comparator | Efficacy |
[xx] | Phase III | [xx] | Superior endpoint |
Wiki code:
style="background-color:#1a9850" |
Moderate signal: p-value > 0.01 and ≤ 0.05
The finding is modified by "seems to have"; e.g., seems to have superior PFS
Study | Evidence | Comparator | Efficacy |
[yy] | Phase III | [yy] | Seems to have superior endpoint |
Wiki code:
style="background-color:#91cf60" |
Weak signal: p-value > 0.05 and ≤ 0.1
The finding is modified by "might have"; e.g., might have superior PFS
Study | Evidence | Comparator | Efficacy |
[zz] | Phase III | [zz] | Might have superior endpoint |
Wiki code:
style="background-color:#d9ef8b" |
Inferior findings
Regimens with inferior comparative efficacy are labeled using a red divergent ColorBrewer scale, with intensity as a function of p-value:
Strong signal: p-value ≤ 0.01
No modifier is used, e.g., efficacy will simply be labeled as inferior; e.g., inferior OS
Study | Evidence | Comparator | Efficacy |
[xx] | Phase III | [xx] | Inferior endpoint |
Wiki code:
style="background-color:#d73027" |
Moderate signal: p-value > 0.01 and ≤ 0.05
The finding is modified by "seems to have"; e.g., seems to have inferior OS
Study | Evidence | Comparator | Efficacy |
[yy] | Phase III | [yy] | Seems to have inferior endpoint |
Wiki code:
style="background-color:#fc8d59" |
Weak signal: p-value > 0.05 and ≤ 0.1
The finding is modified by "might have"; e.g., might have inferior OS
Study | Evidence | Comparator | Efficacy |
[zz] | Phase III | [zz] | Might have inferior endpoint |
Wiki code:
style="background-color:#fee08b" |
Negative and non-inferior findings
The distinction between a negative superiority trial and a positive non-inferiority or equivalence study is crucial. In a superiority trial, a treatment has been hypothesized to be better than another, but in the end the null hypothesis was not rejected. There is still a distinct possibility that one treatment is superior to the other (type II error), but the signal is not observed due to underpowering issues, excess crossover, attrition such that the intention-to-treat population is small, or obfuscation of a subgroup by the larger population. Non-inferiority trials use a one-sided test to determine whether a new intervention is no worse than a standard intervention. Equivalence trials have a similar design but use a two-sided test, allowing for the possibility that the new intervention is no better than the standard one. These designs require much greater numbers of participants, and are often used to evaluate a new treatment that is likely to have comparable efficacy but has an improved (or different) toxicity profile.
A "negative" superiority trial
Study | Evidence | Comparator | Efficacy |
[xx] | Phase III | [xx] | Seems not superior |
A "negative" non-inferiority trial
Study | Evidence | Comparator | Efficacy |
[yy] | Phase III | [yy] | Inconclusive whether non-inferior |
A "positive" non-inferiority trial
Study | Evidence | Comparator | Efficacy |
[zz] | Phase III | [zz] | Non-inferior endpoint |
More details
back to top |
What we are really interested in is whether efficacy findings from a clinical trial will work for our patient. As such, we have historically relied on the cutoff of p=0.05 to accept whether or not a finding is significant and true. Of course, this means that approximately 1 in 20 reportedly "true" findings are in fact falsely positive. This "holy grail" cutoff has led to significant publication bias which is well summarized by Dr. John Ioannidis in his seminal 2005 paper "Why Most Published Research Findings Are False." One potential solution is to report comparative efficacy "in plain English" as shown in the graphic below (link to original 2009 article).
Here is another way of considering P-values, only just a bit tongue-in-cheek from XKCD.
Examples
back to top |
1. A treatment regimen with superior efficacy: BR for untreated follicular lymphoma
Study | Evidence | Comparator | Efficacy |
Rummel et al. 2013 (StiL NHL1) | Phase III | R-CHOP | Superior PFS |
2. A treatment regimen which failed to demonstrate a difference in primary endpoint: gemcitabine for metastatic pancreatic cancer
Study | Evidence | Comparator | Efficacy |
Hong et al. 2013 | Randomized Phase II | Gemcitabine & Simvastatin | Seems not superior |
3. A treatment regimen that is non-inferior to its comparator: capecitabine & bevacizumab for metastatic breast cancer
Study | Evidence | Comparator | Efficacy |
Zielinski et al. 2016 (TURANDOT) | Phase III | Paclitaxel & Bevacizumab | Non-inferior OS |
4. A treatment regimen with inferior efficacy: dexamethasone for relapsed/refractory multiple myeloma
Study | Evidence | Comparator | Efficacy |
Rajkumar et al. 2008 | Phase III | TD | Inferior TTP |
Frequently asked questions
back to top |
Q: What is the current status of efficacy labeling on hemonc.org?
A: All phase III trials and most randomized phase II trials are now labeled for efficacy. Future work includes labeling non-randomized trials with overall response rates (ORR).
Q: How do we choose to label efficacy when multiple outcomes are reported?
A: Often, a trial will report on multiple outcomes, such as overall response rate, progression-free survival, and overall survival. We generally look to the PRIMARY endpoint, as defined in the published methods. If the PRIMARY endpoint is negative but the trial reports a positive secondary more surrogate finding, we do NOT label this (example - primary endpoint was OS, which was not shown, but the investigational arm did have better PFS). If a secondary endpoint shows differential efficacy and is LESS "surrogate" than the primary endpoint (see below), we will label by that endpoint.
Q: How do you handle changes in efficacy when trial results are updated?
A: Many trials are published at the time of preliminary findings, and have subsequent publications as the results mature. At this time, most of these updates are added to the bibliography section of the regimen. If the strength of the efficacy assertion changes, or if a less surrogate endpoint becomes significant, we will updated the efficacy label with a (*) to denote that it is an updated efficacy.
Q: How do you distinguish between a failed superiority trial and a successful non-inferiority or equivalence study?
A: Both of them would be labeled yellow, but the language used is slightly different, and the intensity is different. Here is an example of epirubicin followed by capecitabine in the adjuvant treatment of breast cancer, where one arm failed to demonstrate superiority, whereas the other arm demonstrated statistical non-inferiority.
Study | Evidence | Comparator | Efficacy |
Cameron et al. 2017 (TACT2) | Phase III | Accelerated Epirubicin, then Capecitabine | Seems not superior |
Accelerated Epirubicin, then CMF Epirubicin, then CMF |
Non-inferior TTR |
Q: Do you have a hierarchy of surrogacy?
A: Yes, we use a three-level hierarchy to determine the strength of an outcome measure. Note that this hierarchy is NOT used to inform the coloration of the efficacy label, at this time.
Strong outcomes
- Overall survival (OS)
- All-cause mortality
- Disease-specific mortality
Intermediate outcomes
- Distant disease-free survival (DDFS)
- Disease-free interval (DFI)
- Disease-free survival (DFS)
- Durable response rate (DRR)
- Duration of response (DOR)
- Event-free survival (EFS) Events sometimes defined differently, but usually include relapse, progression, and death from any cause.
- Freedom from first progression (FFFP)
- Failure-free survival (FFS) Defined as the absence of an additional systemic therapy, relapse, or non-relapse mortality.
- Freedom from treatment failure (FFTF)
- Invasive disease free survival (IDFS)
- Primary refractory disease In leukemia, generally refers to the failure to achieve CR or CRi after two courses of intensive induction.
- Progression-free survival (PFS) The most commonly used surrogate time-based measure.
- Progression-free survival rate at 6 months (PFS6)
- Relapse-free interval (RFI) Not commonly used outside of the adjuvant setting.
- Relapse-free survival (RFS) Not commonly used outside of the adjuvant setting.
- Time to next treatment (TTNT)
- Time to treatment failure (TTTF)
Weak outcomes
- Response rate (RR) Definitions of the below may vary across cancer subtypes:
- Complete response rate (CR rate)
- Complete response with incomplete hematologic recovery rate (CRi rate)
- Complete response without minimal residual disease rate (CRMRD- rate)
- Minimal response rate (MR rate)
- Near complete response rate (nCR rate)
- Partial response rate (PR rate)
- Stable disease rate (SD rate)
- Stringent complete response rate (sCR rate)
- Unconfirmed complete response rate (CRu rate)
- Very good partial response rate (VGPR rate)
- Other miscellaneous response rates
- Major cytogenetic reponse (MCyR)
- Major hematologic response (MaHR)
- Minor hematologic response (MiHR)
- Morphologic leukemia-free state (MLFS)
- No evidence of leukemia (NEL)
- Overall hematologic response (OHR)
- Overall response rate (ORR) Definition may vary across cancer subtypes but usually this is a sum of the CR + PR rates.
- Disease control rate (DCR) Usually this is the sum of the CR + PR + SD rates.
- Clinical feasibility rate Defined as no grade 4 neutropenia/thrombocytopenia or thrombocytopenia with bleeding, no grade 3/4 febrile neutropenia or non-hematological toxicity; no premature withdrawal/death.
Q: What about exceptional responders?
A: It is increasingly recognized, especially with newer therapies such as immunotherapy, that some patients may experience a remarkable response to a drug that otherwise appears to lack efficacy in the population. These patients are usually referred to as "exceptional responders" and may provide significant insights into rational treatment selection a.k.a., precision medicine. At this time we do not make a particular effort to identify exceptional responders, nor do we consider a regimen for inclusion in HemOnc.org if the manuscript states that it generally lacks efficacy.
Q: Do you consider quality of life (QoL) measures in efficacy?
A: Very few RCTs report on QoL measures, and as such we do not currently include them in the consideration. This may change in the future.
Toxicity
back to top |
Defined generally, toxicity is the presence or absence of a negative effect (harm) on the study population. This is often also referred to as safety. As with efficacy, we only report comparative toxicity.
Due to less granular and more subjective reporting, toxicity is currently reported using one of three labels:
Decreased toxicity | Equivalent toxicity | Increased toxicity |
Example
back to top |
A treatment regimen with increased toxicity: R-CHOP for untreated follicular lymphoma
Study | Evidence | Comparator | Efficacy | Toxicity |
Hiddemann et al. 2005 | Phase III | CHOP | Seems to have superior OS | Increased toxicity |
Frequently asked questions
back to top |
Q: What is the current status of toxicity labeling on hemonc.org?
A: A few regimens are currently labeled for toxicity; we are focusing current efforts on labeling efficacy.
Q: Are you basing the label on the reported CTCAE measures?
A: CTCAE measures are extremely valuable in that they are structured and thus reproducible. However, it is often hard to compare them directly. For example, if one regimen has grade 4 lab-based toxicity and the other has grade 2 gastrointestinal toxicity, which is the more toxic? In general, we plan to use the authors' interpretation of overall toxicity and tolerability when labeling - or better yet, prospectively-gathered quality-of-life data (see below).
Q: Do you plan to incorporate patient-reported outcomes?
A: As shown in numerous publications, patient reports of toxicity are more accurate than clinician assessments. However, they have not until recently been standardized. Now that the PRO-CTCAE is available, we expect to see more of these in the future and will incorporate them into the toxicity assessment.