Example 1 - Describing treatment sequences in ER+ HER2- secondary breast cancer patients diagnosed between 2013 and 2017: Analyst’s Guide

Introduction
Methods
Results and Visualisation

Introduction

The methods in this report have been developed as part of the LCDI project, for information on further methods, data sources and definitions please see the Cancer Wiki page: https://www.wiki.ed.ac.uk/display/CAN

This report shows the methods and subsequent results from deriving line of treatment in 277 ER+, HER2- Secondary breast cancer (SBC) patients diagnosed between 2013 and 2017.

SBC patients were identified through the Secondary Breast Cancer CNS database which is maintained by the cancer nurses. This dataset comes in two formats from the Western General and St. John’s hospitals, the data are manually inputted and substantial data cleaning is required before reading the data into any analytical software.

There may also be typos in Date of Birth, Date of Metastases/Date Seen and also the patient identifier which appears to be Case Reference Number (CRN). In order to get the CHI, the data has to be linked to other sources by CRN. In cases where the CRN has not matched, the record is checked against name and date of birth in SMR06 as there may be a typo in the CRN.

“Date seen” in the data from the Western General is considered to be the date of Metastases.

Various datasets used in this example have different completion dates e.g. PIS has a lag of 3 months but SMR06 has a lag of one year.

Completion data for SMR06 2017 incidence is 31st December 2018 in accordance with UKIACR guidelines.

Since the datasets used in this analysis are updated at different times, the cut-off point of the data is the end of 2017 to avoid incorrect line of treatment. E.g. classifying a patient’s 3rd line of treatment as Chemotherapy when in reality it is Endocrine but the PIS data has not been updated at time of calculation.

Therefore there may be a number patients who have “No Treatment” as their last line of treatment because there are no more data within the time period to continue the pathway.

In addition to the SBC CNS database here are four datasets used in this analysis:

South East Scotland Cancer Database (SESCD) - Access database, although this is to be moved to an SQL server
- Main Table
- Hormone Table
Chemocare
Prescribing Information System (PIS)
- e-Prescribed data
Fulvestrant Audit data
NRS Deaths data

Methods

Each dataset is filtered to the patient cohort and records which occurred after the date of metastatic diagnosis. The datasets are joined together, ordered by date of treatment and ranked such that the first occurrence of a drug is taken as a new line of treatment, any further treatments of the same drug are not included.

SESCD

The SESCD Main table contains information on the site of the tumour, the code for breast cancer is 1749. By filtering the SESCD Main table to show the cohort and selecting a site of 1749, the Primary number can be obtained.

The Primary Number shows the number of primary tumours that the patient has, by finding the primary number associated with the breast tumour the data is linked to the Hormone table - which only has CHI and primary number.

Please note that there are other tables in the SESCD that can be linked in this way.

By using the Patient Number (CHI) and Primary Number the hormone therapy administered to breast cancer tumours can be determined.

For the purposes of this example the hormone therapies of interest in the SESCD are:

Tamoxifen
Faslodex (Fulvestrant)
Arimidex (Anastrozole)
Letrozole
Exemestane
Herceptin

#reading in the SESCD datasets----

read_excel("Z:/Oncology Data/Cancer LCDI/SES Oncology - Access db copy/Hormone.xlsx",
           guess_max=150000 ) -> Onc_Hormone

read_excel("Z:/Oncology Data/Cancer LCDI/SES Oncology - Access db copy/MainData.xlsx",
           guess_max=150000 ) -> Onc_Main


left_join(Patient_cohort,Onc_Hormone, by=c("UPI_NUMBER"="PATIENT"))->Onc_Hormone

#filtering for the materials we're interested in 
Onc_Hormone %>% filter(MATERIAL %in% c(71,72,82,83,85,804)) %>% 
select(UPI_NUMBER,Met_inc_date,PRIMARY,MATERIAL,STDATE,ENDDATE)->Onc_Hormone


left_join(Patient_cohort,Onc_Main, by=c("UPI_NUMBER"="PATIENT"))->Onc_Main

#Selecting records for breast cancer (site 1749)

Onc_Main %>% filter(SITE==1749) %>% 
  select(UPI_NUMBER,Met_inc_date,PRIMARY,OESTROGEN_RECEPTOR_STATUS,HER2_STATUS)->Onc_Main

#Joining the main dataset to the hormone dataset in order to have hormone therapy records that are for breast cancer only

left_join(Onc_Main,Onc_Hormone) ->Onc_Hormone

#After looking at the data, there's no records that have a missing material entry but have a start date. 
Onc_Hormone$`STDATE` %>% as.character() %>%  as.Date("%Y-%m-%d") ->Onc_Hormone$`STDATE`
Onc_Hormone$`ENDDATE` %>% as.character() %>% as.Date("%Y-%m-%d") ->Onc_Hormone$`ENDDATE`

Onc_Hormone %>% filter(is.na(MATERIAL)==0) %>% select(UPI_NUMBER) %>% n_distinct() ->Onc_Hormone_pats_anydate

First the dataset is filtered to show the cohort. Irrespective of the therapy Start Date there are 223 ER+ve, HER2-ve patients with a record of hormone therapy in the SESCD; this is 80.5% of the cohort.

Of the patients who have a valid SESCD record there are 0 (0%) patients who have a missing Start Date of hormone therapy. However there are 197 (88.3%) patients who do not have a valid End Date for treatment.

Therefore Start Date will be used to identify treatment that occurs after the date of metastases and will be used to determine the order of line of treatment.

#Selecting therapies that only occur after the date of metastasis 


Onc_Hormone %>% filter(STDATE>=Met_inc_date) ->Onc_Hormone

Onc_Hormone %>% select(UPI_NUMBER) %>% n_distinct()->Onc_Hormone_pats_afterdiag

Onc_Hormone$MATERIAL %>% as.character()->Onc_Hormone$MATERIAL

Onc_Hormone$MATERIAL %>% as.factor() %>%
  fct_recode("Tamoxifen" = "71",
             "Fulvestrant" = "72",
             "Anastrozole" = "82",
             "Letrozole" = "83",
             "Exemestane" = "85") %>%
  as.character() -> Onc_Hormone$MATERIAL

#if the Material is NA it means that there was no record after met inc date

There are 77 patients (27.8%) who have a record of endocrine therapy occurring on or after the date of metastasis.

Onc_Hormone %>% group_by(MATERIAL) %>% 
  summarise(`Number of patients` = n_distinct(UPI_NUMBER)) %>% 
  ungroup() %>% 
  mutate(Total_pats=sum(`Number of patients`),
         `% of records*`=round2((`Number of patients`/Total_pats*100),1)) %>% 
  select(-Total_pats) %>% 
  kable() %>%   kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive","bordered"),full_width=F, position="left") %>% 
  footnote(general="*The denominator for this calculation is the total number of records since patients can have more than one treatment.")

MATERIAL	Number of patients	% of records*
Anastrozole	2	2.2
Exemestane	20	21.5
Fulvestrant	1	1.1
Letrozole	50	53.8
Tamoxifen	20	21.5
Note:
*The denominator for this calculation is the total number of records since patients can have more than one treatment.

Chemocare

#Adding in chemocare----
read_excel("Z:/Oncology Data/Cancer LCDI/Chemocare/MedicineUtilisationReport_20-12-2018_09-36-24.xlsx")->Chemocare

left_join(Patient_cohort,Chemocare,by=c("UPI_NUMBER"="CHI"))->Chemocare

#renaming variables to get rid of spaces

Chemocare %>% rename("Drug_Name"="Drug Name",
                     "Drug_Type"="Drug Type",
                     "Drug_Status"="Drug Status",
                     "Appointment_Date"="Appointment Date")-> Chemocare

#formatting to date type

Chemocare$Appointment_Date %>% as.Date("%d-%m-%Y") ->Chemocare$Appointment_Date

#Filtering for endocrine and cytotoxic drugs
#Also filtering for chemocare records occuring after 2012 since cohort starts from metastatic diagnosis dates 2013 onwards.
#Chemocare also started recording data in the latter half of 2012 and so there will be incomplete data for that year.

Chemocare %>% filter(Drug_Status%in% c("Given","Authorised"),
                     (Drug_Type=="Cytotoxic"|Drug_Name %in% c("LETROZOLE","EVEROLIMUS","PALBOCICLIB",
                                                                  "RIBOCICLIB (TRIAL B178)")),
                      year(Appointment_Date)>=2013) -> Chemocare

Chemocare %>% filter(is.na(Regime)==0 & is.na(Drug_Name)==0 & is.na(Drug_Type)==0) %>% select(UPI_NUMBER) %>% n_distinct()-> Chemocare_pats_anydate

The Chemocare data used in this analysis is from a Medicines Utilisation report where the tumour type is Breast and covers the time period 2013 to 2017. Please note that Chemocare data started recording in the latter half of 2012.

The Medicines Utilisation report is filtered to show a Drug Type of “Cytotoxic” and Drug Names: Letrozole, Everolimus, Palbociclib, Ribociclib (Trial B178).

Some drugs are given as part of a regime and so should not count as separate lines of treatment. For this project chemotherapy regimes were used rather than individual drug names.

There are 137 (49.5%) Secondary breast cancer patients who have a record for hormone therapy or chemotherapy in the Chemocare extract regardless of when the treatment was administered.

Appointment Date is the variable that will be used to include treatments administered after the date of metastatic diagnosis. There are 0 (0%) patients who have a missing Appointment date but have either a valid Regime, Drug Name or Drug Type.

Chemocare %>% filter(Appointment_Date >= Met_inc_date) ->Chemocare
n_distinct(Chemocare$UPI_NUMBER)-> Chemocare_pats_afterdiag

After removing records occurring before the date of metastasis there are now 127 (45.8%) patients who have a record for endocrine therapy or chemotherapy.

The variable Drug Status is used to determine whether the patient had treatment or not. Treatment can only proceed once the Drug Status has been changed to “Authorised”, the status can then be changed to “Given” but this is only if the system is manually updated. Therefore it is assumed that patients have received treatment if the Drug Status is either “Authorised” or “Given”.

“Given” can be “True” and “Authorised” if the dose is modified from the planned dose on the day. In this case it is not possible for “Drug Status” to change to “Given” as full dose was not given. This happens rarely though.

Take care if using the variable “Given” as it can be misleading. It is only “True” for a Drug Status of “Given” but not “Authorised” as demonstrated by the table below.

#Filtering for records for cytotoxic(chemotherapy) and hormone therapies that are given or authorised
#there can be cases where the drug is given but this is not reflected in the database as users forget to update

Chemocare %>% filter(Drug_Type=="Cytotoxic"|Drug_Name %in% c("LETROZOLE","EVEROLIMUS","PALBOCICLIB","RIBOCICLIB (TRIAL B178)")) %>% 
  group_by(Drug_Status,Given) %>% 
  summarise(`Number of patients` = n_distinct(UPI_NUMBER)) %>% 
  ungroup() %>% 
  mutate(Total_pats=sum(`Number of patients`),
         `% of records*`= round2((`Number of patients`/Total_pats*100),1)) %>% 
  rename(`Drug status` = Drug_Status) %>% 
  select(-Total_pats) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive","bordered"),full_width=F, position="left") %>% 
  footnote(general="*The denominator for this calculation is the total number of records since patients can have more than one treatment.")

Drug status	Given	Number of patients	% of records*
Authorised	False	85	40.1
Given	True	127	59.9
Note:
*The denominator for this calculation is the total number of records since patients can have more than one treatment.

Regimes

In order to simplify the analysis the following regimes were reclassified:

#recode regimes
#cdk4/6 inhibitors - "PALBOCICLIB / AI" is Palbociclib
# and ribociclib is part of the T COMPLEEMENT-1 regime 

Chemocare$Regime %>% as.factor() %>% 
  fct_recode("Trial/other"="ARISTACAT STUDY",
             "Capecitabine"="CAPECITABINE",
             "Capecitabine"="CAPECITABINE LD",
             "Carboplatin"="CARBO AUC 6",
             "Carboplatin + Taxane" ="CARBO&PACLI WKLY",
             "CMF"="CMF ADJ",
             "Taxane" = "DOCETAXEL BREAST",
             "Taxane" = "DOCETAXEL&CYCLO",
             "Anthracycline"="EPI/CYCLO ADJ",
             "Anthracycline"="EPI/CYCLO MET",
             "Anthracycline"="EPIRUBICIN WKLY",
             "Eribulin"="ERIBULIN",
             "Everolimus + Exemestane" = "Evero/Exemes",
             "Taxane" ="PACLITAX WKLY",
             "Taxane" ="PACLITAXEL R682",
             "Vinorelbine" ="VINORELBINE IV 1",
             "Vinorelbine" ="VINORELBINE IV 2",
             "Vinorelbine" ="VINORELBINE PO 1",
             "Vinorelbine" ="VINORELBINE PO 2",
             "Trial/other"="ARISTACAT STUDY",
             "Taxane"="FEC-D (D)",
             "Anthracycline"="FEC-D (FEC)",
             "Taxane"="FEC-D NEO (D)",
             "Anthracycline"="FEC-D NEO (FEC)",
             "Anthracycline"="FEC 100",
             "Anthracycline"="FEC 80",
             "CDK4/6i+AI"="PALBOCICLIB / AI",
             "CDK4/6i+AI"="T COMPLEEMENT-1",
             "Trial/other"="T PAKT STUDY 552",
             "CDK4/6i+AI"="T PALOMA 2") %>% 
  as.character()->Chemocare$Regime_grouped

Chemocare %>% group_by(Regime,Regime_grouped) %>% 
  summarise(`Number of patients` = n_distinct(UPI_NUMBER)) %>% 
  ungroup() %>% 
  mutate(Total_pats=sum(`Number of patients`),
         `% of records*`= round2((`Number of patients`/Total_pats*100),1)) %>% 
  select(-Total_pats,-`Number of patients`,-`% of records*`) %>% 
  rename(`Regime grouped`=Regime_grouped) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),full_width=F, position="left")

Regime	Regime grouped
ARISTACAT STUDY	Trial/other
CAPECITABINE	Capecitabine
CAPECITABINE LD	Capecitabine
CARBO AUC 6	Carboplatin
CARBO&PACLI WKLY	Carboplatin + Taxane
DOCETAXEL BREAST	Taxane
DOCETAXEL&CYCLO	Taxane
EPI/CYCLO ADJ	Anthracycline
EPI/CYCLO MET	Anthracycline
EPIRUBICIN WKLY	Anthracycline
ERIBULIN	Eribulin
Evero/Exemes	Everolimus + Exemestane
FEC-D NEO (D)	Taxane
FEC-D NEO (FEC)	Anthracycline
FEC 100	Anthracycline
FEC 80	Anthracycline
PACLITAX WKLY	Taxane
PACLITAXEL R682	Taxane
PALBOCICLIB / AI	CDK4/6i+AI
T COMPLEEMENT-1	CDK4/6i+AI
T PAKT STUDY 552	Trial/other
T PALOMA 2	CDK4/6i+AI
VINORELBINE IV 1	Vinorelbine
VINORELBINE IV 2	Vinorelbine
VINORELBINE PO 1	Vinorelbine
VINORELBINE PO 2	Vinorelbine

Chemocare %>% group_by(Regime_grouped) %>% 
  summarise(`Number of patients` = n_distinct(UPI_NUMBER)) %>% 
  ungroup() %>% 
  mutate(Total_pats=sum(`Number of patients`),
         `% of records*`= round2((`Number of patients`/Total_pats*100),1)) %>% 
  select(-Total_pats) %>% 
  rename(`Regime grouped`=Regime_grouped) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive","bordered"),full_width=F, position="left") %>% 
  footnote(general="*The denominator for this calculation is the total number of records since patients can have more than one treatment.")

Regime grouped	Number of patients	% of records*
Anthracycline	47	20.1
Capecitabine	66	28.2
Carboplatin	1	0.4
Carboplatin + Taxane	1	0.4
CDK4/6i+AI	13	5.6
Eribulin	10	4.3
Everolimus + Exemestane	9	3.8
Taxane	70	29.9
Trial/other	3	1.3
Vinorelbine	14	6.0
Note:
*The denominator for this calculation is the total number of records since patients can have more than one treatment.

#Cannot find anastrazole,fulvestrant,faslodex,exemestane or tamoxifen 

Chemocare %>% select(UPI_NUMBER,Met_inc_date,OESTROGEN_RECEPTOR_STATUS,
                     HER2_STATUS,Regime_grouped,Drug_Type,Appointment_Date) ->Chemocare


#Taking the 1st date of each Regime as the start date of each treatment

Chemocare %>% group_by(UPI_NUMBER,Regime_grouped) %>% 
  summarise(Met_inc_date=min(Met_inc_date),
            OESTROGEN_RECEPTOR_STATUS=max(OESTROGEN_RECEPTOR_STATUS),
            HER2_STATUS=max(HER2_STATUS),
            Appointment_Date=min(Appointment_Date))->Chemocare


colnames(Chemocare)[2] <- "MATERIAL"
colnames(Chemocare)[6] <- "STDATE"

Chemocare$STDATE %>% as.Date("%d-%m-%Y") ->Chemocare$STDATE

Chemocare$Source <- "Chemocare"

#Combining SESCD and Chemocare data 

bind_rows(Onc_Hormone,Chemocare)  %>% 
  select(UPI_NUMBER,Met_inc_date,MATERIAL,STDATE,ENDDATE,Source)->Onc_Chemo

Onc_Chemo %>% arrange(UPI_NUMBER,STDATE) ->Onc_Chemo

Prescribing information System (PIS)

## reading in PIS hormone therapy data----

read_excel("LCDI_hormone_therapies_15Jan2019_V2.xlsx", sheet="Report 2") -> PIS_hormone
PIS_hormone$Source <- "PIS"

PIS_hormone %>% rename(Presc_HB_name = `Presc Health Board Name`,
                       Calendar_year = `APT Calendar Year`,
                       APT_Date = `APT Date`,
                       Drug_Name = `ePR DMD Drug Name`,
                       Quantity = `ePR Mapped Quantity`,
                       Number_of_items = `ePR Number of e-Prescribed Items`) -> PIS_hormone


left_join(Patient_cohort,PIS_hormone, by=c("UPI_NUMBER"="Pat UPI [C]")) %>% 
  select(UPI_NUMBER,Met_inc_date,`APT_Date`,Calendar_year,`Drug_Name`,Source)->PIS_hormone

PIS_hormone %>% filter(is.na(Drug_Name)==0) %>% select(UPI_NUMBER) %>% n_distinct()->PIS_pats_anydate

The data extract from PIS is an e-Prescribed report covering the time period 2011 to 2017. The e-Prescribed report contains information on prescriptions issued from GP Practices only, an e-message is generated when the prescription is issued; there is no information as to whether the prescription made it to the pharmacy.

The reason why Paid data is not used from PIS is due to the Paid date defaulting to the end of the month as this is when the pharmacy submits their prescriptions for payment. Unfortunately there are cases where the patient has passed away before the Paid Date therefore in terms of treatment it appears that a patient has received medication after passing away. There may also be delays in the pharmacy submitting the prescription for payment.

When comparing the Paid data to the e-Prescribed data, there is one less patient missing in e-Prescribed compared to the Paid data.

There are 254 (91.7%) patients who have records in the PIS. Since the data is electronically generated there are no records which have a valid drug name but missing prescribed date.

#102 patients missing from paid data
#101 patients missing from eprescribed 

#Changing column names in order to bind data to the SESCD and Chemocare data 

colnames(PIS_hormone)[1] <- "UPI_NUMBER"
colnames(PIS_hormone)[3] <- "STDATE"
colnames(PIS_hormone)[5] <- "MATERIAL"


PIS_hormone$STDATE %>% as.character()%>%  as.Date("%Y-%m-%d") -> PIS_hormone$STDATE

#Filtering to select prescriptions issued after the date of metastatic diagnosis 

PIS_hormone %>% filter(STDATE>=Met_inc_date)->PIS_hormone

n_distinct(PIS_hormone$UPI_NUMBER) -> PIS_pats_afterdiag

#Joining the Chemocare and SESCD dataset to the PIS Data. 

bind_rows(Onc_Chemo,PIS_hormone)->Onc_Chemo_PIS

Onc_Chemo_PIS %>% arrange(UPI_NUMBER,STDATE) ->Onc_Chemo_PIS

Onc_Chemo_PIS %>% select( -ENDDATE) ->Onc_Chemo_PIS

After filtering the data to include only records after date of metastatic diagnosis there are 215 (77.6%) patients who have a record in PIS.

Fulvestrant Audit Data

#Add in fulvestrant data ---- 
Fulvestrant<-read_excel("Z:/Oncology Data/Cancer LCDI/Fulvestrant Audit/Fulvestrant Only MM.xlsx")

Fulvestrant$`STDATE` %>% as.Date() ->Fulvestrant$`STDATE`



#Joining the patient cohort and removing patients who don't have a record for Fulvestrant

left_join(Patient_cohort,Fulvestrant) %>% 
  select(UPI_NUMBER,Met_inc_date,MATERIAL,STDATE) %>% 
  filter(is.na(MATERIAL)!=1)->Fulvestrant

Fulvestrant$Source<-"Fulvestrant Audit"

n_distinct(Fulvestrant$UPI_NUMBER)->Fulvestrant_pats_anydate

#Selecting Fulvestrant records that occur after the date of diagnosis.

Fulvestrant %>% filter(STDATE>=Met_inc_date)->Fulvestrant 

n_distinct(Fulvestrant$UPI_NUMBER) -> Fulvestrant_pats_afterdiag

Fulvestrant data has not been collected in Chemocare until March 2019 and so any data Fulvestrant before then has been collected on a manually inputted spreadsheet owned by the Edinburgh Cancer Centre.

Please note that the data is in wide format, there are spelling errors and inconsistencies in the format of entries in each variable.

This dataset contains:

Patient CHI number
Age
Site of metastases at diagnosis
Adjuvant endocrine treatment
Neo-adjuvant endocrine treatment
Pre endocrine treatment (metastatic setting)
1st - 10th line endocrine (metastatic)
Start date of 1st - 10th line of endocrine treatment

For the purposes of this analysis the dataset has been filtered for records containing Fulvestrant only and the start date of treatment. The column names in the dataset have been changed to match the columns in the combined dataset in order to join the datasets together.

CHI’s need a leading zero attached to records containing only 9 digits.

The start dates are either in Month/Year format or Year format. For the purposes of this analysis the start date of the treatment is assumed to be the end of the month.

There are 19 (6.9%) patients who have a record in the Fulvestrant audit dataset. After filtering for records that occur after the date of diagnosis there are 19 (6.9%) patients.

#adding fulvestrant to the rest of the onc database

bind_rows(Onc_Chemo_PIS,Fulvestrant) ->Onc_Chemo_PIS
Onc_Chemo_PIS %>% arrange(UPI_NUMBER,STDATE) ->Onc_Chemo_PIS



Onc_Chemo_PIS$MATERIAL %>% as.factor() %>% 
  fct_recode("Exemestane"="EXEMESTANE 25MG TABLETS",
             "Fulvestrant"="FULVESTRANT 250MG/5ML SOLUTION FOR INJECTION PRE-FILLED SYRINGES",
             "Letrozole/Anastrozole"="Anastrozole",
             "Letrozole/Anastrozole"="Letrozole",
             "Letrozole/Anastrozole"="ANASTROZOLE 1MG TABLETS",
             "Letrozole/Anastrozole"="LETROZOLE 2.5MG TABLETS",
             "Tamoxifen"="TAMOXIFEN 10MG TABLETS",
             "Tamoxifen"="TAMOXIFEN 20MG TABLETS")%>% 
  as.character()->Onc_Chemo_PIS$MATERIAL

#Exemestane and Everolimus 
#Always get Exemestane with Everolimus but can get Exemestane on it's own.


#Creating a new simplified treatment category 

Onc_Chemo_PIS$MATERIAL %>% as.factor() %>% 
  fct_recode("Chemotherapy"="Anthracycline",
             "Chemotherapy"="Capecitabine",
             "Chemotherapy"="Carboplatin",
             "Chemotherapy"="Carboplatin + Taxane",
             "Chemotherapy"="Eribulin",
             "Chemotherapy"="Taxane",
             "Chemotherapy"="Vinorelbine",
             "Endocrine"="Everolimus + Exemestane",
             "Endocrine" ="Exemestane",
             "Endocrine"="Fulvestrant",
             "Endocrine"="Letrozole/Anastrozole",
             "Endocrine"="Tamoxifen",
             "Endocrine"="CDK4/6i+AI") ->Onc_Chemo_PIS$Treatment_category


n_distinct(Onc_Chemo_PIS$UPI_NUMBER) -> Onc_Chemo_PIS_pats_afterdiag

Dataset	Patients with records found in Dataset	Patients records found in Dataset after date of diagnosis
SESCD	223 (80.5%)	77 (27.8%)
Chemocare	137 (49.5%)	127 (45.8%)
PIS	254 (91.7%)	215 (77.6%)
Fulvestrant	19 (6.9%)	19 (6.9%)

Combined dataset

Now that the datasets have been combined there are 252 (91%) patients who have had treatment after metastatic diagnosis. Treatment has been classified into two different categories, please see the table below for a breakdown.

Onc_Chemo_PIS %>% group_by(MATERIAL,Treatment_category) %>% 
  summarise(`Number of patients`=n_distinct(UPI_NUMBER)) %>% 
  ungroup() %>% 
  arrange(Treatment_category,MATERIAL) %>% 
  rename("Treatment category 1" = "MATERIAL",
         "Treatment category 2" = "Treatment_category") %>% 
  mutate(Total_records=sum(`Number of patients`)) %>% 
  mutate(`% of records` = round(`Number of patients`/Total_records*100,1)) %>% 
  select(-Total_records) %>%  kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive","bordered"), full_width=F, position="left")

Treatment category 1	Treatment category 2	Number of patients	% of records
Anthracycline	Chemotherapy	47	8.1
Capecitabine	Chemotherapy	66	11.3
Carboplatin	Chemotherapy	1	0.2
Carboplatin + Taxane	Chemotherapy	1	0.2
Eribulin	Chemotherapy	10	1.7
Taxane	Chemotherapy	70	12.0
Vinorelbine	Chemotherapy	14	2.4
CDK4/6i+AI	Endocrine	13	2.2
Everolimus + Exemestane	Endocrine	9	1.5
Exemestane	Endocrine	96	16.5
Fulvestrant	Endocrine	20	3.4
Letrozole/Anastrozole	Endocrine	161	27.6
Tamoxifen	Endocrine	72	12.3
Trial/other	Trial/other	3	0.5

Identifying Line of Treatment

National Records Scotland (NRS) Deaths Data

####Deaths ----

NRS_headers<-read_excel("Z:/Oncology Data/Cancer LCDI/SMR/NRS deaths col names pos.xlsx")
PosFrom <- NRS_headers$Position_From
PosTo <- NRS_headers$Position_To
NRS_header<-NRS_headers$Field_name


#REMEMBER TO SPECIFY GUESS_MAX to avoid warnings!.

NRS_Deaths <- read_fwf("N:/Upload/Linked Data/Raw Data/S_mrl_deaths.dat", fwf_positions(PosFrom,PosTo,
                                                                                        col_names = NRS_header),guess_max = 200000)

#Select deaths up to end of 2017 to keep time periods consistent with the rest of the datasets 
as.Date(as.character(NRS_Deaths$DATE_OF_DEATH), format="%Y%m%d") ->NRS_Deaths$DATE_OF_DEATH

NRS_Deaths %>% filter(DATE_OF_DEATH<="2017-12-31") ->NRS_Deaths

#Selecting Patients in the cohort 

left_join(Patient_cohort,NRS_Deaths)->Patient_deaths


#Selecting patients who have a valid date of death 

Patient_deaths %>% filter(is.na(DATE_OF_DEATH)==0)-> Patient_deaths

#Creating an empty data frame with the same variable names as the joint dataset 
#Populating empty data frame with the patients who have a valid date of death
#adding the record for patient death to the joint dataset

Pat_Deaths = data.frame(matrix(vector(), 138, 8,
                                     dimnames=list(c(), c("UPI_NUMBER", "Met_inc_date", "MATERIAL","STDATE","ENDDATE","Source","timediff","Treatment_category"))),
                              stringsAsFactors=F)

Pat_Deaths$UPI_NUMBER<-Patient_deaths$UPI_NUMBER
Pat_Deaths$Met_inc_date<-Patient_deaths$Met_inc_date
Pat_Deaths$STDATE<-Patient_deaths$DATE_OF_DEATH
Pat_Deaths$Treatment_category<-"Death"
Pat_Deaths$Source<-"NRS"

bind_rows(Onc_Chemo_PIS,Pat_Deaths)->Onc_Chemo_PIS

rm(Patient_deaths)

#There will be patients who don't get a 1st treatment at all 
#The criteria used to identify these patients is those who don't have a record for any treatment and who also do not have a record for death

#1st find out who didn't get a 1st treatment at all and also does not have a record for death

ifelse(Patient_cohort$UPI_NUMBER %in% Onc_Chemo_PIS$UPI_NUMBER,1,0) -> Patient_cohort$flag

Patient_cohort %>% filter(flag==0) ->Patient_Notreat_1st

Pat_Notreat_1st = data.frame(matrix(vector(), 10, 8,
                               dimnames=list(c(), c("UPI_NUMBER", "Met_inc_date", "MATERIAL","STDATE","ENDDATE","Source","timediff","Treatment_category"))),
                        stringsAsFactors=F)

Pat_Notreat_1st$UPI_NUMBER<-Patient_Notreat_1st$UPI_NUMBER
Pat_Notreat_1st$Met_inc_date<-Patient_Notreat_1st$Met_inc_date
#Setting the date for no treatment record as an arbitrary date outside the time period 
Pat_Notreat_1st$STDATE<-"2000-01-01"
Pat_Notreat_1st$STDATE %>% as.Date() ->Pat_Notreat_1st$STDATE
Pat_Notreat_1st$Treatment_category<-"No treatment"

bind_rows(Onc_Chemo_PIS,Pat_Notreat_1st) ->Onc_Chemo_PIS

In order to clarify the treatment pathway, patient death records are included in order to distinguish between patients who have no further record of treatment and patients who have passed away.

Death records are from an NRS data extract up to the end of 2017. There are 138 (49.8%) patients with a death record.

Ranking each treatment

The first occurrence for each type of treatment is taken as first date the treatment occurred; e.g. if a Patient received Letrozole, Exemestane and then Letrozole the first occurence of Letrozole is taken as the first line of treatment, any subsequent records are not counted in the treatment pathway.

CDK 4/6 inhibitors are given concurrently with another AI treatment, however CDK4/6 inhibitors show as one row and are therefore counted as separate line of treatment to the concurrent AI.

To get around this, the patients who receive CDK 4/6 inhibitors are removed from the dataset, the AI treatment after the CKD4/6 record is removed and the CDK4/6 record is changed to “CDK4/6 + AI”. These patients are then re-introduced back into the main dataset.

#Ranking each line of treatment ----

Onc_Chemo_PIS %>% arrange(UPI_NUMBER,STDATE)->Onc_Chemo_PIS


#This ranks each start date for each patient - if a patient 1st has 20 records for tamoxifen the records will have 
#a rank of 1-20 based on start date

Onc_Chemo_PIS %>% group_by(UPI_NUMBER) %>% 
  mutate(rank=rank(STDATE,ties.method = "first")) %>% 
  arrange(UPI_NUMBER,STDATE) %>%
  ungroup() ->Onc_Chemo_PIS_ranked



#This then takes the minimum rank for each treatment for each patient - so each type of treatment should have 1 unique number 
#What we want is the 1st occurence of the treatment, e.g. if a patient received letrozole,Exemestane then letrozole we only
#want the start date for the 1st occurence of letrozole

Onc_Chemo_PIS_ranked %>% group_by(UPI_NUMBER,MATERIAL) %>% 
  mutate(rank=min(rank)) %>% 
  arrange(UPI_NUMBER,rank) %>% ungroup() -> Onc_Chemo_PIS_ranked

#this groups by each patient,material and by the rank they appear in 
#it aggregates showing the minimum start date in each material for each rank 
#since theres now only one rank per material the minimum start date will be for each material
#then it creates a new rank based on this 1st start date 
#now each rank is in order, 1,2,3 etc.


Onc_Chemo_PIS_ranked %>% 
  group_by(UPI_NUMBER,Treatment_category,MATERIAL,rank) %>% 
  summarise(first_STDATE = min(STDATE),Met_inc_date=min(Met_inc_date),Source=min(Source)) %>%
  arrange(UPI_NUMBER,first_STDATE,rank) %>% 
  group_by(UPI_NUMBER) %>% 
  mutate(rank=rank(first_STDATE,ties.method = "first")) %>% 
  select(UPI_NUMBER,Met_inc_date,MATERIAL,Treatment_category,Source,first_STDATE,rank) ->Onc_Chemo_PIS_ranked

#Checking for patients who have treatments after the date of death
#This might be due to the rank ties.method="1st"


Onc_Chemo_PIS_ranked %>%
  filter(Treatment_category=="Death") %>%
  select(UPI_NUMBER) %>% left_join(Onc_Chemo_PIS_ranked) %>% 
  mutate(after_death=ifelse(UPI_NUMBER== shift(UPI_NUMBER,1L,type="lag")& shift(Treatment_category,1L,type="lag")=="Death",1,0))->check_after_death


#no patients have a record after death 

#Creating a column with the total amount of patients 
n_distinct(Patient_cohort$UPI_NUMBER)->Onc_Chemo_PIS_ranked$total_pats


#creating a rank for patients who have not died and and therefore has a "no treatment" category
Onc_Chemo_PIS_ranked %>% group_by(UPI_NUMBER) %>% 
  filter(!(UPI_NUMBER %in% Pat_Deaths$UPI_NUMBER)&Treatment_category!="No treatment") %>% 
  summarise(No_treat_rank=max(rank)+1) ->Pat_Notreat

#Creating a separate dataframe for the "no treatment" records

Pat_Notreat_rank = data.frame(matrix(vector(), 129, 8,
                                     dimnames=list(c(), c("UPI_NUMBER", "Met_inc_date", "MATERIAL","Treatment_category","Source","first_STDATE","rank","total_pats"))),
                              stringsAsFactors=F)


Pat_Notreat_rank$UPI_NUMBER<-Pat_Notreat$UPI_NUMBER
Pat_Notreat_rank$rank<-Pat_Notreat$No_treat_rank
Pat_Notreat_rank$Treatment_category<-"No treatment"

#Binding the no treatment records to the main dataset and arranging by UPI number and rank

bind_rows(Onc_Chemo_PIS_ranked,Pat_Notreat_rank) ->Onc_Chemo_PIS_ranked
Onc_Chemo_PIS_ranked %>% arrange(UPI_NUMBER,rank) ->Onc_Chemo_PIS_ranked

n_distinct(Patient_cohort$UPI_NUMBER)->Onc_Chemo_PIS_ranked$total_pats

# CDK4/6 patients ----
# CDK4/6 patients will get CDK4/6 inhibitors with another treatment so need to remove these patients and change the combinations
# After looking at the data it appears that each record of a CDK4/6 inhibitor belongs to a regime which administers it with letrozole
# Change all CDK4/6 records need to be CDK4/6 + Letrozole/Anstrozole and then remove the letrozole/anstrozole record


Onc_Chemo_PIS_ranked %>% filter(MATERIAL=="CDK4/6i+AI") %>%
  select(UPI_NUMBER) %>%
  left_join(Onc_Chemo_PIS_ranked) ->CDK_ranked

#remove the CDK patients from the main cohort and add them back in again afterwards
Onc_Chemo_PIS_ranked%>% filter(!(UPI_NUMBER %in% CDK_ranked$UPI_NUMBER)) ->Onc_Chemo_PIS_ranked

#Now back to CDK patients - create a new material column that is empty
CDK_ranked$MATERIAL2<-NA

#For the records that are CDK4/6 - change the material category to Letrozole + CDK4/6
ifelse(CDK_ranked$MATERIAL=="CDK4/6i+AI",
       "Letrozole + CDK4/6", CDK_ranked$MATERIAL) ->CDK_ranked$MATERIAL2

#remove records for Letrozole/Anastrozole as these will be the same line of treatment as the CDK 4/6

CDK_ranked %>% filter(MATERIAL!="Letrozole/Anastrozole"|is.na(MATERIAL)==1) ->CDK_ranked


#The start date will be the date that CDK4/6 was administered in Chemocare - this will be more accurate than the date in PIS
#Overwrite the orginal Material with Material 2
CDK_ranked$MATERIAL<-CDK_ranked$MATERIAL2


CDK_ranked %>% group_by(UPI_NUMBER) %>%
  mutate(rank=rank(first_STDATE,ties.method = "first")) %>%
  arrange(UPI_NUMBER,first_STDATE) ->CDK_ranked

#Remove Material2
CDK_ranked %>% select(-MATERIAL2) ->CDK_ranked

#Now add CDK_ranked back to the main dataset

bind_rows(Onc_Chemo_PIS_ranked,CDK_ranked)->Onc_Chemo_PIS_ranked

Onc_Chemo_PIS_ranked %>% arrange(UPI_NUMBER,rank)->Onc_Chemo_PIS_ranked


Onc_Chemo_PIS_ranked %>% mutate(Treatment_category2=MATERIAL) ->Onc_Chemo_PIS_ranked
ifelse(Onc_Chemo_PIS_ranked$Treatment_category=="Death","Death",
       ifelse(Onc_Chemo_PIS_ranked$Treatment_category=="No treatment","No treatment",
              Onc_Chemo_PIS_ranked$Treatment_category2))->Onc_Chemo_PIS_ranked$Treatment_category2


#Creating treatment summary table ----

Onc_Chemo_PIS_ranked%>% 
  group_by(Treatment_category,rank) %>% 
  mutate(`1st_line_treatment`=ifelse(rank ==1,n_distinct(UPI_NUMBER),0)) %>% 
  mutate(`1st_line_treatment_%`=ifelse(rank ==1,round2(`1st_line_treatment`/total_pats*100,1),0)) %>% 
  
  mutate(`2nd_line_treatment`=ifelse(rank ==2,n_distinct(UPI_NUMBER),0)) %>% 
  
  mutate(`2nd_line_treatment_%`=ifelse(rank ==2,round2(`2nd_line_treatment`/total_pats*100,1),0)) %>%
  
  mutate(`3rd_line_treatment`=ifelse(rank ==3,n_distinct(UPI_NUMBER),0)) %>% 
  
  mutate(`3rd_line_treatment_%`=ifelse(rank ==3,round2(`3rd_line_treatment`/total_pats*100,1),0)) %>% 
  
  mutate(`4th_line_treatment`=ifelse(rank ==4,n_distinct(UPI_NUMBER),0)) %>% 
  
  mutate(`4th_line_treatment_%`=ifelse(rank ==4,round2(`4th_line_treatment`/total_pats*100,1),0)) %>% 
  
  mutate(`5th_line_treatment`=ifelse(rank ==5,n_distinct(UPI_NUMBER),0)) %>% 
  
  mutate(`5th_line_treatment_%`=ifelse(rank ==5,round2(`5th_line_treatment`/total_pats*100,1),0)) %>%
  
  mutate(`6th_line_treatment`=ifelse(rank ==6,n_distinct(UPI_NUMBER),0)) %>% 
  
  mutate(`6th_line_treatment_%`=ifelse(rank ==6,round2(`6th_line_treatment`/total_pats*100,1),0)) %>%
  
  mutate(`7th_line_treatment`=ifelse(rank ==7,n_distinct(UPI_NUMBER),0)) %>% 
  
  mutate(`7th_line_treatment_%`=ifelse(rank ==7,round2(`7th_line_treatment`/total_pats*100,1),0)) %>%
  
  mutate(`8th_line_treatment`=ifelse(rank ==8,n_distinct(UPI_NUMBER),0)) %>% 
  
  mutate(`8th_line_treatment_%`=ifelse(rank ==8,round2(`8th_line_treatment`/total_pats*100,1),0)) %>% 
  
  mutate(`9th_line_treatment`=ifelse(rank ==9,n_distinct(UPI_NUMBER),0)) %>% 
  
  mutate(`9th_line_treatment_%`=ifelse(rank ==9,round2(`9th_line_treatment`/total_pats *100,1),0)) %>% 

  group_by(Treatment_category) %>% 
  summarise(`1st line`=max(`1st_line_treatment`,na.rm=TRUE),
            `1st line %`=max(`1st_line_treatment_%`,na.rm=TRUE),
            `2nd line`=max(`2nd_line_treatment`,na.rm=TRUE),
            `2nd line %`=max(`2nd_line_treatment_%`,na.rm=TRUE),
            `3rd line`=max(`3rd_line_treatment`,na.rm=TRUE),
            `3rd line %`=max(`3rd_line_treatment_%`,na.rm=TRUE),
            `4th line` =max(`4th_line_treatment`,na.rm=TRUE),
            `4th line %`=max(`4th_line_treatment_%`,na.rm=TRUE),
            `5th line`=max(`5th_line_treatment`,na.rm=TRUE),
            `5th line %`=max(`5th_line_treatment_%`,na.rm=TRUE),
            `6th line`=max(`6th_line_treatment`,na.rm=TRUE),
            `6th line %`=max(`6th_line_treatment_%`,na.rm=TRUE),
            `7th line`=max(`7th_line_treatment`,na.rm=TRUE),
            `7th line %`=max(`7th_line_treatment_%`,na.rm=TRUE),
            `8th line`=max(`8th_line_treatment`,na.rm=TRUE),
            `8th line %`=max(`8th_line_treatment_%`,na.rm=TRUE),
            `9th line`=max(`9th_line_treatment`,na.rm=TRUE),
            `9th line %`=max(`9th_line_treatment_%`,na.rm=TRUE))->Treatment_summary

#Adding in a line for total number of patients treated 
No_treatment <- c("No of patients treated")
as.data.frame(No_treatment) ->No_treatment

colnames(No_treatment)[1] <- "Treatment_category"

bind_rows(Treatment_summary,No_treatment) ->Treatment_summary

#Reordering the order of each treatment 

Treatment_summary$Treatment_category %>% as.factor() %>% 
  fct_relevel("Endocrine","Chemotherapy","Trial/other","No of patients treated","No treatment","Death") ->Treatment_summary$Treatment_category

Treatment_summary %>% arrange(Treatment_category) ->Treatment_summary

Treatment_summary$Treatment_category %>% as.character() ->Treatment_summary$Treatment_category
#Creating the number of patients treated and % of patients treated 

sum(Treatment_summary$`1st line`[1:3],na.rm=TRUE) -> Treatment_summary[4,"1st line"]
sum(Treatment_summary$`1st line %`[1:3],na.rm=TRUE) -> Treatment_summary[4,"1st line %"]
sum(Treatment_summary$`2nd line`[1:3],na.rm=TRUE) -> Treatment_summary[4,"2nd line"]
sum(Treatment_summary$`2nd line %`[1:3],na.rm=TRUE) -> Treatment_summary[4,"2nd line %"]
sum(Treatment_summary$`3rd line`[1:3],na.rm=TRUE) -> Treatment_summary[4,"3rd line"]
sum(Treatment_summary$`3rd line %`[1:3],na.rm=TRUE) -> Treatment_summary[4,"3rd line %"]
sum(Treatment_summary$`4th line`[1:3],na.rm=TRUE) -> Treatment_summary[4,"4th line"]
sum(Treatment_summary$`4th line %`[1:3],na.rm=TRUE) -> Treatment_summary[4,"4th line %"]
sum(Treatment_summary$`5th line`[1:3],na.rm=TRUE) -> Treatment_summary[4,"5th line"]
sum(Treatment_summary$`5th line %`[1:3],na.rm=TRUE) -> Treatment_summary[4,"5th line %"]
sum(Treatment_summary$`6th line`[1:3],na.rm=TRUE) -> Treatment_summary[4,"6th line"]
sum(Treatment_summary$`6th line %`[1:3],na.rm=TRUE) -> Treatment_summary[4,"6th line %"]
sum(Treatment_summary$`7th line`[1:3],na.rm=TRUE) -> Treatment_summary[4,"7th line"]
sum(Treatment_summary$`7th line  %`[1:3],na.rm=TRUE) -> Treatment_summary[4,"7th line %"]
sum(Treatment_summary$`8th line`[1:3],na.rm=TRUE) -> Treatment_summary[4,"8th line"]
sum(Treatment_summary$`8th line %`[1:3],na.rm=TRUE) -> Treatment_summary[4,"8th line %"]
sum(Treatment_summary$`9th line`[1:3],na.rm=TRUE) -> Treatment_summary[4,"9th line"]
sum(Treatment_summary$`9th line %`[1:3],na.rm=TRUE) -> Treatment_summary[4,"9th line %"]

Treatment_summary %>% rename("Treatment"="Treatment_category") ->Treatment_summary

Treatment_summary %>% write_xlsx("Adv BC Line of treatment1.xlsx")

#### More Granular level summary table

Onc_Chemo_PIS_ranked %>% 
  group_by(Treatment_category2,rank) %>% 
  mutate(`1st_line_treatment`=ifelse(rank ==1,n_distinct(UPI_NUMBER),0)) %>% 
  mutate(`1st_line_treatment_%`=ifelse(rank ==1,round2(`1st_line_treatment`/total_pats*100,1),0)) %>% 
  
  mutate(`2nd_line_treatment`=ifelse(rank ==2,n_distinct(UPI_NUMBER),0)) %>% 
  
  mutate(`2nd_line_treatment_%`=ifelse(rank ==2,round(`2nd_line_treatment`/total_pats*100,1),0)) %>%
  
  mutate(`3rd_line_treatment`=ifelse(rank ==3,n_distinct(UPI_NUMBER),0)) %>% 
  
  mutate(`3rd_line_treatment_%`=ifelse(rank ==3,round(`3rd_line_treatment`/total_pats*100,1),0)) %>% 
  
  mutate(`4th_line_treatment`=ifelse(rank ==4,n_distinct(UPI_NUMBER),0)) %>% 
  
  mutate(`4th_line_treatment_%`=ifelse(rank ==4,round(`4th_line_treatment`/total_pats*100,1),0)) %>% 
  
  mutate(`5th_line_treatment`=ifelse(rank ==5,n_distinct(UPI_NUMBER),0)) %>% 
  
  mutate(`5th_line_treatment_%`=ifelse(rank ==5,round(`5th_line_treatment`/total_pats*100,1),0)) %>%
  
  mutate(`6th_line_treatment`=ifelse(rank ==6,n_distinct(UPI_NUMBER),0)) %>% 
  
  mutate(`6th_line_treatment_%`=ifelse(rank ==6,round(`6th_line_treatment`/total_pats*100,1),0)) %>%
  
  mutate(`7th_line_treatment`=ifelse(rank ==7,n_distinct(UPI_NUMBER),0)) %>% 
  
  mutate(`7th_line_treatment_%`=ifelse(rank ==7,round(`7th_line_treatment`/total_pats*100,1),0)) %>%
  
  mutate(`8th_line_treatment`=ifelse(rank ==8,n_distinct(UPI_NUMBER),0)) %>% 
  
  mutate(`8th_line_treatment_%`=ifelse(rank ==8,round(`8th_line_treatment`/total_pats*100,1),0)) %>%
  
  mutate(`9th_line_treatment`=ifelse(rank ==8,n_distinct(UPI_NUMBER),0)) %>% 
  
  mutate(`9th_line_treatment_%`=ifelse(rank ==8,round(`9th_line_treatment`/total_pats*100,1),0)) %>%
  
  group_by(Treatment_category2) %>% 
  summarise(`1st line`=max(`1st_line_treatment`,na.rm=TRUE),
            `1st line %`=max(`1st_line_treatment_%`,na.rm=TRUE),
            `2nd line`=max(`2nd_line_treatment`,na.rm=TRUE),
            `2nd line %`=max(`2nd_line_treatment_%`,na.rm=TRUE),
            `3rd line`=max(`3rd_line_treatment`,na.rm=TRUE),
            `3rd line %`=max(`3rd_line_treatment_%`,na.rm=TRUE),
            `4th line` =max(`4th_line_treatment`,na.rm=TRUE),
            `4th line %`=max(`4th_line_treatment_%`,na.rm=TRUE),
            `5th line`=max(`5th_line_treatment`,na.rm=TRUE),
            `5th line %`=max(`5th_line_treatment_%`,na.rm=TRUE),
            `6th line`=max(`6th_line_treatment`,na.rm=TRUE),
            `6th line %`=max(`6th_line_treatment_%`,na.rm=TRUE),
            `7th line`=max(`7th_line_treatment`,na.rm=TRUE),
            `7th line %`=max(`7th_line_treatment_%`,na.rm=TRUE),
            `8th line`=max(`8th_line_treatment`,na.rm=TRUE),
            `8th line %`=max(`8th_line_treatment_%`,na.rm=TRUE),
            `9th line`=max(`9th_line_treatment`,na.rm=TRUE),
            `9th line %`=max(`9th_line_treatment_%`,na.rm=TRUE)) ->Treatment_summary2


No_treatment <- c("No of patients treated")
as.data.frame(No_treatment) ->No_treatment

colnames(No_treatment)[1] <- "Treatment_category2"

bind_rows(Treatment_summary2,No_treatment) ->Treatment_summary2

Treatment_summary2$Treatment_category2 %>% as.factor() %>% 
  fct_relevel("Anthracycline","Capecitabine","Carboplatin","Carboplatin + Taxane","Letrozole + CDK4/6",
              "Eribulin","Everolimus + Exemestane","Exemestane","Fulvestrant","Letrozole/Anastrozole",
              "Tamoxifen","Taxane","Trial/other","Vinorelbine",
              "No of patients treated","No treatment","Death") ->Treatment_summary2$Treatment_category2

Treatment_summary2 %>% arrange(Treatment_category2) ->Treatment_summary2

Treatment_summary2$Treatment_category2 %>% as.character() ->Treatment_summary2$Treatment_category2

sum(Treatment_summary2$`1st line`[1:14],na.rm=TRUE) -> Treatment_summary2[15,"1st line"]
sum(Treatment_summary2$`1st line %`[1:14],na.rm=TRUE) -> Treatment_summary2[15,"1st line %"]
sum(Treatment_summary2$`2nd line`[1:14],na.rm=TRUE) -> Treatment_summary2[15,"2nd line"]
sum(Treatment_summary2$`2nd line %`[1:14],na.rm=TRUE) -> Treatment_summary2[15,"2nd line %"]
sum(Treatment_summary2$`3rd line`[1:14],na.rm=TRUE) -> Treatment_summary2[15,"3rd line"]
sum(Treatment_summary2$`3rd line %`[1:14],na.rm=TRUE) -> Treatment_summary2[15,"3rd line %"]
sum(Treatment_summary2$`4th line`[1:14],na.rm=TRUE) -> Treatment_summary2[15,"4th line"]
sum(Treatment_summary2$`4th line %`[1:14],na.rm=TRUE) -> Treatment_summary2[15,"4th line %"]
sum(Treatment_summary2$`5th line`[1:14],na.rm=TRUE) -> Treatment_summary2[15,"5th line"]
sum(Treatment_summary2$`5th line %`[1:14],na.rm=TRUE) -> Treatment_summary2[15,"5th line %"]
sum(Treatment_summary2$`6th line`[1:14],na.rm=TRUE) -> Treatment_summary2[15,"6th line"]
sum(Treatment_summary2$`6th line %`[1:14],na.rm=TRUE) -> Treatment_summary2[15,"6th line %"]
sum(Treatment_summary2$`7th line`[1:14],na.rm=TRUE) -> Treatment_summary2[15,"7th line"]
sum(Treatment_summary2$`7th line %`[1:14],na.rm=TRUE) -> Treatment_summary2[15,"7th line %"]
sum(Treatment_summary2$`8th line`[1:14],na.rm=TRUE) -> Treatment_summary2[15,"8th line"]
sum(Treatment_summary2$`8th line %`[1:14],na.rm=TRUE) -> Treatment_summary2[15,"8th line %"]
sum(Treatment_summary2$`9th line`[1:14],na.rm=TRUE) -> Treatment_summary2[15,"9th line"]
sum(Treatment_summary2$`9th line %`[1:14],na.rm=TRUE) -> Treatment_summary2[15,"9th line %"]

Treatment_summary2 %>% rename("Treatment"="Treatment_category2") ->Treatment_summary2

Treatment_summary2 %>% write_xlsx("Adv BC Line of treatment2.xlsx")



###### Creating list of UPIs for example 1 cohort ---- 
#check the numbers match 1st
# Onc_Chemo_PIS_ranked %>% filter(rank==1 & Treatment_category2=="Letrozole/Anastrozole") %>% 
#   select(UPI_NUMBER,Met_inc_date,Treatment_category2) %>%
#   group_by(Treatment_category2) %>% 
#   summarise(pats=n_distinct(UPI_NUMBER))


Onc_Chemo_PIS_ranked %>% filter(rank==1 &  Treatment_category2=="Letrozole/Anastrozole") %>% 
  select(UPI_NUMBER,Met_inc_date) -> Example1_let1st_cohort

Example1_let1st_cohort %>% write_csv("Example1_let1st_cohort.csv")

Results and Visualisation

In addition to tables, Sankey, Alluvial plots and Decision Trees were explored in order to visualise the patient pathway.

The following r packages were used for plotting:

NetworkD3
Collapsibletree
Ggalluvial

First line therapy for the 277 ER+ HER2- secondary breast cancer patients identified between 2013 and 2017 included Endocrine 189 (68.2%), Chemotherapy 61 (22.0%), 10 (3.6%) received no treatment and 2 (0.7%) received Trial/other.

Treatment_summary %>%kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive","bordered")) %>% 
  scroll_box(width = "100%", height = "100%")

Treatment	1st line	1st line %	2nd line	2nd line %	3rd line	3rd line %	4th line	4th line %	5th line	5th line %	6th line	6th line %	7th line	7th line %	8th line	8th line %	9th line	9th line %
Endocrine	189	68.2	98	35.4	45	16.2	14	5.1	7	2.5	3	1.1	3	1.1	0	0.0	0	0.0
Chemotherapy	61	22.0	52	18.8	45	16.2	30	10.8	13	4.7	6	2.2	1	0.4	1	0.4	0	0.0
Trial/other	2	0.7	0	0.0	1	0.4	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
No of patients treated	252	90.9	150	54.2	91	32.8	44	15.9	20	7.2	9	3.3	4	0.0	1	0.4	0	0.0
No treatment	10	3.6	62	22.4	28	10.1	24	8.7	5	1.8	5	1.8	3	1.1	1	0.4	1	0.4
Death	15	5.4	40	14.4	31	11.2	23	8.3	19	6.9	6	2.2	2	0.7	2	0.7	0	0.0

#Format data for alluvial and Sankey plots ----- 
#max(Onc_Chemo_PIS_ranked$rank)

#There are a maximum of 9 lines of treatment that a patient can have in this cohort 
 
Onc_Chemo_PIS_ranked %>% group_by(UPI_NUMBER) %>% 
  mutate(`1st Treatment`=ifelse(rank==1,Treatment_category,NA)) %>% 
  mutate(`2nd Treatment`=ifelse(rank==2,Treatment_category,NA)) %>% 
  mutate(`3rd Treatment`=ifelse(rank==3,Treatment_category,NA)) %>% 
  mutate(`4th Treatment`=ifelse(rank==4,Treatment_category,NA)) %>% 
  mutate(`5th Treatment`=ifelse(rank==5,Treatment_category,NA)) %>% 
  mutate(`6th Treatment`=ifelse(rank==6,Treatment_category,NA)) %>%
  mutate(`7th Treatment`=ifelse(rank==7,Treatment_category,NA)) %>% 
  mutate(`8th Treatment`=ifelse(rank==8,Treatment_category,NA)) %>% 
  mutate(`9th Treatment`=ifelse(rank==9,Treatment_category,NA)) -> Onc_Chemo_PIS_ranked



Onc_Chemo_PIS_ranked %>% group_by(UPI_NUMBER) %>% 
  summarise(Met_inc_date=max(Met_inc_date,na.rm=TRUE),
            `1st Treatment`=first(`1st Treatment`),
            `2nd Treatment`=nth(`2nd Treatment`,2),
            `3rd Treatment`=nth(`3rd Treatment`,3),
            `4th Treatment`=nth(`4th Treatment`,4),
            `5th Treatment`=nth(`5th Treatment`,5),
            `6th Treatment`=nth(`6th Treatment`,6),
            `7th Treatment`=nth(`7th Treatment`,7),
            `8th Treatment`=nth(`8th Treatment`,8),
            `9th Treatment`=nth(`9th Treatment`,9)) %>% 
  group_by(`1st Treatment`,`2nd Treatment`,`3rd Treatment`,`4th Treatment`,`5th Treatment`,`6th Treatment`,`7th Treatment`,`8th Treatment`,`9th Treatment`,`9th Treatment`) %>% 
  summarise(No_of_pats=n_distinct(UPI_NUMBER)) -> Onc_Chemo_PIS_tree_1


# Recording rank to factor for alluvial plots

Onc_Chemo_PIS_ranked$rank %>% as.factor() %>% 
  fct_recode("1st Treatment" ="1",
             "2nd Treatment" = "2",
             "3rd Treatment" = "3",
             "4th Treatment" ="4",
             "5th Treatment" = "5",
             "6th Treatment" = "6",
             "7th Treatment" = "7",
             "8th Treatment" = "8",
             "9th Treatment" = "9") ->Onc_Chemo_PIS_ranked$`Line of Treatment`

## Prepping data for the D3 Network sankey plot  ---- 
Onc_Chemo_PIS_ranked %>% group_by(UPI_NUMBER) %>% 
  summarise(Met_inc_date=max(Met_inc_date,na.rm=TRUE),
            `1st Treatment`=first(`1st Treatment`),
            `2nd Treatment`=nth(`2nd Treatment`,2),
            `3rd Treatment`=nth(`3rd Treatment`,3),
            `4th Treatment`=nth(`4th Treatment`,4),
            `5th Treatment`=nth(`5th Treatment`,5),
            `6th Treatment`=nth(`6th Treatment`,6),
            `7th Treatment`=nth(`7th Treatment`,7),
            `8th Treatment`=nth(`8th Treatment`,8),
            `9th Treatment`=nth(`9th Treatment`,9)) ->Onc_Chemo_PIS_sankey_1

Onc_Chemo_PIS_sankey_1 %>% arrange(desc(UPI_NUMBER)) ->Onc_Chemo_PIS_sankey_1

#Adding 1st, 2nd, 3rd, to treatment description

ifelse(is.na(Onc_Chemo_PIS_sankey_1$`1st Treatment`)==0,paste(Onc_Chemo_PIS_sankey_1$`1st Treatment`,"1st"),NA)->Onc_Chemo_PIS_sankey_1$`1st Treatment`
ifelse(is.na(Onc_Chemo_PIS_sankey_1$`2nd Treatment`)==0,paste(Onc_Chemo_PIS_sankey_1$`2nd Treatment`,"2nd"),NA)->Onc_Chemo_PIS_sankey_1$`2nd Treatment`
ifelse(is.na(Onc_Chemo_PIS_sankey_1$`3rd Treatment`)==0,paste(Onc_Chemo_PIS_sankey_1$`3rd Treatment`,"3rd"),NA)->Onc_Chemo_PIS_sankey_1$`3rd Treatment`
ifelse(is.na(Onc_Chemo_PIS_sankey_1$`4th Treatment`)==0,paste(Onc_Chemo_PIS_sankey_1$`4th Treatment`,"4th"),NA)->Onc_Chemo_PIS_sankey_1$`4th Treatment`
ifelse(is.na(Onc_Chemo_PIS_sankey_1$`5th Treatment`)==0,paste(Onc_Chemo_PIS_sankey_1$`5th Treatment`,"5th"),NA)->Onc_Chemo_PIS_sankey_1$`5th Treatment`
ifelse(is.na(Onc_Chemo_PIS_sankey_1$`6th Treatment`)==0,paste(Onc_Chemo_PIS_sankey_1$`6th Treatment`,"6th"),NA)->Onc_Chemo_PIS_sankey_1$`6th Treatment`
ifelse(is.na(Onc_Chemo_PIS_sankey_1$`7th Treatment`)==0,paste(Onc_Chemo_PIS_sankey_1$`7th Treatment`,"7th"),NA)->Onc_Chemo_PIS_sankey_1$`7th Treatment`
ifelse(is.na(Onc_Chemo_PIS_sankey_1$`8th Treatment`)==0,paste(Onc_Chemo_PIS_sankey_1$`8th Treatment`,"8th"),NA)->Onc_Chemo_PIS_sankey_1$`8th Treatment`
ifelse(is.na(Onc_Chemo_PIS_sankey_1$`9th Treatment`)==0,paste(Onc_Chemo_PIS_sankey_1$`9th Treatment`,"9th"),NA)->Onc_Chemo_PIS_sankey_1$`9th Treatment`



#In order to make sure that all the nodes flow correctly, remove the 1st nodes that don't go anywhere. 

Onc_Chemo_PIS_sankey_1 %>% filter(`1st Treatment`!="No treatment 1st") %>% 
                                    filter(`1st Treatment`!="Death 1st")->Onc_Chemo_PIS_sankey_1

Sankey_nodes=data.frame(name=c(as.character(Onc_Chemo_PIS_sankey_1$`1st Treatment`), as.character(Onc_Chemo_PIS_sankey_1$`2nd Treatment`),
                               as.character(Onc_Chemo_PIS_sankey_1$`3rd Treatment`), as.character(Onc_Chemo_PIS_sankey_1$`4th Treatment`),
                               as.character(Onc_Chemo_PIS_sankey_1$`5th Treatment`), as.character(Onc_Chemo_PIS_sankey_1$`6th Treatment`),
                               as.character(Onc_Chemo_PIS_sankey_1$`7th Treatment`), as.character(Onc_Chemo_PIS_sankey_1$`8th Treatment`),
                               as.character(Onc_Chemo_PIS_sankey_1$`9th Treatment`)) %>% unique())

Sankey_nodes %>% filter(is.na(name)==0) ->Sankey_nodes

# With networkD3, connection must be provided using id, not using real name like in the links dataframe.. So we need to reformat it.
Onc_Chemo_PIS_sankey_1$`1st treatment`=match(Onc_Chemo_PIS_sankey_1$`1st Treatment`, Sankey_nodes$name)-1 
Onc_Chemo_PIS_sankey_1$`2nd treatment`=match(Onc_Chemo_PIS_sankey_1$`2nd Treatment`, Sankey_nodes$name)-1 
Onc_Chemo_PIS_sankey_1$`3rd treatment`=match(Onc_Chemo_PIS_sankey_1$`3rd Treatment`, Sankey_nodes$name)-1 
Onc_Chemo_PIS_sankey_1$`4th treatment`=match(Onc_Chemo_PIS_sankey_1$`4th Treatment`, Sankey_nodes$name)-1 
Onc_Chemo_PIS_sankey_1$`5th treatment`=match(Onc_Chemo_PIS_sankey_1$`5th Treatment`, Sankey_nodes$name)-1 
Onc_Chemo_PIS_sankey_1$`6th treatment`=match(Onc_Chemo_PIS_sankey_1$`6th Treatment`, Sankey_nodes$name)-1 
Onc_Chemo_PIS_sankey_1$`7th treatment`=match(Onc_Chemo_PIS_sankey_1$`7th Treatment`, Sankey_nodes$name)-1 
Onc_Chemo_PIS_sankey_1$`8th treatment`=match(Onc_Chemo_PIS_sankey_1$`8th Treatment`, Sankey_nodes$name)-1 
Onc_Chemo_PIS_sankey_1$`9th treatment`=match(Onc_Chemo_PIS_sankey_1$`9th Treatment`, Sankey_nodes$name)-1 



#Find out how many go from 1st to 2nd, 2nd to 3rd, 3rd to 4th etc.

Onc_Chemo_PIS_sankey_1 %>% group_by(`1st treatment`,`2nd treatment`) %>% 
  summarise(no_of_pats=n_distinct(UPI_NUMBER))%>% 
  ungroup() ->Sankey_1st_2nd

names(Sankey_1st_2nd)[1]<-"source"
names(Sankey_1st_2nd)[2]<-"target"


Onc_Chemo_PIS_sankey_1 %>% group_by(`2nd treatment`,`3rd treatment`) %>% 
  summarise(no_of_pats=n_distinct(UPI_NUMBER))%>% 
  ungroup() ->Sankey_2nd_3rd

names(Sankey_2nd_3rd)[1]<-"source"
names(Sankey_2nd_3rd)[2]<-"target"


Onc_Chemo_PIS_sankey_1 %>% group_by(`3rd treatment`,`4th treatment`) %>% 
  summarise(no_of_pats=n_distinct(UPI_NUMBER))%>% 
  ungroup() ->Sankey_3rd_4th

names(Sankey_3rd_4th)[1]<-"source"
names(Sankey_3rd_4th)[2]<-"target"


Onc_Chemo_PIS_sankey_1 %>% group_by(`4th treatment`,`5th treatment`) %>% 
  summarise(no_of_pats=n_distinct(UPI_NUMBER))%>% 
  ungroup() ->Sankey_4th_5th

names(Sankey_4th_5th)[1]<-"source"
names(Sankey_4th_5th)[2]<-"target"


Onc_Chemo_PIS_sankey_1 %>% group_by(`5th treatment`,`6th treatment`) %>% 
  summarise(no_of_pats=n_distinct(UPI_NUMBER))%>% 
  ungroup() ->Sankey_5th_6th

names(Sankey_5th_6th)[1]<-"source"
names(Sankey_5th_6th)[2]<-"target"


Onc_Chemo_PIS_sankey_1 %>% group_by(`6th treatment`,`7th treatment`) %>% 
  summarise(no_of_pats=n_distinct(UPI_NUMBER))%>% 
  ungroup() ->Sankey_6th_7th

names(Sankey_6th_7th)[1]<-"source"
names(Sankey_6th_7th)[2]<-"target"


Onc_Chemo_PIS_sankey_1 %>% group_by(`7th treatment`,`8th treatment`) %>% 
  summarise(no_of_pats=n_distinct(UPI_NUMBER))%>% 
  ungroup() ->Sankey_7th_8th

names(Sankey_7th_8th)[1]<-"source"
names(Sankey_7th_8th)[2]<-"target"


Onc_Chemo_PIS_sankey_1 %>% group_by(`8th treatment`,`9th treatment`) %>% 
  summarise(no_of_pats=n_distinct(UPI_NUMBER)) %>% 
  ungroup()->Sankey_8th_9th

names(Sankey_8th_9th)[1]<-"source"
names(Sankey_8th_9th)[2]<-"target"


bind_rows(Sankey_1st_2nd,Sankey_2nd_3rd,Sankey_3rd_4th,Sankey_4th_5th,Sankey_5th_6th,Sankey_6th_7th,Sankey_7th_8th,Sankey_8th_9th) ->Onc_Chemo_PIS_sankey_1



##Sankey plots high level data ---- 
#Removing any nodes that don't go anywhere

Onc_Chemo_PIS_sankey_1 %>% 
  filter(is.na(source)!=1) %>% 
  filter(is.na(target)!=1) %>% 
  as.data.frame()->Onc_Chemo_PIS_sankey_1


Onc_Chemo_PIS_sankey_1$treatment_type <- sub(' .*', '',
                                             Sankey_nodes[Onc_Chemo_PIS_sankey_1$source + 1, 'name'])

Data visualization using Sankey plots and Decision Trees

Patient pathways are very complex and sometimes deviate from what is expected. The Sankey, Decision tree and Alluvial plots allow patients to be tracked through their pathway.

The Sankey and Decision tree plots are both interactive, hovering over each node will show the number of patients at each point in the line of treatment. Clicking on the nodes of the Decision tree will expand to the next line of treatment, click the nodes of interest to follow the line of treatment.

Subsequent treatment decision including best supportive care and death have been tracked to identify up to 65 or 122 unique pathways with up to 8 lines of treatment.

sankeyNetwork(Links = Onc_Chemo_PIS_sankey_1, Nodes = Sankey_nodes,
              Source = "source", Target = "target",
              Value = "no_of_pats", NodeID = "name",
              LinkGroup = "treatment_type",units = "Patients",NodeGroup = NULL,
              sinksRight=FALSE,fontSize = 12,iterations=40,
              fontFamily = "calibri",nodePadding = 20, nodeWidth = 15,
              colourScale = JS("d3.scaleOrdinal(d3.schemeCategory10);"),
              height=750)

Sankey Network diagram of line of treatment in HER2+ ER- Secondary Breast Cancer Patients

##Alluvial and Tree Plot - High level Data

#alpha is for colour transparency
#geom_flow(aes.flow="backward") puts the flow going backwards from rank5 to rank 1

str_remove(Onc_Chemo_PIS_ranked$`Line of Treatment`,"Treatment") -> Onc_Chemo_PIS_ranked$`Line of Treatment`

ggplot(Onc_Chemo_PIS_ranked,
       aes(x = `Line of Treatment`, stratum = Treatment_category, alluvium = UPI_NUMBER,fill=Treatment_category, colour=Treatment_category)) +
  scale_x_discrete(expand = c(.1, .1)) +
  geom_flow(alpha=.9, colour="Black") +
  geom_stratum(alpha = .9,linetype=1,colour="Black") +
  scale_color_brewer(palette="Spectral", aesthetics = c("colour","fill"),direction = -1) +
  theme_minimal()+
  theme(legend.position="bottom")+
  labs(fill='Treatment')

Alluvial diagram of line of treatment in HER2+ ER- Secondary Breast Cancer Patients

Decision Tree

From the Decision Tree out of the 189 patients who had Endocrine therapy as a first line treatment 73 (39%) patients continued to receive a different Endocrine therapy as a second line therapy, 29 (15%) patients went on to have Chemotherapy, 54 (29%) patients received No Treatment and 33 (18%) patients died on the treatment.

#Collapsible Tree diagram ----
colour_tree<-c("seashell","#253494",
"#ffffcc",
"#41b6c4",
"#a1dab4",
"#2c7fb8",
"#253494",
"#ffffcc",
"#41b6c4",
"#a1dab4",
"#253494",
"#ffffcc",
"#41b6c4",
"#a1dab4",
"#253494",
"#253494",
"#ffffcc",
"#41b6c4",
"#a1dab4",
"#253494",
"#ffffcc",
"#41b6c4",
"#a1dab4",
"#253494",
"#ffffcc",
"#41b6c4",
"#a1dab4",
"#253494",
"#ffffcc",
"#41b6c4",
"#a1dab4",
"#2c7fb8",
"#253494",
"#ffffcc",
"#253494",
"#ffffcc",
"#41b6c4",
"#a1dab4",
"#253494",
"#ffffcc",
"#41b6c4",
"#a1dab4",
"#253494",
"#ffffcc",
"#253494",
"#ffffcc",
"#41b6c4",
"#a1dab4",
"#253494",
"#ffffcc",
"#41b6c4",
"#a1dab4",
"#253494",
"#a1dab4",
"#253494",
"#ffffcc",
"#41b6c4",
"#a1dab4",
"#253494",
"#ffffcc",
"#41b6c4",
"#a1dab4",
"#41b6c4",
"#253494",
"#ffffcc",
"#a1dab4",
"#41b6c4",
"#ffffcc",
"#41b6c4",
"#ffffcc",
"#41b6c4",
"#253494",
"#ffffcc",
"#253494",
"#253494",
"#ffffcc",
"#41b6c4",
"#253494",
"#ffffcc",
"#41b6c4",
"#ffffcc",
"#a1dab4",
"#253494",
"#ffffcc",
"#253494",
"#ffffcc",
"#253494",
"#ffffcc",
"#41b6c4",
"#a1dab4",
"#ffffcc",
"#41b6c4",
"#a1dab4",
"#ffffcc",
"#253494",
"#41b6c4",
"#a1dab4",
"#253494",
"#ffffcc",
"#41b6c4",
"#253494",
"#41b6c4",
"#ffffcc",
"#ffffcc",
"#253494",
"#ffffcc",
"#a1dab4",
"#253494",
"#ffffcc",
"#a1dab4",
"#253494",
"#a1dab4",
"#41b6c4",
"#ffffcc",
"#41b6c4",
"#253494",
"#ffffcc",
"#41b6c4",
"#a1dab4",
"#a1dab4",
"#a1dab4",
"#a1dab4",
"#253494",
"#ffffcc",
"#ffffcc",
"#a1dab4")


Cohort<-Onc_Chemo_PIS_tree_1

collapsibleTree(Cohort,c("1st Treatment","2nd Treatment","3rd Treatment","4th Treatment","5th Treatment","6th Treatment","7th Treatment","8th Treatment","9th Treatment"),
                tooltip = TRUE,
                attribute = "No_of_pats",
                fill=colour_tree,
                fontSize = 15,collapsed = TRUE,
                height=750)

Decision tree of line of treatment in HER2+ ER- Secondary Breast Cancer Patients

# Saving out patients and their first line of treatment and start date for that treatment for patient characteristics report
# Onc_Chemo_PIS_ranked %>% filter(`Line of Treatment`=="1st ") %>% 
#   select(UPI_NUMBER,Met_inc_date,Treatment_category, first_STDATE) %>% 
#   write_xlsx("1st_line_treatment.xlsx")

Onc_Chemo_PIS_ranked %>%
  select(UPI_NUMBER,Met_inc_date,Treatment_category,Treatment_category2, first_STDATE,`Line of Treatment`) %>%
  write_xlsx("line_treatment_advbcpats.xlsx")

#Allvial and Sankey data formats for more granular data ----


Onc_Chemo_PIS_ranked %>% group_by(UPI_NUMBER) %>% 
  mutate(`1st Treatment`=ifelse(rank==1,Treatment_category2,NA)) %>% 
  mutate(`2nd Treatment`=ifelse(rank==2,Treatment_category2,NA)) %>% 
  mutate(`3rd Treatment`=ifelse(rank==3,Treatment_category2,NA)) %>% 
  mutate(`4th Treatment`=ifelse(rank==4,Treatment_category2,NA)) %>% 
  mutate(`5th Treatment`=ifelse(rank==5,Treatment_category2,NA)) %>% 
  mutate(`6th Treatment`=ifelse(rank==6,Treatment_category2,NA)) %>%
  mutate(`7th Treatment`=ifelse(rank==7,Treatment_category2,NA)) %>% 
  mutate(`8th Treatment`=ifelse(rank==8,Treatment_category2,NA)) %>% 
  mutate(`9th Treatment`=ifelse(rank==9,Treatment_category2,NA)) -> Onc_Chemo_PIS_ranked_2



Onc_Chemo_PIS_ranked_2 %>% group_by(UPI_NUMBER) %>% 
  summarise(Met_inc_date=max(Met_inc_date,na.rm=TRUE),
            `1st Treatment`=first(`1st Treatment`),
            `2nd Treatment`=nth(`2nd Treatment`,2),
            `3rd Treatment`=nth(`3rd Treatment`,3),
            `4th Treatment`=nth(`4th Treatment`,4),
            `5th Treatment`=nth(`5th Treatment`,5),
            `6th Treatment`=nth(`6th Treatment`,6),
            `7th Treatment`=nth(`7th Treatment`,7),
            `8th Treatment`=nth(`8th Treatment`,8),
            `9th Treatment`=nth(`9th Treatment`,9)) ->Onc_Chemo_PIS_tree_2


Onc_Chemo_PIS_tree_2 %>% group_by(`1st Treatment`,`2nd Treatment`,`3rd Treatment`,`4th Treatment`,`5th Treatment`,`6th Treatment`,`7th Treatment`,`8th Treatment`,`9th Treatment`) %>% 
  summarise(No_of_pats=n_distinct(UPI_NUMBER)) -> Onc_Chemo_PIS_tree_2


# Recording rank to factor for alluvial plots

Onc_Chemo_PIS_ranked_2$rank %>% as.factor() %>% 
  fct_recode("1st Treatment" ="1",
             "2nd Treatment" = "2",
             "3rd Treatment" = "3",
             "4th Treatment" ="4",
             "5th Treatment" = "5",
             "6th Treatment" = "6",
             "7th Treatment" = "7",
             "8th Treatment" = "8",
             "9th Treatment" = "9") ->Onc_Chemo_PIS_ranked_2$`Line of Treatment`


## Prepping data for the D3 Network sankey plot  ---- 
Onc_Chemo_PIS_ranked_2 %>% group_by(UPI_NUMBER) %>% 
  summarise(Met_inc_date=max(Met_inc_date,na.rm=TRUE),
            `1st Treatment`=first(`1st Treatment`),
            `2nd Treatment`=nth(`2nd Treatment`,2),
            `3rd Treatment`=nth(`3rd Treatment`,3),
            `4th Treatment`=nth(`4th Treatment`,4),
            `5th Treatment`=nth(`5th Treatment`,5),
            `6th Treatment`=nth(`6th Treatment`,6),
            `7th Treatment`=nth(`7th Treatment`,7),
            `8th Treatment`=nth(`8th Treatment`,8),
            `9th Treatment`=nth(`9th Treatment`,9)) ->Onc_Chemo_PIS_sankey_2

Onc_Chemo_PIS_sankey_2 %>% arrange(desc(UPI_NUMBER)) ->Onc_Chemo_PIS_sankey_2


ifelse(is.na(Onc_Chemo_PIS_sankey_2$`1st Treatment`)==0,paste(Onc_Chemo_PIS_sankey_2$`1st Treatment`,"1st"),NA)->Onc_Chemo_PIS_sankey_2$`1st Treatment`
ifelse(is.na(Onc_Chemo_PIS_sankey_2$`2nd Treatment`)==0,paste(Onc_Chemo_PIS_sankey_2$`2nd Treatment`,"2nd"),NA)->Onc_Chemo_PIS_sankey_2$`2nd Treatment`
ifelse(is.na(Onc_Chemo_PIS_sankey_2$`3rd Treatment`)==0,paste(Onc_Chemo_PIS_sankey_2$`3rd Treatment`,"3rd"),NA)->Onc_Chemo_PIS_sankey_2$`3rd Treatment`
ifelse(is.na(Onc_Chemo_PIS_sankey_2$`4th Treatment`)==0,paste(Onc_Chemo_PIS_sankey_2$`4th Treatment`,"4th"),NA)->Onc_Chemo_PIS_sankey_2$`4th Treatment`
ifelse(is.na(Onc_Chemo_PIS_sankey_2$`5th Treatment`)==0,paste(Onc_Chemo_PIS_sankey_2$`5th Treatment`,"5th"),NA)->Onc_Chemo_PIS_sankey_2$`5th Treatment`
ifelse(is.na(Onc_Chemo_PIS_sankey_2$`6th Treatment`)==0,paste(Onc_Chemo_PIS_sankey_2$`6th Treatment`,"6th"),NA)->Onc_Chemo_PIS_sankey_2$`6th Treatment`
ifelse(is.na(Onc_Chemo_PIS_sankey_2$`7th Treatment`)==0,paste(Onc_Chemo_PIS_sankey_2$`7th Treatment`,"7th"),NA)->Onc_Chemo_PIS_sankey_2$`7th Treatment`
ifelse(is.na(Onc_Chemo_PIS_sankey_2$`8th Treatment`)==0,paste(Onc_Chemo_PIS_sankey_2$`8th Treatment`,"8th"),NA)->Onc_Chemo_PIS_sankey_2$`8th Treatment`
ifelse(is.na(Onc_Chemo_PIS_sankey_2$`9th Treatment`)==0,paste(Onc_Chemo_PIS_sankey_2$`9th Treatment`,"9th"),NA)->Onc_Chemo_PIS_sankey_2$`9th Treatment`



# Onc_Chemo_PIS_sankey_2$`1st Treatment` %>% table()
# Onc_Chemo_PIS_sankey_2$`2nd Treatment` %>% table()
# Onc_Chemo_PIS_sankey_2$`3rd Treatment` %>% table()
# Onc_Chemo_PIS_sankey_2$`4th Treatment` %>% table()
# Onc_Chemo_PIS_sankey_2$`5th Treatment` %>% table()
# Onc_Chemo_PIS_sankey_2$`6th Treatment` %>% table()
# Onc_Chemo_PIS_sankey_2$`7th Treatment` %>% table()
# Onc_Chemo_PIS_sankey_2$`8th Treatment` %>% table()
# Onc_Chemo_PIS_sankey_2$`9th Treatment` %>% table()


#In order to make sure that all the nodes flow correctly, remove the 1st nodes that don't go anywhere. 

Onc_Chemo_PIS_sankey_2 %>% filter(`1st Treatment`!="No treatment 1st") %>% 
  filter(`1st Treatment`!="Death 1st")->Onc_Chemo_PIS_sankey_2

Sankey_nodes2=data.frame(name=c(as.character(Onc_Chemo_PIS_sankey_2$`1st Treatment`), as.character(Onc_Chemo_PIS_sankey_2$`2nd Treatment`),
                               as.character(Onc_Chemo_PIS_sankey_2$`3rd Treatment`), as.character(Onc_Chemo_PIS_sankey_2$`4th Treatment`),
                               as.character(Onc_Chemo_PIS_sankey_2$`5th Treatment`), as.character(Onc_Chemo_PIS_sankey_2$`6th Treatment`),
                               as.character(Onc_Chemo_PIS_sankey_2$`7th Treatment`), as.character(Onc_Chemo_PIS_sankey_2$`8th Treatment`),
                               as.character(Onc_Chemo_PIS_sankey_2$`9th Treatment`)) %>% unique())


Sankey_nodes2 %>% filter(is.na(name)==0) ->Sankey_nodes2



# With networkD3, connection must be provided using id, not using real name like in the links dataframe.. So we need to reformat it.
Onc_Chemo_PIS_sankey_2$`1st treatment`=match(Onc_Chemo_PIS_sankey_2$`1st Treatment`, Sankey_nodes2$name)-1 
Onc_Chemo_PIS_sankey_2$`2nd treatment`=match(Onc_Chemo_PIS_sankey_2$`2nd Treatment`, Sankey_nodes2$name)-1 
Onc_Chemo_PIS_sankey_2$`3rd treatment`=match(Onc_Chemo_PIS_sankey_2$`3rd Treatment`, Sankey_nodes2$name)-1 
Onc_Chemo_PIS_sankey_2$`4th treatment`=match(Onc_Chemo_PIS_sankey_2$`4th Treatment`, Sankey_nodes2$name)-1 
Onc_Chemo_PIS_sankey_2$`5th treatment`=match(Onc_Chemo_PIS_sankey_2$`5th Treatment`, Sankey_nodes2$name)-1 
Onc_Chemo_PIS_sankey_2$`6th treatment`=match(Onc_Chemo_PIS_sankey_2$`6th Treatment`, Sankey_nodes2$name)-1 
Onc_Chemo_PIS_sankey_2$`7th treatment`=match(Onc_Chemo_PIS_sankey_2$`7th Treatment`, Sankey_nodes2$name)-1 
Onc_Chemo_PIS_sankey_2$`8th treatment`=match(Onc_Chemo_PIS_sankey_2$`8th Treatment`, Sankey_nodes2$name)-1 
Onc_Chemo_PIS_sankey_2$`9th treatment`=match(Onc_Chemo_PIS_sankey_2$`9th Treatment`, Sankey_nodes2$name)-1 






#Find out how many go from 1st to 2nd, 2nd to 3rd, 3rd to 4th etc.

Onc_Chemo_PIS_sankey_2 %>% group_by(`1st treatment`,`2nd treatment`) %>% 
  summarise(no_of_pats=n_distinct(UPI_NUMBER))%>% 
  ungroup() ->Sankey_1st_2nd2

names(Sankey_1st_2nd2)[1]<-"source"
names(Sankey_1st_2nd2)[2]<-"target"


Onc_Chemo_PIS_sankey_2 %>% group_by(`2nd treatment`,`3rd treatment`) %>% 
  summarise(no_of_pats=n_distinct(UPI_NUMBER))%>% 
  ungroup() ->Sankey_2nd_3rd2

names(Sankey_2nd_3rd2)[1]<-"source"
names(Sankey_2nd_3rd2)[2]<-"target"


Onc_Chemo_PIS_sankey_2 %>% group_by(`3rd treatment`,`4th treatment`) %>% 
  summarise(no_of_pats=n_distinct(UPI_NUMBER))%>% 
  ungroup() ->Sankey_3rd_4th2

names(Sankey_3rd_4th2)[1]<-"source"
names(Sankey_3rd_4th2)[2]<-"target"


Onc_Chemo_PIS_sankey_2 %>% group_by(`4th treatment`,`5th treatment`) %>% 
  summarise(no_of_pats=n_distinct(UPI_NUMBER))%>% 
  ungroup() ->Sankey_4th_5th2

names(Sankey_4th_5th2)[1]<-"source"
names(Sankey_4th_5th2)[2]<-"target"


Onc_Chemo_PIS_sankey_2 %>% group_by(`5th treatment`,`6th treatment`) %>% 
  summarise(no_of_pats=n_distinct(UPI_NUMBER))%>% 
  ungroup() ->Sankey_5th_6th2

names(Sankey_5th_6th2)[1]<-"source"
names(Sankey_5th_6th2)[2]<-"target"


Onc_Chemo_PIS_sankey_2 %>% group_by(`6th treatment`,`7th treatment`) %>% 
  summarise(no_of_pats=n_distinct(UPI_NUMBER))%>% 
  ungroup() ->Sankey_6th_7th2

names(Sankey_6th_7th2)[1]<-"source"
names(Sankey_6th_7th2)[2]<-"target"


Onc_Chemo_PIS_sankey_2 %>% group_by(`7th treatment`,`8th treatment`) %>% 
  summarise(no_of_pats=n_distinct(UPI_NUMBER))%>% 
  ungroup() ->Sankey_7th_8th2

names(Sankey_7th_8th2)[1]<-"source"
names(Sankey_7th_8th2)[2]<-"target"


Onc_Chemo_PIS_sankey_2 %>% group_by(`8th treatment`,`9th treatment`) %>% 
  summarise(no_of_pats=n_distinct(UPI_NUMBER)) %>% 
  ungroup()->Sankey_8th_9th2

names(Sankey_8th_9th2)[1]<-"source"
names(Sankey_8th_9th2)[2]<-"target"


bind_rows(Sankey_1st_2nd2,Sankey_2nd_3rd2,Sankey_3rd_4th2,Sankey_4th_5th2,Sankey_5th_6th2,Sankey_6th_7th2,Sankey_7th_8th2,Sankey_8th_9th2) ->Onc_Chemo_PIS_sankey_2

Granular Data

Of the 189 patients who received Endocrine therapy as a first line treatment between 2013 and 2017, 111 (58.7%) received Letrozole/Anastrozole, 29 (15.3%) received Tamoxifen, 34 (18.0%) received Exemestane, 13 (6.9%) received CDK 4/6 inhibitors with Letrozole and 2 (1.1%) received Fulvestrant.

Of the 61 patients that received Chemotherapy as a first line treatment between 2013 and 2017, 25 (41.0%) received Taxane Chemotherapy, 20 (32.8%) received Anthracycline, 15 (24.6%) received Capecitabine and 1 (1.6%) received Carboplatin.

Treatment_summary2 %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),full_width = TRUE) %>% 
  scroll_box(width = "100%", height = "100%")

Treatment	1st line	1st line %	2nd line	2nd line %	3rd line	3rd line %	4th line	4th line %	5th line	5th line %	6th line	6th line %	7th line	7th line %	8th line	8th line %	9th line	9th line %
Anthracycline	20	7.2	9	3.2	11	4.0	6	2.2	1	0.4	0	0.0	0	0.0	0	0.0	0	0.0
Capecitabine	15	5.4	17	6.1	13	4.7	17	6.1	4	1.4	0	0.0	0	0.0	0	0.0	0	0.0
Carboplatin	1	0.4	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
Carboplatin + Taxane	0	0.0	0	0.0	1	0.4	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
Letrozole + CDK4/6	13	4.7	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
Eribulin	0	0.0	0	0.0	1	0.4	1	0.4	4	1.4	3	1.1	0	0.0	1	0.4	1	0.4
Everolimus + Exemestane	0	0.0	2	0.7	5	1.8	1	0.4	0	0.0	1	0.4	0	0.0	0	0.0	0	0.0
Exemestane	34	12.3	46	16.6	7	2.5	7	2.5	1	0.4	0	0.0	1	0.4	0	0.0	0	0.0
Fulvestrant	2	0.7	6	2.2	6	2.2	3	1.1	2	0.7	1	0.4	0	0.0	0	0.0	0	0.0
Letrozole/Anastrozole	111	40.1	19	6.9	12	4.3	1	0.4	3	1.1	1	0.4	2	0.7	0	0.0	0	0.0
Tamoxifen	29	10.5	25	9.0	15	5.4	2	0.7	1	0.4	0	0.0	0	0.0	0	0.0	0	0.0
Taxane	25	9.0	22	7.9	14	5.1	3	1.1	3	1.1	2	0.7	1	0.4	0	0.0	0	0.0
Trial/other	2	0.7	0	0.0	1	0.4	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0
Vinorelbine	0	0.0	4	1.4	5	1.8	3	1.1	1	0.4	1	0.4	0	0.0	0	0.0	0	0.0
No of patients treated	252	91.0	150	54.0	91	33.0	44	16.0	20	7.3	9	3.4	4	1.5	1	0.4	1	0.4
No treatment	10	3.6	62	22.4	28	10.1	24	8.7	5	1.8	5	1.8	3	1.1	1	0.4	1	0.4
Death	15	5.4	40	14.4	31	11.2	23	8.3	19	6.9	6	2.2	2	0.7	2	0.7	2	0.7

##Sankey plots granular level data ---- 

Onc_Chemo_PIS_sankey_2 %>% as.data.frame() %>% 
  filter(is.na(source)!=1) %>% 
  filter(is.na(target)!=1) ->Onc_Chemo_PIS_sankey_2

# Make the Network


Onc_Chemo_PIS_sankey_2$treatment_type <- sub(' .*', '',
                                             Sankey_nodes2[Onc_Chemo_PIS_sankey_2$source + 1, 'name'])


sankeyNetwork(Links = Onc_Chemo_PIS_sankey_2, Nodes = Sankey_nodes2,
              Source = "source", Target = "target",
              Value = "no_of_pats", NodeID = "name",
              LinkGroup = "treatment_type",units = "Patients",NodeGroup = NULL,
              sinksRight=FALSE,fontSize = 12,iterations=40,
              fontFamily = "calibri",nodePadding = 20, nodeWidth = 15,
              colourScale = JS("d3.scaleOrdinal(d3.schemeCategory10);"),
              height=750)

Sankey Network diagram of line of treatment in HER2+ ER- Secondary Breast Cancer Patients - granular data

#Collapsible Tree diagram ----
colour_tree2<-c("seashell","#253494",
"#253494",
"#253494",
"#ffffcc",
"#41b6c4",
"#41b6c4",
"#41b6c4",
"#41b6c4",
"#a1dab4",
"#41b6c4",
"#253494",
"#2c7fb8",
"#ffffcc",
"#41b6c4",
"#41b6c4",
"#a1dab4",
"#41b6c4",
"#253494",
"#ffffcc",
"#41b6c4",
"#a1dab4",
"#253494",
"#253494",
"#253494",
"#253494",
"#ffffcc",
"#41b6c4",
"#41b6c4",
"#a1dab4",
"#41b6c4",
"#253494",
"#41b6c4",
"#41b6c4",
"#a1dab4",
"#41b6c4",
"#253494",
"#253494",
"#ffffcc",
"#41b6c4",
"#41b6c4",
"#41b6c4",
"#a1dab4",
"#41b6c4",
"#253494",
"#253494",
"#253494",
"#ffffcc",
"#41b6c4",
"#41b6c4",
"#41b6c4",
"#a1dab4",
"#253494",
"#253494",
"#253494",
"#ffffcc",
"#41b6c4",
"#41b6c4",
"#41b6c4",
"#253494",
"#253494",
"#41b6c4",
"#253494",
"#253494",
"#41b6c4",
"#a1dab4",
"#ffffcc",
"#41b6c4",
"#41b6c4",
"#a1dab4",
"#253494",
"#41b6c4",
"#253494",
"#41b6c4",
"#253494",
"#ffffcc",
"#41b6c4",
"#253494",
"#ffffcc",
"#41b6c4",
"#253494",
"#a1dab4",
"#253494",
"#ffffcc",
"#41b6c4",
"#253494",
"#ffffcc",
"#41b6c4",
"#253494",
"#41b6c4",
"#ffffcc",
"#41b6c4",
"#41b6c4",
"#a1dab4",
"#ffffcc",
"#a1dab4",
"#a1dab4",
"#41b6c4",
"#a1dab4",
"#ffffcc",
"#253494",
"#ffffcc",
"#41b6c4",
"#253494",
"#ffffcc",
"#253494",
"#253494",
"#253494",
"#ffffcc",
"#41b6c4",
"#41b6c4",
"#a1dab4",
"#41b6c4",
"#253494",
"#41b6c4",
"#a1dab4",
"#ffffcc",
"#41b6c4",
"#41b6c4",
"#a1dab4",
"#253494",
"#253494",
"#253494",
"#ffffcc",
"#41b6c4",
"#a1dab4",
"#41b6c4",
"#253494",
"#253494",
"#253494",
"#ffffcc",
"#253494",
"#2c7fb8",
"#ffffcc",
"#253494",
"#a1dab4",
"#253494",
"#253494",
"#253494",
"#41b6c4",
"#a1dab4",
"#ffffcc",
"#41b6c4",
"#41b6c4",
"#a1dab4",
"#253494",
"#ffffcc",
"#253494",
"#253494",
"#253494",
"#253494",
"#253494",
"#ffffcc",
"#41b6c4",
"#a1dab4",
"#41b6c4",
"#253494",
"#253494",
"#a1dab4",
"#253494",
"#ffffcc",
"#253494",
"#253494",
"#ffffcc",
"#253494",
"#a1dab4",
"#41b6c4",
"#253494",
"#41b6c4",
"#a1dab4",
"#ffffcc",
"#ffffcc",
"#a1dab4",
"#41b6c4",
"#ffffcc",
"#41b6c4",
"#a1dab4",
"#ffffcc",
"#253494",
"#41b6c4",
"#253494",
"#253494",
"#a1dab4",
"#ffffcc",
"#ffffcc",
"#a1dab4",
"#253494",
"#a1dab4",
"#ffffcc",
"#a1dab4",
"#a1dab4",
"#253494",
"#41b6c4",
"#41b6c4",
"#253494",
"#a1dab4",
"#253494",
"#ffffcc",
"#a1dab4",
"#253494",
"#ffffcc",
"#41b6c4",
"#a1dab4",
"#253494",
"#253494",
"#253494",
"#a1dab4",
"#ffffcc",
"#41b6c4",
"#ffffcc",
"#253494",
"#253494",
"#a1dab4",
"#253494",
"#253494",
"#253494",
"#ffffcc",
"#41b6c4",
"#ffffcc",
"#41b6c4",
"#41b6c4",
"#ffffcc",
"#41b6c4",
"#253494",
"#253494",
"#a1dab4",
"#253494",
"#a1dab4",
"#ffffcc",
"#253494",
"#41b6c4",
"#253494",
"#ffffcc",
"#ffffcc",
"#ffffcc",
"#a1dab4",
"#ffffcc",
"#253494",
"#253494",
"#253494",
"#253494",
"#ffffcc",
"#ffffcc",
"#253494",
"#ffffcc",
"#253494",
"#253494",
"#41b6c4",
"#a1dab4",
"#41b6c4",
"#a1dab4",
"#41b6c4",
"#253494",
"#253494",
"#ffffcc",
"#ffffcc",
"#253494",
"#41b6c4",
"#ffffcc",
"#ffffcc",
"#ffffcc",
"#253494",
"#253494",
"#a1dab4",
"#ffffcc",
"#ffffcc",
"#ffffcc",
"#ffffcc",
"#253494",
"#ffffcc",
"#41b6c4",
"#253494",
"#253494",
"#253494",
"#ffffcc",
"#a1dab4",
"#ffffcc",
"#ffffcc",
"#41b6c4",
"#ffffcc",
"#ffffcc",
"#41b6c4",
"#a1dab4",
"#a1dab4",
"#253494",
"#41b6c4",
"#ffffcc",
"#253494",
"#253494",
"#ffffcc",
"#a1dab4",
"#253494",
"#a1dab4",
"#253494",
"#a1dab4",
"#ffffcc",
"#41b6c4",
"#a1dab4",
"#ffffcc",
"#253494",
"#ffffcc",
"#ffffcc",
"#41b6c4",
"#41b6c4",
"#253494",
"#41b6c4",
"#a1dab4",
"#ffffcc",
"#a1dab4",
"#41b6c4",
"#a1dab4",
"#ffffcc",
"#253494",
"#ffffcc",
"#a1dab4",
"#ffffcc",
"#a1dab4")


Cohort<-Onc_Chemo_PIS_tree_2

collapsibleTree(Cohort,c("1st Treatment","2nd Treatment","3rd Treatment","4th Treatment","5th Treatment","6th Treatment","7th Treatment","8th Treatment","9th Treatment"),
                tooltip = TRUE,
                attribute = "No_of_pats",
                fontSize = 15,collapsed = TRUE,
                fill=colour_tree2,
                height=1000)

Decision tree of line of treatment in HER2+ ER- Secondary Breast Cancer Patients - granular data