58-seer.Rmd

# Surveillance Epidemiology and End Results (SEER) {-}

[![Build Status](https://travis-ci.org/asdfree/seer.svg?branch=master)](https://travis-ci.org/asdfree/seer) [![Build status](https://ci.appveyor.com/api/projects/status/github/asdfree/seer?svg=TRUE)](https://ci.appveyor.com/project/ajdamico/seer)

The Surveillance Epidemiology and End Results (SEER) aggregates person-level information for more than a quarter of cancer incidence in the United States.

* A series of both individual- and population-level tables, grouped by site of cancer diagnosis.

* A registry covering various geographies across the US population, standardized by SEER*Stat to produce nationally-representative estimates.

* Updated every spring based on the previous November's submission of data.

* Maintained by the United States [National Cancer Institute (NCI)](http://www.cancer.gov/)

## Simplified Download and Importation {-}

The R `lodown` package easily downloads and imports all available SEER microdata by simply specifying `"seer"` with an `output_dir =` parameter in the `lodown()` function. Depending on your internet connection and computer processing speed, you might prefer to run this step overnight.

```{r eval = FALSE }
library(lodown)
lodown( "seer" , output_dir = file.path( path.expand( "~" ) , "SEER" ) , 
	your_username = "username" , 
	your_password = "password" )
```

## Analysis Examples with base R \ {-}

Load a data frame:

```{r eval = FALSE }
available_files <-
	list.files( 
		file.path( path.expand( "~" ) , "SEER" ) , 
		recursive = TRUE , 
		full.names = TRUE 
	)

seer_df <- 
	readRDS( grep( "incidence(.*)yr1973(.*)LYMYLEUK" , available_files , value = TRUE ) )
```

```{r eval = FALSE }

```

### Variable Recoding {-}

Add new columns to the data set:
```{r eval = FALSE }
seer_df <- 
	transform( 
		seer_df , 
		
		survival_months = ifelse( srv_time_mon == 9999 , NA , as.numeric( srv_time_mon ) ) ,
		
		female = as.numeric( sex == 2 ) ,
		
		race_ethnicity =
			ifelse( race1v == 99 , "unknown" ,
			ifelse( nhiade > 0 , "hispanic" , 
			ifelse( race1v == 1 , "white non-hispanic" ,
			ifelse( race1v == 2 , "black non-hispanic" , 
				"other non-hispanic" ) ) ) ) ,
		
		marital_status_at_dx =
			factor( 
				as.numeric( mar_stat ) , 
				levels = c( 1:6 , 9 ) ,
				labels =
					c(
						"single (never married)" ,
						"married" ,
						"separated" ,
						"divorced" ,
						"widowed" ,
						"unmarried or domestic partner or unregistered" ,
						"unknown"
					)
			)
	)
```

### Unweighted Counts {-}

Count the unweighted number of records in the table, overall and by groups:
```{r eval = FALSE , results = "hide" }
nrow( seer_df )

table( seer_df[ , "race_ethnicity" ] , useNA = "always" )
```

### Descriptive Statistics {-}

Calculate the mean (average) of a linear variable, overall and by groups:
```{r eval = FALSE , results = "hide" }
mean( seer_df[ , "survival_months" ] , na.rm = TRUE )

tapply(
	seer_df[ , "survival_months" ] ,
	seer_df[ , "race_ethnicity" ] ,
	mean ,
	na.rm = TRUE 
)
```

Calculate the distribution of a categorical variable, overall and by groups:
```{r eval = FALSE , results = "hide" }
prop.table( table( seer_df[ , "marital_status_at_dx" ] ) )

prop.table(
	table( seer_df[ , c( "marital_status_at_dx" , "race_ethnicity" ) ] ) ,
	margin = 2
)
```

Calculate the sum of a linear variable, overall and by groups:
```{r eval = FALSE , results = "hide" }
sum( seer_df[ , "survival_months" ] , na.rm = TRUE )

tapply(
	seer_df[ , "survival_months" ] ,
	seer_df[ , "race_ethnicity" ] ,
	sum ,
	na.rm = TRUE 
)
```

Calculate the median (50th percentile) of a linear variable, overall and by groups:
```{r eval = FALSE , results = "hide" }
quantile( seer_df[ , "survival_months" ] , 0.5 , na.rm = TRUE )

tapply(
	seer_df[ , "survival_months" ] ,
	seer_df[ , "race_ethnicity" ] ,
	quantile ,
	0.5 ,
	na.rm = TRUE 
)
```

### Subsetting {-}

Limit your `data.frame` to inpatient hospital reporting source:
```{r eval = FALSE , results = "hide" }
sub_seer_df <- subset( seer_df , rept_src == 1 )
```
Calculate the mean (average) of this subset:
```{r eval = FALSE , results = "hide" }
mean( sub_seer_df[ , "survival_months" ] , na.rm = TRUE )
```

### Measures of Uncertainty {-}

Calculate the variance, overall and by groups:
```{r eval = FALSE , results = "hide" }
var( seer_df[ , "survival_months" ] , na.rm = TRUE )

tapply(
	seer_df[ , "survival_months" ] ,
	seer_df[ , "race_ethnicity" ] ,
	var ,
	na.rm = TRUE 
)
```

### Regression Models and Tests of Association {-}

Perform a t-test:
```{r eval = FALSE , results = "hide" }
t.test( survival_months ~ female , seer_df )
```

Perform a chi-squared test of association:
```{r eval = FALSE , results = "hide" }
this_table <- table( seer_df[ , c( "female" , "marital_status_at_dx" ) ] )

chisq.test( this_table )
```

Perform a generalized linear model:
```{r eval = FALSE , results = "hide" }
glm_result <- 
	glm( 
		survival_months ~ female + marital_status_at_dx , 
		data = seer_df
	)

summary( glm_result )
```

## Analysis Examples with `dplyr` \ {-}

The R `dplyr` library offers an alternative grammar of data manipulation to base R and SQL syntax. [dplyr](https://github.com/tidyverse/dplyr/) offers many verbs, such as `summarize`, `group_by`, and `mutate`, the convenience of pipe-able functions, and the `tidyverse` style of non-standard evaluation. [This vignette](https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html) details the available features. As a starting point for SEER users, this code replicates previously-presented examples:

```{r eval = FALSE , results = "hide" }
library(dplyr)
seer_tbl <- tbl_df( seer_df )
```
Calculate the mean (average) of a linear variable, overall and by groups:
```{r eval = FALSE , results = "hide" }
seer_tbl %>%
	summarize( mean = mean( survival_months , na.rm = TRUE ) )

seer_tbl %>%
	group_by( race_ethnicity ) %>%
	summarize( mean = mean( survival_months , na.rm = TRUE ) )
```