Gleb Satyukov
Senior Research Engineer | Data Science Instructor
R Basics 1: https://environ-175.com/basics/1
R Basics 2: https://environ-175.com/basics/2
R Basics 3: https://environ-175.com/basics/3
R Basics 4: https://environ-175.com/basics/4
R Basics 5: https://environ-175.com/basics/5
R Advanced 1: https://environ-175.com/advanced/1
R Advanced 2: https://environ-175.com/advanced/2
R Advanced 3: https://environ-175.com/advanced/3
R Advanced 4: https://environ-175.com/advanced/4
R Advanced 5: https://environ-175.com/advanced/5
R Spatial 1: https://environ-175.com/spatial/1
R Spatial 2: https://environ-175.com/spatial/2
R Spatial 3: https://environ-175.com/spatial/3
R Spatial 4: https://environ-175.com/spatial/4
R Spatial 5: https://environ-175.com/spatial/5
Review Date Formatting
Adding regression lines in ggplot
Review Full Join (with 2 variables)
Reshape data from long to wide format
Why do we need to perform a weighted collapse?
New functions!
geom_smooth(data=..., method = "...")
write_csv(data, path)
pivot_wider(...)
Attention to detail
Clean your environment
Use proper file paths
Use proper code spacing
Use inline and block comments!!
Use correct variable names (lowercase and underscores)
Save charts programmatiaclly with ggsave
Using Global Variables
Set a directory using path_main
Keep your data in a dedicated data folder
Inspect the data after loading using head()
Be consistent with your use of quotes (' vs ")
Make sure to export both graphs and final data (using write_csv(data, path)
)
Follow instructions in the assignments exactly
Our data is originally stored as raw text
It is converted to an internal object in R
as.character(...)
-> "character" type
as.numeric(...)
-> "numeric" type
New addition: as.Date(...)
-> "Date" type!
Some examples of date formatting:
as.Date("03/14/2021", format = "%m/%d/%Y")
as.Date("14-Mar-21", format = "%d-%b-%y")
as.Date("Sunday, March 14, 2021", format = "%A, %B %d, %Y")
Note: as.Date()
can only handle dates, not times!
library(lubridate)
date_string <- "Sunday, March 14, 2021 3:30pm PT"
# Remove the time zone abbreviation to parse the time first
date_clean <- sub(" PT$", "", date_string)
# Parse using appropriate format
parsed_date <- parse_date_time(date_clean, orders = "A, B d, Y I:Mp")
# Optionally, assign time zone
parsed_date <- force_tz(parsed_date, tzone = "America/Los_Angeles")
Once we have our date parsed it is a "Date" type
class(parsed_date)
[1] "POSIXct" "POSIXt"
We can now use functions like:
format(parsed_date, "%A, %B %d, %Y at %I:%M %p %Z")
And strftime:
strftime(parsed_date, "%Y-%m-%d %H:%M")
Specifier | Meaning | Example |
---|---|---|
%Y | 4-digit year | 2021 |
%y | 2-digit year | 21 |
%B | Full month name | March |
%b | Abbreviated month name | Mar |
%m | Month number | 03 |
%d | Day of the month | 14 |
%A | Full weekday name | Sunday |
%a | Abbreviated weekday name | Sun |
%I | Hour (12-hour clock) | 03 |
%H | Hour (24-hour clock) | 15 |
%M | Minute | 30 |
%p | AM/PM | PM |
%Z | Time zone abbreviation | PDT |
Stands for string parse time
Strptime/strftime is universal across languages
This is a very common way to format dates and time
Check out this tool for help with date formats: https://www.pythonmorsels.com/strptime/
geom_smooth()
geom_smooth(data=.....)
geom_smooth(data=....., method = "lm")
geom_smooth(data=....., method = "loess")
Friendly reminder to use more spaces
Let's revisit Advanced 4 Data Assignment
geom_smooth()
geom_smooth(data=.....)
geom_smooth(data=....., method = "lm")
geom_smooth(data=....., method = "gam")
geom_smooth(data=....., method = "loess")
geom_smooth(method="lm")
geom_smooth(method = "lm")
Fits a straight line to the data
Assumes a linear relationship between x and y
Great with large datasets (more efficient than loess)
geom_smooth(method="loess")
geom_smooth(method = "loess")
Stands for LOcal regrESSion
Fits a smooth, flexible curve to the data
The relationship between variables appears nonlinear
If you want a visual trend, not a predictive model
Great with small to medium datasets (< 1000 points)
# assumes all of the data by default
geom_smooth()
# using all of our data
geom_smooth(data=fbi_nowkd)
# filter to crime in 2021 only
geom_smooth(data=filter(fbi_nowkd, year==2021))
# using local polinomial regression method for smoothin
geom_smooth(data=filter(fbi_nowkd, year==2021), method="loess")
# split regression lines to before and after DST change
geom_smooth(
data=filter(fbi_nowkd, year==2021 & days_away < 0),
method="loess"
)
geom_smooth(
data=filter(fbi_nowkd, year==2021 & days_away > 0),
method="loess"
)
Number of maternal deaths by country, 1985 to 2020
https://ourworldindata.org/maternal-mortality
CDC: Maternal Mortality Rates in the US 2021
https://www.cdc.gov/nchs/data/hestat/maternal-mortality/2021/maternal-mortality-rates-2021.htm
We're going to be looking at data from 1960s
It's right before the Civil Rights Act of 1964
And in 1966, Southern hospitals were barred from participating in Medicare unless they discontinued their longstanding practice of racial segregation
Plot of maternal mortality rates for black and white women in 1960 by Census Division
this line represents where two groups would be equal
all of the data points are above the 45-degree line
we can quickly discern that black maternal mortality rates are higher than white mortality rates
(data points are bigger for divisions with more births)
List of regions of the United States
think of these divisions as being a collection of geographically proximate states
U.S. Census Bureau regions and divisions defines the divisions (see image on the next slide)
Maternal deaths from the Multiple Causes of Death
https://www.nber.org/research/data/mortality-data-vital-statistics-nchs-multiple-cause-death-dataBirth counts from the Natality files
https://www.nber.org/research/data/vital-statistics-natality-birth-dataWe need both the deaths and birth data so we can create our maternal mortality rate variable
Maternal mortality rate is defined as:
the number of deaths from maternal causes, like infections, per 100,000 live births
Similar to our full joins before, except using the c()
to combine 2 variables for the join to work correctly:
#################################################
# STEP 4. JOIN DATA FILES #
# full join using both state and race variables #
#################################################
all_data <- full_join(data_deaths, data_births,
by = c("state", "race"))
This is what that new variable looks like in code:
########################################
# STEP 5. CREATE DEATH RATE PER BIRTHS #
########################################
all_data <- mutate(all_data,
mort_rate = deaths / births * 100000)
c()
stand for?Slido: What is the average income?
Weighted average:
(4 x $60,000 + 2 x $30,000) / 6 = $50,000
Sometimes not all samples are equal
Sometimes not all data points are equal
For example, a region with a larger population might carry more weight in environmental data
We need to account for the fact that some data points deserve more weight than others
We will be using a weighted collapse in our assignment
Here we are using number of births to weight the data:
sum_div <- collap(all_data, mort_rate ~ division + race,
w = ~births, keep.w = TRUE, FUN = fmean)
The data we work with is in a long format when we import it
After a weighted collapse we need to convert our data to wide format
This is to make sure that we have the variables that we are interested in
Example of our data in long format:
state | race | mort |
---|---|---|
Alabama | white | 2.3 |
Alabama | black | 5.0 |
Texas | white | 3.6 |
Texas | black | 4.4 |
Example of our data in wide format:
state | white | black |
---|---|---|
Alabama | 2.3 | 5.0 |
Texas | 3.6 | 4.4 |
state | race | mort |
---|---|---|
Alabama | white | 2.3 |
Alabama | black | 5.0 |
Texas | white | 3.6 |
Texas | black | 4.4 |
state | white | black |
---|---|---|
Alabama | 2.3 | 5.0 |
Texas | 3.6 | 4.4 |
what if we want to reshape the data from long to wide
but we also want to keep multiple variables?
Example of our data in long format:
state | race | mort | births |
---|---|---|---|
Alabama | white | 2.3 | 255 |
Alabama | black | 5.0 | 175 |
Texas | white | 3.6 | 390 |
Texas | black | 4.4 | 272 |
An example where we "Glue" variable names together:
#####################
# STEP 7. RESHAPE WIDE
#####################
wide_data <- pivot_wider(sum_div,
names_from = race,
values_from = c("births", "mort_rate"),
names_glue = "{.value}_{race}")
* best practice is to put the value name first, then the names variable next, e.g. mort_white
Example of our new data in wide format:
state | mort_white | mort_black | births_white | births_black |
---|---|---|---|---|
Alabama | 2.3 | 5.0 | 255 | 175 |
Texas | 3.6 | 4.4 | 390 | 272 |
1. Clean up the environment, load libraries
2. Import births file and deaths file
3. Clean up data (rename variable names)
4. Join data objects by state and race
5. Create deaths per births variable
6. Collapse by division with weights
7. Reshape from long-to-wide
8. Make weighted scatterplot (and export it!)
9. Export our final wide format data as well
Going to be published on canvas today
Assignment is going to be due this Friday
Due Date: Friday May 16, 2025 at 11:59 pm PT