ENVIRON-175

Programming with Big Environmental Datasets

Gleb Satyukov
Senior Research Engineer | Data Science Instructor

https://app.sli.do/event/naLBvxc6gj7Qzp4R5ec66A

Schedule

Agenda for today

Review Date Formatting

Adding regression lines in ggplot

Review Full Join (with 2 variables)

Reshape data from long to wide format

Why do we need to perform a weighted collapse?

New functions!


            geom_smooth(data=..., method = "...")

            write_csv(data, path)

            pivot_wider(...)

Reminder about the best practices

Attention to detail

Clean your environment

Use proper file paths

Use proper code spacing

Use inline and block comments!!

Use correct variable names (lowercase and underscores)

Save charts programmatiaclly with ggsave

Best Practices 2.0

Using Global Variables

Set a directory using path_main

Keep your data in a dedicated data folder

Inspect the data after loading using head()

Be consistent with your use of quotes (' vs ")

Make sure to export both graphs and final data (using write_csv(data, path))

Follow instructions in the assignments exactly

Slido

Best Practices

Internal Representation

Our data is originally stored as raw text

It is converted to an internal object in R

as.character(...) -> "character" type

as.numeric(...) -> "numeric" type

New addition: as.Date(...) -> "Date" type!

Date Formats in R

📅 Date Formats

Some examples of date formatting:


as.Date("03/14/2021", format = "%m/%d/%Y")

as.Date("14-Mar-21", format = "%d-%b-%y")

as.Date("Sunday, March 14, 2021", format = "%A, %B %d, %Y")

Note: as.Date() can only handle dates, not times!

Parsing Dates with Lubridate


library(lubridate)

date_string <- "Sunday, March 14, 2021 3:30pm PT"

# Remove the time zone abbreviation to parse the time first
date_clean <- sub(" PT$", "", date_string)

# Parse using appropriate format
parsed_date <- parse_date_time(date_clean, orders = "A, B d, Y I:Mp")

# Optionally, assign time zone
parsed_date <- force_tz(parsed_date, tzone = "America/Los_Angeles")

FORMAT Function

Once we have our date parsed it is a "Date" type


class(parsed_date)
[1] "POSIXct" "POSIXt"

We can now use functions like:


format(parsed_date, "%A, %B %d, %Y at %I:%M %p %Z")

And strftime:


strftime(parsed_date, "%Y-%m-%d %H:%M")

Specifier	Meaning	Example
`%Y`	4-digit year	2021
`%y`	2-digit year	21
`%B`	Full month name	March
`%b`	Abbreviated month name	Mar
`%m`	Month number	03
`%d`	Day of the month	14
`%A`	Full weekday name	Sunday
`%a`	Abbreviated weekday name	Sun
`%I`	Hour (12-hour clock)	03
`%H`	Hour (24-hour clock)	15
`%M`	Minute	30
`%p`	AM/PM	PM
`%Z`	Time zone abbreviation	PDT

STRPTIME

Stands for string parse time

Strptime/strftime is universal across languages

This is a very common way to format dates and time

Check out this tool for help with date formats: https://www.pythonmorsels.com/strptime/

Smooth

Smoothing Methods


geom_smooth()

geom_smooth(data=.....)

geom_smooth(data=....., method = "lm")
geom_smooth(data=....., method = "loess")

Spacing!

Friendly reminder to use more spaces

Let's revisit Advanced 4 Data Assignment

Smoothing Methods


geom_smooth()

geom_smooth(data=.....)

geom_smooth(data=....., method = "lm")
geom_smooth(data=....., method = "gam")
geom_smooth(data=....., method = "loess")

Linear Regression

Local Polynomial Regression

Generalized additive models with integrated smoothness estimation

Linear vs Local Polynomial

When to use `geom_smooth(method="lm")`


geom_smooth(method = "lm")

Fits a straight line to the data

Assumes a linear relationship between x and y

Great with large datasets (more efficient than loess)

When to use `geom_smooth(method="loess")`


geom_smooth(method = "loess")

Stands for LOcal regrESSion

Fits a smooth, flexible curve to the data

The relationship between variables appears nonlinear

If you want a visual trend, not a predictive model

Great with small to medium datasets (< 1000 points)

Ggplot2 Smoothing Methods


# assumes all of the data by default
geom_smooth()

# using all of our data
geom_smooth(data=fbi_nowkd)

# filter to crime in 2021 only
geom_smooth(data=filter(fbi_nowkd, year==2021))

# using local polinomial regression method for smoothin
geom_smooth(data=filter(fbi_nowkd, year==2021), method="loess")

# split regression lines to before and after DST change
geom_smooth(
    data=filter(fbi_nowkd, year==2021 & days_away < 0),
    method="loess"
)
geom_smooth(
    data=filter(fbi_nowkd, year==2021 & days_away > 0),
    method="loess"
)

Maternal

Mortality

Maternal Mortality

Number of maternal deaths by country, 1985 to 2020

https://ourworldindata.org/maternal-mortality

CDC: Maternal Mortality Rates in the US 2021

https://www.cdc.gov/nchs/data/hestat/maternal-mortality/2021/maternal-mortality-rates-2021.htm

Maternal Mortality

We're going to be looking at data from 1960s

It's right before the Civil Rights Act of 1964

And in 1966, Southern hospitals were barred from participating in Medicare unless they discontinued their longstanding practice of racial segregation

Plot of maternal mortality rates for black and white women in 1960 by Census Division

45 degree line

this line represents where two groups would be equal

all of the data points are above the 45-degree line

we can quickly discern that black maternal mortality rates are higher than white mortality rates

(data points are bigger for divisions with more births)

Alternative Visualization

Census Divisions

List of regions of the United States

think of these divisions as being a collection of geographically proximate states

U.S. Census Bureau regions and divisions defines the divisions (see image on the next slide)

There are 2 different data files

Maternal deaths from the Multiple Causes of Death

https://www.nber.org/research/data/mortality-data-vital-statistics-nchs-multiple-cause-death-data

Birth counts from the Natality files

https://www.nber.org/research/data/vital-statistics-natality-birth-data

Mortality Rate Variable

We need both the deaths and birth data so we can create our maternal mortality rate variable

Maternal mortality rate is defined as:

the number of deaths from maternal causes, like infections, per 100,000 live births

Multi-Variable Joins

Similar to our full joins before, except using the c() to combine 2 variables for the join to work correctly:


#################################################
#           STEP 4. JOIN DATA FILES             #
# full join using both state and race variables #
#################################################

all_data <- full_join(data_deaths, data_births,
                by = c("state", "race"))

Mortality Rate Variable

This is what that new variable looks like in code:


########################################
# STEP 5. CREATE DEATH RATE PER BIRTHS #
########################################

all_data <- mutate(all_data,
      mort_rate = deaths / births * 100000)

Slido

What does `c()` stand for?

Weighted

Averages

Weighted Average

Population Sizes

Slido: What is the average income?

Population Sizes

Weighted average:

(4 x $60,000 + 2 x $30,000) / 6 = $50,000

Weighted Averages

Sometimes not all samples are equal

Sometimes not all data points are equal

For example, a region with a larger population might carry more weight in environmental data

We need to account for the fact that some data points deserve more weight than others

Weighted Averages

We will be using a weighted collapse in our assignment

Here we are using number of births to weight the data:


sum_div <- collap(all_data, mort_rate ~ division + race,
                  w = ~births, keep.w = TRUE, FUN = fmean)

Reshaping the data

The data we work with is in a long format when we import it

After a weighted collapse we need to convert our data to wide format

This is to make sure that we have the variables that we are interested in

Long Format

Example of our data in long format:

state	race	mort
Alabama	white	2.3
Alabama	black	5.0
Texas	white	3.6
Texas	black	4.4

Wide Format

Example of our data in wide format:

state	white	black
Alabama	2.3	5.0
Texas	3.6	4.4

Side-By-Side

state	race	mort
Alabama	white	2.3
Alabama	black	5.0
Texas	white	3.6
Texas	black	4.4

state	white	black
Alabama	2.3	5.0
Texas	3.6	4.4

Side-By-Side

Glue options for multiple variables

what if we want to reshape the data from long to wide

but we also want to keep multiple variables?

Reshape with multiple variables

Example of our data in long format:

state	race	mort	births
Alabama	white	2.3	255
Alabama	black	5.0	175
Texas	white	3.6	390
Texas	black	4.4	272

Reshape with multiple variables

An example where we "Glue" variable names together:


#####################
# STEP 7. RESHAPE WIDE
#####################

wide_data <- pivot_wider(sum_div,
                         names_from = race,
                         values_from = c("births", "mort_rate"),
                         names_glue = "{.value}_{race}")

* best practice is to put the value name first, then the names variable next, e.g. mort_white

Reshape with multiple variables

Example of our new data in wide format:

state	mort_white	mort_black	births_white	births_black
Alabama	2.3	5.0	255	175
Texas	3.6	4.4	390	272

Workplan / Checklist

1. Clean up the environment, load libraries

2. Import births file and deaths file

3. Clean up data (rename variable names)

4. Join data objects by state and race

5. Create deaths per births variable

6. Collapse by division with weights

7. Reshape from long-to-wide

8. Make weighted scatterplot (and export it!)

9. Export our final wide format data as well

Final Result

R Advanced 5: Assignment 10

Going to be published on canvas today

Assignment is going to be due this Friday

Due Date: Friday May 16, 2025 at 11:59 pm PT

ENVIRON-175

Programming with Big Environmental Datasets

Slides

Slides

Slides

Schedule

Schedule

Agenda for today

Reminder about the best practices

Best Practices 2.0

Slido

Best Practices

Internal Representation

Date Formats in R

📅 Date Formats

Parsing Dates with Lubridate

FORMAT Function

STRPTIME

Smooth

Smoothing Methods

Spacing!

Smoothing Methods

Linear Regression

Local Polynomial Regression

Generalized additive models with integrated smoothness estimation

Linear vs Local Polynomial

When to use geom_smooth(method="lm")

When to use geom_smooth(method="loess")

Ggplot2 Smoothing Methods

Maternal

Mortality

Maternal Mortality

Maternal Mortality

45 degree line

Alternative Visualization

Census Divisions

There are 2 different data files

Mortality Rate Variable

Multi-Variable Joins

Mortality Rate Variable

Slido

What does c() stand for?

Weighted

Averages

Weighted Average

Population Sizes

Population Sizes

Weighted Averages

Weighted Averages

Reshaping the data

Long Format

Wide Format

Side-By-Side

Side-By-Side

Glue options for multiple variables

Reshape with multiple variables

Reshape with multiple variables

Reshape with multiple variables

Workplan / Checklist

Final Result

R Advanced 5: Assignment 10

When to use `geom_smooth(method="lm")`

When to use `geom_smooth(method="loess")`

What does `c()` stand for?