ENVIRON-175

Programming with Big Environmental Datasets



Gleb Satyukov
Senior Research Engineer | Data Science Instructor


https://app.sli.do/event/naLBvxc6gj7Qzp4R5ec66A

Slides


R Basics 1: https://environ-175.com/basics/1

R Basics 2: https://environ-175.com/basics/2

R Basics 3: https://environ-175.com/basics/3

R Basics 4: https://environ-175.com/basics/4

R Basics 5: https://environ-175.com/basics/5

Slides


R Advanced 1: https://environ-175.com/advanced/1

R Advanced 2: https://environ-175.com/advanced/2

R Advanced 3: https://environ-175.com/advanced/3

R Advanced 4: https://environ-175.com/advanced/4

R Advanced 5: https://environ-175.com/advanced/5

Slides


R Spatial 1: https://environ-175.com/spatial/1

R Spatial 2: https://environ-175.com/spatial/2

R Spatial 3: https://environ-175.com/spatial/3

R Spatial 4: https://environ-175.com/spatial/4

R Spatial 5: https://environ-175.com/spatial/5

Schedule

Schedule

Agenda for today

Review Date Formatting

Adding regression lines in ggplot

Review Full Join (with 2 variables)

Reshape data from long to wide format

Why do we need to perform a weighted collapse?

New functions!


            geom_smooth(data=..., method = "...")

            write_csv(data, path)

            pivot_wider(...)
            

Reminder about the best practices

Attention to detail

Clean your environment

Use proper file paths

Use proper code spacing

Use inline and block comments!!

Use correct variable names (lowercase and underscores)

Save charts programmatiaclly with ggsave

Best Practices 2.0

Using Global Variables

Set a directory using path_main

Keep your data in a dedicated data folder

Inspect the data after loading using head()

Be consistent with your use of quotes (' vs ")

Make sure to export both graphs and final data (using write_csv(data, path))

Follow instructions in the assignments exactly

Slido


Best Practices









Internal Representation


Our data is originally stored as raw text

It is converted to an internal object in R


as.character(...) -> "character" type

as.numeric(...) -> "numeric" type

New addition: as.Date(...) -> "Date" type!

Date Formats in R

📅 Date Formats


Some examples of date formatting:


as.Date("03/14/2021", format = "%m/%d/%Y")

as.Date("14-Mar-21", format = "%d-%b-%y")

as.Date("Sunday, March 14, 2021", format = "%A, %B %d, %Y")
          

Note: as.Date() can only handle dates, not times!

Parsing Dates with Lubridate



library(lubridate)

date_string <- "Sunday, March 14, 2021 3:30pm PT"

# Remove the time zone abbreviation to parse the time first
date_clean <- sub(" PT$", "", date_string)

# Parse using appropriate format
parsed_date <- parse_date_time(date_clean, orders = "A, B d, Y I:Mp")

# Optionally, assign time zone
parsed_date <- force_tz(parsed_date, tzone = "America/Los_Angeles")
          

FORMAT Function


Once we have our date parsed it is a "Date" type


class(parsed_date)
[1] "POSIXct" "POSIXt"
          

We can now use functions like:


format(parsed_date, "%A, %B %d, %Y at %I:%M %p %Z")
          

And strftime:


strftime(parsed_date, "%Y-%m-%d %H:%M")
          
Specifier Meaning Example
%Y4-digit year2021
%y2-digit year21
%BFull month nameMarch
%bAbbreviated month nameMar
%mMonth number03
%dDay of the month14
%AFull weekday nameSunday
%aAbbreviated weekday nameSun
%IHour (12-hour clock)03
%HHour (24-hour clock)15
%MMinute30
%pAM/PMPM
%ZTime zone abbreviationPDT

STRPTIME


Stands for string parse time

Strptime/strftime is universal across languages

This is a very common way to format dates and time


Check out this tool for help with date formats: https://www.pythonmorsels.com/strptime/

Smooth









Smoothing Methods



geom_smooth()

geom_smooth(data=.....)

geom_smooth(data=....., method = "lm")
geom_smooth(data=....., method = "loess")
          

Spacing!


Friendly reminder to use more spaces




Let's revisit Advanced 4 Data Assignment

Smoothing Methods



geom_smooth()

geom_smooth(data=.....)

geom_smooth(data=....., method = "lm")
geom_smooth(data=....., method = "gam")
geom_smooth(data=....., method = "loess")
          

Linear Regression

Local Polynomial Regression

Generalized additive models with integrated smoothness estimation

Linear vs Local Polynomial

When to use geom_smooth(method="lm")



geom_smooth(method = "lm")
          

Fits a straight line to the data

Assumes a linear relationship between x and y

Great with large datasets (more efficient than loess)

When to use geom_smooth(method="loess")



geom_smooth(method = "loess")
          

Stands for LOcal regrESSion

Fits a smooth, flexible curve to the data

The relationship between variables appears nonlinear

If you want a visual trend, not a predictive model

Great with small to medium datasets (< 1000 points)

Ggplot2 Smoothing Methods


# assumes all of the data by default
geom_smooth()

# using all of our data
geom_smooth(data=fbi_nowkd)

# filter to crime in 2021 only
geom_smooth(data=filter(fbi_nowkd, year==2021))

# using local polinomial regression method for smoothin
geom_smooth(data=filter(fbi_nowkd, year==2021), method="loess")

# split regression lines to before and after DST change
geom_smooth(
    data=filter(fbi_nowkd, year==2021 & days_away < 0),
    method="loess"
)
geom_smooth(
    data=filter(fbi_nowkd, year==2021 & days_away > 0),
    method="loess"
)
          

Maternal

Mortality

Maternal Mortality


Number of maternal deaths by country, 1985 to 2020

https://ourworldindata.org/maternal-mortality


CDC: Maternal Mortality Rates in the US 2021

https://www.cdc.gov/nchs/data/hestat/maternal-mortality/2021/maternal-mortality-rates-2021.htm

Maternal Mortality


We're going to be looking at data from 1960s

It's right before the Civil Rights Act of 1964

And in 1966, Southern hospitals were barred from participating in Medicare unless they discontinued their longstanding practice of racial segregation

Plot of maternal mortality rates for black and white women in 1960 by Census Division

45 degree line


this line represents where two groups would be equal

all of the data points are above the 45-degree line

we can quickly discern that black maternal mortality rates are higher than white mortality rates

(data points are bigger for divisions with more births)

Alternative Visualization

Census Divisions


List of regions of the United States

think of these divisions as being a collection of geographically proximate states

U.S. Census Bureau regions and divisions defines the divisions (see image on the next slide)

There are 2 different data files


Maternal deaths from the Multiple Causes of Death

https://www.nber.org/research/data/mortality-data-vital-statistics-nchs-multiple-cause-death-data

Birth counts from the Natality files

https://www.nber.org/research/data/vital-statistics-natality-birth-data

Mortality Rate Variable


We need both the deaths and birth data so we can create our maternal mortality rate variable


Maternal mortality rate is defined as:

the number of deaths from maternal causes, like infections, per 100,000 live births

Multi-Variable Joins


Similar to our full joins before, except using the c() to combine 2 variables for the join to work correctly:


#################################################
#           STEP 4. JOIN DATA FILES             #
# full join using both state and race variables #
#################################################

all_data <- full_join(data_deaths, data_births,
                by = c("state", "race"))
          

Mortality Rate Variable


This is what that new variable looks like in code:


########################################
# STEP 5. CREATE DEATH RATE PER BIRTHS #
########################################

all_data <- mutate(all_data,
      mort_rate = deaths / births * 100000)
          

Slido


What does c() stand for?









Weighted

Averages









Weighted Average

Population Sizes


Slido: What is the average income?

Population Sizes


Weighted average:

(4 x $60,000 + 2 x $30,000) / 6 = $50,000

Weighted Averages


Sometimes not all samples are equal

Sometimes not all data points are equal

For example, a region with a larger population might carry more weight in environmental data

We need to account for the fact that some data points deserve more weight than others

Weighted Averages


We will be using a weighted collapse in our assignment

Here we are using number of births to weight the data:


sum_div <- collap(all_data, mort_rate ~ division + race,
                  w = ~births, keep.w = TRUE, FUN = fmean)
          

Reshaping the data


The data we work with is in a long format when we import it

After a weighted collapse we need to convert our data to wide format

This is to make sure that we have the variables that we are interested in

Long Format


Example of our data in long format:

state race mort
Alabama white 2.3
Alabama black 5.0
Texas white 3.6
Texas black 4.4

Wide Format

Example of our data in wide format:

state white black
Alabama 2.3 5.0
Texas 3.6 4.4

Side-By-Side


state race mort
Alabama white 2.3
Alabama black 5.0
Texas white 3.6
Texas black 4.4
state white black
Alabama 2.3 5.0
Texas 3.6 4.4

Side-By-Side


Glue options for multiple variables


what if we want to reshape the data from long to wide

but we also want to keep multiple variables?

Reshape with multiple variables


Example of our data in long format:

state race mort births
Alabama white 2.3 255
Alabama black 5.0 175
Texas white 3.6 390
Texas black 4.4 272

Reshape with multiple variables


An example where we "Glue" variable names together:


#####################
# STEP 7. RESHAPE WIDE
#####################

wide_data <- pivot_wider(sum_div,
                         names_from = race,
                         values_from = c("births", "mort_rate"),
                         names_glue = "{.value}_{race}")
          

* best practice is to put the value name first, then the names variable next, e.g. mort_white

Reshape with multiple variables


Example of our new data in wide format:

state mort_white mort_black births_white births_black
Alabama 2.3 5.0 255 175
Texas 3.6 4.4 390 272

Workplan / Checklist


1. Clean up the environment, load libraries

2. Import births file and deaths file

3. Clean up data (rename variable names)

4. Join data objects by state and race

5. Create deaths per births variable

6. Collapse by division with weights

7. Reshape from long-to-wide

8. Make weighted scatterplot (and export it!)

9. Export our final wide format data as well

Final Result

R Advanced 5: Assignment 10


Going to be published on canvas today

Assignment is going to be due this Friday

Due Date: Friday May 16, 2025 at 11:59 pm PT