ENVIRON-175

Programming with Big Environmental Datasets

Gleb Satyukov
Senior Research Engineer | Data Science Instructor

Slido

Attendance etc

Slido

https://app.sli.do/event/tTrBqrBFyTcL4xBSudvJjZ

Schedule

Gleb's Office Hours

Wednesdays between 3pm and 4pm on Zoom

Fridays between 3pm and 4pm on Zoom

Gleb’s personal Zoom link is: https://ucla.zoom.us/j/6935808910

Kaitlynn's Office Hours

Mondays between 12pm (noon) and 1pm

Wednesdays between 11am and 12pm (noon)

Kaitlynn’s personal Zoom link is: https://ucla.zoom.us/j/8321830416

Join Our Slack Community

Important announcements in the #general channel

click here to join

Slack #classroom

Class announcements are in #general

#classroom is used during class

#tech-support for anything tech related

#team-1, #team-2, #team-3, #team-4, #team-5

Slides

R Advanced 1: https://environ-175.com/advanced/1

R Advanced 2: https://environ-175.com/advanced/2

R Advanced 3: https://environ-175.com/advanced/3

R Advanced 4: https://environ-175.com/advanced/4

R Advanced 5: https://environ-175.com/advanced/5

Slides

R Spatial 1: https://environ-175.com/spatial/1

R Spatial 2: https://environ-175.com/spatial/2

R Spatial 3: https://environ-175.com/spatial/3

R Spatial 4: https://environ-175.com/spatial/4

R Spatial 5: https://environ-175.com/spatial/5

What did we learn so far?

File management

Following instructions

Reading csv, tsv, txt files

Working with large datasets

Creating checklists and workplans

Breaking work down into smaller parts

Collaborating and solving problems together

Slido

How difficult did you find Project 1?

Slido

How are you going to score on the next midterm project?

Agenda for today

Join Operations

Data Assignment

New functions!


left_join()

right_join()

full_join()

New library!


# install janitor library
install.packages("janitor")

So far..

we have only worked with 1 dataset at a time

everything is about to change

with multiple datasets

Join

combine two datasets based on a key

matching rows from one table to another

useful when working with two or more datasets

Venn Diagram

Example A



# Table A
students <- data.frame(
  student_id <- c(1, 2, 3),
  name <- c("Alice", "Bob", "Charlie")
)

Example A

Example B



# Table B
grades <- data.frame(
  student_id <- c(1, 2, 4),
  grade <- c("A", "B", "C")
)

Example B

Side by side

Venn Diagram

Slido

Which grade did Charlie get?

DPLYR

Join functions in dplyr package

left_join()

a left join keeps all observations in A

right_join()

a right join keeps all observations in B

full_join()

a full join keeps all observations in A and B

Left Join


library(dplyr)

                       ( A )    ( B )
                         ^        ^
                         |        |

combined <- left_join(students, grades, by = "student_id")

View(combined)

Left Join Result

Who's grade is this?

Left Join

Left Join Result

Right Join


library(dplyr)

                        ( A )    ( B )
                          ^        ^
                          |        |

combined <- right_join(students, grades, by = "student_id")

View(combined)

Right Join Result

Where's Charlies grade?

Right Join

Right Join Result

Full Join


library(dplyr)

                       ( A )    ( B )
                         ^        ^
                         |        |

combined <- full_join(students, grades, by = "student_id")

View(combined)

Full Join Result

Full Outer Join

Full Join Result

NA

What is NA?

NA in R: Stands for "Not Available"
This is how R marks a missing value

Missing Values and `NA` in R

Missing Values: Sometimes data is incomplete:
a value might be missing for any reason
Why does this happen?
- Data entry errors
- Information was never collected
- Different sources of data don't match up perfectly

Inner Join

Inner Join Result

Best Practice

Keep all of the mismatched data

You want to be able to see where data might be missing in order to determine scope of problem

R is not going to graph missing data, no need to drop

Use full_join() to keep all the data

Environmental Justice

Distributional Justice
Where environmental benefits and harms are located (big data can help by mapping and quantifying disparities)
Procedural Justice
Who has influence in decision-making processes (big data here is not enough - requires qualitative assessment)

Residential Segregation

LA county has a high degree of residential segregation

https://en.wikipedia.org/wiki/Residential_segregation_in_the_United_States

Tree Canopy

We're going to look at tree canopy cover

Because of the connection to climate change

More tree canopy cover means more shade and therefore more protection from hot weather

NLCD Data

The US Forest Service produces the National Land Cover Database

30 meter x 30 meter pixel maps of tree canopy cover

https://www.usgs.gov/centers/eros/science/national-land-cover-database

Census Tracts

a geographic region defined for the purpose of taking a census: https://en.wikipedia.org/wiki/Census_tract

2020 Census Tract Identifier

LA County Census Tracts

Joining the data

What if there are missing values?

Checklist

1. Clean environment, load libraries

2. Import NLCD and Census data

3. Clean up our data/ fix variable names

4. Create a consistent ID variable in both data frames

5. Join the Census and NLCD data by tract ID

6. Make a scatterplot (and export/save it!)

Janitor Library

Helps with all things concerned with cleaning


# install janitor library
install.packages("janitor")

# load janitor library
library(janitor)

# clean variable names
trees_clean <- clean_names(trees)

Paste Function

Paste function concatenates (combines) objects together after converting them to character vectors


paste("hello", "world")

[1] "hello world"

Sep Argument

sep=" " is a character string to separate the terms


# check the documentation
?paste

# paste function signature
paste(..., sep = " ", collapse = NULL, recycle0 = FALSE)

# paste0 is used when you don't want a separator
paste0(..., collapse = NULL, recycle0 = FALSE)

Building a path

We are going to build a path to our data files

Create a separate directory for this assignment

Locate the directory where you stored the data

Append filenames to the main directory path

Helps us keep variable definitions short

Example path:

Path to the folder with data files for the assignment:


# Path leading to the main assignment directory

path_main <- "/Users/gleb/Dropbox/UCLA/ENVIRON-175/Advanced-1/"

.

Example path:

Combined path to the specific data files:


# Combined path to the tree data file
tree_data_path <- paste0(path_main, "nlcd_trees_LA.csv")

# Combined path to the census data file
census_data_path <- paste0(path_main, "census_poverty_LA.txt")

Combining Variables

Here we'll combine 3 values into 1, creating a new var:


# Use paste0 to combine state, county and tract into 1 ID

poverty_clean <- mutate(poverty, tractid=paste0(state, county, tract))

.

Check data types

Data type needs to match for join to work


# Check tractid classes in both data sets
class(trees_clean$tractid)
class(poverty_clean$tractid)

# Or using `str`
str(trees_clean$tractid)
str(poverty_clean$tractid)

Check number of rows

Use ?nrow to check the number of rows:


nrow(trees)
nrow(poverty)
nrow(alldata)

nrow(filter(alldata, is.na(povrate)))
nrow(filter(alldata, is.na(tree_cover)))

Tree coverage vs poverty rates

Don't forget to use ggsave!


ggplot(alldata, aes(y=tree_cover, x=povrate)) +
  geom_point(shape=21,
             color="blue",
             fill = "darkgray",
             alpha=0.3,
             size = 2) +
  labs(y = "Fraction tree canopy",
       x = "Poverty rate") +
  theme(panel.background = element_rect(fill="white"),
        axis.line = element_line(color = "gray"),
        axis.ticks = element_line(color = "gray"))

ggsave("/Users/gleb/Dropbox/UCLA/ENVIRON-175/Advanced-1/scatter_canopy_poverty.png",
       height = 4, width = 6)

Final Result

R Advanced 1: Assignment 6

Going to be published on canvas today

Assignment is going to be due this Friday

Due Date: Friday May 2, 2025 at 11:59 pm PT

ENVIRON-175

Programming with Big Environmental Datasets

Slido

Attendance etc

Slido

Schedule

Schedule

Gleb's Office Hours

Kaitlynn's Office Hours

Join Our Slack Community

Slack #classroom

Slides

Slides

What did we learn so far?

Slido

How difficult did you find Project 1?

Slido

How are you going to score on the next midterm project?

Agenda for today

So far..

Join

Venn Diagram

Example A

Example A

Example B

Example B

Example B

Side by side

Side by side

Venn Diagram

Slido

Which grade did Charlie get?

DPLYR

Join functions in dplyr package

Left Join

Left Join

Left Join Result

Who's grade is this?

Left Join

Left Join Result

Right Join

Right Join

Right Join Result

Where's Charlies grade?

Right Join

Right Join Result

Full Join

Full Join

Full Join Result

Full Outer Join

Full Join Result

NA

Missing Values and NA in R

Inner Join

Inner Join Result

Best Practice

Environmental Justice

Environmental Justice

Residential Segregation

Tree Canopy

NLCD Data

Census Tracts

LA County Census Tracts

Joining the data

Joining the data

Joining the data

What if there are missing values?

Checklist

Janitor Library

Paste Function

Sep Argument

Building a path

Example path:

Example path:

Combining Variables

Check data types

Check number of rows

Tree coverage vs poverty rates

Final Result

R Advanced 1: Assignment 6

Missing Values and `NA` in R