Gleb Satyukov
Senior Research Engineer | Data Science Instructor
Wednesdays between 3pm and 4pm on Zoom
Fridays between 3pm and 4pm on Zoom
Gleb’s personal Zoom link is: https://ucla.zoom.us/j/6935808910
Mondays between 12pm (noon) and 1pm
Wednesdays between 11am and 12pm (noon)
Kaitlynn’s personal Zoom link is: https://ucla.zoom.us/j/8321830416
Class announcements are in #general
#classroom is used during class
#tech-support for anything tech related
#team-1, #team-2, #team-3, #team-4, #team-5
R Advanced 1: https://environ-175.com/advanced/1
R Advanced 2: https://environ-175.com/advanced/2
R Advanced 3: https://environ-175.com/advanced/3
R Advanced 4: https://environ-175.com/advanced/4
R Advanced 5: https://environ-175.com/advanced/5
R Spatial 1: https://environ-175.com/spatial/1
R Spatial 2: https://environ-175.com/spatial/2
R Spatial 3: https://environ-175.com/spatial/3
R Spatial 4: https://environ-175.com/spatial/4
R Spatial 5: https://environ-175.com/spatial/5
File management
Following instructions
Reading csv, tsv, txt files
Working with large datasets
Creating checklists and workplans
Breaking work down into smaller parts
Collaborating and solving problems together
Join Operations
Data Assignment
New functions!
left_join()
right_join()
full_join()
New library!
# install janitor library
install.packages("janitor")
we have only worked with 1 dataset at a time
everything is about to change
with multiple datasets
combine two datasets based on a key
matching rows from one table to another
useful when working with two or more datasets
# Table A
students <- data.frame(
student_id <- c(1, 2, 3),
name <- c("Alice", "Bob", "Charlie")
)
# Table B
grades <- data.frame(
student_id <- c(1, 2, 4),
grade <- c("A", "B", "C")
)
left_join()
a left join keeps all observations in A
right_join()
a right join keeps all observations in B
full_join()
a full join keeps all observations in A and B
library(dplyr)
( A ) ( B )
^ ^
| |
combined <- left_join(students, grades, by = "student_id")
View(combined)
library(dplyr)
( A ) ( B )
^ ^
| |
combined <- right_join(students, grades, by = "student_id")
View(combined)
library(dplyr)
( A ) ( B )
^ ^
| |
combined <- full_join(students, grades, by = "student_id")
View(combined)
What is NA?
NA
in R: Stands for "Not Available"
This is how R marks a missing value
NA
in RKeep all of the mismatched data
You want to be able to see where data might be missing in order to determine scope of problem
R is not going to graph missing data, no need to drop
Use full_join()
to keep all the data
LA county has a high degree of residential segregation
https://en.wikipedia.org/wiki/Residential_segregation_in_the_United_StatesWe're going to look at tree canopy cover
Because of the connection to climate change
More tree canopy cover means more shade and therefore more protection from hot weather
The US Forest Service produces the National Land Cover Database
30 meter x 30 meter pixel maps of tree canopy cover
https://www.usgs.gov/centers/eros/science/national-land-cover-databasea geographic region defined for the purpose of taking a census: https://en.wikipedia.org/wiki/Census_tract
1. Clean environment, load libraries
2. Import NLCD and Census data
3. Clean up our data/ fix variable names
4. Create a consistent ID variable in both data frames
5. Join the Census and NLCD data by tract ID
6. Make a scatterplot (and export/save it!)
Helps with all things concerned with cleaning
# install janitor library
install.packages("janitor")
# load janitor library
library(janitor)
# clean variable names
trees_clean <- clean_names(trees)
Paste function concatenates (combines) objects together after converting them to character vectors
paste("hello", "world")
[1] "hello world"
sep=" "
is a character string to separate the terms
# check the documentation
?paste
# paste function signature
paste(..., sep = " ", collapse = NULL, recycle0 = FALSE)
# paste0 is used when you don't want a separator
paste0(..., collapse = NULL, recycle0 = FALSE)
We are going to build a path to our data files
Create a separate directory for this assignment
Locate the directory where you stored the data
Append filenames to the main directory path
Helps us keep variable definitions short
Path to the folder with data files for the assignment:
# Path leading to the main assignment directory
path_main <- "/Users/gleb/Dropbox/UCLA/ENVIRON-175/Advanced-1/"
.
Combined path to the specific data files:
# Combined path to the tree data file
tree_data_path <- paste0(path_main, "nlcd_trees_LA.csv")
# Combined path to the census data file
census_data_path <- paste0(path_main, "census_poverty_LA.txt")
Here we'll combine 3 values into 1, creating a new var:
# Use paste0 to combine state, county and tract into 1 ID
poverty_clean <- mutate(poverty, tractid=paste0(state, county, tract))
.
Data type needs to match for join to work
# Check tractid classes in both data sets
class(trees_clean$tractid)
class(poverty_clean$tractid)
# Or using `str`
str(trees_clean$tractid)
str(poverty_clean$tractid)
Use ?nrow
to check the number of rows:
nrow(trees)
nrow(poverty)
nrow(alldata)
nrow(filter(alldata, is.na(povrate)))
nrow(filter(alldata, is.na(tree_cover)))
Don't forget to use ggsave!
ggplot(alldata, aes(y=tree_cover, x=povrate)) +
geom_point(shape=21,
color="blue",
fill = "darkgray",
alpha=0.3,
size = 2) +
labs(y = "Fraction tree canopy",
x = "Poverty rate") +
theme(panel.background = element_rect(fill="white"),
axis.line = element_line(color = "gray"),
axis.ticks = element_line(color = "gray"))
ggsave("/Users/gleb/Dropbox/UCLA/ENVIRON-175/Advanced-1/scatter_canopy_poverty.png",
height = 4, width = 6)
Going to be published on canvas today
Assignment is going to be due this Friday
Due Date: Friday May 2, 2025 at 11:59 pm PT