ENVIRON-175

Programming with Big Environmental Datasets



Gleb Satyukov
Senior Research Engineer | Data Science Instructor


Slido


Attendance etc





Slido


https://app.sli.do/event/tTrBqrBFyTcL4xBSudvJjZ

Schedule

Schedule

Gleb's Office Hours


Wednesdays between 3pm and 4pm on Zoom

Fridays between 3pm and 4pm on Zoom


Gleb’s personal Zoom link is: https://ucla.zoom.us/j/6935808910

Kaitlynn's Office Hours


Mondays between 12pm (noon) and 1pm

Wednesdays between 11am and 12pm (noon)


Kaitlynn’s personal Zoom link is: https://ucla.zoom.us/j/8321830416

Join Our Slack Community

Important announcements in the #general channel

click here to join

Slack #classroom


Class announcements are in #general

#classroom is used during class

#tech-support for anything tech related

#team-1, #team-2, #team-3, #team-4, #team-5

Slides


R Advanced 1: https://environ-175.com/advanced/1

R Advanced 2: https://environ-175.com/advanced/2

R Advanced 3: https://environ-175.com/advanced/3

R Advanced 4: https://environ-175.com/advanced/4

R Advanced 5: https://environ-175.com/advanced/5

Slides


R Spatial 1: https://environ-175.com/spatial/1

R Spatial 2: https://environ-175.com/spatial/2

R Spatial 3: https://environ-175.com/spatial/3

R Spatial 4: https://environ-175.com/spatial/4

R Spatial 5: https://environ-175.com/spatial/5

What did we learn so far?


File management

Following instructions

Reading csv, tsv, txt files

Working with large datasets

Creating checklists and workplans

Breaking work down into smaller parts

Collaborating and solving problems together

Slido


How difficult did you find Project 1?





Slido


How are you going to score on the next midterm project?





Agenda for today

Join Operations

Data Assignment

New functions!


left_join()

right_join()

full_join()
            

New library!


# install janitor library
install.packages("janitor")
          

So far..


we have only worked with 1 dataset at a time

everything is about to change

with multiple datasets

Join


combine two datasets based on a key

matching rows from one table to another

useful when working with two or more datasets

Venn Diagram

Example A




# Table A
students <- data.frame(
  student_id <- c(1, 2, 3),
  name <- c("Alice", "Bob", "Charlie")
)

          

Example A

Example B




# Table B
grades <- data.frame(
  student_id <- c(1, 2, 4),
  grade <- c("A", "B", "C")
)

          

Example B

Example B

Side by side

Side by side

Venn Diagram

Slido

Which grade did Charlie get?






DPLYR

Join functions in dplyr package


left_join()
a left join keeps all observations in A

right_join()
a right join keeps all observations in B

full_join()
a full join keeps all observations in A and B

Left Join

Left Join


library(dplyr)

                       ( A )    ( B )
                         ^        ^
                         |        |

combined <- left_join(students, grades, by = "student_id")

View(combined)
            

Left Join Result

Who's grade is this?

Left Join

Left Join Result

Right Join

Right Join


library(dplyr)

                        ( A )    ( B )
                          ^        ^
                          |        |

combined <- right_join(students, grades, by = "student_id")

View(combined)
            

Right Join Result

Where's Charlies grade?

Right Join

Right Join Result

Full Join

Full Join


library(dplyr)

                       ( A )    ( B )
                         ^        ^
                         |        |

combined <- full_join(students, grades, by = "student_id")

View(combined)
            

Full Join Result

Full Outer Join

Full Join Result

NA


What is NA?

NA in R: Stands for "Not Available"
This is how R marks a missing value

Missing Values and NA in R


  • Missing Values: Sometimes data is incomplete:
                    a value might be missing for any reason
  • Why does this happen?
    • Data entry errors
    • Information was never collected
    • Different sources of data don't match up perfectly

Inner Join


Inner Join Result


Best Practice


Keep all of the mismatched data

You want to be able to see where data might be missing in order to determine scope of problem

R is not going to graph missing data, no need to drop

Use full_join() to keep all the data

Environmental Justice








Environmental Justice

  • Distributional Justice
    Where environmental benefits and harms are located (big data can help by mapping and quantifying disparities)
  • Procedural Justice
    Who has influence in decision-making processes (big data here is not enough - requires qualitative assessment)

Residential Segregation


LA county has a high degree of residential segregation

https://en.wikipedia.org/wiki/Residential_segregation_in_the_United_States

Tree Canopy


We're going to look at tree canopy cover

Because of the connection to climate change

More tree canopy cover means more shade and therefore more protection from hot weather

NLCD Data


The US Forest Service produces the National Land Cover Database

30 meter x 30 meter pixel maps of tree canopy cover

https://www.usgs.gov/centers/eros/science/national-land-cover-database

Census Tracts


a geographic region defined for the purpose of taking a census: https://en.wikipedia.org/wiki/Census_tract



2020 Census Tract Identifier

LA County Census Tracts

Joining the data

Joining the data

Joining the data

What if there are missing values?

Checklist


1. Clean environment, load libraries

2. Import NLCD and Census data

3. Clean up our data/ fix variable names

4. Create a consistent ID variable in both data frames

5. Join the Census and NLCD data by tract ID

6. Make a scatterplot (and export/save it!)

Janitor Library


Helps with all things concerned with cleaning


# install janitor library
install.packages("janitor")

# load janitor library
library(janitor)

# clean variable names
trees_clean <- clean_names(trees)
          

Paste Function


Paste function concatenates (combines) objects together after converting them to character vectors


paste("hello", "world")

[1] "hello world"
          

Sep Argument


sep=" " is a character string to separate the terms


# check the documentation
?paste

# paste function signature
paste(..., sep = " ", collapse = NULL, recycle0 = FALSE)

# paste0 is used when you don't want a separator
paste0(..., collapse = NULL, recycle0 = FALSE)
          

Building a path


We are going to build a path to our data files

Create a separate directory for this assignment

Locate the directory where you stored the data

Append filenames to the main directory path

Helps us keep variable definitions short

Example path:


Path to the folder with data files for the assignment:


# Path leading to the main assignment directory

path_main <- "/Users/gleb/Dropbox/UCLA/ENVIRON-175/Advanced-1/"

.
          

Example path:


Combined path to the specific data files:


# Combined path to the tree data file
tree_data_path <- paste0(path_main, "nlcd_trees_LA.csv")

# Combined path to the census data file
census_data_path <- paste0(path_main, "census_poverty_LA.txt")
          

Combining Variables


Here we'll combine 3 values into 1, creating a new var:


# Use paste0 to combine state, county and tract into 1 ID

poverty_clean <- mutate(poverty, tractid=paste0(state, county, tract))

.
          

Check data types


Data type needs to match for join to work


# Check tractid classes in both data sets
class(trees_clean$tractid)
class(poverty_clean$tractid)

# Or using `str`
str(trees_clean$tractid)
str(poverty_clean$tractid)
          

Check number of rows


Use  ?nrow   to check the number of rows:


nrow(trees)
nrow(poverty)
nrow(alldata)

nrow(filter(alldata, is.na(povrate)))
nrow(filter(alldata, is.na(tree_cover)))
          

Tree coverage vs poverty rates

Don't forget to use ggsave!


ggplot(alldata, aes(y=tree_cover, x=povrate)) +
  geom_point(shape=21,
             color="blue",
             fill = "darkgray",
             alpha=0.3,
             size = 2) +
  labs(y = "Fraction tree canopy",
       x = "Poverty rate") +
  theme(panel.background = element_rect(fill="white"),
        axis.line = element_line(color = "gray"),
        axis.ticks = element_line(color = "gray"))

ggsave("/Users/gleb/Dropbox/UCLA/ENVIRON-175/Advanced-1/scatter_canopy_poverty.png",
       height = 4, width = 6)
          

Final Result

R Advanced 1: Assignment 6


Going to be published on canvas today

Assignment is going to be due this Friday

Due Date: Friday May 2, 2025 at 11:59 pm PT