ENVIRON-175

Programming with Big Environmental Datasets



Gleb Satyukov
Senior Research Engineer | Data Science Instructor


Slido


Attendance etc





Slides


R Basics 1: https://environ-175.com/basics/1

R Basics 2: https://environ-175.com/basics/2

R Basics 3: https://environ-175.com/basics/3

R Basics 4: https://environ-175.com/basics/4

R Basics 5: https://environ-175.com/basics/5

Slides


R Advanced 1: https://environ-175.com/advanced/1

R Advanced 2: https://environ-175.com/advanced/2

R Advanced 3: https://environ-175.com/advanced/3

R Advanced 4: https://environ-175.com/advanced/4

R Advanced 5: https://environ-175.com/advanced/5

Slides


R Spatial 1: https://environ-175.com/spatial/1

R Spatial 2: https://environ-175.com/spatial/2

R Spatial 3: https://environ-175.com/spatial/3

R Spatial 4: https://environ-175.com/spatial/4

R Spatial 5: https://environ-175.com/spatial/5

Schedule

Schedule

Gleb's Office Hours


Wednesdays between 3pm and 4pm on Zoom

Fridays between 3pm and 4pm on Zoom


Gleb’s personal Zoom link is: https://ucla.zoom.us/j/6935808910

Kaitlynn's Office Hours


Mondays between 12pm (noon) and 1pm

Wednesdays between 11am and 12pm (noon)


Kaitlynn’s personal Zoom link is: https://ucla.zoom.us/j/8321830416

Agenda for today

Best Practices (again)

Global Variables

String Operations

If/Else Logic

New functions! Add this to best practices


head()

ifelse()
            

Reminder about the best practices


Attention to detail

Clean your environment

Use proper file paths

Use proper code spacing

Use inline and block comments!!

Use correct variable names (lowercase)

Save charts programmatiaclly with ggsave

Best Practices 2.0


Using Global Variables

Set a directory using path_main

Inspecting the data using head()

Keep data in a dedicated data folder

Be consistent with your use of quotes (' vs ")

And more best practices coming soon!

Follow instructions in the assignments

Tidyverse Style Guide


Highly recommend checking it out!

https://style.tidyverse.org/files.html

Paste Function


Paste function concatenates (combines) objects together after converting them to character vectors


paste("hello", "world")

[1] "hello world"
          

Sep Argument


sep=" " is a character string to separate the terms


# check the documentation
?paste

# paste function signature
paste(..., sep = " ", collapse = NULL, recycle0 = FALSE)

# paste0 is used when you don't want a separator
paste0(..., collapse = NULL, recycle0 = FALSE)
          

Building a path


We are going to build a path to our data files

Create a separate directory for this assignment

Locate the directory where you stored the data

Append filenames to the main directory path

Helps us keep variable definitions short

Example directories (Mac/Linux):



/Users/gleb/Documents/Environ-175/Advanced-1/
          

/Users/gleb/Documents/Environ-175/Advanced-2/
          

/Users/gleb/Documents/Environ-175/Advanced-3/
          

/Users/gleb/Documents/Environ-175/Advanced-4/
          

/Users/gleb/Documents/Environ-175/Advanced-5/
          

Example directories (Windows):



C:/Users/gleb/Documents/Environ-175/Advanced-1/
          

C:/Users/gleb/Documents/Environ-175/Advanced-2/
          

C:/Users/gleb/Documents/Environ-175/Advanced-3/
          

C:/Users/gleb/Documents/Environ-175/Advanced-4/
          

C:/Users/gleb/Documents/Environ-175/Advanced-5/
          

Example path:


Path to the folder with data files for the assignment:


# Path leading to the main assignment directory

path_main <- "/Users/gleb/Dropbox/UCLA/ENVIRON-175/Advanced-2/"

.
          

Example path:


Path to the folder with data files for the assignment:


# Path leading to the main assignment directory
path_main <- "/Users/gleb/Dropbox/UCLA/ENVIRON-175/Advanced-2/"

# document 1 file path
document_path_1 <- paste0(path_main, "document_1.csv")

# document 2 file path
document_path_2 <- paste0(path_main, "document_2.csv")
          

Example path:


Path to the folder with data in a dedicated data folder:


# Path leading to the main assignment directory
path_main <- "/Users/gleb/Dropbox/UCLA/ENVIRON-175/Advanced-2/"

# document 1 file path
document_path_1 <- paste0(path_main, "data/document_1.csv")

# document 2 file path
document_path_2 <- paste0(path_main, "data/document_2.csv")
          

Global Variables

Global Variables are set-up at the top

And reused throughout your script multiple times


# Setup Global Variables
main_color <- "blue"
accent_color <- "lightblue"

# Graph 1 with YVAR1
ggplot(DATA, aes(x=XVAR, y=YVAR1)) +
    geom_point(color = main_color) +
    geom_line(color = accent_color)

# Graph 2 with YVAR2
ggplot(DATA, aes(x=VAR1, y=YVAR2)) +
    geom_point(color = main_color) +
    geom_line(color = accent_color)
          

Slido


What do we call variables that are reused throughout our R script?





Other String Operations

Other common string operations include:

  • searching within a string
  • searching and replacing within a string
  • getting a slice of a string (a substring)
  • splitting a string into separate pieces
  • getting the length of a string
  • capitalizing/ uppercasing/ lowercasing

Length of a string


Use nchar to find the length of a string, like this:


####################
# Length of a string
####################

nchar("Hello World")        # 11
          

Concatenate two strings


Use paste or paste0 to merge two strings together:


#####################
# Concatenate strings
#####################

paste("Hello", "World")              # "Hello World"
paste("Hello", "World", sep = ", ")  # "Hello, World"

paste0("Hello", "World")             # "HelloWorld"
          

Splitting a string


Split a string based on a split parameter, like so:


################
# Split a string
################

strsplit("apple,banana,kiwi", ",")[[1]]
# [1] "apple" "banana" "kiwi"
          

Find and Replace


Find and replace specific parts of a string:


##################
# Find and replace
##################

gsub("dog", "cat", "The quick brown fox jumps over the lazy dog")
# "The quick brown fox jumps over the lazy cat"
          

Source: The quick brown fox jumps over the lazy dog

Uppercase / Lowercase / Capitalize


This should be self-explanatory


####################################
# Uppercase / Lowercase / Capitalize
####################################

toupper("hello")            # "HELLO"
tolower("HELLO")            # "hello"

tools::toTitleCase("hello world")  # "Hello World"
          

Trim Whitespace


This is really useful when your imported data has any leading or trailing whitespace that you don't want:


#################
# Trim whitespace
#################

trimws("    no space pls   ")  # "no space pls"
          

Substring / slice


Getting a specific slice of a string:


###################
# Substring / slice
###################

substr("Environment", 1, 7)       # "Environ"
substring("Environment", 1, 7)    # "Environ"
          

Regular Expressions

These are used to find matching patterns in text

Returns -1 when specified pattern is not found in text


#####################
# Regular Expressions
#####################

regexpr("@", "gleb@ucla.edu")
# [1] 5
          


############################
# Global Regular Expressions
############################

gregexpr("n", "Environment")
# [1]  2  7 10
          

Global Regular Expressions


* grep stands for global regular expression print


# Check if there are any numbers in this text
grepl("[0-9]", "Environ 175")    # TRUE

# Replace any number with the letter R
gsub("[0-9]", "R", "Environ 101") # Environ RRR

# Find all strings that start with the letter A
grep("^A", c("Apple", "Banana", "Avocado"))

# Find index of all strings that end in 'ing'
grep("ing$", c("Run", "Swimming", "Eating"))  # 2, 3
          

More about Regex


Check out the wikipedia page here:

https://en.wikipedia.org/wiki/Regular_expression

If you want to test your regex patterns:

https://regex101.com/

Slido


What does regexpr() function return when the specified pattern is not found in text?











If / Else Logic


If/Else logic is typically used when you need to set a value based on a certain condition

Conditional logic is one of the building blocks of any programming language

In R we can use it to create categorical variables in our data, for example if a value is over a certain threshold

Using if/else in R


We can use if, else if, and else to classify numeric values, for example:


# Temperature reading
temp <- 78

if (temp < 60) {
  category <- "Cold"
} else if (temp < 80) {
  category <- "Warm"
} else {
  category <- "Hot"
}
          

Used when turning numeric data into categories

Filter Function (recap)


Recall how our filter function works:


####################
# FILTER BY DISTANCE
####################

data <- filter(data, distance < 50000) # 50km


.
          

IFELSE() Function


Example of ifelse() being used as a function:


#############################
# IFELSE() USED AS A FUNCTION
#############################

ifelse(<SOME CONDITION>, <RESULT IF TRUE>, <RESULT IF FALSE>)


.
          

Adding Categories


A monitor is considered to be "Close" if the distance to the power plant is less than 50km (50000 meters)


##################################
# ADD CATEGORIES BASED ON DISTANCE
##################################

data50km <- mutate(all_data,
        dist_group = ifelse(distance < 50000, "Close", "Far"))


.
          

Adding Categories


Monitor is considered "Close" if it's distance is less than 50km


##################################
# ADD CATEGORIES BASED ON DISTANCE
##################################

data50km <- mutate(all_data,
        dist_group = ifelse(distance < 50000, "Close", "Far"))


.
          

Slido


Which category does this temperature fall in?











ACID RAIN


Sulfur dioxide (SO2)








ARP - Acid Rain Program


https://www.epa.gov/acidrain/acid-rain-program

ARP was enacted by the Federal Government in 1990

1995 - Phase 1 where ARP is regulating SO2 emissions of 110 largest coal-fired power plants

We'll be using EPA data from 1992 to 1998 which includes Phase 1 power plants

Free Market approach


Power plants were allowed a certain amount of permits

Each permit allowed them to emit 1 ton of SO2

Reaching their limit, power plants could buy more permits from other plants that had left-overs

ARP had bipartisan support


ARP easily passed through congress and the senate:

https://www.govtrack.us/congress/votes/101-1990/h137

https://www.govtrack.us/congress/votes/101-1990/s55

Acid Rain Program (ARP) Phase I

110 biggest SO2 polluting plants

EPA Data

Clean EPA Data

Distance Data (in meters)

Side by side

Full Join

All Data

All Data with Categories

Final Result

Slido


In what year was the ARP enacted?











Checklist

1. Clean environment, load libraries

2. Import distance data and EPA data

3. Clean up data and substring to fix ID variable

4. Filter to only include SO2 readings

5. Join EPA and distance data by monitor ID

6. Create distance categories with if else logic

7. Collapse our SO2 data by year and distance category

8. Make a connected scatterplot (and save/export it!)

R Advanced 2: Assignment 7


Going to be published on canvas today

Assignment is going to be due next Monday

Due Date: Monday May 5, 2025 at 11:59 pm PT