ENVIRON-175

Programming with Big Environmental Datasets



Gleb Satyukov
Senior Research Engineer | Data Science Instructor


Slido


https://app.sli.do/event/sfm2atzbPr2bbJoEdR8tjV

New office hours

Gleb's new Office Hours


Wednesdays between 3pm and 4pm on Zoom

Fridays between 3pm and 4pm on Zoom


Gleb’s personal Zoom link is: https://ucla.zoom.us/j/6935808910

Just this week


This Wednesday between 11am and 12pm (noon) on Zoom

Other Wednesdays between 3pm and 4pm on Zoom

Slides


R Basics 1: https://environ-175.com/basics/1

R Basics 2: https://environ-175.com/basics/2

R Basics 3: https://environ-175.com/basics/3

R Basics 4: https://environ-175.com/basics/4

R Basics 5: https://environ-175.com/basics/5

Slides


R Advanced 1: https://environ-175.com/advanced/1

R Advanced 2: https://environ-175.com/advanced/2

R Advanced 3: https://environ-175.com/advanced/3

R Advanced 4: https://environ-175.com/advanced/4

R Advanced 5: https://environ-175.com/advanced/5

Slides


R Spatial 1: https://environ-175.com/spatial/1

R Spatial 2: https://environ-175.com/spatial/2

R Spatial 3: https://environ-175.com/spatial/3

R Spatial 4: https://environ-175.com/spatial/4

R Spatial 5: https://environ-175.com/spatial/5

Schedule

Schedule

Project 1

Submit your project R script just like the data assignments on canvas, i.e. project-1.R

We will not be answering project related questions

You can discuss it with your classmates on Slack

You will have 48 hours to complete the project

Make sure you follow all of the best practices

Make sure your code runs from start to finish

                    without any errors or interruptions

Reminder about the best practices


Attention to detail

Clean your environment

Use proper file paths

Use proper code spacing

Use inline and block comments!!

Use correct variable names (lowercase)

Save charts programmatiaclly with ggsave

Agenda for today


Surprise Quiz ✏️ Spot the Error! ❌

New function!

sample <- filter(data, variable == "keyword")

Logical Operators

Data Assignment

Best Practices

Project 1 Info

Quiz ✏️ Quiz ✏️ Quiz ✏️ Quiz ✏️ Quiz ✏️ Quiz ✏️ Quiz ✏️ Quiz ✏️ Quiz ✏️






1. Where is the error?



meps_data << read_csv("/Users/gleb/ENVIRON-175/meps_2019.csv")

    ^      ^     ^              ^                  ^
    |      |     |              |                  |

   (1)    (2)   (3)            (4)                (5)
          

2. Where is the error?



10_age <- mutate(meps_data, 10_age = floor(age / 10) * 10)

  ^     ^            ^                 ^             ^
  |     |            |                 |             |

 (1)   (2)          (3)               (4)           (5)
          

3. Where is the error?



new_data <- mutate(data, new_var == var_1 / var_2)

   ^      ^         ^       ^     ^
   |      |         |       |     |

  (1)    (2)       (3)     (4)   (5)
          

4. Where is the error?



ggplot(data="expense_by_age10") * geom_point(aes(x = age10, y = emergency)

  ^        ^        ^           ^             ^
  |        |        |           |             |

 (1)      (2)      (3)         (4)           (5)
          

5. Where is the error?



gsave("~/Documents/scatter_plot.png", height = 4, width = 6)

  ^          ^            ^                  ^      ^
  |          |            |                  |      |

 (1)        (2)          (3)                (4)    (5)
          

👏 🙌 🎉 Quiz ✏️ 🎊 ✅ 👏 🙌 🎉 Quiz ✏️ 🎊 ✅ 👏 🙌 🎉 Quiz ✏️ 🎊 ✅ 👏 🙌 🎉 Quiz ✏️ 🎊 ✅ 👏 🙌 🎉 Quiz ✏️ 🎊 ✅ 👏 🙌 🎉 Quiz ✏️ 🎊 ✅ 👏 🙌 🎉 Quiz ✏️ 🎊 ✅ 👏 🙌 🎉 Quiz ✏️ 🎊 ✅ 👏 🙌 🎉 Quiz ✏️ 🎊 ✅






You all win

Rubber Duck Debugging

https://en.wikipedia.org/wiki/Rubber_duck_debugging

Comparison Operators


Comparison Operators

  • ==           Equal to
  • !=           Not equal to
  • >             Greater than
  • <             Less than
  • >=           Greater than or equal to
  • <=           Less than or equal to

Slido

What do we use to check for equality in R?






Filter Function


Slido


We want a sample where the city is equal to Los Angeles, but the code has an error. Where is it?


la_data <- filter(data, city == Los Angeles )

  ^      ^    ^              ^       ^
  |      |    |              |       |

 (1)    (2)  (3)            (4)     (5)
          

Comparison Operators


Write the logical condition for this filter:

Comparison Operators

Correct Answer:


mom_hs == "Yes"
          

mom_hs != "No"
          

Comparison Operators


Write the logical condition for this filter:

Comparison Operators

Correct Answer:


gpa <= 199
          

gpa < 200
          

Comparison Operators


Write the logical condition for this filter:

Comparison Operators

Correct Answer:


mom_hs == dad_hs
          

dad_hs == mom_hs
          

Logical Operators


AND Operator


gpa > 200 & gpa < 300
          

OR Operator


state == "Florida" | state == "California"
          

Logical Operators


Correct way of combining comparison operators:


gpa > 200 & gpa < 300
          

✅ OK



200 < gpa < 300
          

❌ NOT OK

FORCATS

Forcats Library


https://forcats.tidyverse.org/


install.packages("forcats")
          

Categorical Data in R

Factors are how R stores categorical data, e.g.


c("Low", "Medium", "High")
          

With the forcats package you can:

  • Shuffle levels randomly
  • Clean and reorder levels
  • Group rare categories together
  • Reorder based on frequency or another variable

Sort categorical data with forcats

Imagine that our data is survey responses like this:


responses <- factor(c(
  "High", "Extreme", "Low", "Very High", "Extreme",
  "Low", "Medium", "High", "Very High", "Extreme",
  "High", "Low", "Extreme", "Very High", "Medium"
))
          

Count categorical data with forcats



# Get a table with counts
fct_count(responses)
          

Sort categorical data with forcats



# Set a meaningful order
ordered_responses <- fct_relevel(responses,
  "Low", "Medium", "High", "Very High", "Extreme"
)

# Check the levels
levels(ordered_responses)
          

Collapse categorical data with forcats



grouped <- fct_collapse(responses,
  "High+" = c("Very High", "Extreme")
)
          

FORCATS Cheatsheet

https://forcats.tidyverse.org/

Drinking Water












https://www.pnas.org/
https://www.pnas.org/doi/10.1073/pnas.1719805115

Violations

drinking water contaminants pose a harm to public health

16 million cases of acute gastroenteritis that occur each year

while 9–45 million people are possibly affected

relatively few community water systems (3–10%) incur health-based violations

improved compliance is needed to ensure safe drinking water nationwide

Checklist


1. Load packages up

2. Import fixed-width data

3. Clean up variable names

4. Filter to rural southern counties

5. Collapse down violations by state

6. Round violations for graphing label

7. Make a bar plot (and save/export it!)

Final Bar Plot

R Basics 5: Assignment


Is probably already published on canvas

Assignment is going to be due this Friday

Due Date: Friday April 25, 2025 at 11:59 pm PT