ENVIRON-175

Programming with Big Environmental Datasets



Gleb Satyukov
Senior Research Engineer | Data Science Instructor


Slides


R Basics 1: https://environ-175.com/basics/1

R Basics 2: https://environ-175.com/basics/2

R Basics 3: https://environ-175.com/basics/3

R Basics 4: https://environ-175.com/basics/4

R Basics 5: https://environ-175.com/basics/5

Slides


R Advanced 1: https://environ-175.com/advanced/1

R Advanced 2: https://environ-175.com/advanced/2

R Advanced 3: https://environ-175.com/advanced/3

R Advanced 4: https://environ-175.com/advanced/4

R Advanced 5: https://environ-175.com/advanced/5

Slides


R Spatial 1: https://environ-175.com/spatial/1

R Spatial 2: https://environ-175.com/spatial/2

R Spatial 3: https://environ-175.com/spatial/3

R Spatial 4: https://environ-175.com/spatial/4

R Spatial 5: https://environ-175.com/spatial/5

Slido


https://app.sli.do/event/nAihDtN5jwg2LRes9tvRpg

Updated Syllabus


Changed the order of chapters (slightly)

Additional resources for outside of class

We updated the schedule to match Fall 24 class

Syllabus


Check the syllabus!!

https://environ-175.com/syllabus

Schedule


Schedule


Slack Teams


We will be randomly(?) adding folks to teams

There will be 5 teams in total

Roughly 20 people in each team

Join Our Slack Community

click here to join

Checklists


There is a checklist.pdf

Use offline checklists to stay on track

It is rewarding to check things off

It is a To-Do as well as Ta-Da! 🙌

Slido

Everything everywhere all at once






Quick Recap


R vs R Studio

R is the programming language

R Studio is the frontend or IDE

Yes, R Studio has different themes!

Filepaths


Each OS has the same file system approach (FS)

UNIX-based systems use forward slashes ('/')

Windows uses backward slashes ('\')

Where to not save your data


Agenda Today - Best Practices

Please, create a designated folder for this class

In other words, DO NOT save class related files in your Downloads (or the Desktop)

Space out your code to make it more readable

Use comments to clarify your code

Use proper variable names

Be consistent with your quotes (' vs ")

More on this later in this lecture

Get Help


            help()
          

orrr


            ?help
          

for example:


            ?read_csv
          

or:


            ?summary
          

Slido

How can we get help in R Studio?








Libraries/ Packages


Pre-Installed libraries

User Contributed libraries

Loading Libraries/ Packages


          library(dplyr)
          

-> loads the library quietly



          require(dplyr)
          

-> confirms loading

use of quotes here appears to be optional?



          detach('package:dplyr')
          

-> removes or unloads libraries

Clear your environment

Tip Of The Day

How do you clear the environment?


rm(list=ls())
          

How do you clear the console?


cat("\014")
          

How do you clear any plots?


dev.off()
          

Slido

How can you clear the environment in R Studio?






Functions = Code Reuse




# ────────────────────────────────────────────────
# Clears Out Everything
# ────────────────────────────────────────────────
clear_all <- function() {
  rm(list = ls()) # clears the environment
  cat('\014') # clears the console
  dev.off() # clears the plots
}

# calling the function
clear_all()

          

Capitalization Matters

Capitalization Matters


Use lower case variable names

Separate words with underscores

This is also known as the snake_case

❌ DONT DO THIS


ALLCAPS

dot.case

camelCase

kebab-case

Variable Names


first_name <- "Alice"
          

✅ OK



temperature_2025 <- 76.5
          

✅ OK



2025_temperature <- 112.7
          

❌ NOT OK

Spacing


Make sure that you have empty lines between meaningful blocks of code

This makes your code much easier to understand

Especially by someone else reading it!

Or even you yourself a year later 📅

❌ No Spacing Example - Bad Style



library(dplyr)
library(ggplot2)
data <- mtcars
summary <- data %>% group_by(cyl) %>% summarise(avg_mpg = mean(mpg))
plot <- ggplot(summary, aes(x = factor(cyl), y = avg_mpg)) + geom_col()
print(plot)
          

✅ Proper Spacing Example


# ────────────────────────────────────────────────
# Load libraries
# ────────────────────────────────────────────────
library(dplyr)
library(ggplot2)



# ────────────────────────────────────────────────
# Prepare data
# ────────────────────────────────────────────────
data <- mtcars
grouped <- group_by(data, cyl)
summary <- summarise(grouped, avg_mpg = mean(mpg), .groups = "drop")



# ────────────────────────────────────────────────
# Create plot
# ────────────────────────────────────────────────
plot <- ggplot(summary, aes(x = factor(cyl), y = avg_mpg)) +
  geom_col() + labs(x = "Cylinders", y = "Average MPG")



# ────────────────────────────────────────────────
# Display plot
# ────────────────────────────────────────────────
print(plot)



                                                                .
          

Comments in R


We use # symbol for comments in R

Make sure that the comments are clarifying

Avoid comments that don't add anything new

Avoid very long comments that go off screen

Slido

Which symbol do we use for comments in R?






About Piping


Examples



speed <- 15 # mph
          

✅ OK



value <- 10 # assign 10
          

❌ NOT OK



color <- "turquoise" # make sure the color is turquoise because that is my favorite color and i love everything turquoise
          

❌ NOT OK

Collapsing


Grouping

Condensing

Summarizing

Aggregating

When is it used?


You want to summarize large datasets

You need to prepare data for visualizations

You want to reduce complexity for statistical analysis

You want to understand patterns instead of individual data points

Library Imports



# Step 1: Load libraries
library(readr)
library(dplyr)
library(ggplot2)
          

New Library



library(collapse)
          

Used for data manipulation, aggregation, and transformation


collap()
          

Main function for collapsing/aggregating data

Collapse Library

  • Fast data manipulation for large datasets
  • Efficient group objects using GRP()
  • Collapse data by groups using collap()
  • Aggregation functions like fmean() and fsum()

Measuring Pollution


🏢 Downtown LA

🏡 San Fernando Valley

🛫 LAX (the airport)

🚢 Port of Los Angeles

Daily Pollution Data

Date Station PM2.5 Ozone
2025-01-01 Downtown 12.5 0.040
2025-01-01 Valley 20.3 0.050
2025-01-01 LAX 15.0 0.045
2025-01-01 Port of LA 22.1 0.055
2024-01-02 Downtown 14.2 0.039

Monthly Average Pollution

Station Month Avg PM2.5 Avg Ozone
Downtown 2025-01 13.4 0.0395
Valley 2025-01 20.3 0.0500
LAX 2025-01 15.0 0.0450
Port of LA 2025-01 22.1 0.0550

R Basics 2: Assignment Question


How did our parents education affect our own educational outcomes?

A sample of the NLSY Data

childid child_gpa mom_hsgrad mom_schoolyrs
102 309 No 8
204 217 Yes 15
307 253 Yes 12
511 162 Yes 12
1124 234 No 6
1433 300 Yes 15

Collapsing NLSY Data

Collapsing data means summarizing individual records into group-level statistics.

  • Reducing many rows to just a few rows per group
  • Identify patterns, trends, or differences in categories
  • Useful for comparing group means or running statistical models

Example


Average GPA by Mother's High School Graduation

mom_hsgrad mean_gpa n (children)
No 238.1 14
Yes 286.8 85

Considerations


Missing values

Weighting data


(out of scope)

Checklist!


Step 1. Load the necessary packages

Step 2. Use original NLSY data

Step 3. Collapse GPA by Mom's education

Step 4. Visualize results with a bar chart

Step 5. Save and submit that assignment-2.R

Step 6. ....... profit?

Remember the best practices!


Correct file paths

Clean environment

Use code spacing

Use comments

Be precise

R Basics 2: Assignment


Published after the class today

Assignment is going to be due this Friday

Friday April 11, 2025 at 11:59 pm PT