ENVIRON-175

Programming with Big Environmental Datasets



Gleb Satyukov
Senior Research Engineer | Data Science Instructor


Slido


https://app.sli.do/event/fTKKGgcVXW4HMi4YpB2fkd

The Dress


Slides


R Basics 1: https://environ-175.com/basics/1

R Basics 2: https://environ-175.com/basics/2

R Basics 3: https://environ-175.com/basics/3

R Basics 4: https://environ-175.com/basics/4

R Basics 5: https://environ-175.com/basics/5

Slides


R Advanced 1: https://environ-175.com/advanced/1

R Advanced 2: https://environ-175.com/advanced/2

R Advanced 3: https://environ-175.com/advanced/3

R Advanced 4: https://environ-175.com/advanced/4

R Advanced 5: https://environ-175.com/advanced/5

Slides


R Spatial 1: https://environ-175.com/spatial/1

R Spatial 2: https://environ-175.com/spatial/2

R Spatial 3: https://environ-175.com/spatial/3

R Spatial 4: https://environ-175.com/spatial/4

R Spatial 5: https://environ-175.com/spatial/5

Agenda for today

Reminder about the best practices

Troubleshooting / Debugging

Reading other types of files

Rounding numbers in R

MEPS Data Assignment

More about ggplot2

New function!

ggsave(filepath, height = 4, width = 6)

Join Our Slack Community

click here to join

Slack #classroom


Class announcements are in #general

#classroom is used during class

#team-1 through #team-5

Reminder about the best practices


Use proper file paths

Clean your environment

Use proper code spacing

Use inline and block comments

Save charts programmatiaclly

Attention to detail

Troubleshooting


Review your raw or clean data

Inspect variables in the console

Read the error messages in the console

Inspect the variables in the Environment panel

Don't be afraid of error messages

Check the documentation with:

?help

View() Function

Rubber Duck Debugging

https://en.wikipedia.org/wiki/Rubber_duck_debugging

Review: Function Decomposition


my_function <- function(arg1, arg2 = "default") {
  # Do something with arg1 and arg2
}
          

my_function – the function name

arg1 – a required input

arg2 = "default"

                     – optional input with default value

What is a Function Signature?

It is the part of the function that tells you:

  • What the function is called
  • Which inputs or arguments are expected
  • What the function returns (if anything)

        mean(data, trim = 0, na.rm = FALSE)
          

this is is the signature for the mean() function

For Example:


        mean(data, trim = 0, na.rm = FALSE) -> Integer
          

This is the signature for the built-in mean() function.

It tells you:

  • What the function is called
  • What necessary inputs (arguments) it expects
  • What optional inputs (arguments) it expects
  • What it returns (most of the time)

📦 Importing Data in R Studio

Different ways to load different types of data

Depending on type of data, amount of data

read_csv()

  • From the readr package
  • Best for: CSV files
  • Fast, returns a tibble
  • Example:

library(readr)
data <- read_csv("data/environment.csv")
          

read_delim()

  • From the readr package
  • Best for: Custom-delimited text files
  • Example:

library(readr)
data <- read_delim("data/environment.txt", delim = "\t")
          

read_excel()

  • From the readxl package
  • Best for any Excel files (.xls, .xlsx)
  • Example:

library(readxl)
data <- read_excel("data/environment_data.xlsx", sheet = "Sheet1")
          

Base R: read.csv()

  • Slower, more manual setup
  • Most basic way of importing data
  • Example:

data <- read.csv("data/environment.csv")
          

sf::st_read()

  • From the sf package
  • Best for Shapefiles and spatial vector data
  • Example:

library(sf)
shape_data <- st_read("data/shapefile.shp")
          

Base R Plot function

GGPlot2 Example

raster::raster()

  • From the raster package
  • Best for: GeoTIFF and raster layers
  • Example:

library(raster)
raster_data <- raster("data/satellite.tif")
          

Visual Comparison

FunctionFile TypePackageNotes
read_csv().csvreadrFaster
read_delim().txt, .tsvreadrCustom delimiters
read_excel().xls, .xlsxreadxlFor Excel sheets
read.csv().csvbase RSlower
st_read().shpsfSpatial data
raster().tifrasterRaster data

Tidy Tuesday


Publishing data samples every Tuesday since 2018

https://github.com/rfordatascience/tidytuesday/tree/main/data/2025/2025-04-15

Slido

Which function do we use to import XLSX files?






Slido

Which package do you need to import for the previous function to work?






GGPlot: Grammar of Graphics


The first parameter is always data

aes = aesthetic

geom = geometry

labs = labels

...

Basics

Basics

ggplot2 cheatsheet


https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf

CO2 Uptake






Basics 4 Data Assignment


Healthcare in the U.S.

Our World In Data: Healthcare

https://ourworldindata.org/financing-healthcare

Medical Expenditure Panel Survey


data from 2019

data in CSV format

filename: meps_2019.csv

https://meps.ahrq.gov/mepsweb/

Medical Expenditure Panel Survey


Different Categories of expenses

We will look at these categories by age

Checklist

1. Clean the environment

2. Import packages/libraries

3. Review and clean up the original data using

rename()

4. Add a new field using

mutate()

5. Collapse the data using


collap()
          

Checklist

6. Visualize and save the chart with:


ggplot(data, aes(x, y, ...)) +
  geom_point(...) +
  geom_line(...) +
  labs(...) +
  theme...

ggsave(filepath, width = number, height = number)
          

Review the original data

Cleaning the data

Review the clean data

Option 1: Separate data points

Option 2: Bin age variable

Notice there are 2 charts here

Multiple plots

Multiple plots

Notice there are 2 charts here

What are the dimensions of this data?


This is one object with two Y variables, inpatient and emergency.

And, the X is 9 age bins.

Slido


Question: What are the dimensions of this data?

Correct answer:

3 columns and 9 rows

How did we get separate bins?


1. Add new column using

mutate()

2. Group the data using

collap()

Try it yourself

Understanding collap() shorthand

Formula Syntax:

[Summarize these columns] ~ [Group by]

Example:

inpatient + emergency ~ age10 
  • + adds more variables to summarize
  • ~ separates grouping from summarized variables
  • / nests subgroupings (e.g., homeworld / gender)

Rounding


We need to round to the nearest 10 digit age

Rounding is usually easy for people, not so much with computers

If I told you to round the number 42 to the nearest 10, you'd know it is 40

Rounding in R


Rounding is done with floor and ceiling functions

floor(4.2) = 4
ceiling(4.2) = 5

Slido


What is the result of:

floor(42)

What is the result of floor(42)?


Correct answer:

Answer: 42

Slido


What is the result of:

floor(42/10)

What is the result of floor(42/10)?


Correct answer:

Answer: 4

One more attempt


Multiply by 10 if we want the function to return 40

floor(42/10) * 10 = 40

Try it yourself

Multiple plots

Multiple plots

R Basics 4: Assignment


Is already published on canvas

Assignment is going to be due this Monday

Due Date: Monday April 21, 2025 at 11:59 pm PT