Let’s face it—some of the most interesting data today never touches a database or CSV file. It lives out in the wild: real-time stock prices, social media trends, weather patterns, and news headlines. This data is dynamic, constantly updating, and available for the taking if you know how to ask for it properly.
Think of APIs as polite conversations with data services, and web scraping as carefully extracting information from public websites. Both skills transform R from an analytical tool into a live data collection engine.
The Art of API Conversation
APIs (Application Programming Interfaces) are essentially structured ways for applications to talk to each other. When you use an API, you’re asking a service to give you specific information in a predictable format.
Making Your First API Call
Let’s start with a practical example—fetching current cryptocurrency prices:
r
library(httr)
library(jsonlite)
# Basic GET request to a crypto API
response <- GET(“https://api.coingecko.com/api/v3/simple/price?ids=bitcoin,ethereum&vs_currencies=usd”)
# Always check if the request was successful
if (status_code(response) == 200) {
crypto_data <- fromJSON(content(response, “text”))
bitcoin_price <- crypto_data$bitcoin$usd
ethereum_price <- crypto_data$ethereum$usd
print(paste(“Bitcoin: $”, bitcoin_price, ” | Ethereum: $”, ethereum_price))
} else {
warning(“API request failed with status: “, status_code(response))
}
This pattern—make request, check status, parse response—is the foundation of all API work.
Handling Authentication Like a Pro
Many APIs require authentication to prevent abuse and track usage. Here’s how to handle different security approaches:
r
# API Key in headers (most common)
api_key <- Sys.getenv(“WEATHER_API_KEY”) # Store keys in environment variables
weather_response <- GET(
“https://api.weatherapi.com/v1/current.json”,
query = list(q = “New York”, key = api_key)
)
# Bearer token authentication
token <- Sys.getenv(“TWITTER_BEARER_TOKEN”)
twitter_response <- GET(
“https://api.twitter.com/2/tweets/search/recent”,
add_headers(Authorization = paste(“Bearer”, token)),
query = list(query = “data science”, max_results = 10)
)
# OAuth2 flow for user-specific data
my_app <- oauth_app(“my_github_app”, key = “your_key”, secret = “your_secret”)
github_token <- oauth2.0_token(oauth_endpoints(“github”), my_app)
github_response <- GET(“https://api.github.com/user/repos”, config(token = github_token))
Golden Rule: Never hardcode API keys or secrets in your scripts. Use environment variables or secure credential stores.
Taming Paginated APIs
Many APIs split results across multiple pages to avoid overwhelming their servers. Here’s how to collect all the data systematically:
r
fetch_all_news_articles <- function(query, max_pages = 10) {
all_articles <- list()
page <- 1
while (page <= max_pages) {
response <- GET(
“https://newsapi.org/v2/everything”,
query = list(
q = query,
page = page,
apiKey = Sys.getenv(“NEWS_API_KEY”)
)
)
# Check if we got a successful response
if (status_code(response) != 200) break
page_data <- fromJSON(content(response, “text”))
articles <- page_data$articles
# Stop if we’ve reached the last page
if (length(articles) == 0) break
all_articles <- c(all_articles, articles)
# Be polite – don’t overwhelm the API
Sys.sleep(0.5)
page <- page + 1
}
return(bind_rows(all_articles))
}
# Usage
climate_articles <- fetch_all_news_articles(“climate change”)
Building Resilient API Calls
APIs can be flaky—servers go down, rate limits get hit, networks fail. Professional code handles these gracefully:
r
safe_api_call <- function(url, max_retries = 3, timeout_sec = 10) {
for (attempt in 1:max_retries) {
tryCatch({
response <- GET(url, timeout(timeout_sec))
if (status_code(response) == 429) { # Rate limited
reset_time <- as.numeric(response$headers$`x-ratelimit-reset`)
wait_time <- reset_time – as.numeric(Sys.time())
if (wait_time > 0) {
message(“Rate limited. Waiting “, wait_time, ” seconds.”)
Sys.sleep(wait_time)
next
}
}
stop_for_status(response) # Throw error for other bad status codes
return(response)
}, error = function(e) {
message(“Attempt “, attempt, ” failed: “, e$message)
if (attempt == max_retries) stop(“All retries exhausted”)
Sys.sleep(2 ^ attempt) # Exponential backoff
})
}
}
# Usage with error handling
tryCatch({
stock_data <- safe_api_call(“https://api.twelvedata.com/stocks?symbol=AAPL”)
processed_data <- fromJSON(content(stock_data, “text”))
}, error = function(e) {
message(“Failed to fetch stock data: “, e$message)
# Load cached data as fallback
processed_data <- readRDS(“cached_stock_data.rds”)
})
Web Scraping: When APIs Don’t Exist
Sometimes the data you need isn’t available through APIs—it’s sitting on public websites. Web scraping lets you extract this information, but it comes with responsibilities.
Ethical Scraping 101
Before scraping any website:
- Check robots.txt (e.g., example.com/robots.txt)
- Respect rate limits—don’t hammer servers
- Only scrape publicly available data
- Follow the website’s terms of service
Extracting Data with rvest
Here’s how to scrape product information from an e-commerce site:
r
library(rvest)
library(polite)
# Always be polite – use the polite package
session <- bow(
“https://example-books.com”,
delay = 2, # Wait 2 seconds between requests
user_agent = “Academic Research Bot 1.0”
)
scrape_books <- function(page_num = 1) {
page_url <- paste0(“https://example-books.com/books?page=”, page_num)
# Use the polite session
scraped_page <- scrape(session, page_url)
books <- scraped_page %>%
html_elements(“.book-item”) %>%
map_df(function(book) {
data.frame(
title = book %>% html_element(“.title”) %>% html_text2(),
author = book %>% html_element(“.author”) %>% html_text2(),
price = book %>% html_element(“.price”) %>% html_text2() %>% parse_number(),
rating = book %>% html_element(“.stars”) %>% html_attr(“data-rating”) %>% as.numeric(),
stringsAsFactors = FALSE
)
})
return(books)
}
# Scrape multiple pages responsibly
all_books <- map_df(1:5, scrape_books) # Only 5 pages to be respectful
Handling Dynamic Content
Some websites load data dynamically with JavaScript. For these, you might need RSelenium:
r
library(RSelenium)
library(rvest)
# Start a Selenium server (requires Docker or standalone server)
rD <- rsDriver(browser = “chrome”, port = 4445L)
driver <- rD$client
# Navigate to a JavaScript-heavy site
driver$navigate(“https://example-spa.com/dynamic-data”)
# Wait for content to load
Sys.sleep(3)
# Get the page source after JavaScript execution
page_content <- driver$getPageSource()[[1]]
# Parse with rvest as usual
dynamic_data <- read_html(page_content) %>%
html_elements(“.data-item”) %>%
html_text2()
driver$close()
Transforming Raw Web Data into Analysis-Ready Formats
API responses and scraped data often need significant cleaning:
r
library(tidyverse)
# Clean and transform API data
clean_weather_data <- function(raw_api_response) {
weather_df <- fromJSON(raw_api_response) %>%
pluck(“data”) %>%
as_tibble() %>%
mutate(
timestamp = as.POSIXct(timestamp, origin = “1970-01-01”),
temperature = as.numeric(temp_c),
humidity = as.numeric(humidity),
condition = as.factor(condition$text)
) %>%
select(timestamp, temperature, humidity, condition, wind_kph = wind_speed) %>%
arrange(timestamp)
return(weather_df)
}
# Handle nested JSON structures
flatten_complex_api_response <- function(nested_data) {
flattened <- nested_data %>%
unnest_wider(user, names_sep = “_”) %>%
unnest_longer(comments, indices_to = “comment_index”) %>%
mutate(
across(where(is.list), ~map_chr(., ~paste(., collapse = “, “))),
publication_date = as.Date(created_at)
)
return(flattened)
}
Building Production-Grade Data Pipelines
When you’re ready to move from experimentation to production:
r
library(targets)
# Define a pipeline for daily weather data collection
tar_script({
list(
tar_target(weather_raw, {
fetch_weather_data(Sys.Date())
}, format = “qs”),
tar_target(weather_clean, {
clean_weather_data(weather_raw)
}),
tar_target(weather_analysis, {
analyze_weather_trends(weather_clean)
}),
tar_target(weather_report, {
generate_daily_report(weather_analysis)
}, format = “file”)
)
})
# Run the pipeline daily
tar_make()
Conclusion: The Web as Your Living Dataset
Mastering APIs and web scraping fundamentally changes your relationship with data. Instead of being limited to static files and databases, you can now:
- Monitor real-time events as they happen
- Build living dashboards that update automatically
- Create unique datasets nobody else has
- Respond to changes in near real-time
But with great power comes great responsibility. Always be respectful—treat websites and APIs as you’d want your own services treated. Use rate limiting, cache data when appropriate, and respect terms of service.
The most successful data scientists aren’t just great analysts; they’re skilled data collectors who know how to find fresh, relevant data. By adding web APIs and ethical scraping to your toolkit, you’re not just analyzing the world—you’re connecting to its pulse.
Now go forth and build something amazing with the wealth of data waiting at your fingertips. The internet is your oyster, and R is your pearl-harvesting tool.