The Web as Your Data Playground: Mastering APIs and Web Scraping

Let’s face it—some of the most interesting data today never touches a database or CSV file. It lives out in the wild: real-time stock prices, social media trends, weather patterns, and news headlines. This data is dynamic, constantly updating, and available for the taking if you know how to ask for it properly.

Think of APIs as polite conversations with data services, and web scraping as carefully extracting information from public websites. Both skills transform R from an analytical tool into a live data collection engine.

The Art of API Conversation

APIs (Application Programming Interfaces) are essentially structured ways for applications to talk to each other. When you use an API, you’re asking a service to give you specific information in a predictable format.

Making Your First API Call

Let’s start with a practical example—fetching current cryptocurrency prices:

r

library(httr)

library(jsonlite)

# Basic GET request to a crypto API

response <- GET(“https://api.coingecko.com/api/v3/simple/price?ids=bitcoin,ethereum&vs_currencies=usd”)

# Always check if the request was successful

if (status_code(response) == 200) {

  crypto_data <- fromJSON(content(response, “text”))

  bitcoin_price <- crypto_data$bitcoin$usd

  ethereum_price <- crypto_data$ethereum$usd

  print(paste(“Bitcoin: $”, bitcoin_price, ” | Ethereum: $”, ethereum_price))

} else {

  warning(“API request failed with status: “, status_code(response))

}

This pattern—make request, check status, parse response—is the foundation of all API work.

Handling Authentication Like a Pro

Many APIs require authentication to prevent abuse and track usage. Here’s how to handle different security approaches:

r

# API Key in headers (most common)

api_key <- Sys.getenv(“WEATHER_API_KEY”)  # Store keys in environment variables

weather_response <- GET(

  “https://api.weatherapi.com/v1/current.json”,

  query = list(q = “New York”, key = api_key)

)

# Bearer token authentication

token <- Sys.getenv(“TWITTER_BEARER_TOKEN”)

twitter_response <- GET(

  “https://api.twitter.com/2/tweets/search/recent”,

  add_headers(Authorization = paste(“Bearer”, token)),

  query = list(query = “data science”, max_results = 10)

)

# OAuth2 flow for user-specific data

my_app <- oauth_app(“my_github_app”, key = “your_key”, secret = “your_secret”)

github_token <- oauth2.0_token(oauth_endpoints(“github”), my_app)

github_response <- GET(“https://api.github.com/user/repos”, config(token = github_token))

Golden Rule: Never hardcode API keys or secrets in your scripts. Use environment variables or secure credential stores.

Taming Paginated APIs

Many APIs split results across multiple pages to avoid overwhelming their servers. Here’s how to collect all the data systematically:

r

fetch_all_news_articles <- function(query, max_pages = 10) {

  all_articles <- list()

  page <- 1

  while (page <= max_pages) {

    response <- GET(

      “https://newsapi.org/v2/everything”,

      query = list(

        q = query,

        page = page,

        apiKey = Sys.getenv(“NEWS_API_KEY”)

      )

    )

    # Check if we got a successful response

    if (status_code(response) != 200) break

    page_data <- fromJSON(content(response, “text”))

    articles <- page_data$articles

    # Stop if we’ve reached the last page

    if (length(articles) == 0) break

    all_articles <- c(all_articles, articles)

    # Be polite – don’t overwhelm the API

    Sys.sleep(0.5)

    page <- page + 1

  }

  return(bind_rows(all_articles))

}

# Usage

climate_articles <- fetch_all_news_articles(“climate change”)

Building Resilient API Calls

APIs can be flaky—servers go down, rate limits get hit, networks fail. Professional code handles these gracefully:

r

safe_api_call <- function(url, max_retries = 3, timeout_sec = 10) {

  for (attempt in 1:max_retries) {

    tryCatch({

      response <- GET(url, timeout(timeout_sec))

      if (status_code(response) == 429) {  # Rate limited

        reset_time <- as.numeric(response$headers$`x-ratelimit-reset`)

        wait_time <- reset_time – as.numeric(Sys.time())

        if (wait_time > 0) {

          message(“Rate limited. Waiting “, wait_time, ” seconds.”)

          Sys.sleep(wait_time)

          next

        }

      }

      stop_for_status(response)  # Throw error for other bad status codes

      return(response)

    }, error = function(e) {

      message(“Attempt “, attempt, ” failed: “, e$message)

      if (attempt == max_retries) stop(“All retries exhausted”)

      Sys.sleep(2 ^ attempt)  # Exponential backoff

    })

  }

}

# Usage with error handling

tryCatch({

  stock_data <- safe_api_call(“https://api.twelvedata.com/stocks?symbol=AAPL”)

  processed_data <- fromJSON(content(stock_data, “text”))

}, error = function(e) {

  message(“Failed to fetch stock data: “, e$message)

  # Load cached data as fallback

  processed_data <- readRDS(“cached_stock_data.rds”)

})

Web Scraping: When APIs Don’t Exist

Sometimes the data you need isn’t available through APIs—it’s sitting on public websites. Web scraping lets you extract this information, but it comes with responsibilities.

Ethical Scraping 101

Before scraping any website:

  • Check robots.txt (e.g., example.com/robots.txt)
  • Respect rate limits—don’t hammer servers
  • Only scrape publicly available data
  • Follow the website’s terms of service

Extracting Data with rvest

Here’s how to scrape product information from an e-commerce site:

r

library(rvest)

library(polite)

# Always be polite – use the polite package

session <- bow(

  “https://example-books.com”,

  delay = 2,  # Wait 2 seconds between requests

  user_agent = “Academic Research Bot 1.0”

)

scrape_books <- function(page_num = 1) {

  page_url <- paste0(“https://example-books.com/books?page=”, page_num)

  # Use the polite session

  scraped_page <- scrape(session, page_url)

  books <- scraped_page %>%

    html_elements(“.book-item”) %>%

    map_df(function(book) {

      data.frame(

        title = book %>% html_element(“.title”) %>% html_text2(),

        author = book %>% html_element(“.author”) %>% html_text2(),

        price = book %>% html_element(“.price”) %>% html_text2() %>% parse_number(),

        rating = book %>% html_element(“.stars”) %>% html_attr(“data-rating”) %>% as.numeric(),

        stringsAsFactors = FALSE

      )

    })

  return(books)

}

# Scrape multiple pages responsibly

all_books <- map_df(1:5, scrape_books)  # Only 5 pages to be respectful

Handling Dynamic Content

Some websites load data dynamically with JavaScript. For these, you might need RSelenium:

r

library(RSelenium)

library(rvest)

# Start a Selenium server (requires Docker or standalone server)

rD <- rsDriver(browser = “chrome”, port = 4445L)

driver <- rD$client

# Navigate to a JavaScript-heavy site

driver$navigate(“https://example-spa.com/dynamic-data”)

# Wait for content to load

Sys.sleep(3)

# Get the page source after JavaScript execution

page_content <- driver$getPageSource()[[1]]

# Parse with rvest as usual

dynamic_data <- read_html(page_content) %>%

  html_elements(“.data-item”) %>%

  html_text2()

driver$close()

Transforming Raw Web Data into Analysis-Ready Formats

API responses and scraped data often need significant cleaning:

r

library(tidyverse)

# Clean and transform API data

clean_weather_data <- function(raw_api_response) {

  weather_df <- fromJSON(raw_api_response) %>%

    pluck(“data”) %>%

    as_tibble() %>%

    mutate(

      timestamp = as.POSIXct(timestamp, origin = “1970-01-01”),

      temperature = as.numeric(temp_c),

      humidity = as.numeric(humidity),

      condition = as.factor(condition$text)

    ) %>%

    select(timestamp, temperature, humidity, condition, wind_kph = wind_speed) %>%

    arrange(timestamp)

  return(weather_df)

}

# Handle nested JSON structures

flatten_complex_api_response <- function(nested_data) {

  flattened <- nested_data %>%

    unnest_wider(user, names_sep = “_”) %>%

    unnest_longer(comments, indices_to = “comment_index”) %>%

    mutate(

      across(where(is.list), ~map_chr(., ~paste(., collapse = “, “))),

      publication_date = as.Date(created_at)

    )

  return(flattened)

}

Building Production-Grade Data Pipelines

When you’re ready to move from experimentation to production:

r

library(targets)

# Define a pipeline for daily weather data collection

tar_script({

  list(

    tar_target(weather_raw, {

      fetch_weather_data(Sys.Date())

    }, format = “qs”),

    tar_target(weather_clean, {

      clean_weather_data(weather_raw)

    }),

    tar_target(weather_analysis, {

      analyze_weather_trends(weather_clean)

    }),

    tar_target(weather_report, {

      generate_daily_report(weather_analysis)

    }, format = “file”)

  )

})

# Run the pipeline daily

tar_make()

Conclusion: The Web as Your Living Dataset

Mastering APIs and web scraping fundamentally changes your relationship with data. Instead of being limited to static files and databases, you can now:

  • Monitor real-time events as they happen
  • Build living dashboards that update automatically
  • Create unique datasets nobody else has
  • Respond to changes in near real-time

But with great power comes great responsibility. Always be respectful—treat websites and APIs as you’d want your own services treated. Use rate limiting, cache data when appropriate, and respect terms of service.

The most successful data scientists aren’t just great analysts; they’re skilled data collectors who know how to find fresh, relevant data. By adding web APIs and ethical scraping to your toolkit, you’re not just analyzing the world—you’re connecting to its pulse.

Now go forth and build something amazing with the wealth of data waiting at your fingertips. The internet is your oyster, and R is your pearl-harvesting tool.

Leave a Comment