Geocoding with ArcGIS and Nominatim

Geocoding is the process of converting a physical address or location into geographical coordinates and is useful in a wide range of applications and industries that involve spatial data. This report provides an introduction to two basic geocoding tasks–geocoding physical addresses and identifying corresponding census tracts–and includes instructions for using ArcGIS ($) and Nominatim API in R (free).

For further instruction, see ArcGIS documentation here and Nominatim here.

Getting started

Prep, clean, and standardize addresses

Before geocoding, addresses should be cleaned and prepared to standard postal format in order to obtain reliable geocodes, regardless of the method used (though it is undoubtedly more important when using the Nominatim API). While ArcGIS can handle minor spelling errors, certain spelling and formatting errors can still lead to inaccurate geocodes. Common cleaning and preparation steps may include:

  • Correcting common spelling errors (e.g., city names)
  • Abbreviating address suffixes (e.g., drive should be dr)
  • Replacing incorrect zip codes with five zeros (i.e., 00000)
  • Miscellaneous cleaning (e.g., removing random symbols, multiple spaces, etc.)
  • Saving new address file (for ArcGIS, addresses can either be stored in a single field with address components separated by a comma or in multiple fields)
address_long <- c("lane", "road", "drive", "drv", "street", "apartment", "avenue", "boulevard", 
                  "expressway", "highway", "highwat", "suite", "court", "circle", "circlr")
address_abrv <- c("ln", "rd", "dr", "dr", "st", "apt", "ave", "blvd", "expy", "hwy", "hwy", "ste", 
                  "ct", "cir", "cir")
clean_st_address <- function(df, st_address, city, zip) {
  df <- df %>%
    dplyr::drop_na(st_address) %>%
    dplyr::mutate(st_address = tolower(st_address)) %>%
    dplyr::mutate(zip = ifelse(stringr::str_detect(zip, '\\d\\d\\d\\d\\d'), zip, "00000")) %>%
    dplyr::mutate(st_address = gsub("\\s+", " ", st_address),
                  st_address = gsub("\\.|\\,|\\sapt.*|\\sunit.*|\\s#.*|\\#|\\-|\\s\\d+", "", st_address),
                  st_address = stringi::stri_replace_all_fixed(st_address, address_long, address_abrv, 
                                                               vectorize_all = FALSE)) %>%
    dplyr::mutate(city = tolower(gsub("\\s+", " ", city)),
                  city = gsub("\\,.*|\\stx|\\.", "", city)) 
  return(df)
}

Geocoding with ArcGIS

1. Geocode addresses

To begin, log in to ArcGIS Pro with NetID/username and password. Follow the steps below to obtain latitude and longitude coordinates.

  • New project → Map
  • Go to Geoprocessing pane on right side of window → Toolboxes → click on Geocoding Tools → select Geocode Addresses
    • Fill in/select parameters
      • Input table = addresses to be geocoded
      • Input address locator = “ArcGIS World Geocoding Service”
      • Output feature class = folder and file name for saving
      • Preferred location type = “Address location”
    • Click Run

1.1. Identify corresponding census tracts

If you need to identify corresponding census tracts, follow the instructions below. Otherwise, continue to step 2.

  • Download TIGER/Line shapefile for state
  • Add shapefile data to map: Map → Add Data → Add data to the map
    • Shapefile must be saved in same folder as address data
  • Go to Geoprocessing pane on right → Toolboxes → Analysis tools → Overlay → Spatial Join
    • Fill in/select parameters
      • Target features = geocoded addresses
      • Join features = shapefile
      • Output feature class = folder and file name for saving
      • Join operator = “join one to one”
      • Match option = “within”
    • Click Run

2. Export geo data

  • To view resulting table, right click on table name in Contents pane → select Attribute Table
  • To export data, click on the three lines in upper right corner of attribute table → select Export → save data with chosen file name + .csv (e.g., file_name.csv)

3. Remove unreliable geocodes

  • Remove addresses identified as being outside of the state
  • Remove addresses that did not return one of the following Addr_type: Point Address, Street Address, or Subaddress
  • Drop match scores (Score) below 90%
clean_arcgis_data <- function(df, state) {
  
  df <- df %>%
    filter(Region == state,
           Join_Count == 1,
           Addr_type %in% c("PointAddress", "StreetAddress", "Subaddress"),
           Score >= 90) %>%
    rename(address = "IN_SingleLine")
  
  return(df)
  
}

Geocoding with Nominatim API

While Nominatim may not be as accurate as ArcGIS, it is a decent alternative if you do not have access to an ArcGIS license and are looking for a free option. Below are the functions you will need to geocode your addresses. Be sure to check Nominatim’s Usage Policy for information on usage and call limits.

1. Generate API call

This function will construct the search request.

nominatim_search <- function(search_query_url, country_url, language_url, email_url) {
  
  library(RCurl)
  
  nominatim_search_api <- "https://nominatim.openstreetmap.org/search/"
  search_query_url <- sapply(search_query_url, as.list)
  search_query_url <- sapply(search_query_url, URLencode)
  
  if (!is.null(country_url)) {
    country_url <- paste0("&countrycodes=", country_url)
  }
  parameters_url <- paste0("?format=json",
                           "&addressdetails=1&extratags=1&limit=1", country_url, "&accept-language=", 
                           language_url, "&email=", email_url)
  
  nominatim_search_call <- paste0(nominatim_search_api,search_query_url, parameters_url)
  
  return(nominatim_search_call)
  
}

2. Extract data from json

This function will extract the relevant data from the JSON output by coverting it to an R object and creating a data frame.

get_geodata_from_json <- function(geodata_json) {
  
  library(jsonlite)
  
  # convert json output into r object
  geodata <- lapply(geodata_json, fromJSON, simplifyVector = FALSE)
 
  # extract coordinates, address, county, and importance
  extracted_data <- data.frame(lat = NA, lng = NA, address = NA, county = NA, importance = NA)
  
  for(i in 1:length(geodata)) {
    if(length(geodata[[i]]) != 0) {
      # get data
      lat <- geodata[[i]][[1]]$lat
      lng <- geodata[[i]][[1]]$lon
      address <- geodata[[i]][[1]]$display_name
      county <- geodata[[i]][[1]]$address$county
      importance <- geodata[[i]][[1]]$importance
      
      # get rid of NULLs
      info <- list(lat, lng, address, county, importance)
      for (j in 1:length(info)) {
        if (is.null(info[[j]])) info[[j]] <- NA
      }
      # create output data frame
      extracted_data[i, ] <- info
    } else {
      extracted_data[i, ] <- NA
    }
  }
  return(extracted_data)
}

3. Main function

This function uses the functions above to create the main function for our API call and will return a data frame with the geocoded data.

geocode_nominatim <- function(search_query, country = NULL, language = "en",
                              fields = "coordinates", email) {
  
  library(RCurl)
  
  # construct url for geocoding
  url_geocode <- url_nominatim_search(search_query, country, language, email)
  
  # get data from nominatim
  geodata_json <- list()
  
  pb = txtProgressBar(min = 0, max = length(url_geocode), initial = 0)
  for (i in 1:length(url_geocode)) {
    setTxtProgressBar(pb, i)
    geodata_json[i] <- GET(url_geocode[i])
    Sys.sleep(1.5)
  }
  close(pb)
  
  # get data from json output
  geodata_df <- as.data.frame(sapply(search_query, as.character),
                              stringsAsFactors = FALSE)
  names(geodata_df) <- "search query"
  rownames(geodata_df) <- NULL
  geodata_df[, 2:6] <- get_geodata_from_json_nominatim(geodata_json)
  geodata_df_query <- data.frame(search_query = geodata_df[, 1],
                                 stringsAsFactors = FALSE)
  geodata_df_coordinates <- geodata_df[, 2:6]

  # return data frame with the geodata
  geodata_result <- geodata_df_query
  
  if("all" %in% fields) {
    geodata_result <- cbind(geodata_result, geodata_df[, 2:6])
  }
  if("coordinates" %in% fields) {
    geodata_result <- cbind(geodata_result, geodata_df_coordinates)
  }

  return(geodata_result)
}

4. Run

Run the code below to generate the geocodes.

email <- "your-email"

defaultW <- getOption("warn")
options(warn = -1)

coords_nominatim <- geocode_nominatim(address, country = "us", fields = "all", 
                                      email = email)

options(warn = defaultW)