Geocoding with ArcGIS and Nominatim
Geocoding is the process of converting a physical address or location into geographical coordinates and is useful in a wide range of applications and industries that involve spatial data. This report provides an introduction to two basic geocoding tasks–geocoding physical addresses and identifying corresponding census tracts–and includes instructions for using ArcGIS ($) and Nominatim API in R (free).
For further instruction, see ArcGIS documentation here and Nominatim here.
Getting started
Prep, clean, and standardize addresses
Before geocoding, addresses should be cleaned and prepared to standard postal format in order to obtain reliable geocodes, regardless of the method used (though it is undoubtedly more important when using the Nominatim API). While ArcGIS can handle minor spelling errors, certain spelling and formatting errors can still lead to inaccurate geocodes. Common cleaning and preparation steps may include:
- Correcting common spelling errors (e.g., city names)
- Abbreviating address suffixes (e.g., drive should be dr)
- Replacing incorrect zip codes with five zeros (i.e., 00000)
- Miscellaneous cleaning (e.g., removing random symbols, multiple spaces, etc.)
- Saving new address file (for ArcGIS, addresses can either be stored in a single field with address components separated by a comma or in multiple fields)
address_long <- c("lane", "road", "drive", "drv", "street", "apartment", "avenue", "boulevard",
"expressway", "highway", "highwat", "suite", "court", "circle", "circlr")
address_abrv <- c("ln", "rd", "dr", "dr", "st", "apt", "ave", "blvd", "expy", "hwy", "hwy", "ste",
"ct", "cir", "cir")
clean_st_address <- function(df, st_address, city, zip) {
df <- df %>%
dplyr::drop_na(st_address) %>%
dplyr::mutate(st_address = tolower(st_address)) %>%
dplyr::mutate(zip = ifelse(stringr::str_detect(zip, '\\d\\d\\d\\d\\d'), zip, "00000")) %>%
dplyr::mutate(st_address = gsub("\\s+", " ", st_address),
st_address = gsub("\\.|\\,|\\sapt.*|\\sunit.*|\\s#.*|\\#|\\-|\\s\\d+", "", st_address),
st_address = stringi::stri_replace_all_fixed(st_address, address_long, address_abrv,
vectorize_all = FALSE)) %>%
dplyr::mutate(city = tolower(gsub("\\s+", " ", city)),
city = gsub("\\,.*|\\stx|\\.", "", city))
return(df)
}
Geocoding with ArcGIS
1. Geocode addresses
To begin, log in to ArcGIS Pro with NetID/username and password. Follow the steps below to obtain latitude and longitude coordinates.
- New project → Map
- Go to Geoprocessing pane on right side of window → Toolboxes → click on Geocoding Tools → select Geocode Addresses
- Fill in/select parameters
- Input table = addresses to be geocoded
- Input address locator = “ArcGIS World Geocoding Service”
- Output feature class = folder and file name for saving
- Preferred location type = “Address location”
- Click Run
- Fill in/select parameters
1.1. Identify corresponding census tracts
If you need to identify corresponding census tracts, follow the instructions below. Otherwise, continue to step 2.
- Download TIGER/Line shapefile for state
- Add shapefile data to map: Map → Add Data → Add data to the map
- Shapefile must be saved in same folder as address data
- Go to Geoprocessing pane on right → Toolboxes → Analysis tools → Overlay → Spatial Join
- Fill in/select parameters
- Target features = geocoded addresses
- Join features = shapefile
- Output feature class = folder and file name for saving
- Join operator = “join one to one”
- Match option = “within”
- Click Run
- Fill in/select parameters
2. Export geo data
- To view resulting table, right click on table name in Contents pane → select Attribute Table
- To export data, click on the three lines in upper right corner of attribute table → select Export → save data with chosen file name + .csv (e.g., file_name.csv)
3. Remove unreliable geocodes
- Remove addresses identified as being outside of the state
- Remove addresses that did not return one of the following Addr_type: Point Address, Street Address, or Subaddress
- Drop match scores (Score) below 90%
clean_arcgis_data <- function(df, state) {
df <- df %>%
filter(Region == state,
Join_Count == 1,
Addr_type %in% c("PointAddress", "StreetAddress", "Subaddress"),
Score >= 90) %>%
rename(address = "IN_SingleLine")
return(df)
}
Geocoding with Nominatim API
While Nominatim may not be as accurate as ArcGIS, it is a decent alternative if you do not have access to an ArcGIS license and are looking for a free option. Below are the functions you will need to geocode your addresses. Be sure to check Nominatim’s Usage Policy for information on usage and call limits.
1. Generate API call
This function will construct the search request.
nominatim_search <- function(search_query_url, country_url, language_url, email_url) {
library(RCurl)
nominatim_search_api <- "https://nominatim.openstreetmap.org/search/"
search_query_url <- sapply(search_query_url, as.list)
search_query_url <- sapply(search_query_url, URLencode)
if (!is.null(country_url)) {
country_url <- paste0("&countrycodes=", country_url)
}
parameters_url <- paste0("?format=json",
"&addressdetails=1&extratags=1&limit=1", country_url, "&accept-language=",
language_url, "&email=", email_url)
nominatim_search_call <- paste0(nominatim_search_api,search_query_url, parameters_url)
return(nominatim_search_call)
}
2. Extract data from json
This function will extract the relevant data from the JSON output by coverting it to an R object and creating a data frame.
get_geodata_from_json <- function(geodata_json) {
library(jsonlite)
# convert json output into r object
geodata <- lapply(geodata_json, fromJSON, simplifyVector = FALSE)
# extract coordinates, address, county, and importance
extracted_data <- data.frame(lat = NA, lng = NA, address = NA, county = NA, importance = NA)
for(i in 1:length(geodata)) {
if(length(geodata[[i]]) != 0) {
# get data
lat <- geodata[[i]][[1]]$lat
lng <- geodata[[i]][[1]]$lon
address <- geodata[[i]][[1]]$display_name
county <- geodata[[i]][[1]]$address$county
importance <- geodata[[i]][[1]]$importance
# get rid of NULLs
info <- list(lat, lng, address, county, importance)
for (j in 1:length(info)) {
if (is.null(info[[j]])) info[[j]] <- NA
}
# create output data frame
extracted_data[i, ] <- info
} else {
extracted_data[i, ] <- NA
}
}
return(extracted_data)
}
3. Main function
This function uses the functions above to create the main function for our API call and will return a data frame with the geocoded data.
geocode_nominatim <- function(search_query, country = NULL, language = "en",
fields = "coordinates", email) {
library(RCurl)
# construct url for geocoding
url_geocode <- url_nominatim_search(search_query, country, language, email)
# get data from nominatim
geodata_json <- list()
pb = txtProgressBar(min = 0, max = length(url_geocode), initial = 0)
for (i in 1:length(url_geocode)) {
setTxtProgressBar(pb, i)
geodata_json[i] <- GET(url_geocode[i])
Sys.sleep(1.5)
}
close(pb)
# get data from json output
geodata_df <- as.data.frame(sapply(search_query, as.character),
stringsAsFactors = FALSE)
names(geodata_df) <- "search query"
rownames(geodata_df) <- NULL
geodata_df[, 2:6] <- get_geodata_from_json_nominatim(geodata_json)
geodata_df_query <- data.frame(search_query = geodata_df[, 1],
stringsAsFactors = FALSE)
geodata_df_coordinates <- geodata_df[, 2:6]
# return data frame with the geodata
geodata_result <- geodata_df_query
if("all" %in% fields) {
geodata_result <- cbind(geodata_result, geodata_df[, 2:6])
}
if("coordinates" %in% fields) {
geodata_result <- cbind(geodata_result, geodata_df_coordinates)
}
return(geodata_result)
}
4. Run
Run the code below to generate the geocodes.
email <- "your-email"
defaultW <- getOption("warn")
options(warn = -1)
coords_nominatim <- geocode_nominatim(address, country = "us", fields = "all",
email = email)
options(warn = defaultW)