My first R package: zipcode

You may know that I am a fan of the CivicSpace US ZIP Code Database compiled by Schuyler Erle of Mapping Hacks fame. It contains nearly 10,000 more records than the ZIP Code Tabulation Areas file from the U.S. Census Bureau upon which it is based, so a lot of work has gone into it.

I have been using the database a lot recently to correlate with survey respondents, so I have saved it as an R data.frame. Since others may find it useful, too, I have packaged it into the ‘zipcode’ package now available on CRAN.

One you load the package, the database is available in the ‘zipcode’ data.frame:

> library(zipcode)
> data(zipcode)

> nrow(zipcode)
[1] 43191

> head(zipcode)
    zip       city state latitude longitude timezone  dst
1 00210 Portsmouth    NH 43.00590  -71.0132       -5 TRUE
2 00211 Portsmouth    NH 43.00590  -71.0132       -5 TRUE
3 00212 Portsmouth    NH 43.00590  -71.0132       -5 TRUE
4 00213 Portsmouth    NH 43.00590  -71.0132       -5 TRUE
5 00214 Portsmouth    NH 43.00590  -71.0132       -5 TRUE
6 00215 Portsmouth    NH 43.00590  -71.0132       -5 TRUE

Note that the ‘zip’ column is a string, not an integer, in order to preserve leading zeroes — a sensitive topic for those of us in the Northeast… 🙂

The package also includes a clean.zipcodes() function to help clean up zip codes in your data. It strips off “ZIP+4” suffixes, attempts to restore missing leading zeroes, and replaces anything with non-digits (like non-U.S. postal codes) with NAs:

> library(zipcode)
> data(zipcode)

> somedata = data.frame(postal = c(2061, "02142", 2043, "20210", "2061-2203", "SW1P 3JX", "210", '02199-1880'))
> somedata
      postal
1       2061
2      02142
3       2043
4      20210
5  2061-2203
6   SW1P 3JX
7        210
8 02199-1880

> somedata$zip = clean.zipcodes(somedata$postal)
> somedata
      postal   zip
1       2061 02061
2      02142 02142
3       2043 02043
4      20210 20210
5  2061-2203 02061
6   SW1P 3JX  <NA>
7        210 00210
8 02199-1880 02199

> data(zipcode)
> somedata = merge(somedata, zipcode, by.x='zip', by.y='zip')
> somedata
    zip     postal       city state latitude longitude timezone  dst
1 00210        210 Portsmouth    NH 43.00590 -71.01320       -5 TRUE
2 02043       2043    Hingham    MA 42.22571 -70.88764       -5 TRUE
3 02061       2061    Norwell    MA 42.15243 -70.82050       -5 TRUE
4 02061  2061-2203    Norwell    MA 42.15243 -70.82050       -5 TRUE
5 02142      02142  Cambridge    MA 42.36230 -71.08412       -5 TRUE
6 02199 02199-1880     Boston    MA 42.34713 -71.08234       -5 TRUE
7 20210      20210 Washington    DC 38.89331 -77.01465       -5 TRUE

Now we wouldn’t be R users if we didn’t try to do something with data, even if it’s just a lookup table of zip codes. So let’s take a look at how they’re distributed by first digit:

library(zipcode)
library(ggplot2)

data(zipcode)
zipcode$region = substr(zipcode$zip, 1, 1)

g = ggplot(data=zipcode) + geom_point(aes(x=longitude, y=latitude, colour=region))

# simplify display and limit to the "lower 48"
g = g + theme_bw() + scale_x_continuous(limits = c(-125,-66), breaks = NA)
g = g + scale_y_continuous(limits = c(25,50), breaks = NA)

# don't need axis labels
g = g + labs(x=NULL, y=NULL)

If we make the points smaller, cities and interstates are clearly visible, at least once you leave the Northeast Megalopolis:

Data source to map Zip codes to Latitude and Longitude

[Update: For R users, I have since bundled this database into an R package, ‘zipcode’, now available on CRAN.]

When I need positions for zip codes, I use the “CivicSpace US ZIP Code Database by Schuyler Erle, August 2004”. I first found it thanks to Tom Boutell’s site (http://www.boutell.com/zipcodes/).

According to the README, it contains “over 98% of the ZIP Codes in current use in the United States” as of 2004. The ZIP file includes the data in CSV and a PostGIS-friendly SQL definition file. Schuyler Erle co-authored O’Reilly’s excellent Mapping Hacks, so the zipcode.zip file is also now mirrored on the Mapping Hacks website (http://mappinghacks.com/data/).

In addition to latitude and longitude, the data include city and state name and time zone:

"zip","city","state","latitude","longitude","timezone","dst"
"00210","Portsmouth","NH","43.005895","-71.013202","-5","1"
"00211","Portsmouth","NH","43.005895","-71.013202","-5","1"
"00212","Portsmouth","NH","43.005895","-71.013202","-5","1"
[...]
"99928","Ward Cove","AK","55.395359","-131.67537","-9","1"
"99929","Wrangell","AK","56.409507","-132.33822","-9","1"
"99950","Ketchikan","AK","55.875767","-131.46633","-9","1"

The database is based on the 1999-2000 U.S. Census Gazetteer files. While the ZIP Code Tabulation Areas fixed-width ASCII file lacks niceties like place names and time zone info, it does contain some basic population and geographic statistics:

  • Columns 1-2: United States Postal Service State Abbreviation
  • Columns 3-66: Name (e.g. 35004 5-Digit ZCTA – there are no post office names)
  • Columns 67-75: Total Population (2000)
  • Columns 76-84: Total Housing Units (2000)
  • Columns 85-98: Land Area (square meters) – Created for statistical purposes only.
  • Columns 99-112: Water Area (square meters) – Created for statistical purposes only.
  • Columns 113-124: Land Area (square miles) – Created for statistical purposes only.
  • Columns 125-136: Water Area (square miles) – Created for statistical purposes only.
  • Columns 137-146: Latitude (decimal degrees) First character is blank or “-” denoting North or South latitude respectively
  • Columns 147-157: Longitude (decimal degrees) First character is blank or “-” denoting East or West longitude respectively

The clincher for me is that the CivicSpace database contains nearly 10,000 more entries that the base Census file:

$ wc -l zipcode.csv
43205 zipcode.csv
$ wc -l zcta5.txt
33233 zcta5.txt