My first R package: zipcode

You may know that I am a fan of the CivicSpace US ZIP Code Database compiled by Schuyler Erle of Mapping Hacks fame. It contains nearly 10,000 more records than the ZIP Code Tabulation Areas file from the U.S. Census Bureau upon which it is based, so a lot of work has gone into it.

I have been using the database a lot recently to correlate with survey respondents, so I have saved it as an R data.frame. Since others may find it useful, too, I have packaged it into the ‘zipcode’ package now available on CRAN.

One you load the package, the database is available in the ‘zipcode’ data.frame:

> library(zipcode)
> data(zipcode)

> nrow(zipcode)
[1] 43191

> head(zipcode)
    zip       city state latitude longitude timezone  dst
1 00210 Portsmouth    NH 43.00590  -71.0132       -5 TRUE
2 00211 Portsmouth    NH 43.00590  -71.0132       -5 TRUE
3 00212 Portsmouth    NH 43.00590  -71.0132       -5 TRUE
4 00213 Portsmouth    NH 43.00590  -71.0132       -5 TRUE
5 00214 Portsmouth    NH 43.00590  -71.0132       -5 TRUE
6 00215 Portsmouth    NH 43.00590  -71.0132       -5 TRUE

Note that the ‘zip’ column is a string, not an integer, in order to preserve leading zeroes — a sensitive topic for those of us in the Northeast… 🙂

The package also includes a clean.zipcodes() function to help clean up zip codes in your data. It strips off “ZIP+4” suffixes, attempts to restore missing leading zeroes, and replaces anything with non-digits (like non-U.S. postal codes) with NAs:

> library(zipcode)
> data(zipcode)

> somedata = data.frame(postal = c(2061, "02142", 2043, "20210", "2061-2203", "SW1P 3JX", "210", '02199-1880'))
> somedata
      postal
1       2061
2      02142
3       2043
4      20210
5  2061-2203
6   SW1P 3JX
7        210
8 02199-1880

> somedata$zip = clean.zipcodes(somedata$postal)
> somedata
      postal   zip
1       2061 02061
2      02142 02142
3       2043 02043
4      20210 20210
5  2061-2203 02061
6   SW1P 3JX  <NA>
7        210 00210
8 02199-1880 02199

> data(zipcode)
> somedata = merge(somedata, zipcode, by.x='zip', by.y='zip')
> somedata
    zip     postal       city state latitude longitude timezone  dst
1 00210        210 Portsmouth    NH 43.00590 -71.01320       -5 TRUE
2 02043       2043    Hingham    MA 42.22571 -70.88764       -5 TRUE
3 02061       2061    Norwell    MA 42.15243 -70.82050       -5 TRUE
4 02061  2061-2203    Norwell    MA 42.15243 -70.82050       -5 TRUE
5 02142      02142  Cambridge    MA 42.36230 -71.08412       -5 TRUE
6 02199 02199-1880     Boston    MA 42.34713 -71.08234       -5 TRUE
7 20210      20210 Washington    DC 38.89331 -77.01465       -5 TRUE

Now we wouldn’t be R users if we didn’t try to do something with data, even if it’s just a lookup table of zip codes. So let’s take a look at how they’re distributed by first digit:

library(zipcode)
library(ggplot2)

data(zipcode)
zipcode$region = substr(zipcode$zip, 1, 1)

g = ggplot(data=zipcode) + geom_point(aes(x=longitude, y=latitude, colour=region))

# simplify display and limit to the "lower 48"
g = g + theme_bw() + scale_x_continuous(limits = c(-125,-66), breaks = NA)
g = g + scale_y_continuous(limits = c(25,50), breaks = NA)

# don't need axis labels
g = g + labs(x=NULL, y=NULL)

If we make the points smaller, cities and interstates are clearly visible, at least once you leave the Northeast Megalopolis:

Incremental improvements to Nightlights mapping thanks to R-Bloggers

My recent post Nightlights: cool data, bad geocoding highlighted some of the geocoding challenges Steve Mosher has been finding as he works with this interesting “light pollution” data set.

It was also my first article reposted on Tal Galili’s fantastic R-Bloggers site which I have been following for a while. But even better than the surge of new visitors were the great comments and suggestions posted by members of the community. In this post, I’m going to walk through each suggestion to illustrate just how generous and helpful this community can be.

Our starting point is where we ended up in my first post, using ggplot2 to display the raster nightlights data and map overlay:

library(RCurl)
library(R.utils)
library(rgdal)
library(raster)

url_radianceCalibrated = "ftp://ftp.ngdc.noaa.gov/DMSP/web_data/x_rad_cal/rad_cal.tar"
calibratedLights = "rad_cal.tar"
hiResTif = "world_avg.tif"

download.file(url_radianceCalibrated,calibratedLights,mode="wb")
untar(calibratedLights)
gunzip( paste(hiResTif, '.gz', sep='') )

hiResLights = raster( hiResTif )

# Eastern Mass., Cape Cod & Islands:
e = extent(-71.5, -69.5, 41, 43)

r = crop(hiResLights,e)
p = rasterToPoints(r)
df = data.frame(p)
colnames(df) = c("lon", "lat", "dn")

g = ggplot( data=df) + geom_point(aes(x=lon, y=lat, color=dn) )
g = g + scale_colour_gradient(low="black", high="white", trans="sqrt")
g = g + theme_bw() + xlim(c(-71.5,-69.5)) + ylim(c(41,43))
g = g + opts( axis.text.x = theme_blank(), axis.text.y = theme_blank() )
g = g + borders("state", colour="yellow", alpha=0.5)

Note the mismatch between the data and map overlay and the weirdness in the map where points are missing on the North Shore:

original data, positioning, and borders in ggplot2

Ben Bolker suggested a way to eliminate the artifacts which led me to this discussion on R-sig-Geo between Hadley Wickham and Paul Hiemstra which tipped me off to the existence of geom_path layer in addition to the geom_polygon layer which borders() usually produces. Polygons are closed but paths need not be, so that helps. And ggplot2’s map_data() function seems to grab the same data as borders():

b = map_data("state")

g = ggplot( data=df) + geom_point(aes(x=lon, y=lat, color=dn) )
g = g + scale_colour_gradient(low="black", high="white", trans="sqrt")
g = g + theme_bw() + xlim(c(-71.5,-69.5)) + ylim(c(41,43))
g = g + opts( axis.text.x = theme_blank(), axis.text.y = theme_blank() )
g = g + geom_path(data=b, aes(x=long,y=lat,group=group), colour="yellow", alpha=0.5)

Bonus: geom_path() obeys the “alpha=0.5” directive to set the transparency:

Worst artifacts solved by switching to map_data() and geom_path()

But Robert Hijmans really hit it out the park with two great suggestions. First, he pointed me towards a much, much better source of coastline data by using raster’s getData() function to grab data from the GADM database of Global Administrative Areas:

usa = getData('GADM', country="USA", level=0)

Level 0 will get you country boundaries, Level 1 for state/province, and so on. So we’ll lose state boundaries, but these files are pretty big to start with and can take a lot longer to plot.

Also, be warned: apparently somebody sinned against The Church of GNU, so you may need to run gpclibPermit() manually before running fortify() on the SpatialPolygonsDataFrame:

> f_usa = fortify(usa)
Using GADMID to define regions.

	Note: polygon geometry computations in maptools
 	depend on the package gpclib, which has a
 	restricted licence. It is disabled by default;
 	to enable gpclib, type gpclibPermit()

Checking rgeos availability as gpclib substitute:
FALSE 
Error: isTRUE(gpclibPermitStatus()) is not TRUE
> gpclibPermit()
[1] TRUE

With that hoop cleared, we can fortify() and plot this new layer:

f_usa = fortify(usa)

g = ggplot( data=df) + geom_point(aes(x=lon, y=lat, color=dn) )
g = g + scale_colour_gradient(low="black", high="white", trans="sqrt")
g = g + theme_bw() + xlim(c(-71.5,-69.5)) + ylim(c(41,43))
g = g + opts( axis.text.x = theme_blank(), axis.text.y = theme_blank() )
g = g + geom_path(data=f_usa, aes(x=long,y=lat,group=group), colour="yellow", alpha=0.5)

The coordinates are still shifted, but—wow—what a beautiful coast line. Cape Ann on the North Shore is really there now:

beautiful coastline data from GADM

Robert also points out an important mismatch in that GDAL returns the top left corner and resolution, so we could be off by a pixel or so. A quick call to xmax() and ymax() will fix this in our original raster:

xmax(hiResLights) = 180
ymin(hiResLights) = -90

r = crop(hiResLights,e)
p = rasterToPoints(r)
df = data.frame(p)
colnames(df) = c("lon", "lat", "dn")

g = ggplot( data=df) + geom_point(aes(x=lon, y=lat, color=dn) )
g = g + scale_colour_gradient(low="black", high="white", trans="sqrt")
g = g + theme_bw() + xlim(c(-71.5,-69.5)) + ylim(c(41,43))
g = g + opts( axis.text.x = theme_blank(), axis.text.y = theme_blank() )
g = g + geom_path(data=f_usa, aes(x=long,y=lat,group=group), colour="yellow", alpha=0.5)

Hey, that’s not bad at all:

final try, after adjusting GDAL pixel shift

Looking at the final version leads me to wonder how much of the geocoding problem is position, and how much is resolution/blurring/smearing. The lights of Provincetown, for instance, look pretty good. Maybe the blob is too north by a few pixels, but at least it’s well contained by land now. On Nantucket, the blur is half in the harbor. Then again, on Nantucket, most of the lights are right on the harbor, from the ferry terminal and running east to main street. So the lights are just about where they should be. Perhaps they’re just blurred and therefore spill into the harbor?

But the real point of the post is to highlight the generosity of this community. For that, thanks. And again: welcome R-Bloggers readers!

rgdal and other GIS-related packages for Mac OS X

CRAN contains ready-made binary packages for nearly all of its packages, but rgdal is one which I keep finding myself trying to install from source whenever I upgrade R.

Compiled versions of rgdal, along with prerequisites and complements like the GDAL framework, GRASS, and even the old FFTW3 can be found at KyngChaos’s Wiki:


Update 10/10/10:
I don’t remember needing to do this before, but

$ ln -s /Library/Frameworks/GDAL.framework/Programs/gdal-config /usr/local/bin

makes the rgdal source package installation work on R 2.11.1.


Posted in GIS, Tips. Tags: , , . Leave a Comment »

Don’t forget altitude when geocoding

or

A funny thing happened as I walked down State Street: I fell into the Big Dig

Let’s go for a walk down State Street in Boston in Google’s Street View.

So far, so good:

Better be careful crossing this intersection:

ahhhh! Was there an open manhole or something?

It could be worse — at least we’re in an exit-only lane…

…and nobody’s coming!

Just one more click and… yup — that’s better!

Left as an exercise for the reader: If you keep walking towards the Aquarium, you’ll fall into the northbound tunnel, too.

Be careful!

Where to find GDAL, PROJ, FreeType for Mac OS

Posted in Sysadmin. Tags: , . Leave a Comment »