R reminds me a lot of English. It’s easy to get started, but very difficult to master. So for all those times I’ve spent… well, forever… trying to figure out the “R way” of doing something, I’m glad to share these quick wins.
My recent R tutorial on mining Twitter for consumer sentiment wouldn’t have been possible without Jeff Gentry’s amazing twitteR package (available on CRAN). It does so much of the behind-the-scenes heavy lifting to access Twitter’s REST APIs, that one line of code is all you need to perform a search and retrieve the (even paginated) results:
library(twitteR) tweets = searchTwitter("#rstats", n=1500)
You can search for anything, of course, “#rstats” is just an example. (And if you’re really into that hashtag, the twitteR package even provides an Rtweets()
function which hardcodes that search string for you.) The n=1500
specifies the maximum number of tweets supported by the Search API, though you may retrieve fewer as Twitter’s search indices contain only a couple of days’ tweets.
What you get back is a list of tweets (technically “status updates”):
> head(tweets) [[1]] [1] "Cloudnumberscom: CloudNumbers.com \023 #Rstats gets real in the cloud http://t.co/Vw4Gupr via @AddToAny" [[2]] [1] "0_h_r_1: CloudNumbers.com \023 #Rstats gets real in the cloud via DecisionStats - I came across Cloudnumbers.com . ... http://tinyurl.com/5sjagjg" [[3]] [1] "cmprsk: RT I just joined the beta to run #Rstats in the cloud with cloudnumbers.com http://t.co/lvVp0YJ via @cloudnumberscom http://bit.ly/lbSruR" [[4]] [1] "0_h_r_1: I just joined the beta to run #Rstats in the cloud with cloudnumbers.com http://t.co/lvVp0YJ via @cloudnumberscom" [[5]] [1] "cmprsk: RT man, the #rstats think people I am too soft on #sas, the #sas people think I am too soft on #wps, the #wps pe... http://bit.ly/innEv8" [[6]] [1] "keepstherainoff: Thanks to @cmprsk @geoffjentry and @MikeKSmith for colour-coded #Rstats GUI advice" > class(tweets[[1]]) [1] "status" attr(,"package") [1] "twitteR"
Now that you have some tweets, the fun really begins. To get you started, the status
class includes a very handy toDataFrame()
accessor method (see ?status
):
> library(plyr) > tweets.df = ldply(tweets, function(t) t$toDataFrame() )
> str(tweets.df) 'data.frame': 131 obs. of 10 variables: $ text : Factor w/ 122 levels "CloudNumbers.com \023 #Rstats gets real in the cloud http://t.co/Vw4Gupr via @AddToAny",..: 1 2 3 4 5 6 7 8 9 10 ... $ favorited : logi NA NA NA NA NA NA ... $ replyToSN : logi NA NA NA NA NA NA ... $ created : POSIXct, format: "2011-07-04 13:50:39" "2011-07-04 13:48:10" "2011-07-04 13:29:00" "2011-07-04 13:23:42" ... $ truncated : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ replyToSID : logi NA NA NA NA NA NA ... $ id : Factor w/ 131 levels "87941406873751552",..: 1 2 3 4 5 6 7 8 9 10 ... $ replyToUID : logi NA NA NA NA NA NA ... $ statusSource: Factor w/ 17 levels "<a href="http://twitter.com/tweetbutton" rel="nofollow">Tweet Button</a>",..: 1 2 3 1 3 4 5 5 3 4 ... $ screenName : Factor w/ 64 levels "Cloudnumberscom",..: 1 2 3 2 3 4 2 5 3 6 ...
You can pull a particular user’s tweets just as easily with the userTimeline()
function. Heck, the package even lets you tweet from R if you use Jeff’s companion ROAuth package, but that requires more than one line….
Enjoy!
July 21, 2011 at 11:35 PM
tweets.df = ldply(tweets, function(t) t$toDataFrame() )
the line above doesn’t work.ERROR: t$toDataFrame : $ operator not defined for this S4 class
July 22, 2011 at 9:46 PM
I reply to myself, only the twitteR.0.99.9 edition on R.2.12 works.
July 25, 2011 at 10:21 AM
Hi Chengjun:
I hadn’t realized toDataFrame() was such a recent addition, but I’m glad you got it working.
Jeffrey
August 20, 2011 at 12:56 AM
[…] x = 60; // X-coordinate float y = 440; // Y-coordinate int radius = 45; // Head […] adminMy evaluation of twitteR Package July 24, 2011# @author Chengjun WANG # @date July 22, 2011 #~~~~~~~~~~~~~~~~~~Mining twitter with […]
July 23, 2012 at 8:31 AM
The dataframe doesnt appear to contain the tweet geocode, any idea how to access it? Thanks for a helpful post.
August 18, 2012 at 10:53 AM
You’re right — and I just double-checked the latest version as well.
When I last looked into the geocode situation last year, only a small fraction of tweets had them. That may have changed, but at the time, I remember considering using the user’s location as a proxy. It’s the text string entered on the user’s profile, so it needs to be parsed and geocoded, and obviously doesn’t account for travel:
> u = getUser('JeffreyBreen')
> u$location
[1] "Cambridge, MA"
If you know what location you’re interested in, however, you can specify a lat/lon and radius to search using the twitteR package. Here are recent R-related tweets from the Boston area:
> l = searchTwitter('#rstats', geocode='42.375,-71.1061111,50mi')
> l[[1]]
[1] "luizpcfreitas: @JeffreyBreen do you plan to give the same prez in Boston? Q suggested the Meetup lobby you to do it. #rstats #hadoop"
See http://stackoverflow.com/questions/11674842/how-to-extract-tweet-geocode-in-twitter-package-in-r for a discussion.
If you’re really ambitious, you could help Jeff extend the twitteR package — the geo location appears in Twitter’s JSON results as `coordinates`. See https://dev.twitter.com/docs/platform-objects/tweets
Good luck!
Jeffrey