Slides from today’s Big Data Step-by-Step Tutorials: Infrastructure series and Intro to R+Hadoop with RHadoop’s rmr

March 10, 2012 — Jeffrey Breen

Here are my presentations from today’s Boston Predictive Analytics Big Data Workshop.

All code and config files are available at github: https://github.com/jeffreybreen/tutorial-201203-big-data

My portion of the workshop was divided into four parts, three focusing on different infrastructure scenarios and ending with a deep dive into the rmr R package:

Big Data Step-by-Step: Infrastructure 1/3: Local VM

Starting small.

Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2

Not everyone has Big Data. Some of us have an occasional need to analyze a data set larger than comfortably fits in our existing analysis environment either due to disk, CPU, or memory constraints. For these times, launching a single, large machine in the cloud may fit the bill. This part of presentation walks through how to launch just such a machine using Amazon’s EC2 cloud computing platform. Since I tend to run R and RStudio on Linux, that’s the focus of this tutorial, but the general outline may be helpful to others as well.

Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud… easily… with Whirr

Scale up using the cloud. The Apache Whirr cloud management tool makes it easy to launch a Hadoop cluster on EC2. We use the Cloudera VM from presentation #1 as a launching point for the cluster and, thanks to a Whirr-generated proxy script, submit jobs and fetch results from our local VM just as before. For extra credit, we see how Whirr can save us money by bidding for excess capacity via EC2’s spot instances.

Big Data Step-by-Step: Using R & Hadoop (with RHadoop’s rmr package)

Crunching Big Data with R. Originally a Java-only ecosystem, Hadoop Streaming allows the creation of mappers, reducers, and combiners in any language which can handle stdin and stdout—but that doesn’t mean you want to have to write code to manage I/O at that level. After a quick (and undoubtedly incomplete) survey of Hadoop-related R packages, we walk through some of the abstractions and features of RHadoop’s rmr package which make it easier for R developers to get started. We walk through a sample mapper and reducer, demonstrating and documenting the native R objects which carry the data from step to step.

Thank you to the session’s sponsors, all the speakers, and to an interesting and engaged audience. Special thanks to John Versotek for arranging such an informative and enjoyable day, and for the opportunity to take part.

Posted in Tutorials. Tags: airlines, Amazon EC2, Big Data, cloud computing, Cloudera, Hadoop, R, rstats, VMware, Whirr. 4 Comments »

Find Dell Service Tag from Linux, ESX console

January 20, 2009 — Jeffrey Breen

Try this first:

# dmidecode -s system-serial-number
The “-s” option didn’t work on the Kernel 2.4-based ESX service console, but this should:

# dmidecode | grep --extended-regexp Serial[[:space:]]Number:[[:space:]]*[A-Z0-9]{7}$ | uniq

Credit to Noah@Noah.org

Posted in Sysadmin. Tags: Dell, ESX, hardware, linux, PowerEdge, server, service tag, VMware. Leave a Comment »

	Teresa on My first R package: zipco…
	j k lakshna (@lakshn… on slides from my R tutorial on T…
	Nandi on slides from my R tutorial on T…
	Will on slides from my R tutorial on T…
	sillywabbit4562 on Data source to map Zip codes t…
	Arnaud on slides from my R tutorial on T…
	David on Use geom_rect() to add recessi…
	abraham on slides from my R tutorial on T…
	Paola on slides from my R tutorial on T…
	Bach on slides from my R tutorial on T…

Things I tend to forget

Tags

Recent Comments

Find Dell Service Tag from Linux, ESX console

Recent Posts

Jeffrey @ Twitter

Jeffrey @ Cambridge Aviation Research