Skip to main content

Locating Michigan Lottery Retailers with the U.S. Census Bureau Geocoder API in R, part 1

This post records the process of cleaning and geocoding U.S. address records. The addresses are official records of Michigan lottery retailers. 


Michigan contracts with one of the multi-national gambling companies to run the State lottery, and participates in the multi-state Powerball lottery as well. The data described here are retailer based records of lottery sales from 2003 to 2016.

The purpose of the geocoding was to explore the distribution of sales across the State and to prepare to merge the dataset with various Census Bureau measures of geographic place characteristics. Of course, there’s a lengthy literature on individual level correlates of lottery ticket purchases, but I have been curious about where retailers tend to cluster throughout cities and townships in the State. In what follows I describe the process of geocoding latitude and longitude coordinates — as well as U.S. Census geographies – of Michigan lottery retailers. The lottery sales data were provided to me in October 2016 through a project with a colleague in investigative journalism, who FOIA’d the Michigan Lottery Commission for the data. (A subject for another post — in addition to the data described here, the Commission also provided about 6 GB of CSV file records of lottery winners, but containing only first names of winners, game played, amount won, and retailer name and address. More on that in a future post.)



While geocoding is more interesting than the data cleaning, for my own recollection later, I have recorded the process of cleaning the data provided by the Lottery Commission. 


Information on retailers, by lottery sales license, were stored separately from lottery sales figures:

## Read in the retailerinfo.csv file, keeping addresses and names as characters instead of factors
retailerinfo<-read.csv(file=".../retailerinfo.csv", stringsAsFactors = FALSE, header=TRUE)

## Read in the retailersales.csv file, keeping addresses and names as characters instead of factors
retailersales<-read.csv(file=".../retailersales.csv", stringsAsFactors = FALSE,  header=TRUE)
colnames(retailersales)[1] <- 'Retailer.Id' # rename variable X in retailersales as the retailer ID number


## We will create a merged file
retailer<-merge(retailerinfo, retailersales, by="Retailer.Id", all=TRUE)
colnames(retailer)[10] <- 'Total.sales' # rename total sales variable


The records on lottery sales begin with the year 2003 and extent until the date in 2016 when the FOIA request was fulfilled.  


head(retailer)


Functions used to verify that the total column equals a calculated sum across rows

retailer$calctotal<-rowSums(retailer[11:24], na.rm=TRUE)
retailer$verifytotals <- retailer$calctotal == retailer$Total.sales
table(retailer$verifytotals) # all TRUE; total is summary across 2003 to 2016 sales

Some address records were missing. Observations missing addresses were dropped from the geocoding dataset.


## the function is.na() returns TRUE if Complete.Address==NA in the dataset.
summary(is.na(retailer$Complete.Address))
retailgeo<-subset(retailer,is.na(retailer$Complete.Address)==FALSE)

For the purpose of illustrating the geocoding process, I subsetted the data to Kent County, which encompasses Michigan’s second largest city, Grand Rapids.

The Zip code field in the dataset is stored as an integer. While a bit tedious (and there’s probably a better way to do it), I used the subset function on each Kent Zip:


Kentretailgeo<-subset(retailgeo, retailgeo$Zip==49301 | retailgeo$Zip==49302 | retailgeo$Zip==49306 
                      | retailgeo$Zip==49315 | retailgeo$Zip==49316 | retailgeo$Zip==49317 
                      | retailgeo$Zip==49319 | retailgeo$Zip==49321 | retailgeo$Zip==49326 
                      | retailgeo$Zip==49330 | retailgeo$Zip==49331 | retailgeo$Zip==49341 
                      | retailgeo$Zip==49343 | retailgeo$Zip==49345 | retailgeo$Zip==49418 
                      | retailgeo$Zip==49468 | retailgeo$Zip==49501 | retailgeo$Zip==49503 
                      | retailgeo$Zip==49504 | retailgeo$Zip==49505 | retailgeo$Zip==49506 
                      | retailgeo$Zip==49507 | retailgeo$Zip==49508 | retailgeo$Zip==49509 
                      | retailgeo$Zip==49510 | retailgeo$Zip==49512 | retailgeo$Zip==49514 
                      | retailgeo$Zip==49515 | retailgeo$Zip==49516 | retailgeo$Zip==49518 
                      | retailgeo$Zip==49523 | retailgeo$Zip==49525 | retailgeo$Zip==49544 
                      | retailgeo$Zip==49546 | retailgeo$Zip==49548 | retailgeo$Zip==49588 
                      | retailgeo$Zip==49599)

There's over 1000 observations.  Because the Census geocoder API processes up to 1000 addresses per batch, I split the Kent records into two parts. 

Preparing a file for geocoding with the U.S. Census geocoder API


The benefit of using the U.S. Census geocoder is that it is free and address coordinates along with Census geographies, the corresponding block, tract, and County FIPS code. (The API documentation for the geocoder is here and along with instructions for requesting an API key.)



A frequently asked question is,"Can I post addresses to the API from an R dataframe?" The answer is "No." The API requires a comma delimited external file with specific address fields: a record ID, house number and street name, city, state, and Zip code. 



From the Kent address records, I prepared two address files --- to remain below the 1000 address records limit for each 



## Create a subset of this dataset that contains only a Retailer.ID, and address including Zip code and Zip plus 4
## Create a csv file with the first 700 addresses
write.csv(Kentretailgeo[1:700,c(1, 3:7) ], file="addressesKent1.csv", row.names=FALSE)

## and then the remaining addresses
write.csv(Kentretailgeo[701:1329,c(1, 3:7) ], file="addressesKent2.csv", row.names=FALSE)

The httr package is used to submit the request.  A quick start guide describes the basics of it at https://cran.r-project.org/web/packages/httr/vignettes/quickstart.html

To submit the request, use the POST()function. 

The body of the request contains the name of the comma separated value file, in this case addressesKent1.csv, and two options for the Census API: the benchmark and vintage. The benchmark could include current Census records ('Public_AR_Current'), or the most recent American Community Survey ('Public_AR_ACS2015'). With the year of the addresses beginning in 2007, I thought I would find the most address matches with the 2010 Census.  The results are stored in an R object, censusgeoKent1


censusgeoKent1<- POST(url="https://geocoding.geo.census.gov/geocoder/geographies/addressbatch", 
                    body =
                      list(addressFile = upload_file("addressesKent1.csv"), 
                           benchmark = "Public_AR_Census2010",
                           vintage = "Census2010_Census2010"),
                    encode = "multipart", 
                    verbose())

After POST(), check for 'Success' in the request with

http_status(censusgeoKent1)

Once we have the data returned from the API, while there are a few different ways to work with the data and prepare it for merging with the lottery sales records, the easiest is to use the cat() function, which prints the dataframe containing the API data to a a csv file:

cat(content(censusgeoKent1, "text"), file="censusgeoKentfile1output.csv")

Next, we need to read the CSV file back in -- but one quirk of the geocoder API is that the header row names are incomplete.  So skip those row names and provide these instead:

censusgeoKentfile1data<-read.csv(file="censusgeoKentfile1output.csv", skip=1,
                             col.names = c('Retailer.Id', 'Complete.Address', 'Match', 'MatchType', 'MatchedAddress', 'LongLat', 'census.id',
                                           'Streetside', 'State', 'County', 'Tract', 'Block'))

There are two small obstacles to get over. One is that it appears not all address records were matched:

table(censusgeoKentfile1data$Match) # list number of matching records


I’ll come back to the subset of non-matching records in Part 2 of this post, trying out different Census address files. For now, we’ll move on to the second set of Kent lottery retailers, recorded in addressesKent2.csv:

censusgeoKent2<- POST(url="https://geocoding.geo.census.gov/geocoder/geographies/addressbatch", 
                    body =
                      list(addressFile = upload_file("addressesKent2.csv"), 
                           benchmark = "Public_AR_Census2010",
                           vintage = "Census2010_Census2010"),
                    encode = "multipart", 
                    verbose())

http_status(censusgeoKent2)
cat(content(censusgeoKent2, "text"), file="censusgeoKentfile2output.csv")
censusgeoKentfile2data<-read.csv(file="censusgeoKentfile2output.csv", skip=1,
                             col.names = c('Retailer.Id', 'Complete.Address', 'Match', 'MatchType', 'MatchedAddress', 'LongLat', 'census.id',
                                           'Streetside', 'State', 'County', 'Tract', 'Block'))
table(censusgeoKentfile2data$Match) # list number of matching records

Appending the two Kent address records:

## We will append the two files together with the rbind() function;'censusgeoKent' will be the combined set of files: 
censusgeoKent<-rbind(censusgeoKentfile2data, censusgeoKentfile1data)

Another obstacle in the analysis is the formatting of the longitude and latitude coordinates,which are currently stored in one field. We can separate the two with the 'tidyr' package, and make sure each is stored as a numeric:

library(tidyr)
censusgeoKent<-separate(data=censusgeoKent, col=LongLat, into=c("Long", "Lat"), sep=",")
censusgeoKent$Long<-as.numeric(censusgeoKent$Long)
censusgeoKent$Lat<-as.numeric(censusgeoKent$Lat)
str(censusgeoKent)



I merged the geocoded addresses with the lottery sales, by the `Retailer.Id’, in the resulting dataset Kentretail:

Kentretail<-merge(Kentretailgeo, censusgeoKent, by="Retailer.Id", all=TRUE)
head(Kentretail)



There are several analyses I would like to do with the lottery sales. For now, though, let’s simply map out the retailer locations. 

With the lat/lon coordinates, this has been relatively easy with the ‘ggmap’ package, yet in preparing this post, while downloading from CRAN the most recent ggmap package, pulling maps from Google resulted in an error: “Error: GeomRasterAnn was built with an incompatible version of ggproto…”. Commenters at Stackoverflow recommend installing the version at github to resolve the problem. This procedure worked for now:

library(devtools)
devtools::install_github("dkahle/ggmap")
library(ggmap)
GrandRapidsMap2 <- qmap("grand rapids, mi", zoom = 12)
plot(GrandRapidsMap2)

Then plotting the location of lottery retailers

GrandRapidsMap2 + ggtitle("Michigan Lottery retailers in Grand Rapids, 2015") +
    geom_point(aes(x = Long, y = Lat,fill = calctotal), alpha = .4, 
             data = Kentretail) + theme(legend.position="none")

results in this map:




The map shows that lottery retailers tend to cluster along the wider, busier streets in and out of the city, such as east to west at the southern end of the city, and north to south on the eastern side. There’s a small concentration of retailers west of the river, on the western downtown side of the city. These concentrations are a bit more visible with a density plot.


GrandRapidsMap2 + ggtitle("Michigan Lottery retailers in Grand Rapids, 2015") +
 stat_density2d(aes(x=Long, y=Lat, fill = ..level..,
                     alpha = ..level..),
                 size = 2, bins =5, data = Kentretail,
                 geom = "polygon") + theme(legend.position="none")


One interesting feature of the map is a concentration of retailers on the southeast side of the city, near the intersection of Madison and Burton streets. The surrounding neighborhood is relatively poor, compared to other parts of the city. I’ll follow up on this angle to the lottery sales in a future post.

Comments

  1. Nice maps. Would be good to see some analyses of ACS data for these areas.

    ReplyDelete

Post a Comment

Popular posts from this blog

Using the survey package in R to analyze the European Social Survey, part 1

Using the survey package in R to analyze the European Social Survey For future reference, I’d like to have a record of tools for analyzing the European Social Survey, via the “survey” package by Lumley ( https://cran.r-project.org/web/packages/survey/survey.pdf ). In this post, I simply setup the survey object and demonstrate the tabulation of responses. The examples below require the survey , dplyr , and forcats packages: library(survey) library(dplyr) library(forcats) Below I load a version of the 8th round of the European Social Survey dataset ( http://europeansocialsurvey.org ) load(file=url("https://github.com/whittkilburn/ESSdata/raw/master/ess%20round%208%20workspace.rdata")) The dataframe within the workspace is ess8 ; it was imported from a Stata datafile with the foreign package; factor labels were preserved for available columns, with one exception: the sampling weight column was replaced with a

More contour and density plots [stat_density2d() and hdrcde()] of Michigan lottery sales in Grand Rapids

After the prior post of a density map of lottery sales, I thought perhaps I had incorrectly passed on some arguments within ggplot for the use of stat_density2d().  So I looked back through the documentation for  stat_density2d()at  docs.gg gplot2.org .  The example in the documentation is the Old Faithful geyser data, which I recalled from other contour/density plot analyses in Antony Unwin's Graphical Data Analysis with R .   Unwin's discussion of density plots relies on both ggplot() and the hdrcde() packages.  The two packages use different engines for density estimation/contour lines, so perhaps it could be interesting to compare the two.  Let's start with the contour/density estimation in Unwin's book.  Unwin begins with a scatterplot and contour lines for Old Faithful, which shows three distinct clusters of eruptions:  ggplot(geyser, aes(duration, waiting)) + geom_point() +        geom_density2d() +         ggtitle("Old Faithful geyser eruption d

Using the survey package in R to analyze the European Social Survey, part 2

Using the survey package in R to analyze the European Social Survey, part 2 Recoding the party support measure We copy paste the old labels and type the new: ess8_at<-ess8_at  %>%    mutate ( at_party_vote =   fct_recode (prtvtbat,      "Social Democratic Party SP"  =  "SP \xd6 " ,      "People's Party VP"  =  " \xd6 VP" ,      "Freedom Party FP"  =  "FP \xd6 " ,      "Alliance for the Future of Austria BZ" =  "BZ \xd6 " ,      "The Greens Gr"  =  "Gr \xfc ne" ,      "Communist Party of Austria KP"  =  "KP \xd6 " ,      "New Austria and Liberal Forum NEOS"  =  "NEOS" ,      "Pirate Party of Austria PIRAIT"  =  "Piratenpartei  \xd6 sterreich" ,      "Team Stronach for Austria"  =  "Team Frank Stronach" ,      NULL =   "Other" ,      NULL =   "Not applicable" ,