Skip to main content

Using the survey package in R to analyze the European Social Survey, part 1

Using the survey package in R to analyze the European Social Survey

For future reference, I’d like to have a record of tools for analyzing the European Social Survey, via the “survey” package by Lumley (https://cran.r-project.org/web/packages/survey/survey.pdf). In this post, I simply setup the survey object and demonstrate the tabulation of responses.

The examples below require the survey, dplyr, and forcats packages:

library(survey)
library(dplyr)
library(forcats)

Below I load a version of the 8th round of the European Social Survey dataset (http://europeansocialsurvey.org)

load(file=url("https://github.com/whittkilburn/ESSdata/raw/master/ess%20round%208%20workspace.rdata"))

The dataframe within the workspace is ess8; it was imported from a Stata datafile with the foreign package; factor labels were preserved for available columns, with one exception: the sampling weight column was replaced with a numeric version. –The factor labels are there to make it easier for students to tabulate variables without having to check attributes of each one, such as would be the case if the dataset was imported via the `haven’ package. For example, the variable recording an individual’s left-right self placement is preserved with factor labels:

table(ess8$lrscale)
## 
##       Left          1          2          3          4          5 
##       1463        846       2121       3754       3780      12389 
##          6          7          8          9      Right    Refusal 
##       4105       4269       3265        999       1592       1487 
## Don't know  No answer 
##       4305         12
str(ess8$lrscale)
##  Factor w/ 14 levels "Left","1","2",..: 1 2 6 1 6 6 5 6 6 6 ...

We will use the mutate() function to add a new lrscale variable to the dataset, at the end of it, called left_right. The following code simply creates an identical copy of the old variable. We use ess8 <- to store the contents of the mutate to the original dataset.

ess8 <- ess8 %>%  # 
  mutate(left_right = lrscale)  

## mutate(NEW = OLD) --- specify the name of the new variable = the original variable name..

table(ess8$left_right)
## 
##       Left          1          2          3          4          5 
##       1463        846       2121       3754       3780      12389 
##          6          7          8          9      Right    Refusal 
##       4105       4269       3265        999       1592       1487 
## Don't know  No answer 
##       4305         12

For simplicity, let’s just work with the sample from Austria. We use filter() to select the Austrian sample.

ess8_at<-ess8 %>%
  filter(cntry=="AT")

Creating the survey design object.

The ESS survey contains two different weights: pweight is the population size weight. pspwght is the survey analysis weight. Since I’m working with just the Austrian sample, we’ll use the analysis weight. In a future post I’ll construct some multi-level analyses using both.

The code below creates an object, ess_design_at, which specifies the design of the survey from the function svydesign(). This object, ess_design_at contains the info we need to account for the survey design:

It doesn’t show any results, just stores the design in ess_design_at:

ess_design_at <- 
  svydesign(
    ids = ~0 ,
    strata=NULL,
    weights=~pspwght,
    data = ess8_at)

The part of the statement that says weights=~pspwght sets the sampling weight. Apart from naming your own survey object something besides ess_design_at, you will use your own weight variable in place of pspwght.

The other parts of the statement are ids = ~0means we don’t have a variable for sampling clusters. strata=NULL means there was no stratification.Then we list the weight and the dataset. If you pass a qualitative – factor – variable to the survey object design statement, you will see an error Error in 1/as.matrix(weights) : non-numeric argument to binary operator.

survey analysis

The survey analysis commands include both a variable to analyze and the survey object. svytable() produces frequency counts. *For each survey analysis command, you will need to specify both the variable and the survey design object. In this case, the variable to analyze is left_right, while the survey design object ess_design_at is next.

svytable(~ left_right, ess_design_at ) 
## left_right
##       Left          1          2          3          4          5 
##   68.88656   38.72109  123.95784  170.99772  215.48897  651.09398 
##          6          7          8          9      Right    Refusal 
##  220.82195  135.83134  107.26632   30.88962   56.81019   82.46806 
## Don't know  No answer 
##  106.76636    0.00000

The decimal points are calculated because of the survey weights. We can round entries up to the nearest integer with round=TRUE:

svytable(~ left_right, ess_design_at, round=TRUE ) 
## left_right
##       Left          1          2          3          4          5 
##         69         39        124        171        215        651 
##          6          7          8          9      Right    Refusal 
##        221        136        107         31         57         82 
## Don't know  No answer 
##        107          0

These are frequencies; We could calculate percentages by thinking through how each of the percentages within the table would be calculated. To shorten the code, we store the svytable() function as table_a:

table_a<-svytable(~ left_right, ess_design_at)

table_a/sum(table_a)*100 
## left_right
##       Left          1          2          3          4          5 
##   3.427192   1.926422   6.167057   8.507349  10.720844  32.392735 
##          6          7          8          9      Right    Refusal 
##  10.986167   6.757778   5.336633   1.536797   2.826378   4.102889 
## Don't know  No answer 
##   5.311759   0.000000

Creating better formatted tables

Usually, the purpose of running these svytable() functions is to calculate various statistics, then to create a neat table within a word processing software. There is one additional command to neaten up the results. It is knitr::kable(, digits=2), where the table goes before the comma. Here are two examples, The first is a set of simple frequencies. The second is the percentages:

knitr::kable(table_a, digits = 2)
left_right Freq
Left 68.89
1 38.72
2 123.96
3 171.00
4 215.49
5 651.09
6 220.82
7 135.83
8 107.27
9 30.89
Right 56.81
Refusal 82.47
Don’t know 106.77
No answer 0.00
# While the table header shows "Freq", these are actually percentages
knitr::kable(table_a/sum(table_a)*100 , digits=2)
left_right Freq
Left 3.43
1 1.93
2 6.17
3 8.51
4 10.72
5 32.39
6 10.99
7 6.76
8 5.34
9 1.54
Right 2.83
Refusal 4.10
Don’t know 5.31
No answer 0.00

We could change the column headers and a caption

knitr::kable(table_a/sum(table_a)*100 , digits=2, col.names=c("left to right ID", "Percent"))
left to right ID Percent
Left 3.43
1 1.93
2 6.17
3 8.51
4 10.72
5 32.39
6 10.99
7 6.76
8 5.34
9 1.54
Right 2.83
Refusal 4.10
Don’t know 5.31
No answer 0.00

We can calculate subsetted statistics, such as younger people, less than 30 yrs. There is an age variable in the dataset, called agea:

svytable(~ left_right, subset(ess_design_at, agea<30), round=TRUE) 
## left_right
##       Left          1          2          3          4          5 
##         16         14         51         61         35        117 
##          6          7          8          9      Right    Refusal 
##         30         26         22          6         14         15 
## Don't know  No answer 
##         23          0

These are frequency counts, rounded to the nearest integer, per round=TRUE.

When subsetting, the survey design goes inside the subset function. We can calculate mean self-placement on left right scale by party. But first let’s create a left-right variable that is simply left right and center:

ess8_at <- ess8_at %>%
  mutate(left_right3cat = fct_collapse(left_right,
         left = c("Left", "1" , "2", "3", "4"),
         center = "5",
         right = c("6", "7", "8", "9", "Right") ))

table(ess8_at$left_right3cat)
## 
##       left     center      right    Refusal Don't know  No answer 
##        575        677        576         87         95          0

And not to get bogged down in more recoding, but we need to recode a measure of party support.

Let’s look at the example of the political party an Austrian respondent reports voting for in the last election:

table(ess8_at$prtvtbat)
## 
##                      SP\xd6                      \xd6VP 
##                         475                         358 
##                      FP\xd6                      BZ\xd6 
##                         250                          10 
##                    Gr\xfcne                      KP\xd6 
##                         185                           6 
##                        NEOS Piratenpartei \xd6sterreich 
##                          36                           4 
##         Team Frank Stronach                       Other 
##                          18                           7 
##              Not applicable                     Refusal 
##                         418                         205 
##                  Don't know                   No answer 
##                          38                           0

The labels contain stray character encodings from translating the responses into the English character set, and we would want to clean that up and provide more descriptive labels in English. So we will use fct_recode() to alter the existing labels.

See part 2 of this post, next:

Comments

Popular posts from this blog

More contour and density plots [stat_density2d() and hdrcde()] of Michigan lottery sales in Grand Rapids

After the prior post of a density map of lottery sales, I thought perhaps I had incorrectly passed on some arguments within ggplot for the use of stat_density2d().  So I looked back through the documentation for  stat_density2d()at  docs.gg gplot2.org .  The example in the documentation is the Old Faithful geyser data, which I recalled from other contour/density plot analyses in Antony Unwin's Graphical Data Analysis with R .   Unwin's discussion of density plots relies on both ggplot() and the hdrcde() packages.  The two packages use different engines for density estimation/contour lines, so perhaps it could be interesting to compare the two.  Let's start with the contour/density estimation in Unwin's book.  Unwin begins with a scatterplot and contour lines for Old Faithful, which shows three distinct clusters of eruptions:  ggplot(geyser, aes(duration, waiting)) + geom_point() +        geom_density2d() +         ggtitle("Old Faithful geyser eruption d

Using the survey package in R to analyze the European Social Survey, part 2

Using the survey package in R to analyze the European Social Survey, part 2 Recoding the party support measure We copy paste the old labels and type the new: ess8_at<-ess8_at  %>%    mutate ( at_party_vote =   fct_recode (prtvtbat,      "Social Democratic Party SP"  =  "SP \xd6 " ,      "People's Party VP"  =  " \xd6 VP" ,      "Freedom Party FP"  =  "FP \xd6 " ,      "Alliance for the Future of Austria BZ" =  "BZ \xd6 " ,      "The Greens Gr"  =  "Gr \xfc ne" ,      "Communist Party of Austria KP"  =  "KP \xd6 " ,      "New Austria and Liberal Forum NEOS"  =  "NEOS" ,      "Pirate Party of Austria PIRAIT"  =  "Piratenpartei  \xd6 sterreich" ,      "Team Stronach for Austria"  =  "Team Frank Stronach" ,      NULL =   "Other" ,      NULL =   "Not applicable" ,