SPSS data :: Journalism with R (2024)

SPSS is similar to Excel in that it’s proprietary software that stores data in a very specific format and provides a graphical interface useful for even deeper analysis.

It stands for Statistical Package for the Social Sciences and is owned by IBM. It’s also very expensive and usually only large businesses or organizations own licenses.

But it’s possible to bring in data saved from SPSS into R.

In this example, we’ll be working with case-level data from the FBI’s Supplementary Homicide Report. It has data on more than 27,000 homicides and was obtained via Freedom of Information Act by the Murder Accountability Project.

Here’s a New Yorker article about the Murder Accountability Project and its founder Thomas Hargrove, a journalist who tries to find serial killers with data and algorithms. We’ll be digging through the data and applying the algorithm ourselves in the next chapter.

The data zipped is 30 megabytes. Unzipped, the file is almost 200 MB (Good luck opening that in Excel).

But R can handle big data (to an extent). Data is saved to the computer’s memory. If your computer’s memory is 16 gigabytes then that’s the max file size you can import. I don’t recommend pushing it to that point because it still takes a lot of memory to run R’s functions. If get to the point of working with big data, then there are strategies like putting data into a MySQL database.

First download the data. And unzip it into the “data”" sub directory of this working directory.

If you’re working with the local data from downloaded from this course’s repo, just run the line of code below.

temp <- tempfile()unzip("data/SHR76_16.sav.zip", exdir="data", overwrite=T)unlink(temp)

If you have the SHR76_16.sav file in your data directory, we can now use the read.spss() function from the foreign package to import the data.

Here’s the thing about SPSS files.

It’s layered.

There’s a label for the data and the value for the data.

So you need to anticipate that when working with R.

## If you don't have foreign yet installed, uncomment and run the line below#install.packages("foreign")library(foreign)data_labels <- read.spss("data/SHR76_16.sav", to.data.frame=TRUE)

Check out what the data frame looks like and scroll all the way to the right of it.

View(data_labels)

SPSS data :: Journalism with R (1)

data_only <- read.spss("data/SHR76_16.sav", to.data.frame=TRUE, use.value.labels=F)

Check out this data frame and scroll all the way to the right.

View(data_only)

SPSS data :: Journalism with R (2)

Can you spot the difference?

The data_labels dataframe has the states and metropolitan area columns in the far right spelled out.

The data_only dataframe has states and metropolitan area columns represented as numbers.

This is a big deal because we need both sets of data for our analysis later on.

These are duplicate data frames but sometimes there’s a sort of mirror to the data in the other one.

Combine data frames

This is what we need to do.

  1. Bring in the dplyr package
  2. Rename the columns that are duplicated but have different data
  3. Drop the columns in one data set that are duplicated but are the same in the other
  4. Bring them together (join) as one big happy data frame

This is the first time you will be introduced to the concept of joining data sets, which is one of the most powerful and important things you can do in data analysis. We’ll go over it in the next chapter in more detail.

We’ll use the select() function from the dplyr package that lets you pick and rename specific columns.

library(dplyr)## OK, we're keeping ID, CNTYFIPS, Ori, State, Agency, and AGENCY_A columns## And we're going to rename the other ones so that we know they're labelsnew_labels <- select(data_labels, ID, CNTYFIPS, Ori, State, Agency, AGENCY_A, Agentype_label=Agentype, Source_label=Source, Solved_label=Solved, Year, Month_label=Month, Incident, ActionType, Homicide_label=Homicide, Situation_label=Situation, VicAge, VicSex_label=VicSex, VicRace_label=VicRace, VicEthnic, Offa*ge, OffSex_label=OffSex, OffRace_label=OffRace, OffEthnic, Weapon_label=Weapon, Relationship_label=Relationship, Circ*mstance_label=Circ*mstance, Subcircum, VicCount, OffCount, FileDate, fstate_label=fstate, MSA_label=MSA)## OK, we're dropping ID, CNTYFIPS, Ori, State, Agency, and AGENCY_A columns## And we're going to rename the other ones so that we know they're specifically valuesnew_data_only <- select(data_only, Agentype_value=Agentype, Source_value=Source, Solved_value=Solved, Month_value=Month, Homicide_value=Homicide, Situation_value=Situation, VicSex_value=VicSex, VicRace_value=VicRace, OffSex_value=OffSex, OffRace_value=OffRace, Weapon_value=Weapon, Relationship_value=Relationship, Circ*mstance_value=Circ*mstance, fstate_value=fstate, MSA_value=MSA)# cbind() means column binding-- it only works if the number of rows are the samenew_data <- cbind(new_labels, new_data_only) # Now we're going to use the select() function to reorder the columns so labels and values are next to each othernew_data <- select(new_data, ID, CNTYFIPS, Ori, State, Agency, AGENCY_A, Agentype_label, Agentype_value, Source_label, Source_value, Solved_label, Solved_value, Year, Month_label, Month_value, Incident, ActionType, Homicide_label,Homicide_value, Situation_label,Situation_value, VicAge, VicSex_label,VicSex_value, VicRace_label,VicRace_value, VicEthnic, Offa*ge, OffSex_label,OffSex_value, OffRace_label,OffRace_value, OffEthnic, Weapon_label,Weapon_value, Relationship_label,Relationship_value, Circ*mstance_label,Circ*mstance_value, Subcircum, VicCount, OffCount, FileDate, fstate_label,fstate_value, MSA_label,MSA_value)# remove the old data frames because they're huge and we want to free up memoryrm(data_labels)rm(data_only)rm(new_labels)rm(new_data_only)

How’s it look at the end of the data frame now?

View(new_data)

SPSS data :: Journalism with R (3)

There are now 47 columns total and it looks like the values are next to labels.

Wonderful.

Let’s move on to the next chapter so we can start wrangling this data.

© Copyright 2018, Andrew Ba Tran

SPSS data :: Journalism with R (2024)

FAQs

Is SPSS easier to use than R? ›

Learning Curve and Support SPSS has a relatively low learning curve and offers user-friendly tutorials and documentation. It also has a support team and community forum for troubleshooting and answering questions. R, on the other hand, has a steeper learning curve and requires some programming skills.

What can R do that SPSS cannot? ›

R graphics are more advanced then SPSS. R has at least 3 different graphics programs. The consequence is that R can handle very complex statistical analytics. The advantage of SPSS is that it can perform parallel computing, sometimes using IO to harddisk.

How to read SPSS data into R? ›

Method 1 - foreign R package
  1. Select Data Sources > Plus (+) > R.
  2. Enter a name for the data set under Name.
  3. Paste the below R code where it states "Enter your R code here": library(foreign) location = "https://wiki.q-researchsoftware.com/images/3/35/Technology_2018.sav" ...
  4. Note this will import using the variable names.
Apr 22, 2024

What are the advantages of R programming over SPSS? ›

Both R and Python are open-source languages that are freely available for everyone to use. In contrast, SPSS is a proprietary software that requires a license to use. R and Python have an exceptionally broad range of functions (well over 2,000 packages) and new statistical methods are quickly implemented.

Is SPSS enough for data analysis? ›

No matter what your business objectives are if you have a bunch of data that you want to analyze, SPSS is one of the best statistical analysis tools that you can use.

Is statistics with R hard? ›

Although R is considered a complex language due to its many commands and inconsistent analysis ways, enrolling in an in-person or live online Data Science class can help overcome the challenges. R is often compared to Python, another data science language.

Is SPSS compatible with R? ›

You can also run IBM SPSS Statistics from an external R process, such as an R IDE or the R interpreter. In this mode, you still have access to all of the functions in the R Integration Package for IBM SPSS Statistics, but you can develop and test your R programs with the R development environment of your choice.

Can you use R to analyze data? ›

R is a free, open source statistical programming language. It is useful for data cleaning, analysis, and visualization. It complements workflows that require the use of other software.

What is an R value in SPSS? ›

The bivariate Pearson Correlation produces a sample correlation coefficient, r, which measures the strength and direction of linear relationships between pairs of continuous variables.

Why do people still use SPSS? ›

With its powerful analytical capabilities, user-friendly interface, and wide-ranging applications, SPSS is likely to remain a valuable skill in the future. As industries continue to rely on data-driven decision-making, professionals proficient in SPSS can expect a bright scope ahead.

What are the disadvantages of R programming? ›

Pros and cons of R programming
AdvantagesDisadvantages
Leading language when it comes to comprehensive statistical analysis packagesMemory-intensive since objects are stored in physical memory
Community-developed code enhancements and bug fixesLacking in security features, cannot be embedded in a web application.
1 more row
Feb 12, 2024

Is SPSS difficult to learn? ›

Learning SPSS. SPSS's interface resembles that of Excel spreadsheets, which makes it easy to learn. If you have never come across SPSS before, it will be helpful to have previously worked with a spreadsheet program, such as OpenOffice or MS Excel.

Is it easy to use SPSS? ›

SPSS is an easy-to-use and powerful data management and analysis software package that performs a wide variety of statistical procedures. The original acronym stands for 'Statistical Package for the Social Sciences'. SPSS runs on Windows, Macintosh and UNIX platforms.

What is the advantage of using SPSS over? ›

One of the key advantages of using SPSS in monitoring and evaluation is its ability to handle large datasets. SPSS can easily handle datasets with thousands of variables and millions of cases, making it suitable for large-scale evaluations and research projects.

Which is easier to learn R or Stata? ›

If you have a basic understanding of coding or are familiar with the coding environment. Stata, on the other hand, should be preferred over R if you have little or no coding experience. Because it is simple to use and anyone can efficiently utilise it.

Top Articles
Latest Posts
Article information

Author: Frankie Dare

Last Updated:

Views: 6260

Rating: 4.2 / 5 (53 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Frankie Dare

Birthday: 2000-01-27

Address: Suite 313 45115 Caridad Freeway, Port Barabaraville, MS 66713

Phone: +3769542039359

Job: Sales Manager

Hobby: Baton twirling, Stand-up comedy, Leather crafting, Rugby, tabletop games, Jigsaw puzzles, Air sports

Introduction: My name is Frankie Dare, I am a funny, beautiful, proud, fair, pleasant, cheerful, enthusiastic person who loves writing and wants to share my knowledge and understanding with you.