SPSS data :: Journalism with R (2024)

SPSS is similar to Excel in that it’s proprietary software that stores data in a very specific format and provides a graphical interface useful for even deeper analysis.

It stands for Statistical Package for the Social Sciences and is owned by IBM. It’s also very expensive and usually only large businesses or organizations own licenses.

But it’s possible to bring in data saved from SPSS into R.

In this example, we’ll be working with case-level data from the FBI’s Supplementary Homicide Report. It has data on more than 27,000 homicides and was obtained via Freedom of Information Act by the Murder Accountability Project.

Here’s a New Yorker article about the Murder Accountability Project and its founder Thomas Hargrove, a journalist who tries to find serial killers with data and algorithms. We’ll be digging through the data and applying the algorithm ourselves in the next chapter.

The data zipped is 30 megabytes. Unzipped, the file is almost 200 MB (Good luck opening that in Excel).

But R can handle big data (to an extent). Data is saved to the computer’s memory. If your computer’s memory is 16 gigabytes then that’s the max file size you can import. I don’t recommend pushing it to that point because it still takes a lot of memory to run R’s functions. If get to the point of working with big data, then there are strategies like putting data into a MySQL database.

Combine data frames

This is what we need to do.

Bring in the dplyr package
Rename the columns that are duplicated but have different data
Drop the columns in one data set that are duplicated but are the same in the other
Bring them together (join) as one big happy data frame

This is the first time you will be introduced to the concept of joining data sets, which is one of the most powerful and important things you can do in data analysis. We’ll go over it in the next chapter in more detail.

We’ll use the select() function from the dplyr package that lets you pick and rename specific columns.

library(dplyr)## OK, we're keeping ID, CNTYFIPS, Ori, State, Agency, and AGENCY_A columns## And we're going to rename the other ones so that we know they're labelsnew_labels <- select(data_labels, ID, CNTYFIPS, Ori, State, Agency, AGENCY_A, Agentype_label=Agentype, Source_label=Source, Solved_label=Solved, Year, Month_label=Month, Incident, ActionType, Homicide_label=Homicide, Situation_label=Situation, VicAge, VicSex_label=VicSex, VicRace_label=VicRace, VicEthnic, Offa*ge, OffSex_label=OffSex, OffRace_label=OffRace, OffEthnic, Weapon_label=Weapon, Relationship_label=Relationship, Circ*mstance_label=Circ*mstance, Subcircum, VicCount, OffCount, FileDate, fstate_label=fstate, MSA_label=MSA)## OK, we're dropping ID, CNTYFIPS, Ori, State, Agency, and AGENCY_A columns## And we're going to rename the other ones so that we know they're specifically valuesnew_data_only <- select(data_only, Agentype_value=Agentype, Source_value=Source, Solved_value=Solved, Month_value=Month, Homicide_value=Homicide, Situation_value=Situation, VicSex_value=VicSex, VicRace_value=VicRace, OffSex_value=OffSex, OffRace_value=OffRace, Weapon_value=Weapon, Relationship_value=Relationship, Circ*mstance_value=Circ*mstance, fstate_value=fstate, MSA_value=MSA)# cbind() means column binding-- it only works if the number of rows are the samenew_data <- cbind(new_labels, new_data_only) # Now we're going to use the select() function to reorder the columns so labels and values are next to each othernew_data <- select(new_data, ID, CNTYFIPS, Ori, State, Agency, AGENCY_A, Agentype_label, Agentype_value, Source_label, Source_value, Solved_label, Solved_value, Year, Month_label, Month_value, Incident, ActionType, Homicide_label,Homicide_value, Situation_label,Situation_value, VicAge, VicSex_label,VicSex_value, VicRace_label,VicRace_value, VicEthnic, Offa*ge, OffSex_label,OffSex_value, OffRace_label,OffRace_value, OffEthnic, Weapon_label,Weapon_value, Relationship_label,Relationship_value, Circ*mstance_label,Circ*mstance_value, Subcircum, VicCount, OffCount, FileDate, fstate_label,fstate_value, MSA_label,MSA_value)# remove the old data frames because they're huge and we want to free up memoryrm(data_labels)rm(data_only)rm(new_labels)rm(new_data_only)

How’s it look at the end of the data frame now?

View(new_data)

There are now 47 columns total and it looks like the values are next to labels.

Wonderful.

Let’s move on to the next chapter so we can start wrangling this data.

FAQs

Is SPSS easier to use than R? ›

Learning Curve and Support SPSS has a relatively low learning curve and offers user-friendly tutorials and documentation. It also has a support team and community forum for troubleshooting and answering questions. R, on the other hand, has a steeper learning curve and requires some programming skills.

Read On ›

What can R do that SPSS cannot? ›

R graphics are more advanced then SPSS. R has at least 3 different graphics programs. The consequence is that R can handle very complex statistical analytics. The advantage of SPSS is that it can perform parallel computing, sometimes using IO to harddisk.

How to read SPSS data into R? ›

Method 1 - foreign R package

Select Data Sources > Plus (+) > R.
Enter a name for the data set under Name.
Paste the below R code where it states "Enter your R code here": library(foreign) location = "https://wiki.q-researchsoftware.com/images/3/35/Technology_2018.sav" ...
Note this will import using the variable names.

Apr 22, 2024

Find Out More ›

What are the advantages of R programming over SPSS? ›

Both R and Python are open-source languages that are freely available for everyone to use. In contrast, SPSS is a proprietary software that requires a license to use. R and Python have an exceptionally broad range of functions (well over 2,000 packages) and new statistical methods are quickly implemented.

Find Out More ›

Is SPSS enough for data analysis? ›

No matter what your business objectives are if you have a bunch of data that you want to analyze, SPSS is one of the best statistical analysis tools that you can use.

What are the disadvantages of R programming? ›

Pros and cons of R programming

Advantages	Disadvantages
Leading language when it comes to comprehensive statistical analysis packages	Memory-intensive since objects are stored in physical memory
Community-developed code enhancements and bug fixes	Lacking in security features, cannot be embedded in a web application.

1 more row

Feb 12, 2024

See Details ›

Is SPSS difficult to learn? ›

Learning SPSS. SPSS's interface resembles that of Excel spreadsheets, which makes it easy to learn. If you have never come across SPSS before, it will be helpful to have previously worked with a spreadsheet program, such as OpenOffice or MS Excel.

Get More Info Here ›

Is it easy to use SPSS? ›

SPSS is an easy-to-use and powerful data management and analysis software package that performs a wide variety of statistical procedures. The original acronym stands for 'Statistical Package for the Social Sciences'. SPSS runs on Windows, Macintosh and UNIX platforms.

Show Me More ›

What is the advantage of using SPSS over? ›

One of the key advantages of using SPSS in monitoring and evaluation is its ability to handle large datasets. SPSS can easily handle datasets with thousands of variables and millions of cases, making it suitable for large-scale evaluations and research projects.

Explore More ›

Which is easier to learn R or Stata? ›

If you have a basic understanding of coding or are familiar with the coding environment. Stata, on the other hand, should be preferred over R if you have little or no coding experience. Because it is simple to use and anyone can efficiently utilise it.