Welcome!¶
Welcome to the midterm example tutorial for Python packages! In this tutorial, we are going to learn how to use the PRAW package to access data from the website Reddit.
What is Reddit?¶
Reddit is a social news and discussion website commonly known as a "community platform" where users can share content, rate it, and comment on it. The main features of Reddit consist of:
Content: Users can submit links, text, images, and videos to Reddit.
Voting: Other users can upvote or downvote content. Posts with more upvotes appear higher in their subreddit and on the site's front page.
Subreddits: Users can create boards called "subreddits" to organize content by topic.
For example, r/International News, r/CPlusPlus, and r/AppleMusic are all subreddits.
Moderation: Reddit administrators moderate the communities.
Real-time events: During major events, users can work together to create a real-time timeline of events.
What does PRAW do?¶
PRAW stands for Python Reddit API Wrapper. It's a Python library that allows you to easily interact with the Reddit API, enabling you to:
Read data from Reddit: Access posts, comments, user profiles, subreddits, and more.
Write data to Reddit: Submit posts, comments, vote, and perform other actions.
Build Reddit bots: Automate tasks on Reddit, such as monitoring subreddits or replying to comments.
** For the purpose of this tutorial, we'll focus on collecting and reading data from Reddit specifically. **
Getting Started¶
Let's get started! There are many steps involved, but the first is an easy one: installing PRAW.
Installing PRAW¶
The first step is to install the PRAW package. Go to your terminal and type pip install praw
.
You should now be able to see that the package has been installed. Depending on your system, you may need to use pip3 to install packages for Python 3.
Importing PRAW¶
The next step is to import PRAW. In order to use PRAW, you'll need to first have access to the following:
A Reddit Account -- you can create an account here.
Client ID & Client Secret (these are needed to access an API) -- Your Client ID & Client Secret are obtained when creating an app on Reddit. You can follow Reddit's First Steps Guide to create them here.
User Agent -- To use Reddit's API, you need a unique user agent.
An API. Request an API here.
Once you have all of the above items, we can go ahead and import PRAW. We'll use my account to access Reddit via the following code chunk:
import praw
reddit = praw.Reddit(
client_id="XXX",
client_secret="XXX",
user_agent="XXX")
Now that we've imported PRAW, we can take a look at it's many features! One of the most common uses is to pull data from subreddits.
Creating a Subreddit Instance (aka Accessing a Subreddit)¶
We can create a subreddit instance for any subreddit r/[insert name]. For the sake of this tutorial (and in vane with the Halloween season), let's take a look at the submissions within the niche subreddit for Beetlejuice, r/Beetlejuice. To create a subreddit instance, we can use the following code:
subreddit = reddit.subreddit("Beetlejuice")
Now that we have a Subreddit instance, we can iterate through some of its submissions. Some of the most used for submissions are:
controversial
gilded
hot
new
rising
top
Let's take a look at the top 15 most controversial title posts within r/Beetlejuice to see how this subreddit might be currently be interacting with Beetlejuice content. We can view the titles like this:
for submission in reddit.subreddit("Beetlejuice").controversial(limit=15):
print(submission.title)
Anyone feel that Beetlejuice Beetlejuice was rushed and mid? I didn’t really like it I liked Beetlejuice... but after a second watch Here is the thing. I am a little bit sceptic to watch the second part because I enjoy watching the original movie from the 80'. Can you please share your experience with the 2nd part? I am prepared (I read the post below) and I know everything that I need to know but I am scared of disappointments.. Why are people justifying Beetlejuice x Lydia? Is Winona Ryder okay? (Press Junket) Depp The art direction sucks Does Beetlejuice have the hots for Astrid? I genderswapped the movie characters Love the sequel but we got to talk about its biggest issue (Spoilers) Everything is turning into a Beetlejuice ad!!!! Did you catch this in the movie? I need yall help Is Beetlejuice the same age as Lydia in the cartoon series?
It looks like this subreddit is a bit mixed in terms of its collective thoughts about Beetlejuice. To get an even more in depth understanding of who these people are and how they might feel about the content, we can also take look at the comments.
Accessing Comments within a Subreddit¶
Accessing subreddit comments is much like accessing subreddit submissions -- we'll use a similar code chunk, substituting "submission" with "comment." Instead of looking at titles, though, let's see who the authors are. We could look at the authors of the comments from our previous search. We could also look at the authors of the top 25 hottest comments within the r/Beetlejuice subreddit.
for comment in reddit.subreddit("Beetlejuice").top(time_filter='day', limit=25):
print(comment.author)
heathashmil Glittering-Tax-2734 SlySonic GeometricTroops_49 dogoodfresh Ulichan_cos bil-sabab ParadoxRadiant MrManTrashCan Accomplished_Wait446 Lower-Goose-9796 Forsaken_Strain8651 Oiyouinthebushes DarkFox160 MacGrath1994 Wookie9991
You should now have 25 different comment authors. Cool, right? Now that we've mastered the basics of collecting instances from both submissions and comments, let's crank it up a notch.
What if we want to look at comments from one specific post?
We can access information about the comments from one particular post by calling for that specific "post ID." Let's take a deeper look at the comments for "Why are people justifying Beetlejuice x Lydia?" post. To do so, we'll call the submission, then print the first 5 comments, like so:
# Get a specific post by its ID
submission = reddit.submission(id='1fcml9l')
# Print the title and the first 5 comments
print(submission.title)
submission.comments.replace_more(limit=0)
for comment in submission.comments[:5]:
print(f"Comment: {comment.body}")
Why are people justifying Beetlejuice x Lydia? Comment: It's wildly improper. I mean maybe Beetlejuice, a man from the middle ages doesn't see it that way, but others really should. But now in the new movie, where Lydia is fifty one... Comment: I bet you’re fun at parties. Comment: No one’s shipping them in the first film, it’s about the second, now she’s an adult. Her looking at him like that is good for comedic effect but out of character to her laid back, moody teenage self. He presumably stole that photo from their home and obvs wasn’t able to take a pic of her as an adult cause he was in the underworld. Winona and other cast ship it. 10 year gap is pretty normal. Also she’s a grown woman who can say what she wants, she knows her own mind. Comment: IIRC even Winona said in interviews she wants them to get together Comment: I'm a shipper of the two. While I did watch the cartoon growing up, I barely remember it (and loathe the idea of paying $25 for the full series at Walmart when it's only *dvd quality* ugh). I also never watched the full musical and only recently began looking up clips on tiktok. I've said in many comments why I think they'd be perfect together. That said? Your take is totally valid too. I'm all for a *discussion* on this, if you want, but I'm not here to debate or "change your mind". I respect your opinion as long as you don't actively attack those who *do* ship them. We all like what we like.
Now we're cooking! (And based on the comments, so is this post, apparently).
It's nice to see individual attributes like "title" and "author," but what if we wanted to see multiple attributes about multiple submissions simultaneously? One way is to manually add the specific attributes you'd like, i.e. title, score, and author. This code chunk adds an f string to make it clear which element is which:
for post in subreddit.controversial(limit=5):
print(f"Title: {post.title}, Score: {post.score}, URL: {post.author}")
Title: Anyone feel that Beetlejuice Beetlejuice was rushed and mid? , Score: 0, URL: Arjun_SagarMarchanda Title: I didn’t really like it, Score: 4, URL: angelface1212 Title: I liked Beetlejuice... but after a second watch, Score: 0, URL: Werewolf_Knight Title: Here is the thing. I am a little bit sceptic to watch the second part because I enjoy watching the original movie from the 80'. Can you please share your experience with the 2nd part? I am prepared (I read the post below) and I know everything that I need to know but I am scared of disappointments.., Score: 0, URL: tamara_994 Title: Why are people justifying Beetlejuice x Lydia?, Score: 3, URL: Important_Ad_1049
Creating a Data Frame¶
Another method is to create a data frame! In order to do so, first we'll need to import pandas, then create an empty dictionary for the information we'd like to collect about each post. Next, we'll append those items together. From there, we assign the Data Frame to a variable and print the results. Let's see what happens:
import pandas as pd
# Create an empty list to store post data
posts_data = []
# Scrape the top 5 posts from the subreddit r/Beetlejuice
for post in subreddit.top(limit=5):
post_info = {
'title': post.title,
'score': post.score,
'url': post.url,
'author': str(post.author),
'num_comments': post.num_comments,
'created_utc': post.created_utc,
'selftext': post.selftext,
}
posts_data.append(post_info)
# Create a DataFrame from the list of post data
df = pd.DataFrame(posts_data)
# Print the DataFrame to verify the result
print(df)
title score \ 0 Not mine 567 1 I suddenly love this meme... (IG: enabuns) 506 2 I turned my broken gutter drain into a sandworm 486 3 turned the popcorn bucket into… 469 4 They made a puppet just for a 5 seconds of scr... 463 url author num_comments \ 0 https://i.redd.it/0v76jvj8zdl71.jpg InsertOrigionalName 17 1 https://i.redd.it/4dzxo4vvwiq31.jpg lbpaz 4 2 https://i.redd.it/mwv3aoune0qd1.jpeg PersonalChipmunk3605 17 3 https://i.redd.it/yuv2h0jt6hnd1.jpeg placebeyondthepalms 19 4 https://v.redd.it/vzgyxwvj28sd1 GeometricTroops_49 44 created_utc selftext 0 1.630718e+09 1 1.570194e+09 2 1.726858e+09 3 1.725753e+09 new planter 🌱 4 1.727822e+09
That looks a little more organized, doesn't it? We can adjust the code if needed in order to fit the attributes we're specifically looking for.
What if we want to collect data from multiple subreddits simultaneously?
Keyword Searches for Multiple Subreddits¶
We can access more than one subreddit at a time. There are several methods for doing so, but the arguably most useful (and easiest) way is to look for specific keywords throughout all subreddits. We can create a dictionary for the keywords we're looking for
import praw
import pandas as pd
keyword_list = ['Beetlejuice', 'Beetlejuice 2', '80s', 'Tim Burton',
'Winona Ryder', 'Michael Keaton', 'movie']
search_query = ' OR '.join(keyword_list) # Join keywords with 'OR' for broader search
# Searching in all subreddits with a time filter for broader results
all_subreddits = reddit.subreddit("all")
# creating lists for storing scraped data
titles = []
scores = []
ids = []
# looping over posts and scraping data
for submission in all_subreddits.search(search_query, time_filter='all', limit=100): # Add 'time_filter' and 'limit'
titles.append(submission.title)
scores.append(submission.score) # upvotes
ids.append(submission.id)
# Creating a DataFrame with the scraped data
new_df = pd.DataFrame({
'Title': titles,
'Id': ids,
'Upvotes': scores
})
# Display the DataFrame to verify the result
print(new_df)
# Check if DataFrame is empty
if new_df.empty:
print("No data found or API limits reached.")
Title Id Upvotes 0 Lauren Boebert escorted out of “Beetlejuice” m... 16h60x7 30619 1 "Do you know who I am?" - Part 2. Lauren Boebe... 16hn3vc 12711 2 Tim MacMahon on Chicago fans booing Jerry Krau... 199v9cn 5624 3 That's just Tim Walz 1emme31 8490 4 Fall and Christmas Horror Schedule 2024 1f7cyco 8 .. ... ... ... 95 Movies were better when production values were... 1b028l8 1216 96 LeVar Burton will receive a Lifetime Achieveme... udvimn 11352 97 Beetlejuice Beetlejuice | Official Trailer 2 1e6auqs 808 98 WB allegedly “haven’t been this excited for a ... 102oea2 2228 99 Keaton as old Batman would be sick! 1bd4p9a 2377 [100 rows x 3 columns]
We can adjust to make our keywords as specific or general as needed. We can also adjust the time filter and a limit to help organize our data. This type of search is great when looking at how multiple subreddit communities interact with the same content, and/or how the content subject matter changes between subreddit groups.
Okay, so we've successfully installed PRAW, accessed a subreddit, its comments, and the specific comments for a specific post within that subreddit. We've also learned how to append different attributes into data frames to collect multiple aspects of data simultaneously.
And that's our overview of PRAW! Looking at how people communicate amongst themselves online, both individually and in a community setting are crucial elements of online psychological research, so PRAW is an invalubale tool for examining naturalistic communication between collectives and their thoughts. Thank you!