TikTok is a popular video sharing platform where users can upload content on a variety of topics. In this tutorial, we will use traktok, an R package that allows us to access TikTok data by simulating a browser accessing the API endpoints used by TikTok itself.
The first step we need to do is install and load the necessary packages we will need for the tutorial, including traktok, cookiemonster, and tidyverse. traktok will be used to scrape data from TikTok and can be installed using the devtools package. cookiemonster is used to help authenticate ourselves using browser cookies so that we can access the TikTok API without an API key. Finally, we will use the tidyverse package to analyze the data that we obtain through traktok. While traktok and cookiemonster need to be installed and loaded, tidyverse is already installed, so we just need to load it.
library(devtools)
## Loading required package: usethis
devtools::install_github("JBGruber/traktok")
## Using GitHub PAT from the git credential store.
## Skipping install of 'traktok' from a github remote, the SHA1 (c2a88432) has not changed since last install.
## Use `force = TRUE` to force installation
library(traktok)
install.packages("cookiemonster")
## Installing package into '/Users/benjaminsilver/Library/R/x86_64/4.2/library'
## (as 'lib' is unspecified)
##
## The downloaded binary packages are in
## /var/folders/pp/lk19kb4d6_g2vn0f30qf_zv40000gn/T//RtmpShYf2s/downloaded_packages
library(cookiemonster)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
To scrape TikTok data, you need to authenticate yourself using browser cookies. Here’s how to do that. 1. Log into TikTok on your browser. 2. Export cookies: After logging in, you’ll need to extract your TikTok cookies using a browser extension. * For Chrome/Edge: Get cookies.txt extension * For Firefox: cookies.txt extension 3. Save Cookies: After installing the extension, upload your cookies to your home file in r studio. 4. Import Cookies into R: Now, import the cookies into R by using the add_cookies() function to add the file to your R home. The add_cookies function is reading in your cookies file, so the file name must correspond exactly with what you specify in the code.
cookiemonster::add_cookies("cookies.txt")
## ✔ Cookies for tiktok.com, www.tiktok.com put in the jar!
Once you’re authenticated, you can search for TikTok videos using traktok. The function tt_search_hidden() allows you to search for videos using keywords. This will create a data frame that contains information about each top video under the given #.
videos_all <- tt_search_hidden("#mentalhealth", max_pages = 10)
## ℹ Getting page 1⏲ waiting 9 seconds ℹ Getting page 1✔ Got page 1. Found 12 videos. [10.2s]
## ℹ Getting page 2⏲ waiting 1.9 seconds ℹ Getting page 2✔ Got page 2. Found 13 videos. [2.4s]
## ℹ Getting page 3⏲ waiting 0.1 seconds ℹ Getting page 3✔ Got page 3. Found 2 videos. [1.8s]
## ℹ Getting page 4⏲ waiting 6.3 seconds ℹ Getting page 4✔ Got page 4. Found 12 videos. [7.4s]
## ℹ Getting page 5⏲ waiting 1.4 seconds ℹ Getting page 5✔ Got page 5. Found 12 videos. [2.6s]
## ℹ Getting page 6⏲ waiting 4.4 seconds ℹ Getting page 6✔ Got page 6. Found 12 videos. [5.4s]
## ℹ Getting page 7⏲ waiting 1.6 seconds ℹ Getting page 7✔ Got page 7. Found 11 videos. [2.7s]
## ℹ Getting page 8⏲ waiting 0.4 seconds ℹ Getting page 8✔ Got page 8. Found 12 videos. [1.4s]
## ℹ Getting page 9⏲ waiting 0.7 seconds ℹ Getting page 9✔ Got page 9. Found 11 videos. [2s]
## ℹ Getting page 10✔ Got page 10. Found 11 videos. [2s]
Now, you can display the data frame you’ve created which displays including metadata such as the video’s unique ID, upload timestamp, video caption, number of views, likes, comments, shares, and other related details
videos_all
## # A tibble: 107 × 20
## video_id video_timestamp video_url video_length video_title
## <chr> <dttm> <glue> <int> <chr>
## 1 7361432231664946475 2024-04-24 13:57:21 https://www… 78 "What you …
## 2 7369283535779204394 2024-05-15 17:44:09 https://www… 26 "You make …
## 3 7336314465840876842 2024-02-16 21:27:26 https://www… 77 "Check on …
## 4 6989349036272798982 2021-07-26 21:23:56 https://www… 58 "#stitch w…
## 5 7146756532196396334 2022-09-24 01:45:38 https://www… 42 "You are n…
## 6 7247875528772848923 2023-06-23 13:38:55 https://www… 44 "Mental he…
## 7 7335112292155051306 2024-02-13 15:42:47 https://www… 5 "Mental he…
## 8 7241323192969645354 2023-06-05 21:52:30 https://www… 23 "#duet wit…
## 9 6866581785229184261 2020-08-30 01:24:27 https://www… 12 "Love all …
## 10 7341414883096825130 2024-03-01 15:19:25 https://www… 5 "Mental he…
## # ℹ 97 more rows
## # ℹ 15 more variables: video_diggcount <int>, video_sharecount <int>,
## # video_commentcount <int>, video_playcount <int>, video_is_ad <lgl>,
## # author_name <chr>, author_nickname <chr>, author_followercount <int>,
## # author_followingcount <int>, author_heartcount <int>,
## # author_videocount <int>, author_diggcount <int>, music <list>,
## # challenges <list>, download_url <chr>
Since tt_search_hidden() returns the full timestamp (date and time) of the videos and we are just interested in the year, we use mutate() to create a new column in our data frame called Year and substr() to extract the first four characters of the video_timestamp column, which has the year the video was posted. This will allow us to analyze videos by the year they were uploaded.
videos_all <- videos_all %>%
mutate(Year = substr(video_timestamp, 1, 4))
We filter the data to include only videos from 2023 and 2024 to focus our analysis on the relevant time periods.
videos_filtered <- filter(videos_all, Year == "2023" | Year == "2024")
You can now disply the new filtered data frame containing videos from just 2023 and 2024.
videos_filtered
## # A tibble: 93 × 21
## video_id video_timestamp video_url video_length video_title
## <chr> <dttm> <glue> <int> <chr>
## 1 7361432231664946475 2024-04-24 13:57:21 https://www… 78 "What you …
## 2 7369283535779204394 2024-05-15 17:44:09 https://www… 26 "You make …
## 3 7336314465840876842 2024-02-16 21:27:26 https://www… 77 "Check on …
## 4 7247875528772848923 2023-06-23 13:38:55 https://www… 44 "Mental he…
## 5 7335112292155051306 2024-02-13 15:42:47 https://www… 5 "Mental he…
## 6 7241323192969645354 2023-06-05 21:52:30 https://www… 23 "#duet wit…
## 7 7341414883096825130 2024-03-01 15:19:25 https://www… 5 "Mental he…
## 8 7389347367180455200 2024-07-08 19:21:50 https://www… 66 "Mental He…
## 9 7392728409350081834 2024-07-17 22:01:58 https://www… 10 "#depresse…
## 10 7403335582262234399 2024-08-15 12:03:19 https://www… 27 "Felt bro …
## # ℹ 83 more rows
## # ℹ 16 more variables: video_diggcount <int>, video_sharecount <int>,
## # video_commentcount <int>, video_playcount <int>, video_is_ad <lgl>,
## # author_name <chr>, author_nickname <chr>, author_followercount <int>,
## # author_followingcount <int>, author_heartcount <int>,
## # author_videocount <int>, author_diggcount <int>, music <list>,
## # challenges <list>, download_url <chr>, Year <chr>
We use group_by() to group the data by year (2023 and 2024) and summarize() to calculate the average views and likes for each year, providing the necessary summary statistics for comparison.
avg_data <- videos_filtered %>%
group_by(Year) %>%
summarize(
avg_views = mean(video_playcount, na.rm = TRUE),
avg_likes = mean(video_diggcount, na.rm = TRUE)
)
Now, you’ll see that we’ve created a new data frame that contains the average views and likes from these top videos from 2023 and 2024.
avg_data
## # A tibble: 2 × 3
## Year avg_views avg_likes
## <chr> <dbl> <dbl>
## 1 2023 5121652. 653352.
## 2 2024 2911695. 379741.
We create a bar plot using ggplot2 to visually compare the average views and likes of TikTok videos from 2023 and 2024, with different colors representing each metric. Visualizing this data can help us think about the psychological implications of why the data is the way it is.
# Bar plot for average views
ggplot(avg_data, aes(x = Year, y = avg_views, fill = Year)) +
geom_bar(stat = "identity") +
labs(title = "Average Views for Top #MentalHealth Videos in 2023 vs 2024",
x = "Year", y = "Average Views") +
theme_minimal()
# Bar plot for average likes
ggplot(avg_data, aes(x = Year, y = avg_likes, fill = Year)) +
geom_bar(stat = "identity") +
labs(title = "Average Likes for Top #MentalHealth Videos in 2023 vs 2024",
x = "Year", y = "Average Likes") +
theme_minimal()
Finally, I want to show that a very similar data frame can be attained through inputting specific video URLs using the tt_videos_hidden() function. This function will allow us to download data based on specific TikTok video URLs. This can be a very useful tool if we already have specific videos we want to attain more information from.
video_data <- tt_videos_hidden("https://www.tiktok.com/@matthiasjbarker/video/7361432231664946475?q=%23mentalhealth&t=1728676930915")
## ℹ Getting 1 unique link
## ℹ Getting video 7361432231664946475✔ Got video 7361432231664946475 (1/1). File size: 6.3 Mb. [1.5s]
You can now see the data about your chosen video in a data frame that looks very similar to our original one.
video_data
## # A tibble: 1 × 25
## video_id video_url video_timestamp video_length video_title
## <chr> <chr> <dttm> <int> <chr>
## 1 7361432231664946475 https://www.… 2024-04-24 13:57:21 78 "What you …
## # ℹ 20 more variables: video_locationcreated <chr>, video_diggcount <int>,
## # video_sharecount <int>, video_commentcount <int>, video_playcount <int>,
## # author_id <chr>, author_secuid <chr>, author_username <chr>,
## # author_nickname <chr>, author_bio <chr>, download_url <chr>,
## # html_status <int>, music <list>, challenges <list>, is_secret <lgl>,
## # is_for_friend <lgl>, is_slides <lgl>, video_status <chr>,
## # video_status_code <int>, video_fn <chr>