Welcome! You are about to learn how to use some function of the Rvest software package!
Today we are showing you how to use Rvest to access data from the online real state website “Redfin”

In this tutorial, we will:

What is Rvest?

Rvest is a package in R designed for web scraping, which allows users to easily collect data from websites and convert it into usable formats for analysis. Web scraping involves extracting specific information from the underlying HTML of a webpage, such as text, tables, or images. With Rvest, you can point to parts of a webpage using CSS selectors or XPath, and pull data like prices, names, or dates directly into your R environment. This is particularly useful when data you need isn’t available in a convenient downloadable format, but is instead displayed on a website. For instance, you might use Rvest to gather product prices from an e-commerce site, or property listings from a real estate site.

What is Redfin?

Redfin is a real estate platform that provides users with detailed information about homes for sale, rental properties, and market trends in various locations. The site offers listings that include key details like prices, square footage, number of bedrooms, and historical sales data, which makes it a rich resource for anyone studying the housing market. For example, a data analyst might scrape Redfin data using Rvest to analyze property trends in a specific neighborhood, identify pricing patterns, or calculate average home sizes. Redfin’s real estate data can be used to gain insights into housing trends and buyer behavior, making it a valuable source for anyone interested in the real estate sector.

Intallation of the software

The Rvest package in R helps you collect data from websites directly into your R code. To use it, you first install it by typing install.packages(“rvest”) in your R console. After installing, you load the package with library(rvest). This package allows you to extract specific information from web pages, like text, images, or tables, by reading the webpage’s HTML code. It’s useful when you need data that is only available on a website and not in a downloadable file format.

install.packages("rvest")
## Installing package into '/Users/benjaminsilver/Library/R/x86_64/4.2/library'
## (as 'lib' is unspecified)
## 
##   There is a binary version available but the source version is later:
##       binary source needs_compilation
## rvest  1.0.3  1.0.4             FALSE
## installing the source package 'rvest'

“Activating” the functions of Rvest

After you have installed Rvest you need to use the “library” function. This function will enable you to use the tools that are contained in the software package. Make sure that every time you are starting a new R session you run all the code that “activates” the tools of your software package.

library(rvest)

“Activating” the functions of Tidyverse

Because we are going to manipulate data, you should consider installing “tidyverse”. We have seen tidyverse in other examples before, and in this case, this software package includes multiple functions that will help you later on if you are trying to organize or filter your data. Tidyverse is a meta-package, which means that within this package there are multiple software packages that have now been “activated” because you have activated the entire Tidyverse meta-package software.

**In this demonstration we have omitted the installation part of tidyverse, but make sure you install it if you haven’t already. The code chunk to install it should look like this: “install.packages(”tidyverse”)”

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()         masks stats::filter()
## ✖ readr::guess_encoding() masks rvest::guess_encoding()
## ✖ dplyr::lag()            masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Extracting the data yay!

Now we can start with some of the functions from Rvest.

Where are we getting the data from? –> read_html

The first function will be read_html, which will help you analyze and extract the data from an HTML. In this case we have decided to go with the 40 options of properties available for sale or rent in Soho.

soho_page <- read_html("https://www.redfin.com/neighborhood/498402/NY/Manhattan/SoHo")

***Coding vocab: Variables In this example we created a variable called soho_page that will store all the information that we have from reading the html link on the right. This function is a way of saying “put the data of this HTML into a variable that is called “soho_page”.

Extracting the data –> html_elements & html_text

The following code is used to extract property price information from a webpage. First, the soho_page object, which contains the HTML structure of the webpage, is passed into a chain of functions using the pipe operator (%>%). The html_elements() function is then used to search for specific elements in the HTML that match the CSS selector “.bp-Homecard__Price–value”. This selector corresponds to the parts of the webpage where the property prices are displayed. After locating those elements, html_text() extracts the text inside them, which in this case would be the actual price values. These values are stored in the property_prices variable. Finally, the print(property_prices) command displays the list of prices that were scraped from the webpage, allowing you to see the extracted data directly in your R console. This process is part of web scraping, where data from a website is collected and used for analysis in R.

property_prices <- soho_page %>% 
  html_elements(".bp-Homecard__Price--value") %>% 
  html_text()

print(property_prices)
##  [1] "$4,975,000"  "$27,000,000" "$4,150,000"  "$4,995,000"  "$4,490,000" 
##  [6] "$2,250,000"  "$3,500,000"  "$2,650,000"  "$2,250,000"  "$4,200,000" 
## [11] "$1,350,000"  "$2,650,000"  "$3,125,000"  "$2,000,000"  "$675,000"   
## [16] "$4,595,000"  "$1,795,000"  "$2,175,000"  "$2,850,000"  "$2,137,250" 
## [21] "$1,174,200"  "$1,313,250"  "$4,480,500"  "$4,295,000"  "$7,850,000" 
## [26] "$2,299,000"  "$15,995,000" "$3,000,000"  "$925,000"    "$5,500,000" 
## [31] "$5,499,000"  "$5,595,000"  "$29,995,000" "$10,500,000" "$2,300,000" 
## [36] "$4,200,000"  "$5,650,000"  "$4,200,000"  "$1,375,000"  "$6,750,000"

***Coding vocab: (%>%) “The pipe” Think of the pipe as a way of saying “take this function, and then do this other function, and then do this other function”. It is a way of having a cleaner code that allows you to use the result of one function into the other.

Extracting data (Address)

The function uses the CSS selector “.bp-Homecard__Address” to locate the specific HTML elements that contain the addresses of the properties. Once the elements are found, html_text() extracts the actual address text from those elements. The extracted addresses are then stored in the property_addresses variable, which now holds a list of property addresses from the webpage. This process allows you to gather addresses for further analysis directly in R.

property_address <- soho_page %>% 
  html_elements(".bp-Homecard__Address") %>% 
  html_text()


print(property_address)
##  [1] "83 Thompson St Unit 3W, New York, NY 10012"     
##  [2] "145 6th Ave, New York, NY 10013"                
##  [3] "311 W Broadway Unit 3A, New York, NY 10013"     
##  [4] "83 Thompson St Unit 3E, New York, NY 10012"     
##  [5] "113 Prince St Unit 6E, New York, NY 10012"      
##  [6] "11 Charlton St Unit 1A, New York, NY 10014"     
##  [7] "255 Hudson St Unit TH3, New York, NY 10013"     
##  [8] "77 Charlton St Unit S11C, New York, NY 10014"   
##  [9] "17 Thompson St #5, New York, NY 10013"          
## [10] "46 Mercer St Unit 4W, New York, NY 10013"       
## [11] "2 Charlton St Unit 5G, New York, NY 10013"      
## [12] "118 Wooster St Ph -6C, New York, NY 10012"      
## [13] "121 Mercer St Unit 4W, New York, NY 10012"      
## [14] "196 6th Ave Unit 4/5B, New York, NY 10013"      
## [15] "185 W Houston St Unit 6K, New York, NY 10014"   
## [16] "219 Hudson St Unit PHB, New York, NY 10013"     
## [17] "196 6th Ave Unit 5A, New York, NY 10013"        
## [18] "170 Mercer St Unit 5E, New York, NY 10012"      
## [19] "451 W Broadway Unit 4S, New York, NY 10012"     
## [20] "110 Charlton St Unit 3D, New York, NY 10014"    
## [21] "110 Charlton St Unit 8E, New York, NY 10014"    
## [22] "110 Charlton St Unit 15G, New York, NY 10014"   
## [23] "110 Charlton St Unit 19D, New York, NY 10014"   
## [24] "255 Hudson St Unit TH1, New York, NY 10013"     
## [25] "40 Mercer St #26, New York, NY 10013"           
## [26] "90 Prince St Unit 2S, New York, NY 10012"       
## [27] "459 W Broadway Unit PHSOUTH, New York, NY 10012"
## [28] "255 Hudson St Unit PHB, New York, NY 10013"     
## [29] "131 Thompson St Unit 4C, New York, NY 10012"    
## [30] "115 Mercer St Unit 4A, New York, NY 10012"      
## [31] "554 Broome St, New York, NY 10013"              
## [32] "10 Sullivan St Unit 2B, New York, NY 10012"     
## [33] "62 Wooster St, New York, NY 10013"              
## [34] "111 Wooster St Unit PHBC, New York, NY 10012"   
## [35] "477 Broome St #44, New York, NY 10013"          
## [36] "515 Broadway Unit 4FL, New York, NY 10012"      
## [37] "470 Broome St Unit 4S, New York, NY 10012"      
## [38] "515 Broadway #4, New York, NY 10012"            
## [39] "124 Thompson St #24, New York, NY 10012"        
## [40] "100 Greene St, New York, NY 10012"

Children?

If you are having trouble locating where the information is in the HTML code, you should consider using the function html_children. This will allow you to extract all the nodes that are contained in a parent element. In this case we are examining what is contained in the parent element called “.bp-Homecard__Stats”. After using the html_elements function we want to use the html_children function to actually see what is included in that code chunk. This way we could see that in order to get the data on the number of bedrooms, bathrooms or square feet, we now need to use the format “.bp-Homecard__Stats–sqft”. This indicates that the information about square feet is going to be contained in “.bp-Homecard__Stats”.

soho_page %>% 
  html_elements(".bp-Homecard__Stats") %>% 
  html_children()
## {xml_nodeset (120)}
##  [1] <span class="bp-Homecard__Stats--beds text-nowrap">2 beds</span>
##  [2] <span class="bp-Homecard__Stats--baths text-nowrap">2.5 baths</span>
##  [3] <span class="bp-Homecard__Stats--sqft text-nowrap"><span class="bp-Homec ...
##  [4] <span class="bp-Homecard__Stats--beds text-nowrap">7 beds</span>
##  [5] <span class="bp-Homecard__Stats--baths text-nowrap">9 baths</span>
##  [6] <span class="bp-Homecard__Stats--sqft text-nowrap"><span class="bp-Homec ...
##  [7] <span class="bp-Homecard__Stats--beds text-nowrap">3 beds</span>
##  [8] <span class="bp-Homecard__Stats--baths text-nowrap">3.5 baths</span>
##  [9] <span class="bp-Homecard__Stats--sqft text-nowrap"><span class="bp-Homec ...
## [10] <span class="bp-Homecard__Stats--beds text-nowrap">2 beds</span>
## [11] <span class="bp-Homecard__Stats--baths text-nowrap">2.5 baths</span>
## [12] <span class="bp-Homecard__Stats--sqft text-nowrap"><span class="bp-Homec ...
## [13] <span class="bp-Homecard__Stats--beds text-nowrap">2 beds</span>
## [14] <span class="bp-Homecard__Stats--baths text-nowrap">2 baths</span>
## [15] <span class="bp-Homecard__Stats--sqft text-nowrap"><span class="bp-Homec ...
## [16] <span class="bp-Homecard__Stats--beds text-nowrap">2 beds</span>
## [17] <span class="bp-Homecard__Stats--baths text-nowrap">2 baths</span>
## [18] <span class="bp-Homecard__Stats--sqft text-nowrap"><span class="bp-Homec ...
## [19] <span class="bp-Homecard__Stats--beds text-nowrap">2 beds</span>
## [20] <span class="bp-Homecard__Stats--baths text-nowrap">2.5 baths</span>
## ...

Extracting data (SquareFoot)

The html_elements() function is used to search for the HTML elements that match the CSS selector “.bp-Homecard__Stats–sqft”, which corresponds to the part of the webpage displaying the square footage information for each property. The html_text() function then extracts the square footage as text from those elements. The results are stored in the property_sqft variable, and the print(property_sqft) command outputs the list of square footage values to the console. This allows you to view and use the square footage data directly in R.

property_sqft <- soho_page %>% 
  html_elements(".bp-Homecard__Stats--sqft") %>% 
  html_text()

print(property_sqft)
##  [1] "1,782 sq ft"   "6,300 sq ft"   "2,195 sq ft"   "1,736 sq ft"  
##  [5] "— sq ft"       "— sq ft"       "1,836 sq ft"   "1,100 sq ft"  
##  [9] "— sq ft"       "2,210 sq ft"   "172,836 sq ft" "1,300 sq ft"  
## [13] "1,804 sq ft"   "1,254 sq ft"   "— sq ft"       "2,013 sq ft"  
## [17] "1,371 sq ft"   "1,225 sq ft"   "2,100 sq ft"   "1,209 sq ft"  
## [21] "499 sq ft"     "525 sq ft"     "1,578 sq ft"   "2,553 sq ft"  
## [25] "2,370 sq ft"   "1,400 sq ft"   "4,315 sq ft"   "1,421 sq ft"  
## [29] "— sq ft"       "2,170 sq ft"   "2,300 sq ft"   "2,195 sq ft"  
## [33] "6,900 sq ft"   "3,530 sq ft"   "1,278 sq ft"   "4,500 sq ft"  
## [37] "2,000 sq ft"   "4,500 sq ft"   "— sq ft"       "2,400 sq ft"

Extracting data (Bathrooms and Bedrooms)

The function looks for HTML elements that match the CSS selector “.bp-Homecard__Stats–baths” and “.bp-Homecard__Stats–beds”, which corresponds to the section of the webpage where the number of bathrooms and beds is displayed for each property. The html_text() function then pulls the bathroom and beds data as text from those selected elements. The results are stored in the property_baths variable and the property_beds variable, and when you call print(property_baths) and print (property_beds), it outputs the number of bathrooms and beds for each property, allowing you to view and work with this data in R.

property_baths <- soho_page %>% 
  html_elements(".bp-Homecard__Stats--baths") %>% 
  html_text()

print(property_baths)
##  [1] "2.5 baths" "9 baths"   "3.5 baths" "2.5 baths" "2 baths"   "2 baths"  
##  [7] "2.5 baths" "2 baths"   "2 baths"   "3 baths"   "1 bath"    "1.5 baths"
## [13] "2.5 baths" "1.5 baths" "1 bath"    "3.5 baths" "2 baths"   "1 bath"   
## [19] "1 bath"    "1.5 baths" "1 bath"    "1 bath"    "2.5 baths" "3.5 baths"
## [25] "4 baths"   "2 baths"   "4.5 baths" "2.5 baths" "1 bath"    "2.5 baths"
## [31] "3 baths"   "3.5 baths" "6.5 baths" "4 baths"   "1 bath"    "2 baths"  
## [37] "2.5 baths" "2 baths"   "2 baths"   "2.5 baths"
property_beds <- soho_page %>% 
  html_elements(".bp-Homecard__Stats--beds") %>% 
  html_text()

print(property_beds)
##  [1] "2 beds" "7 beds" "3 beds" "2 beds" "2 beds" "2 beds" "2 beds" "2 beds"
##  [9] "2 beds" "3 beds" "2 beds" "2 beds" "2 beds" "1 bed"  "0 beds" "3 beds"
## [17] "2 beds" "1 bed"  "2 beds" "0 beds" "0 beds" "1 bed"  "3 beds" "3 beds"
## [25] "3 beds" "2 beds" "4 beds" "2 beds" "2 beds" "2 beds" "3 beds" "3 beds"
## [33] "6 beds" "5 beds" "2 beds" "3 beds" "2 beds" "2 beds" "2 beds" "2 beds"

Data frame –> how do you organize all the data? –> data.frame()

This code first prints the list of bedroom data using print(property_beds), which shows the number of bedrooms for each property that was scraped from the webpage. Then, the code creates a data frame called property_data, which combines several pieces of property information: prices, addresses, square footage, bathrooms, and bedrooms. The columns for each of these property details are filled using the previously scraped data stored in variables like property_prices, property_sqft, property_baths, and property_beds. The stringsAsFactors = FALSE argument ensures that character data (like addresses) is not automatically converted into categorical variables (factors). Finally, the print(property_data) command outputs the complete data frame, showing all the property details together in a structured table. This data frame can now be used for further analysis or manipulation in R.

property_data <- data.frame(
  Price = property_prices,
  Address = property_address,
  SquareFootage = property_sqft,
  Bathrooms = property_baths,
  Bedrooms = property_beds,
  stringsAsFactors = FALSE
)

print(property_data)
##          Price                                         Address SquareFootage
## 1   $4,975,000      83 Thompson St Unit 3W, New York, NY 10012   1,782 sq ft
## 2  $27,000,000                 145 6th Ave, New York, NY 10013   6,300 sq ft
## 3   $4,150,000      311 W Broadway Unit 3A, New York, NY 10013   2,195 sq ft
## 4   $4,995,000      83 Thompson St Unit 3E, New York, NY 10012   1,736 sq ft
## 5   $4,490,000       113 Prince St Unit 6E, New York, NY 10012       — sq ft
## 6   $2,250,000      11 Charlton St Unit 1A, New York, NY 10014       — sq ft
## 7   $3,500,000      255 Hudson St Unit TH3, New York, NY 10013   1,836 sq ft
## 8   $2,650,000    77 Charlton St Unit S11C, New York, NY 10014   1,100 sq ft
## 9   $2,250,000           17 Thompson St #5, New York, NY 10013       — sq ft
## 10  $4,200,000        46 Mercer St Unit 4W, New York, NY 10013   2,210 sq ft
## 11  $1,350,000       2 Charlton St Unit 5G, New York, NY 10013 172,836 sq ft
## 12  $2,650,000       118 Wooster St Ph -6C, New York, NY 10012   1,300 sq ft
## 13  $3,125,000       121 Mercer St Unit 4W, New York, NY 10012   1,804 sq ft
## 14  $2,000,000       196 6th Ave Unit 4/5B, New York, NY 10013   1,254 sq ft
## 15    $675,000    185 W Houston St Unit 6K, New York, NY 10014       — sq ft
## 16  $4,595,000      219 Hudson St Unit PHB, New York, NY 10013   2,013 sq ft
## 17  $1,795,000         196 6th Ave Unit 5A, New York, NY 10013   1,371 sq ft
## 18  $2,175,000       170 Mercer St Unit 5E, New York, NY 10012   1,225 sq ft
## 19  $2,850,000      451 W Broadway Unit 4S, New York, NY 10012   2,100 sq ft
## 20  $2,137,250     110 Charlton St Unit 3D, New York, NY 10014   1,209 sq ft
## 21  $1,174,200     110 Charlton St Unit 8E, New York, NY 10014     499 sq ft
## 22  $1,313,250    110 Charlton St Unit 15G, New York, NY 10014     525 sq ft
## 23  $4,480,500    110 Charlton St Unit 19D, New York, NY 10014   1,578 sq ft
## 24  $4,295,000      255 Hudson St Unit TH1, New York, NY 10013   2,553 sq ft
## 25  $7,850,000            40 Mercer St #26, New York, NY 10013   2,370 sq ft
## 26  $2,299,000        90 Prince St Unit 2S, New York, NY 10012   1,400 sq ft
## 27 $15,995,000 459 W Broadway Unit PHSOUTH, New York, NY 10012   4,315 sq ft
## 28  $3,000,000      255 Hudson St Unit PHB, New York, NY 10013   1,421 sq ft
## 29    $925,000     131 Thompson St Unit 4C, New York, NY 10012       — sq ft
## 30  $5,500,000       115 Mercer St Unit 4A, New York, NY 10012   2,170 sq ft
## 31  $5,499,000               554 Broome St, New York, NY 10013   2,300 sq ft
## 32  $5,595,000      10 Sullivan St Unit 2B, New York, NY 10012   2,195 sq ft
## 33 $29,995,000               62 Wooster St, New York, NY 10013   6,900 sq ft
## 34 $10,500,000    111 Wooster St Unit PHBC, New York, NY 10012   3,530 sq ft
## 35  $2,300,000           477 Broome St #44, New York, NY 10013   1,278 sq ft
## 36  $4,200,000       515 Broadway Unit 4FL, New York, NY 10012   4,500 sq ft
## 37  $5,650,000       470 Broome St Unit 4S, New York, NY 10012   2,000 sq ft
## 38  $4,200,000             515 Broadway #4, New York, NY 10012   4,500 sq ft
## 39  $1,375,000         124 Thompson St #24, New York, NY 10012       — sq ft
## 40  $6,750,000               100 Greene St, New York, NY 10012   2,400 sq ft
##    Bathrooms Bedrooms
## 1  2.5 baths   2 beds
## 2    9 baths   7 beds
## 3  3.5 baths   3 beds
## 4  2.5 baths   2 beds
## 5    2 baths   2 beds
## 6    2 baths   2 beds
## 7  2.5 baths   2 beds
## 8    2 baths   2 beds
## 9    2 baths   2 beds
## 10   3 baths   3 beds
## 11    1 bath   2 beds
## 12 1.5 baths   2 beds
## 13 2.5 baths   2 beds
## 14 1.5 baths    1 bed
## 15    1 bath   0 beds
## 16 3.5 baths   3 beds
## 17   2 baths   2 beds
## 18    1 bath    1 bed
## 19    1 bath   2 beds
## 20 1.5 baths   0 beds
## 21    1 bath   0 beds
## 22    1 bath    1 bed
## 23 2.5 baths   3 beds
## 24 3.5 baths   3 beds
## 25   4 baths   3 beds
## 26   2 baths   2 beds
## 27 4.5 baths   4 beds
## 28 2.5 baths   2 beds
## 29    1 bath   2 beds
## 30 2.5 baths   2 beds
## 31   3 baths   3 beds
## 32 3.5 baths   3 beds
## 33 6.5 baths   6 beds
## 34   4 baths   5 beds
## 35    1 bath   2 beds
## 36   2 baths   3 beds
## 37 2.5 baths   2 beds
## 38   2 baths   2 beds
## 39   2 baths   2 beds
## 40 2.5 baths   2 beds

“Activating” functions again

Similar to what you did at the start, you want to make sure that you have “activated” the functions from “readr”, which will then enable you to work with your data more in depth.

library(readr)

What if my data is not numerical? … Don’t panic, you can still analyze it

After making the data frame you could encounter a problem, you data might be a “character” type, meaning it is read as text. Since we want to create a filter for the price, we need that data to be read as numbers. You can use the parse_number function to create another dataframe that contains the same information but now with numbers instead of characters. We applied this function to four of our variables that needed this adjustment.

property_data <- data.frame(
  Price = parse_number(property_prices),       
  Address = property_address,                
  SquareFootage = parse_number(property_sqft),  
  Bathrooms = parse_number(property_baths),     
  Bedrooms = parse_number(property_beds),      
  stringsAsFactors = FALSE
)
## Warning: 6 parsing failures.
## row col expected  actual
##   5  -- a number — sq ft
##   6  -- a number — sq ft
##   9  -- a number — sq ft
##  15  -- a number — sq ft
##  29  -- a number — sq ft
## ... ... ........ .......
## See problems(...) for more details.
print(property_data)
##       Price                                         Address SquareFootage
## 1   4975000      83 Thompson St Unit 3W, New York, NY 10012          1782
## 2  27000000                 145 6th Ave, New York, NY 10013          6300
## 3   4150000      311 W Broadway Unit 3A, New York, NY 10013          2195
## 4   4995000      83 Thompson St Unit 3E, New York, NY 10012          1736
## 5   4490000       113 Prince St Unit 6E, New York, NY 10012            NA
## 6   2250000      11 Charlton St Unit 1A, New York, NY 10014            NA
## 7   3500000      255 Hudson St Unit TH3, New York, NY 10013          1836
## 8   2650000    77 Charlton St Unit S11C, New York, NY 10014          1100
## 9   2250000           17 Thompson St #5, New York, NY 10013            NA
## 10  4200000        46 Mercer St Unit 4W, New York, NY 10013          2210
## 11  1350000       2 Charlton St Unit 5G, New York, NY 10013        172836
## 12  2650000       118 Wooster St Ph -6C, New York, NY 10012          1300
## 13  3125000       121 Mercer St Unit 4W, New York, NY 10012          1804
## 14  2000000       196 6th Ave Unit 4/5B, New York, NY 10013          1254
## 15   675000    185 W Houston St Unit 6K, New York, NY 10014            NA
## 16  4595000      219 Hudson St Unit PHB, New York, NY 10013          2013
## 17  1795000         196 6th Ave Unit 5A, New York, NY 10013          1371
## 18  2175000       170 Mercer St Unit 5E, New York, NY 10012          1225
## 19  2850000      451 W Broadway Unit 4S, New York, NY 10012          2100
## 20  2137250     110 Charlton St Unit 3D, New York, NY 10014          1209
## 21  1174200     110 Charlton St Unit 8E, New York, NY 10014           499
## 22  1313250    110 Charlton St Unit 15G, New York, NY 10014           525
## 23  4480500    110 Charlton St Unit 19D, New York, NY 10014          1578
## 24  4295000      255 Hudson St Unit TH1, New York, NY 10013          2553
## 25  7850000            40 Mercer St #26, New York, NY 10013          2370
## 26  2299000        90 Prince St Unit 2S, New York, NY 10012          1400
## 27 15995000 459 W Broadway Unit PHSOUTH, New York, NY 10012          4315
## 28  3000000      255 Hudson St Unit PHB, New York, NY 10013          1421
## 29   925000     131 Thompson St Unit 4C, New York, NY 10012            NA
## 30  5500000       115 Mercer St Unit 4A, New York, NY 10012          2170
## 31  5499000               554 Broome St, New York, NY 10013          2300
## 32  5595000      10 Sullivan St Unit 2B, New York, NY 10012          2195
## 33 29995000               62 Wooster St, New York, NY 10013          6900
## 34 10500000    111 Wooster St Unit PHBC, New York, NY 10012          3530
## 35  2300000           477 Broome St #44, New York, NY 10013          1278
## 36  4200000       515 Broadway Unit 4FL, New York, NY 10012          4500
## 37  5650000       470 Broome St Unit 4S, New York, NY 10012          2000
## 38  4200000             515 Broadway #4, New York, NY 10012          4500
## 39  1375000         124 Thompson St #24, New York, NY 10012            NA
## 40  6750000               100 Greene St, New York, NY 10012          2400
##    Bathrooms Bedrooms
## 1        2.5        2
## 2        9.0        7
## 3        3.5        3
## 4        2.5        2
## 5        2.0        2
## 6        2.0        2
## 7        2.5        2
## 8        2.0        2
## 9        2.0        2
## 10       3.0        3
## 11       1.0        2
## 12       1.5        2
## 13       2.5        2
## 14       1.5        1
## 15       1.0        0
## 16       3.5        3
## 17       2.0        2
## 18       1.0        1
## 19       1.0        2
## 20       1.5        0
## 21       1.0        0
## 22       1.0        1
## 23       2.5        3
## 24       3.5        3
## 25       4.0        3
## 26       2.0        2
## 27       4.5        4
## 28       2.5        2
## 29       1.0        2
## 30       2.5        2
## 31       3.0        3
## 32       3.5        3
## 33       6.5        6
## 34       4.0        5
## 35       1.0        2
## 36       2.0        3
## 37       2.5        2
## 38       2.0        2
## 39       2.0        2
## 40       2.5        2

***Potential errors: If your some pieces of your data are missing, you might need to consider erasing those rows.

Cleaning data if needed:

In order to clean those rows where there is no data available, you can use the “filter()” function. Here is breakdown of how you can use it: You first want to put the new variable you are going to create after you have run the filter (cleaned_property_data). Then, you want to put the source from which you want to filter the information (property_data) followed by the pipe. After that you should create the filter by putting “filter(!is.na(SquareFootage)). This code line reflects that we want to filter those values that are na in the column “SquareFootage”. Finally you want to print the result (print(cleaned_property_data).

cleaned_property_data <- property_data %>% 
  filter(!is.na(SquareFootage))

print(cleaned_property_data)
##       Price                                         Address SquareFootage
## 1   4975000      83 Thompson St Unit 3W, New York, NY 10012          1782
## 2  27000000                 145 6th Ave, New York, NY 10013          6300
## 3   4150000      311 W Broadway Unit 3A, New York, NY 10013          2195
## 4   4995000      83 Thompson St Unit 3E, New York, NY 10012          1736
## 5   3500000      255 Hudson St Unit TH3, New York, NY 10013          1836
## 6   2650000    77 Charlton St Unit S11C, New York, NY 10014          1100
## 7   4200000        46 Mercer St Unit 4W, New York, NY 10013          2210
## 8   1350000       2 Charlton St Unit 5G, New York, NY 10013        172836
## 9   2650000       118 Wooster St Ph -6C, New York, NY 10012          1300
## 10  3125000       121 Mercer St Unit 4W, New York, NY 10012          1804
## 11  2000000       196 6th Ave Unit 4/5B, New York, NY 10013          1254
## 12  4595000      219 Hudson St Unit PHB, New York, NY 10013          2013
## 13  1795000         196 6th Ave Unit 5A, New York, NY 10013          1371
## 14  2175000       170 Mercer St Unit 5E, New York, NY 10012          1225
## 15  2850000      451 W Broadway Unit 4S, New York, NY 10012          2100
## 16  2137250     110 Charlton St Unit 3D, New York, NY 10014          1209
## 17  1174200     110 Charlton St Unit 8E, New York, NY 10014           499
## 18  1313250    110 Charlton St Unit 15G, New York, NY 10014           525
## 19  4480500    110 Charlton St Unit 19D, New York, NY 10014          1578
## 20  4295000      255 Hudson St Unit TH1, New York, NY 10013          2553
## 21  7850000            40 Mercer St #26, New York, NY 10013          2370
## 22  2299000        90 Prince St Unit 2S, New York, NY 10012          1400
## 23 15995000 459 W Broadway Unit PHSOUTH, New York, NY 10012          4315
## 24  3000000      255 Hudson St Unit PHB, New York, NY 10013          1421
## 25  5500000       115 Mercer St Unit 4A, New York, NY 10012          2170
## 26  5499000               554 Broome St, New York, NY 10013          2300
## 27  5595000      10 Sullivan St Unit 2B, New York, NY 10012          2195
## 28 29995000               62 Wooster St, New York, NY 10013          6900
## 29 10500000    111 Wooster St Unit PHBC, New York, NY 10012          3530
## 30  2300000           477 Broome St #44, New York, NY 10013          1278
## 31  4200000       515 Broadway Unit 4FL, New York, NY 10012          4500
## 32  5650000       470 Broome St Unit 4S, New York, NY 10012          2000
## 33  4200000             515 Broadway #4, New York, NY 10012          4500
## 34  6750000               100 Greene St, New York, NY 10012          2400
##    Bathrooms Bedrooms
## 1        2.5        2
## 2        9.0        7
## 3        3.5        3
## 4        2.5        2
## 5        2.5        2
## 6        2.0        2
## 7        3.0        3
## 8        1.0        2
## 9        1.5        2
## 10       2.5        2
## 11       1.5        1
## 12       3.5        3
## 13       2.0        2
## 14       1.0        1
## 15       1.0        2
## 16       1.5        0
## 17       1.0        0
## 18       1.0        1
## 19       2.5        3
## 20       3.5        3
## 21       4.0        3
## 22       2.0        2
## 23       4.5        4
## 24       2.5        2
## 25       2.5        2
## 26       3.0        3
## 27       3.5        3
## 28       6.5        6
## 29       4.0        5
## 30       1.0        2
## 31       2.0        3
## 32       2.5        2
## 33       2.0        2
## 34       2.5        2

Filtering your data

After you have fixed your data, you are ready to ask the filter a more specific thing, in this case we wanted to get all the properties below 5 million dollars. We used the same logic of variables and inputs to create a new “filtered_property_data” variable out of “cleaned_property_data”. You then specify the filter, the column, and the condition. Finally, you print it!

filtered_property_data <- cleaned_property_data %>%
  filter(Price < 5000000)

print(filtered_property_data)
##      Price                                      Address SquareFootage Bathrooms
## 1  4975000   83 Thompson St Unit 3W, New York, NY 10012          1782       2.5
## 2  4150000   311 W Broadway Unit 3A, New York, NY 10013          2195       3.5
## 3  4995000   83 Thompson St Unit 3E, New York, NY 10012          1736       2.5
## 4  3500000   255 Hudson St Unit TH3, New York, NY 10013          1836       2.5
## 5  2650000 77 Charlton St Unit S11C, New York, NY 10014          1100       2.0
## 6  4200000     46 Mercer St Unit 4W, New York, NY 10013          2210       3.0
## 7  1350000    2 Charlton St Unit 5G, New York, NY 10013        172836       1.0
## 8  2650000    118 Wooster St Ph -6C, New York, NY 10012          1300       1.5
## 9  3125000    121 Mercer St Unit 4W, New York, NY 10012          1804       2.5
## 10 2000000    196 6th Ave Unit 4/5B, New York, NY 10013          1254       1.5
## 11 4595000   219 Hudson St Unit PHB, New York, NY 10013          2013       3.5
## 12 1795000      196 6th Ave Unit 5A, New York, NY 10013          1371       2.0
## 13 2175000    170 Mercer St Unit 5E, New York, NY 10012          1225       1.0
## 14 2850000   451 W Broadway Unit 4S, New York, NY 10012          2100       1.0
## 15 2137250  110 Charlton St Unit 3D, New York, NY 10014          1209       1.5
## 16 1174200  110 Charlton St Unit 8E, New York, NY 10014           499       1.0
## 17 1313250 110 Charlton St Unit 15G, New York, NY 10014           525       1.0
## 18 4480500 110 Charlton St Unit 19D, New York, NY 10014          1578       2.5
## 19 4295000   255 Hudson St Unit TH1, New York, NY 10013          2553       3.5
## 20 2299000     90 Prince St Unit 2S, New York, NY 10012          1400       2.0
## 21 3000000   255 Hudson St Unit PHB, New York, NY 10013          1421       2.5
## 22 2300000        477 Broome St #44, New York, NY 10013          1278       1.0
## 23 4200000    515 Broadway Unit 4FL, New York, NY 10012          4500       2.0
## 24 4200000          515 Broadway #4, New York, NY 10012          4500       2.0
##    Bedrooms
## 1         2
## 2         3
## 3         2
## 4         2
## 5         2
## 6         3
## 7         2
## 8         2
## 9         2
## 10        1
## 11        3
## 12        2
## 13        1
## 14        2
## 15        0
## 16        0
## 17        1
## 18        3
## 19        3
## 20        2
## 21        2
## 22        2
## 23        3
## 24        2

What is this all useful for?

Socio-economic factors, like the cost of living and housing affordability, play a significant role in stress, anxiety and overall mental well-being. By gathering data on Redfin, particularly with a neighborhood focus on Manhattan, we can take a population sample and examine housing trends–such as rising rents or the increasing costs of homeownership–and examine how financial strain and housing impacts people’s mental health.

Researchers can use the tools we used to scrape this data to also explore psychology at a community level. Neighborhood changes, such as gentrification or housing shortages, affect community cohesion, social support systems, and one’s space–these are critical to one’s well-being.