Welcome! You are about to learn how to use some function of the Rvest
software package!
Today we are showing you how to use Rvest to access data from the online
real state website “Redfin”
In this tutorial, we will:
Rvest is a package in R designed for web scraping, which allows users to easily collect data from websites and convert it into usable formats for analysis. Web scraping involves extracting specific information from the underlying HTML of a webpage, such as text, tables, or images. With Rvest, you can point to parts of a webpage using CSS selectors or XPath, and pull data like prices, names, or dates directly into your R environment. This is particularly useful when data you need isn’t available in a convenient downloadable format, but is instead displayed on a website. For instance, you might use Rvest to gather product prices from an e-commerce site, or property listings from a real estate site.
Redfin is a real estate platform that provides users with detailed information about homes for sale, rental properties, and market trends in various locations. The site offers listings that include key details like prices, square footage, number of bedrooms, and historical sales data, which makes it a rich resource for anyone studying the housing market. For example, a data analyst might scrape Redfin data using Rvest to analyze property trends in a specific neighborhood, identify pricing patterns, or calculate average home sizes. Redfin’s real estate data can be used to gain insights into housing trends and buyer behavior, making it a valuable source for anyone interested in the real estate sector.
The Rvest package in R helps you collect data from websites directly into your R code. To use it, you first install it by typing install.packages(“rvest”) in your R console. After installing, you load the package with library(rvest). This package allows you to extract specific information from web pages, like text, images, or tables, by reading the webpage’s HTML code. It’s useful when you need data that is only available on a website and not in a downloadable file format.
install.packages("rvest")
## Installing package into '/Users/benjaminsilver/Library/R/x86_64/4.2/library'
## (as 'lib' is unspecified)
##
## There is a binary version available but the source version is later:
## binary source needs_compilation
## rvest 1.0.3 1.0.4 FALSE
## installing the source package 'rvest'
After you have installed Rvest you need to use the “library” function. This function will enable you to use the tools that are contained in the software package. Make sure that every time you are starting a new R session you run all the code that “activates” the tools of your software package.
library(rvest)
Because we are going to manipulate data, you should consider installing “tidyverse”. We have seen tidyverse in other examples before, and in this case, this software package includes multiple functions that will help you later on if you are trying to organize or filter your data. Tidyverse is a meta-package, which means that within this package there are multiple software packages that have now been “activated” because you have activated the entire Tidyverse meta-package software.
**In this demonstration we have omitted the installation part of tidyverse, but make sure you install it if you haven’t already. The code chunk to install it should look like this: “install.packages(”tidyverse”)”
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ readr::guess_encoding() masks rvest::guess_encoding()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Now we can start with some of the functions from Rvest.
The first function will be read_html, which will help you analyze and extract the data from an HTML. In this case we have decided to go with the 40 options of properties available for sale or rent in Soho.
soho_page <- read_html("https://www.redfin.com/neighborhood/498402/NY/Manhattan/SoHo")
***Coding vocab: Variables In this example we created a variable called soho_page that will store all the information that we have from reading the html link on the right. This function is a way of saying “put the data of this HTML into a variable that is called “soho_page”.
The following code is used to extract property price information from a webpage. First, the soho_page object, which contains the HTML structure of the webpage, is passed into a chain of functions using the pipe operator (%>%). The html_elements() function is then used to search for specific elements in the HTML that match the CSS selector “.bp-Homecard__Price–value”. This selector corresponds to the parts of the webpage where the property prices are displayed. After locating those elements, html_text() extracts the text inside them, which in this case would be the actual price values. These values are stored in the property_prices variable. Finally, the print(property_prices) command displays the list of prices that were scraped from the webpage, allowing you to see the extracted data directly in your R console. This process is part of web scraping, where data from a website is collected and used for analysis in R.
property_prices <- soho_page %>%
html_elements(".bp-Homecard__Price--value") %>%
html_text()
print(property_prices)
## [1] "$4,975,000" "$27,000,000" "$4,150,000" "$4,995,000" "$4,490,000"
## [6] "$2,250,000" "$3,500,000" "$2,650,000" "$2,250,000" "$4,200,000"
## [11] "$1,350,000" "$2,650,000" "$3,125,000" "$2,000,000" "$675,000"
## [16] "$4,595,000" "$1,795,000" "$2,175,000" "$2,850,000" "$2,137,250"
## [21] "$1,174,200" "$1,313,250" "$4,480,500" "$4,295,000" "$7,850,000"
## [26] "$2,299,000" "$15,995,000" "$3,000,000" "$925,000" "$5,500,000"
## [31] "$5,499,000" "$5,595,000" "$29,995,000" "$10,500,000" "$2,300,000"
## [36] "$4,200,000" "$5,650,000" "$4,200,000" "$1,375,000" "$6,750,000"
***Coding vocab: (%>%) “The pipe” Think of the pipe as a way of saying “take this function, and then do this other function, and then do this other function”. It is a way of having a cleaner code that allows you to use the result of one function into the other.
The function uses the CSS selector “.bp-Homecard__Address” to locate the specific HTML elements that contain the addresses of the properties. Once the elements are found, html_text() extracts the actual address text from those elements. The extracted addresses are then stored in the property_addresses variable, which now holds a list of property addresses from the webpage. This process allows you to gather addresses for further analysis directly in R.
property_address <- soho_page %>%
html_elements(".bp-Homecard__Address") %>%
html_text()
print(property_address)
## [1] "83 Thompson St Unit 3W, New York, NY 10012"
## [2] "145 6th Ave, New York, NY 10013"
## [3] "311 W Broadway Unit 3A, New York, NY 10013"
## [4] "83 Thompson St Unit 3E, New York, NY 10012"
## [5] "113 Prince St Unit 6E, New York, NY 10012"
## [6] "11 Charlton St Unit 1A, New York, NY 10014"
## [7] "255 Hudson St Unit TH3, New York, NY 10013"
## [8] "77 Charlton St Unit S11C, New York, NY 10014"
## [9] "17 Thompson St #5, New York, NY 10013"
## [10] "46 Mercer St Unit 4W, New York, NY 10013"
## [11] "2 Charlton St Unit 5G, New York, NY 10013"
## [12] "118 Wooster St Ph -6C, New York, NY 10012"
## [13] "121 Mercer St Unit 4W, New York, NY 10012"
## [14] "196 6th Ave Unit 4/5B, New York, NY 10013"
## [15] "185 W Houston St Unit 6K, New York, NY 10014"
## [16] "219 Hudson St Unit PHB, New York, NY 10013"
## [17] "196 6th Ave Unit 5A, New York, NY 10013"
## [18] "170 Mercer St Unit 5E, New York, NY 10012"
## [19] "451 W Broadway Unit 4S, New York, NY 10012"
## [20] "110 Charlton St Unit 3D, New York, NY 10014"
## [21] "110 Charlton St Unit 8E, New York, NY 10014"
## [22] "110 Charlton St Unit 15G, New York, NY 10014"
## [23] "110 Charlton St Unit 19D, New York, NY 10014"
## [24] "255 Hudson St Unit TH1, New York, NY 10013"
## [25] "40 Mercer St #26, New York, NY 10013"
## [26] "90 Prince St Unit 2S, New York, NY 10012"
## [27] "459 W Broadway Unit PHSOUTH, New York, NY 10012"
## [28] "255 Hudson St Unit PHB, New York, NY 10013"
## [29] "131 Thompson St Unit 4C, New York, NY 10012"
## [30] "115 Mercer St Unit 4A, New York, NY 10012"
## [31] "554 Broome St, New York, NY 10013"
## [32] "10 Sullivan St Unit 2B, New York, NY 10012"
## [33] "62 Wooster St, New York, NY 10013"
## [34] "111 Wooster St Unit PHBC, New York, NY 10012"
## [35] "477 Broome St #44, New York, NY 10013"
## [36] "515 Broadway Unit 4FL, New York, NY 10012"
## [37] "470 Broome St Unit 4S, New York, NY 10012"
## [38] "515 Broadway #4, New York, NY 10012"
## [39] "124 Thompson St #24, New York, NY 10012"
## [40] "100 Greene St, New York, NY 10012"
If you are having trouble locating where the information is in the HTML code, you should consider using the function html_children. This will allow you to extract all the nodes that are contained in a parent element. In this case we are examining what is contained in the parent element called “.bp-Homecard__Stats”. After using the html_elements function we want to use the html_children function to actually see what is included in that code chunk. This way we could see that in order to get the data on the number of bedrooms, bathrooms or square feet, we now need to use the format “.bp-Homecard__Stats–sqft”. This indicates that the information about square feet is going to be contained in “.bp-Homecard__Stats”.
soho_page %>%
html_elements(".bp-Homecard__Stats") %>%
html_children()
## {xml_nodeset (120)}
## [1] <span class="bp-Homecard__Stats--beds text-nowrap">2 beds</span>
## [2] <span class="bp-Homecard__Stats--baths text-nowrap">2.5 baths</span>
## [3] <span class="bp-Homecard__Stats--sqft text-nowrap"><span class="bp-Homec ...
## [4] <span class="bp-Homecard__Stats--beds text-nowrap">7 beds</span>
## [5] <span class="bp-Homecard__Stats--baths text-nowrap">9 baths</span>
## [6] <span class="bp-Homecard__Stats--sqft text-nowrap"><span class="bp-Homec ...
## [7] <span class="bp-Homecard__Stats--beds text-nowrap">3 beds</span>
## [8] <span class="bp-Homecard__Stats--baths text-nowrap">3.5 baths</span>
## [9] <span class="bp-Homecard__Stats--sqft text-nowrap"><span class="bp-Homec ...
## [10] <span class="bp-Homecard__Stats--beds text-nowrap">2 beds</span>
## [11] <span class="bp-Homecard__Stats--baths text-nowrap">2.5 baths</span>
## [12] <span class="bp-Homecard__Stats--sqft text-nowrap"><span class="bp-Homec ...
## [13] <span class="bp-Homecard__Stats--beds text-nowrap">2 beds</span>
## [14] <span class="bp-Homecard__Stats--baths text-nowrap">2 baths</span>
## [15] <span class="bp-Homecard__Stats--sqft text-nowrap"><span class="bp-Homec ...
## [16] <span class="bp-Homecard__Stats--beds text-nowrap">2 beds</span>
## [17] <span class="bp-Homecard__Stats--baths text-nowrap">2 baths</span>
## [18] <span class="bp-Homecard__Stats--sqft text-nowrap"><span class="bp-Homec ...
## [19] <span class="bp-Homecard__Stats--beds text-nowrap">2 beds</span>
## [20] <span class="bp-Homecard__Stats--baths text-nowrap">2.5 baths</span>
## ...
The html_elements() function is used to search for the HTML elements that match the CSS selector “.bp-Homecard__Stats–sqft”, which corresponds to the part of the webpage displaying the square footage information for each property. The html_text() function then extracts the square footage as text from those elements. The results are stored in the property_sqft variable, and the print(property_sqft) command outputs the list of square footage values to the console. This allows you to view and use the square footage data directly in R.
property_sqft <- soho_page %>%
html_elements(".bp-Homecard__Stats--sqft") %>%
html_text()
print(property_sqft)
## [1] "1,782 sq ft" "6,300 sq ft" "2,195 sq ft" "1,736 sq ft"
## [5] "— sq ft" "— sq ft" "1,836 sq ft" "1,100 sq ft"
## [9] "— sq ft" "2,210 sq ft" "172,836 sq ft" "1,300 sq ft"
## [13] "1,804 sq ft" "1,254 sq ft" "— sq ft" "2,013 sq ft"
## [17] "1,371 sq ft" "1,225 sq ft" "2,100 sq ft" "1,209 sq ft"
## [21] "499 sq ft" "525 sq ft" "1,578 sq ft" "2,553 sq ft"
## [25] "2,370 sq ft" "1,400 sq ft" "4,315 sq ft" "1,421 sq ft"
## [29] "— sq ft" "2,170 sq ft" "2,300 sq ft" "2,195 sq ft"
## [33] "6,900 sq ft" "3,530 sq ft" "1,278 sq ft" "4,500 sq ft"
## [37] "2,000 sq ft" "4,500 sq ft" "— sq ft" "2,400 sq ft"
The function looks for HTML elements that match the CSS selector “.bp-Homecard__Stats–baths” and “.bp-Homecard__Stats–beds”, which corresponds to the section of the webpage where the number of bathrooms and beds is displayed for each property. The html_text() function then pulls the bathroom and beds data as text from those selected elements. The results are stored in the property_baths variable and the property_beds variable, and when you call print(property_baths) and print (property_beds), it outputs the number of bathrooms and beds for each property, allowing you to view and work with this data in R.
property_baths <- soho_page %>%
html_elements(".bp-Homecard__Stats--baths") %>%
html_text()
print(property_baths)
## [1] "2.5 baths" "9 baths" "3.5 baths" "2.5 baths" "2 baths" "2 baths"
## [7] "2.5 baths" "2 baths" "2 baths" "3 baths" "1 bath" "1.5 baths"
## [13] "2.5 baths" "1.5 baths" "1 bath" "3.5 baths" "2 baths" "1 bath"
## [19] "1 bath" "1.5 baths" "1 bath" "1 bath" "2.5 baths" "3.5 baths"
## [25] "4 baths" "2 baths" "4.5 baths" "2.5 baths" "1 bath" "2.5 baths"
## [31] "3 baths" "3.5 baths" "6.5 baths" "4 baths" "1 bath" "2 baths"
## [37] "2.5 baths" "2 baths" "2 baths" "2.5 baths"
property_beds <- soho_page %>%
html_elements(".bp-Homecard__Stats--beds") %>%
html_text()
print(property_beds)
## [1] "2 beds" "7 beds" "3 beds" "2 beds" "2 beds" "2 beds" "2 beds" "2 beds"
## [9] "2 beds" "3 beds" "2 beds" "2 beds" "2 beds" "1 bed" "0 beds" "3 beds"
## [17] "2 beds" "1 bed" "2 beds" "0 beds" "0 beds" "1 bed" "3 beds" "3 beds"
## [25] "3 beds" "2 beds" "4 beds" "2 beds" "2 beds" "2 beds" "3 beds" "3 beds"
## [33] "6 beds" "5 beds" "2 beds" "3 beds" "2 beds" "2 beds" "2 beds" "2 beds"
This code first prints the list of bedroom data using print(property_beds), which shows the number of bedrooms for each property that was scraped from the webpage. Then, the code creates a data frame called property_data, which combines several pieces of property information: prices, addresses, square footage, bathrooms, and bedrooms. The columns for each of these property details are filled using the previously scraped data stored in variables like property_prices, property_sqft, property_baths, and property_beds. The stringsAsFactors = FALSE argument ensures that character data (like addresses) is not automatically converted into categorical variables (factors). Finally, the print(property_data) command outputs the complete data frame, showing all the property details together in a structured table. This data frame can now be used for further analysis or manipulation in R.
property_data <- data.frame(
Price = property_prices,
Address = property_address,
SquareFootage = property_sqft,
Bathrooms = property_baths,
Bedrooms = property_beds,
stringsAsFactors = FALSE
)
print(property_data)
## Price Address SquareFootage
## 1 $4,975,000 83 Thompson St Unit 3W, New York, NY 10012 1,782 sq ft
## 2 $27,000,000 145 6th Ave, New York, NY 10013 6,300 sq ft
## 3 $4,150,000 311 W Broadway Unit 3A, New York, NY 10013 2,195 sq ft
## 4 $4,995,000 83 Thompson St Unit 3E, New York, NY 10012 1,736 sq ft
## 5 $4,490,000 113 Prince St Unit 6E, New York, NY 10012 — sq ft
## 6 $2,250,000 11 Charlton St Unit 1A, New York, NY 10014 — sq ft
## 7 $3,500,000 255 Hudson St Unit TH3, New York, NY 10013 1,836 sq ft
## 8 $2,650,000 77 Charlton St Unit S11C, New York, NY 10014 1,100 sq ft
## 9 $2,250,000 17 Thompson St #5, New York, NY 10013 — sq ft
## 10 $4,200,000 46 Mercer St Unit 4W, New York, NY 10013 2,210 sq ft
## 11 $1,350,000 2 Charlton St Unit 5G, New York, NY 10013 172,836 sq ft
## 12 $2,650,000 118 Wooster St Ph -6C, New York, NY 10012 1,300 sq ft
## 13 $3,125,000 121 Mercer St Unit 4W, New York, NY 10012 1,804 sq ft
## 14 $2,000,000 196 6th Ave Unit 4/5B, New York, NY 10013 1,254 sq ft
## 15 $675,000 185 W Houston St Unit 6K, New York, NY 10014 — sq ft
## 16 $4,595,000 219 Hudson St Unit PHB, New York, NY 10013 2,013 sq ft
## 17 $1,795,000 196 6th Ave Unit 5A, New York, NY 10013 1,371 sq ft
## 18 $2,175,000 170 Mercer St Unit 5E, New York, NY 10012 1,225 sq ft
## 19 $2,850,000 451 W Broadway Unit 4S, New York, NY 10012 2,100 sq ft
## 20 $2,137,250 110 Charlton St Unit 3D, New York, NY 10014 1,209 sq ft
## 21 $1,174,200 110 Charlton St Unit 8E, New York, NY 10014 499 sq ft
## 22 $1,313,250 110 Charlton St Unit 15G, New York, NY 10014 525 sq ft
## 23 $4,480,500 110 Charlton St Unit 19D, New York, NY 10014 1,578 sq ft
## 24 $4,295,000 255 Hudson St Unit TH1, New York, NY 10013 2,553 sq ft
## 25 $7,850,000 40 Mercer St #26, New York, NY 10013 2,370 sq ft
## 26 $2,299,000 90 Prince St Unit 2S, New York, NY 10012 1,400 sq ft
## 27 $15,995,000 459 W Broadway Unit PHSOUTH, New York, NY 10012 4,315 sq ft
## 28 $3,000,000 255 Hudson St Unit PHB, New York, NY 10013 1,421 sq ft
## 29 $925,000 131 Thompson St Unit 4C, New York, NY 10012 — sq ft
## 30 $5,500,000 115 Mercer St Unit 4A, New York, NY 10012 2,170 sq ft
## 31 $5,499,000 554 Broome St, New York, NY 10013 2,300 sq ft
## 32 $5,595,000 10 Sullivan St Unit 2B, New York, NY 10012 2,195 sq ft
## 33 $29,995,000 62 Wooster St, New York, NY 10013 6,900 sq ft
## 34 $10,500,000 111 Wooster St Unit PHBC, New York, NY 10012 3,530 sq ft
## 35 $2,300,000 477 Broome St #44, New York, NY 10013 1,278 sq ft
## 36 $4,200,000 515 Broadway Unit 4FL, New York, NY 10012 4,500 sq ft
## 37 $5,650,000 470 Broome St Unit 4S, New York, NY 10012 2,000 sq ft
## 38 $4,200,000 515 Broadway #4, New York, NY 10012 4,500 sq ft
## 39 $1,375,000 124 Thompson St #24, New York, NY 10012 — sq ft
## 40 $6,750,000 100 Greene St, New York, NY 10012 2,400 sq ft
## Bathrooms Bedrooms
## 1 2.5 baths 2 beds
## 2 9 baths 7 beds
## 3 3.5 baths 3 beds
## 4 2.5 baths 2 beds
## 5 2 baths 2 beds
## 6 2 baths 2 beds
## 7 2.5 baths 2 beds
## 8 2 baths 2 beds
## 9 2 baths 2 beds
## 10 3 baths 3 beds
## 11 1 bath 2 beds
## 12 1.5 baths 2 beds
## 13 2.5 baths 2 beds
## 14 1.5 baths 1 bed
## 15 1 bath 0 beds
## 16 3.5 baths 3 beds
## 17 2 baths 2 beds
## 18 1 bath 1 bed
## 19 1 bath 2 beds
## 20 1.5 baths 0 beds
## 21 1 bath 0 beds
## 22 1 bath 1 bed
## 23 2.5 baths 3 beds
## 24 3.5 baths 3 beds
## 25 4 baths 3 beds
## 26 2 baths 2 beds
## 27 4.5 baths 4 beds
## 28 2.5 baths 2 beds
## 29 1 bath 2 beds
## 30 2.5 baths 2 beds
## 31 3 baths 3 beds
## 32 3.5 baths 3 beds
## 33 6.5 baths 6 beds
## 34 4 baths 5 beds
## 35 1 bath 2 beds
## 36 2 baths 3 beds
## 37 2.5 baths 2 beds
## 38 2 baths 2 beds
## 39 2 baths 2 beds
## 40 2.5 baths 2 beds
Similar to what you did at the start, you want to make sure that you have “activated” the functions from “readr”, which will then enable you to work with your data more in depth.
library(readr)
After making the data frame you could encounter a problem, you data might be a “character” type, meaning it is read as text. Since we want to create a filter for the price, we need that data to be read as numbers. You can use the parse_number function to create another dataframe that contains the same information but now with numbers instead of characters. We applied this function to four of our variables that needed this adjustment.
property_data <- data.frame(
Price = parse_number(property_prices),
Address = property_address,
SquareFootage = parse_number(property_sqft),
Bathrooms = parse_number(property_baths),
Bedrooms = parse_number(property_beds),
stringsAsFactors = FALSE
)
## Warning: 6 parsing failures.
## row col expected actual
## 5 -- a number — sq ft
## 6 -- a number — sq ft
## 9 -- a number — sq ft
## 15 -- a number — sq ft
## 29 -- a number — sq ft
## ... ... ........ .......
## See problems(...) for more details.
print(property_data)
## Price Address SquareFootage
## 1 4975000 83 Thompson St Unit 3W, New York, NY 10012 1782
## 2 27000000 145 6th Ave, New York, NY 10013 6300
## 3 4150000 311 W Broadway Unit 3A, New York, NY 10013 2195
## 4 4995000 83 Thompson St Unit 3E, New York, NY 10012 1736
## 5 4490000 113 Prince St Unit 6E, New York, NY 10012 NA
## 6 2250000 11 Charlton St Unit 1A, New York, NY 10014 NA
## 7 3500000 255 Hudson St Unit TH3, New York, NY 10013 1836
## 8 2650000 77 Charlton St Unit S11C, New York, NY 10014 1100
## 9 2250000 17 Thompson St #5, New York, NY 10013 NA
## 10 4200000 46 Mercer St Unit 4W, New York, NY 10013 2210
## 11 1350000 2 Charlton St Unit 5G, New York, NY 10013 172836
## 12 2650000 118 Wooster St Ph -6C, New York, NY 10012 1300
## 13 3125000 121 Mercer St Unit 4W, New York, NY 10012 1804
## 14 2000000 196 6th Ave Unit 4/5B, New York, NY 10013 1254
## 15 675000 185 W Houston St Unit 6K, New York, NY 10014 NA
## 16 4595000 219 Hudson St Unit PHB, New York, NY 10013 2013
## 17 1795000 196 6th Ave Unit 5A, New York, NY 10013 1371
## 18 2175000 170 Mercer St Unit 5E, New York, NY 10012 1225
## 19 2850000 451 W Broadway Unit 4S, New York, NY 10012 2100
## 20 2137250 110 Charlton St Unit 3D, New York, NY 10014 1209
## 21 1174200 110 Charlton St Unit 8E, New York, NY 10014 499
## 22 1313250 110 Charlton St Unit 15G, New York, NY 10014 525
## 23 4480500 110 Charlton St Unit 19D, New York, NY 10014 1578
## 24 4295000 255 Hudson St Unit TH1, New York, NY 10013 2553
## 25 7850000 40 Mercer St #26, New York, NY 10013 2370
## 26 2299000 90 Prince St Unit 2S, New York, NY 10012 1400
## 27 15995000 459 W Broadway Unit PHSOUTH, New York, NY 10012 4315
## 28 3000000 255 Hudson St Unit PHB, New York, NY 10013 1421
## 29 925000 131 Thompson St Unit 4C, New York, NY 10012 NA
## 30 5500000 115 Mercer St Unit 4A, New York, NY 10012 2170
## 31 5499000 554 Broome St, New York, NY 10013 2300
## 32 5595000 10 Sullivan St Unit 2B, New York, NY 10012 2195
## 33 29995000 62 Wooster St, New York, NY 10013 6900
## 34 10500000 111 Wooster St Unit PHBC, New York, NY 10012 3530
## 35 2300000 477 Broome St #44, New York, NY 10013 1278
## 36 4200000 515 Broadway Unit 4FL, New York, NY 10012 4500
## 37 5650000 470 Broome St Unit 4S, New York, NY 10012 2000
## 38 4200000 515 Broadway #4, New York, NY 10012 4500
## 39 1375000 124 Thompson St #24, New York, NY 10012 NA
## 40 6750000 100 Greene St, New York, NY 10012 2400
## Bathrooms Bedrooms
## 1 2.5 2
## 2 9.0 7
## 3 3.5 3
## 4 2.5 2
## 5 2.0 2
## 6 2.0 2
## 7 2.5 2
## 8 2.0 2
## 9 2.0 2
## 10 3.0 3
## 11 1.0 2
## 12 1.5 2
## 13 2.5 2
## 14 1.5 1
## 15 1.0 0
## 16 3.5 3
## 17 2.0 2
## 18 1.0 1
## 19 1.0 2
## 20 1.5 0
## 21 1.0 0
## 22 1.0 1
## 23 2.5 3
## 24 3.5 3
## 25 4.0 3
## 26 2.0 2
## 27 4.5 4
## 28 2.5 2
## 29 1.0 2
## 30 2.5 2
## 31 3.0 3
## 32 3.5 3
## 33 6.5 6
## 34 4.0 5
## 35 1.0 2
## 36 2.0 3
## 37 2.5 2
## 38 2.0 2
## 39 2.0 2
## 40 2.5 2
***Potential errors: If your some pieces of your data are missing, you might need to consider erasing those rows.
In order to clean those rows where there is no data available, you can use the “filter()” function. Here is breakdown of how you can use it: You first want to put the new variable you are going to create after you have run the filter (cleaned_property_data). Then, you want to put the source from which you want to filter the information (property_data) followed by the pipe. After that you should create the filter by putting “filter(!is.na(SquareFootage)). This code line reflects that we want to filter those values that are na in the column “SquareFootage”. Finally you want to print the result (print(cleaned_property_data).
cleaned_property_data <- property_data %>%
filter(!is.na(SquareFootage))
print(cleaned_property_data)
## Price Address SquareFootage
## 1 4975000 83 Thompson St Unit 3W, New York, NY 10012 1782
## 2 27000000 145 6th Ave, New York, NY 10013 6300
## 3 4150000 311 W Broadway Unit 3A, New York, NY 10013 2195
## 4 4995000 83 Thompson St Unit 3E, New York, NY 10012 1736
## 5 3500000 255 Hudson St Unit TH3, New York, NY 10013 1836
## 6 2650000 77 Charlton St Unit S11C, New York, NY 10014 1100
## 7 4200000 46 Mercer St Unit 4W, New York, NY 10013 2210
## 8 1350000 2 Charlton St Unit 5G, New York, NY 10013 172836
## 9 2650000 118 Wooster St Ph -6C, New York, NY 10012 1300
## 10 3125000 121 Mercer St Unit 4W, New York, NY 10012 1804
## 11 2000000 196 6th Ave Unit 4/5B, New York, NY 10013 1254
## 12 4595000 219 Hudson St Unit PHB, New York, NY 10013 2013
## 13 1795000 196 6th Ave Unit 5A, New York, NY 10013 1371
## 14 2175000 170 Mercer St Unit 5E, New York, NY 10012 1225
## 15 2850000 451 W Broadway Unit 4S, New York, NY 10012 2100
## 16 2137250 110 Charlton St Unit 3D, New York, NY 10014 1209
## 17 1174200 110 Charlton St Unit 8E, New York, NY 10014 499
## 18 1313250 110 Charlton St Unit 15G, New York, NY 10014 525
## 19 4480500 110 Charlton St Unit 19D, New York, NY 10014 1578
## 20 4295000 255 Hudson St Unit TH1, New York, NY 10013 2553
## 21 7850000 40 Mercer St #26, New York, NY 10013 2370
## 22 2299000 90 Prince St Unit 2S, New York, NY 10012 1400
## 23 15995000 459 W Broadway Unit PHSOUTH, New York, NY 10012 4315
## 24 3000000 255 Hudson St Unit PHB, New York, NY 10013 1421
## 25 5500000 115 Mercer St Unit 4A, New York, NY 10012 2170
## 26 5499000 554 Broome St, New York, NY 10013 2300
## 27 5595000 10 Sullivan St Unit 2B, New York, NY 10012 2195
## 28 29995000 62 Wooster St, New York, NY 10013 6900
## 29 10500000 111 Wooster St Unit PHBC, New York, NY 10012 3530
## 30 2300000 477 Broome St #44, New York, NY 10013 1278
## 31 4200000 515 Broadway Unit 4FL, New York, NY 10012 4500
## 32 5650000 470 Broome St Unit 4S, New York, NY 10012 2000
## 33 4200000 515 Broadway #4, New York, NY 10012 4500
## 34 6750000 100 Greene St, New York, NY 10012 2400
## Bathrooms Bedrooms
## 1 2.5 2
## 2 9.0 7
## 3 3.5 3
## 4 2.5 2
## 5 2.5 2
## 6 2.0 2
## 7 3.0 3
## 8 1.0 2
## 9 1.5 2
## 10 2.5 2
## 11 1.5 1
## 12 3.5 3
## 13 2.0 2
## 14 1.0 1
## 15 1.0 2
## 16 1.5 0
## 17 1.0 0
## 18 1.0 1
## 19 2.5 3
## 20 3.5 3
## 21 4.0 3
## 22 2.0 2
## 23 4.5 4
## 24 2.5 2
## 25 2.5 2
## 26 3.0 3
## 27 3.5 3
## 28 6.5 6
## 29 4.0 5
## 30 1.0 2
## 31 2.0 3
## 32 2.5 2
## 33 2.0 2
## 34 2.5 2
After you have fixed your data, you are ready to ask the filter a more specific thing, in this case we wanted to get all the properties below 5 million dollars. We used the same logic of variables and inputs to create a new “filtered_property_data” variable out of “cleaned_property_data”. You then specify the filter, the column, and the condition. Finally, you print it!
filtered_property_data <- cleaned_property_data %>%
filter(Price < 5000000)
print(filtered_property_data)
## Price Address SquareFootage Bathrooms
## 1 4975000 83 Thompson St Unit 3W, New York, NY 10012 1782 2.5
## 2 4150000 311 W Broadway Unit 3A, New York, NY 10013 2195 3.5
## 3 4995000 83 Thompson St Unit 3E, New York, NY 10012 1736 2.5
## 4 3500000 255 Hudson St Unit TH3, New York, NY 10013 1836 2.5
## 5 2650000 77 Charlton St Unit S11C, New York, NY 10014 1100 2.0
## 6 4200000 46 Mercer St Unit 4W, New York, NY 10013 2210 3.0
## 7 1350000 2 Charlton St Unit 5G, New York, NY 10013 172836 1.0
## 8 2650000 118 Wooster St Ph -6C, New York, NY 10012 1300 1.5
## 9 3125000 121 Mercer St Unit 4W, New York, NY 10012 1804 2.5
## 10 2000000 196 6th Ave Unit 4/5B, New York, NY 10013 1254 1.5
## 11 4595000 219 Hudson St Unit PHB, New York, NY 10013 2013 3.5
## 12 1795000 196 6th Ave Unit 5A, New York, NY 10013 1371 2.0
## 13 2175000 170 Mercer St Unit 5E, New York, NY 10012 1225 1.0
## 14 2850000 451 W Broadway Unit 4S, New York, NY 10012 2100 1.0
## 15 2137250 110 Charlton St Unit 3D, New York, NY 10014 1209 1.5
## 16 1174200 110 Charlton St Unit 8E, New York, NY 10014 499 1.0
## 17 1313250 110 Charlton St Unit 15G, New York, NY 10014 525 1.0
## 18 4480500 110 Charlton St Unit 19D, New York, NY 10014 1578 2.5
## 19 4295000 255 Hudson St Unit TH1, New York, NY 10013 2553 3.5
## 20 2299000 90 Prince St Unit 2S, New York, NY 10012 1400 2.0
## 21 3000000 255 Hudson St Unit PHB, New York, NY 10013 1421 2.5
## 22 2300000 477 Broome St #44, New York, NY 10013 1278 1.0
## 23 4200000 515 Broadway Unit 4FL, New York, NY 10012 4500 2.0
## 24 4200000 515 Broadway #4, New York, NY 10012 4500 2.0
## Bedrooms
## 1 2
## 2 3
## 3 2
## 4 2
## 5 2
## 6 3
## 7 2
## 8 2
## 9 2
## 10 1
## 11 3
## 12 2
## 13 1
## 14 2
## 15 0
## 16 0
## 17 1
## 18 3
## 19 3
## 20 2
## 21 2
## 22 2
## 23 3
## 24 2
Socio-economic factors, like the cost of living and housing affordability, play a significant role in stress, anxiety and overall mental well-being. By gathering data on Redfin, particularly with a neighborhood focus on Manhattan, we can take a population sample and examine housing trends–such as rising rents or the increasing costs of homeownership–and examine how financial strain and housing impacts people’s mental health.
Researchers can use the tools we used to scrape this data to also explore psychology at a community level. Neighborhood changes, such as gentrification or housing shortages, affect community cohesion, social support systems, and one’s space–these are critical to one’s well-being.