It will be a long article so I added a Table of content đ Fancy, right?
- Crawl an entire website with Rcrawler
- So how to extract metadata while crawling?
- Explore Crawled Data with rpivottable
- Extract more data without having to recrawl
- Categorize URLs using Regex
- What if I want to follow robots.txt rules?
- What if I want to limit crawling speed?
- What if I want to crawl only a subfolder?
- How to change user-agent?
- What if my IP is banned?
- Where are the internal Links?
- Count Links
- Compute âInternal Page Rankâ
- What if a website is using a JavaScript framework like React or Angular?
- So whatâs the catch?
This tutorial is relying on a package called Rcrawler by Salim Khalil. Itâs a very handy crawler with some nice native functionalities.
After R is being installed and rstudio launched, same as always, weâll install and load our package:
Crawl an entire website with Rcrawler
To launch a simple website analysis, you only need this line of code:
It will crawl the entire website and provide you with the data

After the crawl is being done, youâll have access to:
The INDEX variable
itâs a data frame, if donât know whatâs a data frame, itâs like an excel file. Please note that it will be overwritten every time so export it if you want to keep it!
To take a look at it, just run

Most of the columns are self-explanatory. Usually, the most interesting ones are âHttp Respâ and âLevelâ
The Level is what SEOs call âcrawl depthâ or âpage depthâ. With it, you can easily check how far from the homepage some webpages are.
Quick example with BrightonSEO website, letâs do a quick âggplotâ and weâll be able to see pages distribution by level.
HTML Files
By default, the rcrawler function also store HTML files in your âworking directoryâ. Update location by running setwd() function

Letâs go deeper into options by replying to the most commons questions:
So how to extract metadata while crawling?
Itâs possible to extract any elements from webpages, using a CSS or XPath selector. Weâll have to use 2 new parameters
- PatternsNames to name the new parameters
- ExtractXpathPat or ExtractCSSPat to setup where to grab it in the web page
Letâs take an example:
You can access the scraped data in two ways:
- option 1 = DATA â itâs an environment variable that you can directly access using the console. A small warning, itâs a âlistâ a little less easy to read

If you want to convert it to a data frame, easier to deal with, here the code:
- option 2 = extracted_data.csv
Itâs a CSV file that has been saved inside your working directory along with the HTML files.
It might be useful to merge INDEX and NEWDATA files, here the code
As an example, letâs try to collect webpage type using scraped body class

Letâs extract the first word and feed it inside a new column
A little bit a cleaning to make the labels easier to read

And then a quick ggplot
Want to see something even cooler?
This is a static HTML file that can be store anywhere, even on my shared hosting
Explore Crawled Data with rpivottable
This create a drag & drop pivot explorer
Itâs also possible make some quick data viz
Extract more data without having to recrawl
All the HTML files are in your hard drive, so if you need more data extracted, itâs entirely possible.
You can list of your recent crawl by using ListProjects() function,

First, weâre going to load the crawling project HTML files:
Letâs say you forgot to grab h2âs and h3âs you can extract them again using the ContentScraper() also included inside rcrawler package.

Categorize URLs using Regex
For those not afraid of regex, here is a complimentary script to categorize URLs. Be careful the regex order is important, some values can overwrite others. Usually, itâs a good idea to place the home page last
What if I want to follow robots.txt rules?
just had Obeyrobots parameter
What if I want to limit crawling speed?
By default, this crawler is rather quick and can grab a lot of webpage in no times. To every advantage an inconvenience, itâs fairly easy to wrongly detected as a DOS. To limit the risks, I suggest you use the parameter RequestsDelay. itâs the time interval between each round of parallel HTTP requests, in seconds. Example
Other interesting limitation options:
no_cores: specify the number of clusters (logical cpu) for parallel crawling, by default itâs the numbers of available cores.
no_conn: itâs the number of concurrent connections per one core, by default it takes the same value of no_cores.
What if I want to crawl only a subfolder?
2 parameters help you do that. crawlUrlfilter will limit the crawl, dataUrlfilter will tell from which URLs data should be extracted
How to change user-agent?
What if my IP is banned?
option 1: Use a VPN on your computer
Option 2: use a proxy
Use the httr package to set up a proxy and use it
Where to find proxy? Itâs been a while I didnât need one so I donât know.
Where are the internal Links?
By default, RCrawler doesnât save internal links, you have to ask for them explicitly by using NetworkData option, like that:
Then youâll have two new variables available at the end of the crawling:
- NetwIndex var that is simply all the webpage URLs. The row number are the same than locally stored HTML files, so
row n°1 = homepage = 1.html
- NetwEdges with all the links. Itâs a bit confusing so let me explain:

Each row is a link. From and To columns indicate âfromâ which page âtoâ which page are each link.
On the image above:
row n°1 is a link from homepage (page n°1) to homepage
row n°2 is a link from homepage to webpage n°2. According to NetwIndex variable, page n°2 is the article about rvest.
etcâŠ
Weight is the Depth level where the link connection has been discovered. All the first rows are from the homepage so Level 0.
Type is either 1 for internal hyperlinks or 2 for external hyperlinks
Count Links
I guess you guys are interested in counting links. Here is the code to do it. I wonât go into too many explanations, it would be too long. if you are interested (and motivated) go and check out the dplyr package and specifically Data Wrangling functions
Count outbound links

To make it more readable letâs replace page IDs with URLs

Count inbound links
The same thing but the other way around

Again to make it more readable

So the useless âauthor pageâ has 14 links pointing at it, as many as the homepage⊠Maybe I should fix this one day.
Compute âInternal Page Rankâ
Many SEOs, I spoke to, seem to be very interested in this. I might as well add here the tutorial. It is very much an adaptation of Paul Shapiro awesome Script.
But Instead of using ScreamingFrog export file, we will use the previously extracted links.

Let make it more readable, weâre going to put the number on a ten basis, just like when the PageRank was a thing.
On 15 webpages website, itâs not very impressive but I encourage you to try on a bigger website.
What if a website is using a JavaScript framework like React or Angular?
RCrawler handly includes Phantom JS, the classic headless browser.
Here is how to to use
After that, reference it as an option
Itâs fairly possible to run 2 crawls, one with and one without, and compare the data afterwards
This Browser option can also be used with the other Rcrawler functions.
â ïž Rendering webpage means every Javascript files will be run, including Web Analytics tags. If you donât take the necessary precaution, itâll change your Web Analytics data
So whatâs the catch?
Rcrawler is a great tool but itâs far from being perfect. SEO will definitely miss a couple of things like there is no internal dead links report, It doesnât grab nofollow attributes on Links and there is always a couple of bugs here and there, but overall itâs a great tool to have.
Another concern is the git repo which is quite inactive.
This is it. I hope you did find this article useful, reach to me for slow support, bugs/corrections or ideas for new articles. Take care.
ref:
Khalil, S., & Fakir, M. (2017). RCrawler: An R package for parallel web crawling and scraping. SoftwareX, 6, 98-106.