Dec 11, 2020 Web crawling is a component of web scraping, the crawler logic finds URLs to be processed by the scraper code. A web crawler starts with a list of URLs to visit, called the seed. For each URL, the crawler finds links in the HTML, filters those links based on some criteria and adds the new links to a queue. Learn web scraping with Python with this step-by-step tutorial. We will cover almost all of the tools Python offers to scrape the web. From Requests to BeautifulSoup, Scrapy, Selenium and more. Beautiful Soup is a pure Python library for extracting structured data from a website. It allows you to parse data from HTML and XML files. It acts as a helper module and interacts with HTML in a similar and better way as to how you would interact with a web page using other available developer tools.
Web crawling is a powerful technique to collect data from the web by finding all the URLs for one or multiple domains. Python has several popular web crawling libraries and frameworks.
Jul 15, 2020 Web Scraping is an automat i c way to retrieve unstructured data from website and store them in a structured format. For example, if you want to analyse what kind of face mask can sell better in Singapore, you may want to scrape all the face mask information on E-Commerce website like Lazada.
In this article, we will first introduce different crawling strategies and use cases. Then we will build a simple web crawler from scratch in Python using two libraries: requests and Beautiful Soup. Next, we will see why it’s better to use a web crawling framework like Scrapy. Finally, we will build an example crawler with Scrapy to collect film metadata from IMDb and see how Scrapy scales to websites with several million pages.
What is a web crawler?
Web crawling and web scraping are two different but related concepts. Web crawling is a component of web scraping, the crawler logic finds URLs to be processed by the scraper code.
A web crawler starts with a list of URLs to visit, called the seed. For each URL, the crawler finds links in the HTML, filters those links based on some criteria and adds the new links to a queue. All the HTML or some specific information is extracted to be processed by a different pipeline.
Web crawling strategies
In practice, web crawlers only visit a subset of pages depending on the crawler budget, which can be a maximum number of pages per domain, depth or execution time.
Most popular websites provide a robots.txt file to indicate which areas of the website are disallowed to crawl by each user agent. The opposite of the robots file is the sitemap.xml file, that lists the pages that can be crawled.
Popular web crawler use cases include:
- Search engines (Googlebot, Bingbot, Yandex Bot…) collect all the HTML for a significant part of the Web. This data is indexed to make it searchable.
- SEO analytics tools on top of collecting the HTML also collect metadata like the response time, response status to detect broken pages and the links between different domains to collect backlinks.
- Price monitoring tools crawl e-commerce websites to find product pages and extract metadata, notably the price. Product pages are then periodically revisited.
- Common Crawl maintains an open repository of web crawl data. For example, the archive from October 2020 contains 2.71 billion web pages.
Next, we will compare three different strategies for building a web crawler in Python. First, using only standard libraries, then third party libraries for making HTTP requests and parsing HTML and finally, a web crawling framework.
Building a simple web crawler in Python from scratch
To build a simple web crawler in Python we need at least one library to download the HTML from a URL and an HTML parsing library to extract links. Python provides standard libraries urllib for making HTTP requests and html.parser for parsing HTML. An example Python crawler built only with standard libraries can be found on Github.
The standard Python libraries for requests and HTML parsing are not very developer-friendly. Other popular libraries like requests, branded as HTTP for humans, and Beautiful Soup provide a better developer experience. You can install the two libraries locally.
A basic crawler can be built following the previous architecture diagram.
The code above defines a Crawler class with helper methods to download_url using the requests library, get_linked_urls using the Beautiful Soup library and add_url_to_visit to filter URLs. The URLs to visit and the visited URLs are stored in two separate lists. You can run the crawler on your terminal.
The crawler logs one line for each visited URL.
The code is very simple but there are many performance and usability issues to solve before successfully crawling a complete website.
- The crawler is slow and supports no parallelism. As can be seen from the timestamps, it takes about one second to crawl each URL. Each time the crawler makes a request it waits for the request to be resolved and no work is done in between.
- The download URL logic has no retry mechanism, the URL queue is not a real queue and not very efficient with a high number of URLs.
- The link extraction logic doesn’t support standardizing URLs by removing URL query string parameters, doesn’t handle URLs starting with #, doesn’t support filtering URLs by domain or filtering out requests to static files.
- The crawler doesn’t identify itself and ignores the robots.txt file.
Next, we will see how Scrapy provides all these functionalities and makes it easy to extend for your custom crawls.
Web crawling with Scrapy
Scrapy is the most popular web scraping and crawling Python framework with 40k stars on Github. One of the advantages of Scrapy is that requests are scheduled and handled asynchronously. This means that Scrapy can send another request before the previous one is completed or do some other work in between. Scrapy can handle many concurrent requests but can also be configured to respect the websites with custom settings, as we’ll see later.
Scrapy has a multi-component architecture. Normally, you will implement at least two different classes: Spider and Pipeline. Web scraping can be thought of as an ETL where you extract data from the web and load it to your own storage. Spiders extract the data and pipelines load it into the storage. Transformation can happen both in spiders and pipelines, but I recommend that you set a custom Scrapy pipeline to transform each item independently of each other. This way, failing to process an item has no effect on other items.
On top of all that, you can add spider and downloader middlewares in between components as it can be seen in the diagram below.
Scrapy Architecture Overview [source]
If you have used Scrapy before, you know that a web scraper is defined as a class that inherits from the base Spider class and implements a parse method to handle each response. If you are new to Scrapy, you can read this article for easy scraping with Scrapy.
Scrapy also provides several generic spider classes: CrawlSpider, XMLFeedSpider, CSVFeedSpider and SitemapSpider. The CrawlSpider class inherits from the base Spider class and provides an extra rules attribute to define how to crawl a website. Each rule uses a LinkExtractor to specify which links are extracted from each page. Next, we will see how to use each one of them by building a crawler for IMDb, the Internet Movie Database.
Building an example Scrapy crawler for IMDb
Before trying to crawl IMDb, I checked IMDb robots.txt file to see which URL paths are allowed. The robots file only disallows 26 paths for all user-agents. Scrapy reads the robots.txt file beforehand and respects it when the ROBOTSTXT_OBEY setting is set to true. This is the case for all projects generated with the Scrapy command startproject.
This command creates a new project with the default Scrapy project folder structure.
Then you can create a spider in scrapy_crawler/spiders/imdb.py with a rule to extract all links.
You can launch the crawler in the terminal.
You will get lots of logs, including one log for each request. Exploring the logs I noticed that even if we set allowed_domains to only crawl web pages under https://www.imdb.com, there were requests to external domains, such as amazon.com.
IMDb redirects from URLs paths under whitelist-offsite and whitelist to external domains. There is an open Scrapy Github issue that shows that external URLs don’t get filtered out when the OffsiteMiddleware is applied before the RedirectMiddleware. To fix this issue, we can configure the link extractor to deny URLs starting with two regular expressions.
Rule and LinkExtractor classes support several arguments to filter out URLs. For example, you can ignore specific URL extensions and reduce the number of duplicate URLs by sorting query strings. If you don’t find a specific argument for your use case you can pass a custom function to process_links in LinkExtractor or process_values in Rule.
For example, IMDb has two different URLs with the same content.
To limit the number of crawled URLs, we can remove all query strings from URLs with the url_query_cleaner function from the w3lib library and use it in process_links.
Now that we have limited the number of requests to process, we can add a parse_item method to extract data from each page and pass it to a pipeline to store it. For example, we can either extract the whole response.text to process it in a different pipeline or select the HTML metadata. To select the HTML metadata in the header tag we can code our own XPATHs but I find it better to use a library, extruct, that extracts all metadata from an HTML page. You can install it with pip install extract.
I set the follow attribute to True so that Scrapy still follows all links from each response even if we provided a custom parse method. I also configured extruct to extract only Open Graph metadata and JSON-LD, a popular method for encoding linked data using JSON in the Web, used by IMDb. You can run the crawler and store items in JSON lines format to a file.
The output file imdb.jl contains one line for each crawled item. For example, the extracted Open Graph metadata for a movie taken from the <meta> tags in the HTML looks like this.
The JSON-LD for a single item is too long to be included in the article, here is a sample of what Scrapy extracts from the <script type='application/ld+json'> tag.
Exploring the logs, I noticed another common issue with crawlers. By sequentially clicking on filters, the crawler generates URLs with the same content, only that the filters were applied in a different order.
Long filter and search URLs is a difficult problem that can be partially solved by limiting the length of URLs with a Scrapy setting, URLLENGTH_LIMIT.
I used IMDb as an example to show the basics of building a web crawler in Python. I didn’t let the crawler run for long as I didn’t have a specific use case for the data. In case you need specific data from IMDb, you can check the IMDb Datasets project that provides a daily export of IMDb data and IMDbPY, a Python package for retrieving and managing the data.
Web crawling at scale
If you attempt to crawl a big website like IMDb, with over 45M pages based on Google, it’s important to crawl responsibly by configuring the following settings. You can identify your crawler and provide contact details in the BOT_NAME setting. To limit the pressure you put on the website servers you can increase the DOWNLOAD_DELAY, limit the CONCURRENT_REQUESTS_PER_DOMAIN or set AUTOTHROTTLE_ENABLED that will adapt those settings dynamically based on the response times from the server.
Notice that Scrapy crawls are optimized for a single domain by default. If you are crawling multiple domains check these settings to optimize for broad crawls, including changing the default crawl order from depth-first to breath-first. To limit your crawl budget, you can limit the number of requests with the CLOSESPIDER_PAGECOUNT setting of the close spider extension.
With the default settings, Scrapy crawls about 600 pages per minute for a website like IMDb. To crawl 45M pages it will take more than 50 days for a single robot. If you need to crawl multiple websites it can be better to launch separate crawlers for each big website or group of websites. If you are interested in distributed web crawls, you can read how a developer crawled 250M pages with Python in 40 hours using 20 Amazon EC2 machine instances.
In some cases, you may run into websites that require you to execute JavaScript code to render all the HTML. Fail to do so, and you may not collect all links on the website. Because nowadays it’s very common for websites to render content dynamically in the browser I wrote a Scrapy middleware for rendering JavaScript pages using ScrapingBee’s API.
Conclusion
We compared the code of a Python crawler using third-party libraries for downloading URLs and parsing HTML with a crawler built using a popular web crawling framework. Scrapy is a very performant web crawling framework and it’s easy to extend with your custom code. But you need to know all the places where you can hook your own code and the settings for each component.
Configuring Scrapy properly becomes even more important when crawling websites with millions of pages. If you want to learn more about web crawling I suggest that you pick a popular website and try to crawl it. You will definitely run into new issues, which makes the topic fascinating!
Sources
In this article, we will see how to extract structured information from web-page leveraging BeautifulSoup and CSS selectors.
WebScraping with BeautifulSoup
Pulling the HTML out
Web Scraping Python Beautifulsoup Javascript
BeautifulSoup is not a web scraping library per se. It is a library that allows you to efficiently and easily pull out information from HTML. In the real world, it is often used for web scraping projects.
So, to begin, we'll need HTML. We will pull out HTML from the HackerNews landing page using the requests
python package.
Parsing the HTML with BeautifulSoup
Now that the HTML is accessible we will use BeautifulSoup to parse it. If you haven't already, you can install the package by doing a simple pip install beautifullsoup4
. In the rest of this article, we will refer to BeautifulSoup4
as BS4
.
We now need to parse the HTML and load it into a BS4
structure.
This soup
object is very handy and allows us to easily access many useful pieces of information such as:
Targeting DOM elements
You might begin to see a pattern in how to use this library. It allows you to quickly and elegantly target the DOM elements you need.
If you need to select DOM elements from its tag (<p>
, <a>
, <span>
, ….) you can simply do soup.<tag>
to select it. The caveat is that it will only select the first HTML element with that tag.
For example if I want the first link I just have to do
This element will also have many useful methods to quickly extract information:
This is a simple example. If you want to select the first element based on its id
or class
it is not much more difficult:
And if you don't want the first matching element but instead all matching elements, just replace find
with find_all
.
This simple and elegant interface allows you to quickly write short and powerful Python snippets.
For example, let's say that I want to extract all links in this page and find the top three links that appear the most on the page. All I have to do is this:
Advanced usage
BeautifulSoup is a great example of a library that is both easy to use and powerful.
You can do much more to select elements using BeautifulSoup. Although we won't cover those cases in this article, here are few examples of advanced things you can do:
- Select elements with regexp
- Select elements with a custom function (links that have Google in them for example)
- Iterating over siblings elements
We also only covered how to target elements but there is also a whole section about updating and writing HTML. Again, we won't cover this in this article.
Let's now talk about CSS selectors.
CSS selectors
Why learn about CSS selectors if BeautifulSoup can select all elements with its pre-made method?
Well, you'll soon understand.
Hard dom
Sometimes, the HTML document won't have a useful class and id. Selecting elements with BS4 without relying on that information can be quite verbose.
For example, let's say that you want to extract the score of a post on the HN homepage, but you can't use class
name or id
in your code. Here is how you could do it:
Not that great right?
If you rely on CSS selectors, it becomes easier.
This is much clearer and simpler, right? Of course, this example artificially highlights the usefulness of the CSS selector. But, you will quickly see that the DOM structure of a page is more reliable than the class name.
Easily debuggable
Another thing that makes CSS selectors great for web scraping is that they are easily debuggable. I'll show you how. Open Chrome, then open your developers’ tools, (left-click -> “Inspect”), click on the document panel, and use “Ctrl-F or CMD-F” to be in search mode.
In the search bar, you'll be able to write any CSS expression you want, and Chrome will instantly find all elements matching it.
Iterate over the results by pressing Enter
to check that you are correctly getting everything you need.
What is great with Chrome is that it works the other way around too. You can also left-click on an element, click “Copy -> Copy Selector”, and your selector will be pasted in your clipboard.
Web Scraping With Python Beautifulsoup
Powerful
CSS selectors, and particularly pseudo-classes, allow you to select any elements you want with one simple string.
Child and descendants
You can select direct child and descendant with:
And you can mix them together:
Web Scraping With Python And Beautifulsoup
This will totally work.
Siblings
This one is one of my favorites because it allows you to select elements based on the elements on the same level in the DOM hierarchy, hence the sibling expression.
To select all p
coming after an h2
you can use the h2 ~ p
selector (it will match two p
). You can also use h2 + p
if you only want to select p
coming directly after an h2
(it will match only one p
)
Attribute selectors
Attribute selectors allow you to select elements with particular attributes values. So, p[data-test='foo']
will match
Position pseudo classes
If you want to select the last p
inside a section, you can also do it in “pure” CSS by leveraging position pseudo-classes. For this particular example, you just need this selector: section p:last-child()
. If you want to learn more about this, I suggest you take a look at this article
Maintainable code
I also think that CSS expressions are easier to maintain. For example, at ScrapingBee, when we do custom web scraping tasks all of our scripts begins like this:
This makes it easy and quick to fix scripts when DOM changes appear. The laziest way to do it is to simply copy/paste what Chrome gives you when you left-click on an element. If you do this, be careful, Chrome tends to add a lot of useless selectors when you use this trick. So do not hesitate to clean them up a bit before using them in your script.
Conclusion
In the end, everything you do with pure CSS selectors you can do it with BeautifulSoup4. But, I think choosing the former is the best way to go.
Web Scraping Python Beautifulsoup
I hoped you liked this article about web scraping in Python and that it will make your life easier.
If you'd like to read more about web scraping in Python do not hesitate to check out our extensive Python web scraping guide.
You might also be interested by our XPath tutorial
Web Scraping Python Beautifulsoup Code
Happy Scraping,
Web Scraping Python Beautifulsoup Table
Pierre de Wulf