{"id":17881,"date":"2024-10-01T12:33:33","date_gmt":"2024-10-01T12:33:33","guid":{"rendered":"https:\/\/www.scrapingdog.com\/?p=17881"},"modified":"2025-08-25T07:42:00","modified_gmt":"2025-08-25T07:42:00","slug":"web-crawling-with-python","status":"publish","type":"post","link":"https:\/\/www.scrapingdog.com\/blog\/web-crawling-with-python\/","title":{"rendered":"Build Web Crawler with Python (Complete Guide)"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"17881\" class=\"elementor elementor-17881\" data-elementor-post-type=\"post\">\n\t\t\t\t<div class=\"elementor-element elementor-element-67328a9 e-flex e-con-boxed e-con e-parent\" data-id=\"67328a9\" data-element_type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t<div class=\"elementor-element elementor-element-2aec25b elementor-widget elementor-widget-html\" data-id=\"2aec25b\" data-element_type=\"widget\" data-widget_type=\"html.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<!-- Gutenberg \u201cCustom HTML\u201d block -->\r\n<div style=\"\r\n  background:#d9f4e5;\r\n  border-left:4px solid #1d9b6c;\r\n  padding:18px 24px;\r\n  margin:24px 0;\r\n  border-radius:6px;\r\n  font-family:'Montserrat',sans-serif;\r\n  font-size:18px;\r\n  line-height:1.65;\r\n  color:#1a1a1a;\">\r\n  <p style=\"margin:0 0 8px 0; font-weight:600;\">TL;DR<\/p>\r\n\r\n  <ul style=\"margin:0; padding-left:20px;\">\r\n    <li>Explains crawling vs scraping; start from a seed URL and follow links recursively to collect pages.<\/li>\r\n    <li>Builds a mini crawler in Python with <code>requests<\/code> + <code>BeautifulSoup<\/code> (<em>Books to Scrape<\/em> demo).<\/li>\r\n    <li>Scales up with <code>Scrapy<\/code> (<code>CrawlSpider<\/code> + link rules) and exports results to JSON.<\/li>\r\n    <li><strong>Scaling tips:<\/strong> use proxies (<strong>Scrapingdog<\/strong> datacenter proxy; 1k free credits), tune delays \/ concurrency, and respect <code>robots.txt<\/code>.<\/li>\r\n  <\/ul>\r\n<\/div>\r\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-48b8379 font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"48b8379\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Web crawling is a technique by which you can automatically navigate through multiple URLs and collect a tremendous amount of data. You can find all the URLs of multiple domains and extract information from them.<\/p><p>This technique is mainly used by search engines like Google, Yahoo, and Bing to rank websites and suggest results to the user based on the query one makes.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t<div class=\"elementor-element elementor-element-7b999eb e-con-full e-flex e-con e-child\" data-id=\"7b999eb\" data-element_type=\"container\">\n\t\t\t\t<div class=\"elementor-element elementor-element-97bc23e font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"97bc23e\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>In this article, we are going to first understand the main\u00a0<a href=\"https:\/\/www.scrapingdog.com\/blog\/crawling-vs-scraping\/\" target=\"_blank\" rel=\"noopener\" data-type=\"link\" data-id=\"https:\/\/www.scrapingdog.com\/blog\/crawling-vs-scraping\/\">difference between web crawling and web scraping<\/a>. This will help you create a thin line in your mind between web scraping and web crawling.<\/p><p>Before we go in and create a full-fledged web crawler I will show you how you can create a small web crawler using\u00a0<a href=\"https:\/\/requests.readthedocs.io\/\" target=\"_blank\" rel=\"nofollow noopener\"><strong>requests<\/strong><\/a>\u00a0and\u00a0<a href=\"https:\/\/www.crummy.com\/software\/BeautifulSoup\/bs4\/doc\/\" target=\"_blank\" rel=\"nofollow noopener\"><strong>BeautifulSoup<\/strong><\/a>. This will give you a clear idea of what exactly a web crawler is. Then we will create a production-ready web crawler using\u00a0<a href=\"https:\/\/scrapy.org\/\" target=\"_blank\" rel=\"nofollow noopener\"><strong>Scrapy<\/strong><\/a>.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-46ed3a6 elementor-widget elementor-widget-heading\" data-id=\"46ed3a6\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">What is web crawling?<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5a81612 font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"5a81612\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"74d8\">Web crawling is an automated bot whose job is to visit multiple URLs on a single website or multiple websites and download content from those pages. Then this data can be used for multiple purposes like price analysis, indexing on search engines, monitoring changes on websites, etc.<\/p><p id=\"d04e\">It all starts with the\u00a0<strong>seed URL<\/strong>\u00a0which is the entry point of any web crawler. The web crawler then downloads HTML content from the page by making a GET request. The data downloaded is now parsed using various html parsing libraries to extract the most valuable data from it.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-83c1600 elementor-widget elementor-widget-image\" data-id=\"83c1600\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img fetchpriority=\"high\" decoding=\"async\" width=\"800\" height=\"425\" src=\"https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/2024-10-01_13-31-1024x544.png\" class=\"attachment-large size-large wp-image-17883\" alt=\"\" srcset=\"https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/2024-10-01_13-31-1024x544.png 1024w, https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/2024-10-01_13-31-300x159.png 300w, https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/2024-10-01_13-31-768x408.png 768w, https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/2024-10-01_13-31.png 1344w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d914004 font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"d914004\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"74d8\">Extracted data might contain links to other pages on the same website. Now, the crawler will make GET requests to these pages as well to repeat the same process that it did with the seed URL. Of course, this process is a recursive process that enables the script to visit every URL on the domain and gather all the information available.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-be1b6aa elementor-widget elementor-widget-heading\" data-id=\"be1b6aa\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">How web crawling is different from web scraping?<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d6a3385 font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"d6a3385\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Web scraping and web crawling\u00a0might sound similar but there is a fine line between them which makes them very different.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6e53dbd elementor-widget elementor-widget-image\" data-id=\"6e53dbd\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" width=\"800\" height=\"334\" src=\"https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/2024-10-01_13-49-1024x428.png\" class=\"attachment-large size-large wp-image-17884\" alt=\"\" srcset=\"https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/2024-10-01_13-49-1024x428.png 1024w, https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/2024-10-01_13-49-300x125.png 300w, https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/2024-10-01_13-49-768x321.png 768w, https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/2024-10-01_13-49.png 1300w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a0e8b30 font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"a0e8b30\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Web scraping involves making a GET request to just one single page and extracting the data present on the page. It will not look for other URLs available on the page. Of course, web scraping is comparatively fast because it works on a single page only.<\/p><p><strong><em>Read More:\u00a0<a href=\"https:\/\/www.scrapingdog.com\/blog\/what-is-web-scraping\/\" target=\"_blank\" rel=\"noopener\" data-type=\"link\" data-id=\"https:\/\/www.scrapingdog.com\/blog\/what-is-web-scraping\/\">What is Web Scraping<\/a>?<\/em><\/strong><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-38bad91 elementor-widget elementor-widget-heading\" data-id=\"38bad91\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Web Crawling using Requests &amp; BeautifulSoup<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-de2d0de font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"de2d0de\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>In my experience, the combination of requests and BS4 is the best when it comes to downloading and parsing the raw HTML. If you want to learn more about the\u00a0<a href=\"https:\/\/www.scrapingdog.com\/blog\/best-python-web-scraping-libraries\/\">best libraries for\u00a0web scraping with Python<\/a>\u00a0then check out this guide,<\/p><p id=\"7127\">In this section, we will create a small crawler for this\u00a0<a href=\"https:\/\/books.toscrape.com\/\" target=\"_blank\" rel=\"nofollow noopener\">website<\/a>. So, according to the flowchart shown above the crawler will look for links right from the seed URL. The crawler will then go to each link and extract data.<\/p><p id=\"a6ef\">Let\u2019s first download these two libraries in the coding environment.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9aa0755 elementor-widget elementor-widget-code-highlight\" data-id=\"9aa0755\" data-element_type=\"widget\" data-widget_type=\"code-highlight.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"prismjs-default copy-to-clipboard \">\n\t\t\t<pre data-line=\"\" class=\"highlight-height language-python line-numbers\">\n\t\t\t\t<code readonly=\"true\" class=\"language-python\">\n\t\t\t\t\t<xmp>pip install requests\r\npip install bs4<\/xmp>\n\t\t\t\t<\/code>\n\t\t\t<\/pre>\n\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7728ca7 font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"7728ca7\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"8c13\">We will be using another library\u00a0<code>urllib.parse<\/code>\u00a0but since it is a part of the Python standard library, there is no need for installation.<\/p><p id=\"cb77\">A basic Python crawler for our target website will look like this.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d14171f elementor-widget elementor-widget-code-highlight\" data-id=\"d14171f\" data-element_type=\"widget\" data-widget_type=\"code-highlight.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"prismjs-default copy-to-clipboard \">\n\t\t\t<pre data-line=\"\" class=\"highlight-height language-python line-numbers\">\n\t\t\t\t<code readonly=\"true\" class=\"language-python\">\n\t\t\t\t\t<xmp>import requests\r\nfrom bs4 import BeautifulSoup\r\nfrom urllib.parse import urljoin\r\n\r\n# URL of the website to crawl\r\nbase_url = \"https:\/\/books.toscrape.com\/\"\r\n\r\n# Set to store visited URLs\r\nvisited_urls = set()\r\n\r\n# List to store URLs to visit next\r\nurls_to_visit = [base_url]\r\n\r\n# Function to crawl a page and extract links\r\ndef crawl_page(url):\r\n    try:\r\n        response = requests.get(url)\r\n        response.raise_for_status()  # Raise an exception for HTTP errors\r\n\r\n        soup = BeautifulSoup(response.content, \"html.parser\")\r\n        \r\n        # Extract links and enqueue new URLs\r\n        links = []\r\n        for link in soup.find_all(\"a\", href=True):\r\n            next_url = urljoin(url, link[\"href\"])\r\n            links.append(next_url)\r\n        \r\n        return links\r\n\r\n    except requests.exceptions.RequestException as e:\r\n        print(f\"Error crawling {url}: {e}\")\r\n        return []\r\n\r\n# Crawl the website\r\nwhile urls_to_visit:\r\n    current_url = urls_to_visit.pop(0)  # Dequeue the first URL\r\n\r\n    if current_url in visited_urls:\r\n        continue\r\n\r\n    print(f\"Crawling: {current_url}\")\r\n\r\n    new_links = crawl_page(current_url)\r\n    visited_urls.add(current_url)\r\n    urls_to_visit.extend(new_links)\r\n\r\nprint(\"Crawling finished.\")<\/xmp>\n\t\t\t\t<\/code>\n\t\t\t<\/pre>\n\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2d287ca font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"2d287ca\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"157d\">It is a very simple code but let me break it down and explain it to you.<\/p><ol><li>We import the required libraries:\u00a0<code>requests<\/code>,\u00a0<code>BeautifulSoup<\/code>, and\u00a0<code>urljoin<\/code>\u00a0from\u00a0<code>urllib.parse<\/code>.<\/li><li>We define the\u00a0<code>base_url<\/code>\u00a0of the website and initialize a set\u00a0<code>visited_urls<\/code>\u00a0to store visited URLs.<\/li><li>We define a\u00a0<code>urls_to_visit<\/code>\u00a0list to store URLs that need to be crawled. We start with the base URL.<\/li><li>We define the\u00a0<code>crawl_page()<\/code>\u00a0function to fetch a web page, parse its HTML content, and extract links from it.<\/li><li>Inside the function, we use\u00a0<code>requests.get()<\/code>\u00a0to fetch the page and\u00a0<code>BeautifulSoup<\/code>\u00a0to parse its content.<\/li><li>We iterate through each\u00a0<code>&lt;a&gt;<\/code>\u00a0tag to extract links, and convert them to absolute URLs using\u00a0<code>urljoin()<\/code>, and add them to the\u00a0<code>links<\/code>\u00a0list.<\/li><li>The\u00a0<code>while<\/code>\u00a0loop continues as long as there are URLs in the\u00a0<code>urls_to_visit<\/code>\u00a0list. For each URL, we:<\/li><\/ol><ul><li>Dequeue the URL and check if it has been visited before.<\/li><li>Call the\u00a0<code>crawl_page()<\/code>\u00a0function to fetch the page and extract links.<\/li><li>Add the current URL to the\u00a0<code>visited_urls<\/code>\u00a0set and enqueue the new links to\u00a0<code>urls_to_visit<\/code>.<\/li><\/ul><p id=\"4f60\">8. Once the crawling process is complete, we print a message indicating that the process has finished.<\/p><p id=\"ade9\">To run this code you can type this command on bash. I have named my file crawl.py.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c5d5cd2 elementor-widget elementor-widget-code-highlight\" data-id=\"c5d5cd2\" data-element_type=\"widget\" data-widget_type=\"code-highlight.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"prismjs-default copy-to-clipboard \">\n\t\t\t<pre data-line=\"\" class=\"highlight-height language-python line-numbers\">\n\t\t\t\t<code readonly=\"true\" class=\"language-python\">\n\t\t\t\t\t<xmp>python crawl.py<\/xmp>\n\t\t\t\t<\/code>\n\t\t\t<\/pre>\n\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4a33a00 font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"4a33a00\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Once your crawler starts, this will appear on your screen.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-74d5c6d elementor-widget elementor-widget-image\" data-id=\"74d5c6d\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" width=\"800\" height=\"408\" src=\"https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-21.png\" class=\"attachment-large size-large wp-image-17895\" alt=\"\" srcset=\"https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-21.png 828w, https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-21-300x153.png 300w, https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-21-768x391.png 768w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-04e3e50 font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"04e3e50\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"f811\">This code might give you an idea of how web crawling works. However, there are certain limitations and potential disadvantages to this code.<\/p><ul><li><strong>No Parallelism<\/strong>: The code does not utilize parallel processing, meaning that only one request is processed at a time. Parallelizing the crawling process can significantly improve the speed of crawling.<\/li><li><strong>Lack of Error Handling<\/strong>: The code lacks detailed error handling for various scenarios, such as handling specific HTTP errors, connection timeouts, and more. Proper error handling is crucial for robust crawling.<\/li><li><strong>Depth-First Crawling<\/strong>: The code uses a breadth-first approach, but in certain cases, depth-first crawling might be more efficient. This depends on the structure of the website and the goals of the crawling operation. If you want to learn more about BFS and DFS then read this\u00a0<a href=\"https:\/\/www.geeksforgeeks.org\/difference-between-bfs-and-dfs\/\" target=\"_blank\" rel=\"nofollow noopener\">guide<\/a>. BFS looks for the shortest path to reach the destination.<\/li><\/ul><p>In the next section, we are going to create a web crawler using Scrapy which will help us eliminate these limitations.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7aa2969 elementor-widget elementor-widget-heading\" data-id=\"7aa2969\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Web Crawler using Scrapy<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a16a37a font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"a16a37a\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"f811\">Again we are going to use the same\u00a0<a href=\"https:\/\/books.toscrape.com\/\" target=\"_blank\" rel=\"nofollow noopener\">site<\/a>\u00a0for crawling with Scrapy.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-66916f0 elementor-widget elementor-widget-image\" data-id=\"66916f0\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"407\" src=\"https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-22.png\" class=\"attachment-large size-large wp-image-17896\" alt=\"\" srcset=\"https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-22.png 828w, https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-22-300x153.png 300w, https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-22-768x390.png 768w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fc66bbd font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"fc66bbd\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"f811\">This page has a lot of information like warnings, titles, categories, etc. Our task would be to find links that match a certain pattern. For example, when we click on any of the categories we can see a certain pattern in the URL.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9c18069 elementor-widget elementor-widget-image\" data-id=\"9c18069\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"641\" src=\"https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-23.png\" class=\"attachment-large size-large wp-image-17897\" alt=\"\" srcset=\"https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-23.png 828w, https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-23-300x240.png 300w, https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-23-768x615.png 768w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-074cd2f font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"074cd2f\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"f811\">Every URL will have\u00a0<strong><em>\/catalogue<\/em><\/strong>,\u00a0<strong><em>\/category<\/em><\/strong>\u00a0and\u00a0<strong><em>\/books<\/em><\/strong>. But when we click on a book we only see\u00a0<strong><em>\/catalogue<\/em><\/strong>\u00a0and nothing else.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-281f66f elementor-widget elementor-widget-image\" data-id=\"281f66f\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"315\" src=\"https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-24.png\" class=\"attachment-large size-large wp-image-17898\" alt=\"\" srcset=\"https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-24.png 828w, https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-24-300x118.png 300w, https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-24-768x302.png 768w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ed19702 font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"ed19702\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"6d41\">So, one task would be to instruct our web crawler to find all the links that have this pattern. The web crawler would then follow to find all the available links with\u00a0<strong>\/catalogue\/category<\/strong>\u00a0patterns in them. That would be the mission of our web crawler.<\/p><p id=\"ddff\">In this case, it would be quite trivial because we have a sidebar where all the categories are listed but in a real-world project, you will oftentimes have something like maybe the top 10 categories that you can click and then there are a hundred more categories that you have to find by for example going into a book and then you have another 10 sub-categories of the book present on that book page.<\/p><p id=\"ddff\">So, you can instruct the crawler to go into all the different book pages to find all the secondary categories and then collect all the pages that are category pages.<\/p><p id=\"ddff\">This could be the web crawling task and the web scraping task could be to collect titles and prices of the books from each dedicated book page. I hope you got the idea now. Let\u2019s proceed with the coding part!<\/p><p>We will start with downloading Scrapy. It is a web scraping or web crawling framework. It is not just a simple library but an actual framework. Once you download this it will create multiple python files in your folder. You can type the following command in your cmd to install it.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9fb1733 elementor-widget elementor-widget-code-highlight\" data-id=\"9fb1733\" data-element_type=\"widget\" data-widget_type=\"code-highlight.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"prismjs-default copy-to-clipboard \">\n\t\t\t<pre data-line=\"\" class=\"highlight-height language-python line-numbers\">\n\t\t\t\t<code readonly=\"true\" class=\"language-python\">\n\t\t\t\t\t<xmp>pip install scrapy<\/xmp>\n\t\t\t\t<\/code>\n\t\t\t<\/pre>\n\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d7ae7fa font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"d7ae7fa\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Once you install it you can go back to your working directory and run this command.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-358d685 elementor-widget elementor-widget-code-highlight\" data-id=\"358d685\" data-element_type=\"widget\" data-widget_type=\"code-highlight.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"prismjs-default copy-to-clipboard \">\n\t\t\t<pre data-line=\"\" class=\"highlight-height language-python line-numbers\">\n\t\t\t\t<code readonly=\"true\" class=\"language-python\">\n\t\t\t\t\t<xmp>scrapy startproject learncrawling<\/xmp>\n\t\t\t\t<\/code>\n\t\t\t<\/pre>\n\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ac8d05f font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"ac8d05f\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>You can of course use whatever name you like. I have used\u00a0<strong>learncrawling<\/strong>. Once the project is created you can see that in your chosen directory you have a new directory with the project name inside it.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-22d0be3 elementor-widget elementor-widget-image\" data-id=\"22d0be3\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"383\" height=\"370\" src=\"https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-25.png\" class=\"attachment-large size-large wp-image-17899\" alt=\"\" srcset=\"https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-25.png 383w, https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-25-300x290.png 300w\" sizes=\"(max-width: 383px) 100vw, 383px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-116d9a6 font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"116d9a6\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>You will see a bunch of other Python files as well. We will cover some of these files later in this blog. But the most important directory here is the spider\u2019s directory. These are actually the constructs that we use for the web crawling process.<\/p><p>We can create our own custom spiders to define our own crawling process. We are going to create a new Python file inside this.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c93d883 elementor-widget elementor-widget-image\" data-id=\"c93d883\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"280\" height=\"103\" src=\"https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-26.png\" class=\"attachment-large size-large wp-image-17900\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a756a70 font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"a756a70\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>In this file, we are going to write some code.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-10c5281 elementor-widget elementor-widget-code-highlight\" data-id=\"10c5281\" data-element_type=\"widget\" data-widget_type=\"code-highlight.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"prismjs-default copy-to-clipboard \">\n\t\t\t<pre data-line=\"\" class=\"highlight-height language-python line-numbers\">\n\t\t\t\t<code readonly=\"true\" class=\"language-python\">\n\t\t\t\t\t<xmp>from scrapy.spiders import CrawlSpider, Rule\r\nfrom scrapy.linkextractors import LinkExtractor<\/xmp>\n\t\t\t\t<\/code>\n\t\t\t<\/pre>\n\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1c6d503 font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"1c6d503\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ol><li><code>from scrapy.spider import CrawlSpider<\/code>: This line imports the\u00a0<code>CrawlSpider<\/code>\u00a0class from the\u00a0<code>scrapy.spider<\/code>\u00a0module.\u00a0<code>CrawlSpider<\/code>\u00a0is a subclass of the base\u00a0<code>Spider<\/code>\u00a0class provided by Scrapy. It is used to create spider classes specifically designed for crawling websites by following links.\u00a0<code>Rule<\/code>\u00a0is used to define rules for link extraction and following.<\/li><li><code>from scrapy.linkextractors import LinkExtractor<\/code>: This line imports the\u00a0<code>LinkExtractor<\/code>\u00a0class from the\u00a0<code>scrapy.linkextractors<\/code>\u00a0module.\u00a0<code>LinkExtractor<\/code>\u00a0is a utility class provided by Scrapy to extract links from web pages based on specified rules and patterns.<\/li><\/ol><p>Then we want to create a new class which is going to be our custom spider class. I\u2019m going to call this class\u00a0<code>CrawlingSpider<\/code>\u00a0and this is going to inherit from the\u00a0<code>CrawlSpider<\/code>\u00a0class.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9c57d04 elementor-widget elementor-widget-code-highlight\" data-id=\"9c57d04\" data-element_type=\"widget\" data-widget_type=\"code-highlight.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"prismjs-default copy-to-clipboard \">\n\t\t\t<pre data-line=\"\" class=\"highlight-height language-python line-numbers\">\n\t\t\t\t<code readonly=\"true\" class=\"language-python\">\n\t\t\t\t\t<xmp>from scrapy.spiders import CrawlSpider, Rule\r\nfrom scrapy.linkextractors import LinkExtractor\r\n\r\n\r\n\r\nclass CrawlingSpider(CrawlSpider):\r\n    name = \"mycrawler\"\r\n    allowed_domains = [\"toscrape.com\"]\r\n    start_urls = [\"https:\/\/books.toscrape.com\/\"]\r\n\r\n\r\n    rules = (\r\n        Rule(LinkExtractor(allow=\"catalogue\/category\")),\r\n    )<\/xmp>\n\t\t\t\t<\/code>\n\t\t\t<\/pre>\n\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a82fa8f font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"a82fa8f\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><code>name = \"mycrawler\"<\/code>: This attribute specifies the name of the spider. The name is used to uniquely identify the spider when running Scrapy commands.<\/p><p><code>allowed_domains = [\"toscrape.com\"]<\/code>: This attribute defines a list of domain names that the spider is allowed to crawl. In this case, it specifies that the spider should only crawl within the domain \u201ctoscrape.com\u201d.<\/p><p><code>start_urls = [\"https:\/\/books.toscrape.com\/\"]<\/code>: This attribute provides a list of starting URLs for the spider. The spider will begin crawling from these URLs.<\/p><p><code>Rule(LinkExtractor(allow=\"catalogue\/category\")),<\/code>: This line defines a rule using the\u00a0<code>Rule<\/code>\u00a0class. It utilizes a\u00a0<code>LinkExtractor<\/code>\u00a0to extract links based on the provided rule. The\u00a0<code>allow<\/code>\u00a0parameter specifies a regular expression pattern that is used to match URLs. In this case, it\u2019s looking for URLs containing the text \u201ccatalogue\/category\u201d.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0754afb elementor-widget elementor-widget-code-highlight\" data-id=\"0754afb\" data-element_type=\"widget\" data-widget_type=\"code-highlight.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"prismjs-default copy-to-clipboard \">\n\t\t\t<pre data-line=\"\" class=\"highlight-height language-python line-numbers\">\n\t\t\t\t<code readonly=\"true\" class=\"language-python\">\n\t\t\t\t\t<xmp>scrapy crawl mycrawler<\/xmp>\n\t\t\t\t<\/code>\n\t\t\t<\/pre>\n\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-08a39e3 font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"08a39e3\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Once it starts running you will see something like this on your bash.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-88fe4be elementor-widget elementor-widget-image\" data-id=\"88fe4be\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"268\" src=\"https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-27.png\" class=\"attachment-large size-large wp-image-17904\" alt=\"\" srcset=\"https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-27.png 828w, https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-27-300x100.png 300w, https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-27-768x257.png 768w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0c3a9e2 font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"0c3a9e2\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"3321\">Our crawler is finding all these urls. But what if we want to find links to individual book pages as well and we want to scrape certain information from that?<\/p><p id=\"19a4\">Now, we will look for certain things in our spiders and we will extract some data from each individual book page. For finding book pages I will define a new rule.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f9b6066 elementor-widget elementor-widget-code-highlight\" data-id=\"f9b6066\" data-element_type=\"widget\" data-widget_type=\"code-highlight.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"prismjs-default copy-to-clipboard \">\n\t\t\t<pre data-line=\"\" class=\"highlight-height language-python line-numbers\">\n\t\t\t\t<code readonly=\"true\" class=\"language-python\">\n\t\t\t\t\t<xmp>Rule(LinkExtractor(allow=\"catalogue\", deny=\"category\"), callback=\"parse_item\")<\/xmp>\n\t\t\t\t<\/code>\n\t\t\t<\/pre>\n\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6a72e82 font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"6a72e82\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>This will find all the URLs with\u00a0<strong>catalogue<\/strong>\u00a0in it but it will deny the pages with\u00a0<strong>category<\/strong>\u00a0in them. Then we use a callback function to pass all of our crawled urls. This function will then handle the web scraping part.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ad15927 elementor-widget elementor-widget-heading\" data-id=\"ad15927\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">What Will We Scrape<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8cf6503 elementor-widget elementor-widget-image\" data-id=\"8cf6503\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"381\" src=\"https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/1_DODH5LaKqcuVacfXCcNVDg.png\" class=\"attachment-large size-large wp-image-17905\" alt=\"\" srcset=\"https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/1_DODH5LaKqcuVacfXCcNVDg.png 875w, https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/1_DODH5LaKqcuVacfXCcNVDg-300x143.png 300w, https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/1_DODH5LaKqcuVacfXCcNVDg-768x366.png 768w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1b45e79 font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"1b45e79\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"97e6\">We are going to scrape:<\/p><ul><li><strong><em>Title of the book<\/em><\/strong><\/li><li><strong><em>Price of the book<\/em><\/strong><\/li><li><strong><em>Availability<\/em><\/strong><\/li><\/ul><p id=\"bef5\">Let\u2019s find out their DOM locations one by one.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ef844fc elementor-widget elementor-widget-image\" data-id=\"ef844fc\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"257\" src=\"https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-28-1.png\" class=\"attachment-large size-large wp-image-17910\" alt=\"\" srcset=\"https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-28-1.png 828w, https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-28-1-300x96.png 300w, https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-28-1-768x247.png 768w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5e46b79 font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"5e46b79\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"97e6\">The title can be seen under the class\u00a0<code>product_main<\/code>\u00a0with\u00a0<code>h1<\/code>\u00a0tags.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7c99c5e elementor-widget elementor-widget-image\" data-id=\"7c99c5e\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"287\" src=\"https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-29.png\" class=\"attachment-large size-large wp-image-17911\" alt=\"\" srcset=\"https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-29.png 828w, https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-29-300x108.png 300w, https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-29-768x275.png 768w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-56a2fec font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"56a2fec\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"7eb5\">Pricing can be seen under the\u00a0<code>p<\/code>\u00a0tag with class\u00a0<code>price_color<\/code>.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-95c232c elementor-widget elementor-widget-image\" data-id=\"95c232c\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"172\" src=\"https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-30.png\" class=\"attachment-large size-large wp-image-17912\" alt=\"\" srcset=\"https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-30.png 828w, https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-30-300x64.png 300w, https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-30-768x165.png 768w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a0174f2 font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"a0174f2\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"a584\">Availability can be seen under the\u00a0<code>p<\/code>\u00a0tag with class\u00a0<code>availability<\/code>.<\/p><p id=\"b2f4\">Let\u2019s put all this under the code under\u00a0<code>parse_item()<\/code>\u00a0function.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6c26825 elementor-widget elementor-widget-code-highlight\" data-id=\"6c26825\" data-element_type=\"widget\" data-widget_type=\"code-highlight.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"prismjs-default copy-to-clipboard \">\n\t\t\t<pre data-line=\"\" class=\"highlight-height language-python line-numbers\">\n\t\t\t\t<code readonly=\"true\" class=\"language-python\">\n\t\t\t\t\t<xmp>def parse_item(self,response):\r\n\r\n        yield {\r\n            \"title\":response.css(\".product_main h1::text\").get(),\r\n            \"price\":response.css(\".price_color::text\").get(),\r\n            \"availability\":response.css(\".availability::text\")[1].get().strip()\r\n        }<\/xmp>\n\t\t\t\t<\/code>\n\t\t\t<\/pre>\n\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-98aee57 font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"98aee57\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul><li><code>yield { ... }<\/code>: This line starts a dictionary comprehension enclosed within curly braces. This dictionary will be yielded as the output of the method, effectively passing the extracted data to Scrapy\u2019s output pipeline.<\/li><li><code>\"title\": response.css(\".product_main h1::text\").get()<\/code>: This line extracts the text content of the\u00a0<code>&lt;h1&gt;<\/code>\u00a0element within the\u00a0<code>.product_main<\/code>\u00a0class using a CSS selector. The\u00a0<code>::text<\/code>\u00a0pseudo-element is used to select the text content. The\u00a0<code>.get()<\/code>\u00a0method retrieves the extracted text.<\/li><li><code>\"price\": response.css(\".price_color::text\").get()<\/code>: This line extracts the text content of the element with the\u00a0<code>.price_color<\/code>\u00a0class, similar to the previous line.<\/li><li><code>\"availability\": response.css(\".availability::text\")[1].get().strip()<\/code>: This line extracts the text content of the third element with the\u00a0<code>.availability<\/code>\u00a0class on the page.\u00a0<code>[2]<\/code>\u00a0indicates that we\u2019re selecting the third matching element (remember that indexing is zero-based). The\u00a0<code>.get()<\/code>\u00a0method retrieves the text content. The\u00a0<code>strip()<\/code>\u00a0function is used to remove the white spaces.<\/li><\/ul><p id=\"d965\">Our spider is now ready and we can run it from our terminal. Once again we are going to use the same command.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-14185b8 elementor-widget elementor-widget-code-highlight\" data-id=\"14185b8\" data-element_type=\"widget\" data-widget_type=\"code-highlight.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"prismjs-default copy-to-clipboard \">\n\t\t\t<pre data-line=\"\" class=\"highlight-height language-python line-numbers\">\n\t\t\t\t<code readonly=\"true\" class=\"language-python\">\n\t\t\t\t\t<xmp>scrapy crawl mycrawler<\/xmp>\n\t\t\t\t<\/code>\n\t\t\t<\/pre>\n\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-cd610ef font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"cd610ef\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Let\u2019s run it and see the results. I am pretty excited.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a641436 elementor-widget elementor-widget-image\" data-id=\"a641436\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"233\" src=\"https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-31.png\" class=\"attachment-large size-large wp-image-17913\" alt=\"\" srcset=\"https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-31.png 828w, https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-31-300x87.png 300w, https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-31-768x224.png 768w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2326c87 font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"2326c87\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"ec40\">Our spider scrapes all the books and goes through all the instances that it can find. It will probably take some time but at the end, you will see all the extracted data. You can even notice the dictionary that is being printed, it contains all the data we wanted to scrape.<\/p><p id=\"9f68\">What if I want to save this data?<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-308f34e elementor-widget elementor-widget-heading\" data-id=\"308f34e\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">Saving the data in a JSON file<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a7c6631 font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"a7c6631\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"ec40\">You can save this data into a JSON file very easily. You can do it directly from the bash, no need to make any changes in the code.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f5aa37f elementor-widget elementor-widget-code-highlight\" data-id=\"f5aa37f\" data-element_type=\"widget\" data-widget_type=\"code-highlight.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"prismjs-default copy-to-clipboard \">\n\t\t\t<pre data-line=\"\" class=\"highlight-height language-python line-numbers\">\n\t\t\t\t<code readonly=\"true\" class=\"language-python\">\n\t\t\t\t\t<xmp>scrapy crawl mycrawler -o results.json<\/xmp>\n\t\t\t\t<\/code>\n\t\t\t<\/pre>\n\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7c540b1 font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"7c540b1\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>This will then take all the scraped information and save it in a JSON file.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-93411b0 elementor-widget elementor-widget-image\" data-id=\"93411b0\" data-element_type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"238\" src=\"https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-32.png\" class=\"attachment-large size-large wp-image-17914\" alt=\"\" srcset=\"https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-32.png 828w, https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-32-300x89.png 300w, https:\/\/www.scrapingdog.com\/wp-content\/uploads\/2024\/10\/image-32-768x228.png 768w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8eba850 font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"8eba850\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>We have got the title, price, and availability. You can of course play with it a little and extract the integer from the availability string using regex. But my purpose in this was to explain to you how it can be done fast and smoothly.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b392228 elementor-widget elementor-widget-heading\" data-id=\"b392228\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Web Crawling with Proxy<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ee2ae5f font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"ee2ae5f\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"c2cb\">One problem you might face in your web crawling journey is you might get blocked from accessing the website. This happens because you might be sending too many requests to the website due to which it might ban your IP. This will put a breakage to your data pipeline.<\/p><p id=\"d5f3\">In order to prevent this you can use proxy services that can handle IP rotations, and retries, and even pass appropriate headers to the website to act like a legit person rather than a data-hungry bot(of course we are).<\/p><p id=\"09ac\">You can sign up for the free\u00a0<a href=\"https:\/\/www.scrapingdog.com\/datacenter-proxies\/\" target=\"_blank\" rel=\"noopener\">datacenter proxy<\/a>. You will get 1,000 free credits to run a small crawler. Let\u2019s see how you can integrate this proxy into your Scrapy environment. There are mainly 3 steps involved while integrating your proxy in this.<\/p><ol><li>Define a constant\u00a0<code>PROXY_SERVER<\/code>\u00a0in your\u00a0<code>crawler_spider.py<\/code>\u00a0file.<\/li><\/ol>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4a31324 elementor-widget elementor-widget-code-highlight\" data-id=\"4a31324\" data-element_type=\"widget\" data-widget_type=\"code-highlight.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"prismjs-default copy-to-clipboard \">\n\t\t\t<pre data-line=\"\" class=\"highlight-height language-python line-numbers\">\n\t\t\t\t<code readonly=\"true\" class=\"language-python\">\n\t\t\t\t\t<xmp>from scrapy.spiders import CrawlSpider, Rule\r\nfrom scrapy.linkextractors import LinkExtractor\r\n\r\n\r\n\r\nclass CrawlingSpider(CrawlSpider):\r\n    name = \"mycrawler\"\r\n    allowed_domains = [\"toscrape.com\"]\r\n    start_urls = [\"https:\/\/books.toscrape.com\/\"]\r\n\r\n    PROXY_SERVER = \"http:\/\/scrapingdog:Your-API-Key@proxy.scrapingdog.com:8081\"\r\n    \r\n    rules = (\r\n        Rule(LinkExtractor(allow=\"catalogue\/category\")),\r\n        Rule(LinkExtractor(allow=\"catalogue\", deny=\"category\"), callback=\"parse_item\")\r\n    )\r\n\r\n    def parse_item(self,response):\r\n\r\n        yield {\r\n            \"title\":response.css(\".product_main h1::text\").get(),\r\n            \"price\":response.css(\".price_color::text\").get(),\r\n            \"availability\":response.css(\".availability::text\")[1].get().strip()\r\n        }<\/xmp>\n\t\t\t\t<\/code>\n\t\t\t<\/pre>\n\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ec53c7a font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"ec53c7a\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>1. Then we will move to <code>settings.py<\/code>\u00a0file. Find the\u00a0<strong><em>downloader middlerwares<\/em><\/strong>\u00a0section in this file and uncomment it.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3f8dd14 elementor-widget elementor-widget-code-highlight\" data-id=\"3f8dd14\" data-element_type=\"widget\" data-widget_type=\"code-highlight.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"prismjs-default copy-to-clipboard \">\n\t\t\t<pre data-line=\"\" class=\"highlight-height language-python line-numbers\">\n\t\t\t\t<code readonly=\"true\" class=\"language-python\">\n\t\t\t\t\t<xmp>DOWNLOADER_MIDDLEWARES = {\r\n   'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware':1,\r\n   'learncrawling.middlewares.LearncrawlingDownloaderMiddleware': 543,\r\n}<\/xmp>\n\t\t\t\t<\/code>\n\t\t\t<\/pre>\n\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-86af9f9 font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"86af9f9\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"c2cb\">2. Then we will move to <code>settings.py<\/code>\u00a0file. Find the\u00a0<strong><em>downloader middlerwares<\/em><\/strong>\u00a0section in this file and uncomment it.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-266b3c9 elementor-widget elementor-widget-code-highlight\" data-id=\"266b3c9\" data-element_type=\"widget\" data-widget_type=\"code-highlight.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"prismjs-default copy-to-clipboard \">\n\t\t\t<pre data-line=\"\" class=\"highlight-height language-python line-numbers\">\n\t\t\t\t<code readonly=\"true\" class=\"language-python\">\n\t\t\t\t\t<xmp>DOWNLOADER_MIDDLEWARES = {\r\n   'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware':1,\r\n   'learncrawling.middlewares.LearncrawlingDownloaderMiddleware': 543,\r\n}<\/xmp>\n\t\t\t\t<\/code>\n\t\t\t<\/pre>\n\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0ae67b4 font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"0ae67b4\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"e703\">This will enable the use of a proxy.<\/p><p id=\"3084\">3. Then the final step would be to make changes in our\u00a0<code>middlewares.py<\/code>\u00a0file. In this file, you will find a class by the name\u00a0<code>LearncrawlingDownloaderMiddleware<\/code>. From here we can manipulate the process of request-sending by adding the proxy server.<\/p><p id=\"87c1\">Here you will find a function\u00a0<code>process_request()<\/code>\u00a0under which you have to add the below line.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-cb32ffc elementor-widget elementor-widget-code-highlight\" data-id=\"cb32ffc\" data-element_type=\"widget\" data-widget_type=\"code-highlight.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"prismjs-default copy-to-clipboard \">\n\t\t\t<pre data-line=\"\" class=\"highlight-height language-python line-numbers\">\n\t\t\t\t<code readonly=\"true\" class=\"language-python\">\n\t\t\t\t\t<xmp>request.meta['proxy'] = \"http:\/\/scrapingdog:Your-API-Key@proxy.scrapingdog.com:8081\"\r\nreturn None<\/xmp>\n\t\t\t\t<\/code>\n\t\t\t<\/pre>\n\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b8e5502 font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"b8e5502\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"cf00\">Now, every request will go through a proxy and your data pipeline will not get blocked.<\/p><p id=\"cd14\">Of course, changing just the IP will not help you bypass the anti-scraping wall of any website therefore this proxy also passes custom headers to help you penetrate that wall.<\/p><p id=\"21d4\">Now, Scrapy has certain limits when it comes to crawling. Let\u2019s say if you are crawling websites like Amazon using using Scrapy then you can scrape around\u00a0350 pages per minute(according to my own experiment). That means\u00a050400 pages per day. This speed is not enough if you want to scrape millions of pages in just a few days. I came across this\u00a0article\u00a0where the author scraped more than 250 million pages within 40 hours. I would recommend reading this article.<\/p><p id=\"a124\">In some cases, you might have to wait to make another request like\u00a0<strong>zoominfo.com<\/strong>. For that, you can use\u00a0DOWNLOAD_DELAY\u00a0to give your crawler a little rest. You can read more about it\u00a0<a href=\"https:\/\/docs.scrapy.org\/en\/latest\/topics\/settings.html#download-delay\" target=\"_blank\" rel=\"nofollow noopener\">here<\/a>. This is how you can add this to your code.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4a6c330 elementor-widget elementor-widget-code-highlight\" data-id=\"4a6c330\" data-element_type=\"widget\" data-widget_type=\"code-highlight.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"prismjs-default copy-to-clipboard \">\n\t\t\t<pre data-line=\"\" class=\"highlight-height language-python line-numbers\">\n\t\t\t\t<code readonly=\"true\" class=\"language-python\">\n\t\t\t\t\t<xmp>class MySpider(scrapy.Spider):\r\n    name = 'my_spider'\r\n    start_urls = ['https:\/\/example.com']\r\n\r\n    download_delay = 1  # Set the delay to 1 second<\/xmp>\n\t\t\t\t<\/code>\n\t\t\t<\/pre>\n\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0961568 font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"0961568\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"a866\">Then you can use\u00a0<code>CONCURRENT_REQUESTS<\/code>\u00a0to control the number of requests you want to send at a time. You can read more about it\u00a0<a href=\"https:\/\/docs.scrapy.org\/en\/latest\/topics\/settings.html?highlight=concurrent+#concurrent-requests\" target=\"_blank\" rel=\"nofollow noopener\">here<\/a>.<\/p><p id=\"3ea4\">You can also use\u00a0<code>ROBOTSTXT_OBEY<\/code>\u00a0to obey the rules set by the domain owners about data collection. Of course, as a data collector, you should respect their boundaries. You can read more about it\u00a0<a href=\"https:\/\/docs.scrapy.org\/en\/latest\/topics\/settings.html?highlight=ROBOTSTXT_OBEY#robotstxt-obey\" target=\"_blank\" rel=\"nofollow noopener\">here<\/a>.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4179a9c elementor-widget elementor-widget-heading\" data-id=\"4179a9c\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Complete Code<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d12f244 font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"d12f244\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"a866\">There are multiple data points available on this website which can also be scraped. But for now, the complete code for this tutorial will look like this.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c910395 elementor-widget elementor-widget-code-highlight\" data-id=\"c910395\" data-element_type=\"widget\" data-widget_type=\"code-highlight.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"prismjs-default copy-to-clipboard \">\n\t\t\t<pre data-line=\"\" class=\"highlight-height language-python line-numbers\">\n\t\t\t\t<code readonly=\"true\" class=\"language-python\">\n\t\t\t\t\t<xmp>\/\/crawling_spider.py\r\n\r\nfrom scrapy.spiders import CrawlSpider, Rule\r\nfrom scrapy.linkextractors import LinkExtractor\r\n\r\n\r\n\r\nclass CrawlingSpider(CrawlSpider):\r\n    name = \"mycrawler\"\r\n    allowed_domains = [\"toscrape.com\"]\r\n    start_urls = [\"http:\/\/books.toscrape.com\/\"]\r\n\r\n\r\n    rules = (\r\n        Rule(LinkExtractor(allow=\"catalogue\/category\")),\r\n        Rule(LinkExtractor(allow=\"catalogue\", deny=\"category\"), callback=\"parse_item\")\r\n    )\r\n\r\n    def parse_item(self,response):\r\n\r\n        yield {\r\n            \"title\":response.css(\".product_main h1::text\").get(),\r\n            \"price\":response.css(\".price_color::text\").get(),\r\n            \"availability\":response.css(\".availability::text\")[1].get().strip()\r\n        }<\/xmp>\n\t\t\t\t<\/code>\n\t\t\t<\/pre>\n\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-60bacb5 elementor-widget elementor-widget-heading\" data-id=\"60bacb5\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Conclusion<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-35aa0e9 font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"35aa0e9\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"d73e\">In this blog, we created a crawler using requests and Scrapy. Both are capable of achieving the target but with Scrapy you can complete the task fast. Scrapy provides you flexibility through which you can crawl endless websites with efficiency. Beginners might find Scrapy a little intimidating but once you get it you will be able to crawl websites very easily.<\/p><p id=\"7cc1\">I hope now you clearly understand the difference between web scraping and web crawling. The\u00a0<code>parse_item()<\/code>\u00a0function is doing web scraping once the URLs are crawled.<\/p><p id=\"c104\">I think you are now capable of crawling websites whose data matters. You can start with crawling Amazon and see how it goes. You can start by reading this guide on\u00a0<a href=\"https:\/\/www.scrapingdog.com\/blog\/scrape-amazon\/\" target=\"_blank\" rel=\"noopener\">web scraping Amazon with Python<\/a>.<\/p><p id=\"dbfa\">I hope you like this tutorial and if you do then please do not forget to share it with your friends and on your social media.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-409ad04 elementor-widget elementor-widget-heading\" data-id=\"409ad04\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Additional Resources<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0e0dbbf font-color-green elementor-widget elementor-widget-text-editor\" data-id=\"0e0dbbf\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul><li><a href=\"https:\/\/www.scrapingdog.com\/blog\/javascript-web-crawler-nodejs\/\" target=\"_blank\" rel=\"noreferrer noopener\" data-type=\"link\" data-id=\"https:\/\/www.scrapingdog.com\/blog\/javascript-web-crawler-nodejs\/\">Web Crawling using Javascript &amp; NodeJs<\/a><\/li><li><a href=\"https:\/\/www.scrapingdog.com\/webscraping-problems\/python\/which-is-better-for-web-scraping-python-or-javascript\" target=\"_blank\" rel=\"noopener\" data-type=\"link\" data-id=\"https:\/\/www.scrapingdog.com\/webscraping-problems\/python\/which-is-better-for-web-scraping-python-or-javascript\">Web Scraping Python vs Nodejs<\/a><\/li><li><a href=\"https:\/\/www.scrapingdog.com\/blog\/find-all-the-urls-on-a-domain-website\/\" target=\"_blank\" rel=\"noopener\" data-type=\"link\" data-id=\"https:\/\/www.scrapingdog.com\/blog\/find-all-the-urls-on-a-domain-website\/\">4 Best Methods to Find All The URLs on a Domain\u2019s Website<\/a><\/li><\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>TL;DR Explains crawling vs scraping; start from a seed URL and follow links recursively to collect pages. Builds a mini crawler in Python with requests + BeautifulSoup (Books to Scrape demo). Scales up with Scrapy (CrawlSpider + link rules) and exports results to JSON. Scaling tips: use proxies (Scrapingdog datacenter proxy; 1k free credits), tune [&hellip;]<\/p>\n","protected":false},"author":5,"featured_media":17920,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"inline_featured_image":false,"footnotes":""},"categories":[25,84],"tags":[],"class_list":["post-17881","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-blog","category-python"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.scrapingdog.com\/wp-json\/wp\/v2\/posts\/17881","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.scrapingdog.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.scrapingdog.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.scrapingdog.com\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/www.scrapingdog.com\/wp-json\/wp\/v2\/comments?post=17881"}],"version-history":[{"count":0,"href":"https:\/\/www.scrapingdog.com\/wp-json\/wp\/v2\/posts\/17881\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.scrapingdog.com\/wp-json\/wp\/v2\/media\/17920"}],"wp:attachment":[{"href":"https:\/\/www.scrapingdog.com\/wp-json\/wp\/v2\/media?parent=17881"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.scrapingdog.com\/wp-json\/wp\/v2\/categories?post=17881"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.scrapingdog.com\/wp-json\/wp\/v2\/tags?post=17881"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}