Python Crawler

1. First Impressions of Web Crawlers#

In the vast world of the internet, data is scattered like treasures everywhere. A web crawler is like a diligent treasure hunter, able to automatically shuttle between web pages according to specific rules, capturing the information we need. Simply put, it is a program or script that can automatically retrieve web data, formally known as a web crawler, and is often referred to as a web spider or web robot.

Crawlers play a crucial role in data acquisition. Nowadays, data has become a key basis for decision-making for businesses and researchers, and crawlers are an efficient means of obtaining large amounts of data. Its application scenarios are extremely broad. In the field of data analysis, by collecting user comments and behavior data from social media platforms through crawlers, businesses can gain precise insights into user needs and market trends, providing strong support for product optimization and marketing strategy formulation. In competitive product research, crawlers can regularly capture competitors' product information, price dynamics, and promotional activities, helping businesses adjust their strategies in a timely manner to maintain competitiveness. Additionally, crawlers also play an irreplaceable role in academic research, news information aggregation, and more.

2. Environment Setup and Basic Syntax#

2.1 Python Installation and Configuration#

Python is the preferred language for crawler development, as its concise syntax and rich libraries can greatly simplify the development process. First, we need to download the installation package suitable for your operating system from the official Python website (https://www.python.org/downloads/). On the download page, you can see various versions of Python, and it is recommended to choose the latest stable version. After downloading, run the installation package, and in the installation wizard, be sure to check the "Add Python to PATH" option. This step is crucial as it ensures that Python can be called from any path in the system.

Once the installation is complete, you can verify if it was successful by entering python --version in the command prompt (CMD). If installed successfully, you will see the version number of Python. Additionally, you can enter python to enter the Python interactive environment, where you can directly write and execute Python code to experience the charm of Python.

2.2 Choosing Development Tools#

A good development tool can significantly improve development efficiency. In crawler development, PyCharm is a popular integrated development environment (IDE). It has powerful code editing features, such as code auto-completion, syntax highlighting, and code navigation, making it easier for you to write code. At the same time, PyCharm has good support for various Python libraries and frameworks, facilitating project management and debugging.

In addition to PyCharm, there are other good options, such as Visual Studio Code (VS Code). It is a lightweight yet powerful code editor that can achieve efficient Python development by installing Python plugins. VS Code has good extensibility and cross-platform capabilities, making it suitable for developers who prefer a simple development environment.

2.3 Review of Basic Python Syntax#

Before officially starting crawler development, it is necessary to review the basic syntax of Python, including data types, control statements, and functions.

Python has several basic data types, such as integers (int), floating-point numbers (float), strings (str), and booleans (bool). Different data types have their characteristics when storing and manipulating data. For example, strings are used to store text information, and you can use indexing and slicing operations to obtain specific characters or substrings from a string. Example code is as follows:

name = "Crawler Expert"

print(name[0])  # Output the first character

print(name[2:5])  # Output the substring from the third character to the fifth character

Control statements are an important part of program logic, commonly including conditional statements (if - elif - else) and loop statements (for, while). Conditional statements are used to execute different code blocks based on different conditions, while loop statements are used to repeatedly execute a segment of code. For example, using a for loop to iterate over a list:

fruits = ["Apple", "Banana", "Orange"]

for fruit in fruits:
    print(fruit)

Functions are tools that encapsulate a segment of reusable code, improving code reusability and readability. In Python, you can define a function using the def keyword, for example:

def add_numbers(a, b):
    return a + b

result = add_numbers(3, 5)

print(result)  # Output 8

By mastering these basic syntax elements, we lay a solid foundation for subsequent crawler development.

3. Basic Principles and Workflow of Crawlers#

3.1 HTTP Protocol Analysis#

The HTTP protocol, or Hypertext Transfer Protocol, is the foundation for communication between crawlers and web servers. It acts as a common language between crawlers and servers, specifying how both parties request and transmit data.

An HTTP request mainly consists of a request line, request headers, and an optional request body. The request line includes the request method, URL, and HTTP version. Common request methods are GET and POST. The GET method is typically used to retrieve resources from the server. For example, when accessing a webpage, the browser sends a GET request to the server to retrieve the HTML content of the page. When using a GET request, parameters are appended to the end of the URL in the form of key-value pairs, such as https://example.com/search?q=crawler&page=1. This method makes the parameters visible in the URL, making it less suitable for transmitting sensitive information.

The POST method is commonly used to submit data to the server, such as login forms or submitting comments. Unlike GET, the parameters of a POST request are placed in the request body, making them invisible to users and relatively more secure. For example, during login, the username and password are sent to the server via a POST request, and the request body may contain something like username=admin&password=123456.

An HTTP response consists of a status line, response headers, and a response body. The status code in the status line intuitively reflects the result of processing the request. For example, 200 indicates that the request was successful and the server successfully returned the requested resource; 404 indicates that the requested resource does not exist, possibly due to an incorrect URL or the page being deleted; 500 indicates an internal server error. The response headers contain some metadata about the response, such as content type (e.g., text/html indicates that the returned content is an HTML page) and content length. The response body is the actual data we want to retrieve, such as the HTML code of the webpage or data in JSON format.

3.2 Crawler Workflow#

The workflow of a crawler can be summarized in several key steps: initiating requests, obtaining responses, parsing data, and saving data.

First, the crawler will use an HTTP library (such as the requests library in Python) to initiate a request to the target server based on the specified URL. In this process, to simulate a real user's access, request headers may be set, such as User-Agent, which can inform the server about the type of browser and operating system the visitor is using. For example, the following code uses the requests library to initiate a GET request:

import requests

url = "https://example.com"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}

response = requests.get(url, headers=headers)

After receiving the request, the server will return the corresponding response. If the request is successful, we will receive a response object containing the status code, response headers, and response body. After the crawler obtains the response, the next step is to extract the data we need from the response body. This step requires selecting an appropriate parsing method based on the data format. If the response body is in HTML format, libraries such as BeautifulSoup or lxml can be used for parsing. For example, using BeautifulSoup to parse an HTML page and extract all links:

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, 'html.parser')

links = soup.find_all('a')

for link in links:
    print(link.get('href'))

If the response body is in JSON format, the built-in json module in Python can be used for parsing. An example is as follows:

import json

data = json.loads(response.text)

print(data)

Finally, the parsed data is saved to a local file or database for subsequent analysis and use. Common formats for saving to files include text files (.txt), CSV files (.csv), and JSON files (.json). For example, saving data as a JSON file:

import json

data = [{"name": "Crawler", "info": "Data collection tool"}]

with open('data.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, ensure_ascii=False, indent=4)

If saving to a database is required, the pymysql library can be used for relational databases (like MySQL), while the pymongo library can be used for non-relational databases (like MongoDB).

4. Common Libraries for Crawlers#

4.1 Requests Library#

The Requests library is a powerful tool for sending HTTP requests in Python. Its API is simple and clear, allowing us to easily interact with web servers. When using the Requests library, first ensure that it is installed, which can be done using the command pip install requests.

Once installed, you can use it in your code. For example, sending a simple GET request to retrieve webpage content:

import requests

url = "https://www.example.com"

response = requests.get(url)

if response.status_code == 200:
    print(response.text)
else:
    print(f"Request failed, status code: {response.status_code}")

In this example, requests.get(url) sends a GET request to the specified URL and returns a response object response. By checking response.status_code, we can determine whether the request was successful. If the status code is 200, it indicates success, and response.text contains the HTML content of the webpage.

In addition to GET requests, the Requests library also supports POST requests for submitting data to the server. For example, simulating a login form submission:

import requests

url = "https://www.example.com/login"

data = {
    "username": "your_username",
    "password": "your_password"
}

response = requests.post(url, data=data)

print(response.text)

In this code, the data dictionary contains the username and password required for login, and requests.post(url, data=data) sends this data to the specified login URL.

Additionally, request headers can be set to simulate different browser accesses, handle JSON response data, set request timeouts, etc. For example, setting request headers:

import requests

url = "https://www.example.com"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}

response = requests.get(url, headers=headers)

print(response.text)

In this example, the headers dictionary sets the User-Agent, informing the server that we are using the Chrome browser. This can help avoid some websites rejecting requests due to detecting non-browser access.

4.2 BeautifulSoup Library#

The BeautifulSoup library is a powerful tool specifically designed for parsing HTML and XML data. It can convert complex HTML or XML documents into an easy-to-navigate and manipulate tree structure, allowing us to easily extract the data we need. Before using the BeautifulSoup library, it needs to be installed, which can be done using the command pip install beautifulsoup4.

Assuming we have already obtained the HTML content of a webpage using the Requests library, we can then use the BeautifulSoup library for parsing. For example:

from bs4 import BeautifulSoup
import requests

url = "https://www.example.com"

response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    # Extract all links
    links = soup.find_all('a')
    for link in links:
        print(link.get('href'))
else:
    print(f"Request failed, status code: {response.status_code}")

In this code, BeautifulSoup(response.text, 'html.parser') parses the obtained HTML content into a BeautifulSoup object soup. html.parser is the parser used here, which is Python's built-in HTML parser, but other parsers like lxml can also be chosen as needed. soup.find_all('a') is used to find all <a> tags, i.e., links, and link.get('href') retrieves the href attribute value of each link.

In addition to searching for elements by tag name, you can also search by class name, ID, and other attributes. For example, finding elements with a specific class name:

from bs4 import BeautifulSoup
import requests

url = "https://www.example.com"

response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    # Find elements with class name "special-class"
    special_elements = soup.find_all(class_='special-class')
    for element in special_elements:
        print(element.get_text())
else:
    print(f"Request failed, status code: {response.status_code}")

In this example, soup.find_all(class_='special-class') finds all elements with the class name special-class, and element.get_text() retrieves the text content of these elements.

4.3 XPath Syntax#

XPath is a language used to locate and extract elements in XML and HTML documents. By defining path expressions, it can accurately select nodes or node sets in a document, which is important in crawler development. In Python, XPath syntax can be used in conjunction with the lxml library. First, the lxml library needs to be installed using the command pip install lxml.

Here is an example of using XPath syntax to extract webpage data:

from lxml import etree
import requests

url = "https://www.example.com"

response = requests.get(url)

if response.status_code == 200:
    html = etree.HTML(response.text)
    # Extract all text content from <p> tags
    p_texts = html.xpath('//p/text()')
    for text in p_texts:
        print(text)
else:
    print(f"Request failed, status code: {response.status_code}")

In this code, etree.HTML(response.text) converts the obtained HTML content into an Element object html from lxml. html.xpath('//p/text()') uses the XPath expression //p/text() to select all text content within <p> tags. Here, // indicates selecting nodes in the document from the current node, regardless of their position; p is the tag name; /text() indicates selecting the text within that tag.

XPath syntax also supports locating elements by attributes. For example, extracting elements with specific attribute values:

from lxml import etree
import requests

url = "https://www.example.com"

response = requests.get(url)

if response.status_code == 200:
    html = etree.HTML(response.text)
    # Extract all links within <div> tags with class "article-content"
    links = html.xpath('//div[@class="article-content"]//a/@href')
    for link in links:
        print(link)
else:
    print(f"Request failed, status code: {response.status_code}")

In this example, //div[@class="article-content"] selects all <div> tags with the class attribute value article-content, and //a/@href continues to select all <a> tags' href attribute values within these <div> tags.

4.4 Regular Expressions#

Regular expressions are a powerful tool for matching and processing strings, and they are often used in data extraction for crawlers. They define a series of rules to match strings that conform to specific patterns. In Python, regular expressions are supported through the built-in re module.

For example, if we want to extract all phone numbers from a piece of text, we can use the following regular expression:

import re

text = "Contact number: 13888888888, another phone: 15666666666"

pattern = r'\d{11}'

phones = re.findall(pattern, text)

for phone in phones:
    print(phone)

In this code, r'\d{11}' is the regular expression pattern. Here, r indicates that this is a raw string, preventing backslash characters from being escaped; \d matches any digit character (0 - 9); {11} indicates that the preceding character (i.e., digit) appears consecutively 11 times. re.findall(pattern, text) searches for all strings in the given text text that match the regular expression pattern pattern and returns a list.

Regular expressions can also be used to match more complex patterns, such as email addresses and URLs. For example, matching email addresses:

import re

text = "Email: [email protected], another email: [email protected]"

pattern = r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+'

emails = re.findall(pattern, text)

for email in emails:
    print(email)

In this example, [a-zA-Z0-9_.+-]+ matches one or more characters consisting of letters, digits, underscores, dots, plus signs, and hyphens; @ is the fixed symbol in email addresses; [a-zA-Z0-9-]+ matches one or more characters consisting of letters, digits, and hyphens; \.[a-zA-Z0-9-.]+ indicates a dot followed by one or more characters consisting of letters, digits, dots, and hyphens. This regular expression pattern can accurately match common email address formats.

5. Advanced Crawler Techniques#

5.1 Handling Anti-Crawler Mechanisms#

In the journey of crawling, we often encounter anti-crawler mechanisms set by websites, which can be like facing numerous obstacles on the treasure hunt. However, don't worry; we have a series of effective counter-strategies.

Setting request headers is a simple yet effective method. Websites often determine whether a request comes from a crawler by checking information like User-Agent in the request headers. We can simulate a real browser's request headers to make the website mistakenly believe it is a regular user accessing it. For example, in Python's requests library, request headers can be set like this:

import requests

url = "https://example.com"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}

response = requests.get(url, headers=headers)

Here, the User-Agent string simulates information from the Chrome browser, which can increase the credibility of the request.

Using proxy IPs is also an important means to bypass anti-crawler restrictions. When our IP address is banned by the website due to frequent requests, a proxy IP acts as our "stand-in," helping us continue to access the site. We can use some free or paid proxy IP services. In Python, an example of setting a proxy IP using the requests library is as follows:

import requests

url = "https://example.com"

proxies = {
    "http": "http://your_proxy_ip:your_proxy_port",
    "https": "https://your_proxy_ip:your_proxy_port"
}

response = requests.get(url, proxies=proxies)

When using proxy IPs, it is important to choose reliable proxy sources to ensure the stability and availability of the proxy IPs.

Additionally, controlling the request frequency is crucial. If a large number of requests are sent to a website in a short period, it is easy to be identified as a crawler. We can set time intervals to make requests more "gentle." For example, using the time module's sleep function to pause for a while after each request:

import requests
import time

url_list = ["https://example1.com", "https://example2.com", "https://example3.com"]

for url in url_list:
    response = requests.get(url)
    time.sleep(5)  # Pause for 5 seconds

This way, sending a request every 5 seconds reduces the risk of being detected by anti-crawler mechanisms.

5.2 Dynamic Web Page Crawling#

With the development of web technologies, more and more web pages use dynamic loading techniques, which pose new challenges for crawlers. Traditional methods of directly obtaining HTML content may not be able to retrieve dynamically loaded data. However, we have specialized tools to deal with this situation.

Selenium is a powerful automation testing tool that can simulate real user operations in a browser to obtain the complete content of dynamic web pages. To use Selenium, you first need to install the corresponding browser driver. For example, for Chrome, download the driver that matches your browser version from the ChromeDriver website. After installation, install the Selenium library using pip install selenium. Here is an example of using Selenium to open a webpage and retrieve content:

from selenium import webdriver

driver = webdriver.Chrome()

driver.get("https://example.com")

# Get webpage content
page_source = driver.page_source

print(page_source)

driver.quit()

In this example, webdriver.Chrome() creates an instance of the Chrome browser driver, driver.get(url) opens the specified webpage, driver.page_source retrieves the source code of the webpage, and finally, driver.quit() closes the browser.

In addition to Selenium, Scrapy-Splash is also a good choice. It is a plugin for the Scrapy framework specifically designed to handle dynamic web pages. Splash is a service based on JavaScript rendering, which can render JavaScript on the server side and then return the rendered HTML to the crawler. When using Scrapy-Splash, you need to install the Splash service first and configure it in your Scrapy project. After configuration, you can use Splash in your crawler code to handle dynamic web pages. For example, add the following configuration in the Scrapy settings.py file:

SPLASH_URL = 'http://localhost:8050'

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
    'scrapy.downloader.middlewares.httpcompression.HttpCompressionMiddleware': 810,
    'scrapy_splash.SplashMiddleware': 820,
}

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
    'scrapy_splash.SplashSpiderMiddleware': 725,
}

DUPEFILTER_CLASS ='scrapy_splash.SplashAwareDupeFilter'

HTTPCACHE_STORAGE ='scrapy_splash.SplashAwareFSCacheStorage'

Then, use Splash's Request object to send requests in your crawler code:

from scrapy_splash import SplashRequest

class MySpider(scrapy.Spider):
    name = "my_spider"
    start_urls = ["https://example.com"]

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse, args={'wait': 5})

    def parse(self, response):
        # Parse webpage content
        pass

In this example, SplashRequest sends the request to the Splash service, and args={'wait': 5} indicates waiting for 5 seconds to ensure that the page's JavaScript code is fully rendered before retrieving the content.

5.3 Multithreading and Asynchronous Crawling#

In cases where the data volume is large, single-threaded crawlers may have relatively low efficiency. At this point, we can utilize multithreading and asynchronous programming techniques to improve crawler efficiency.

Multithreaded crawlers can open multiple threads simultaneously, with each thread responsible for one or more URL requests and data extraction tasks, significantly speeding up data collection. In Python, the threading module can be used to implement multithreaded crawlers. For example:

import threading
import requests

def fetch_url(url):
    response = requests.get(url)
    print(response.text)

url_list = ["https://example1.com", "https://example2.com", "https://example3.com"]

threads = []

for url in url_list:
    t = threading.Thread(target=fetch_url, args=(url,))
    threads.append(t)
    t.start()

for t in threads:
    t.join()

In this code, threading.Thread(target=fetch_url, args=(url,)) creates a new thread, with target specifying the function to be executed by the thread and args passing the required parameters. This way, multiple threads can simultaneously send requests to different URLs, improving crawler efficiency.

Asynchronous crawling utilizes Python's asynchronous programming features to perform non-blocking I/O operations within a single thread. While waiting for network requests, the program can execute other tasks, avoiding thread blocking and improving resource utilization. asyncio is the standard library for asynchronous programming in Python, and combined with the aiohttp library (for asynchronous HTTP requests), it can achieve efficient asynchronous crawling. Example code is as follows:

import asyncio
import aiohttp

async def fetch_url(session, url):
    async with session.get(url) as response:
        return await response.text()

async def main():
    urls = ["https://example1.com", "https://example2.com", "https://example3.com"]
    async with aiohttp.ClientSession() as session:
        tasks = [fetch_url(session, url) for url in urls]
        results = await asyncio.gather(*tasks)
        for result in results:
            print(result)

if __name__ == "__main__":
    asyncio.run(main())

In this example, async def defines an asynchronous function, and await is used to wait for the completion of asynchronous operations. aiohttp.ClientSession() creates an HTTP session, and the tasks list contains all asynchronous tasks. asyncio.gather(*tasks) runs these tasks in parallel and waits for all tasks to complete. This way, efficient asynchronous crawling is achieved.

6. Application of Crawler Frameworks#

6.1 Scrapy Framework#

Scrapy is a powerful and widely used Python crawler framework that provides efficient and convenient solutions for crawler development.

The architecture of Scrapy is like a precision machine, with multiple core components working together. The engine is the core hub of the entire framework, responsible for coordinating communication and data flow between various components. It retrieves URLs to be crawled from the scheduler, sends requests to the downloader, which downloads webpage content from the internet and returns the response to the engine. The engine then passes the response to the spider for data parsing. The pipeline is responsible for processing the data extracted by the spider, performing operations such as data cleaning and storage. The scheduler acts like an intelligent task allocator, maintaining a queue of URLs and managing the scheduling of URLs to be crawled, ensuring that each URL is reasonably arranged in the crawling order while also deduplicating repeated URLs.

When using the Scrapy framework, the first step is to create a Scrapy project. By entering scrapy startproject project_name in the command line, you can quickly create a project framework. For example, to create a project named my_crawler, the command is as follows:

scrapy startproject my_crawler

After entering the project directory, you can create a spider using the command scrapy genspider spider_name target_domain. Suppose you want to crawl data from the example.com website; the command to create a spider is as follows:

cd my_crawler

scrapy genspider example_spider example.com

In the generated spider file, you need to define the spider's logic. For example, here is a simple spider example that crawls the title information from a specified webpage:

import scrapy

class ExampleSpider(scrapy.Spider):
    name = "example_spider"
    allowed_domains = ["example.com"]
    start_urls = ["https://example.com"]

    def parse(self, response):
        titles = response.css('title::text').extract()
        for title in titles:
            yield {'title': title}

In this example, the parse method is the core logic of the spider. It uses CSS selectors to extract title information from the webpage and passes the data to the pipeline for subsequent processing.

6.2 PySpider Framework#

PySpider is a lightweight yet powerful crawler framework with unique features and applicable scenarios.

One of the highlights of PySpider is its intuitive and easy-to-use WebUI interface. Through this interface, developers can easily manage projects, monitor tasks, and view results. In the WebUI, you can easily create, edit, and start crawler tasks, get real-time updates on the crawler's running status, and view the crawled data results, greatly improving development and debugging efficiency.

PySpider supports various data storage methods, including common databases like MySQL, MongoDB, and Redis. This allows developers to flexibly choose the most suitable data storage solution based on the actual needs of the project, facilitating subsequent data processing and analysis.

It also has powerful distributed crawling capabilities. By configuring multiple crawler nodes, PySpider can achieve high-concurrency crawling tasks, significantly improving data collection speed and efficiency. This feature makes it perform exceptionally well in handling large-scale data crawling tasks.

For example, using PySpider to crawl article content from a news website. First, install PySpider using the command pip install pyspider. After installation, start the PySpider service by entering pyspider all. Access the PySpider WebUI interface in your browser at http://localhost:5000. In the WebUI, create a new crawler project and write the following crawler code:

from pyspider.libs.base_handler import *

class NewsSpider(BaseHandler):
    crawl_config = {}

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('https://news.example.com', callback=self.index_page)

    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

    def detail_page(self, response):
        title = response.doc('title').text()
        content = response.doc('.article-content').text()
        return {'title': title, 'content': content}

In this example, the on_start method defines the starting URL for the crawler, the index_page method is used to parse the list page, extract article links, and hand them over to the detail_page method for detail page parsing, ultimately extracting the article's title and content. Through PySpider's WebUI interface, we can easily start, monitor, and manage this crawler task.

7. Practical Project Exercises#

7.1 Small Crawler Project#

To give everyone a more intuitive feel for the charm of crawlers, we will take crawling Douban Movie Top 250 as an example to demonstrate a complete crawler implementation process. Douban Movie Top 250 is a popular list among movie enthusiasts, containing rich movie information.

First, we need to analyze the webpage structure. By using the browser's developer tools (such as the F12 shortcut in Chrome), we can view the HTML source code of the webpage and find the tags and attributes where the movie information is located. For example, the movie title is usually within the <span class="title"> tag, and the rating is within the <span class="rating_num"> tag.

Next, we will write the crawler code. Using Python's requests library and BeautifulSoup library, the code is as follows:

import requests
from bs4 import BeautifulSoup
import csv

def get_movie_info(url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    }
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        movie_list = soup.find_all('div', class_='item')
        for movie in movie_list:
            title = movie.find('span', class_='title').text
            rating = movie.find('span', class_='rating_num').text
            quote = movie.find('span', class_='inq')
            quote = quote.text if quote else 'None'
            yield {
                'title': title,
                'rating': rating,
                'quote': quote
            }
    else:
        print(f"Request failed, status code: {response.status_code}")

def save_to_csv(data, filename='douban_movies.csv'):
    with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
        fieldnames = ['title', 'rating', 'quote']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        for movie in data:
            writer.writerow(movie)

if __name__ == "__main__":
    base_url = "https://movie.douban.com/top250?start={}&filter="
    all_movie_info = []
    for start in range(0, 250, 25):
        url = base_url.format(start)
        movie_info = get_movie_info(url)
        all_movie_info.extend(movie_info)
    save_to_csv(all_movie_info)

In this code, the get_movie_info function is responsible for sending requests, parsing the webpage, and extracting the movie's title, rating, and quote information. The save_to_csv function saves the extracted data as a CSV file. By looping through different page numbers, we can obtain all the information from Douban Movie Top 250.

7.2 Large Comprehensive Crawler Project#

Now, we move on to a more challenging task—crawling product information from an e-commerce platform and implementing data persistence and distributed crawling functionality. Taking the example of crawling mobile product information from a certain e-commerce platform, this process involves multiple complex steps.

First, we need to address anti-crawler issues. E-commerce platforms usually have strict anti-crawler mechanisms, and we can disguise requests by setting request headers and using proxy IPs. At the same time, to achieve data persistence, we choose to store data in a MySQL database. Below is a partial code example implemented using the Scrapy framework:

import scrapy
import pymysql

class MobileSpider(scrapy.Spider):
    name = "mobile_spider"
    allowed_domains = ["example.com"]
    start_urls = ["https://example.com/mobiles"]

    def parse(self, response):
        for mobile in response.css('.mobile-item'):
            item = {
                'title': mobile.css('.title::text').get(),
                'price': mobile.css('.price::text').get(),
                'rating': mobile.css('.rating::text').get()
            }
            yield item

        next_page = response.css('.next-page::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

class MySQLPipeline:
    def __init__(self):
        self.conn = pymysql.connect(
            host='localhost',
            user='root',
            password='password',
            db='ecommerce',
            charset='utf8'
        )
        self.cursor = self.conn.cursor()

    def process_item(self, item, spider):
        sql = "INSERT INTO mobiles (title, price, rating) VALUES (%s, %s, %s)"
        values = (item['title'], item['price'], item['rating'])
        self.cursor.execute(sql, values)
        self.conn.commit()
        return item

    def close_spider(self, spider):
        self.cursor.close()
        self.conn.close()

In this code, the MobileSpider class defines the logic of the crawler, including the starting URL, data parsing, and pagination handling. The MySQLPipeline class is responsible for inserting the crawled data into the mobiles table in the MySQL database.

For distributed crawling, we can use the Scrapy-Redis framework. Scrapy-Redis can distribute crawling tasks across multiple nodes for parallel execution, greatly improving crawling efficiency. First, install the Scrapy-Redis library using the command pip install scrapy-redis. Then, configure the following in the settings.py file:

# Enable Redis to store the request queue
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# Ensure all spiders share the same duplicate filter
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# Redis connection settings
REDIS_HOST = 'localhost'
REDIS_PORT = 6379

In the spider class, inherit from the RedisSpider class, as shown below:

from scrapy_redis.spiders import RedisSpider

class DistributedMobileSpider(RedisSpider):
    name = "distributed_mobile_spider"
    redis_key = "mobile:start_urls"

    def parse(self, response):
        # Parsing logic is similar to a regular Spider
        pass

In this configuration, redis_key specifies the key in Redis where the starting URLs are stored. By placing the starting URLs in the Redis queue, different crawler nodes can fetch tasks from the queue and process them, achieving distributed crawling.

8. Summary and Outlook#

In this journey through crawlers, we started with basic syntax and gradually delved into the core principles of crawlers, commonly used libraries, advanced techniques, framework applications, and practical projects. Through learning, we mastered how to set up a crawler environment using Python, utilize various libraries and tools to send HTTP requests, parse webpage data, handle anti-crawler mechanisms, and achieve efficient data collection. At the same time, we also learned about the characteristics and advantages of different crawler frameworks and how to apply them in practical projects.

Looking ahead, crawler technology will continue to evolve with the development of the internet. With the booming development of big data and artificial intelligence technologies, crawler technology is expected to play a more critical role in data collection and analysis. In the future, crawlers may become more intelligent, capable of automatically recognizing changes in webpage structures and flexibly adjusting crawling strategies to improve the accuracy and efficiency of data collection. Additionally, in the field of distributed crawling, with continuous technological improvements, the efficiency of multi-node collaboration will further enhance, enabling faster processing of large-scale data crawling tasks. Furthermore, as awareness of data security and privacy protection continues to grow, crawler technology will also pay more attention to compliance and security, ensuring data collection is conducted legally and safely.

The development prospects of crawler technology are broad and full of infinite possibilities. I hope everyone can continue to explore and innovate in their future learning and practice, fully leveraging the advantages of crawler technology to contribute to the development of various fields.