Python asyncio 高性能网络爬虫实战

write_bot

2026-2-14

环境准备

确保系统安装 Python 3.7 或更高版本，asyncio 库在这些版本中得到完整支持[来源#1]。检查版本命令：

python --version
# 或
python3 --version

若版本低于 3.7，请从 Python 官网下载安装最新版本。创建虚拟环境隔离依赖：

python3 -m venv crawler_env
source crawler_env/bin/activate

Linux/macOS

Windows 系统激活命令：

crawler_env\Scripts\activate

安装异步 HTTP 客户端 aiohttp：

pip install aiohttp

验证安装：

pip list | grep aiohttp

预期输出显示 aiohttp 版本号（如 aiohttp 3.8.4），表明环境就绪。

步骤拆解：核心概念与代码实现

asyncio 的核心是协程（coroutine）和事件循环（event loop）。协程使用 async/await 语法定义，允许函数在 I/O 操作时暂停并恢复，实现非阻塞并发[来源#2]。对于网络爬虫，可创建多个协程同时发起 HTTP 请求，避免同步阻塞。

定义异步函数：使用 async def 创建协程函数。
使用 aiohttp.ClientSession：管理 HTTP 连接，支持并发请求。
创建任务：使用 asyncio.create_task() 或 asyncio.gather() 调度多个协程。
运行事件循环：使用 asyncio.run() 启动主协程。

以下完整异步爬虫示例从 http://httpbin.org/get 获取数据，模拟并发请求。代码可直接执行。

import asyncio
import aiohttp
import time

async def fetch_url(session, url):
    """异步获取单个URL的内容"""
    try:
        async with session.get(url) as response:
            if response.status == 200:
                data = await response.text()
                print(f"成功获取 {url}: {len(data)} 字节")
                return data
            else:
                print(f"请求失败 {url}: 状态码 {response.status}")
                return None
    except Exception as e:
        print(f"异常 {url}: {str(e)}")
        return None

async def main():
    """主协程：并发请求多个URL"""
    urls = [
        "http://httpbin.org/get",
        "http://httpbin.org/get?param1=value1",
        "http://httpbin.org/get?param2=value2",
        "http://httpbin.org/get?param3=value3",
        "http://httpbin.org/get?param4=value4"
    ]
    
    async with aiohttp.ClientSession() as session:
        tasks = [fetch_url(session, url) for url in urls]
        results = await asyncio.gather(*tasks)
        
        # 验证结果：统计成功请求的数量
        successful = [r for r in results if r is not None]
        print(f"\n总请求数: {len(urls)}")
        print(f"成功请求数: {len(successful)}")
        print(f"失败请求数: {len(urls) - len(successful)}")

if __name__ == "__main__":
    start_time = time.time()
    asyncio.run(main())
    end_time = time.time()
    print(f"\n总耗时: {end_time - start_time:.2f} 秒")

保存为 async_crawler.py 并运行：

python async_crawler.py

预期输出显示每个 URL 的获取状态和总耗时。由于并发执行，总耗时应远低于 5 个请求的串行时间（通常小于 1 秒），验证 asyncio 在 I/O 密集型任务中的性能优势[来源#2]。

结果验证

验证爬虫性能的关键是测量并发性和响应时间。在上面的代码中，已打印总耗时和请求统计。为更深入验证，可添加日志记录或使用 timeit 工具。例如，修改代码以记录每个请求的开始和结束时间：

import asyncio
import aiohttp
import time

async def fetch_url(session, url, index):
    """带索引的异步获取函数"""
    start = time.time()
    try:
        async with session.get(url) as response:
            if response.status == 200:
                data = await response.text()
                end = time.time()
                print(f"请求 {index}: {url} 耗时 {end - start:.2f} 秒")
                return data
            else:
                print(f"请求 {index} 失败: 状态码 {response.status}")
                return None
    except Exception as e:
        print(f"请求 {index} 异常: {str(e)}")
        return None

async def main():
    urls = ["http://httpbin.org/get" for _ in range(10)]  # 10个并发请求
    async with aiohttp.ClientSession() as session:
        tasks = [fetch_url(session, url, i) for i, url in enumerate(urls)]
        results = await asyncio.gather(*tasks)
        successful = [r for r in results if r is not None]
        print(f"\n成功请求数: {len(successful)}/10")

if __name__ == "__main__":
    start = time.time()
    asyncio.run(main())
    print(f"总耗时: {time.time() - start:.2f} 秒")

运行此代码（保存为 validate_crawler.py 并执行 python validate_crawler.py），预期输出显示每个请求的独立耗时，总耗时应接近单个请求的时间（约 0.5-1 秒），证明了并发性。如果耗时显著增加，可能表示网络问题或服务器限流。

常见错误与排查

在使用 asyncio 构建爬虫时，常见错误包括事件循环冲突、连接池耗尽和异常处理不当。以下是典型问题及解决方案：

错误：RuntimeError: This event loop is already running。原因：在已有事件循环中调用 asyncio.run()。解决方案：确保只在主脚本中使用 asyncio.run()，避免嵌套调用[来源#1]。
错误：Too many connections。原因：并发请求过多，超出服务器或客户端限制。解决方案：使用 Semaphore 限制并发数，例如：semaphore = asyncio.Semaphore(5)，并在 fetch 函数中使用 async with semaphore。
错误：SSL/TLS 证书问题。原因：HTTPS 请求时证书验证失败。解决方案：在 ClientSession 中设置 ssl=False（仅测试用），或使用 certifi 库更新证书。
错误：协程未正确 await。原因：忘记 await 异步函数调用。解决方案：始终使用 await，例如：data = await fetch_url(session, url)。

例如，添加 Semaphore 限制并发的代码片段：

import asyncio
import aiohttp

async def fetch_with_semaphore(session, url, semaphore):
    async with semaphore:
        async with session.get(url) as response:
            return await response.text()

async def main():
    semaphore = asyncio.Semaphore(5)  # 限制并发为5
    urls = ["http://httpbin.org/get" for _ in range(20)]
    async with aiohttp.ClientSession() as session:
        tasks = [fetch_with_semaphore(session, url, semaphore) for url in urls]
        results = await asyncio.gather(*tasks)
        print(f"完成 {len(results)} 个请求")

if __name__ == "__main__":
    asyncio.run(main())

运行此代码可避免连接过多导致的错误。如果遇到 SSL 问题，可临时修改 ClientSession 为：async with aiohttp.ClientSession(connector=aiohttp.TCPConnector(ssl=False)) as session:。始终在生产环境中使用真实证书以确保安全。

参考链接

阅读剩余

在浏览本站资源前，请先认真阅读免责声明。
站长邮箱：wnluo@wnluo.com

本站提供的一切软件、教程和内容信息仅限用于学习和研究目的；不得将上述内容用于商业或者非法用途，否则，一切后果请用户自负。本站信息来自网络收集整理，版权争议与本站无关。您必须在下载后的24个小时之内，从您的电脑或手机中彻底删除上述内容。如果您喜欢该程序和内容，请支持正版，购买注册，得到更好的正版服务。我们非常重视版权问题，如有侵权请邮件与我们联系处理。敬请谅解！

THE END

Python异步编程实战：使用asyncio和aiohttp构建高并发爬虫

<<上一篇

Go后端实战：使用Gin框架快速构建RESTful API服务

下一篇>>