As a second step learning async programming I developed a very simple spider with python 3.6 and asynchronous developing.
In this case the spider request a bunch of urls. The server that is serving waits on each request.
http://localhost:8000/1
The number '1' tells the number of seconds that the server should wait:
http://localhost:8000/2 -> makes server wait for 2 seconds
http://localhost:8000/5 -> makes server wait for 5 seconds
I'm testing with a server that makes to wait my consumer, feel free to use random waits or whatever you prefer.
The consumer accepts two queues, a queue for urls to retrieve, and a queue to store the results.
At the moment the consumer don't store nothing at the urls queue, only retrieve the urls configured on the hardcoded urls list.
More functionalities will be added in the future.
In this example aiohttp==2.3.10 is used.
(iospider) $ pip install "aiohttp==2.3.10"
Example output, for 24 urls, with the spider configured with 16 default workers (same as scrapy).
# IOSpider output
(iospider) jesus@laptop:~/iospider$ time python iospider.py
Creating a client session outside of coroutine
client_session: <aiohttp.client.ClientSession object at 0x7f25c55ea208>
consumed http://localhost:8000/1
consumed http://localhost:8000/4
...
consumed http://localhost:8000/2
consumed http://localhost:8000/3
...
consumed http://localhost:8000/1
consumed http://localhost:8000/4
Retrieved 24 pages in 6.038133026999731
real 0m6,222s
user 0m0,217s
sys 0m0,024s
Using scrapy to retrieve the same number of urls with the same number of workers:
(scrapy) jesus@laptop:~/scrapy$ time scrapy runspider -s CONCURRENT_REQUESTS=16 -s CONCURRENT_REQUESTS_PER_DOMAIN=16 client_scrapy.py
# Scrapy output
...
real 0m6,870s
user 0m0,860s
sys 0m0,036s
The average times show that IOSpider is faster than Scrapy, but we need to consider that IOSpider is a very simplistic approach with only one main feature.
In this case the spider request a bunch of urls. The server that is serving waits on each request.
http://localhost:8000/1
The number '1' tells the number of seconds that the server should wait:
http://localhost:8000/2 -> makes server wait for 2 seconds
http://localhost:8000/5 -> makes server wait for 5 seconds
I'm testing with a server that makes to wait my consumer, feel free to use random waits or whatever you prefer.
The consumer accepts two queues, a queue for urls to retrieve, and a queue to store the results.
At the moment the consumer don't store nothing at the urls queue, only retrieve the urls configured on the hardcoded urls list.
More functionalities will be added in the future.
In this example aiohttp==2.3.10 is used.
(iospider) $ pip install "aiohttp==2.3.10"
import asyncio from contextlib import closing from time import perf_counter import aiohttp SPIDER_WORKERS = 16 async def consume(client: aiohttp.ClientSession, queue_results: asyncio.Queue, queue_urls: asyncio.Queue): while True: if queue_urls.empty(): break url = await queue_urls.get() print(f'consumed {url}') with aiohttp.Timeout(10): async with client.get(url) as response: if response.status == 200: page = await response.text() await queue_results.put(page) def run(queue_results: asyncio.Queue, queue_urls: asyncio.Queue, workers: int): with closing(asyncio.get_event_loop()) as loop: with aiohttp.ClientSession() as client: tasks = [consume(client, queue_results, queue_urls) for i in range(workers)] loop.run_until_complete(asyncio.gather(*tasks)) urls = ['http://localhost:8000/1', 'http://localhost:8000/2', 'http://localhost:8000/3', 'http://localhost:8000/4'] * 6 start = perf_counter() queue_urls = asyncio.Queue() queue_results = asyncio.Queue() [queue_urls.put_nowait(url) for url in urls] run(queue_results, queue_urls, SPIDER_WORKERS if queue_urls.qsize() > SPIDER_WORKERS else queue_urls.qsize()) print(f'Retrieved {queue_results.qsize()} pages in {perf_counter() - start}')
Example output, for 24 urls, with the spider configured with 16 default workers (same as scrapy).
# IOSpider output
(iospider) jesus@laptop:~/iospider$ time python iospider.py
Creating a client session outside of coroutine
client_session: <aiohttp.client.ClientSession object at 0x7f25c55ea208>
consumed http://localhost:8000/1
consumed http://localhost:8000/4
...
consumed http://localhost:8000/2
consumed http://localhost:8000/3
...
consumed http://localhost:8000/1
consumed http://localhost:8000/4
Retrieved 24 pages in 6.038133026999731
real 0m6,222s
user 0m0,217s
sys 0m0,024s
Using scrapy to retrieve the same number of urls with the same number of workers:
(scrapy) jesus@laptop:~/scrapy$ time scrapy runspider -s CONCURRENT_REQUESTS=16 -s CONCURRENT_REQUESTS_PER_DOMAIN=16 client_scrapy.py
# Scrapy output
...
real 0m6,870s
user 0m0,860s
sys 0m0,036s
The average times show that IOSpider is faster than Scrapy, but we need to consider that IOSpider is a very simplistic approach with only one main feature.
Comentarios
Publicar un comentario