Ir al contenido principal

A simple python async spider (async programming with python 3.6 step 2)

As a second step learning async programming I developed a very simple spider with python 3.6 and asynchronous developing.

In this case the spider request a bunch of urls. The server that is serving waits on each request.

http://localhost:8000/1

The number '1' tells the number of seconds that the server should wait:

http://localhost:8000/2 -> makes server wait for 2 seconds
http://localhost:8000/5 -> makes server wait for 5 seconds

I'm testing with a server that makes to wait my consumer, feel free to use random waits or whatever you prefer.

The consumer accepts two queues, a queue for urls to retrieve, and a queue to store the results.

At the moment the consumer don't store nothing at the urls queue, only retrieve the urls configured on the hardcoded urls list.

More functionalities will be added in the future.

In this example aiohttp==2.3.10 is used.

(iospider) $ pip install "aiohttp==2.3.10"


import asyncio
from contextlib import closing
from time import perf_counter

import aiohttp

SPIDER_WORKERS = 16

async def consume(client: aiohttp.ClientSession, queue_results: asyncio.Queue, queue_urls: asyncio.Queue):
    while True:
        if queue_urls.empty():
            break
        url = await queue_urls.get()
        print(f'consumed {url}')
        with aiohttp.Timeout(10):
            async with client.get(url) as response:
                if response.status == 200:
                    page = await response.text()
                    await queue_results.put(page)


def run(queue_results: asyncio.Queue, queue_urls: asyncio.Queue, workers: int):
    with closing(asyncio.get_event_loop()) as loop:
        with aiohttp.ClientSession() as client:
            tasks = [consume(client, queue_results, queue_urls) for i in range(workers)]
            loop.run_until_complete(asyncio.gather(*tasks))

urls = ['http://localhost:8000/1', 'http://localhost:8000/2', 'http://localhost:8000/3', 'http://localhost:8000/4'] * 6
start = perf_counter()
queue_urls = asyncio.Queue()
queue_results = asyncio.Queue()
[queue_urls.put_nowait(url) for url in urls]
run(queue_results, queue_urls, SPIDER_WORKERS if queue_urls.qsize() > SPIDER_WORKERS else queue_urls.qsize())
print(f'Retrieved {queue_results.qsize()} pages in {perf_counter() - start}')

Example output, for 24 urls, with the spider configured with 16 default workers (same as scrapy).

# IOSpider output

(iospider) jesus@laptop:~/iospider$ time python iospider.py
Creating a client session outside of coroutine
client_session: <aiohttp.client.ClientSession object at 0x7f25c55ea208>
consumed http://localhost:8000/1
consumed http://localhost:8000/4
...
consumed http://localhost:8000/2
consumed http://localhost:8000/3
...
consumed http://localhost:8000/1

consumed http://localhost:8000/4
Retrieved 24 pages in 6.038133026999731

real    0m6,222s
user    0m0,217s
sys    0m0,024s

Using scrapy to retrieve the same number of urls with the same number of workers:

(scrapy) jesus@laptop:~/scrapy$ time scrapy runspider -s CONCURRENT_REQUESTS=16 -s CONCURRENT_REQUESTS_PER_DOMAIN=16 client_scrapy.py

# Scrapy output
...

real    0m6,870s
user    0m0,860s
sys    0m0,036s

The average times show that IOSpider is faster than Scrapy, but we need to consider that IOSpider is a very simplistic approach with only one main feature.

Comentarios

Entradas populares de este blog

Use django ORM standalone within your nameko micro-services

Learning about micro services with python, I found a great tool named nameko . https://www.nameko.io/ Nameko is a Python framework to build microservices that doesn't care in concrete technologies you will use within your project. To allow that microservices to work with a database, you can install into your project a wide variety of third parties, like SQLAlchemy (just like any other). To have an easy way to communicate with the database and keep track of the changes made to the models, I chose Django: I'm just learning about microservices and I want to keep focused on that. Easy to use, Django is a reliable web framework, have a powerful and well known ORM. Also using Django we will have many of the different functionalities that this framework provide. To make all this magic to work together, I developed a python package that allow you to use Django as a Nameko injected dependency: https://pypi.org/project/django-nameko-standalone/ You can found the source ...

Stop python debugger exactly on the failing loop iteration

This post is for ipdb users. Is this case familiar for you?: Write a break point on your code Go into a loop Start pressing 'c' and 'n' until you found the iteration that is failing. It can works, but you can only do this ... one or two times maybe?? If you need to this frequently, this method doesn't works and you need another tool, but .... what's the tool I need? You need launch_ipdb_on_exception ... ... from ipdb import launch_ipdb_on_exception with launch_ipdb_on_exception(): for i in range ( 9 ): print (i) if i== 4 : raise Exception('Debug time!') ... # Output 0 1 2 3 4 Exception('Debug time!',) > <ipython-input-1-4f44dca149ad>(7)<module>() 5 print(i) 6 if i==4: ----> 7 raise Exception('Debug time!') ipdb> i # Note ipdb console!! 4 ipdb> c # Continues the execution In [2]: i # Ipython console again Out[2]: 4 In [3...

Stop measuring time with time.clock() in Python3

If we read the time.clock() documentation ( https://docs.python.org/3/library/time.html#time.clock): Deprecated since version 3.3: The behaviour of this function depends on the platform: use perf_counter() or process_time() instead, depending on your requirements, to have a well defined behaviour. I use time.perf_counter, but feel free to use time.process_time if fits for your requirements. The main difference they have are: time.perf_counter() It does include time elapsed during sleep and is system-wide. time.process_time() It does not include time elapsed during sleep. It is process-wide by definition.