Scrapy 2.6 documentation¶
Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
Getting help¶
Having trouble? We’d like to help!
Try the FAQ – it’s got answers to some common questions.
Looking for specific information? Try the Index or Module Index.
Ask or search questions in StackOverflow using the scrapy tag.
Ask or search questions in the Scrapy subreddit.
Search for questions on the archives of the scrapy-users mailing list.
Ask a question in the #scrapy IRC channel,
Report bugs with Scrapy in our issue tracker.
Join the Discord community Scrapy Discord.
First steps¶
Scrapy at a glance¶
Scrapy (/ˈskreɪpaɪ/) is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.
Even though Scrapy was originally designed for web scraping, it can also be used to extract data using APIs (such as Amazon Associates Web Services) or as a general purpose web crawler.
Walk-through of an example spider¶
In order to show you what Scrapy brings to the table, we’ll walk you through an example of a Scrapy Spider using the simplest way to run a spider.
Here’s the code for a spider that scrapes famous quotes from website https://quotes.toscrape.com, following the pagination:
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = [
'https://quotes.toscrape.com/tag/humor/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'author': quote.xpath('span/small/text()').get(),
'text': quote.css('span.text::text').get(),
}
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
Put this in a text file, name it to something like quotes_spider.py
and run the spider using the runspider
command:
scrapy runspider quotes_spider.py -o quotes.jl
When this finishes you will have in the quotes.jl
file a list of the
quotes in JSON Lines format, containing text and author, looking like this:
{"author": "Jane Austen", "text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"}
{"author": "Steve Martin", "text": "\u201cA day without sunshine is like, you know, night.\u201d"}
{"author": "Garrison Keillor", "text": "\u201cAnyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.\u201d"}
...
What just happened?¶
When you ran the command scrapy runspider quotes_spider.py
, Scrapy looked for a
Spider definition inside it and ran it through its crawler engine.
The crawl started by making requests to the URLs defined in the start_urls
attribute (in this case, only the URL for quotes in humor category)
and called the default callback method parse
, passing the response object as
an argument. In the parse
callback, we loop through the quote elements
using a CSS Selector, yield a Python dict with the extracted quote text and author,
look for a link to the next page and schedule another request using the same
parse
method as callback.
Here you notice one of the main advantages about Scrapy: requests are scheduled and processed asynchronously. This means that Scrapy doesn’t need to wait for a request to be finished and processed, it can send another request or do other things in the meantime. This also means that other requests can keep going even if some request fails or an error happens while handling it.
While this enables you to do very fast crawls (sending multiple concurrent requests at the same time, in a fault-tolerant way) Scrapy also gives you control over the politeness of the crawl through a few settings. You can do things like setting a download delay between each request, limiting amount of concurrent requests per domain or per IP, and even using an auto-throttling extension that tries to figure out these automatically.
Note
This is using feed exports to generate the JSON file, you can easily change the export format (XML or CSV, for example) or the storage backend (FTP or Amazon S3, for example). You can also write an item pipeline to store the items in a database.
What else?¶
You’ve seen how to extract and store items from a website using Scrapy, but this is just the surface. Scrapy provides a lot of powerful features for making scraping easy and efficient, such as:
Built-in support for selecting and extracting data from HTML/XML sources using extended CSS selectors and XPath expressions, with helper methods to extract using regular expressions.
An interactive shell console (IPython aware) for trying out the CSS and XPath expressions to scrape data, very useful when writing or debugging your spiders.
Built-in support for generating feed exports in multiple formats (JSON, CSV, XML) and storing them in multiple backends (FTP, S3, local filesystem)
Robust encoding support and auto-detection, for dealing with foreign, non-standard and broken encoding declarations.
Strong extensibility support, allowing you to plug in your own functionality using signals and a well-defined API (middlewares, extensions, and pipelines).
Wide range of built-in extensions and middlewares for handling:
cookies and session handling
HTTP features like compression, authentication, caching
user-agent spoofing
robots.txt
crawl depth restriction
and more
A Telnet console for hooking into a Python console running inside your Scrapy process, to introspect and debug your crawler
Plus other goodies like reusable spiders to crawl sites from Sitemaps and XML/CSV feeds, a media pipeline for automatically downloading images (or any other media) associated with the scraped items, a caching DNS resolver, and much more!
What’s next?¶
The next steps for you are to install Scrapy, follow through the tutorial to learn how to create a full-blown Scrapy project and join the community. Thanks for your interest!
Installation guide¶
Supported Python versions¶
Scrapy requires Python 3.6+, either the CPython implementation (default) or the PyPy 7.2.0+ implementation (see Alternate Implementations).
Installing Scrapy¶
If you’re using Anaconda or Miniconda, you can install the package from the conda-forge channel, which has up-to-date packages for Linux, Windows and macOS.
To install Scrapy using conda
, run:
conda install -c conda-forge scrapy
Alternatively, if you’re already familiar with installation of Python packages, you can install Scrapy and its dependencies from PyPI with:
pip install Scrapy
We strongly recommend that you install Scrapy in a dedicated virtualenv, to avoid conflicting with your system packages.
Note that sometimes this may require solving compilation issues for some Scrapy dependencies depending on your operating system, so be sure to check the Platform specific installation notes.
For more detailed and platform specifics instructions, as well as troubleshooting information, read on.
Things that are good to know¶
Scrapy is written in pure Python and depends on a few key Python packages (among others):
lxml, an efficient XML and HTML parser
parsel, an HTML/XML data extraction library written on top of lxml,
w3lib, a multi-purpose helper for dealing with URLs and web page encodings
twisted, an asynchronous networking framework
cryptography and pyOpenSSL, to deal with various network-level security needs
The minimal versions which Scrapy is tested against are:
Twisted 14.0
lxml 3.4
pyOpenSSL 0.14
Scrapy may work with older versions of these packages but it is not guaranteed it will continue working because it’s not being tested against them.
Some of these packages themselves depends on non-Python packages that might require additional installation steps depending on your platform. Please check platform-specific guides below.
In case of any trouble related to these dependencies, please refer to their respective installation instructions:
Using a virtual environment (recommended)¶
TL;DR: We recommend installing Scrapy inside a virtual environment on all platforms.
Python packages can be installed either globally (a.k.a system wide), or in user-space. We do not recommend installing Scrapy system wide.
Instead, we recommend that you install Scrapy within a so-called
“virtual environment” (venv
).
Virtual environments allow you to not conflict with already-installed Python
system packages (which could break some of your system tools and scripts),
and still install packages normally with pip
(without sudo
and the likes).
See Virtual Environments and Packages on how to create your virtual environment.
Once you have created a virtual environment, you can install Scrapy inside it with pip
,
just like any other Python package.
(See platform-specific guides
below for non-Python dependencies that you may need to install beforehand).
Platform specific installation notes¶
Windows¶
Though it’s possible to install Scrapy on Windows using pip, we recommend you to install Anaconda or Miniconda and use the package from the conda-forge channel, which will avoid most installation issues.
Once you’ve installed Anaconda or Miniconda, install Scrapy with:
conda install -c conda-forge scrapy
To install Scrapy on Windows using pip
:
Warning
This installation method requires “Microsoft Visual C++” for installing some Scrapy dependencies, which demands significantly more disk space than Anaconda.
Download and execute Microsoft C++ Build Tools to install the Visual Studio Installer.
Run the Visual Studio Installer.
Under the Workloads section, select C++ build tools.
Check the installation details and make sure following packages are selected as optional components:
MSVC (e.g MSVC v142 - VS 2019 C++ x64/x86 build tools (v14.23) )
Windows SDK (e.g Windows 10 SDK (10.0.18362.0))
Install the Visual Studio Build Tools.
Now, you should be able to install Scrapy using pip
.
Ubuntu 14.04 or above¶
Scrapy is currently tested with recent-enough versions of lxml, twisted and pyOpenSSL, and is compatible with recent Ubuntu distributions. But it should support older versions of Ubuntu too, like Ubuntu 14.04, albeit with potential issues with TLS connections.
Don’t use the python-scrapy
package provided by Ubuntu, they are
typically too old and slow to catch up with latest Scrapy.
To install Scrapy on Ubuntu (or Ubuntu-based) systems, you need to install these dependencies:
sudo apt-get install python3 python3-dev python3-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev
python3-dev
,zlib1g-dev
,libxml2-dev
andlibxslt1-dev
are required forlxml
libssl-dev
andlibffi-dev
are required forcryptography
Inside a virtualenv,
you can install Scrapy with pip
after that:
pip install scrapy
Note
The same non-Python dependencies can be used to install Scrapy in Debian Jessie (8.0) and above.
macOS¶
Building Scrapy’s dependencies requires the presence of a C compiler and development headers. On macOS this is typically provided by Apple’s Xcode development tools. To install the Xcode command line tools open a terminal window and run:
xcode-select --install
There’s a known issue that
prevents pip
from updating system packages. This has to be addressed to
successfully install Scrapy and its dependencies. Here are some proposed
solutions:
(Recommended) Don’t use system Python. Install a new, updated version that doesn’t conflict with the rest of your system. Here’s how to do it using the homebrew package manager:
Install homebrew following the instructions in https://brew.sh/
Update your
PATH
variable to state that homebrew packages should be used before system packages (Change.bashrc
to.zshrc
accordantly if you’re using zsh as default shell):echo "export PATH=/usr/local/bin:/usr/local/sbin:$PATH" >> ~/.bashrc
Reload
.bashrc
to ensure the changes have taken place:source ~/.bashrc
Install python:
brew install python
Latest versions of python have
pip
bundled with them so you won’t need to install it separately. If this is not the case, upgrade python:brew update; brew upgrade python
(Optional) Install Scrapy inside a Python virtual environment.
This method is a workaround for the above macOS issue, but it’s an overall good practice for managing dependencies and can complement the first method.
After any of these workarounds you should be able to install Scrapy:
pip install Scrapy
PyPy¶
We recommend using the latest PyPy version. The version tested is 5.9.0. For PyPy3, only Linux installation was tested.
Most Scrapy dependencies now have binary wheels for CPython, but not for PyPy.
This means that these dependencies will be built during installation.
On macOS, you are likely to face an issue with building the Cryptography
dependency. The solution to this problem is described
here,
that is to brew install openssl
and then export the flags that this command
recommends (only needed when installing Scrapy). Installing on Linux has no special
issues besides installing build dependencies.
Installing Scrapy with PyPy on Windows is not tested.
You can check that Scrapy is installed correctly by running scrapy bench
.
If this command gives errors such as
TypeError: ... got 2 unexpected keyword arguments
, this means
that setuptools was unable to pick up one PyPy-specific dependency.
To fix this issue, run pip install 'PyPyDispatcher>=2.1.0'
.
Troubleshooting¶
AttributeError: ‘module’ object has no attribute ‘OP_NO_TLSv1_1’¶
After you install or upgrade Scrapy, Twisted or pyOpenSSL, you may get an exception with the following traceback:
[…]
File "[…]/site-packages/twisted/protocols/tls.py", line 63, in <module>
from twisted.internet._sslverify import _setAcceptableProtocols
File "[…]/site-packages/twisted/internet/_sslverify.py", line 38, in <module>
TLSVersion.TLSv1_1: SSL.OP_NO_TLSv1_1,
AttributeError: 'module' object has no attribute 'OP_NO_TLSv1_1'
The reason you get this exception is that your system or virtual environment has a version of pyOpenSSL that your version of Twisted does not support.
To install a version of pyOpenSSL that your version of Twisted supports,
reinstall Twisted with the tls
extra option:
pip install twisted[tls]
For details, see Issue #2473.
Scrapy Tutorial¶
In this tutorial, we’ll assume that Scrapy is already installed on your system. If that’s not the case, see Installation guide.
We are going to scrape quotes.toscrape.com, a website that lists quotes from famous authors.
This tutorial will walk you through these tasks:
Creating a new Scrapy project
Writing a spider to crawl a site and extract data
Exporting the scraped data using the command line
Changing spider to recursively follow links
Using spider arguments
Scrapy is written in Python. If you’re new to the language you might want to start by getting an idea of what the language is like, to get the most out of Scrapy.
If you’re already familiar with other languages, and want to learn Python quickly, the Python Tutorial is a good resource.
If you’re new to programming and want to start with Python, the following books may be useful to you:
You can also take a look at this list of Python resources for non-programmers, as well as the suggested resources in the learnpython-subreddit.
Creating a project¶
Before you start scraping, you will have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run:
scrapy startproject tutorial
This will create a tutorial
directory with the following contents:
tutorial/
scrapy.cfg # deploy configuration file
tutorial/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
Our first Spider¶
Spiders are classes that you define and that Scrapy uses to scrape information
from a website (or a group of websites). They must subclass
Spider
and define the initial requests to make,
optionally how to follow links in the pages, and how to parse the downloaded
page content to extract data.
This is the code for our first Spider. Save it in a file named
quotes_spider.py
under the tutorial/spiders
directory in your project:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'https://quotes.toscrape.com/page/1/',
'https://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = f'quotes-{page}.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log(f'Saved file {filename}')
As you can see, our Spider subclasses scrapy.Spider
and defines some attributes and methods:
name
: identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different Spiders.start_requests()
: must return an iterable of Requests (you can return a list of requests or write a generator function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from these initial requests.parse()
: a method that will be called to handle the response downloaded for each of the requests made. The response parameter is an instance ofTextResponse
that holds the page content and has further helpful methods to handle it.The
parse()
method usually parses the response, extracting the scraped data as dicts and also finding new URLs to follow and creating new requests (Request
) from them.
How to run our spider¶
To put our spider to work, go to the project’s top level directory and run:
scrapy crawl quotes
This command runs the spider with name quotes
that we’ve just added, that
will send some requests for the quotes.toscrape.com
domain. You will get an output
similar to this:
... (omitted for brevity)
2016-12-16 21:24:05 [scrapy.core.engine] INFO: Spider opened
2016-12-16 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-16 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://quotes.toscrape.com/robots.txt> (referer: None)
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/page/1/> (referer: None)
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/page/2/> (referer: None)
2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-1.html
2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-2.html
2016-12-16 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished)
...
Now, check the files in the current directory. You should notice that two new
files have been created: quotes-1.html and quotes-2.html, with the content
for the respective URLs, as our parse
method instructs.
Note
If you are wondering why we haven’t parsed the HTML yet, hold on, we will cover that soon.
What just happened under the hood?¶
Scrapy schedules the scrapy.Request
objects
returned by the start_requests
method of the Spider. Upon receiving a
response for each one, it instantiates Response
objects
and calls the callback method associated with the request (in this case, the
parse
method) passing the response as argument.
A shortcut to the start_requests method¶
Instead of implementing a start_requests()
method
that generates scrapy.Request
objects from URLs,
you can just define a start_urls
class attribute
with a list of URLs. This list will then be used by the default implementation
of start_requests()
to create the initial requests
for your spider:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'https://quotes.toscrape.com/page/1/',
'https://quotes.toscrape.com/page/2/',
]
def parse(self, response):
page = response.url.split("/")[-2]
filename = f'quotes-{page}.html'
with open(filename, 'wb') as f:
f.write(response.body)
The parse()
method will be called to handle each
of the requests for those URLs, even though we haven’t explicitly told Scrapy
to do so. This happens because parse()
is Scrapy’s
default callback method, which is called for requests without an explicitly
assigned callback.
Extracting data¶
The best way to learn how to extract data with Scrapy is trying selectors using the Scrapy shell. Run:
scrapy shell 'https://quotes.toscrape.com/page/1/'
Note
Remember to always enclose urls in quotes when running Scrapy shell from
command-line, otherwise urls containing arguments (i.e. &
character)
will not work.
On Windows, use double quotes instead:
scrapy shell "https://quotes.toscrape.com/page/1/"
You will see something like:
[ ... Scrapy log here ... ]
2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x7fa91d888c90>
[s] item {}
[s] request <GET https://quotes.toscrape.com/page/1/>
[s] response <200 https://quotes.toscrape.com/page/1/>
[s] settings <scrapy.settings.Settings object at 0x7fa91d888c10>
[s] spider <DefaultSpider 'default' at 0x7fa91c8af990>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
Using the shell, you can try selecting elements using CSS with the response object:
>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
The result of running response.css('title')
is a list-like object called
SelectorList
, which represents a list of
Selector
objects that wrap around XML/HTML elements
and allow you to run further queries to fine-grain the selection or extract the
data.
To extract the text from the title above, you can do:
>>> response.css('title::text').getall()
['Quotes to Scrape']
There are two things to note here: one is that we’ve added ::text
to the
CSS query, to mean we want to select only the text elements directly inside
<title>
element. If we don’t specify ::text
, we’d get the full title
element, including its tags:
>>> response.css('title').getall()
['<title>Quotes to Scrape</title>']
The other thing is that the result of calling .getall()
is a list: it is
possible that a selector returns more than one result, so we extract them all.
When you know you just want the first result, as in this case, you can do:
>>> response.css('title::text').get()
'Quotes to Scrape'
As an alternative, you could’ve written:
>>> response.css('title::text')[0].get()
'Quotes to Scrape'
Accessing an index on a SelectorList
instance will
raise an IndexError
exception if there are no results:
>>> response.css('noelement')[0].get()
Traceback (most recent call last):
...
IndexError: list index out of range
You might want to use .get()
directly on the
SelectorList
instance instead, which returns None
if there are no results:
>>> response.css("noelement").get()
There’s a lesson here: for most scraping code, you want it to be resilient to errors due to things not being found on a page, so that even if some parts fail to be scraped, you can at least get some data.
Besides the getall()
and
get()
methods, you can also use
the re()
method to extract using
regular expressions:
>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']
In order to find the proper CSS selectors to use, you might find useful opening
the response page from the shell in your web browser using view(response)
.
You can use your browser’s developer tools to inspect the HTML and come up
with a selector (see Using your browser’s Developer Tools for scraping).
Selector Gadget is also a nice tool to quickly find CSS selector for visually selected elements, which works in many browsers.
XPath: a brief intro¶
Besides CSS, Scrapy selectors also support using XPath expressions:
>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').get()
'Quotes to Scrape'
XPath expressions are very powerful, and are the foundation of Scrapy Selectors. In fact, CSS selectors are converted to XPath under-the-hood. You can see that if you read closely the text representation of the selector objects in the shell.
While perhaps not as popular as CSS selectors, XPath expressions offer more power because besides navigating the structure, it can also look at the content. Using XPath, you’re able to select things like: select the link that contains the text “Next Page”. This makes XPath very fitting to the task of scraping, and we encourage you to learn XPath even if you already know how to construct CSS selectors, it will make scraping much easier.
We won’t cover much of XPath here, but you can read more about using XPath with Scrapy Selectors here. To learn more about XPath, we recommend this tutorial to learn XPath through examples, and this tutorial to learn “how to think in XPath”.
Extracting data in our spider¶
Let’s get back to our spider. Until now, it doesn’t extract any data in particular, just saves the whole HTML page to a local file. Let’s integrate the extraction logic above into our spider.
A Scrapy spider typically generates many dictionaries containing the data
extracted from the page. To do that, we use the yield
Python keyword
in the callback, as you can see below:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'https://quotes.toscrape.com/page/1/',
'https://quotes.toscrape.com/page/2/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
If you run this spider, it will output the extracted data with the log:
2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'}
2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}
Storing the scraped data¶
The simplest way to store the scraped data is by using Feed exports, with the following command:
scrapy crawl quotes -O quotes.json
That will generate a quotes.json
file containing all scraped items,
serialized in JSON.
The -O
command-line switch overwrites any existing file; use -o
instead
to append new content to any existing file. However, appending to a JSON file
makes the file contents invalid JSON. When appending to a file, consider
using a different serialization format, such as JSON Lines:
scrapy crawl quotes -o quotes.jl
The JSON Lines format is useful because it’s stream-like, you can easily append new records to it. It doesn’t have the same problem of JSON when you run twice. Also, as each record is a separate line, you can process big files without having to fit everything in memory, there are tools like JQ to help doing that at the command-line.
In small projects (like the one in this tutorial), that should be enough.
However, if you want to perform more complex things with the scraped items, you
can write an Item Pipeline. A placeholder file
for Item Pipelines has been set up for you when the project is created, in
tutorial/pipelines.py
. Though you don’t need to implement any item
pipelines if you just want to store the scraped items.
Following links¶
Let’s say, instead of just scraping the stuff from the first two pages from https://quotes.toscrape.com, you want quotes from all the pages in the website.
Now that you know how to extract data from pages, let’s see how to follow links from them.
First thing is to extract the link to the page we want to follow. Examining our page, we can see there is a link to the next page with the following markup:
<ul class="pager">
<li class="next">
<a href="/page/2/">Next <span aria-hidden="true">→</span></a>
</li>
</ul>
We can try extracting it in the shell:
>>> response.css('li.next a').get()
'<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'
This gets the anchor element, but we want the attribute href
. For that,
Scrapy supports a CSS extension that lets you select the attribute contents,
like this:
>>> response.css('li.next a::attr(href)').get()
'/page/2/'
There is also an attrib
property available
(see Selecting element attributes for more):
>>> response.css('li.next a').attrib['href']
'/page/2/'
Let’s see now our spider modified to recursively follow the link to the next page, extracting data from it:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'https://quotes.toscrape.com/page/1/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
Now, after extracting the data, the parse()
method looks for the link to
the next page, builds a full absolute URL using the
urljoin()
method (since the links can be
relative) and yields a new request to the next page, registering itself as
callback to handle the data extraction for the next page and to keep the
crawling going through all the pages.
What you see here is Scrapy’s mechanism of following links: when you yield a Request in a callback method, Scrapy will schedule that request to be sent and register a callback method to be executed when that request finishes.
Using this, you can build complex crawlers that follow links according to rules you define, and extract different kinds of data depending on the page it’s visiting.
In our example, it creates a sort of loop, following all the links to the next page until it doesn’t find one – handy for crawling blogs, forums and other sites with pagination.
A shortcut for creating Requests¶
As a shortcut for creating Request objects you can use
response.follow
:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'https://quotes.toscrape.com/page/1/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('span small::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
Unlike scrapy.Request, response.follow
supports relative URLs directly - no
need to call urljoin. Note that response.follow
just returns a Request
instance; you still have to yield this Request.
You can also pass a selector to response.follow
instead of a string;
this selector should extract necessary attributes:
for href in response.css('ul.pager a::attr(href)'):
yield response.follow(href, callback=self.parse)
For <a>
elements there is a shortcut: response.follow
uses their href
attribute automatically. So the code can be shortened further:
for a in response.css('ul.pager a'):
yield response.follow(a, callback=self.parse)
To create multiple requests from an iterable, you can use
response.follow_all
instead:
anchors = response.css('ul.pager a')
yield from response.follow_all(anchors, callback=self.parse)
or, shortening it further:
yield from response.follow_all(css='ul.pager a', callback=self.parse)
More examples and patterns¶
Here is another spider that illustrates callbacks and following links, this time for scraping author information:
import scrapy
class AuthorSpider(scrapy.Spider):
name = 'author'
start_urls = ['https://quotes.toscrape.com/']
def parse(self, response):
author_page_links = response.css('.author + a')
yield from response.follow_all(author_page_links, self.parse_author)
pagination_links = response.css('li.next a')
yield from response.follow_all(pagination_links, self.parse)
def parse_author(self, response):
def extract_with_css(query):
return response.css(query).get(default='').strip()
yield {
'name': extract_with_css('h3.author-title::text'),
'birthdate': extract_with_css('.author-born-date::text'),
'bio': extract_with_css('.author-description::text'),
}
This spider will start from the main page, it will follow all the links to the
authors pages calling the parse_author
callback for each of them, and also
the pagination links with the parse
callback as we saw before.
Here we’re passing callbacks to
response.follow_all
as positional
arguments to make the code shorter; it also works for
Request
.
The parse_author
callback defines a helper function to extract and cleanup the
data from a CSS query and yields the Python dict with the author data.
Another interesting thing this spider demonstrates is that, even if there are
many quotes from the same author, we don’t need to worry about visiting the
same author page multiple times. By default, Scrapy filters out duplicated
requests to URLs already visited, avoiding the problem of hitting servers too
much because of a programming mistake. This can be configured by the setting
DUPEFILTER_CLASS
.
Hopefully by now you have a good understanding of how to use the mechanism of following links and callbacks with Scrapy.
As yet another example spider that leverages the mechanism of following links,
check out the CrawlSpider
class for a generic
spider that implements a small rules engine that you can use to write your
crawlers on top of it.
Also, a common pattern is to build an item with data from more than one page, using a trick to pass additional data to the callbacks.
Using spider arguments¶
You can provide command line arguments to your spiders by using the -a
option when running them:
scrapy crawl quotes -O quotes-humor.json -a tag=humor
These arguments are passed to the Spider’s __init__
method and become
spider attributes by default.
In this example, the value provided for the tag
argument will be available
via self.tag
. You can use this to make your spider fetch only quotes
with a specific tag, building the URL based on the argument:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
url = 'https://quotes.toscrape.com/'
tag = getattr(self, 'tag', None)
if tag is not None:
url = url + 'tag/' + tag
yield scrapy.Request(url, self.parse)
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
}
next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
If you pass the tag=humor
argument to this spider, you’ll notice that it
will only visit URLs from the humor
tag, such as
https://quotes.toscrape.com/tag/humor
.
Next steps¶
This tutorial covered only the basics of Scrapy, but there’s a lot of other features not mentioned here. Check the What else? section in Scrapy at a glance chapter for a quick overview of the most important ones.
You can continue from the section Basic concepts to know more about the command-line tool, spiders, selectors and other things the tutorial hasn’t covered like modeling the scraped data. If you prefer to play with an example project, check the Examples section.
Examples¶
The best way to learn is with examples, and Scrapy is no exception. For this reason, there is an example Scrapy project named quotesbot, that you can use to play and learn more about Scrapy. It contains two spiders for https://quotes.toscrape.com, one using CSS selectors and another one using XPath expressions.
The quotesbot project is available at: https://github.com/scrapy/quotesbot. You can find more information about it in the project’s README.
If you’re familiar with git, you can checkout the code. Otherwise you can download the project as a zip file by clicking here.
- Scrapy at a glance
Understand what Scrapy is and how it can help you.
- Installation guide
Get Scrapy installed on your computer.
- Scrapy Tutorial
Write your first Scrapy project.
- Examples
Learn more by playing with a pre-made Scrapy project.
Basic concepts¶
Command line tool¶
Scrapy is controlled through the scrapy
command-line tool, to be referred
here as the “Scrapy tool” to differentiate it from the sub-commands, which we
just call “commands” or “Scrapy commands”.
The Scrapy tool provides several commands, for multiple purposes, and each one accepts a different set of arguments and options.
(The scrapy deploy
command has been removed in 1.0 in favor of the
standalone scrapyd-deploy
. See Deploying your project.)
Configuration settings¶
Scrapy will look for configuration parameters in ini-style scrapy.cfg
files
in standard locations:
/etc/scrapy.cfg
orc:\scrapy\scrapy.cfg
(system-wide),~/.config/scrapy.cfg
($XDG_CONFIG_HOME
) and~/.scrapy.cfg
($HOME
) for global (user-wide) settings, andscrapy.cfg
inside a Scrapy project’s root (see next section).
Settings from these files are merged in the listed order of preference: user-defined values have higher priority than system-wide defaults and project-wide settings will override all others, when defined.
Scrapy also understands, and can be configured through, a number of environment variables. Currently these are:
SCRAPY_SETTINGS_MODULE
(see Designating the settings)SCRAPY_PROJECT
(see Sharing the root directory between projects)SCRAPY_PYTHON_SHELL
(see Scrapy shell)
Default structure of Scrapy projects¶
Before delving into the command-line tool and its sub-commands, let’s first understand the directory structure of a Scrapy project.
Though it can be modified, all Scrapy projects have the same file structure by default, similar to this:
scrapy.cfg
myproject/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
spider1.py
spider2.py
...
The directory where the scrapy.cfg
file resides is known as the project
root directory. That file contains the name of the python module that defines
the project settings. Here is an example:
[settings]
default = myproject.settings
Using the scrapy
tool¶
You can start by running the Scrapy tool with no arguments and it will print some usage help and the available commands:
Scrapy X.Y - no active project
Usage:
scrapy <command> [options] [args]
Available commands:
crawl Run a spider
fetch Fetch a URL using the Scrapy downloader
[...]
The first line will print the currently active project if you’re inside a Scrapy project. In this example it was run from outside a project. If run from inside a project it would have printed something like this:
Scrapy X.Y - project: myproject
Usage:
scrapy <command> [options] [args]
[...]
Creating projects¶
The first thing you typically do with the scrapy
tool is create your Scrapy
project:
scrapy startproject myproject [project_dir]
That will create a Scrapy project under the project_dir
directory.
If project_dir
wasn’t specified, project_dir
will be the same as myproject
.
Next, you go inside the new project directory:
cd project_dir
And you’re ready to use the scrapy
command to manage and control your
project from there.
Controlling projects¶
You use the scrapy
tool from inside your projects to control and manage
them.
For example, to create a new spider:
scrapy genspider mydomain mydomain.com
Some Scrapy commands (like crawl
) must be run from inside a Scrapy
project. See the commands reference below for more
information on which commands must be run from inside projects, and which not.
Also keep in mind that some commands may have slightly different behaviours
when running them from inside projects. For example, the fetch command will use
spider-overridden behaviours (such as the user_agent
attribute to override
the user-agent) if the url being fetched is associated with some specific
spider. This is intentional, as the fetch
command is meant to be used to
check how spiders are downloading pages.
Available tool commands¶
This section contains a list of the available built-in commands with a description and some usage examples. Remember, you can always get more info about each command by running:
scrapy <command> -h
And you can see all available commands with:
scrapy -h
There are two kinds of commands, those that only work from inside a Scrapy project (Project-specific commands) and those that also work without an active Scrapy project (Global commands), though they may behave slightly different when running from inside a project (as they would use the project overridden settings).
Global commands:
Project-only commands:
startproject¶
Syntax:
scrapy startproject <project_name> [project_dir]
Requires project: no
Creates a new Scrapy project named project_name
, under the project_dir
directory.
If project_dir
wasn’t specified, project_dir
will be the same as project_name
.
Usage example:
$ scrapy startproject myproject
genspider¶
Syntax:
scrapy genspider [-t template] <name> <domain or URL>
Requires project: no
New in version 2.6.0: The ability to pass a URL instead of a domain.
Create a new spider in the current folder or in the current project’s spiders
folder, if called from inside a project. The <name>
parameter is set as the spider’s name
, while <domain or URL>
is used to generate the allowed_domains
and start_urls
spider’s attributes.
Note
Even if an HTTPS URL is specified, the protocol used in
start_urls
is always HTTP. This is a known issue: issue 3553.
Usage example:
$ scrapy genspider -l
Available templates:
basic
crawl
csvfeed
xmlfeed
$ scrapy genspider example example.com
Created spider 'example' using template 'basic'
$ scrapy genspider -t crawl scrapyorg scrapy.org
Created spider 'scrapyorg' using template 'crawl'
This is just a convenience shortcut command for creating spiders based on pre-defined templates, but certainly not the only way to create spiders. You can just create the spider source code files yourself, instead of using this command.
crawl¶
Syntax:
scrapy crawl <spider>
Requires project: yes
Start crawling using a spider.
Usage examples:
$ scrapy crawl myspider
[ ... myspider starts crawling ... ]
check¶
Syntax:
scrapy check [-l] <spider>
Requires project: yes
Run contract checks.
Usage examples:
$ scrapy check -l
first_spider
* parse
* parse_item
second_spider
* parse
* parse_item
$ scrapy check
[FAILED] first_spider:parse_item
>>> 'RetailPricex' field is missing
[FAILED] first_spider:parse
>>> Returned 92 requests, expected 0..4
list¶
Syntax:
scrapy list
Requires project: yes
List all available spiders in the current project. The output is one spider per line.
Usage example:
$ scrapy list
spider1
spider2
edit¶
Syntax:
scrapy edit <spider>
Requires project: yes
Edit the given spider using the editor defined in the EDITOR
environment
variable or (if unset) the EDITOR
setting.
This command is provided only as a convenience shortcut for the most common case, the developer is of course free to choose any tool or IDE to write and debug spiders.
Usage example:
$ scrapy edit spider1
fetch¶
Syntax:
scrapy fetch <url>
Requires project: no
Downloads the given URL using the Scrapy downloader and writes the contents to standard output.
The interesting thing about this command is that it fetches the page how the
spider would download it. For example, if the spider has a USER_AGENT
attribute which overrides the User Agent, it will use that one.
So this command can be used to “see” how your spider would fetch a certain page.
If used outside a project, no particular per-spider behaviour would be applied and it will just use the default Scrapy downloader settings.
Supported options:
--spider=SPIDER
: bypass spider autodetection and force use of specific spider--headers
: print the response’s HTTP headers instead of the response’s body--no-redirect
: do not follow HTTP 3xx redirects (default is to follow them)
Usage examples:
$ scrapy fetch --nolog http://www.example.com/some/page.html
[ ... html content here ... ]
$ scrapy fetch --nolog --headers http://www.example.com/
{'Accept-Ranges': ['bytes'],
'Age': ['1263 '],
'Connection': ['close '],
'Content-Length': ['596'],
'Content-Type': ['text/html; charset=UTF-8'],
'Date': ['Wed, 18 Aug 2010 23:59:46 GMT'],
'Etag': ['"573c1-254-48c9c87349680"'],
'Last-Modified': ['Fri, 30 Jul 2010 15:30:18 GMT'],
'Server': ['Apache/2.2.3 (CentOS)']}
view¶
Syntax:
scrapy view <url>
Requires project: no
Opens the given URL in a browser, as your Scrapy spider would “see” it. Sometimes spiders see pages differently from regular users, so this can be used to check what the spider “sees” and confirm it’s what you expect.
Supported options:
--spider=SPIDER
: bypass spider autodetection and force use of specific spider--no-redirect
: do not follow HTTP 3xx redirects (default is to follow them)
Usage example:
$ scrapy view http://www.example.com/some/page.html
[ ... browser starts ... ]
shell¶
Syntax:
scrapy shell [url]
Requires project: no
Starts the Scrapy shell for the given URL (if given) or empty if no URL is
given. Also supports UNIX-style local file paths, either relative with
./
or ../
prefixes or absolute file paths.
See Scrapy shell for more info.
Supported options:
--spider=SPIDER
: bypass spider autodetection and force use of specific spider-c code
: evaluate the code in the shell, print the result and exit--no-redirect
: do not follow HTTP 3xx redirects (default is to follow them); this only affects the URL you may pass as argument on the command line; once you are inside the shell,fetch(url)
will still follow HTTP redirects by default.
Usage example:
$ scrapy shell http://www.example.com/some/page.html
[ ... scrapy shell starts ... ]
$ scrapy shell --nolog http://www.example.com/ -c '(response.status, response.url)'
(200, 'http://www.example.com/')
# shell follows HTTP redirects by default
$ scrapy shell --nolog http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F -c '(response.status, response.url)'
(200, 'http://example.com/')
# you can disable this with --no-redirect
# (only for the URL passed as command line argument)
$ scrapy shell --no-redirect --nolog http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F -c '(response.status, response.url)'
(302, 'http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F')
parse¶
Syntax:
scrapy parse <url> [options]
Requires project: yes
Fetches the given URL and parses it with the spider that handles it, using the
method passed with the --callback
option, or parse
if not given.
Supported options:
--spider=SPIDER
: bypass spider autodetection and force use of specific spider--a NAME=VALUE
: set spider argument (may be repeated)--callback
or-c
: spider method to use as callback for parsing the response--meta
or-m
: additional request meta that will be passed to the callback request. This must be a valid json string. Example: –meta=’{“foo” : “bar”}’--cbkwargs
: additional keyword arguments that will be passed to the callback. This must be a valid json string. Example: –cbkwargs=’{“foo” : “bar”}’--pipelines
: process items through pipelines--rules
or-r
: useCrawlSpider
rules to discover the callback (i.e. spider method) to use for parsing the response--noitems
: don’t show scraped items--nolinks
: don’t show extracted links--nocolour
: avoid using pygments to colorize the output--depth
or-d
: depth level for which the requests should be followed recursively (default: 1)--verbose
or-v
: display information for each depth level--output
or-o
: dump scraped items to a fileNew in version 2.3.
Usage example:
$ scrapy parse http://www.example.com/ -c parse_item
[ ... scrapy log lines crawling example.com spider ... ]
>>> STATUS DEPTH LEVEL 1 <<<
# Scraped Items ------------------------------------------------------------
[{'name': 'Example item',
'category': 'Furniture',
'length': '12 cm'}]
# Requests -----------------------------------------------------------------
[]
settings¶
Syntax:
scrapy settings [options]
Requires project: no
Get the value of a Scrapy setting.
If used inside a project it’ll show the project setting value, otherwise it’ll show the default Scrapy value for that setting.
Example usage:
$ scrapy settings --get BOT_NAME
scrapybot
$ scrapy settings --get DOWNLOAD_DELAY
0
runspider¶
Syntax:
scrapy runspider <spider_file.py>
Requires project: no
Run a spider self-contained in a Python file, without having to create a project.
Example usage:
$ scrapy runspider myspider.py
[ ... spider starts crawling ... ]
version¶
Syntax:
scrapy version [-v]
Requires project: no
Prints the Scrapy version. If used with -v
it also prints Python, Twisted
and Platform info, which is useful for bug reports.
bench¶
Syntax:
scrapy bench
Requires project: no
Run a quick benchmark test. Benchmarking.
Custom project commands¶
You can also add your custom project commands by using the
COMMANDS_MODULE
setting. See the Scrapy commands in
scrapy/commands for examples on how to implement your commands.
COMMANDS_MODULE¶
Default: ''
(empty string)
A module to use for looking up custom Scrapy commands. This is used to add custom commands for your Scrapy project.
Example:
COMMANDS_MODULE = 'mybot.commands'
Register commands via setup.py entry points¶
You can also add Scrapy commands from an external library by adding a
scrapy.commands
section in the entry points of the library setup.py
file.
The following example adds my_command
command:
from setuptools import setup, find_packages
setup(name='scrapy-mymodule',
entry_points={
'scrapy.commands': [
'my_command=my_scrapy_module.commands:MyCommand',
],
},
)
Spiders¶
Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items). In other words, Spiders are the place where you define the custom behaviour for crawling and parsing pages for a particular site (or, in some cases, a group of sites).
For spiders, the scraping cycle goes through something like this:
You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests.
The first requests to perform are obtained by calling the
start_requests()
method which (by default) generatesRequest
for the URLs specified in thestart_urls
and theparse
method as callback function for the Requests.In the callback function, you parse the response (web page) and return item objects,
Request
objects, or an iterable of these objects. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup, lxml or whatever mechanism you prefer) and generate items with the parsed data.
Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or written to a file using Feed exports.
Even though this cycle applies (more or less) to any kind of spider, there are different kinds of default spiders bundled into Scrapy for different purposes. We will talk about those types here.
scrapy.Spider¶
- class scrapy.spiders.Spider¶
- class scrapy.Spider¶
This is the simplest spider, and the one from which every other spider must inherit (including spiders that come bundled with Scrapy, as well as spiders that you write yourself). It doesn’t provide any special functionality. It just provides a default
start_requests()
implementation which sends requests from thestart_urls
spider attribute and calls the spider’s methodparse
for each of the resulting responses.- name¶
A string which defines the name for this spider. The spider name is how the spider is located (and instantiated) by Scrapy, so it must be unique. However, nothing prevents you from instantiating more than one instance of the same spider. This is the most important spider attribute and it’s required.
If the spider scrapes a single domain, a common practice is to name the spider after the domain, with or without the TLD. So, for example, a spider that crawls
mywebsite.com
would often be calledmywebsite
.
- allowed_domains¶
An optional list of strings containing domains that this spider is allowed to crawl. Requests for URLs not belonging to the domain names specified in this list (or their subdomains) won’t be followed if
OffsiteMiddleware
is enabled.Let’s say your target url is
https://www.example.com/1.html
, then add'example.com'
to the list.
- start_urls¶
A list of URLs where the spider will begin to crawl from, when no particular URLs are specified. So, the first pages downloaded will be those listed here. The subsequent
Request
will be generated successively from data contained in the start URLs.
- custom_settings¶
A dictionary of settings that will be overridden from the project wide configuration when running this spider. It must be defined as a class attribute since the settings are updated before instantiation.
For a list of available built-in settings see: Built-in settings reference.
- crawler¶
This attribute is set by the
from_crawler()
class method after initializating the class, and links to theCrawler
object to which this spider instance is bound.Crawlers encapsulate a lot of components in the project for their single entry access (such as extensions, middlewares, signals managers, etc). See Crawler API to know more about them.
- settings¶
Configuration for running this spider. This is a
Settings
instance, see the Settings topic for a detailed introduction on this subject.
- logger¶
Python logger created with the Spider’s
name
. You can use it to send log messages through it as described on Logging from Spiders.
- state¶
A dict you can use to persist some spider state between batches. See Keeping persistent state between batches to know more about it.
- from_crawler(crawler, *args, **kwargs)¶
This is the class method used by Scrapy to create your spiders.
You probably won’t need to override this directly because the default implementation acts as a proxy to the
__init__()
method, calling it with the given argumentsargs
and named argumentskwargs
.Nonetheless, this method sets the
crawler
andsettings
attributes in the new instance so they can be accessed later inside the spider’s code.- Parameters
crawler (
Crawler
instance) – crawler to which the spider will be boundargs (list) – arguments passed to the
__init__()
methodkwargs (dict) – keyword arguments passed to the
__init__()
method
- start_requests()¶
This method must return an iterable with the first Requests to crawl for this spider. It is called by Scrapy when the spider is opened for scraping. Scrapy calls it only once, so it is safe to implement
start_requests()
as a generator.The default implementation generates
Request(url, dont_filter=True)
for each url instart_urls
.If you want to change the Requests used to start scraping a domain, this is the method to override. For example, if you need to start by logging in using a POST request, you could do:
class MySpider(scrapy.Spider): name = 'myspider' def start_requests(self): return [scrapy.FormRequest("http://www.example.com/login", formdata={'user': 'john', 'pass': 'secret'}, callback=self.logged_in)] def logged_in(self, response): # here you would extract links to follow and return Requests for # each of them, with another callback pass
- parse(response)¶
This is the default callback used by Scrapy to process downloaded responses, when their requests don’t specify a callback.
The
parse
method is in charge of processing the response and returning scraped data and/or more URLs to follow. Other Requests callbacks have the same requirements as theSpider
class.This method, as well as any other Request callback, must return an iterable of
Request
and/or item objects.- Parameters
response (
Response
) – the response to parse
- log(message[, level, component])¶
Wrapper that sends a log message through the Spider’s
logger
, kept for backward compatibility. For more information see Logging from Spiders.
- closed(reason)¶
Called when the spider closes. This method provides a shortcut to signals.connect() for the
spider_closed
signal.
Let’s see an example:
import scrapy
class MySpider(scrapy.Spider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = [
'http://www.example.com/1.html',
'http://www.example.com/2.html',
'http://www.example.com/3.html',
]
def parse(self, response):
self.logger.info('A response from %s just arrived!', response.url)
Return multiple Requests and items from a single callback:
import scrapy
class MySpider(scrapy.Spider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = [
'http://www.example.com/1.html',
'http://www.example.com/2.html',
'http://www.example.com/3.html',
]
def parse(self, response):
for h3 in response.xpath('//h3').getall():
yield {"title": h3}
for href in response.xpath('//a/@href').getall():
yield scrapy.Request(response.urljoin(href), self.parse)
Instead of start_urls
you can use start_requests()
directly;
to give data more structure you can use Item
objects:
import scrapy
from myproject.items import MyItem
class MySpider(scrapy.Spider):
name = 'example.com'
allowed_domains = ['example.com']
def start_requests(self):
yield scrapy.Request('http://www.example.com/1.html', self.parse)
yield scrapy.Request('http://www.example.com/2.html', self.parse)
yield scrapy.Request('http://www.example.com/3.html', self.parse)
def parse(self, response):
for h3 in response.xpath('//h3').getall():
yield MyItem(title=h3)
for href in response.xpath('//a/@href').getall():
yield scrapy.Request(response.urljoin(href), self.parse)
Spider arguments¶
Spiders can receive arguments that modify their behaviour. Some common uses for spider arguments are to define the start URLs or to restrict the crawl to certain sections of the site, but they can be used to configure any functionality of the spider.
Spider arguments are passed through the crawl
command using the
-a
option. For example:
scrapy crawl myspider -a category=electronics
Spiders can access arguments in their __init__ methods:
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
def __init__(self, category=None, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.start_urls = [f'http://www.example.com/categories/{category}']
# ...
The default __init__ method will take any spider arguments and copy them to the spider as attributes. The above example can also be written as follows:
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
yield scrapy.Request(f'http://www.example.com/categories/{self.category}')
If you are running Scrapy from a script, you can
specify spider arguments when calling
CrawlerProcess.crawl
or
CrawlerRunner.crawl
:
process = CrawlerProcess()
process.crawl(MySpider, category="electronics")
Keep in mind that spider arguments are only strings.
The spider will not do any parsing on its own.
If you were to set the start_urls
attribute from the command line,
you would have to parse it on your own into a list
using something like ast.literal_eval()
or json.loads()
and then set it as an attribute.
Otherwise, you would cause iteration over a start_urls
string
(a very common python pitfall)
resulting in each character being seen as a separate url.
A valid use case is to set the http auth credentials
used by HttpAuthMiddleware
or the user agent
used by UserAgentMiddleware
:
scrapy crawl myspider -a http_user=myuser -a http_pass=mypassword -a user_agent=mybot
Spider arguments can also be passed through the Scrapyd schedule.json
API.
See Scrapyd documentation.
Generic Spiders¶
Scrapy comes with some useful generic spiders that you can use to subclass your spiders from. Their aim is to provide convenient functionality for a few common scraping cases, like following all links on a site based on certain rules, crawling from Sitemaps, or parsing an XML/CSV feed.
For the examples used in the following spiders, we’ll assume you have a project
with a TestItem
declared in a myproject.items
module:
import scrapy
class TestItem(scrapy.Item):
id = scrapy.Field()
name = scrapy.Field()
description = scrapy.Field()
CrawlSpider¶
- class scrapy.spiders.CrawlSpider[source]¶
This is the most commonly used spider for crawling regular websites, as it provides a convenient mechanism for following links by defining a set of rules. It may not be the best suited for your particular web sites or project, but it’s generic enough for several cases, so you can start from it and override it as needed for more custom functionality, or just implement your own spider.
Apart from the attributes inherited from Spider (that you must specify), this class supports a new attribute:
- rules¶
Which is a list of one (or more)
Rule
objects. EachRule
defines a certain behaviour for crawling the site. Rules objects are described below. If multiple rules match the same link, the first one will be used, according to the order they’re defined in this attribute.
This spider also exposes an overridable method:
- parse_start_url(response, **kwargs)[source]¶
This method is called for each response produced for the URLs in the spider’s
start_urls
attribute. It allows to parse the initial responses and must return either an item object, aRequest
object, or an iterable containing any of them.
Crawling rules¶
- class scrapy.spiders.Rule(link_extractor=None, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=None, errback=None)[source]¶
link_extractor
is a Link Extractor object which defines how links will be extracted from each crawled page. Each produced link will be used to generate aRequest
object, which will contain the link’s text in itsmeta
dictionary (under thelink_text
key). If omitted, a default link extractor created with no arguments will be used, resulting in all links being extracted.callback
is a callable or a string (in which case a method from the spider object with that name will be used) to be called for each link extracted with the specified link extractor. This callback receives aResponse
as its first argument and must return either a single instance or an iterable of item objects and/orRequest
objects (or any subclass of them). As mentioned above, the receivedResponse
object will contain the text of the link that produced theRequest
in itsmeta
dictionary (under thelink_text
key)cb_kwargs
is a dict containing the keyword arguments to be passed to the callback function.follow
is a boolean which specifies if links should be followed from each response extracted with this rule. Ifcallback
is Nonefollow
defaults toTrue
, otherwise it defaults toFalse
.process_links
is a callable, or a string (in which case a method from the spider object with that name will be used) which will be called for each list of links extracted from each response using the specifiedlink_extractor
. This is mainly used for filtering purposes.process_request
is a callable (or a string, in which case a method from the spider object with that name will be used) which will be called for everyRequest
extracted by this rule. This callable should take said request as first argument and theResponse
from which the request originated as second argument. It must return aRequest
object orNone
(to filter out the request).errback
is a callable or a string (in which case a method from the spider object with that name will be used) to be called if any exception is raised while processing a request generated by the rule. It receives aTwisted Failure
instance as first parameter.Warning
Because of its internal implementation, you must explicitly set callbacks for new requests when writing
CrawlSpider
-based spiders; unexpected behaviour can occur otherwise.New in version 2.0: The errback parameter.
CrawlSpider example¶
Let’s now take a look at an example CrawlSpider with rules:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
)
def parse_item(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
item = scrapy.Item()
item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = response.xpath('//td[@id="item_name"]/text()').get()
item['description'] = response.xpath('//td[@id="item_description"]/text()').get()
item['link_text'] = response.meta['link_text']
url = response.xpath('//td[@id="additional_data"]/@href').get()
return response.follow(url, self.parse_additional_page, cb_kwargs=dict(item=item))
def parse_additional_page(self, response, item):
item['additional_data'] = response.xpath('//p[@id="additional_data"]/text()').get()
return item
This spider would start crawling example.com’s home page, collecting category
links, and item links, parsing the latter with the parse_item
method. For
each item response, some data will be extracted from the HTML using XPath, and
an Item
will be filled with it.
XMLFeedSpider¶
- class scrapy.spiders.XMLFeedSpider[source]¶
XMLFeedSpider is designed for parsing XML feeds by iterating through them by a certain node name. The iterator can be chosen from:
iternodes
,xml
, andhtml
. It’s recommended to use theiternodes
iterator for performance reasons, since thexml
andhtml
iterators generate the whole DOM at once in order to parse it. However, usinghtml
as the iterator may be useful when parsing XML with bad markup.To set the iterator and the tag name, you must define the following class attributes:
- iterator¶
A string which defines the iterator to use. It can be either:
'iternodes'
- a fast iterator based on regular expressions'html'
- an iterator which usesSelector
. Keep in mind this uses DOM parsing and must load all DOM in memory which could be a problem for big feeds'xml'
- an iterator which usesSelector
. Keep in mind this uses DOM parsing and must load all DOM in memory which could be a problem for big feeds
It defaults to:
'iternodes'
.
- itertag¶
A string with the name of the node (or element) to iterate in. Example:
itertag = 'product'
- namespaces¶
A list of
(prefix, uri)
tuples which define the namespaces available in that document that will be processed with this spider. Theprefix
anduri
will be used to automatically register namespaces using theregister_namespace()
method.You can then specify nodes with namespaces in the
itertag
attribute.Example:
class YourSpider(XMLFeedSpider): namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')] itertag = 'n:url' # ...
Apart from these new attributes, this spider has the following overridable methods too:
- adapt_response(response)[source]¶
A method that receives the response as soon as it arrives from the spider middleware, before the spider starts parsing it. It can be used to modify the response body before parsing it. This method receives a response and also returns a response (it could be the same or another one).
- parse_node(response, selector)[source]¶
This method is called for the nodes matching the provided tag name (
itertag
). Receives the response and anSelector
for each node. Overriding this method is mandatory. Otherwise, you spider won’t work. This method must return an item object, aRequest
object, or an iterable containing any of them.
- process_results(response, results)[source]¶
This method is called for each result (item or request) returned by the spider, and it’s intended to perform any last time processing required before returning the results to the framework core, for example setting the item IDs. It receives a list of results and the response which originated those results. It must return a list of results (items or requests).
Warning
Because of its internal implementation, you must explicitly set callbacks for new requests when writing
XMLFeedSpider
-based spiders; unexpected behaviour can occur otherwise.
XMLFeedSpider example¶
These spiders are pretty easy to use, let’s have a look at one example:
from scrapy.spiders import XMLFeedSpider
from myproject.items import TestItem
class MySpider(XMLFeedSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/feed.xml']
iterator = 'iternodes' # This is actually unnecessary, since it's the default value
itertag = 'item'
def parse_node(self, response, node):
self.logger.info('Hi, this is a <%s> node!: %s', self.itertag, ''.join(node.getall()))
item = TestItem()
item['id'] = node.xpath('@id').get()
item['name'] = node.xpath('name').get()
item['description'] = node.xpath('description').get()
return item
Basically what we did up there was to create a spider that downloads a feed from
the given start_urls
, and then iterates through each of its item
tags,
prints them out, and stores some random data in an Item
.
CSVFeedSpider¶
- class scrapy.spiders.CSVFeedSpider[source]¶
This spider is very similar to the XMLFeedSpider, except that it iterates over rows, instead of nodes. The method that gets called in each iteration is
parse_row()
.- delimiter¶
A string with the separator character for each field in the CSV file Defaults to
','
(comma).
- quotechar¶
A string with the enclosure character for each field in the CSV file Defaults to
'"'
(quotation mark).
- headers¶
A list of the column names in the CSV file.
CSVFeedSpider example¶
Let’s see an example similar to the previous one, but using a
CSVFeedSpider
:
from scrapy.spiders import CSVFeedSpider
from myproject.items import TestItem
class MySpider(CSVFeedSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/feed.csv']
delimiter = ';'
quotechar = "'"
headers = ['id', 'name', 'description']
def parse_row(self, response, row):
self.logger.info('Hi, this is a row!: %r', row)
item = TestItem()
item['id'] = row['id']
item['name'] = row['name']
item['description'] = row['description']
return item
SitemapSpider¶
- class scrapy.spiders.SitemapSpider[source]¶
SitemapSpider allows you to crawl a site by discovering the URLs using Sitemaps.
It supports nested sitemaps and discovering sitemap urls from robots.txt.
- sitemap_urls¶
A list of urls pointing to the sitemaps whose urls you want to crawl.
You can also point to a robots.txt and it will be parsed to extract sitemap urls from it.
- sitemap_rules¶
A list of tuples
(regex, callback)
where:regex
is a regular expression to match urls extracted from sitemaps.regex
can be either a str or a compiled regex object.callback is the callback to use for processing the urls that match the regular expression.
callback
can be a string (indicating the name of a spider method) or a callable.
For example:
sitemap_rules = [('/product/', 'parse_product')]
Rules are applied in order, and only the first one that matches will be used.
If you omit this attribute, all urls found in sitemaps will be processed with the
parse
callback.
- sitemap_follow¶
A list of regexes of sitemap that should be followed. This is only for sites that use Sitemap index files that point to other sitemap files.
By default, all sitemaps are followed.
- sitemap_alternate_links¶
Specifies if alternate links for one
url
should be followed. These are links for the same website in another language passed within the sameurl
block.For example:
<url> <loc>http://example.com/</loc> <xhtml:link rel="alternate" hreflang="de" href="http://example.com/de"/> </url>
With
sitemap_alternate_links
set, this would retrieve both URLs. Withsitemap_alternate_links
disabled, onlyhttp://example.com/
would be retrieved.Default is
sitemap_alternate_links
disabled.
- sitemap_filter(entries)[source]¶
This is a filter function that could be overridden to select sitemap entries based on their attributes.
For example:
<url> <loc>http://example.com/</loc> <lastmod>2005-01-01</lastmod> </url>
We can define a
sitemap_filter
function to filterentries
by date:from datetime import datetime from scrapy.spiders import SitemapSpider class FilteredSitemapSpider(SitemapSpider): name = 'filtered_sitemap_spider' allowed_domains = ['example.com'] sitemap_urls = ['http://example.com/sitemap.xml'] def sitemap_filter(self, entries): for entry in entries: date_time = datetime.strptime(entry['lastmod'], '%Y-%m-%d') if date_time.year >= 2005: yield entry
This would retrieve only
entries
modified on 2005 and the following years.Entries are dict objects extracted from the sitemap document. Usually, the key is the tag name and the value is the text inside it.
It’s important to notice that:
as the loc attribute is required, entries without this tag are discarded
alternate links are stored in a list with the key
alternate
(seesitemap_alternate_links
)namespaces are removed, so lxml tags named as
{namespace}tagname
become onlytagname
If you omit this method, all entries found in sitemaps will be processed, observing other attributes and their settings.
SitemapSpider examples¶
Simplest example: process all urls discovered through sitemaps using the
parse
callback:
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/sitemap.xml']
def parse(self, response):
pass # ... scrape item here ...
Process some urls with certain callback and other urls with a different callback:
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/sitemap.xml']
sitemap_rules = [
('/product/', 'parse_product'),
('/category/', 'parse_category'),
]
def parse_product(self, response):
pass # ... scrape product ...
def parse_category(self, response):
pass # ... scrape category ...
Follow sitemaps defined in the robots.txt file and only follow sitemaps
whose url contains /sitemap_shop
:
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/robots.txt']
sitemap_rules = [
('/shop/', 'parse_shop'),
]
sitemap_follow = ['/sitemap_shops']
def parse_shop(self, response):
pass # ... scrape shop here ...
Combine SitemapSpider with other sources of urls:
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/robots.txt']
sitemap_rules = [
('/shop/', 'parse_shop'),
]
other_urls = ['http://www.example.com/about']
def start_requests(self):
requests = list(super(MySpider, self).start_requests())
requests += [scrapy.Request(x, self.parse_other) for x in self.other_urls]
return requests
def parse_shop(self, response):
pass # ... scrape shop here ...
def parse_other(self, response):
pass # ... scrape other here ...
Selectors¶
When you’re scraping web pages, the most common task you need to perform is to extract data from the HTML source. There are several libraries available to achieve this, such as:
BeautifulSoup is a very popular web scraping library among Python programmers which constructs a Python object based on the structure of the HTML code and also deals with bad markup reasonably well, but it has one drawback: it’s slow.
lxml is an XML parsing library (which also parses HTML) with a pythonic API based on
ElementTree
. (lxml is not part of the Python standard library.)
Scrapy comes with its own mechanism for extracting data. They’re called selectors because they “select” certain parts of the HTML document specified either by XPath or CSS expressions.
XPath is a language for selecting nodes in XML documents, which can also be used with HTML. CSS is a language for applying styles to HTML documents. It defines selectors to associate those styles with specific HTML elements.
Note
Scrapy Selectors is a thin wrapper around parsel library; the purpose of this wrapper is to provide better integration with Scrapy Response objects.
parsel is a stand-alone web scraping library which can be used without Scrapy. It uses lxml library under the hood, and implements an easy API on top of lxml API. It means Scrapy selectors are very similar in speed and parsing accuracy to lxml.
Using selectors¶
Constructing selectors¶
Response objects expose a Selector
instance
on .selector
attribute:
>>> response.selector.xpath('//span/text()').get()
'good'
Querying responses using XPath and CSS is so common that responses include two
more shortcuts: response.xpath()
and response.css()
:
>>> response.xpath('//span/text()').get()
'good'
>>> response.css('span::text').get()
'good'
Scrapy selectors are instances of Selector
class
constructed by passing either TextResponse
object or
markup as a string (in text
argument).
Usually there is no need to construct Scrapy selectors manually:
response
object is available in Spider callbacks, so in most cases
it is more convenient to use response.css()
and response.xpath()
shortcuts. By using response.selector
or one of these shortcuts
you can also ensure the response body is parsed only once.
But if required, it is possible to use Selector
directly.
Constructing from text:
>>> from scrapy.selector import Selector
>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').get()
'good'
Constructing from response - HtmlResponse
is one of
TextResponse
subclasses:
>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse
>>> response = HtmlResponse(url='http://example.com', body=body)
>>> Selector(response=response).xpath('//span/text()').get()
'good'
Selector
automatically chooses the best parsing rules
(XML vs HTML) based on input type.
Using selectors¶
To explain how to use the selectors we’ll use the Scrapy shell
(which
provides interactive testing) and an example page located in the Scrapy
documentation server:
For the sake of completeness, here’s its full HTML code:
<html>
<head>
<base href='http://example.com/' />
<title>Example website</title>
</head>
<body>
<div id='images'>
<a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
<a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
<a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
<a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
<a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
</div>
</body>
</html>
First, let’s open the shell:
scrapy shell https://docs.scrapy.org/en/latest/_static/selectors-sample1.html
Then, after the shell loads, you’ll have the response available as response
shell variable, and its attached selector in response.selector
attribute.
Since we’re dealing with HTML, the selector will automatically use an HTML parser.
So, by looking at the HTML code of that page, let’s construct an XPath for selecting the text inside the title tag:
>>> response.xpath('//title/text()')
[<Selector xpath='//title/text()' data='Example website'>]
To actually extract the textual data, you must call the selector .get()
or .getall()
methods, as follows:
>>> response.xpath('//title/text()').getall()
['Example website']
>>> response.xpath('//title/text()').get()
'Example website'
.get()
always returns a single result; if there are several matches,
content of a first match is returned; if there are no matches, None
is returned. .getall()
returns a list with all results.
Notice that CSS selectors can select text or attribute nodes using CSS3 pseudo-elements:
>>> response.css('title::text').get()
'Example website'
As you can see, .xpath()
and .css()
methods return a
SelectorList
instance, which is a list of new
selectors. This API can be used for quickly selecting nested data:
>>> response.css('img').xpath('@src').getall()
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']
If you want to extract only the first matched element, you can call the
selector .get()
(or its alias .extract_first()
commonly used in
previous Scrapy versions):
>>> response.xpath('//div[@id="images"]/a/text()').get()
'Name: My image 1 '
It returns None
if no element was found:
>>> response.xpath('//div[@id="not-exists"]/text()').get() is None
True
A default return value can be provided as an argument, to be used instead
of None
:
>>> response.xpath('//div[@id="not-exists"]/text()').get(default='not-found')
'not-found'
Instead of using e.g. '@src'
XPath it is possible to query for attributes
using .attrib
property of a Selector
:
>>> [img.attrib['src'] for img in response.css('img')]
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']
As a shortcut, .attrib
is also available on SelectorList directly;
it returns attributes for the first matching element:
>>> response.css('img').attrib['src']
'image1_thumb.jpg'
This is most useful when only a single result is expected, e.g. when selecting by id, or selecting unique elements on a web page:
>>> response.css('base').attrib['href']
'http://example.com/'
Now we’re going to get the base URL and some image links:
>>> response.xpath('//base/@href').get()
'http://example.com/'
>>> response.css('base::attr(href)').get()
'http://example.com/'
>>> response.css('base').attrib['href']
'http://example.com/'
>>> response.xpath('//a[contains(@href, "image")]/@href').getall()
['image1.html',
'image2.html',
'image3.html',
'image4.html',
'image5.html']
>>> response.css('a[href*=image]::attr(href)').getall()
['image1.html',
'image2.html',
'image3.html',
'image4.html',
'image5.html']
>>> response.xpath('//a[contains(@href, "image")]/img/@src').getall()
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']
>>> response.css('a[href*=image] img::attr(src)').getall()
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']
Extensions to CSS Selectors¶
Per W3C standards, CSS selectors do not support selecting text nodes or attribute values. But selecting these is so essential in a web scraping context that Scrapy (parsel) implements a couple of non-standard pseudo-elements:
to select text nodes, use
::text
to select attribute values, use
::attr(name)
where name is the name of the attribute that you want the value of
Warning
These pseudo-elements are Scrapy-/Parsel-specific. They will most probably not work with other libraries like lxml or PyQuery.
Examples:
title::text
selects children text nodes of a descendant<title>
element:
>>> response.css('title::text').get()
'Example website'
*::text
selects all descendant text nodes of the current selector context:
>>> response.css('#images *::text').getall()
['\n ',
'Name: My image 1 ',
'\n ',
'Name: My image 2 ',
'\n ',
'Name: My image 3 ',
'\n ',
'Name: My image 4 ',
'\n ',
'Name: My image 5 ',
'\n ']
foo::text
returns no results iffoo
element exists, but contains no text (i.e. text is empty):
>>> response.css('img::text').getall()
[]
This means
.css('foo::text').get()
could return None even if an element exists. Usedefault=''
if you always want a string:
>>> response.css('img::text').get()
>>> response.css('img::text').get(default='')
''
a::attr(href)
selects the href attribute value of descendant links:
>>> response.css('a::attr(href)').getall()
['image1.html',
'image2.html',
'image3.html',
'image4.html',
'image5.html']
Note
See also: Selecting element attributes.
Note
You cannot chain these pseudo-elements. But in practice it would not make much sense: text nodes do not have attributes, and attribute values are string values already and do not have children nodes.
Nesting selectors¶
The selection methods (.xpath()
or .css()
) return a list of selectors
of the same type, so you can call the selection methods for those selectors
too. Here’s an example:
>>> links = response.xpath('//a[contains(@href, "image")]')
>>> links.getall()
['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
'<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
'<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
'<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
'<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']
>>> for index, link in enumerate(links):
... href_xpath = link.xpath('@href').get()
... img_xpath = link.xpath('img/@src').get()
... print(f'Link number {index} points to url {href_xpath!r} and image {img_xpath!r}')
Link number 0 points to url 'image1.html' and image 'image1_thumb.jpg'
Link number 1 points to url 'image2.html' and image 'image2_thumb.jpg'
Link number 2 points to url 'image3.html' and image 'image3_thumb.jpg'
Link number 3 points to url 'image4.html' and image 'image4_thumb.jpg'
Link number 4 points to url 'image5.html' and image 'image5_thumb.jpg'
Selecting element attributes¶
There are several ways to get a value of an attribute. First, one can use XPath syntax:
>>> response.xpath("//a/@href").getall()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
XPath syntax has a few advantages: it is a standard XPath feature, and
@attributes
can be used in other parts of an XPath expression - e.g.
it is possible to filter by attribute value.
Scrapy also provides an extension to CSS selectors (::attr(...)
)
which allows to get attribute values:
>>> response.css('a::attr(href)').getall()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
In addition to that, there is a .attrib
property of Selector.
You can use it if you prefer to lookup attributes in Python
code, without using XPaths or CSS extensions:
>>> [a.attrib['href'] for a in response.css('a')]
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
This property is also available on SelectorList; it returns a dictionary with attributes of a first matching element. It is convenient to use when a selector is expected to give a single result (e.g. when selecting by element ID, or when selecting an unique element on a page):
>>> response.css('base').attrib
{'href': 'http://example.com/'}
>>> response.css('base').attrib['href']
'http://example.com/'
.attrib
property of an empty SelectorList is empty:
>>> response.css('foo').attrib
{}
Using selectors with regular expressions¶
Selector
also has a .re()
method for extracting
data using regular expressions. However, unlike using .xpath()
or
.css()
methods, .re()
returns a list of strings. So you
can’t construct nested .re()
calls.
Here’s an example used to extract image names from the HTML code above:
>>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
['My image 1',
'My image 2',
'My image 3',
'My image 4',
'My image 5']
There’s an additional helper reciprocating .get()
(and its
alias .extract_first()
) for .re()
, named .re_first()
.
Use it to extract just the first matching string:
>>> response.xpath('//a[contains(@href, "image")]/text()').re_first(r'Name:\s*(.*)')
'My image 1'
extract() and extract_first()¶
If you’re a long-time Scrapy user, you’re probably familiar
with .extract()
and .extract_first()
selector methods. Many blog posts
and tutorials are using them as well. These methods are still supported
by Scrapy, there are no plans to deprecate them.
However, Scrapy usage docs are now written using .get()
and
.getall()
methods. We feel that these new methods result in a more concise
and readable code.
The following examples show how these methods map to each other.
SelectorList.get()
is the same asSelectorList.extract_first()
:>>> response.css('a::attr(href)').get() 'image1.html' >>> response.css('a::attr(href)').extract_first() 'image1.html'
SelectorList.getall()
is the same asSelectorList.extract()
:>>> response.css('a::attr(href)').getall() ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html'] >>> response.css('a::attr(href)').extract() ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
Selector.get()
is the same asSelector.extract()
:>>> response.css('a::attr(href)')[0].get() 'image1.html' >>> response.css('a::attr(href)')[0].extract() 'image1.html'
For consistency, there is also
Selector.getall()
, which returns a list:>>> response.css('a::attr(href)')[0].getall() ['image1.html']
So, the main difference is that output of .get()
and .getall()
methods
is more predictable: .get()
always returns a single result, .getall()
always returns a list of all extracted results. With .extract()
method
it was not always obvious if a result is a list or not; to get a single
result either .extract()
or .extract_first()
should be called.
Working with XPaths¶
Here are some tips which may help you to use XPath with Scrapy selectors effectively. If you are not much familiar with XPath yet, you may want to take a look first at this XPath tutorial.
Note
Some of the tips are based on this post from Zyte’s blog.
Working with relative XPaths¶
Keep in mind that if you are nesting selectors and use an XPath that starts
with /
, that XPath will be absolute to the document and not relative to the
Selector
you’re calling it from.
For example, suppose you want to extract all <p>
elements inside <div>
elements. First, you would get all <div>
elements:
>>> divs = response.xpath('//div')
At first, you may be tempted to use the following approach, which is wrong, as
it actually extracts all <p>
elements from the document, not only those
inside <div>
elements:
>>> for p in divs.xpath('//p'): # this is wrong - gets all <p> from the whole document
... print(p.get())
This is the proper way to do it (note the dot prefixing the .//p
XPath):
>>> for p in divs.xpath('.//p'): # extracts all <p> inside
... print(p.get())
Another common case would be to extract all direct <p>
children:
>>> for p in divs.xpath('p'):
... print(p.get())
For more details about relative XPaths see the Location Paths section in the XPath specification.
When querying by class, consider using CSS¶
Because an element can contain multiple CSS classes, the XPath way to select elements by class is the rather verbose:
*[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')]
If you use @class='someclass'
you may end up missing elements that have
other classes, and if you just use contains(@class, 'someclass')
to make up
for that you may end up with more elements that you want, if they have a different
class name that shares the string someclass
.
As it turns out, Scrapy selectors allow you to chain selectors, so most of the time you can just select by class using CSS and then switch to XPath when needed:
>>> from scrapy import Selector
>>> sel = Selector(text='<div class="hero shout"><time datetime="2014-07-23 19:00">Special date</time></div>')
>>> sel.css('.shout').xpath('./time/@datetime').getall()
['2014-07-23 19:00']
This is cleaner than using the verbose XPath trick shown above. Just remember
to use the .
in the XPath expressions that will follow.
Beware of the difference between //node[1] and (//node)[1]¶
//node[1]
selects all the nodes occurring first under their respective parents.
(//node)[1]
selects all the nodes in the document, and then gets only the first of them.
Example:
>>> from scrapy import Selector
>>> sel = Selector(text="""
....: <ul class="list">
....: <li>1</li>
....: <li>2</li>
....: <li>3</li>
....: </ul>
....: <ul class="list">
....: <li>4</li>
....: <li>5</li>
....: <li>6</li>
....: </ul>""")
>>> xp = lambda x: sel.xpath(x).getall()
This gets all first <li>
elements under whatever it is its parent:
>>> xp("//li[1]")
['<li>1</li>', '<li>4</li>']
And this gets the first <li>
element in the whole document:
>>> xp("(//li)[1]")
['<li>1</li>']
This gets all first <li>
elements under an <ul>
parent:
>>> xp("//ul/li[1]")
['<li>1</li>', '<li>4</li>']
And this gets the first <li>
element under an <ul>
parent in the whole document:
>>> xp("(//ul/li)[1]")
['<li>1</li>']
Using text nodes in a condition¶
When you need to use the text content as argument to an XPath string function,
avoid using .//text()
and use just .
instead.
This is because the expression .//text()
yields a collection of text elements – a node-set.
And when a node-set is converted to a string, which happens when it is passed as argument to
a string function like contains()
or starts-with()
, it results in the text for the first element only.
Example:
>>> from scrapy import Selector
>>> sel = Selector(text='<a href="#">Click here to go to the <strong>Next Page</strong></a>')
Converting a node-set to string:
>>> sel.xpath('//a//text()').getall() # take a peek at the node-set
['Click here to go to the ', 'Next Page']
>>> sel.xpath("string(//a[1]//text())").getall() # convert it to string
['Click here to go to the ']
A node converted to a string, however, puts together the text of itself plus of all its descendants:
>>> sel.xpath("//a[1]").getall() # select the first node
['<a href="#">Click here to go to the <strong>Next Page</strong></a>']
>>> sel.xpath("string(//a[1])").getall() # convert it to string
['Click here to go to the Next Page']
So, using the .//text()
node-set won’t select anything in this case:
>>> sel.xpath("//a[contains(.//text(), 'Next Page')]").getall()
[]
But using the .
to mean the node, works:
>>> sel.xpath("//a[contains(., 'Next Page')]").getall()
['<a href="#">Click here to go to the <strong>Next Page</strong></a>']
Variables in XPath expressions¶
XPath allows you to reference variables in your XPath expressions, using
the $somevariable
syntax. This is somewhat similar to parameterized
queries or prepared statements in the SQL world where you replace
some arguments in your queries with placeholders like ?
,
which are then substituted with values passed with the query.
Here’s an example to match an element based on its “id” attribute value, without hard-coding it (that was shown previously):
>>> # `$val` used in the expression, a `val` argument needs to be passed
>>> response.xpath('//div[@id=$val]/a/text()', val='images').get()
'Name: My image 1 '
Here’s another example, to find the “id” attribute of a <div>
tag containing
five <a>
children (here we pass the value 5
as an integer):
>>> response.xpath('//div[count(a)=$cnt]/@id', cnt=5).get()
'images'
All variable references must have a binding value when calling .xpath()
(otherwise you’ll get a ValueError: XPath error:
exception).
This is done by passing as many named arguments as necessary.
parsel, the library powering Scrapy selectors, has more details and examples on XPath variables.
Removing namespaces¶
When dealing with scraping projects, it is often quite convenient to get rid of
namespaces altogether and just work with element names, to write more
simple/convenient XPaths. You can use the
Selector.remove_namespaces()
method for that.
Let’s show an example that illustrates this with the Python Insider blog atom feed.
First, we open the shell with the url we want to scrape:
$ scrapy shell https://feeds.feedburner.com/PythonInsider
This is how the file starts:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet ...
<feed xmlns="http://www.w3.org/2005/Atom"
xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/"
xmlns:blogger="http://schemas.google.com/blogger/2008"
xmlns:georss="http://www.georss.org/georss"
xmlns:gd="http://schemas.google.com/g/2005"
xmlns:thr="http://purl.org/syndication/thread/1.0"
xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">
...
You can see several namespace declarations including a default “http://www.w3.org/2005/Atom” and another one using the “gd:” prefix for “http://schemas.google.com/g/2005”.
Once in the shell we can try selecting all <link>
objects and see that it
doesn’t work (because the Atom XML namespace is obfuscating those nodes):
>>> response.xpath("//link")
[]
But once we call the Selector.remove_namespaces()
method, all
nodes can be accessed directly by their names:
>>> response.selector.remove_namespaces()
>>> response.xpath("//link")
[<Selector xpath='//link' data='<link rel="alternate" type="text/html" h'>,
<Selector xpath='//link' data='<link rel="next" type="application/atom+'>,
...
If you wonder why the namespace removal procedure isn’t always called by default instead of having to call it manually, this is because of two reasons, which, in order of relevance, are:
Removing namespaces requires to iterate and modify all nodes in the document, which is a reasonably expensive operation to perform by default for all documents crawled by Scrapy
There could be some cases where using namespaces is actually required, in case some element names clash between namespaces. These cases are very rare though.
Using EXSLT extensions¶
Being built atop lxml, Scrapy selectors support some EXSLT extensions and come with these pre-registered namespaces to use in XPath expressions:
prefix |
namespace |
usage |
---|---|---|
re |
http://exslt.org/regular-expressions |
|
set |
http://exslt.org/sets |
Regular expressions¶
The test()
function, for example, can prove quite useful when XPath’s
starts-with()
or contains()
are not sufficient.
Example selecting links in list item with a “class” attribute ending with a digit:
>>> from scrapy import Selector
>>> doc = """
... <div>
... <ul>
... <li class="item-0"><a href="link1.html">first item</a></li>
... <li class="item-1"><a href="link2.html">second item</a></li>
... <li class="item-inactive"><a href="link3.html">third item</a></li>
... <li class="item-1"><a href="link4.html">fourth item</a></li>
... <li class="item-0"><a href="link5.html">fifth item</a></li>
... </ul>
... </div>
... """
>>> sel = Selector(text=doc, type="html")
>>> sel.xpath('//li//@href').getall()
['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']
>>> sel.xpath('//li[re:test(@class, "item-\d$")]//@href').getall()
['link1.html', 'link2.html', 'link4.html', 'link5.html']
Warning
C library libxslt
doesn’t natively support EXSLT regular
expressions so lxml’s implementation uses hooks to Python’s re
module.
Thus, using regexp functions in your XPath expressions may add a small
performance penalty.
Set operations¶
These can be handy for excluding parts of a document tree before extracting text elements for example.
Example extracting microdata (sample content taken from https://schema.org/Product) with groups of itemscopes and corresponding itemprops:
>>> doc = """
... <div itemscope itemtype="http://schema.org/Product">
... <span itemprop="name">Kenmore White 17" Microwave</span>
... <img src="kenmore-microwave-17in.jpg" alt='Kenmore 17" Microwave' />
... <div itemprop="aggregateRating"
... itemscope itemtype="http://schema.org/AggregateRating">
... Rated <span itemprop="ratingValue">3.5</span>/5
... based on <span itemprop="reviewCount">11</span> customer reviews
... </div>
...
... <div itemprop="offers" itemscope itemtype="http://schema.org/Offer">
... <span itemprop="price">$55.00</span>
... <link itemprop="availability" href="http://schema.org/InStock" />In stock
... </div>
...
... Product description:
... <span itemprop="description">0.7 cubic feet countertop microwave.
... Has six preset cooking categories and convenience features like
... Add-A-Minute and Child Lock.</span>
...
... Customer reviews:
...
... <div itemprop="review" itemscope itemtype="http://schema.org/Review">
... <span itemprop="name">Not a happy camper</span> -
... by <span itemprop="author">Ellie</span>,
... <meta itemprop="datePublished" content="2011-04-01">April 1, 2011
... <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
... <meta itemprop="worstRating" content = "1">
... <span itemprop="ratingValue">1</span>/
... <span itemprop="bestRating">5</span>stars
... </div>
... <span itemprop="description">The lamp burned out and now I have to replace
... it. </span>
... </div>
...
... <div itemprop="review" itemscope itemtype="http://schema.org/Review">
... <span itemprop="name">Value purchase</span> -
... by <span itemprop="author">Lucas</span>,
... <meta itemprop="datePublished" content="2011-03-25">March 25, 2011
... <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
... <meta itemprop="worstRating" content = "1"/>
... <span itemprop="ratingValue">4</span>/
... <span itemprop="bestRating">5</span>stars
... </div>
... <span itemprop="description">Great microwave for the price. It is small and
... fits in my apartment.</span>
... </div>
... ...
... </div>
... """
>>> sel = Selector(text=doc, type="html")
>>> for scope in sel.xpath('//div[@itemscope]'):
... print("current scope:", scope.xpath('@itemtype').getall())
... props = scope.xpath('''
... set:difference(./descendant::*/@itemprop,
... .//*[@itemscope]/*/@itemprop)''')
... print(f" properties: {props.getall()}")
... print("")
current scope: ['http://schema.org/Product']
properties: ['name', 'aggregateRating', 'offers', 'description', 'review', 'review']
current scope: ['http://schema.org/AggregateRating']
properties: ['ratingValue', 'reviewCount']
current scope: ['http://schema.org/Offer']
properties: ['price', 'availability']
current scope: ['http://schema.org/Review']
properties: ['name', 'author', 'datePublished', 'reviewRating', 'description']
current scope: ['http://schema.org/Rating']
properties: ['worstRating', 'ratingValue', 'bestRating']
current scope: ['http://schema.org/Review']
properties: ['name', 'author', 'datePublished', 'reviewRating', 'description']
current scope: ['http://schema.org/Rating']
properties: ['worstRating', 'ratingValue', 'bestRating']
Here we first iterate over itemscope
elements, and for each one,
we look for all itemprops
elements and exclude those that are themselves
inside another itemscope
.
Other XPath extensions¶
Scrapy selectors also provide a sorely missed XPath extension function
has-class
that returns True
for nodes that have all of the specified
HTML classes.
For the following HTML:
<p class="foo bar-baz">First</p>
<p class="foo">Second</p>
<p class="bar">Third</p>
<p>Fourth</p>
You can use it like this:
>>> response.xpath('//p[has-class("foo")]')
[<Selector xpath='//p[has-class("foo")]' data='<p class="foo bar-baz">First</p>'>,
<Selector xpath='//p[has-class("foo")]' data='<p class="foo">Second</p>'>]
>>> response.xpath('//p[has-class("foo", "bar-baz")]')
[<Selector xpath='//p[has-class("foo", "bar-baz")]' data='<p class="foo bar-baz">First</p>'>]
>>> response.xpath('//p[has-class("foo", "bar")]')
[]
So XPath //p[has-class("foo", "bar-baz")]
is roughly equivalent to CSS
p.foo.bar-baz
. Please note, that it is slower in most of the cases,
because it’s a pure-Python function that’s invoked for every node in question
whereas the CSS lookup is translated into XPath and thus runs more efficiently,
so performance-wise its uses are limited to situations that are not easily
described with CSS selectors.
Parsel also simplifies adding your own XPath extensions.
- parsel.xpathfuncs.set_xpathfunc(fname, func)[source]¶
Register a custom extension function to use in XPath expressions.
The function
func
registered underfname
identifier will be called for every matching node, being passed acontext
parameter as well as any parameters passed from the corresponding XPath expression.If
func
isNone
, the extension function will be removed.See more in lxml documentation.
Built-in Selectors reference¶
Selector objects¶
- class scrapy.selector.Selector(*args, **kwargs)[source]¶
An instance of
Selector
is a wrapper over response to select certain parts of its content.response
is anHtmlResponse
or anXmlResponse
object that will be used for selecting and extracting data.text
is a unicode string or utf-8 encoded text for cases when aresponse
isn’t available. Usingtext
andresponse
together is undefined behavior.type
defines the selector type, it can be"html"
,"xml"
orNone
(default).If
type
isNone
, the selector automatically chooses the best type based onresponse
type (see below), or defaults to"html"
in case it is used together withtext
.If
type
isNone
and aresponse
is passed, the selector type is inferred from the response type as follows:"html"
forHtmlResponse
type"xml"
forXmlResponse
type"html"
for anything else
Otherwise, if
type
is set, the selector type will be forced and no detection will occur.- xpath(query, namespaces=None, **kwargs)[source]¶
Find nodes matching the xpath
query
and return the result as aSelectorList
instance with all elements flattened. List elements implementSelector
interface too.query
is a string containing the XPATH query to apply.namespaces
is an optionalprefix: namespace-uri
mapping (dict) for additional prefixes to those registered withregister_namespace(prefix, uri)
. Contrary toregister_namespace()
, these prefixes are not saved for future calls.Any additional named arguments can be used to pass values for XPath variables in the XPath expression, e.g.:
selector.xpath('//a[href=$url]', url="http://www.example.com")
Note
For convenience, this method can be called as
response.xpath()
- css(query)[source]¶
Apply the given CSS selector and return a
SelectorList
instance.query
is a string containing the CSS selector to apply.In the background, CSS queries are translated into XPath queries using cssselect library and run
.xpath()
method.Note
For convenience, this method can be called as
response.css()
- get()[source]¶
Serialize and return the matched nodes in a single unicode string. Percent encoded content is unquoted.
See also: extract() and extract_first()
- attrib¶
Return the attributes dictionary for underlying element.
See also: Selecting element attributes.
- re(regex, replace_entities=True)[source]¶
Apply the given regex and return a list of unicode strings with the matches.
regex
can be either a compiled regular expression or a string which will be compiled to a regular expression usingre.compile(regex)
.By default, character entity references are replaced by their corresponding character (except for
&
and<
). Passingreplace_entities
asFalse
switches off these replacements.
- re_first(regex, default=None, replace_entities=True)[source]¶
Apply the given regex and return the first unicode string which matches. If there is no match, return the default value (
None
if the argument is not provided).By default, character entity references are replaced by their corresponding character (except for
&
and<
). Passingreplace_entities
asFalse
switches off these replacements.
- register_namespace(prefix, uri)[source]¶
Register the given namespace to be used in this
Selector
. Without registering namespaces you can’t select or extract data from non-standard namespaces. See Selector examples on XML response.
- remove_namespaces()[source]¶
Remove all namespaces, allowing to traverse the document using namespace-less xpaths. See Removing namespaces.
- __bool__()[source]¶
Return
True
if there is any real content selected orFalse
otherwise. In other words, the boolean value of aSelector
is given by the contents it selects.
- getall()[source]¶
Serialize and return the matched node in a 1-element list of unicode strings.
This method is added to Selector for consistency; it is more useful with SelectorList. See also: extract() and extract_first()
SelectorList objects¶
- class scrapy.selector.SelectorList(iterable=(), /)[source]¶
The
SelectorList
class is a subclass of the builtinlist
class, which provides a few additional methods.- xpath(xpath, namespaces=None, **kwargs)[source]¶
Call the
.xpath()
method for each element in this list and return their results flattened as anotherSelectorList
.query
is the same argument as the one inSelector.xpath()
namespaces
is an optionalprefix: namespace-uri
mapping (dict) for additional prefixes to those registered withregister_namespace(prefix, uri)
. Contrary toregister_namespace()
, these prefixes are not saved for future calls.Any additional named arguments can be used to pass values for XPath variables in the XPath expression, e.g.:
selector.xpath('//a[href=$url]', url="http://www.example.com")
- css(query)[source]¶
Call the
.css()
method for each element in this list and return their results flattened as anotherSelectorList
.query
is the same argument as the one inSelector.css()
- getall()[source]¶
Call the
.get()
method for each element is this list and return their results flattened, as a list of unicode strings.See also: extract() and extract_first()
- get(default=None)[source]¶
Return the result of
.get()
for the first element in this list. If the list is empty, return the default value.See also: extract() and extract_first()
- re(regex, replace_entities=True)[source]¶
Call the
.re()
method for each element in this list and return their results flattened, as a list of unicode strings.By default, character entity references are replaced by their corresponding character (except for
&
and<
. Passingreplace_entities
asFalse
switches off these replacements.
- re_first(regex, default=None, replace_entities=True)[source]¶
Call the
.re()
method for the first element in this list and return the result in an unicode string. If the list is empty or the regex doesn’t match anything, return the default value (None
if the argument is not provided).By default, character entity references are replaced by their corresponding character (except for
&
and<
. Passingreplace_entities
asFalse
switches off these replacements.
- attrib¶
Return the attributes dictionary for the first element. If the list is empty, return an empty dict.
See also: Selecting element attributes.
Examples¶
Selector examples on HTML response¶
Here are some Selector
examples to illustrate several concepts.
In all cases, we assume there is already a Selector
instantiated with
a HtmlResponse
object like this:
sel = Selector(html_response)
Select all
<h1>
elements from an HTML response body, returning a list ofSelector
objects (i.e. aSelectorList
object):sel.xpath("//h1")
Extract the text of all
<h1>
elements from an HTML response body, returning a list of strings:sel.xpath("//h1").getall() # this includes the h1 tag sel.xpath("//h1/text()").getall() # this excludes the h1 tag
Iterate over all
<p>
tags and print their class attribute:for node in sel.xpath("//p"): print(node.attrib['class'])
Selector examples on XML response¶
Here are some examples to illustrate concepts for Selector
objects
instantiated with an XmlResponse
object:
sel = Selector(xml_response)
Select all
<product>
elements from an XML response body, returning a list ofSelector
objects (i.e. aSelectorList
object):sel.xpath("//product")
Extract all prices from a Google Base XML feed which requires registering a namespace:
sel.register_namespace("g", "http://base.google.com/ns/1.0") sel.xpath("//g:price").getall()
Items¶
The main goal in scraping is to extract structured data from unstructured sources, typically, web pages. Spiders may return the extracted data as items, Python objects that define key-value pairs.
Scrapy supports multiple types of items. When you create an item, you may use whichever type of item you want. When you write code that receives an item, your code should work for any item type.
Item Types¶
Scrapy supports the following types of items, via the itemadapter library: dictionaries, Item objects, dataclass objects, and attrs objects.
Dictionaries¶
As an item type, dict
is convenient and familiar.
Item objects¶
Item
provides a dict
-like API plus additional features that
make it the most feature-complete item type:
- class scrapy.item.Item([arg])¶
- class scrapy.Item([arg])¶
Item
objects replicate the standarddict
API, including its__init__
method.Item
allows defining field names, so that:KeyError
is raised when using undefined field names (i.e. prevents typos going unnoticed)Item exporters can export all fields by default even if the first scraped object does not have values for all of them
Item
also allows defining field metadata, which can be used to customize serialization.trackref
tracksItem
objects to help find memory leaks (see Debugging memory leaks with trackref).Item
objects also provide the following additional API members:- Item.copy()¶
- Item.deepcopy()¶
Return a
deepcopy()
of this item.
- fields¶
A dictionary containing all declared fields for this Item, not only those populated. The keys are the field names and the values are the
Field
objects used in the Item declaration.
Example:
from scrapy.item import Item, Field
class CustomItem(Item):
one_field = Field()
another_field = Field()
Dataclass objects¶
New in version 2.2.
dataclass()
allows defining item classes with field names,
so that item exporters can export all fields by
default even if the first scraped object does not have values for all of them.
Additionally, dataclass
items also allow to:
define the type and default value of each defined field.
define custom field metadata through
dataclasses.field()
, which can be used to customize serialization.
They work natively in Python 3.7 or later, or using the dataclasses backport in Python 3.6.
Example:
from dataclasses import dataclass
@dataclass
class CustomItem:
one_field: str
another_field: int
Note
Field types are not enforced at run time.
attr.s objects¶
New in version 2.2.
attr.s()
allows defining item classes with field names,
so that item exporters can export all fields by
default even if the first scraped object does not have values for all of them.
Additionally, attr.s
items also allow to:
define the type and default value of each defined field.
define custom field metadata, which can be used to customize serialization.
In order to use this type, the attrs package needs to be installed.
Example:
import attr
@attr.s
class CustomItem:
one_field = attr.ib()
another_field = attr.ib()
Working with Item objects¶
Declaring Item subclasses¶
Item subclasses are declared using a simple class definition syntax and
Field
objects. Here is an example:
import scrapy
class Product(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
stock = scrapy.Field()
tags = scrapy.Field()
last_updated = scrapy.Field(serializer=str)
Note
Those familiar with Django will notice that Scrapy Items are declared similar to Django Models, except that Scrapy Items are much simpler as there is no concept of different field types.
Declaring fields¶
Field
objects are used to specify metadata for each field. For
example, the serializer function for the last_updated
field illustrated in
the example above.
You can specify any kind of metadata for each field. There is no restriction on
the values accepted by Field
objects. For this same
reason, there is no reference list of all available metadata keys. Each key
defined in Field
objects could be used by a different component, and
only those components know about it. You can also define and use any other
Field
key in your project too, for your own needs. The main goal of
Field
objects is to provide a way to define all field metadata in one
place. Typically, those components whose behaviour depends on each field use
certain field keys to configure that behaviour. You must refer to their
documentation to see which metadata keys are used by each component.
It’s important to note that the Field
objects used to declare the item
do not stay assigned as class attributes. Instead, they can be accessed through
the Item.fields
attribute.
- class scrapy.item.Field([arg])¶
- class scrapy.Field([arg])¶
The
Field
class is just an alias to the built-indict
class and doesn’t provide any extra functionality or attributes. In other words,Field
objects are plain-old Python dicts. A separate class is used to support the item declaration syntax based on class attributes.
Note
Field metadata can also be declared for dataclass
and attrs
items. Please refer to the documentation for dataclasses.field and
attr.ib for additional information.
Working with Item objects¶
Here are some examples of common tasks performed with items, using the
Product
item declared above. You will
notice the API is very similar to the dict
API.
Creating items¶
>>> product = Product(name='Desktop PC', price=1000)
>>> print(product)
Product(name='Desktop PC', price=1000)
Getting field values¶
>>> product['name']
Desktop PC
>>> product.get('name')
Desktop PC
>>> product['price']
1000
>>> product['last_updated']
Traceback (most recent call last):
...
KeyError: 'last_updated'
>>> product.get('last_updated', 'not set')
not set
>>> product['lala'] # getting unknown field
Traceback (most recent call last):
...
KeyError: 'lala'
>>> product.get('lala', 'unknown field')
'unknown field'
>>> 'name' in product # is name field populated?
True
>>> 'last_updated' in product # is last_updated populated?
False
>>> 'last_updated' in product.fields # is last_updated a declared field?
True
>>> 'lala' in product.fields # is lala a declared field?
False
Setting field values¶
>>> product['last_updated'] = 'today'
>>> product['last_updated']
today
>>> product['lala'] = 'test' # setting unknown field
Traceback (most recent call last):
...
KeyError: 'Product does not support field: lala'
Accessing all populated values¶
To access all populated values, just use the typical dict
API:
>>> product.keys()
['price', 'name']
>>> product.items()
[('price', 1000), ('name', 'Desktop PC')]
Copying items¶
To copy an item, you must first decide whether you want a shallow copy or a deep copy.
If your item contains mutable values like lists or dictionaries, a shallow copy will keep references to the same mutable values across all different copies.
For example, if you have an item with a list of tags, and you create a shallow copy of that item, both the original item and the copy have the same list of tags. Adding a tag to the list of one of the items will add the tag to the other item as well.
If that is not the desired behavior, use a deep copy instead.
See copy
for more information.
To create a shallow copy of an item, you can either call
copy()
on an existing item
(product2 = product.copy()
) or instantiate your item class from an existing
item (product2 = Product(product)
).
To create a deep copy, call deepcopy()
instead
(product2 = product.deepcopy()
).
Other common tasks¶
Creating dicts from items:
>>> dict(product) # create a dict from all populated values
{'price': 1000, 'name': 'Desktop PC'}
Creating items from dicts:
>>> Product({'name': 'Laptop PC', 'price': 1500})
Product(price=1500, name='Laptop PC')
>>> Product({'name': 'Laptop PC', 'lala': 1500}) # warning: unknown field in dict
Traceback (most recent call last):
...
KeyError: 'Product does not support field: lala'
Extending Item subclasses¶
You can extend Items (to add more fields or to change some metadata for some fields) by declaring a subclass of your original Item.
For example:
class DiscountedProduct(Product):
discount_percent = scrapy.Field(serializer=str)
discount_expiration_date = scrapy.Field()
You can also extend field metadata by using the previous field metadata and appending more values, or changing existing values, like this:
class SpecificProduct(Product):
name = scrapy.Field(Product.fields['name'], serializer=my_serializer)
That adds (or replaces) the serializer
metadata key for the name
field,
keeping all the previously existing metadata values.
Supporting All Item Types¶
In code that receives an item, such as methods of item pipelines or spider middlewares, it is a good practice to use the
ItemAdapter
class and the
is_item()
function to write code that works for
any supported item type:
Item Loaders¶
Item Loaders provide a convenient mechanism for populating scraped items. Even though items can be populated directly, Item Loaders provide a much more convenient API for populating them from a scraping process, by automating some common tasks like parsing the raw extracted data before assigning it.
In other words, items provide the container of scraped data, while Item Loaders provide the mechanism for populating that container.
Item Loaders are designed to provide a flexible, efficient and easy mechanism for extending and overriding different field parsing rules, either by spider, or by source format (HTML, XML, etc) without becoming a nightmare to maintain.
Note
Item Loaders are an extension of the itemloaders library that make it easier to work with Scrapy by adding support for responses.
Using Item Loaders to populate items¶
To use an Item Loader, you must first instantiate it. You can either
instantiate it with an item object or without one, in which
case an item object is automatically created in the
Item Loader __init__
method using the item class
specified in the ItemLoader.default_item_class
attribute.
Then, you start collecting values into the Item Loader, typically using Selectors. You can add more than one value to the same item field; the Item Loader will know how to “join” those values later using a proper processing function.
Note
Collected data is internally stored as lists,
allowing to add several values to the same field.
If an item
argument is passed when creating a loader,
each of the item’s values will be stored as-is if it’s already
an iterable, or wrapped with a list if it’s a single value.
Here is a typical Item Loader usage in a Spider, using the Product item declared in the Items chapter:
from scrapy.loader import ItemLoader
from myproject.items import Product
def parse(self, response):
l = ItemLoader(item=Product(), response=response)
l.add_xpath('name', '//div[@class="product_name"]')
l.add_xpath('name', '//div[@class="product_title"]')
l.add_xpath('price', '//p[@id="price"]')
l.add_css('stock', 'p#stock')
l.add_value('last_updated', 'today') # you can also use literal values
return l.load_item()
By quickly looking at that code, we can see the name
field is being
extracted from two different XPath locations in the page:
//div[@class="product_name"]
//div[@class="product_title"]
In other words, data is being collected by extracting it from two XPath
locations, using the add_xpath()
method. This is the
data that will be assigned to the name
field later.
Afterwards, similar calls are used for price
and stock
fields
(the latter using a CSS selector with the add_css()
method),
and finally the last_update
field is populated directly with a literal value
(today
) using a different method: add_value()
.
Finally, when all data is collected, the ItemLoader.load_item()
method is
called which actually returns the item populated with the data
previously extracted and collected with the add_xpath()
,
add_css()
, and add_value()
calls.
Working with dataclass items¶
By default, dataclass items require all fields to be
passed when created. This could be an issue when using dataclass items with
item loaders: unless a pre-populated item is passed to the loader, fields
will be populated incrementally using the loader’s add_xpath()
,
add_css()
and add_value()
methods.
One approach to overcome this is to define items using the
field()
function, with a default
argument:
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class InventoryItem:
name: Optional[str] = field(default=None)
price: Optional[float] = field(default=None)
stock: Optional[int] = field(default=None)
Input and Output processors¶
An Item Loader contains one input processor and one output processor for each
(item) field. The input processor processes the extracted data as soon as it’s
received (through the add_xpath()
, add_css()
or
add_value()
methods) and the result of the input processor is
collected and kept inside the ItemLoader. After collecting all data, the
ItemLoader.load_item()
method is called to populate and get the populated
item object. That’s when the output processor is
called with the data previously collected (and processed using the input
processor). The result of the output processor is the final value that gets
assigned to the item.
Let’s see an example to illustrate how the input and output processors are called for a particular field (the same applies for any other field):
l = ItemLoader(Product(), some_selector)
l.add_xpath('name', xpath1) # (1)
l.add_xpath('name', xpath2) # (2)
l.add_css('name', css) # (3)
l.add_value('name', 'test') # (4)
return l.load_item() # (5)
So what happens is:
Data from
xpath1
is extracted, and passed through the input processor of thename
field. The result of the input processor is collected and kept in the Item Loader (but not yet assigned to the item).Data from
xpath2
is extracted, and passed through the same input processor used in (1). The result of the input processor is appended to the data collected in (1) (if any).This case is similar to the previous ones, except that the data is extracted from the
css
CSS selector, and passed through the same input processor used in (1) and (2). The result of the input processor is appended to the data collected in (1) and (2) (if any).This case is also similar to the previous ones, except that the value to be collected is assigned directly, instead of being extracted from a XPath expression or a CSS selector. However, the value is still passed through the input processors. In this case, since the value is not iterable it is converted to an iterable of a single element before passing it to the input processor, because input processor always receive iterables.
The data collected in steps (1), (2), (3) and (4) is passed through the output processor of the
name
field. The result of the output processor is the value assigned to thename
field in the item.
It’s worth noticing that processors are just callable objects, which are called with the data to be parsed, and return a parsed value. So you can use any function as input or output processor. The only requirement is that they must accept one (and only one) positional argument, which will be an iterable.
Changed in version 2.0: Processors no longer need to be methods.
Note
Both input and output processors must receive an iterable as their first argument. The output of those functions can be anything. The result of input processors will be appended to an internal list (in the Loader) containing the collected values (for that field). The result of the output processors is the value that will be finally assigned to the item.
The other thing you need to keep in mind is that the values returned by input processors are collected internally (in lists) and then passed to output processors to populate the fields.
Last, but not least, itemloaders comes with some commonly used processors built-in for convenience.
Declaring Item Loaders¶
Item Loaders are declared using a class definition syntax. Here is an example:
from itemloaders.processors import TakeFirst, MapCompose, Join
from scrapy.loader import ItemLoader
class ProductLoader(ItemLoader):
default_output_processor = TakeFirst()
name_in = MapCompose(str.title)
name_out = Join()
price_in = MapCompose(str.strip)
# ...
As you can see, input processors are declared using the _in
suffix while
output processors are declared using the _out
suffix. And you can also
declare a default input/output processors using the
ItemLoader.default_input_processor
and
ItemLoader.default_output_processor
attributes.
Declaring Input and Output Processors¶
As seen in the previous section, input and output processors can be declared in the Item Loader definition, and it’s very common to declare input processors this way. However, there is one more place where you can specify the input and output processors to use: in the Item Field metadata. Here is an example:
import scrapy
from itemloaders.processors import Join, MapCompose, TakeFirst
from w3lib.html import remove_tags
def filter_price(value):
if value.isdigit():
return value
class Product(scrapy.Item):
name = scrapy.Field(
input_processor=MapCompose(remove_tags),
output_processor=Join(),
)
price = scrapy.Field(
input_processor=MapCompose(remove_tags, filter_price),
output_processor=TakeFirst(),
)
>>> from scrapy.loader import ItemLoader
>>> il = ItemLoader(item=Product())
>>> il.add_value('name', ['Welcome to my', '<strong>website</strong>'])
>>> il.add_value('price', ['€', '<span>1000</span>'])
>>> il.load_item()
{'name': 'Welcome to my website', 'price': '1000'}
The precedence order, for both input and output processors, is as follows:
Item Loader field-specific attributes:
field_in
andfield_out
(most precedence)Field metadata (
input_processor
andoutput_processor
key)Item Loader defaults:
ItemLoader.default_input_processor()
andItemLoader.default_output_processor()
(least precedence)
See also: Reusing and extending Item Loaders.
Item Loader Context¶
The Item Loader Context is a dict of arbitrary key/values which is shared among all input and output processors in the Item Loader. It can be passed when declaring, instantiating or using Item Loader. They are used to modify the behaviour of the input/output processors.
For example, suppose you have a function parse_length
which receives a text
value and extracts a length from it:
def parse_length(text, loader_context):
unit = loader_context.get('unit', 'm')
# ... length parsing code goes here ...
return parsed_length
By accepting a loader_context
argument the function is explicitly telling
the Item Loader that it’s able to receive an Item Loader context, so the Item
Loader passes the currently active context when calling it, and the processor
function (parse_length
in this case) can thus use them.
There are several ways to modify Item Loader context values:
By modifying the currently active Item Loader context (
context
attribute):loader = ItemLoader(product) loader.context['unit'] = 'cm'
On Item Loader instantiation (the keyword arguments of Item Loader
__init__
method are stored in the Item Loader context):loader = ItemLoader(product, unit='cm')
On Item Loader declaration, for those input/output processors that support instantiating them with an Item Loader context.
MapCompose
is one of them:class ProductLoader(ItemLoader): length_out = MapCompose(parse_length, unit='cm')
ItemLoader objects¶
- class scrapy.loader.ItemLoader(item=None, selector=None, response=None, parent=None, **context)[source]¶
A user-friendly abstraction to populate an item with data by applying field processors to scraped data. When instantiated with a
selector
or aresponse
it supports data extraction from web pages using selectors.- Parameters
item (scrapy.item.Item) – The item instance to populate using subsequent calls to
add_xpath()
,add_css()
, oradd_value()
.selector (
Selector
object) – The selector to extract data from, when using theadd_xpath()
,add_css()
,replace_xpath()
, orreplace_css()
method.response (
Response
object) – The response used to construct the selector using thedefault_selector_class
, unless the selector argument is given, in which case this argument is ignored.
If no item is given, one is instantiated automatically using the class in
default_item_class
.The item, selector, response and remaining keyword arguments are assigned to the Loader context (accessible through the
context
attribute).- item¶
The item object being parsed by this Item Loader. This is mostly used as a property so, when attempting to override this value, you may want to check out
default_item_class
first.
- default_item_class¶
An item class (or factory), used to instantiate items when not given in the
__init__
method.
- default_input_processor¶
The default input processor to use for those fields which don’t specify one.
- default_output_processor¶
The default output processor to use for those fields which don’t specify one.
- default_selector_class¶
The class used to construct the
selector
of thisItemLoader
, if only a response is given in the__init__
method. If a selector is given in the__init__
method this attribute is ignored. This attribute is sometimes overridden in subclasses.
- selector¶
The
Selector
object to extract data from. It’s either the selector given in the__init__
method or one created from the response given in the__init__
method using thedefault_selector_class
. This attribute is meant to be read-only.
- add_css(field_name, css, *processors, **kw)[source]¶
Similar to
ItemLoader.add_value()
but receives a CSS selector instead of a value, which is used to extract a list of unicode strings from the selector associated with thisItemLoader
.See
get_css()
forkwargs
.- Parameters
css (str) – the CSS selector to extract data from
Examples:
# HTML snippet: <p class="product-name">Color TV</p> loader.add_css('name', 'p.product-name') # HTML snippet: <p id="price">the price is $1200</p> loader.add_css('price', 'p#price', re='the price is (.*)')
- add_value(field_name, value, *processors, **kw)[source]¶
Process and then add the given
value
for the given field.The value is first passed through
get_value()
by giving theprocessors
andkwargs
, and then passed through the field input processor and its result appended to the data collected for that field. If the field already contains collected data, the new data is added.The given
field_name
can beNone
, in which case values for multiple fields may be added. And the processed value should be a dict with field_name mapped to values.Examples:
loader.add_value('name', 'Color TV') loader.add_value('colours', ['white', 'blue']) loader.add_value('length', '100') loader.add_value('name', 'name: foo', TakeFirst(), re='name: (.+)') loader.add_value(None, {'name': 'foo', 'sex': 'male'})
- add_xpath(field_name, xpath, *processors, **kw)[source]¶
Similar to
ItemLoader.add_value()
but receives an XPath instead of a value, which is used to extract a list of strings from the selector associated with thisItemLoader
.See
get_xpath()
forkwargs
.- Parameters
xpath (str) – the XPath to extract data from
Examples:
# HTML snippet: <p class="product-name">Color TV</p> loader.add_xpath('name', '//p[@class="product-name"]') # HTML snippet: <p id="price">the price is $1200</p> loader.add_xpath('price', '//p[@id="price"]', re='the price is (.*)')
- get_css(css, *processors, **kw)[source]¶
Similar to
ItemLoader.get_value()
but receives a CSS selector instead of a value, which is used to extract a list of unicode strings from the selector associated with thisItemLoader
.- Parameters
Examples:
# HTML snippet: <p class="product-name">Color TV</p> loader.get_css('p.product-name') # HTML snippet: <p id="price">the price is $1200</p> loader.get_css('p#price', TakeFirst(), re='the price is (.*)')
- get_output_value(field_name)[source]¶
Return the collected values parsed using the output processor, for the given field. This method doesn’t populate or modify the item at all.
- get_value(value, *processors, **kw)[source]¶
Process the given
value
by the givenprocessors
and keyword arguments.Available keyword arguments:
- Parameters
re (str or Pattern) – a regular expression to use for extracting data from the given value using
extract_regex()
method, applied before processors
Examples:
>>> from itemloaders import ItemLoader >>> from itemloaders.processors import TakeFirst >>> loader = ItemLoader() >>> loader.get_value('name: foo', TakeFirst(), str.upper, re='name: (.+)') 'FOO'
- get_xpath(xpath, *processors, **kw)[source]¶
Similar to
ItemLoader.get_value()
but receives an XPath instead of a value, which is used to extract a list of unicode strings from the selector associated with thisItemLoader
.- Parameters
Examples:
# HTML snippet: <p class="product-name">Color TV</p> loader.get_xpath('//p[@class="product-name"]') # HTML snippet: <p id="price">the price is $1200</p> loader.get_xpath('//p[@id="price"]', TakeFirst(), re='the price is (.*)')
- load_item()[source]¶
Populate the item with the data collected so far, and return it. The data collected is first passed through the output processors to get the final value to assign to each item field.
- nested_css(css, **context)[source]¶
Create a nested loader with a css selector. The supplied selector is applied relative to selector associated with this
ItemLoader
. The nested loader shares the item with the parentItemLoader
so calls toadd_xpath()
,add_value()
,replace_value()
, etc. will behave as expected.
- nested_xpath(xpath, **context)[source]¶
Create a nested loader with an xpath selector. The supplied selector is applied relative to selector associated with this
ItemLoader
. The nested loader shares the item with the parentItemLoader
so calls toadd_xpath()
,add_value()
,replace_value()
, etc. will behave as expected.
- replace_css(field_name, css, *processors, **kw)[source]¶
Similar to
add_css()
but replaces collected data instead of adding it.
- replace_value(field_name, value, *processors, **kw)[source]¶
Similar to
add_value()
but replaces the collected data with the new value instead of adding it.
- replace_xpath(field_name, xpath, *processors, **kw)[source]¶
Similar to
add_xpath()
but replaces collected data instead of adding it.
Nested Loaders¶
When parsing related values from a subsection of a document, it can be useful to create nested loaders. Imagine you’re extracting details from a footer of a page that looks something like:
Example:
<footer>
<a class="social" href="https://facebook.com/whatever">Like Us</a>
<a class="social" href="https://twitter.com/whatever">Follow Us</a>
<a class="email" href="mailto:whatever@example.com">Email Us</a>
</footer>
Without nested loaders, you need to specify the full xpath (or css) for each value that you wish to extract.
Example:
loader = ItemLoader(item=Item())
# load stuff not in the footer
loader.add_xpath('social', '//footer/a[@class = "social"]/@href')
loader.add_xpath('email', '//footer/a[@class = "email"]/@href')
loader.load_item()
Instead, you can create a nested loader with the footer selector and add values relative to the footer. The functionality is the same but you avoid repeating the footer selector.
Example:
loader = ItemLoader(item=Item())
# load stuff not in the footer
footer_loader = loader.nested_xpath('//footer')
footer_loader.add_xpath('social', 'a[@class = "social"]/@href')
footer_loader.add_xpath('email', 'a[@class = "email"]/@href')
# no need to call footer_loader.load_item()
loader.load_item()
You can nest loaders arbitrarily and they work with either xpath or css selectors. As a general guideline, use nested loaders when they make your code simpler but do not go overboard with nesting or your parser can become difficult to read.
Reusing and extending Item Loaders¶
As your project grows bigger and acquires more and more spiders, maintenance becomes a fundamental problem, especially when you have to deal with many different parsing rules for each spider, having a lot of exceptions, but also wanting to reuse the common processors.
Item Loaders are designed to ease the maintenance burden of parsing rules, without losing flexibility and, at the same time, providing a convenient mechanism for extending and overriding them. For this reason Item Loaders support traditional Python class inheritance for dealing with differences of specific spiders (or groups of spiders).
Suppose, for example, that some particular site encloses their product names in
three dashes (e.g. ---Plasma TV---
) and you don’t want to end up scraping
those dashes in the final product names.
Here’s how you can remove those dashes by reusing and extending the default
Product Item Loader (ProductLoader
):
from itemloaders.processors import MapCompose
from myproject.ItemLoaders import ProductLoader
def strip_dashes(x):
return x.strip('-')
class SiteSpecificLoader(ProductLoader):
name_in = MapCompose(strip_dashes, ProductLoader.name_in)
Another case where extending Item Loaders can be very helpful is when you have
multiple source formats, for example XML and HTML. In the XML version you may
want to remove CDATA
occurrences. Here’s an example of how to do it:
from itemloaders.processors import MapCompose
from myproject.ItemLoaders import ProductLoader
from myproject.utils.xml import remove_cdata
class XmlProductLoader(ProductLoader):
name_in = MapCompose(remove_cdata, ProductLoader.name_in)
And that’s how you typically extend input processors.
As for output processors, it is more common to declare them in the field metadata, as they usually depend only on the field and not on each specific site parsing rule (as input processors do). See also: Declaring Input and Output Processors.
There are many other possible ways to extend, inherit and override your Item Loaders, and different Item Loaders hierarchies may fit better for different projects. Scrapy only provides the mechanism; it doesn’t impose any specific organization of your Loaders collection - that’s up to you and your project’s needs.
Scrapy shell¶
The Scrapy shell is an interactive shell where you can try and debug your scraping code very quickly, without having to run the spider. It’s meant to be used for testing data extraction code, but you can actually use it for testing any kind of code as it is also a regular Python shell.
The shell is used for testing XPath or CSS expressions and see how they work and what data they extract from the web pages you’re trying to scrape. It allows you to interactively test your expressions while you’re writing your spider, without having to run the spider to test every change.
Once you get familiarized with the Scrapy shell, you’ll see that it’s an invaluable tool for developing and debugging your spiders.
Configuring the shell¶
If you have IPython installed, the Scrapy shell will use it (instead of the standard Python console). The IPython console is much more powerful and provides smart auto-completion and colorized output, among other things.
We highly recommend you install IPython, specially if you’re working on Unix systems (where IPython excels). See the IPython installation guide for more info.
Scrapy also has support for bpython, and will try to use it where IPython is unavailable.
Through Scrapy’s settings you can configure it to use any one of
ipython
, bpython
or the standard python
shell, regardless of which
are installed. This is done by setting the SCRAPY_PYTHON_SHELL
environment
variable; or by defining it in your scrapy.cfg:
[settings]
shell = bpython
Launch the shell¶
To launch the Scrapy shell you can use the shell
command like
this:
scrapy shell <url>
Where the <url>
is the URL you want to scrape.
shell
also works for local files. This can be handy if you want
to play around with a local copy of a web page. shell
understands
the following syntaxes for local files:
# UNIX-style
scrapy shell ./path/to/file.html
scrapy shell ../other/path/to/file.html
scrapy shell /absolute/path/to/file.html
# File URI
scrapy shell file:///absolute/path/to/file.html
Note
When using relative file paths, be explicit and prepend them
with ./
(or ../
when relevant).
scrapy shell index.html
will not work as one might expect (and
this is by design, not a bug).
Because shell
favors HTTP URLs over File URIs,
and index.html
being syntactically similar to example.com
,
shell
will treat index.html
as a domain name and trigger
a DNS lookup error:
$ scrapy shell index.html
[ ... scrapy shell starts ... ]
[ ... traceback ... ]
twisted.internet.error.DNSLookupError: DNS lookup failed:
address 'index.html' not found: [Errno -5] No address associated with hostname.
shell
will not test beforehand if a file called index.html
exists in the current directory. Again, be explicit.
Using the shell¶
The Scrapy shell is just a regular Python console (or IPython console if you have it available) which provides some additional shortcut functions for convenience.
Available Shortcuts¶
shelp()
- print a help with the list of available objects and shortcutsfetch(url[, redirect=True])
- fetch a new response from the given URL and update all related objects accordingly. You can optionally ask for HTTP 3xx redirections to not be followed by passingredirect=False
fetch(request)
- fetch a new response from the given request and update all related objects accordingly.view(response)
- open the given response in your local web browser, for inspection. This will add a <base> tag to the response body in order for external links (such as images and style sheets) to display properly. Note, however, that this will create a temporary file in your computer, which won’t be removed automatically.
Available Scrapy objects¶
The Scrapy shell automatically creates some convenient objects from the
downloaded page, like the Response
object and the
Selector
objects (for both HTML and XML
content).
Those objects are:
crawler
- the currentCrawler
object.spider
- the Spider which is known to handle the URL, or aSpider
object if there is no spider found for the current URLrequest
- aRequest
object of the last fetched page. You can modify this request usingreplace()
or fetch a new request (without leaving the shell) using thefetch
shortcut.response
- aResponse
object containing the last fetched pagesettings
- the current Scrapy settings
Example of shell session¶
Here’s an example of a typical shell session where we start by scraping the https://scrapy.org page, and then proceed to scrape the https://old.reddit.com/ page. Finally, we modify the (Reddit) request method to POST and re-fetch it getting an error. We end the session by typing Ctrl-D (in Unix systems) or Ctrl-Z in Windows.
Keep in mind that the data extracted here may not be the same when you try it, as those pages are not static and could have changed by the time you test this. The only purpose of this example is to get you familiarized with how the Scrapy shell works.
First, we launch the shell:
scrapy shell 'https://scrapy.org' --nolog
Note
Remember to always enclose URLs in quotes when running the Scrapy shell from
the command line, otherwise URLs containing arguments (i.e. the &
character)
will not work.
On Windows, use double quotes instead:
scrapy shell "https://scrapy.org" --nolog
Then, the shell fetches the URL (using the Scrapy downloader) and prints the
list of available objects and useful shortcuts (you’ll notice that these lines
all start with the [s]
prefix):
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x7f07395dd690>
[s] item {}
[s] request <GET https://scrapy.org>
[s] response <200 https://scrapy.org/>
[s] settings <scrapy.settings.Settings object at 0x7f07395dd710>
[s] spider <DefaultSpider 'default' at 0x7f0735891690>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
>>>
After that, we can start playing with the objects:
>>> response.xpath('//title/text()').get()
'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework'
>>> fetch("https://old.reddit.com/")
>>> response.xpath('//title/text()').get()
'reddit: the front page of the internet'
>>> request = request.replace(method="POST")
>>> fetch(request)
>>> response.status
404
>>> from pprint import pprint
>>> pprint(response.headers)
{'Accept-Ranges': ['bytes'],
'Cache-Control': ['max-age=0, must-revalidate'],
'Content-Type': ['text/html; charset=UTF-8'],
'Date': ['Thu, 08 Dec 2016 16:21:19 GMT'],
'Server': ['snooserv'],
'Set-Cookie': ['loid=KqNLou0V9SKMX4qb4n; Domain=reddit.com; Max-Age=63071999; Path=/; expires=Sat, 08-Dec-2018 16:21:19 GMT; secure',
'loidcreated=2016-12-08T16%3A21%3A19.445Z; Domain=reddit.com; Max-Age=63071999; Path=/; expires=Sat, 08-Dec-2018 16:21:19 GMT; secure',
'loid=vi0ZVe4NkxNWdlH7r7; Domain=reddit.com; Max-Age=63071999; Path=/; expires=Sat, 08-Dec-2018 16:21:19 GMT; secure',
'loidcreated=2016-12-08T16%3A21%3A19.459Z; Domain=reddit.com; Max-Age=63071999; Path=/; expires=Sat, 08-Dec-2018 16:21:19 GMT; secure'],
'Vary': ['accept-encoding'],
'Via': ['1.1 varnish'],
'X-Cache': ['MISS'],
'X-Cache-Hits': ['0'],
'X-Content-Type-Options': ['nosniff'],
'X-Frame-Options': ['SAMEORIGIN'],
'X-Moose': ['majestic'],
'X-Served-By': ['cache-cdg8730-CDG'],
'X-Timer': ['S1481214079.394283,VS0,VE159'],
'X-Ua-Compatible': ['IE=edge'],
'X-Xss-Protection': ['1; mode=block']}
Invoking the shell from spiders to inspect responses¶
Sometimes you want to inspect the responses that are being processed in a certain point of your spider, if only to check that response you expect is getting there.
This can be achieved by using the scrapy.shell.inspect_response
function.
Here’s an example of how you would call it from your spider:
import scrapy
class MySpider(scrapy.Spider):
name = "myspider"
start_urls = [
"http://example.com",
"http://example.org",
"http://example.net",
]
def parse(self, response):
# We want to inspect one specific response.
if ".org" in response.url:
from scrapy.shell import inspect_response
inspect_response(response, self)
# Rest of parsing code.
When you run the spider, you will get something similar to this:
2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.com> (referer: None)
2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.org> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x1e16b50>
...
>>> response.url
'http://example.org'
Then, you can check if the extraction code is working:
>>> response.xpath('//h1[@class="fn"]')
[]
Nope, it doesn’t. So you can open the response in your web browser and see if it’s the response you were expecting:
>>> view(response)
True
Finally you hit Ctrl-D (or Ctrl-Z in Windows) to exit the shell and resume the crawling:
>>> ^D
2014-01-23 17:50:03-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.net> (referer: None)
...
Note that you can’t use the fetch
shortcut here since the Scrapy engine is
blocked by the shell. However, after you leave the shell, the spider will
continue crawling where it stopped, as shown above.
Item Pipeline¶
After an item has been scraped by a spider, it is sent to the Item Pipeline which processes it through several components that are executed sequentially.
Each item pipeline component (sometimes referred as just “Item Pipeline”) is a Python class that implements a simple method. They receive an item and perform an action over it, also deciding if the item should continue through the pipeline or be dropped and no longer processed.
Typical uses of item pipelines are:
cleansing HTML data
validating scraped data (checking that the items contain certain fields)
checking for duplicates (and dropping them)
storing the scraped item in a database
Writing your own item pipeline¶
Each item pipeline component is a Python class that must implement the following method:
- process_item(self, item, spider)¶
This method is called for every item pipeline component.
item is an item object, see Supporting All Item Types.
process_item()
must either: return an item object, return aDeferred
or raise aDropItem
exception.Dropped items are no longer processed by further pipeline components.
- Parameters
item (item object) – the scraped item
spider (
Spider
object) – the spider which scraped the item
Additionally, they may also implement the following methods:
- open_spider(self, spider)¶
This method is called when the spider is opened.
- Parameters
spider (
Spider
object) – the spider which was opened
- close_spider(self, spider)¶
This method is called when the spider is closed.
- Parameters
spider (
Spider
object) – the spider which was closed
- from_crawler(cls, crawler)¶
If present, this classmethod is called to create a pipeline instance from a
Crawler
. It must return a new instance of the pipeline. Crawler object provides access to all Scrapy core components like settings and signals; it is a way for pipeline to access them and hook its functionality into Scrapy.- Parameters
crawler (
Crawler
object) – crawler that uses this pipeline
Item pipeline example¶
Price validation and dropping items with no prices¶
Let’s take a look at the following hypothetical pipeline that adjusts the
price
attribute for those items that do not include VAT
(price_excludes_vat
attribute), and drops those items which don’t
contain a price:
from itemadapter import ItemAdapter
from scrapy.exceptions import DropItem
class PricePipeline:
vat_factor = 1.15
def process_item(self, item, spider):
adapter = ItemAdapter(item)
if adapter.get('price'):
if adapter.get('price_excludes_vat'):
adapter['price'] = adapter['price'] * self.vat_factor
return item
else:
raise DropItem(f"Missing price in {item}")
Write items to a JSON file¶
The following pipeline stores all scraped items (from all spiders) into a
single items.jl
file, containing one item per line serialized in JSON
format:
import json
from itemadapter import ItemAdapter
class JsonWriterPipeline:
def open_spider(self, spider):
self.file = open('items.jl', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(ItemAdapter(item).asdict()) + "\n"
self.file.write(line)
return item
Note
The purpose of JsonWriterPipeline is just to introduce how to write item pipelines. If you really want to store all scraped items into a JSON file you should use the Feed exports.
Write items to MongoDB¶
In this example we’ll write items to MongoDB using pymongo. MongoDB address and database name are specified in Scrapy settings; MongoDB collection is named after item class.
The main point of this example is to show how to use from_crawler()
method and how to clean up the resources properly.:
import pymongo
from itemadapter import ItemAdapter
class MongoPipeline:
collection_name = 'scrapy_items'
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
self.db[self.collection_name].insert_one(ItemAdapter(item).asdict())
return item
Take screenshot of item¶
This example demonstrates how to use coroutine syntax in
the process_item()
method.
This item pipeline makes a request to a locally-running instance of Splash to render a screenshot of the item URL. After the request response is downloaded, the item pipeline saves the screenshot to a file and adds the filename to the item.
import hashlib
from urllib.parse import quote
import scrapy
from itemadapter import ItemAdapter
from scrapy.utils.defer import maybe_deferred_to_future
class ScreenshotPipeline:
"""Pipeline that uses Splash to render screenshot of
every Scrapy item."""
SPLASH_URL = "http://localhost:8050/render.png?url={}"
async def process_item(self, item, spider):
adapter = ItemAdapter(item)
encoded_item_url = quote(adapter["url"])
screenshot_url = self.SPLASH_URL.format(encoded_item_url)
request = scrapy.Request(screenshot_url)
response = await maybe_deferred_to_future(spider.crawler.engine.download(request, spider))
if response.status != 200:
# Error happened, return item.
return item
# Save screenshot to file, filename will be hash of url.
url = adapter["url"]
url_hash = hashlib.md5(url.encode("utf8")).hexdigest()
filename = f"{url_hash}.png"
with open(filename, "wb") as f:
f.write(response.body)
# Store filename in item.
adapter["screenshot_filename"] = filename
return item
Duplicates filter¶
A filter that looks for duplicate items, and drops those items that were already processed. Let’s say that our items have a unique id, but our spider returns multiples items with the same id:
from itemadapter import ItemAdapter
from scrapy.exceptions import DropItem
class DuplicatesPipeline:
def __init__(self):
self.ids_seen = set()
def process_item(self, item, spider):
adapter = ItemAdapter(item)
if adapter['id'] in self.ids_seen:
raise DropItem(f"Duplicate item found: {item!r}")
else:
self.ids_seen.add(adapter['id'])
return item
Activating an Item Pipeline component¶
To activate an Item Pipeline component you must add its class to the
ITEM_PIPELINES
setting, like in the following example:
ITEM_PIPELINES = {
'myproject.pipelines.PricePipeline': 300,
'myproject.pipelines.JsonWriterPipeline': 800,
}
The integer values you assign to classes in this setting determine the order in which they run: items go through from lower valued to higher valued classes. It’s customary to define these numbers in the 0-1000 range.
Feed exports¶
One of the most frequently required features when implementing scrapers is being able to store the scraped data properly and, quite often, that means generating an “export file” with the scraped data (commonly called “export feed”) to be consumed by other systems.
Scrapy provides this functionality out of the box with the Feed Exports, which allows you to generate feeds with the scraped items, using multiple serialization formats and storage backends.
Serialization formats¶
For serializing the scraped data, the feed exports use the Item exporters. These formats are supported out of the box:
But you can also extend the supported format through the
FEED_EXPORTERS
setting.
JSON¶
Value for the
format
key in theFEEDS
setting:json
Exporter used:
JsonItemExporter
See this warning if you’re using JSON with large feeds.
JSON lines¶
Value for the
format
key in theFEEDS
setting:jsonlines
Exporter used:
JsonLinesItemExporter
CSV¶
Value for the
format
key in theFEEDS
setting:csv
Exporter used:
CsvItemExporter
To specify columns to export and their order use
FEED_EXPORT_FIELDS
. Other feed exporters can also use this option, but it is important for CSV because unlike many other export formats CSV uses a fixed header.
XML¶
Value for the
format
key in theFEEDS
setting:xml
Exporter used:
XmlItemExporter
Pickle¶
Value for the
format
key in theFEEDS
setting:pickle
Exporter used:
PickleItemExporter
Marshal¶
Value for the
format
key in theFEEDS
setting:marshal
Exporter used:
MarshalItemExporter
Storages¶
When using the feed exports you define where to store the feed using one or multiple URIs
(through the FEEDS
setting). The feed exports supports multiple
storage backend types which are defined by the URI scheme.
The storages backends supported out of the box are:
Google Cloud Storage (GCS) (requires google-cloud-storage)
Some storage backends may be unavailable if the required external libraries are not available. For example, the S3 backend is only available if the botocore library is installed.
Storage URI parameters¶
The storage URI can also contain parameters that get replaced when the feed is being created. These parameters are:
%(time)s
- gets replaced by a timestamp when the feed is being created%(name)s
- gets replaced by the spider name
Any other named parameter gets replaced by the spider attribute of the same
name. For example, %(site_id)s
would get replaced by the spider.site_id
attribute the moment the feed is being created.
Here are some examples to illustrate:
Store in FTP using one directory per spider:
ftp://user:password@ftp.example.com/scraping/feeds/%(name)s/%(time)s.json
Store in S3 using one directory per spider:
s3://mybucket/scraping/feeds/%(name)s/%(time)s.json
Note
Spider arguments become spider attributes, hence they can also be used as storage URI parameters.
Storage backends¶
Local filesystem¶
The feeds are stored in the local filesystem.
URI scheme:
file
Example URI:
file:///tmp/export.csv
Required external libraries: none
Note that for the local filesystem storage (only) you can omit the scheme if
you specify an absolute path like /tmp/export.csv
. This only works on Unix
systems though.
FTP¶
The feeds are stored in a FTP server.
URI scheme:
ftp
Example URI:
ftp://user:pass@ftp.example.com/path/to/export.csv
Required external libraries: none
FTP supports two different connection modes: active or passive. Scrapy uses the passive connection
mode by default. To use the active connection mode instead, set the
FEED_STORAGE_FTP_ACTIVE
setting to True
.
This storage backend uses delayed file delivery.
S3¶
The feeds are stored on Amazon S3.
URI scheme:
s3
Example URIs:
s3://mybucket/path/to/export.csv
s3://aws_key:aws_secret@mybucket/path/to/export.csv
Required external libraries: botocore >= 1.4.87
The AWS credentials can be passed as user/password in the URI, or they can be passed through the following settings:
AWS_SESSION_TOKEN
(only needed for temporary security credentials)
You can also define a custom ACL and custom endpoint for exported feeds using this setting:
This storage backend uses delayed file delivery.
Google Cloud Storage (GCS)¶
New in version 2.3.
The feeds are stored on Google Cloud Storage.
URI scheme:
gs
Example URIs:
gs://mybucket/path/to/export.csv
Required external libraries: google-cloud-storage.
For more information about authentication, please refer to Google Cloud documentation.
You can set a Project ID and Access Control List (ACL) through the following settings:
This storage backend uses delayed file delivery.
Standard output¶
The feeds are written to the standard output of the Scrapy process.
URI scheme:
stdout
Example URI:
stdout:
Required external libraries: none
Delayed file delivery¶
As indicated above, some of the described storage backends use delayed file delivery.
These storage backends do not upload items to the feed URI as those items are scraped. Instead, Scrapy writes items into a temporary local file, and only once all the file contents have been written (i.e. at the end of the crawl) is that file uploaded to the feed URI.
If you want item delivery to start earlier when using one of these storage
backends, use FEED_EXPORT_BATCH_ITEM_COUNT
to split the output items
in multiple files, with the specified maximum item count per file. That way, as
soon as a file reaches the maximum item count, that file is delivered to the
feed URI, allowing item delivery to start way before the end of the crawl.
Item filtering¶
New in version 2.6.0.
You can filter items that you want to allow for a particular feed by using the
item_classes
option in feeds options. Only items of
the specified types will be added to the feed.
The item_classes
option is implemented by the ItemFilter
class, which is the default value of the item_filter
feed option.
You can create your own custom filtering class by implementing ItemFilter
’s
method accepts
and taking feed_options
as an argument.
For instance:
class MyCustomFilter:
def __init__(self, feed_options):
self.feed_options = feed_options
def accepts(self, item):
if "field1" in item and item["field1"] == "expected_data":
return True
return False
You can assign your custom filtering class to the item_filter
option of a feed.
See FEEDS
for examples.
ItemFilter¶
Post-Processing¶
New in version 2.6.0.
Scrapy provides an option to activate plugins to post-process feeds before they are exported to feed storages. In addition to using builtin plugins, you can create your own plugins.
These plugins can be activated through the postprocessing
option of a feed.
The option must be passed a list of post-processing plugins in the order you want
the feed to be processed. These plugins can be declared either as an import string
or with the imported class of the plugin. Parameters to plugins can be passed
through the feed options. See feed options for examples.
Built-in Plugins¶
- class scrapy.extensions.postprocessing.GzipPlugin(file: BinaryIO, feed_options: Dict[str, Any])[source]¶
Compresses received data using gzip.
Accepted
feed_options
parameters:gzip_compresslevel
gzip_mtime
gzip_filename
See
gzip.GzipFile
for more info about parameters.
- class scrapy.extensions.postprocessing.LZMAPlugin(file: BinaryIO, feed_options: Dict[str, Any])[source]¶
Compresses received data using lzma.
Accepted
feed_options
parameters:lzma_format
lzma_check
lzma_preset
lzma_filters
Note
lzma_filters
cannot be used in pypy version 7.3.1 and older.See
lzma.LZMAFile
for more info about parameters.
Custom Plugins¶
Each plugin is a class that must implement the following methods:
- __init__(self, file, feed_options)¶
Initialize the plugin.
- write(self, data)¶
Process and write data (
bytes
ormemoryview
) into the plugin’s target file. It must return number of bytes written.
- close(self)¶
Close the target file object.
To pass a parameter to your plugin, use feed options. You
can then access those parameters from the __init__
method of your plugin.
Settings¶
These are the settings used for configuring the feed exports:
FEEDS
(mandatory)
FEEDS¶
New in version 2.1.
Default: {}
A dictionary in which every key is a feed URI (or a pathlib.Path
object) and each value is a nested dictionary containing configuration
parameters for the specific feed.
This setting is required for enabling the feed export feature.
See Storage backends for supported URI schemes.
For instance:
{
'items.json': {
'format': 'json',
'encoding': 'utf8',
'store_empty': False,
'item_classes': [MyItemClass1, 'myproject.items.MyItemClass2'],
'fields': None,
'indent': 4,
'item_export_kwargs': {
'export_empty_fields': True,
},
},
'/home/user/documents/items.xml': {
'format': 'xml',
'fields': ['name', 'price'],
'item_filter': MyCustomFilter1,
'encoding': 'latin1',
'indent': 8,
},
pathlib.Path('items.csv.gz'): {
'format': 'csv',
'fields': ['price', 'name'],
'item_filter': 'myproject.filters.MyCustomFilter2',
'postprocessing': [MyPlugin1, 'scrapy.extensions.postprocessing.GzipPlugin'],
'gzip_compresslevel': 5,
},
}
The following is a list of the accepted keys and the setting that is used as a fallback value if that key is not provided for a specific feed definition:
format
: the serialization format.This setting is mandatory, there is no fallback value.
batch_item_count
: falls back toFEED_EXPORT_BATCH_ITEM_COUNT
.New in version 2.3.0.
encoding
: falls back toFEED_EXPORT_ENCODING
.fields
: falls back toFEED_EXPORT_FIELDS
.item_classes
: list of item classes to export.If undefined or empty, all items are exported.
New in version 2.6.0.
item_filter
: a filter class to filter items to export.ItemFilter
is used be default.New in version 2.6.0.
indent
: falls back toFEED_EXPORT_INDENT
.item_export_kwargs
:dict
with keyword arguments for the corresponding item exporter class.New in version 2.4.0.
overwrite
: whether to overwrite the file if it already exists (True
) or append to its content (False
).The default value depends on the storage backend:
Local filesystem:
False
FTP:
True
Note
Some FTP servers may not support appending to files (the
APPE
FTP command).S3:
True
(appending is not supported)Standard output:
False
(overwriting is not supported)
New in version 2.4.0.
store_empty
: falls back toFEED_STORE_EMPTY
.uri_params
: falls back toFEED_URI_PARAMS
.postprocessing
: list of plugins to use for post-processing.The plugins will be used in the order of the list passed.
New in version 2.6.0.
FEED_EXPORT_ENCODING¶
Default: None
The encoding to be used for the feed.
If unset or set to None
(default) it uses UTF-8 for everything except JSON output,
which uses safe numeric encoding (\uXXXX
sequences) for historic reasons.
Use utf-8
if you want UTF-8 for JSON too.
FEED_EXPORT_FIELDS¶
Default: None
A list of fields to export, optional.
Example: FEED_EXPORT_FIELDS = ["foo", "bar", "baz"]
.
Use FEED_EXPORT_FIELDS option to define fields to export and their order.
When FEED_EXPORT_FIELDS is empty or None (default), Scrapy uses the fields defined in item objects yielded by your spider.
If an exporter requires a fixed set of fields (this is the case for CSV export format) and FEED_EXPORT_FIELDS is empty or None, then Scrapy tries to infer field names from the exported data - currently it uses field names from the first item.
FEED_EXPORT_INDENT¶
Default: 0
Amount of spaces used to indent the output on each level. If FEED_EXPORT_INDENT
is a non-negative integer, then array elements and object members will be pretty-printed
with that indent level. An indent level of 0
(the default), or negative,
will put each item on a new line. None
selects the most compact representation.
Currently implemented only by JsonItemExporter
and XmlItemExporter
, i.e. when you are exporting
to .json
or .xml
.
FEED_STORE_EMPTY¶
Default: False
Whether to export empty feeds (i.e. feeds with no items).
FEED_STORAGES¶
Default: {}
A dict containing additional feed storage backends supported by your project. The keys are URI schemes and the values are paths to storage classes.
FEED_STORAGE_FTP_ACTIVE¶
Default: False
Whether to use the active connection mode when exporting feeds to an FTP server
(True
) or use the passive connection mode instead (False
, default).
For information about FTP connection modes, see What is the difference between active and passive FTP?.
FEED_STORAGE_S3_ACL¶
Default: ''
(empty string)
A string containing a custom ACL for feeds exported to Amazon S3 by your project.
For a complete list of available values, access the Canned ACL section on Amazon S3 docs.
FEED_STORAGES_BASE¶
Default:
{
'': 'scrapy.extensions.feedexport.FileFeedStorage',
'file': 'scrapy.extensions.feedexport.FileFeedStorage',
'stdout': 'scrapy.extensions.feedexport.StdoutFeedStorage',
's3': 'scrapy.extensions.feedexport.S3FeedStorage',
'ftp': 'scrapy.extensions.feedexport.FTPFeedStorage',
}
A dict containing the built-in feed storage backends supported by Scrapy. You
can disable any of these backends by assigning None
to their URI scheme in
FEED_STORAGES
. E.g., to disable the built-in FTP storage backend
(without replacement), place this in your settings.py
:
FEED_STORAGES = {
'ftp': None,
}
FEED_EXPORTERS¶
Default: {}
A dict containing additional exporters supported by your project. The keys are serialization formats and the values are paths to Item exporter classes.
FEED_EXPORTERS_BASE¶
Default:
{
'json': 'scrapy.exporters.JsonItemExporter',
'jsonlines': 'scrapy.exporters.JsonLinesItemExporter',
'jl': 'scrapy.exporters.JsonLinesItemExporter',
'csv': 'scrapy.exporters.CsvItemExporter',
'xml': 'scrapy.exporters.XmlItemExporter',
'marshal': 'scrapy.exporters.MarshalItemExporter',
'pickle': 'scrapy.exporters.PickleItemExporter',
}
A dict containing the built-in feed exporters supported by Scrapy. You can
disable any of these exporters by assigning None
to their serialization
format in FEED_EXPORTERS
. E.g., to disable the built-in CSV exporter
(without replacement), place this in your settings.py
:
FEED_EXPORTERS = {
'csv': None,
}
FEED_EXPORT_BATCH_ITEM_COUNT¶
New in version 2.3.0.
Default: 0
If assigned an integer number higher than 0
, Scrapy generates multiple output files
storing up to the specified number of items in each output file.
When generating multiple output files, you must use at least one of the following placeholders in the feed URI to indicate how the different output file names are generated:
%(batch_time)s
- gets replaced by a timestamp when the feed is being created (e.g.2020-03-28T14-45-08.237134
)%(batch_id)d
- gets replaced by the 1-based sequence number of the batch.Use printf-style string formatting to alter the number format. For example, to make the batch ID a 5-digit number by introducing leading zeroes as needed, use
%(batch_id)05d
(e.g.3
becomes00003
,123
becomes00123
).
For instance, if your settings include:
FEED_EXPORT_BATCH_ITEM_COUNT = 100
And your crawl
command line is:
scrapy crawl spidername -o "dirname/%(batch_id)d-filename%(batch_time)s.json"
The command line above can generate a directory tree like:
->projectname
-->dirname
--->1-filename2020-03-28T14-45-08.237134.json
--->2-filename2020-03-28T14-45-09.148903.json
--->3-filename2020-03-28T14-45-10.046092.json
Where the first and second files contain exactly 100 items. The last one contains 100 items or fewer.
FEED_URI_PARAMS¶
Default: None
A string with the import path of a function to set the parameters to apply with printf-style string formatting to the feed URI.
The function signature should be as follows:
- scrapy.extensions.feedexport.uri_params(params, spider)¶
Return a
dict
of key-value pairs to apply to the feed URI using printf-style string formatting.- Parameters
params (dict) –
default key-value pairs
Specifically:
batch_id
: ID of the file batch. SeeFEED_EXPORT_BATCH_ITEM_COUNT
.If
FEED_EXPORT_BATCH_ITEM_COUNT
is0
,batch_id
is always1
.New in version 2.3.0.
batch_time
: UTC date and time, in ISO format with:
replaced with-
.See
FEED_EXPORT_BATCH_ITEM_COUNT
.New in version 2.3.0.
time
:batch_time
, with microseconds set to0
.
spider (scrapy.Spider) – source spider of the feed items
Caution
The function should return a new dictionary, modifying the received
params
in-place is deprecated.
For example, to include the name
of the
source spider in the feed URI:
Define the following function somewhere in your project:
# myproject/utils.py def uri_params(params, spider): return {**params, 'spider_name': spider.name}
Point
FEED_URI_PARAMS
to that function in your settings:# myproject/settings.py FEED_URI_PARAMS = 'myproject.utils.uri_params'
Use
%(spider_name)s
in your feed URI:scrapy crawl <spider_name> -o "%(spider_name)s.jl"
Requests and Responses¶
Scrapy uses Request
and Response
objects for crawling web
sites.
Typically, Request
objects are generated in the spiders and pass
across the system until they reach the Downloader, which executes the request
and returns a Response
object which travels back to the spider that
issued the request.
Both Request
and Response
classes have subclasses which add
functionality not required in the base classes. These are described
below in Request subclasses and
Response subclasses.
Request objects¶
- class scrapy.http.Request(*args, **kwargs)[source]¶
Represents an HTTP request, which is usually generated in a Spider and executed by the Downloader, thus generating a
Response
.- Parameters
url (str) –
the URL of this request
If the URL is invalid, a
ValueError
exception is raised.callback (collections.abc.Callable) – the function that will be called with the response of this request (once it’s downloaded) as its first parameter. For more information see Passing additional data to callback functions below. If a Request doesn’t specify a callback, the spider’s
parse()
method will be used. Note that if exceptions are raised during processing, errback is called instead.method (str) – the HTTP method of this request. Defaults to
'GET'
.meta (dict) – the initial values for the
Request.meta
attribute. If given, the dict passed in this parameter will be shallow copied.body (bytes or str) – the request body. If a string is passed, then it’s encoded as bytes using the
encoding
passed (which defaults toutf-8
). Ifbody
is not given, an empty bytes object is stored. Regardless of the type of this argument, the final value stored will be a bytes object (never a string orNone
).headers (dict) –
the headers of this request. The dict values can be strings (for single valued headers) or lists (for multi-valued headers). If
None
is passed as value, the HTTP header will not be sent at all.Caution
Cookies set via the
Cookie
header are not considered by the CookiesMiddleware. If you need to set cookies for a request, use theRequest.cookies
parameter. This is a known current limitation that is being worked on.the request cookies. These can be sent in two forms.
Using a dict:
request_with_cookies = Request(url="http://www.example.com", cookies={'currency': 'USD', 'country': 'UY'})
Using a list of dicts:
request_with_cookies = Request(url="http://www.example.com", cookies=[{'name': 'currency', 'value': 'USD', 'domain': 'example.com', 'path': '/currency'}])
The latter form allows for customizing the
domain
andpath
attributes of the cookie. This is only useful if the cookies are saved for later requests.When some site returns cookies (in a response) those are stored in the cookies for that domain and will be sent again in future requests. That’s the typical behaviour of any regular web browser.
To create a request that does not send stored cookies and does not store received cookies, set the
dont_merge_cookies
key toTrue
inrequest.meta
.Example of a request that sends manually-defined cookies and ignores cookie storage:
Request( url="http://www.example.com", cookies={'currency': 'USD', 'country': 'UY'}, meta={'dont_merge_cookies': True}, )
For more info see CookiesMiddleware.
Caution
Cookies set via the
Cookie
header are not considered by the CookiesMiddleware. If you need to set cookies for a request, use theRequest.cookies
parameter. This is a known current limitation that is being worked on.encoding (str) – the encoding of this request (defaults to
'utf-8'
). This encoding will be used to percent-encode the URL and to convert the body to bytes (if given as a string).priority (int) – the priority of this request (defaults to
0
). The priority is used by the scheduler to define the order used to process requests. Requests with a higher priority value will execute earlier. Negative values are allowed in order to indicate relatively low-priority.dont_filter (bool) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to
False
.errback (collections.abc.Callable) –
a function that will be called if any exception was raised while processing the request. This includes pages that failed with 404 HTTP errors and such. It receives a
Failure
as first parameter. For more information, see Using errbacks to catch exceptions in request processing below.Changed in version 2.0: The callback parameter is no longer required when the errback parameter is specified.
flags (list) – Flags sent to the request, can be used for logging or similar purposes.
cb_kwargs (dict) – A dict with arbitrary data that will be passed as keyword arguments to the Request’s callback.
- url¶
A string containing the URL of this request. Keep in mind that this attribute contains the escaped URL, so it can differ from the URL passed in the
__init__
method.This attribute is read-only. To change the URL of a Request use
replace()
.
- method¶
A string representing the HTTP method in the request. This is guaranteed to be uppercase. Example:
"GET"
,"POST"
,"PUT"
, etc
- headers¶
A dictionary-like object which contains the request headers.
- body¶
The request body as bytes.
This attribute is read-only. To change the body of a Request use
replace()
.
- meta¶
A dict that contains arbitrary metadata for this request. This dict is empty for new Requests, and is usually populated by different Scrapy components (extensions, middlewares, etc). So the data contained in this dict depends on the extensions you have enabled.
See Request.meta special keys for a list of special meta keys recognized by Scrapy.
This dict is shallow copied when the request is cloned using the
copy()
orreplace()
methods, and can also be accessed, in your spider, from theresponse.meta
attribute.
- cb_kwargs¶
A dictionary that contains arbitrary metadata for this request. Its contents will be passed to the Request’s callback as keyword arguments. It is empty for new Requests, which means by default callbacks only get a
Response
object as argument.This dict is shallow copied when the request is cloned using the
copy()
orreplace()
methods, and can also be accessed, in your spider, from theresponse.cb_kwargs
attribute.In case of a failure to process the request, this dict can be accessed as
failure.request.cb_kwargs
in the request’s errback. For more information, see Accessing additional data in errback functions.
- attributes: Tuple[str, ...] = ('url', 'callback', 'method', 'headers', 'body', 'cookies', 'meta', 'encoding', 'priority', 'dont_filter', 'errback', 'flags', 'cb_kwargs')¶
A tuple of
str
objects containing the name of all public attributes of the class that are also keyword parameters of the__init__
method.Currently used by
Request.replace()
,Request.to_dict()
andrequest_from_dict()
.
- copy()[source]¶
Return a new Request which is a copy of this Request. See also: Passing additional data to callback functions.
- replace([url, method, headers, body, cookies, meta, flags, encoding, priority, dont_filter, callback, errback, cb_kwargs])[source]¶
Return a Request object with the same members, except for those members given new values by whichever keyword arguments are specified. The
Request.cb_kwargs
andRequest.meta
attributes are shallow copied by default (unless new values are given as arguments). See also Passing additional data to callback functions.
- classmethod from_curl(curl_command: str, ignore_unknown_options: bool = True, **kwargs) scrapy.http.request.RequestTypeVar [source]¶
Create a Request object from a string containing a cURL command. It populates the HTTP method, the URL, the headers, the cookies and the body. It accepts the same arguments as the
Request
class, taking preference and overriding the values of the same arguments contained in the cURL command.Unrecognized options are ignored by default. To raise an error when finding unknown options call this method by passing
ignore_unknown_options=False
.Caution
Using
from_curl()
fromRequest
subclasses, such asJSONRequest
, orXmlRpcRequest
, as well as having downloader middlewares and spider middlewares enabled, such asDefaultHeadersMiddleware
,UserAgentMiddleware
, orHttpCompressionMiddleware
, may modify theRequest
object.To translate a cURL command into a Scrapy request, you may use curl2scrapy.
- to_dict(*, spider: Optional[scrapy.spiders.Spider] = None) dict [source]¶
Return a dictionary containing the Request’s data.
Use
request_from_dict()
to convert back into aRequest
object.If a spider is given, this method will try to find out the name of the spider methods used as callback and errback and include them in the output dict, raising an exception if they cannot be found.
Passing additional data to callback functions¶
The callback of a request is a function that will be called when the response
of that request is downloaded. The callback function will be called with the
downloaded Response
object as its first argument.
Example:
def parse_page1(self, response):
return scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
def parse_page2(self, response):
# this would log http://www.example.com/some_page.html
self.logger.info("Visited %s", response.url)
In some cases you may be interested in passing arguments to those callback
functions so you can receive the arguments later, in the second callback.
The following example shows how to achieve this by using the
Request.cb_kwargs
attribute:
def parse(self, response):
request = scrapy.Request('http://www.example.com/index.html',
callback=self.parse_page2,
cb_kwargs=dict(main_url=response.url))
request.cb_kwargs['foo'] = 'bar' # add more arguments for the callback
yield request
def parse_page2(self, response, main_url, foo):
yield dict(
main_url=main_url,
other_url=response.url,
foo=foo,
)
Caution
Request.cb_kwargs
was introduced in version 1.7
.
Prior to that, using Request.meta
was recommended for passing
information around callbacks. After 1.7
, Request.cb_kwargs
became the preferred way for handling user information, leaving Request.meta
for communication with components like middlewares and extensions.
Using errbacks to catch exceptions in request processing¶
The errback of a request is a function that will be called when an exception is raise while processing it.
It receives a Failure
as first parameter and can
be used to track connection establishment timeouts, DNS errors etc.
Here’s an example spider logging all errors and catching some specific errors if needed:
import scrapy
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError
class ErrbackSpider(scrapy.Spider):
name = "errback_example"
start_urls = [
"http://www.httpbin.org/", # HTTP 200 expected
"http://www.httpbin.org/status/404", # Not found error
"http://www.httpbin.org/status/500", # server issue
"http://www.httpbin.org:12345/", # non-responding host, timeout expected
"https://example.invalid/", # DNS error expected
]
def start_requests(self):
for u in self.start_urls:
yield scrapy.Request(u, callback=self.parse_httpbin,
errback=self.errback_httpbin,
dont_filter=True)
def parse_httpbin(self, response):
self.logger.info('Got successful response from {}'.format(response.url))
# do something useful here...
def errback_httpbin(self, failure):
# log all failures
self.logger.error(repr(failure))
# in case you want to do something special for some errors,
# you may need the failure's type:
if failure.check(HttpError):
# these exceptions come from HttpError spider middleware
# you can get the non-200 response
response = failure.value.response
self.logger.error('HttpError on %s', response.url)
elif failure.check(DNSLookupError):
# this is the original request
request = failure.request
self.logger.error('DNSLookupError on %s', request.url)
elif failure.check(TimeoutError, TCPTimedOutError):
request = failure.request
self.logger.error('TimeoutError on %s', request.url)
Accessing additional data in errback functions¶
In case of a failure to process the request, you may be interested in
accessing arguments to the callback functions so you can process further
based on the arguments in the errback. The following example shows how to
achieve this by using Failure.request.cb_kwargs
:
def parse(self, response):
request = scrapy.Request('http://www.example.com/index.html',
callback=self.parse_page2,
errback=self.errback_page2,
cb_kwargs=dict(main_url=response.url))
yield request
def parse_page2(self, response, main_url):
pass
def errback_page2(self, failure):
yield dict(
main_url=failure.request.cb_kwargs['main_url'],
)
Request.meta special keys¶
The Request.meta
attribute can contain any arbitrary data, but there
are some special keys recognized by Scrapy and its built-in extensions.
Those are:
ftp_password
(SeeFTP_PASSWORD
for more info)ftp_user
(SeeFTP_USER
for more info)
bindaddress¶
The IP of the outgoing IP address to use for the performing the request.
download_timeout¶
The amount of time (in secs) that the downloader will wait before timing out.
See also: DOWNLOAD_TIMEOUT
.
download_latency¶
The amount of time spent to fetch the response, since the request has been started, i.e. HTTP message sent over the network. This meta key only becomes available when the response has been downloaded. While most other meta keys are used to control Scrapy behavior, this one is supposed to be read-only.
download_fail_on_dataloss¶
Whether or not to fail on broken responses. See:
DOWNLOAD_FAIL_ON_DATALOSS
.
max_retry_times¶
The meta key is used set retry times per request. When initialized, the
max_retry_times
meta key takes higher precedence over the
RETRY_TIMES
setting.
Stopping the download of a Response¶
Raising a StopDownload
exception from a handler for the
bytes_received
or headers_received
signals will stop the download of a given response. See the following example:
import scrapy
class StopSpider(scrapy.Spider):
name = "stop"
start_urls = ["https://docs.scrapy.org/en/latest/"]
@classmethod
def from_crawler(cls, crawler):
spider = super().from_crawler(crawler)
crawler.signals.connect(spider.on_bytes_received, signal=scrapy.signals.bytes_received)
return spider
def parse(self, response):
# 'last_chars' show that the full response was not downloaded
yield {"len": len(response.text), "last_chars": response.text[-40:]}
def on_bytes_received(self, data, request, spider):
raise scrapy.exceptions.StopDownload(fail=False)
which produces the following output:
2020-05-19 17:26:12 [scrapy.core.engine] INFO: Spider opened
2020-05-19 17:26:12 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-05-19 17:26:13 [scrapy.core.downloader.handlers.http11] DEBUG: Download stopped for <GET https://docs.scrapy.org/en/latest/> from signal handler StopSpider.on_bytes_received
2020-05-19 17:26:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://docs.scrapy.org/en/latest/> (referer: None) ['download_stopped']
2020-05-19 17:26:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://docs.scrapy.org/en/latest/>
{'len': 279, 'last_chars': 'dth, initial-scale=1.0">\n \n <title>Scr'}
2020-05-19 17:26:13 [scrapy.core.engine] INFO: Closing spider (finished)
By default, resulting responses are handled by their corresponding errbacks. To
call their callback instead, like in this example, pass fail=False
to the
StopDownload
exception.
Request subclasses¶
Here is the list of built-in Request
subclasses. You can also subclass
it to implement your own custom functionality.
FormRequest objects¶
The FormRequest class extends the base Request
with functionality for
dealing with HTML forms. It uses lxml.html forms to pre-populate form
fields with form data from Response
objects.
- class scrapy.http.request.form.FormRequest¶
- class scrapy.http.FormRequest¶
- class scrapy.FormRequest(url[, formdata, ...])¶
The
FormRequest
class adds a new keyword parameter to the__init__
method. The remaining arguments are the same as for theRequest
class and are not documented here.- Parameters
formdata (dict or collections.abc.Iterable) – is a dictionary (or iterable of (key, value) tuples) containing HTML Form data which will be url-encoded and assigned to the body of the request.
The
FormRequest
objects support the following class method in addition to the standardRequest
methods:- classmethod FormRequest.from_response(response[, formname=None, formid=None, formnumber=0, formdata=None, formxpath=None, formcss=None, clickdata=None, dont_click=False, ...])¶
Returns a new
FormRequest
object with its form field values pre-populated with those found in the HTML<form>
element contained in the given response. For an example see Using FormRequest.from_response() to simulate a user login.The policy is to automatically simulate a click, by default, on any form control that looks clickable, like a
<input type="submit">
. Even though this is quite convenient, and often the desired behaviour, sometimes it can cause problems which could be hard to debug. For example, when working with forms that are filled and/or submitted using javascript, the defaultfrom_response()
behaviour may not be the most appropriate. To disable this behaviour you can set thedont_click
argument toTrue
. Also, if you want to change the control clicked (instead of disabling it) you can also use theclickdata
argument.Caution
Using this method with select elements which have leading or trailing whitespace in the option values will not work due to a bug in lxml, which should be fixed in lxml 3.8 and above.
- Parameters
response (
Response
object) – the response containing a HTML form which will be used to pre-populate the form fieldsformname (str) – if given, the form with name attribute set to this value will be used.
formid (str) – if given, the form with id attribute set to this value will be used.
formxpath (str) – if given, the first form that matches the xpath will be used.
formcss (str) – if given, the first form that matches the css selector will be used.
formnumber (int) – the number of form to use, when the response contains multiple forms. The first one (and also the default) is
0
.formdata (dict) – fields to override in the form data. If a field was already present in the response
<form>
element, its value is overridden by the one passed in this parameter. If a value passed in this parameter isNone
, the field will not be included in the request, even if it was present in the response<form>
element.clickdata (dict) – attributes to lookup the control clicked. If it’s not given, the form data will be submitted simulating a click on the first clickable element. In addition to html attributes, the control can be identified by its zero-based index relative to other submittable inputs inside the form, via the
nr
attribute.dont_click (bool) – If True, the form data will be submitted without clicking in any element.
The other parameters of this class method are passed directly to the
FormRequest
__init__
method.
Request usage examples¶
Using FormRequest to send data via HTTP POST¶
If you want to simulate a HTML Form POST in your spider and send a couple of
key-value fields, you can return a FormRequest
object (from your
spider) like this:
return [FormRequest(url="http://www.example.com/post/action",
formdata={'name': 'John Doe', 'age': '27'},
callback=self.after_post)]
Using FormRequest.from_response() to simulate a user login¶
It is usual for web sites to provide pre-populated form fields through <input
type="hidden">
elements, such as session related data or authentication
tokens (for login pages). When scraping, you’ll want these fields to be
automatically pre-populated and only override a couple of them, such as the
user name and password. You can use the FormRequest.from_response()
method for this job. Here’s an example spider which uses it:
import scrapy
def authentication_failed(response):
# TODO: Check the contents of the response and return True if it failed
# or False if it succeeded.
pass
class LoginSpider(scrapy.Spider):
name = 'example.com'
start_urls = ['http://www.example.com/users/login.php']
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'username': 'john', 'password': 'secret'},
callback=self.after_login
)
def after_login(self, response):
if authentication_failed(response):
self.logger.error("Login failed")
return
# continue scraping with authenticated session...
JsonRequest¶
The JsonRequest class extends the base Request
class with functionality for
dealing with JSON requests.
- class scrapy.http.JsonRequest(url[, ... data, dumps_kwargs])[source]¶
The
JsonRequest
class adds two new keyword parameters to the__init__
method. The remaining arguments are the same as for theRequest
class and are not documented here.Using the
JsonRequest
will set theContent-Type
header toapplication/json
andAccept
header toapplication/json, text/javascript, */*; q=0.01
- Parameters
data (object) – is any JSON serializable object that needs to be JSON encoded and assigned to body. if
Request.body
argument is provided this parameter will be ignored. ifRequest.body
argument is not provided and data argument is providedRequest.method
will be set to'POST'
automatically.dumps_kwargs (dict) – Parameters that will be passed to underlying
json.dumps()
method which is used to serialize data into JSON format.
- attributes: Tuple[str, ...] = ('url', 'callback', 'method', 'headers', 'body', 'cookies', 'meta', 'encoding', 'priority', 'dont_filter', 'errback', 'flags', 'cb_kwargs', 'dumps_kwargs')¶
A tuple of
str
objects containing the name of all public attributes of the class that are also keyword parameters of the__init__
method.Currently used by
Request.replace()
,Request.to_dict()
andrequest_from_dict()
.
JsonRequest usage example¶
Sending a JSON POST request with a JSON payload:
data = {
'name1': 'value1',
'name2': 'value2',
}
yield JsonRequest(url='http://www.example.com/post/action', data=data)
Response objects¶
- class scrapy.http.Response(*args, **kwargs)[source]¶
An object that represents an HTTP response, which is usually downloaded (by the Downloader) and fed to the Spiders for processing.
- Parameters
url (str) – the URL of this response
status (int) – the HTTP status of the response. Defaults to
200
.headers (dict) – the headers of this response. The dict values can be strings (for single valued headers) or lists (for multi-valued headers).
body (bytes) – the response body. To access the decoded text as a string, use
response.text
from an encoding-aware Response subclass, such asTextResponse
.flags (list) – is a list containing the initial values for the
Response.flags
attribute. If given, the list will be shallow copied.request (scrapy.Request) – the initial value of the
Response.request
attribute. This represents theRequest
that generated this response.certificate (twisted.internet.ssl.Certificate) – an object representing the server’s SSL certificate.
ip_address (
ipaddress.IPv4Address
oripaddress.IPv6Address
) – The IP address of the server from which the Response originated.protocol (
str
) – The protocol that was used to download the response. For instance: “HTTP/1.0”, “HTTP/1.1”, “h2”
New in version 2.0.0: The
certificate
parameter.New in version 2.1.0: The
ip_address
parameter.New in version 2.5.0: The
protocol
parameter.- url¶
A string containing the URL of the response.
This attribute is read-only. To change the URL of a Response use
replace()
.
- status¶
An integer representing the HTTP status of the response. Example:
200
,404
.
- headers¶
A dictionary-like object which contains the response headers. Values can be accessed using
get()
to return the first header value with the specified name orgetlist()
to return all header values with the specified name. For example, this call will give you all cookies in the headers:response.headers.getlist('Set-Cookie')
- body¶
The response body as bytes.
If you want the body as a string, use
TextResponse.text
(only available inTextResponse
and subclasses).This attribute is read-only. To change the body of a Response use
replace()
.
- request¶
The
Request
object that generated this response. This attribute is assigned in the Scrapy engine, after the response and the request have passed through all Downloader Middlewares. In particular, this means that:HTTP redirections will cause the original request (to the URL before redirection) to be assigned to the redirected response (with the final URL after redirection).
Response.request.url doesn’t always equal Response.url
This attribute is only available in the spider code, and in the Spider Middlewares, but not in Downloader Middlewares (although you have the Request available there by other means) and handlers of the
response_downloaded
signal.
- meta¶
A shortcut to the
Request.meta
attribute of theResponse.request
object (i.e.self.request.meta
).Unlike the
Response.request
attribute, theResponse.meta
attribute is propagated along redirects and retries, so you will get the originalRequest.meta
sent from your spider.See also
Request.meta
attribute
- cb_kwargs¶
New in version 2.0.
A shortcut to the
Request.cb_kwargs
attribute of theResponse.request
object (i.e.self.request.cb_kwargs
).Unlike the
Response.request
attribute, theResponse.cb_kwargs
attribute is propagated along redirects and retries, so you will get the originalRequest.cb_kwargs
sent from your spider.See also
Request.cb_kwargs
attribute
- flags¶
A list that contains flags for this response. Flags are labels used for tagging Responses. For example:
'cached'
,'redirected
’, etc. And they’re shown on the string representation of the Response (__str__ method) which is used by the engine for logging.
- certificate¶
New in version 2.0.0.
A
twisted.internet.ssl.Certificate
object representing the server’s SSL certificate.Only populated for
https
responses,None
otherwise.
- ip_address¶
New in version 2.1.0.
The IP address of the server from which the Response originated.
This attribute is currently only populated by the HTTP 1.1 download handler, i.e. for
http(s)
responses. For other handlers,ip_address
is alwaysNone
.
- protocol¶
New in version 2.5.0.
The protocol that was used to download the response. For instance: “HTTP/1.0”, “HTTP/1.1”
This attribute is currently only populated by the HTTP download handlers, i.e. for
http(s)
responses. For other handlers,protocol
is alwaysNone
.
- attributes: Tuple[str, ...] = ('url', 'status', 'headers', 'body', 'flags', 'request', 'certificate', 'ip_address', 'protocol')¶
A tuple of
str
objects containing the name of all public attributes of the class that are also keyword parameters of the__init__
method.Currently used by
Response.replace()
.
- replace([url, status, headers, body, request, flags, cls])[source]¶
Returns a Response object with the same members, except for those members given new values by whichever keyword arguments are specified. The attribute
Response.meta
is copied by default.
- urljoin(url)[source]¶
Constructs an absolute url by combining the Response’s
url
with a possible relative url.This is a wrapper over
urljoin()
, it’s merely an alias for making this call:urllib.parse.urljoin(response.url, url)
- follow(url, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding='utf-8', priority=0, dont_filter=False, errback=None, cb_kwargs=None, flags=None) scrapy.http.request.Request [source]¶
Return a
Request
instance to follow a linkurl
. It accepts the same arguments asRequest.__init__
method, buturl
can be a relative URL or ascrapy.link.Link
object, not only an absolute URL.TextResponse
provides afollow()
method which supports selectors in addition to absolute/relative URLs and Link objects.New in version 2.0: The flags parameter.
- follow_all(urls, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding='utf-8', priority=0, dont_filter=False, errback=None, cb_kwargs=None, flags=None) Generator[scrapy.http.request.Request, None, None] [source]¶
New in version 2.0.
Return an iterable of
Request
instances to follow all links inurls
. It accepts the same arguments asRequest.__init__
method, but elements ofurls
can be relative URLs orLink
objects, not only absolute URLs.TextResponse
provides afollow_all()
method which supports selectors in addition to absolute/relative URLs and Link objects.
Response subclasses¶
Here is the list of available built-in Response subclasses. You can also subclass the Response class to implement your own functionality.
TextResponse objects¶
- class scrapy.http.TextResponse(url[, encoding[, ...]])[source]¶
TextResponse
objects adds encoding capabilities to the baseResponse
class, which is meant to be used only for binary data, such as images, sounds or any media file.TextResponse
objects support a new__init__
method argument, in addition to the baseResponse
objects. The remaining functionality is the same as for theResponse
class and is not documented here.- Parameters
encoding (str) – is a string which contains the encoding to use for this response. If you create a
TextResponse
object with a string as body, it will be converted to bytes encoded using this encoding. If encoding isNone
(default), the encoding will be looked up in the response headers and body instead.
TextResponse
objects support the following attributes in addition to the standardResponse
ones:- text¶
Response body, as a string.
The same as
response.body.decode(response.encoding)
, but the result is cached after the first call, so you can accessresponse.text
multiple times without extra overhead.Note
str(response.body)
is not a correct way to convert the response body into a string:>>> str(b'body') "b'body'"
- encoding¶
A string with the encoding of this response. The encoding is resolved by trying the following mechanisms, in order:
the encoding passed in the
__init__
methodencoding
argumentthe encoding declared in the Content-Type HTTP header. If this encoding is not valid (i.e. unknown), it is ignored and the next resolution mechanism is tried.
the encoding declared in the response body. The TextResponse class doesn’t provide any special functionality for this. However, the
HtmlResponse
andXmlResponse
classes do.the encoding inferred by looking at the response body. This is the more fragile method but also the last one tried.
- selector¶
A
Selector
instance using the response as target. The selector is lazily instantiated on first access.
- attributes: Tuple[str, ...] = ('url', 'status', 'headers', 'body', 'flags', 'request', 'certificate', 'ip_address', 'protocol', 'encoding')¶
A tuple of
str
objects containing the name of all public attributes of the class that are also keyword parameters of the__init__
method.Currently used by
Response.replace()
.
TextResponse
objects support the following methods in addition to the standardResponse
ones:- follow(url, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding=None, priority=0, dont_filter=False, errback=None, cb_kwargs=None, flags=None) scrapy.http.request.Request [source]¶
Return a
Request
instance to follow a linkurl
. It accepts the same arguments asRequest.__init__
method, buturl
can be not only an absolute URL, but alsoa relative URL
a
Link
object, e.g. the result of Link Extractorsa
Selector
object for a<link>
or<a>
element, e.g.response.css('a.my_link')[0]
an attribute
Selector
(not SelectorList), e.g.response.css('a::attr(href)')[0]
orresponse.xpath('//img/@src')[0]
See A shortcut for creating Requests for usage examples.
- follow_all(urls=None, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding=None, priority=0, dont_filter=False, errback=None, cb_kwargs=None, flags=None, css=None, xpath=None) Generator[scrapy.http.request.Request, None, None] [source]¶
A generator that produces
Request
instances to follow all links inurls
. It accepts the same arguments as theRequest
’s__init__
method, except that eachurls
element does not need to be an absolute URL, it can be any of the following:a relative URL
a
Link
object, e.g. the result of Link Extractorsa
Selector
object for a<link>
or<a>
element, e.g.response.css('a.my_link')[0]
an attribute
Selector
(not SelectorList), e.g.response.css('a::attr(href)')[0]
orresponse.xpath('//img/@src')[0]
In addition,
css
andxpath
arguments are accepted to perform the link extraction within thefollow_all
method (only one ofurls
,css
andxpath
is accepted).Note that when passing a
SelectorList
as argument for theurls
parameter or using thecss
orxpath
parameters, this method will not produce requests for selectors from which links cannot be obtained (for instance, anchor tags without anhref
attribute)
HtmlResponse objects¶
- class scrapy.http.HtmlResponse(url[, ...])[source]¶
The
HtmlResponse
class is a subclass ofTextResponse
which adds encoding auto-discovering support by looking into the HTML meta http-equiv attribute. SeeTextResponse.encoding
.
XmlResponse objects¶
- class scrapy.http.XmlResponse(url[, ...])[source]¶
The
XmlResponse
class is a subclass ofTextResponse
which adds encoding auto-discovering support by looking into the XML declaration line. SeeTextResponse.encoding
.
Link Extractors¶
A link extractor is an object that extracts links from responses.
The __init__
method of
LxmlLinkExtractor
takes settings that
determine which links may be extracted. LxmlLinkExtractor.extract_links
returns a
list of matching Link
objects from a
Response
object.
Link extractors are used in CrawlSpider
spiders
through a set of Rule
objects.
You can also use link extractors in regular spiders. For example, you can instantiate
LinkExtractor
into a class
variable in your spider, and use it from your spider callbacks:
def parse(self, response):
for link in self.link_extractor.extract_links(response):
yield Request(link.url, callback=self.parse)
Link extractor reference¶
The link extractor class is
scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor
. For convenience it
can also be imported as scrapy.linkextractors.LinkExtractor
:
from scrapy.linkextractors import LinkExtractor
LxmlLinkExtractor¶
- class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href',), canonicalize=False, unique=True, process_value=None, strip=True)[source]¶
LxmlLinkExtractor is the recommended link extractor with handy filtering options. It is implemented using lxml’s robust HTMLParser.
- Parameters
allow (str or list) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. If not given (or empty), it will match all links.
deny (str or list) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be excluded (i.e. not extracted). It has precedence over the
allow
parameter. If not given (or empty) it won’t exclude any links.allow_domains (str or list) – a single value or a list of string containing domains which will be considered for extracting the links
deny_domains (str or list) – a single value or a list of strings containing domains which won’t be considered for extracting the links
deny_extensions (list) –
a single value or list of strings containing extensions that should be ignored when extracting links. If not given, it will default to
scrapy.linkextractors.IGNORED_EXTENSIONS
.Changed in version 2.0:
IGNORED_EXTENSIONS
now includes7z
,7zip
,apk
,bz2
,cdr
,dmg
,ico
,iso
,tar
,tar.gz
,webm
, andxz
.restrict_xpaths (str or list) – is an XPath (or list of XPath’s) which defines regions inside the response where links should be extracted from. If given, only the text selected by those XPath will be scanned for links. See examples below.
restrict_css (str or list) – a CSS selector (or list of selectors) which defines regions inside the response where links should be extracted from. Has the same behaviour as
restrict_xpaths
.restrict_text (str or list) – a single regular expression (or list of regular expressions) that the link’s text must match in order to be extracted. If not given (or empty), it will match all links. If a list of regular expressions is given, the link will be extracted if it matches at least one.
tags (str or list) – a tag or a list of tags to consider when extracting links. Defaults to
('a', 'area')
.attrs (list) – an attribute or list of attributes which should be considered when looking for links to extract (only for those tags specified in the
tags
parameter). Defaults to('href',)
canonicalize (bool) – canonicalize each extracted url (using w3lib.url.canonicalize_url). Defaults to
False
. Note that canonicalize_url is meant for duplicate checking; it can change the URL visible at server side, so the response can be different for requests with canonicalized and raw URLs. If you’re using LinkExtractor to follow links it is more robust to keep the defaultcanonicalize=False
.unique (bool) – whether duplicate filtering should be applied to extracted links.
process_value (collections.abc.Callable) –
a function which receives each value extracted from the tag and attributes scanned and can modify the value and return a new one, or return
None
to ignore the link altogether. If not given,process_value
defaults tolambda x: x
.For example, to extract links from this code:
<a href="javascript:goToPage('../other/page.html'); return false">Link text</a>
You can use the following function in
process_value
:def process_value(value): m = re.search("javascript:goToPage\('(.*?)'", value) if m: return m.group(1)
strip (bool) – whether to strip whitespaces from extracted attributes. According to HTML5 standard, leading and trailing whitespaces must be stripped from
href
attributes of<a>
,<area>
and many other elements,src
attribute of<img>
,<iframe>
elements, etc., so LinkExtractor strips space chars by default. Setstrip=False
to turn it off (e.g. if you’re extracting urls from elements or attributes which allow leading/trailing whitespaces).
Link¶
- class scrapy.link.Link(url, text='', fragment='', nofollow=False)[source]¶
Link objects represent an extracted link by the LinkExtractor.
Using the anchor tag sample below to illustrate the parameters:
<a href="https://example.com/nofollow.html#foo" rel="nofollow">Dont follow this one</a>
- Parameters
url – the absolute url being linked to in the anchor tag. From the sample, this is
https://example.com/nofollow.html
.text – the text in the anchor tag. From the sample, this is
Dont follow this one
.fragment – the part of the url after the hash symbol. From the sample, this is
foo
.nofollow – an indication of the presence or absence of a nofollow value in the
rel
attribute of the anchor tag.
Settings¶
The Scrapy settings allows you to customize the behaviour of all Scrapy components, including the core, extensions, pipelines and spiders themselves.
The infrastructure of the settings provides a global namespace of key-value mappings that the code can use to pull configuration values from. The settings can be populated through different mechanisms, which are described below.
The settings are also the mechanism for selecting the currently active Scrapy project (in case you have many).
For a list of available built-in settings see: Built-in settings reference.
Designating the settings¶
When you use Scrapy, you have to tell it which settings you’re using. You can
do this by using an environment variable, SCRAPY_SETTINGS_MODULE
.
The value of SCRAPY_SETTINGS_MODULE
should be in Python path syntax, e.g.
myproject.settings
. Note that the settings module should be on the
Python import search path.
Populating the settings¶
Settings can be populated using different mechanisms, each of which having a different precedence. Here is the list of them in decreasing order of precedence:
Command line options (most precedence)
Settings per-spider
Project settings module
Default settings per-command
Default global settings (less precedence)
The population of these settings sources is taken care of internally, but a manual handling is possible using API calls. See the Settings API topic for reference.
These mechanisms are described in more detail below.
1. Command line options¶
Arguments provided by the command line are the ones that take most precedence,
overriding any other options. You can explicitly override one (or more)
settings using the -s
(or --set
) command line option.
Example:
scrapy crawl myspider -s LOG_FILE=scrapy.log
2. Settings per-spider¶
Spiders (See the Spiders chapter for reference) can define their
own settings that will take precedence and override the project ones. They can
do so by setting their custom_settings
attribute:
class MySpider(scrapy.Spider):
name = 'myspider'
custom_settings = {
'SOME_SETTING': 'some value',
}
3. Project settings module¶
The project settings module is the standard configuration file for your Scrapy
project, it’s where most of your custom settings will be populated. For a
standard Scrapy project, this means you’ll be adding or changing the settings
in the settings.py
file created for your project.
4. Default settings per-command¶
Each Scrapy tool command can have its own default
settings, which override the global default settings. Those custom command
settings are specified in the default_settings
attribute of the command
class.
5. Default global settings¶
The global defaults are located in the scrapy.settings.default_settings
module and documented in the Built-in settings reference section.
Import paths and classes¶
New in version 2.4.0.
When a setting references a callable object to be imported by Scrapy, such as a class or a function, there are two different ways you can specify that object:
As a string containing the import path of that object
As the object itself
For example:
from mybot.pipelines.validate import ValidateMyItem
ITEM_PIPELINES = {
# passing the classname...
ValidateMyItem: 300,
# ...equals passing the class path
'mybot.pipelines.validate.ValidateMyItem': 300,
}
Note
Passing non-callable objects is not supported.
How to access settings¶
In a spider, the settings are available through self.settings
:
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['http://example.com']
def parse(self, response):
print(f"Existing settings: {self.settings.attributes.keys()}")
Note
The settings
attribute is set in the base Spider class after the spider
is initialized. If you want to use the settings before the initialization
(e.g., in your spider’s __init__()
method), you’ll need to override the
from_crawler()
method.
Settings can be accessed through the scrapy.crawler.Crawler.settings
attribute of the Crawler that is passed to from_crawler
method in
extensions, middlewares and item pipelines:
class MyExtension:
def __init__(self, log_is_enabled=False):
if log_is_enabled:
print("log is enabled!")
@classmethod
def from_crawler(cls, crawler):
settings = crawler.settings
return cls(settings.getbool('LOG_ENABLED'))
The settings object can be used like a dict (e.g.,
settings['LOG_ENABLED']
), but it’s usually preferred to extract the setting
in the format you need it to avoid type errors, using one of the methods
provided by the Settings
API.
Rationale for setting names¶
Setting names are usually prefixed with the component that they configure. For
example, proper setting names for a fictional robots.txt extension would be
ROBOTSTXT_ENABLED
, ROBOTSTXT_OBEY
, ROBOTSTXT_CACHEDIR
, etc.
Built-in settings reference¶
Here’s a list of all available Scrapy settings, in alphabetical order, along with their default values and the scope where they apply.
The scope, where available, shows where the setting is being used, if it’s tied to any particular component. In that case the module of that component will be shown, typically an extension, middleware or pipeline. It also means that the component must be enabled in order for the setting to have any effect.
AWS_ACCESS_KEY_ID¶
Default: None
The AWS access key used by code that requires access to Amazon Web services, such as the S3 feed storage backend.
AWS_SECRET_ACCESS_KEY¶
Default: None
The AWS secret key used by code that requires access to Amazon Web services, such as the S3 feed storage backend.
AWS_SESSION_TOKEN¶
Default: None
The AWS security token used by code that requires access to Amazon Web services, such as the S3 feed storage backend, when using temporary security credentials.
AWS_ENDPOINT_URL¶
Default: None
Endpoint URL used for S3-like storage, for example Minio or s3.scality.
AWS_USE_SSL¶
Default: None
Use this option if you want to disable SSL connection for communication with S3 or S3-like storage. By default SSL will be used.
AWS_VERIFY¶
Default: None
Verify SSL connection between Scrapy and S3 or S3-like storage. By default SSL verification will occur.
AWS_REGION_NAME¶
Default: None
The name of the region associated with the AWS client.
ASYNCIO_EVENT_LOOP¶
Default: None
Import path of a given asyncio
event loop class.
If the asyncio reactor is enabled (see TWISTED_REACTOR
) this setting can be used to specify the
asyncio event loop to be used with it. Set the setting to the import path of the
desired asyncio event loop class. If the setting is set to None
the default asyncio
event loop will be used.
If you are installing the asyncio reactor manually using the install_reactor()
function, you can use the event_loop_path
parameter to indicate the import path of the event loop
class to be used.
Note that the event loop class must inherit from asyncio.AbstractEventLoop
.
Caution
Please be aware that, when using a non-default event loop
(either defined via ASYNCIO_EVENT_LOOP
or installed with
install_reactor()
), Scrapy will call
asyncio.set_event_loop()
, which will set the specified event loop
as the current loop for the current OS thread.
BOT_NAME¶
Default: 'scrapybot'
The name of the bot implemented by this Scrapy project (also known as the project name). This name will be used for the logging too.
It’s automatically populated with your project name when you create your
project with the startproject
command.
CONCURRENT_ITEMS¶
Default: 100
Maximum number of concurrent items (per response) to process in parallel in item pipelines.
CONCURRENT_REQUESTS¶
Default: 16
The maximum number of concurrent (i.e. simultaneous) requests that will be performed by the Scrapy downloader.
CONCURRENT_REQUESTS_PER_DOMAIN¶
Default: 8
The maximum number of concurrent (i.e. simultaneous) requests that will be performed to any single domain.
See also: AutoThrottle extension and its
AUTOTHROTTLE_TARGET_CONCURRENCY
option.
CONCURRENT_REQUESTS_PER_IP¶
Default: 0
The maximum number of concurrent (i.e. simultaneous) requests that will be
performed to any single IP. If non-zero, the
CONCURRENT_REQUESTS_PER_DOMAIN
setting is ignored, and this one is
used instead. In other words, concurrency limits will be applied per IP, not
per domain.
This setting also affects DOWNLOAD_DELAY
and
AutoThrottle extension: if CONCURRENT_REQUESTS_PER_IP
is non-zero, download delay is enforced per IP, not per domain.
DEFAULT_ITEM_CLASS¶
Default: 'scrapy.Item'
The default class that will be used for instantiating items in the the Scrapy shell.
DEFAULT_REQUEST_HEADERS¶
Default:
{
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
The default headers used for Scrapy HTTP Requests. They’re populated in the
DefaultHeadersMiddleware
.
Caution
Cookies set via the Cookie
header are not considered by the
CookiesMiddleware. If you need to set cookies for a request, use the
Request.cookies
parameter. This is a known
current limitation that is being worked on.
DEPTH_LIMIT¶
Default: 0
Scope: scrapy.spidermiddlewares.depth.DepthMiddleware
The maximum depth that will be allowed to crawl for any site. If zero, no limit will be imposed.
DEPTH_PRIORITY¶
Default: 0
Scope: scrapy.spidermiddlewares.depth.DepthMiddleware
An integer that is used to adjust the priority
of
a Request
based on its depth.
The priority of a request is adjusted as follows:
request.priority = request.priority - ( depth * DEPTH_PRIORITY )
As depth increases, positive values of DEPTH_PRIORITY
decrease request
priority (BFO), while negative values increase request priority (DFO). See
also Does Scrapy crawl in breadth-first or depth-first order?.
Note
This setting adjusts priority in the opposite way compared to
other priority settings REDIRECT_PRIORITY_ADJUST
and RETRY_PRIORITY_ADJUST
.
DEPTH_STATS_VERBOSE¶
Default: False
Scope: scrapy.spidermiddlewares.depth.DepthMiddleware
Whether to collect verbose depth stats. If this is enabled, the number of requests for each depth is collected in the stats.
DNSCACHE_ENABLED¶
Default: True
Whether to enable DNS in-memory cache.
DNSCACHE_SIZE¶
Default: 10000
DNS in-memory cache size.
DNS_RESOLVER¶
New in version 2.0.
Default: 'scrapy.resolver.CachingThreadedResolver'
The class to be used to resolve DNS names. The default scrapy.resolver.CachingThreadedResolver
supports specifying a timeout for DNS requests via the DNS_TIMEOUT
setting,
but works only with IPv4 addresses. Scrapy provides an alternative resolver,
scrapy.resolver.CachingHostnameResolver
, which supports IPv4/IPv6 addresses but does not
take the DNS_TIMEOUT
setting into account.
DNS_TIMEOUT¶
Default: 60
Timeout for processing of DNS queries in seconds. Float is supported.
DOWNLOADER¶
Default: 'scrapy.core.downloader.Downloader'
The downloader to use for crawling.
DOWNLOADER_HTTPCLIENTFACTORY¶
Default: 'scrapy.core.downloader.webclient.ScrapyHTTPClientFactory'
Defines a Twisted protocol.ClientFactory
class to use for HTTP/1.0
connections (for HTTP10DownloadHandler
).
Note
HTTP/1.0 is rarely used nowadays so you can safely ignore this setting,
unless you really want to use HTTP/1.0 and override
DOWNLOAD_HANDLERS
for http(s)
scheme accordingly,
i.e. to 'scrapy.core.downloader.handlers.http.HTTP10DownloadHandler'
.
DOWNLOADER_CLIENTCONTEXTFACTORY¶
Default: 'scrapy.core.downloader.contextfactory.ScrapyClientContextFactory'
Represents the classpath to the ContextFactory to use.
Here, “ContextFactory” is a Twisted term for SSL/TLS contexts, defining the TLS/SSL protocol version to use, whether to do certificate verification, or even enable client-side authentication (and various other things).
Note
Scrapy default context factory does NOT perform remote server certificate verification. This is usually fine for web scraping.
If you do need remote server certificate verification enabled,
Scrapy also has another context factory class that you can set,
'scrapy.core.downloader.contextfactory.BrowserLikeContextFactory'
,
which uses the platform’s certificates to validate remote endpoints.
If you do use a custom ContextFactory, make sure its __init__
method
accepts a method
parameter (this is the OpenSSL.SSL
method mapping
DOWNLOADER_CLIENT_TLS_METHOD
), a tls_verbose_logging
parameter (bool
) and a tls_ciphers
parameter (see
DOWNLOADER_CLIENT_TLS_CIPHERS
).
DOWNLOADER_CLIENT_TLS_CIPHERS¶
Default: 'DEFAULT'
Use this setting to customize the TLS/SSL ciphers used by the default HTTP/1.1 downloader.
The setting should contain a string in the OpenSSL cipher list format,
these ciphers will be used as client ciphers. Changing this setting may be
necessary to access certain HTTPS websites: for example, you may need to use
'DEFAULT:!DH'
for a website with weak DH parameters or enable a
specific cipher that is not included in DEFAULT
if a website requires it.
DOWNLOADER_CLIENT_TLS_METHOD¶
Default: 'TLS'
Use this setting to customize the TLS/SSL method used by the default HTTP/1.1 downloader.
This setting must be one of these string values:
'TLS'
: maps to OpenSSL’sTLS_method()
(a.k.aSSLv23_method()
), which allows protocol negotiation, starting from the highest supported by the platform; default, recommended'TLSv1.0'
: this value forces HTTPS connections to use TLS version 1.0 ; set this if you want the behavior of Scrapy<1.1'TLSv1.1'
: forces TLS version 1.1'TLSv1.2'
: forces TLS version 1.2'SSLv3'
: forces SSL version 3 (not recommended)
DOWNLOADER_CLIENT_TLS_VERBOSE_LOGGING¶
Default: False
Setting this to True
will enable DEBUG level messages about TLS connection
parameters after establishing HTTPS connections. The kind of information logged
depends on the versions of OpenSSL and pyOpenSSL.
This setting is only used for the default
DOWNLOADER_CLIENTCONTEXTFACTORY
.
DOWNLOADER_MIDDLEWARES¶
Default:: {}
A dict containing the downloader middlewares enabled in your project, and their orders. For more info see Activating a downloader middleware.
DOWNLOADER_MIDDLEWARES_BASE¶
Default:
{
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
}
A dict containing the downloader middlewares enabled by default in Scrapy. Low
orders are closer to the engine, high orders are closer to the downloader. You
should never modify this setting in your project, modify
DOWNLOADER_MIDDLEWARES
instead. For more info see
Activating a downloader middleware.
DOWNLOADER_STATS¶
Default: True
Whether to enable downloader stats collection.
DOWNLOAD_DELAY¶
Default: 0
The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. This can be used to throttle the crawling speed to avoid hitting servers too hard. Decimal numbers are supported. Example:
DOWNLOAD_DELAY = 0.25 # 250 ms of delay
This setting is also affected by the RANDOMIZE_DOWNLOAD_DELAY
setting (which is enabled by default). By default, Scrapy doesn’t wait a fixed
amount of time between requests, but uses a random interval between 0.5 * DOWNLOAD_DELAY
and 1.5 * DOWNLOAD_DELAY
.
When CONCURRENT_REQUESTS_PER_IP
is non-zero, delays are enforced
per ip address instead of per domain.
You can also change this setting per spider by setting download_delay
spider attribute.
DOWNLOAD_HANDLERS¶
Default: {}
A dict containing the request downloader handlers enabled in your project.
See DOWNLOAD_HANDLERS_BASE
for example format.
DOWNLOAD_HANDLERS_BASE¶
Default:
{
'data': 'scrapy.core.downloader.handlers.datauri.DataURIDownloadHandler',
'file': 'scrapy.core.downloader.handlers.file.FileDownloadHandler',
'http': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler',
'https': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler',
's3': 'scrapy.core.downloader.handlers.s3.S3DownloadHandler',
'ftp': 'scrapy.core.downloader.handlers.ftp.FTPDownloadHandler',
}
A dict containing the request download handlers enabled by default in Scrapy.
You should never modify this setting in your project, modify
DOWNLOAD_HANDLERS
instead.
You can disable any of these download handlers by assigning None
to their
URI scheme in DOWNLOAD_HANDLERS
. E.g., to disable the built-in FTP
handler (without replacement), place this in your settings.py
:
DOWNLOAD_HANDLERS = {
'ftp': None,
}
The default HTTPS handler uses HTTP/1.1. To use HTTP/2:
Install
Twisted[http2]>=17.9.0
to install the packages required to enable HTTP/2 support in Twisted.Update
DOWNLOAD_HANDLERS
as follows:DOWNLOAD_HANDLERS = { 'https': 'scrapy.core.downloader.handlers.http2.H2DownloadHandler', }
Warning
HTTP/2 support in Scrapy is experimental, and not yet recommended for production environments. Future Scrapy versions may introduce related changes without a deprecation period or warning.
Note
Known limitations of the current HTTP/2 implementation of Scrapy include:
No support for HTTP/2 Cleartext (h2c), since no major browser supports HTTP/2 unencrypted (refer http2 faq).
No setting to specify a maximum frame size larger than the default value, 16384. Connections to servers that send a larger frame will fail.
No support for server pushes, which are ignored.
No support for the
bytes_received
andheaders_received
signals.
DOWNLOAD_TIMEOUT¶
Default: 180
The amount of time (in secs) that the downloader will wait before timing out.
Note
This timeout can be set per spider using download_timeout
spider attribute and per-request using download_timeout
Request.meta key.
DOWNLOAD_MAXSIZE¶
Default: 1073741824
(1024MB)
The maximum response size (in bytes) that downloader will download.
If you want to disable it set to 0.
Note
This size can be set per spider using download_maxsize
spider attribute and per-request using download_maxsize
Request.meta key.
DOWNLOAD_WARNSIZE¶
Default: 33554432
(32MB)
The response size (in bytes) that downloader will start to warn.
If you want to disable it set to 0.
Note
This size can be set per spider using download_warnsize
spider attribute and per-request using download_warnsize
Request.meta key.
DOWNLOAD_FAIL_ON_DATALOSS¶
Default: True
Whether or not to fail on broken responses, that is, declared
Content-Length
does not match content sent by the server or chunked
response was not properly finish. If True
, these responses raise a
ResponseFailed([_DataLoss])
error. If False
, these responses
are passed through and the flag dataloss
is added to the response, i.e.:
'dataloss' in response.flags
is True
.
Optionally, this can be set per-request basis by using the
download_fail_on_dataloss
Request.meta key to False
.
Note
A broken response, or data loss error, may happen under several
circumstances, from server misconfiguration to network errors to data
corruption. It is up to the user to decide if it makes sense to process
broken responses considering they may contain partial or incomplete content.
If RETRY_ENABLED
is True
and this setting is set to True
,
the ResponseFailed([_DataLoss])
failure will be retried as usual.
Warning
This setting is ignored by the
H2DownloadHandler
download handler (see DOWNLOAD_HANDLERS
). In case of a data loss
error, the corresponding HTTP/2 connection may be corrupted, affecting other
requests that use the same connection; hence, a ResponseFailed([InvalidBodyLengthError])
failure is always raised for every request that was using that connection.
DUPEFILTER_CLASS¶
Default: 'scrapy.dupefilters.RFPDupeFilter'
The class used to detect and filter duplicate requests.
The default (RFPDupeFilter
) filters based on request fingerprint using
the scrapy.utils.request.request_fingerprint
function. In order to change
the way duplicates are checked you could subclass RFPDupeFilter
and
override its request_fingerprint
method. This method should accept
scrapy Request
object and return its fingerprint
(a string).
You can disable filtering of duplicate requests by setting
DUPEFILTER_CLASS
to 'scrapy.dupefilters.BaseDupeFilter'
.
Be very careful about this however, because you can get into crawling loops.
It’s usually a better idea to set the dont_filter
parameter to
True
on the specific Request
that should not be
filtered.
DUPEFILTER_DEBUG¶
Default: False
By default, RFPDupeFilter
only logs the first duplicate request.
Setting DUPEFILTER_DEBUG
to True
will make it log all duplicate requests.
EDITOR¶
Default: vi
(on Unix systems) or the IDLE editor (on Windows)
The editor to use for editing spiders with the edit
command.
Additionally, if the EDITOR
environment variable is set, the edit
command will prefer it over the default setting.
EXTENSIONS¶
Default:: {}
A dict containing the extensions enabled in your project, and their orders.
EXTENSIONS_BASE¶
Default:
{
'scrapy.extensions.corestats.CoreStats': 0,
'scrapy.extensions.telnet.TelnetConsole': 0,
'scrapy.extensions.memusage.MemoryUsage': 0,
'scrapy.extensions.memdebug.MemoryDebugger': 0,
'scrapy.extensions.closespider.CloseSpider': 0,
'scrapy.extensions.feedexport.FeedExporter': 0,
'scrapy.extensions.logstats.LogStats': 0,
'scrapy.extensions.spiderstate.SpiderState': 0,
'scrapy.extensions.throttle.AutoThrottle': 0,
}
A dict containing the extensions available by default in Scrapy, and their orders. This setting contains all stable built-in extensions. Keep in mind that some of them need to be enabled through a setting.
For more information See the extensions user guide and the list of available extensions.
FEED_TEMPDIR¶
The Feed Temp dir allows you to set a custom folder to save crawler temporary files before uploading with FTP feed storage and Amazon S3.
FEED_STORAGE_GCS_ACL¶
The Access Control List (ACL) used when storing items to Google Cloud Storage. For more information on how to set this value, please refer to the column JSON API in Google Cloud documentation.
FTP_PASSIVE_MODE¶
Default: True
Whether or not to use passive mode when initiating FTP transfers.
FTP_PASSWORD¶
Default: "guest"
The password to use for FTP connections when there is no "ftp_password"
in Request
meta.
Note
Paraphrasing RFC 1635, although it is common to use either the password “guest” or one’s e-mail address for anonymous FTP, some FTP servers explicitly ask for the user’s e-mail address and will not allow login with the “guest” password.
FTP_USER¶
Default: "anonymous"
The username to use for FTP connections when there is no "ftp_user"
in Request
meta.
GCS_PROJECT_ID¶
Default: None
The Project ID that will be used when storing data on Google Cloud Storage.
ITEM_PIPELINES¶
Default: {}
A dict containing the item pipelines to use, and their orders. Order values are arbitrary, but it is customary to define them in the 0-1000 range. Lower orders process before higher orders.
Example:
ITEM_PIPELINES = {
'mybot.pipelines.validate.ValidateMyItem': 300,
'mybot.pipelines.validate.StoreMyItem': 800,
}
ITEM_PIPELINES_BASE¶
Default: {}
A dict containing the pipelines enabled by default in Scrapy. You should never
modify this setting in your project, modify ITEM_PIPELINES
instead.
JOBDIR¶
Default: ''
A string indicating the directory for storing the state of a crawl when pausing and resuming crawls.
LOG_ENABLED¶
Default: True
Whether to enable logging.
LOG_ENCODING¶
Default: 'utf-8'
The encoding to use for logging.
LOG_FILE¶
Default: None
File name to use for logging output. If None
, standard error will be used.
LOG_FILE_APPEND¶
Default: True
If False
, the log file specified with LOG_FILE
will be
overwritten (discarding the output from previous runs, if any).
LOG_FORMAT¶
Default: '%(asctime)s [%(name)s] %(levelname)s: %(message)s'
String for formatting log messages. Refer to the Python logging documentation for the qwhole list of available placeholders.
LOG_DATEFORMAT¶
Default: '%Y-%m-%d %H:%M:%S'
String for formatting date/time, expansion of the %(asctime)s
placeholder
in LOG_FORMAT
. Refer to the
Python datetime documentation for the
whole list of available directives.
LOG_FORMATTER¶
Default: scrapy.logformatter.LogFormatter
The class to use for formatting log messages for different actions.
LOG_LEVEL¶
Default: 'DEBUG'
Minimum level to log. Available levels are: CRITICAL, ERROR, WARNING, INFO, DEBUG. For more info see Logging.
LOG_STDOUT¶
Default: False
If True
, all standard output (and error) of your process will be redirected
to the log. For example if you print('hello')
it will appear in the Scrapy
log.
LOG_SHORT_NAMES¶
Default: False
If True
, the logs will just contain the root path. If it is set to False
then it displays the component responsible for the log output
LOGSTATS_INTERVAL¶
Default: 60.0
The interval (in seconds) between each logging printout of the stats
by LogStats
.
MEMDEBUG_ENABLED¶
Default: False
Whether to enable memory debugging.
MEMDEBUG_NOTIFY¶
Default: []
When memory debugging is enabled a memory report will be sent to the specified addresses if this setting is not empty, otherwise the report will be written to the log.
Example:
MEMDEBUG_NOTIFY = ['user@example.com']
MEMUSAGE_ENABLED¶
Default: True
Scope: scrapy.extensions.memusage
Whether to enable the memory usage extension. This extension keeps track of
a peak memory used by the process (it writes it to stats). It can also
optionally shutdown the Scrapy process when it exceeds a memory limit
(see MEMUSAGE_LIMIT_MB
), and notify by email when that happened
(see MEMUSAGE_NOTIFY_MAIL
).
MEMUSAGE_LIMIT_MB¶
Default: 0
Scope: scrapy.extensions.memusage
The maximum amount of memory to allow (in megabytes) before shutting down Scrapy (if MEMUSAGE_ENABLED is True). If zero, no check will be performed.
MEMUSAGE_CHECK_INTERVAL_SECONDS¶
Default: 60.0
Scope: scrapy.extensions.memusage
The Memory usage extension
checks the current memory usage, versus the limits set by
MEMUSAGE_LIMIT_MB
and MEMUSAGE_WARNING_MB
,
at fixed time intervals.
This sets the length of these intervals, in seconds.
MEMUSAGE_NOTIFY_MAIL¶
Default: False
Scope: scrapy.extensions.memusage
A list of emails to notify if the memory limit has been reached.
Example:
MEMUSAGE_NOTIFY_MAIL = ['user@example.com']
MEMUSAGE_WARNING_MB¶
Default: 0
Scope: scrapy.extensions.memusage
The maximum amount of memory to allow (in megabytes) before sending a warning email notifying about it. If zero, no warning will be produced.
NEWSPIDER_MODULE¶
Default: ''
Module where to create new spiders using the genspider
command.
Example:
NEWSPIDER_MODULE = 'mybot.spiders_dev'
RANDOMIZE_DOWNLOAD_DELAY¶
Default: True
If enabled, Scrapy will wait a random amount of time (between 0.5 * DOWNLOAD_DELAY
and 1.5 * DOWNLOAD_DELAY
) while fetching requests from the same
website.
This randomization decreases the chance of the crawler being detected (and subsequently blocked) by sites which analyze requests looking for statistically significant similarities in the time between their requests.
The randomization policy is the same used by wget --random-wait
option.
If DOWNLOAD_DELAY
is zero (default) this option has no effect.
REACTOR_THREADPOOL_MAXSIZE¶
Default: 10
The maximum limit for Twisted Reactor thread pool size. This is common multi-purpose thread pool used by various Scrapy components. Threaded DNS Resolver, BlockingFeedStorage, S3FilesStore just to name a few. Increase this value if you’re experiencing problems with insufficient blocking IO.
REDIRECT_PRIORITY_ADJUST¶
Default: +2
Scope: scrapy.downloadermiddlewares.redirect.RedirectMiddleware
Adjust redirect request priority relative to original request:
a positive priority adjust (default) means higher priority.
a negative priority adjust means lower priority.
ROBOTSTXT_OBEY¶
Default: False
Scope: scrapy.downloadermiddlewares.robotstxt
If enabled, Scrapy will respect robots.txt policies. For more information see RobotsTxtMiddleware.
Note
While the default value is False
for historical reasons,
this option is enabled by default in settings.py file generated
by scrapy startproject
command.
ROBOTSTXT_PARSER¶
Default: 'scrapy.robotstxt.ProtegoRobotParser'
The parser backend to use for parsing robots.txt
files. For more information see
RobotsTxtMiddleware.
ROBOTSTXT_USER_AGENT¶
Default: None
The user agent string to use for matching in the robots.txt file. If None
,
the User-Agent header you are sending with the request or the
USER_AGENT
setting (in that order) will be used for determining
the user agent to use in the robots.txt file.
SCHEDULER¶
Default: 'scrapy.core.scheduler.Scheduler'
The scheduler class to be used for crawling. See the Scheduler topic for details.
SCHEDULER_DEBUG¶
Default: False
Setting to True
will log debug information about the requests scheduler.
This currently logs (only once) if the requests cannot be serialized to disk.
Stats counter (scheduler/unserializable
) tracks the number of times this happens.
Example entry in logs:
1956-01-31 00:00:00+0800 [scrapy.core.scheduler] ERROR: Unable to serialize request:
<GET http://example.com> - reason: cannot serialize <Request at 0x9a7c7ec>
(type Request)> - no more unserializable requests will be logged
(see 'scheduler/unserializable' stats counter)
SCHEDULER_DISK_QUEUE¶
Default: 'scrapy.squeues.PickleLifoDiskQueue'
Type of disk queue that will be used by scheduler. Other available types are
scrapy.squeues.PickleFifoDiskQueue
, scrapy.squeues.MarshalFifoDiskQueue
,
scrapy.squeues.MarshalLifoDiskQueue
.
SCHEDULER_MEMORY_QUEUE¶
Default: 'scrapy.squeues.LifoMemoryQueue'
Type of in-memory queue used by scheduler. Other available type is:
scrapy.squeues.FifoMemoryQueue
.
SCHEDULER_PRIORITY_QUEUE¶
Default: 'scrapy.pqueues.ScrapyPriorityQueue'
Type of priority queue used by the scheduler. Another available type is
scrapy.pqueues.DownloaderAwarePriorityQueue
.
scrapy.pqueues.DownloaderAwarePriorityQueue
works better than
scrapy.pqueues.ScrapyPriorityQueue
when you crawl many different
domains in parallel. But currently scrapy.pqueues.DownloaderAwarePriorityQueue
does not work together with CONCURRENT_REQUESTS_PER_IP
.
SCRAPER_SLOT_MAX_ACTIVE_SIZE¶
New in version 2.0.
Default: 5_000_000
Soft limit (in bytes) for response data being processed.
While the sum of the sizes of all responses being processed is above this value, Scrapy does not process new requests.
SPIDER_CONTRACTS¶
Default:: {}
A dict containing the spider contracts enabled in your project, used for testing spiders. For more info see Spiders Contracts.
SPIDER_CONTRACTS_BASE¶
Default:
{
'scrapy.contracts.default.UrlContract' : 1,
'scrapy.contracts.default.ReturnsContract': 2,
'scrapy.contracts.default.ScrapesContract': 3,
}
A dict containing the Scrapy contracts enabled by default in Scrapy. You should
never modify this setting in your project, modify SPIDER_CONTRACTS
instead. For more info see Spiders Contracts.
You can disable any of these contracts by assigning None
to their class
path in SPIDER_CONTRACTS
. E.g., to disable the built-in
ScrapesContract
, place this in your settings.py
:
SPIDER_CONTRACTS = {
'scrapy.contracts.default.ScrapesContract': None,
}
SPIDER_LOADER_CLASS¶
Default: 'scrapy.spiderloader.SpiderLoader'
The class that will be used for loading spiders, which must implement the SpiderLoader API.
SPIDER_LOADER_WARN_ONLY¶
Default: False
By default, when Scrapy tries to import spider classes from SPIDER_MODULES
,
it will fail loudly if there is any ImportError
exception.
But you can choose to silence this exception and turn it into a simple
warning by setting SPIDER_LOADER_WARN_ONLY = True
.
Note
Some scrapy commands run with this setting to True
already (i.e. they will only issue a warning and will not fail)
since they do not actually need to load spider classes to work:
scrapy runspider
,
scrapy settings
,
scrapy startproject
,
scrapy version
.
SPIDER_MIDDLEWARES¶
Default:: {}
A dict containing the spider middlewares enabled in your project, and their orders. For more info see Activating a spider middleware.
SPIDER_MIDDLEWARES_BASE¶
Default:
{
'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50,
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500,
'scrapy.spidermiddlewares.referer.RefererMiddleware': 700,
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800,
'scrapy.spidermiddlewares.depth.DepthMiddleware': 900,
}
A dict containing the spider middlewares enabled by default in Scrapy, and their orders. Low orders are closer to the engine, high orders are closer to the spider. For more info see Activating a spider middleware.
SPIDER_MODULES¶
Default: []
A list of modules where Scrapy will look for spiders.
Example:
SPIDER_MODULES = ['mybot.spiders_prod', 'mybot.spiders_dev']
STATS_CLASS¶
Default: 'scrapy.statscollectors.MemoryStatsCollector'
The class to use for collecting stats, who must implement the Stats Collector API.
STATS_DUMP¶
Default: True
Dump the Scrapy stats (to the Scrapy log) once the spider finishes.
For more info see: Stats Collection.
STATSMAILER_RCPTS¶
Default: []
(empty list)
Send Scrapy stats after spiders finish scraping. See
StatsMailer
for more info.
TELNETCONSOLE_ENABLED¶
Default: True
A boolean which specifies if the telnet console will be enabled (provided its extension is also enabled).
TEMPLATES_DIR¶
Default: templates
dir inside scrapy module
The directory where to look for templates when creating new projects with
startproject
command and new spiders with genspider
command.
The project name must not conflict with the name of custom files or directories
in the project
subdirectory.
TWISTED_REACTOR¶
New in version 2.0.
Default: None
Import path of a given reactor
.
Scrapy will install this reactor if no other reactor is installed yet, such as
when the scrapy
CLI program is invoked or when using the
CrawlerProcess
class.
If you are using the CrawlerRunner
class, you also
need to install the correct reactor manually. You can do that using
install_reactor()
:
- scrapy.utils.reactor.install_reactor(reactor_path, event_loop_path=None)[source]¶
Installs the
reactor
with the specified import path. Also installs the asyncio event loop with the specified import path if the asyncio reactor is enabled
If a reactor is already installed,
install_reactor()
has no effect.
CrawlerRunner.__init__
raises
Exception
if the installed reactor does not match the
TWISTED_REACTOR
setting; therefore, having top-level
reactor
imports in project files and imported
third-party libraries will make Scrapy raise Exception
when
it checks which reactor is installed.
In order to use the reactor installed by Scrapy:
import scrapy
from twisted.internet import reactor
class QuotesSpider(scrapy.Spider):
name = 'quotes'
def __init__(self, *args, **kwargs):
self.timeout = int(kwargs.pop('timeout', '60'))
super(QuotesSpider, self).__init__(*args, **kwargs)
def start_requests(self):
reactor.callLater(self.timeout, self.stop)
urls = ['https://quotes.toscrape.com/page/1']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for quote in response.css('div.quote'):
yield {'text': quote.css('span.text::text').get()}
def stop(self):
self.crawler.engine.close_spider(self, 'timeout')
which raises Exception
, becomes:
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
def __init__(self, *args, **kwargs):
self.timeout = int(kwargs.pop('timeout', '60'))
super(QuotesSpider, self).__init__(*args, **kwargs)
def start_requests(self):
from twisted.internet import reactor
reactor.callLater(self.timeout, self.stop)
urls = ['https://quotes.toscrape.com/page/1']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for quote in response.css('div.quote'):
yield {'text': quote.css('span.text::text').get()}
def stop(self):
self.crawler.engine.close_spider(self, 'timeout')
The default value of the TWISTED_REACTOR
setting is None
, which
means that Scrapy will install the default reactor defined by Twisted for the
current platform. This is to maintain backward compatibility and avoid possible
problems caused by using a non-default reactor.
For additional information, see Choosing a Reactor and GUI Toolkit Integration.
URLLENGTH_LIMIT¶
Default: 2083
Scope: spidermiddlewares.urllength
The maximum URL length to allow for crawled URLs.
This setting can act as a stopping condition in case of URLs of ever-increasing
length, which may be caused for example by a programming error either in the
target server or in your code. See also REDIRECT_MAX_TIMES
and
DEPTH_LIMIT
.
Use 0
to allow URLs of any length.
The default value is copied from the Microsoft Internet Explorer maximum URL length, even though this setting exists for different reasons.
USER_AGENT¶
Default: "Scrapy/VERSION (+https://scrapy.org)"
The default User-Agent to use when crawling, unless overridden. This user agent is
also used by RobotsTxtMiddleware
if ROBOTSTXT_USER_AGENT
setting is None
and
there is no overriding User-Agent header specified for the request.
Settings documented elsewhere:¶
The following settings are documented elsewhere, please check each specific case to see how to enable and use them.
Exceptions¶
Built-in Exceptions reference¶
Here’s a list of all exceptions included in Scrapy and their usage.
CloseSpider¶
- exception scrapy.exceptions.CloseSpider(reason='cancelled')[source]¶
This exception can be raised from a spider callback to request the spider to be closed/stopped. Supported arguments:
- Parameters
reason (str) – the reason for closing
For example:
def parse_page(self, response):
if 'Bandwidth exceeded' in response.body:
raise CloseSpider('bandwidth_exceeded')
DontCloseSpider¶
This exception can be raised in a spider_idle
signal handler to
prevent the spider from being closed.
DropItem¶
The exception that must be raised by item pipeline stages to stop processing an Item. For more information see Item Pipeline.
IgnoreRequest¶
This exception can be raised by the Scheduler or any downloader middleware to indicate that the request should be ignored.
NotConfigured¶
This exception can be raised by some components to indicate that they will remain disabled. Those components include:
Extensions
Item pipelines
Downloader middlewares
Spider middlewares
The exception must be raised in the component’s __init__
method.
NotSupported¶
This exception is raised to indicate an unsupported feature.
StopDownload¶
New in version 2.2.
Raised from a bytes_received
or headers_received
signal handler to indicate that no further bytes should be downloaded for a response.
The fail
boolean parameter controls which method will handle the resulting
response:
If
fail=True
(default), the request errback is called. The response object is available as theresponse
attribute of theStopDownload
exception, which is in turn stored as thevalue
attribute of the receivedFailure
object. This means that in an errback defined asdef errback(self, failure)
, the response can be accessed thoughfailure.value.response
.If
fail=False
, the request callback is called instead.
In both cases, the response could have its body truncated: the body contains
all bytes received up until the exception is raised, including the bytes
received in the signal handler that raises the exception. Also, the response
object is marked with "download_stopped"
in its Response.flags
attribute.
Note
fail
is a keyword-only parameter, i.e. raising
StopDownload(False)
or StopDownload(True)
will raise
a TypeError
.
See the documentation for the bytes_received
and
headers_received
signals
and the Stopping the download of a Response topic for additional information and examples.
- Command line tool
Learn about the command-line tool used to manage your Scrapy project.
- Spiders
Write the rules to crawl your websites.
- Selectors
Extract the data from web pages using XPath.
- Scrapy shell
Test your extraction code in an interactive environment.
- Items
Define the data you want to scrape.
- Item Loaders
Populate your items with the extracted data.
- Item Pipeline
Post-process and store your scraped data.
- Feed exports
Output your scraped data using different formats and storages.
- Requests and Responses
Understand the classes used to represent HTTP requests and responses.
- Link Extractors
Convenient classes to extract links to follow from pages.
- Settings
Learn how to configure Scrapy and see all available settings.
- Exceptions
See all available exceptions and their meaning.
Built-in services¶
Logging¶
Note
scrapy.log
has been deprecated alongside its functions in favor of
explicit calls to the Python standard logging. Keep reading to learn more
about the new logging system.
Scrapy uses logging
for event logging. We’ll
provide some simple examples to get you started, but for more advanced
use-cases it’s strongly suggested to read thoroughly its documentation.
Logging works out of the box, and can be configured to some extent with the Scrapy settings listed in Logging settings.
Scrapy calls scrapy.utils.log.configure_logging()
to set some reasonable
defaults and handle those settings in Logging settings when
running commands, so it’s recommended to manually call it if you’re running
Scrapy from scripts as described in Run Scrapy from a script.
Log levels¶
Python’s builtin logging defines 5 different levels to indicate the severity of a given log message. Here are the standard ones, listed in decreasing order:
logging.CRITICAL
- for critical errors (highest severity)logging.ERROR
- for regular errorslogging.WARNING
- for warning messageslogging.INFO
- for informational messageslogging.DEBUG
- for debugging messages (lowest severity)
How to log messages¶
Here’s a quick example of how to log a message using the logging.WARNING
level:
import logging
logging.warning("This is a warning")
There are shortcuts for issuing log messages on any of the standard 5 levels,
and there’s also a general logging.log
method which takes a given level as
argument. If needed, the last example could be rewritten as:
import logging
logging.log(logging.WARNING, "This is a warning")
On top of that, you can create different “loggers” to encapsulate messages. (For example, a common practice is to create different loggers for every module). These loggers can be configured independently, and they allow hierarchical constructions.
The previous examples use the root logger behind the scenes, which is a top level
logger where all messages are propagated to (unless otherwise specified). Using
logging
helpers is merely a shortcut for getting the root logger
explicitly, so this is also an equivalent of the last snippets:
import logging
logger = logging.getLogger()
logger.warning("This is a warning")
You can use a different logger just by getting its name with the
logging.getLogger
function:
import logging
logger = logging.getLogger('mycustomlogger')
logger.warning("This is a warning")
Finally, you can ensure having a custom logger for any module you’re working on
by using the __name__
variable, which is populated with current module’s
path:
import logging
logger = logging.getLogger(__name__)
logger.warning("This is a warning")
Logging from Spiders¶
Scrapy provides a logger
within each Spider
instance, which can be accessed and used like this:
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://scrapy.org']
def parse(self, response):
self.logger.info('Parse function called on %s', response.url)
That logger is created using the Spider’s name, but you can use any custom Python logger you want. For example:
import logging
import scrapy
logger = logging.getLogger('mycustomlogger')
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://scrapy.org']
def parse(self, response):
logger.info('Parse function called on %s', response.url)
Logging configuration¶
Loggers on their own don’t manage how messages sent through them are displayed. For this task, different “handlers” can be attached to any logger instance and they will redirect those messages to appropriate destinations, such as the standard output, files, emails, etc.
By default, Scrapy sets and configures a handler for the root logger, based on the settings below.
Logging settings¶
These settings can be used to configure the logging:
The first couple of settings define a destination for log messages. If
LOG_FILE
is set, messages sent through the root logger will be
redirected to a file named LOG_FILE
with encoding
LOG_ENCODING
. If unset and LOG_ENABLED
is True
, log
messages will be displayed on the standard error. If LOG_FILE
is set
and LOG_FILE_APPEND
is False
, the file will be overwritten
(discarding the output from previous runs, if any). Lastly, if
LOG_ENABLED
is False
, there won’t be any visible log output.
LOG_LEVEL
determines the minimum level of severity to display, those
messages with lower severity will be filtered out. It ranges through the
possible levels listed in Log levels.
LOG_FORMAT
and LOG_DATEFORMAT
specify formatting strings
used as layouts for all messages. Those strings can contain any placeholders
listed in logging’s logrecord attributes docs and
datetime’s strftime and strptime directives
respectively.
If LOG_SHORT_NAMES
is set, then the logs will not display the Scrapy
component that prints the log. It is unset by default, hence logs contain the
Scrapy component responsible for that log output.
Command-line options¶
There are command-line arguments, available for all commands, that you can use to override some of the Scrapy settings regarding logging.
--logfile FILE
Overrides
LOG_FILE
--loglevel/-L LEVEL
Overrides
LOG_LEVEL
--nolog
Sets
LOG_ENABLED
toFalse
See also
- Module
logging.handlers
Further documentation on available handlers
Custom Log Formats¶
A custom log format can be set for different actions by extending
LogFormatter
class and making
LOG_FORMATTER
point to your new class.
- class scrapy.logformatter.LogFormatter[source]¶
Class for generating log messages for different actions.
All methods must return a dictionary listing the parameters
level
,msg
andargs
which are going to be used for constructing the log message when callinglogging.log
.Dictionary keys for the method outputs:
level
is the log level for that action, you can use those from the python logging library :logging.DEBUG
,logging.INFO
,logging.WARNING
,logging.ERROR
andlogging.CRITICAL
.msg
should be a string that can contain different formatting placeholders. This string, formatted with the providedargs
, is going to be the long message for that action.args
should be a tuple or dict with the formatting placeholders formsg
. The final log message is computed asmsg % args
.
Users can define their own
LogFormatter
class if they want to customize how each action is logged or if they want to omit it entirely. In order to omit logging an action the method must returnNone
.Here is an example on how to create a custom log formatter to lower the severity level of the log message when an item is dropped from the pipeline:
class PoliteLogFormatter(logformatter.LogFormatter): def dropped(self, item, exception, response, spider): return { 'level': logging.INFO, # lowering the level from logging.WARNING 'msg': "Dropped: %(exception)s" + os.linesep + "%(item)s", 'args': { 'exception': exception, 'item': item, } }
- download_error(failure, request, spider, errmsg=None)[source]¶
Logs a download error message from a spider (typically coming from the engine).
New in version 2.0.