Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
Getting help
Having trouble? We’d like to help!
- Try the FAQ – it’s got answers to some common questions.
- Looking for specific information? Try the Index or Module Index.
- Ask or search questions in StackOverflow using the scrapy tag.
- Ask or search questions in the Scrapy subreddit.
- Search for questions on the archives of the scrapy-users mailing list.
- Ask a question in the #scrapy IRC channel,
- Report bugs with Scrapy in our issue tracker.
First steps
- Understand what Scrapy is and how it can help you.
- Get Scrapy installed on your computer.
- Write your first Scrapy project.
- Learn more by playing with a pre-made Scrapy project.
Basic concepts
- Learn about the command-line tool used to manage your Scrapy project.
- Write the rules to crawl your websites.
- Extract the data from web pages using XPath.
- Test your extraction code in an interactive environment.
- Define the data you want to scrape.
- Populate your items with the extracted data.
- Post-process and store your scraped data.
- Output your scraped data using different formats and storages.
- Understand the classes used to represent HTTP requests and responses.
- Convenient classes to extract links to follow from pages.
- Learn how to configure Scrapy and see all available settings.
- See all available exceptions and their meaning.
Built-in services
- Learn how to use Python’s builtin logging on Scrapy.
- Collect statistics about your scraping crawler.
- Send email notifications when certain events occur.
- Inspect a running crawler using a built-in Python console.
- Monitor and control a crawler using a web service.
Solving specific problems
- Get answers to most frequently asked questions.
- Learn how to debug common problems of your scrapy spider.
- Learn how to use contracts for testing your spiders.
- Get familiar with some Scrapy common practices.
- Tune Scrapy for crawling a lot domains in parallel.
Using your browser’s Developer Tools for scraping
- Learn how to scrape with your browser’s developer tools.
Selecting dynamically-loaded content
- Read webpage data that is loaded dynamically.
- Learn how to find and get rid of memory leaks in your crawler.
Downloading and processing files and images
- Download files and/or images associated with your scraped items.
- Deploying your Scrapy spiders and run them in a remote server.
- Adjust crawl rate dynamically based on load.
- Check how Scrapy performs on your hardware.
Jobs: pausing and resuming crawls
- Learn how to pause and resume crawls for large spiders.
Extending Scrapy
- Understand the Scrapy architecture.
- Customize how pages get requested and downloaded.
- Customize the input and output of your spiders.
- Extend Scrapy with your custom functionality
- Use it on extensions and middlewares to extend Scrapy functionality
- See all available signals and how to work with them.
- Quickly export your scraped items to a file (XML, CSV, etc).
All the rest
- See what has changed in recent Scrapy versions.
- Learn how to contribute to the Scrapy project.
- Understand Scrapy versioning and API stability.