QuotesTutorial

Topics -> scrapy-framework, python, scraping

Source Code Link -> GitHub

What We are going to do?

Setting up the Scrapy Project
Writing a Scraper to scrapes quotes
Cleaning and Pipelining to store data in sqlite3 database
Setting up Fake user agents and proxies

Step 1 -> Setting up the Scrapy Project

Creating a QuotesTutorial project

Before creating , we must know about Scrapy

Scrapy is a free and open-source web-crawling framework written in Python. Originally designed for web scraping, it can also be used to extract data using APIs or as a general-purpose web crawler.

To initialize the project

scrapy startproject quotes-tutorial

This will create a tutorial directory with the following contents:

quotes-tutorial/
    scrapy.cfg            # deploy configuration file

    quotes-tutorial/             # project's Python module, you'll import your code from here
        __init__.py

        items.py          # project items definition file

        middlewares.py    # project middlewares file

        pipelines.py      # project pipelines file

        settings.py       # project settings file

        spiders/          # a directory where you'll later put your spiders
            __init__.py

Step 2 -> Writing our scraper

It will extract quotes from Good Reads website

Before moving ahead , we must be aware of the selectors

What are selectors/locators?

A CSS Selector is a combination of an element selector and a value which identifies the web element within a web page.

The choice of locator depends largely on your Application Under Test

An element’s id in XPATH is defined using: “[@id='example']” and in CSS using: “#” - ID's must be unique within the DOM.

Examples:

XPath: //div[@id='example']
CSS: #example

Element Type

The previous example showed //div in the xpath. That is the element type, which could be input for a text box or button, img for an image, or "a" for a link.

Xpath: //input or
Css: =input

Direct Child

HTML pages are structured like XML, with children nested inside of parents. If you can locate, for example, the first link within a div, you can construct a string to reach it. A direct child in XPATH is defined by the use of a “/“, while on CSS, it’s defined using “>”.

Examples:

                        
XPath: //div/a
CSS: div > a

Child or Sub-Child

Writing nested divs can get tiring - and result in code that is brittle. Sometimes you expect the code to change, or want to skip layers. If an element could be inside another or one of its children, it’s defined in XPATH using “//” and in CSS just by a whitespace.

Examples:

                        
XPath: //div//a
CSS: div a

Class

For classes, things are pretty similar in XPATH: “[@class='example']” while in CSS it’s just “.”

Examples:

XPath: //div[@class='example']
CSS: .example

We have inherited the spider class from scrapy-framework. The page_number is the next page to be scraped

start_urls provide the entry point to the scraper if no url is explicitly given.

_parse function is main function from where the scraping starts.

We have used the css selectors explained above.

import scrapy

from ..items import QuotesItem


class QuotesScraper(scrapy.Spider):
    page_number = 2
    name = "QuotesScraper"
    start_urls = ["https://www.goodreads.com/quotes/tag/inspirational"]

    def _parse(self, response, **kwargs):
        item = QuotesItem()
        for quote in response.css(".quote"):
            title = quote.css(".quoteText::text").extract_first()
            author = quote.css(".authorOrTitle::text").extract_first()
            item["title"] = title
            item["author"] = author
            yield item

        # Uncomment the below lines if you want to scrape all the pages in that website and comment the rest uncomment line

        # next_btn = response.css("a.next_page::attr(href)").get()
        # if next_btn is not None:
        #     yield response.follow(next_btn, callback=self._parse())
        next_page=f"https://www.goodreads.com/quotes/tag/inspirational?page={QuotesScraper.page_number}"
        if QuotesScraper.page_number < 3:
            QuotesScraper.page_number += 1
            yield response.follow(next_page, callback=self._parse)

But what are QuotesItem here?

Items are containers that will be loaded with the scraped data; they work like simple python dicts but provide additional protection against populating undeclared fields, to prevent typos.

import scrapy


class QuotesItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    author = scrapy.Field()

Step 3 -> Pipelining to store the data in sqlite3 database

import sqlite3


class QuotesPipeline:
    def __init__(self):
        self.create_connection()
        self.create_table()

    def process_item(self, item, spider):
        self.db_store(item)
        return item

    def create_connection(self):
        self.conn = sqlite3.connect("quotes.db")
        self.curr = self.conn.cursor()

    def create_table(self):
        self.curr.execute("""DROP TABLE IF EXISTS quote_table""")
        self.curr.execute("""create table quote_table( title text, author text)""")

    def db_store(self, item):
        self.curr.execute("""insert into quote_table values(?,?)""", (
            item["title"],
            item["author"]
        ))
        self.conn.commit()

Initiate a QuotesPipeline Class.

__init__ functions creates a connection between the sqlite3 database and the program with the help of create_connection function. It is also responsible for creating a Quotes table with the create_table function.

process_item function will process all the quotes items and will store into the database with the help of db_store function.

Step 4 -> Setting up Fake user agents and proxies

As we are scraping on a large scale we need to avoid banning our IP

We are using two libraries : -

1. scrapy-proxy-pool

It will provide a bunch of proxies to ensure security of our real IP

Enable this middleware by adding the following settings to your settings.py:

PROXY_POOL_ENABLED = True

Then add rotating_proxies middlewares to your DOWNLOADER_MIDDLEWARES:

DOWNLOADER_MIDDLEWARES = {
    # ...
    'scrapy_proxy_pool.middlewares.ProxyPoolMiddleware': 610,
    'scrapy_proxy_pool.middlewares.BanDetectionMiddleware': 620,
    # ...
}

After this all requests will be proxied using proxies.

2. scrapy-user-agents

Random User-Agent middleware picks up User-Agent strings based on Python User Agents and MDN.

Turn off the built-in UserAgentMiddleware and add RandomUserAgentMiddleware.

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}

Deployment

Running Scrapy spiders in your local machine is very convenient for the (early) development stage, but not so much when you need to execute long-running spiders or move spiders to run in production continuously. This is where the solutions for deploying Scrapy spiders come in.

Popular choices for deploying Scrapy spiders are:

Web Preview / Output

Web preview on deployment

Placeholder text by Praveen Chaudhary · Images by Binary Beast