Skip to content

Creating a paginator

We are going to write an example scraping https://books.toscrape.com/.

Before we write the paginator, we need to extract each product.

If we look at the HTML in the inspector, each product is contained within an article with the class .product_pod. So we first need to navigate to the URL and select each article.

main {
new_page {
goto "https://books.toscrape.com/"
$$ article.product_pod {
extract "pages[].books" {
// TODO
}
}
}
}

The full title of the book appears as an attribute in the a tag in the top most h3 child node of article. We can extract this by using the css selector h3 a and grabbing the title attribute. This is accomplished by chaining the $ and attr evaluators.

extract "pages[].books" {
title { $ "h3 a"; attr title }
}

The star_rating is a bit more complicated. The actual number of stars appears as a class name on the p.star-rating element as a string: One, Two Three, Four or Five. We need to select this element, get the class attribute, and extract the stars using a regular expression. We can then use a custom func to map the string into a number.

We explictly ignore case to ensure that a simple case change wouldn’t break our data extraction.

rating {
$ ".star-rating";
attr "class";
extract "star-rating (One|Two|Three|Four|Five)" caseInsensitive=#true;
func "(v) => ({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5}[v.toLowerCase()] || null)"
}

The price is simple. It is contained in the p.price_color element. We need to get the text of the element and then use as_float which will also clean the value before converting it to a float.

price { $ "p.price_color"; text; as_float }

The availbility status is contained in the p.availbility element. We need to extract the text and see if it matches: "In stock". We again will allow matching to be case insensitive.

in_stock { $ "p.availability"; text; matches "In stock" caseInsensitive=#true }

If we put this together we have:

main {
new_page {
goto "https://books.toscrape.com/"
$$ article.product_pod {
extract "books[]" {
title { $ "h3 a"; attr title }
rating {
$ ".star-rating";
attr "class";
extract "star-rating (One|Two|Three|Four|Five)" caseInsensitive=#true;
func "(v) => ({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5}[v.toLowerCase()] || null)"
}
price { $ "p.price_color"; text; as_float }
in_stock { $ "p.availability"; text; matches "In stock" caseInsensitive=#true }
}
}
}
}

We can verify it by running:

Terminal window
tadpole run pagination.kdl --auto --headless

We will use the loop action to continue extracting and clicking li.next a until li.next does not exist anymore.

The extraction needs to happen before the check, and clicking the navigation needs to happen after.

loop {
do {
// TODO: Add extract here
}
while { $ "li.next" }
next {
$ "li.next a" { click }
wait_until // Waits until the page has loaded
}
}
main {
new_page {
goto "https://books.toscrape.com/"
loop {
do {
$$ article.product_pod {
extract "books[]" {
title { $ "h3 a"; attr title }
rating {
$ ".star-rating";
attr "class";
extract "star-rating (One|Two|Three|Four|Five)" caseInsensitive=#true;
func "(v) => ({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5}[v.toLowerCase()] || null)"
}
price { $ "p.price_color"; text; as_float }
in_stock { $ "p.availability"; text; matches "In stock" caseInsensitive=#true }
}
}
}
while { $ "li.next" }
next {
$ "li.next a" { click }
wait_until
}
}
}
}