Creating a paginator
We are going to write an example scraping https://books.toscrape.com/.
Extracting each product
Section titled “Extracting each product”Before we write the paginator, we need to extract each product.
If we look at the HTML in the inspector, each product is contained within an article with
the class .product_pod. So we first need to navigate to the URL and select each article.
main { new_page { goto "https://books.toscrape.com/" $$ article.product_pod { extract "pages[].books" { // TODO } } }}The full title of the book appears as an attribute in the a tag in the top most h3 child node of article.
We can extract this by using the css selector h3 a and grabbing the title attribute. This is accomplished
by chaining the $ and attr evaluators.
extract "pages[].books" { title { $ "h3 a"; attr title }}star_rating
Section titled “star_rating”The star_rating is a bit more complicated. The actual number of stars appears as a class name on the p.star-rating
element as a string: One, Two Three, Four or Five. We need to select this element, get the class attribute,
and extract the stars using a regular expression. We can then use a custom func to map the string into a number.
We explictly ignore case to ensure that a simple case change wouldn’t break our data extraction.
rating { $ ".star-rating"; attr "class"; extract "star-rating (One|Two|Three|Four|Five)" caseInsensitive=#true; func "(v) => ({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5}[v.toLowerCase()] || null)"}The price is simple. It is contained in the p.price_color element. We need to get the text of the
element and then use as_float which will also clean the value before converting it to a float.
price { $ "p.price_color"; text; as_float }in_stock
Section titled “in_stock”The availbility status is contained in the p.availbility element. We need to extract the text
and see if it matches: "In stock". We again will allow matching to be case insensitive.
in_stock { $ "p.availability"; text; matches "In stock" caseInsensitive=#true }If we put this together we have:
main { new_page { goto "https://books.toscrape.com/" $$ article.product_pod { extract "books[]" { title { $ "h3 a"; attr title } rating { $ ".star-rating"; attr "class"; extract "star-rating (One|Two|Three|Four|Five)" caseInsensitive=#true; func "(v) => ({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5}[v.toLowerCase()] || null)" } price { $ "p.price_color"; text; as_float } in_stock { $ "p.availability"; text; matches "In stock" caseInsensitive=#true } } } }}We can verify it by running:
tadpole run pagination.kdl --auto --headlessAdding Pagination
Section titled “Adding Pagination”We will use the loop action to continue extracting and clicking li.next a until li.next does not exist anymore.
The extraction needs to happen before the check, and clicking the navigation needs to happen after.
loop { do { // TODO: Add extract here } while { $ "li.next" } next { $ "li.next a" { click } wait_until // Waits until the page has loaded }}Putting it Together
Section titled “Putting it Together”main { new_page { goto "https://books.toscrape.com/" loop { do { $$ article.product_pod { extract "books[]" { title { $ "h3 a"; attr title } rating { $ ".star-rating"; attr "class"; extract "star-rating (One|Two|Three|Four|Five)" caseInsensitive=#true; func "(v) => ({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5}[v.toLowerCase()] || null)" } price { $ "p.price_color"; text; as_float } in_stock { $ "p.availability"; text; matches "In stock" caseInsensitive=#true } } } } while { $ "li.next" } next { $ "li.next a" { click } wait_until } } }}