
Parse Only What's Needed

Frank Casanova

June 1, 2022


The bigger the page, the more slowly this model is created.


    One option to tune the performance a bit is to tell Beautiful Soup which part of the whole page you will need, and it will create the object model from the relevant part. To do this, you can use a SoupStrainer object.

    A SoupStrainer tells Beautiful Soup what parts extract, and the parse tree will consist only of these elements. This speeds up the process a bit, if you can narrow down the required information to a smaller portion of the HTML.


strainer = SoupStrainer(name='ul', attrs={'class':'productLister gridView'})

soup = BeautifulSoup(content, 'html.parser', parse_only=strainer)


    The preceding code creates a simple SoupStrainer That limits the parse tree to unordered list having a class attribute 'productLister gridView' - which helps to reduce the site to the required parts-and it uses this strainer to creat the soup.

    Because your already have a working scraper, you can replace the soup calls using a strainer to speed up things.

    The following piece of information is hard to find on the Internet:

you can use multiple attibutes in the strainer to parse the website. For example if you extract the links to product pages, you have three options base on the level of the current department link:

    -  The link leads to product pages.

    -  The link leads to a first-level sublist.

    -  The link leads from a first-level sublist to a second-level sublist.

    In this case, you have three different classes but want to create the soup if any of them is present. You can do somethin like this:


BeautifulSoup(content, 'html.parser', name='ul', attrs={'class':['productLister gridView', 'categories shelf']


    Here, you have listed all three versions of the list tha can happen, and the soup contains all the relevant information.