Skip to content

Soup Sieve

Overview

Soup Sieve is a CSS selector library designed to be used with Beautiful Soup 4. It aims to provide selecting, matching, and filtering using modern CSS selectors. Soup Sieve currently provides selectors from the CSS level 1 specifications up through the latest CSS level 4 drafts (though some are not yet implemented).

Soup Sieve was written with the intent to replace Beautiful Soup's builtin select feature, and as of Beautiful Soup version 4.7.0, it now is 🎊. Soup Sieve can also be imported in order to use its API directly for more controlled, specialized parsing.

Soup Sieve has implemented most of the CSS selectors up through the level 4 drafts, though there are a number that don't make sense in a non-browser environment. Selectors that cannot provide meaningful functionality simply do not match anything. Some of the supported selectors are:

  • .classes
  • #ids
  • [attributes=value]
  • parent child
  • parent > child
  • sibling ~ sibling
  • sibling + sibling
  • :not(element.class, element2.class)
  • :is(element.class, element2.class)
  • parent:has(> child)
  • and many more

Installation

You must have Beautiful Soup already installed:

pip install beautifulsoup4

In most cases, assuming you've installed version 4.7.0, that should be all you need to do, but if you've installed via some alternative method, and Soup Sieve is not automatically installed for your, you can install it directly:

pip install soupsieve

If you want to manually install it from source, navigate to the root of the project and run

python setup.py build
python setup.py install

Usage

To use Soup Sieve, you must create a BeautifulSoup object:

>>> import bs4

>>> text = """
... <div>
... <!-- These are animals -->
... <p class="a">Cat</p>
... <p class="b">Dog</p>
... <p class="c">Mouse</p>
... </div>
... """
>>> soup = bs4.BeautifulSoup(text, 'html5lib')

For most people, using the Beautiful Soup 4.7.0+ API may be more than sufficient. Beautiful Soup offers two methods that employ Soup Sieve: select and select_one. Beautiful Soup's select API is identical to Soup Sieve's, except that you don't have to hand it the tag object, the calling object passes itself to Soup Sieve:

>>> soup = bs4.BeautifulSoup(text, 'html5lib')
>>> soup.select_one('p:is(.a, .b, .c)')
<p class="a">Cat</p>
>>> soup = bs4.BeautifulSoup(text, 'html5lib')
>>> soup.select('p:is(.a, .b, .c)')
[<p class="a">Cat</p>, <p class="b">Dog</p>, <p class="c">Mouse</p>]

You can also use the Soup Sieve API directly to get access to the full range of possibilities that Soup Sieve offers. You can select a single tag:

>>> import soupsieve as sv
>>> sv.select_one('p:is(.a, .b, .c)', soup)
<p class="a">Cat</p>

You can select all tags:

>>> import soupsieve as sv
>>> sv.select('p:is(.a, .b, .c)', soup)
[<p class="a">Cat</p>, <p class="b">Dog</p>, <p class="c">Mouse</p>]

You can select the closest ancestor:

>>> import soupsieve as sv
>>> el = sv.select_one('.c', soup)
>>> sv.closest('div', el)
<div>
<!-- These are animals -->
<p class="a">Cat</p>
<p class="b">Dog</p>
<p class="c">Mouse</p>
</div>

You can filter a tag's Children (or an iterable of tags):

>>> sv.filter('p:not(.b)', soup.div)
[<p class="a">Cat</p>, <p class="c">Mouse</p>]

You can match a single tag:

>>> els = sv.select('p:is(.a, .b, .c)', soup)
>>> sv.match(els[0], 'p:not(.b)')
True
>>> sv.match(els[1], 'p:not(.b)')
False

Or even just extract comments:

>>> sv.comments(soup)
[' These are animals ']

Selectors do not have to be constrained to one line either. You can span selectors over multiple lines just like you would in a CSS file.

>>> selector = """
... .a,
... .b,
... .c
... """
>>> sv.select(selector, soup)
[<p class="a">Cat</p>, <p class="b">Dog</p>, <p class="c">Mouse</p>]

You can even use comments to annotate a particularly complex selector.

>>> selector = """
... /* This isn't complicated, but we're going to annotate it anyways.
...    This is the a class */
... .a,
... /* This is the b class */
... .b,
... /* This is the c class */
... .c
... """
>>> sv.select(selector, soup)
[<p class="a">Cat</p>, <p class="b">Dog</p>, <p class="c">Mouse</p>]

If you've ever used Python's Re library for regular expressions, you may know that it is often useful to pre-compile a regular expression pattern, especially if you plan to use it more than once. The same is true for Soup Sieve's matchers, though is not required. If you have a pattern that you want to use more than once, it may be wise to pre-compile it early on:

>>> selector = sv.compile('p:is(.a, .b, .c)')
>>> selector.filter(soup.div)
[<p class="a">Cat</p>, <p class="b">Dog</p>, <p class="c">Mouse</p>]

A compiled object has all the same methods, though the parameters will be slightly different as they don't need things like the pattern or flags once compiled. See API documentation for more info.

Compiled patterns are cached, so if for any reason you need to clear the cache, simply issue the purge command.

>>> sv.purge()

Beautiful Soup Differences

Soup Sieve is the official CSS "select" implementation of Beautiful Soup 4.7.0+. While the inclusion of Soup Sieve fixes many issues and greatly expands CSS support in Beautiful Soup, it does introduce some differences which may surprise some who've become accustom to the old "select" implementation.

Beautiful Soup's old select method had numerous limitations and quirks that do not align with the actual CSS specification. Most are insignificant, but there are a couple differences that people over the years had come to rely on. Soup Sieve, which aims to follow the CSS specification closely, does not support these differences.

  1. Beautiful Soup was very relaxed when it came to attribute values in selectors: [attribute=value]. Beautiful Soup would allow almost anything for a valid unquoted value. Soup Sieve, on the other hand, follows the CSS specification and requires that a value be a valid identifier, or it must be quoted. If you get an error complaining about an invalid attribute, you may need to quote the value.

    For instance, if you previously used a selector like this:

    soup.select('[div={}]')
    

    You would need to quote the value as {} is not a valid CSS identifier, so it must be quoted:

    soup.select('[div="{}"]')
    
  2. Whether on purpose or on accident, Beautiful Soup used to allow relative selectors:

    soup.select('> div')
    

    The above is not a valid CSS selector according the CSS specifications. Relative selector lists have only recently been added to the CSS specifications, and they are only allowed in a :has() pseudo-class:

    article:has(> div)
    

    But, in the level 4 CSS specifications, the :scope pseudo-class has been added which allows for the same feel as using > div. Since Soup Sieve supports the :scope pseudo-class, it can be used to produce the same behavior as the legacy select method.

    So, if you used to to have selectors such as:

    soup.select('> div')
    

    You can simply add :scope, and it should work the same:

    soup.select(':scope > div')
    
  3. Another quirk of Beautiful Soup's old select implementation was that it returned the HTML nodes in the order of how the selectors were defined. For instance, Beautiful Soup, if given the pattern article, body would first return <article> and then <body>.

    Soup Sieve does not, and frankly cannot, honor Beautiful Soup's old ordering convention due to the way it is designed. Soup Sieve returns the nodes in the order they are defined in the document. The Soup Sieve project views this change in behavior as for the best as it is more efficient and is more inline with how browsers implement querySelectorAll, which our select is analogous to. There are no plans to mimic the old behavior.

    For those that are curious, Soup Sieve, when given a selector, begins crawling the HTML tree from the node that is specified. It crawls the tree in an orderly fashion and matches each element against the provided selector pattern. It does not sort them or build up a list, it simply yields each element as it finds a match. Since the elements are, crawled in the order they appear in the document, they are also yielded in this order as well. So, given the earlier selector pattern of article, body, Soup Sieve would return the element <body> and then <article> as that is how it is ordered in the HTML document.