Plugin API
Filters
When writing a Filter plugin, there are two classes to be aware: Filter and SourceText. Both classes are found in
pyspelling.filters.
Each chunk returned by a filter is a SourceText object. These objects contain the desired, filtered text from the
previous filter along with some metadata: encoding, display context, and a category that describes what kind of text the
data is. After all filters have processed the text, each SourceText's content is finally passed to the spell checker.
The text data in a SourceText object is always Unicode, but during the filtering process, the filter can decode the
Unicode if required as long as it is returned as Unicode at the end of the step.
Filter
Filter plugins are subclassed from the Filter class. You'll often want to specify the defaulted value for
default_encoding in the __init__. Simply give it a default value as shown below.
from .. import filters
class MyFilter(filters.Filter):
"""Spelling Filter."""
def __init__(self, options, default_encoding='utf-8'):
"""Initialization."""
super().__init__(options, default_encoding)
Filter.get_default_config
get_default_config is where you should specify your default configuration file. This should contain all accepted
options and their default value. All user options that are passed in will override the defaults. If an option is passed
in that is not found in the defaults, an error will be raised.
def get_default_config(self):
"""Get default configuration."""
return {
"enable_something": True,
"list_of_stuff": ['some', 'stuff']
}
"New 2.0
get_default_confg was added in version 2.0.
Filter.validate_options
validate_options is where you can specify validation of your options. By default, basic validation is done on incoming
options. For instance, if you specify a default as a bool, the default validator will ensure the passed user options
match. Checking is performed on bool, str, list, dict, int, and float types. Nothing beyond simple type
checking is performed, so if you had some custom validation, or simply wanted to bypass the default validator with your
own, you should override validate_options.
def validate_options(self, k, v):
"""Validate options."""
# Call the basic validator
super().validate_options(k, v)
# Perform custom validation
if k == "foo" and v != "bar":
raise ValueError("Value should be 'bar' for 'foo'")
New 2.0
validate_options was added in version 2.0.
Filter.setup
setup is were basic setup can be performed post-validation. At this point, you can access the merged and validated
configuration via self.config.
def setup(self):
"""Setup."""
self.enable_foo = self.config['foo']
New 2.0
setup was added in version 2.0.
Filter.reset
reset is called on every new call to the plugin. It allows you to clean up states from previous calls.
def reset(self):
"""Reset"""
self.counter = 0
self.tracked_stuff = []
New 2.0
reset was added in version 2.0.
Filter.has_bom
has_bom takes a file stream and is usually used to check the first few bytes. While BOM checking could be performed in
header_check, this mainly provided as UTF BOMs are quite common in many file types, so a specific test was dedicated
to it. Additionally, this replaces the old, less flexible CHECK_BOM attribute that was deprecated in version 1.2.
This is useful if you want to handle binary parsing, or a file type that has a custom BOM in the header. When returning
encoding in any of the encoding check functions, None means no encoding was detecting, an empty string means binary
data (encoding validation is skipped), and anything else will be validated and passed through. Just be sure to include a
sensible encoding in your SourceText object when your plugin returns file content.
def has_bom(self, filestream):
"""Check if has BOM."""
content = filestream.read(2)
if content == b'PK\x03\x04':
# Zip file found.
# Return `BINARY_ENCODE` as content is binary type,
# but don't return None which means we don't know what we have.
return filters.BINARY_ENCODE
# Not a zip file, so pass it on to the normal file checker.
return super().has_bom(filestream)
New 2.0
has_bom was added in version 2.0.
Deprecation 2.0
CHECK_BOM has been deprecated since 2.0.
Filter.header_check
header_check is a function that receives the first 1024 characters of the file via content that can be scanned for
an encoding header. A string with the encoding name should be returned or None if a valid encoding header cannot be
found.
def header_check(self, content):
"""Special encode check."""
return None
Filter.content_check
content_check receives a file object which allows you to check the entire file buffer to determine encoding. A string
with the encoding name should be returned or None if a valid encoding header cannot be found.
def content_check(self, filestream):
"""File content check."""
return None
Filter.filter
filter is called when the Filter object is the first in the chain. This means the file has not been read from disk
yet, so we must handle opening the file before applying the filter and then return a list of SourceText objects. The
first filter in the chain is handled differently in order to give the opportunity to handle files that require more
complex methods to acquire the Unicode strings. You can read the file in binary format or directly to Unicode. You can
run parsers or anything else you need in order to get the required Unicode text for the SourceText objects. You can
create as many SourceText objects as you desired and assign them categories so that other Filter objects can avoid
them if desired. Below is the default which reads the entire file into a single object providing the file name as the
context, the encoding, and the category text.
def filter(self, source_file, encoding): # noqa A001
"""Open and filter the file from disk."""
with codecs.open(source_file, 'r', encoding=encoding) as f:
text = f.read()
return [SourceText(text, source_file, encoding, 'text')]
Filter.sfilter
sfilter is called for all Filter objects following the first. The function is passed a SourceText object from
which the text, context, encoding can all be extracted. Here you can manipulate the text back to bytes if needed, wrap
the text in an io.StreamIO object to act as a file stream, run parsers, or anything you need to manipulate the buffer
to filter the Unicode text for the SourceText objects.
def sfilter(self, source):
"""Execute filter."""
return [SourceText(source.text, source.context, source.encoding, 'text')]
If a filter only works either as the first in the chain, or only as a secondary filter in the chain, you could raise an
exception if needed. In most cases, you should be able to have an appropriate filter and sfilter, but there are
most likely cases (particular when dealing with binary data) where only a filter method could be provided.
Check out the default filter plugins provided with the source to see real world examples.
get_plugin
And don't forget to provide a function in the file called get_plugin! get_plugin is the entry point and should
return your Filter object.
def get_plugin():
"""Return the filter."""
return HtmlFilter
SourceText
As previously mentioned, filters must return a list of SourceText objects.
class SourceText(namedtuple('SourceText', ['text', 'context', 'encoding', 'category', 'error'])):
"""Source text."""
Each object should contain a Unicode string (text), some context on the given text hunk (context), the encoding
which the Unicode text was originally in (encoding), and a category that is used to omit certain hunks from other
filters in the chain (category). SourceText should not contain byte strings, and if they do, they will not be passed
to additional filters. error is optional and is only provided message when something goes wrong.
When receiving a SourceText object in your plugin, you can access the content via attributes with the same name as
the parameters above:
>>> source.text
'Some Text'
>>> source.context
'foo.txt'
>>> source.encoding
'utf-8'
>>> source.category
'some-category'
Be mindful when adjusting the context in subsequent items in the pipeline chain. Generally you should only append additional context so as not to wipe out previous contextual data. It may not always make sense to append additional data, so some filters might just pass the previous context as the new context.
If you have a particular chunk of text that has a problem, you can return an error in the SourceText object. Errors
really only need a context and the error as they won't be passed to the spell checker or to any subsequent steps in the
pipeline. Errors are only used to alert the user that something went wrong. SourceText objects with errors will not be
passed down the chain and will not be passed to the spell checker.
if error:
content = [SourceText('', source_file, '', '', error)]
Flow Control
FlowControl plugins are simple plugins that take the category from a SourceText object, and simply returns either
the directive HALT, SKIP, or ALLOW. This controls whether the associated SourceText object's progress is halted
in the pipeline, skips the next filter, or is explicitly allowed in the next filter. The directives and FlowControl
class are found in pyspelling.flow_control.
FlowControl
FlowControl plugins should be subclassed from FlowControl. If you need to you can override the __init__, but
remember to call the original with super to ensure options are handled.
class MyFlowControl(flow_control.FlowControl):
"""Flow control plugin."""
def __init__(self, config):
"""Initialization."""
super().__init__(config)
FlowControl.get_default_config
get_default_config is where you should specify your default configuration file. This should contain all accepted
options and their default value. All user options that are passed in will override the defaults. If an option is passed
in that is not found in the defaults, an error will be raised.
def get_default_config(self):
"""Get default configuration."""
return {
"enable_something": True,
"list_of_stuff": ['some', 'stuff']
}
New 2.0
get_default_confg was added in version 2.0.
FlowControl.validate_options
validate_options is where you can specify validation of your options. By default, basic validation is done on incoming
options. For instance, if you specify a default as a bool, the default validator will ensure the passed user options
match. Checking is performed on bool, str, list, dict, int, and float types. Nothing beyond simple type
checking is performed, so if you had some custom validation, or simply wanted to bypass the default validator with your
own, you should override validate_options.
def validate_options(self, k, v):
"""Validate options."""
# Call the basic validator
super().validate_options(k, v)
# Perform custom validation
if k == "foo" and v != "bar":
raise ValueError("Value should be 'bar' for 'foo'")
New 2.0
validate_options was added in version 2.0.
FlowControl.setup
setup is were basic setup can be performed post-validation. At this point, you can access the merged and validated
configuration via self.config.
def setup(self):
"""Setup."""
self.enable_foo = self.config['foo']
New 2.0
setup was added in version 2.0.
FlowControl.reset
reset is called on every new call to the plugin. It allows you to clean up states from previous calls.
def reset(self):
"""Reset"""
self.counter = 0
self.tracked_stuff = []
New 2.0
reset was added in version 2.0.
FlowControl.adjust_flow
After handling the options, there is only one other function available for overrides: adjust_flow. Adjust flow
receives the category from the SourceText being passed down the pipeline. Here the decision is made to as to what must
be done with the object. Simply return HALT, SKIP, or ALLOW to control the flow for that SourceText object.
def adjust_flow(self, category):
"""Adjust the flow of source control objects."""
status = flow_control.SKIP
for allow in self.allow:
if fnmatch.fnmatch(category, allow, flags=self.FNMATCH_FLAGS):
status = flow_control.ALLOW
for skip in self.skip:
if fnmatch.fnmatch(category, skip, flags=self.FNMATCH_FLAGS):
status = flow_control.SKIP
for halt in self.halt:
if fnmatch.fnmatch(category, halt, flags=self.FNMATCH_FLAGS):
status = flow_control.HALT
if status != flow_control.ALLOW:
break
return status
Check out the default flow control plugins provided with the source to see real world examples.
get_plugin
And don't forget to provide a function in the file called get_plugin! get_plugin is the entry point and should
return your FlowControl object.
def get_plugin():
"""Get flow controller."""
return WildcardFlowControl