Plugin API
Filters
When writing a Filter
plugin, there are two classes to be aware: Filter
and SourceText
. Both classes are found in pyspelling.filters
.
Each chunk returned by a filter is a SourceText
object. These objects contain the desired, filtered text from the previous filter along with some metadata: encoding, display context, and a category that describes what kind of text the data is. After all filters have processed the text, each SourceText
's content is finally passed to the spell checker.
The text data in a SourceText
object is always Unicode, but during the filtering process, the filter can decode the Unicode if required as long as it is returned as Unicode at the end of the step.
Filter
Filter
plugins are subclassed from the Filter
class. You'll often want to specify the defaulted value for default_encoding
in the __init__
. Simply give it a default value as shown below.
from .. import filters
class MyFilter(filters.Filter):
"""Spelling Filter."""
def __init__(self, options, default_encoding='utf-8'):
"""Initialization."""
super().__init__(options, default_encoding)
Filter.get_default_config
get_default_config
is where you should specify your default configuration file. This should contain all accepted options and their default value. All user options that are passed in will override the defaults. If an option is passed in that is not found in the defaults, an error will be raised.
def get_default_config(self):
"""Get default configuration."""
return {
"enable_something": True,
"list_of_stuff": ['some', 'stuff']
}
"New 2.0
get_default_confg
was added in version 2.0
.
Filter.validate_options
validate_options
is where you can specify validation of your options. By default, basic validation is done on incoming options. For instance, if you specify a default as a bool
, the default validator will ensure the passed user options match. Checking is performed on bool
, str
, list
, dict
, int
, and float
types. Nothing beyond simple type checking is performed, so if you had some custom validation, or simply wanted to bypass the default validator with your own, you should override validate_options
.
def validate_options(self, k, v):
"""Validate options."""
# Call the basic validator
super().validate_options(k, v)
# Perform custom validation
if k == "foo" and v != "bar":
raise ValueError("Value should be 'bar' for 'foo'")
New 2.0
validate_options
was added in version 2.0
.
Filter.setup
setup
is were basic setup can be performed post-validation. At this point, you can access the merged and validated configuration via self.config
.
def setup(self):
"""Setup."""
self.enable_foo = self.config['foo']
New 2.0
setup
was added in version 2.0
.
Filter.reset
reset
is called on every new call to the plugin. It allows you to clean up states from previous calls.
def reset(self):
"""Reset"""
self.counter = 0
self.tracked_stuff = []
New 2.0
reset
was added in version 2.0
.
Filter.has_bom
has_bom
takes a file stream and is usually used to check the first few bytes. While BOM checking could be performed in header_check
, this mainly provided as UTF
BOMs are quite common in many file types, so a specific test was dedicated to it. Additionally, this replaces the old, less flexible CHECK_BOM
attribute that was deprecated in version 1.2
.
This is useful if you want to handle binary parsing, or a file type that has a custom BOM in the header. When returning encoding in any of the encoding check functions, None
means no encoding was detecting, an empty string means binary data (encoding validation is skipped), and anything else will be validated and passed through. Just be sure to include a sensible encoding in your SourceText
object when your plugin returns file content.
def has_bom(self, filestream):
"""Check if has BOM."""
content = filestream.read(2)
if content == b'PK\x03\x04':
# Zip file found.
# Return `BINARY_ENCODE` as content is binary type,
# but don't return None which means we don't know what we have.
return filters.BINARY_ENCODE
# Not a zip file, so pass it on to the normal file checker.
return super().has_bom(filestream)
New 2.0
has_bom
was added in version 2.0
.
Deprecation 2.0
CHECK_BOM
has been deprecated since 2.0
.
Filter.header_check
header_check
is a function that receives the first 1024 characters of the file via content
that can be scanned for an encoding header. A string with the encoding name should be returned or None
if a valid encoding header cannot be found.
def header_check(self, content):
"""Special encode check."""
return None
Filter.content_check
content_check
receives a file object which allows you to check the entire file buffer to determine encoding. A string with the encoding name should be returned or None
if a valid encoding header cannot be found.
def content_check(self, filestream):
"""File content check."""
return None
Filter.filter
filter
is called when the Filter
object is the first in the chain. This means the file has not been read from disk yet, so we must handle opening the file before applying the filter and then return a list of SourceText
objects. The first filter in the chain is handled differently in order to give the opportunity to handle files that require more complex methods to acquire the Unicode strings. You can read the file in binary format or directly to Unicode. You can run parsers or anything else you need in order to get the required Unicode text for the SourceText
objects. You can create as many SourceText
objects as you desired and assign them categories so that other Filter
objects can avoid them if desired. Below is the default which reads the entire file into a single object providing the file name as the context, the encoding, and the category text
.
def filter(self, source_file, encoding): # noqa A001
"""Open and filter the file from disk."""
with codecs.open(source_file, 'r', encoding=encoding) as f:
text = f.read()
return [SourceText(text, source_file, encoding, 'text')]
Filter.sfilter
sfilter
is called for all Filter
objects following the first. The function is passed a SourceText
object from which the text, context, encoding can all be extracted. Here you can manipulate the text back to bytes if needed, wrap the text in an io.StreamIO
object to act as a file stream, run parsers, or anything you need to manipulate the buffer to filter the Unicode text for the SourceText
objects.
def sfilter(self, source):
"""Execute filter."""
return [SourceText(source.text, source.context, source.encoding, 'text')]
If a filter only works either as the first in the chain, or only as a secondary filter in the chain, you could raise an exception if needed. In most cases, you should be able to have an appropriate filter
and sfilter
, but there are most likely cases (particular when dealing with binary data) where only a filter
method could be provided.
Check out the default filter plugins provided with the source to see real world examples.
get_plugin
And don't forget to provide a function in the file called get_plugin
! get_plugin
is the entry point and should return your Filter
object.
def get_plugin():
"""Return the filter."""
return HtmlFilter
SourceText
As previously mentioned, filters must return a list of SourceText
objects.
class SourceText(namedtuple('SourceText', ['text', 'context', 'encoding', 'category', 'error'])):
"""Source text."""
Each object should contain a Unicode string (text
), some context on the given text hunk (context
), the encoding which the Unicode text was originally in (encoding
), and a category
that is used to omit certain hunks from other filters in the chain (category
). SourceText
should not contain byte strings, and if they do, they will not be passed to additional filters. error
is optional and is only provided message when something goes wrong.
When receiving a SourceText
object in your plugin, you can access the content via attributes with the same name as the parameters above:
>>> source.text
'Some Text'
>>> source.context
'foo.txt'
>>> source.encoding
'utf-8'
>>> source.category
'some-category'
Be mindful when adjusting the context in subsequent items in the pipeline chain. Generally you should only append additional context so as not to wipe out previous contextual data. It may not always make sense to append additional data, so some filters might just pass the previous context as the new context.
If you have a particular chunk of text that has a problem, you can return an error in the SourceText
object. Errors really only need a context and the error as they won't be passed to the spell checker or to any subsequent steps in the pipeline. Errors are only used to alert the user that something went wrong. SourceText
objects with errors will not be passed down the chain and will not be passed to the spell checker.
if error:
content = [SourceText('', source_file, '', '', error)]
Flow Control
FlowControl
plugins are simple plugins that take the category from a SourceText
object, and simply returns either the directive HALT
, SKIP
, or ALLOW
. This controls whether the associated SourceText
object's progress is halted in the pipeline, skips the next filter, or is explicitly allowed in the next filter. The directives and FlowControl
class are found in pyspelling.flow_control
.
FlowControl
FlowControl
plugins should be subclassed from FlowControl
. If you need to you can override the __init__
, but remember to call the original with super
to ensure options are handled.
class MyFlowControl(flow_control.FlowControl):
"""Flow control plugin."""
def __init__(self, config):
"""Initialization."""
super().__init__(config)
FlowControl.get_default_config
get_default_config
is where you should specify your default configuration file. This should contain all accepted options and their default value. All user options that are passed in will override the defaults. If an option is passed in that is not found in the defaults, an error will be raised.
def get_default_config(self):
"""Get default configuration."""
return {
"enable_something": True,
"list_of_stuff": ['some', 'stuff']
}
New 2.0
get_default_confg
was added in version 2.0
.
FlowControl.validate_options
validate_options
is where you can specify validation of your options. By default, basic validation is done on incoming options. For instance, if you specify a default as a bool
, the default validator will ensure the passed user options match. Checking is performed on bool
, str
, list
, dict
, int
, and float
types. Nothing beyond simple type checking is performed, so if you had some custom validation, or simply wanted to bypass the default validator with your own, you should override validate_options
.
def validate_options(self, k, v):
"""Validate options."""
# Call the basic validator
super().validate_options(k, v)
# Perform custom validation
if k == "foo" and v != "bar":
raise ValueError("Value should be 'bar' for 'foo'")
New 2.0
validate_options
was added in version 2.0
.
FlowControl.setup
setup
is were basic setup can be performed post-validation. At this point, you can access the merged and validated configuration via self.config
.
def setup(self):
"""Setup."""
self.enable_foo = self.config['foo']
New 2.0
setup
was added in version 2.0
.
FlowControl.reset
reset
is called on every new call to the plugin. It allows you to clean up states from previous calls.
def reset(self):
"""Reset"""
self.counter = 0
self.tracked_stuff = []
New 2.0
reset
was added in version 2.0
.
FlowControl.adjust_flow
After handling the options, there is only one other function available for overrides: adjust_flow
. Adjust flow receives the category from the SourceText
being passed down the pipeline. Here the decision is made to as to what must be done with the object. Simply return HALT
, SKIP
, or ALLOW
to control the flow for that SourceText
object.
def adjust_flow(self, category):
"""Adjust the flow of source control objects."""
status = flow_control.SKIP
for allow in self.allow:
if fnmatch.fnmatch(category, allow, flags=self.FNMATCH_FLAGS):
status = flow_control.ALLOW
for skip in self.skip:
if fnmatch.fnmatch(category, skip, flags=self.FNMATCH_FLAGS):
status = flow_control.SKIP
for halt in self.halt:
if fnmatch.fnmatch(category, halt, flags=self.FNMATCH_FLAGS):
status = flow_control.HALT
if status != flow_control.ALLOW:
break
return status
Check out the default flow control plugins provided with the source to see real world examples.
get_plugin
And don't forget to provide a function in the file called get_plugin
! get_plugin
is the entry point and should return your FlowControl
object.
def get_plugin():
"""Get flow controller."""
return WildcardFlowControl