Configuration
Configuration File
PySpelling requires a YAML configuration file. The file defines the various spelling tasks along with their individual filters and options.
You can optionally specify the preferred spell checker as a global option (aspell
is the default if not specified). This can be overridden on the command line.
spellchecker: hunspell
You can specify the number of parallel jobs to use by setting the global option jobs
. This create parallel jobs to process files in a given task.
jobs: 4
New 2.10
Parallel processing is new in 2.10.
All of the spelling tasks are contained under the keyword matrix
and are organized in a list:
matrix:
- task1
- task2
Each task requires, at the very least, a name
and sources
to search.
Depending on your setup, you may need to set the dictionary to use as well. Each spell checker specifies their dictionary/language differently which is covered in more details in Spell Checker Options.
matrix:
- name: Python Source
aspell:
lang: en
d: en_US
sources:
- pyspelling/**/*.py
You can also define more complicated tasks which will run your text through various filters before performing the spell checking by providing a custom pipeline. You can also add your own custom wordlists to extend the dictionary.
matrix:
- name: Python Source
sources:
- pyspelling/**/*.py
aspell:
lang: en
d: en_US
dictionary:
wordlists:
- docs/src/dictionary/en-custom.txt
output: build/dictionary/python.dic
pipeline:
- pyspelling.filters.python:
- pyspelling.filters.context:
context_visible_first: true
escapes: \\[\\`~]
delimiters:
# Ignore multiline content between fences (fences can have 3 or more back ticks)
# ```
# content
# ```
- open: '(?s)^(?P<open> *`{3,})$'
close: '^(?P=open)$'
# Ignore text between inline back ticks
- open: '(?P<open>`+)'
close: '(?P=open)'
Name
Each spelling tasks should have a unique name and is defined with the name
key.
When using the command line --name
option, the task with the matching name will be run.
matrix:
- name: python
New Behavior 2.0
In 1.0
, names doubled as identifiers and groups. It became apparent for certain features that a unique name is desirable for targeting different tasks, while a group specifier should be implemented separately. In 2.0
, if multiple tasks have the same name, the last defined one will be the targeted task when requesting a named task. Use groups to target multiple grouped tasks.
Groups
Each task can be assigned to a group. The group name can be shared with multiple tasks. All tasks in a group can be run by specifying the --group
option with the name of the group on the command line. This option is only available in version 1.1
of the configuration file.
matrix:
- name: python
group: some_name
New 2.0
group
was added in version 2.0
.
Hidden
All tasks in a configuration file will be run if no name
is specified. In version 1.1
of the configuration file, If a task enables the option hidden
by setting it to true
, that task will not be run automatically when no name
is specified. hidden
tasks will only be run if they are specifically mentioned by name
.
matrix:
- name: python
hidden: true
New 2.0
group
was added in version 2.0
.
Default Encoding
When parsing a file, the encoding detection and translation of the data into Unicode is performed by the first filter in the pipeline. For instance, if HTML is the first, it may check for a BOM or look at the file's header to find the meta
tag that specifies the file's encoding. If all encoding checks fail, the filter will usually apply an appropriate default encoding for the file content type (usually UTF-8, but check the specific filter's documentation to be sure). If needed, the filter's default encoding can be overridden in the task via the default_encoding
key. After the first step in the pipeline, the text is passed around as Unicode which requires no Unicode detection.
matrix:
- name: markdown
pipeline:
- pyspelling.filters.text
sources:
- '**/*.md'
default_encoding: utf-8
Once all filtering is complete, the text will be passed to the spell checker as byte strings, usually with the originally detected encoding (unless a filter specifically alters the encoding). The supported spell checkers are limited to very specific encodings, so if your file is using an unsupported encoding, it will fail.
UTF-16 and UTF-32 is not really supported by Aspell and Hunspell, so at the end of the spell check pipeline, Unicode strings that have the associated encoding of UTF-16 or UTF-32 will be encoded with the compatible UTF-8. This does not apply to files being processed with a the pipeline disabled. When the pipeline is disabled, files are sent directly to the spell checker with no modifications.
Unsupported Encodings
If you are trying to spell check a file in an unsupported encoding, you can use the builtin text filter to convert the content to a more appropriate encoding. In general, it is recommended to work in, or convert to UTF-8.
Sources
Each spelling task must define a list of sources to search via the sources
key. Each source should be a glob pattern that should match one or more files. PySpelling will perform a search with these patterns to determine which files should be spell checked.
You can also have multiple patterns on one line separated with |
. When multiple patterns are defined like this, they are evaluated simultaneously. This is useful if you'd like to provide an exclusion pattern along with your file pattern. For instance, if we wanted to scan all python files in our folder, but exclude any in the build folder, we could provide the following pattern: **/*.py|!build/*
.
PySpelling uses Wildcard Match's glob
library to perform the file globbing. By default, it uses the NEGATE
, GLOBSTAR
, and BRACE
flags, but you can override the flag options with the glob_flags
option. You can specify the flags by either their long name GLOBSTAR
or their short name G
. See Wildcard Match's documentation for more information on the available flags and what they do.
matrix:
- name: python
pipeline:
- pyspelling.filters.python:
comments: false
glob_flags: N|G|B
sources:
- pyspelling/**/*.py
By default, to protect against really large pattern sets, such as when using brace expansion: {1..10000000}
, there is a pattern limit of 1000
by default. This can be changed by setting glob_pattern_limit
to some other number. If you set it to 0
, it will disable the pattern limits entirely.
New 2.6
glob_pattern_limit
is new in version 2.6
and only works with wcmatch
version 6.0
.
Expect Match
When processing the sources field it is expected to find at least one matching file. If no files are located it can be helpful to raise an error and this is the default behavior. If it is not expected to always find a file then the expect_match
configuration can be used to suppress the error.
matrix:
- name: markdown
pipeline:
- pyspelling.filters.text
sources:
- '**/*.md'
expect_match: false
default_encoding: utf-8
Pipeline
Note
PySpelling's pipeline
is designed to provide advanced, custom filtering above and beyond what spell checker's normally provide, but it may often be the case that what the spell checker provides is more than sufficient. It should be noted that pipeline
filters are processed before sending the buffer to Aspell or Hunspell. By default, we disable any special modes of the spell checkers.
Spellcheckers like Aspell have builtin filtering. If all you need is the builtin filters from Aspell, the pipeline
configuration can be omitted. For instance, to use Aspell's builtin Markdown mode, simply set the Aspell option directly and omit the pipeline.
- name: markdown
group: docs
sources:
- README.md
aspell:
lang: en
d: en_US
mode: markdown
dictionary:
wordlists:
- .spell-dict
output: build/dictionary/markdown.dic
You can also use PySpelling pipeline
filters and enable special modes of the underlying spell checker if desired.
PySpelling allows you to define tasks that outline what kind of files you want to spell check, and then sends them down a pipeline that filters the content returning chunks of text with some associated context. Each chunk is sent down each step of the pipeline until it reaches the final step, the spell check step. Between filter steps, you can also insert flow control steps that allow you to have certain text chunks skip specific steps. All of this is done with pipeline plugins.
Let's say you had some Markdown files and wanted to convert them to HTML, and then filter out specific tags. You could just use the Markdown filter to convert the file to HTML and then pass it through the HTML filter to extract the text from the HTML tags.
matrix:
- name: markdown
sources:
- README.md
pipeline:
- pyspelling.filters.markdown:
- pyspelling.filters.html:
comments: false
attributes:
- title
- alt
ignores:
- code
- pre
If needed, you can also insert flow control steps before certain filter steps. Each text chunk that is passed between filters has a category assigned to it from the previous filter. Flow control steps allow you to restrict the next filter to specific categories, or exclude specific categories from the next step. This is covered in more depth in Flow Control.
If for some reason you need to send the file directly to the spell checker without using PySpelling's pipeline, simply set pipeline
to null
. This sends file directly to the spell checker without evaluating the encoding or passing through any filters. Specifically with Hunspell, it also sends the spell checker the filename instead of piping the content as Hunspell has certain features that don't work when piping the data, such as OpenOffice ODF input.
Below is an example where we send an OpenOffice ODF file directly to Hunspell in order to use Hunspell's -O
option to parse the ODF file. Keep in mind that when doing this, no encoding is sent to the spell checker unless you define default_encoding
. If default_encoding
is not defined, PySpelling will decode the returned content with the terminal's encoding (or what it thinks the terminal's encoding is).
matrix:
- name: openoffice_ODF
sources:
- file.odt
hunspell:
d: en_US
O: true
pipeline: null
Languages
Languages in both Aspell and Hunspell are controlled by the -d
option. In the YAML configuration, we remove any leading -
and just use d
.
For Aspell:
matrix:
- name: python
aspell:
lang: en
d: en_US
For Hunspell:
matrix:
- name: python
hunspell:
d: en_US
Tip
It can be noted above that Aspell is setting lang
to en
and d
to en_US
, this isn't strictly necessary to just spell check, but is often needed to compile wordlists of words to ignore when spellchecking. lang
points to the actual .dat
file used to compile wordlists in Aspell and needs that information to work. There is usually one .dat
file that covers a language and its variants. So en_US
and en_GB
will both build their wordlists against the en.dat
file.
Since spell checker options vary between both Aspell and Hunspell, spell checker specific options are handled by under special keys named aspell
and hunspell
. To learn more, check out Spell Checker Options.
By default, PySpelling sets your main dictionary to en
for Aspell and en_US
for Hunspell. If you do not desire an American English dictionary, or these dictionaries are not installed in their expected default locations, you will need to configure PySpelling so it can find your preferred dictionary. Since dictionary configuring varies for each spell checker, the main dictionary (and virtually any spell checker specific option) is performed via Spell Checker Options.
International Languages
Some languages use special Unicode characters in them. The spell checker in use may be particular about how the Unicode characters are normalized. When PySpelling passes the content to be spellchecked you may want to normalize the Unicode content if it is having trouble. This can be done with the Text
filter.
For instance, here is how to do so via Aspell with a Czech dictionary.
matrix:
- name: czechstuff
sources: *.txt
aspell:
lang: cs
d: cs
dictionary:
wordlist:
- .dictionary
output: build/czech.dict
pipeline:
- pyspelling.filters.text:
normalize: nfd
Dictionaries and Personal Wordlists
While provided dictionaries cover a number of commonly used words, you may need to specify additional words that are not covered in the default dictionaries. Luckily, both Aspell and Hunspell allow for adding custom wordlists. You can have as many wordlists as you like, and they can be included in a list under the key wordlists
which is also found under the key dictionary
.
All the wordlists are combined into one custom dictionary file whose output name and location is defined via the output
key which is also found under the dictionary
key.
While Hunspell doesn't directly compile the wordlists, Aspell does, and it uses the .dat
file for dictionary you are using. While you may be specifying a region specific versions of English with en_US
or en_GB
, both of these use the en.dat
file. So in Aspell, it is recommended to specify both the --lang
option (or the alias -l
) as well as -d
. If lang
is not specified, the assumed data file will be en
.
matrix:
- name: python
sources:
- pyspelling/**/*.py
aspell:
lang: en
d: en_US
dictionary:
wordlists:
- docs/src/dictionary/en-custom.txt
output: build/dictionary/python.dic
pipeline:
- pyspelling.filters.python:
comments: false
Hunspell, on the other hand, does not require an additional lang
option as custom wordlists are handled differently than when under Aspell:
matrix:
- name: python
sources:
- pyspelling/**/*.py
hunspell:
d: en_US
dictionary:
wordlists:
- docs/src/dictionary/en-custom.txt
output: build/dictionary/python.dic
pipeline:
- pyspelling.filters.python:
comments: false
Lastly, you can set the encoding to be used during compilation via the encoding
under dictionary
. The encoding should generally match the encoding of your main dictionary. The default encoding is utf-8
, and only Aspell uses this option.
matrix:
- name: python
sources:
- pyspelling/**/*.py
aspell:
lang: en
d: en_US
dictionary:
wordlists:
- docs/src/dictionary/en-custom.txt
output: build/dictionary/python.dic
encoding: utf-8
pipeline:
- pyspelling.filters.python:
comments: false
Spell Checker Options
Since PySpelling is a wrapper around both Aspell and Hunspell, there are a number of spell checker specific options. Spell checker specific options can be set under keywords: aspell
and hunspell
for Aspell and Hunspell respectively. Here you can set options like the default dictionary and search options.
We will not list all available options here. In general we expose any and all options and only exclude those that we are aware of that could be problematic. For instance, we do not have an interface for interactive suggestions, so such options are not allowed with PySpelling.
Spell checker specific options basically translate directly to the spell checker's command line options and only requires you to remove the leading -
s you would normally specify on the command line. For instance, a short form option such as -l
would simply be represented with the keyword l
, and the long name form of the same option --lang
would be represented as lang
. Following the key, you would provide the appropriate value depending on it's requirement.
Boolean flags would be set to true
.
matrix:
- name: html
sources:
- docs/**/*.html
aspell:
H: true
pipeline:
- pyspelling.filters.html
Other options would be set to a string or an integer value.
matrix:
- name: python
sources:
- pyspelling/**/*.py
aspell:
lang: en
d: en_US
pipeline:
- pyspelling.filters.python:
strings: false
comments: false
Lastly, if you have an option that can be used multiple times, just set the value up as an array, and the option will be added for each value in the array. Assuming you had multiple pre-compiled dictionaries, you could add them under Aspell's --add-extra-dicts
option:
matrix:
- name: Python Source
sources:
- pyspelling/**/*.py
aspell:
add-extra-dicts:
- my-dictionary.dic
- my-other-dictionary.dic
pipeline:
The above options would be equivalent to doing this from the command line:
$ aspell --add-extra-dicts my-dictionary.dic --add-extra-dicts my-other-dictionary.dic