Skip to content

CPP

Usage

The CPP filter is designed to find and return C/C++ style comments and strings. It accepts a text buffer and will return one or more text buffers containing content from comments and/or strings.

When first in the chain, the CPP filter uses no special encoding detection. It will assume utf-8 if no encoding BOM is found, and the user has not overridden the fallback encoding. Text is returned in chunks based on the context of the text: block, inline, or string (if enabled).

When the strings option is enabled, content will be extracted from strings (not character constants). Support is available for all the modern C++ strings shown below. CPP will also handle decoding string escapes as well, but as string character width and encoding can be dependent on implementation and configuration, some additional setup may be required via option. Strings will be returned with the specified encoding, even if it differs from the file's encoding (this is the associated encoding specified in the SourceText object, the content itself is still in Unicode).

    auto s0 =    "Narrow character string";                   // char
    auto s1 =   L"Wide character string";                     // wchar_t
    auto s2 =  u8"UTF-8 strings";                             // char
    auto s3 =   u"UTF-16 strings";                            // char16_t
    auto s4 =   U"UTF-32 strings";                            // char32_t
    auto R0 =   R"("Raw strings")";                           // const char*
    auto R1 =   R"delim("Raw strings with delimiters")delim"; // const char*
    auto R3 =  LR"("Raw wide character strings")";            // const wchar_t*
    auto R4 = u8R"("Raw UTF-8 strings")";                     // const char*, encoded as UTF-8
    auto R5 =  uR"("Raw UTF-16 strings")";                    // const char16_t*, encoded as UTF-16
    auto R6 =  UR"("Raw UTF-32 strings")";                    // const char32_t*, encoded as UTF-32

As C++ style comments are fairly common convention in other languages, this filter can often be used for other languages as well using generic_mode. In Generic Mode, many C/C++ specific considerations and options will be disabled. See Generic Mode for more information.

matrix:
- name: cpp
  pipeline:
  - pyspelling.filters.cpp
      line_comments: false
  sources:
  - js_files/**/*.{cpp,hpp,c,h}

Filtering String types

When strings is enabled, you can specify which strings you want to allow via the string_types option. Valid string types are S for standard, L for long/wide, U for Unicode (all variants), and R for raw. Case is not important, and the default value is sul.

If specifying R, you must also specify either U, L, or S as raw strings are also either S, L, or S strings. Selecting UR will select both Unicode strings and Unicode raw strings. If you need to target just raw strings, you can use R* which will target all raw string types: raw Unicode, raw wide, and raw standard. You can use * for other types as well. You can also just specify * by itself to target all string types.

Generic Mode

C/C++ style comments are not exclusive to C/C++. Many different file types have adopted similar style comments. The CPP filter has a generic mode which allows for a C/C++ style comment extraction without all the C/C++ specific considerations. Simply enable generic_mode via the options.

Generic Mode disables the C/C++ specific nuance of allowing multiline comments via escaping newlines. This is a very C/C++ specific thing that is rarely carried over by others that have adopted C/C++ style comments:

// Generic mode will \
   not allow this.

Generic Mode will not decode any character escapes in strings when enabled. C/C++ has very specific rules for handling string escapes, only a handful of which may translate to other languages. Generic Mode is mainly meant for comments and not strings, but will return content of single quoted and double quoted strings if strings is enabled. All related escape decoding options do not apply to Generic Mode.

Trigraphs are very C/C++ specific, and will never be evaluated in Generic Mode.

Lastly, when using this filter in Generic Mode, you can also adjust the category prefix from cpp to whatever you would like via the prefix option.

Options

Options Type Default Description
block_comments bool True Return SourceText entries for each block comment.
line_comments bool True Return SourceText entries for each line comment.
strings bool False Return SourceText entries for each string.
group_comments bool False Group consecutive inline comments as one SourceText entry.
trigraphs bool False Account for trigraphs in C/C++ code. Trigraphs are never evaluated in Generic Mode.
generic_mode bool False Parses files with a generic C++ like mode for parsing C++ style comments from non C++ files. See Generic Mode for more info.
decode_escapes bool True Enable/disable string escape decoding. Strings are never decoded in Generic Mode.
charset_size int 1 Set normal string character byte width.
exec_charset string 'utf-8 Set normal string encoding.
wide_charset_size int 4 Set wide string character byte width.
wide_exec_charset string 'utf-32 Set wide string encoding.
string_types string "sul" Set the allowed string types to capture: standard strings (s), wide (l), Unicode (u), and raw (r). * captures all strings, or when used with a type, captures all variants of that type r*.
prefix string 'cpp' Change the category prefix.

Categories

CPP returns text with the following categories. cpp prefix can be changed via the prefix option.

Category Description
cpp-block-comment Text captured from C++ style block comments.
cpp-line-comment Text captured from C++ style line comments.
cpp-string Text captured from strings.