Skip to content
This repository has been archived by the owner on Jun 28, 2024. It is now read-only.

Latest commit

 

History

History

check-spelling

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Spell check tool

Overview

The kata-spell-check.sh tool is used to check a markdown file for typographical (spelling) mistakes.

Approach

The spell check tool is based on hunspell. It uses standard Hunspell English dictionaries and supplements these with a custom Hunspell dictionary. The document is cleaned of several entities before the spell-check begins. These entities include the following:

  • URLs
  • Email addresses
  • Code blocks
  • Most punctuation
  • GitHub userids

Custom words

A custom dictionary is required to accept specific words that are either well understood by the community or are defined in various document files, but do not appear in standard dictionaries. The custom dictionaries allow those words to be accepted as correct. The following lists common examples of such words:

  • Abbreviations
  • Acronyms
  • Company names
  • Product names
  • Project names
  • Technical terms

Spell check a document file

$ ./kata-spell-check.sh check /path/to/file

Note: If you have made local edits to the dictionaries, you may re-create the master dictionary files as documented in the Adding a new word section, in order for your local edits take effect.

Other options

Lists all available options and commands:

$ ./kata-spell-check.sh -h

Technical details

Hunspell dictionary format

A Hunspell dictionary comprises two text files:

  • A word list file

    This file defines a list of words (one per line). The list includes optional references to one or more rules defined in the rules file as well as optional comments. Specify fixed words (e.g. company names) verbatim. Enter “normal” words in their root form.

    The root form of a "normal" word is the simplest and shortest form of that word. For example, the following list of words are all formed from the root word "computer":

    • Computers
    • Computer’s
    • Computing
    • Computed

    Each word in the previous list is an example of using the word "computer" to construct said word through a combination of applying the following manipulations:

    • Remove one or more characters from the end of the word.
    • Add a new ending.

    Therefore, you list the root word "computer" in the word list file.

  • A rules file

    This file defines named manipulations to apply to root words to form new words. For example, rules that make a root word plural.

Source files

The rules file and the the word list file for the custom dictionary generate from "source" fragment files in the data directory.

All the fragment files allow comments using the hash (#) comment symbol and all files contain a comment header explaining their content.

Word list file fragments

The *.txt files are word list file fragments. Splitting the word list into fragments makes updates easier and clearer as each fragment is a grouping of related terms. The name of the file gives a clue as to the contents but the comments at the top of each file provide further detail.

Every line that does not start with a comment symbol contains a single word. An optional comment for a word may appear after the word and is separated from the word by whitespace followed by the comment symbol:

word		# This is a comment explaining this particular word list entry.

You may suffix each word by a forward slash followed by one or more upper-case letters. Each letter refers to a rule name in the rules file:

word/AC		# This word references the 'A' and 'C' rules.

Rules file

The rules file contains a set of general rules that can be applied to one or more root words in the word list files. You can make comments in the rules file.

For an explanation of the format of this file see man 5 hunspell (source).

Adding a new word

Update the word list fragment

If you want to allow a new word to the dictionary,

  • Check to ensure you do need to add the word

    Is the word valid and correct? If the word is a project, product, or company name, is the capitalization correct?

  • Add the new word to the appropriate word list fragment file.

    Specifically, if it is a general word, add the root of the word to the appropriate fragment file.

  • Add a / suffix along with the letters for each rule to apply in order to add rules references.

Optionally update the rules file

It should not generally be necessary to update the rules file since it already contains rules for most scenarios. However, if you need to update the file, read the documentation carefully.

Create the master dictionary files

Every time you change the dictionary files you must recreate the master dictionary files:

$ ./kata-spell-check.sh make-dict

As a convenience, checking a file will automatically create the database.

Test the changes

You must test any changes to the word list file fragments or the rules file by doing the following:

  1. Recreate the master dictionary files.

  2. Run the spell checker on a file containing the words you have added to the dictionary.