Skip to main content

PreValidators and PostValidators

A Detector consists of 3 core components:

  1. Pre validators: these are rules that are applied before anything else. They must return true for the detection process to actually happen. See below for a list of pre validators.
  2. Matchers: matchers are string extractors. Each matcher can return a list of matched strings, that we call “match”. Usually they simply match a regular expression, although they can be more complex (i.e: detect a connection string).
  3. Post validators: these are rules that are applied after matchers. They validate that the matches found have certain properties (a certain entropy, not present in a banlist..). Note that post validators can be applied both to a single match or to all matches found by the matcher. See below for a list of post validators.

Pre-Validation#

Here is the list of the pre validators used by our secrets detection engine.

BanMinifiedPreValidator#

This pre validator bans files suspected to be minified Javascript file. It only evaluates files with a Javascript extension and bans files with a score higher than the given threshold.

Usage:

type: BanMinifiedPreValidatorthreshold_minified: 0.6

ContentWhitelistPreValidator#

This pre validator returns true if either the filename or content contain given patterns. It was created to limit the number of files scanned by detectors that look for broad patterns. For instance: Datadog API keys are not prefixed, we would therefore add a ContentWhitelistPreValidator with the following patterns: ['datadog', 'dogapi', 'dd[_-]?key'] to only include documents with a Datadog context.

Usually, a detector should include one ContentWhitelistPreValidator with one or more keywords.

Usage:

type: ContentWhitelistPreValidatorpatterns: ["datadog", "dogapi", "dd[_-]?key"]

FilenameWhitelistPreValidator#

This pre validator returns true if the filename has an allowlisted name and extension. It is used when we know where to find a secret. For example, if we want to only detect secrets in .env files.

Extensions and filenames are compared as strings, i.e. the files that will match are those that have exactly this name or extension. For more complex cases, you may use whitelist_filepaths. The strings you give in this parameter will be evaluated as regular expressions, and FilenameWhitelistPreValidator will look for a match anywhere in the document's filepath. For example, [r"/config/"] will validate any filepath containing /config/.

Usage:

type: FilenameWhitelistPreValidatorwhitelist_extensions: [".txt"] # optionalwhitelist_filenames: [".env"] # optionalwhitelist_filepaths: ['config/test\.xml'] # optional

FilenameBanlistPreValidator#

This pre validator returns false if the filename has a banlisted name, path or extension. It is especially useful to avoid scanning binary files and other types of extensions.

Usage:

type: FilenameBanlistPreValidatorbanlist_extensions: ["svg", "gzip", "zip"]banlist_filenames: ["sample_value", "test_*"]check_binaries: false

After a thorough analysis by the GitGuardian team on a significant amount of real data, some extensions are banned by default if the flag include_default_banlist_extensions is set to True :

BANLIST_EXTENSIONS_DEFAULT = ("html", "css", "lock", "storyboard", "xib")

Post validation#

Here is the list of the post validators used by our secrets detection engine.

AssignmentBanlistPostValidator#

This post validator is used to discard matches based on the pattern of their assignment variables. As a reminder, we refer to an assignment as any statement of the form {assigned_variable} {assignment_token} {value}, like for instance: my_variable = "HelloWorld".

Usage:

type: AssignmentBanlistPostValidator

CommonHostBanlistPostValidator#

This post validator is a banlist of commonly used false positive hosts such as localhost, LAN IPs, dummy hostnames, etc.

Usage:

type: CommonHostBanlistPostValidator

CommonPasswordBanlistPostValidator#

This post validator is a banlist of commonly used false positive passwords such as passwd, changeme, etc.

Usage:

type: CommonPasswordBanlistPostValidator

CommonUsernameBanlistPostValidator#

This post validator is a banlist of commonly used false positive usernames such as user, username, etc.

Usage:

type: CommonUsernameBanlistPostValidator

CommonHighEntropyBanlistPostValidator#

This post validator is a banlist of commonly used placeholders or example generic high entropy values.

Usage:

type: CommonHighEntropyBanlistPostValidator

CommonValueBanlistPostValidator#

This post validator is a banlist of commonly used false positive generic values, such as ${VAR}, ${SECRET}, etc.

Usage:

type: CommonValueBanlistPostValidator

DictFilterPostValidator#

This post validator looks for common words of a dictionary inside the matched string. It returns false if it finds too many of those, that is to say more than threshold_words_pct_matched of the characters in the matched string. This is useful to ban expected high entropy matches that actually contain English words.
The upper limit for the time complexity of the algorithm is O(n^2) with n the length of the secret to validate. We also need to load the dictionary of words (a Python set) in memory.

Usage:

type: DictFilterPostValidatorwordset: ["hello"] # optional if not provided, a default dictionary is used.min_word_length: 4 # optional[4] search for words that are at least 4 characters.max_length: 12 # optional[None] search for words that are at most 12 characters.threshold_words_pct_matched: 0.4 # optional[0.4] ban match if more than 40% of the string is composed of common words.

EntropyPostValidator#

This post validator ensures that the Shannon entropy of the considered match is greater than a given threshold. It should be used when looking for API keys and generated tokens that have high entropy because they are random.

Usage:

type: EntropyPostValidatorentropy: 3

HeuristicPostValidator#

This post validator applies a few heuristics and returns false if one of the heuristics returns false. It was introduced because some regular expressions are broad and it can be useful to banlist strings that are numbers, URLs or look like a file path.

Heuristics:

  • url: check if match starts with 'http', 'ftp' or 'www'
  • number: check if match is numbers only
  • heuristic_path: check if match contains more than 13% of '/'
  • file_path: check if match starts with '/'
  • upper: checks if there are only upper characters
  • lower: checks if there are only lower characters
  • date: checks if the string is a ISO formatted date
  • hex_color: checks if the string is a hex color

Usage:

type: HeuristicPostValidatorfilters: ["number"] # can be "url", "number", "heuristic_path", "file_path"

MatchesPostValidator#

This post validator applies post validators to only a subset of matches. Sometimes, a Matcher can return multiple matches, and you want to apply post validation rules to only a subset of these matches. For example, a ConnectionUriMatcher returns a host, username and password matches. But you want to apply post validators to each of these matches separately.

Usage:

type: MatchesPostValidatornames: ["username"] # names of the matches to post validator. Other matches are automatically validatedpost_validators: # List of post validators to apply  - type: CommonUsernameBanlistPostValidator

MinimumDigitsPostValidator#

This post validator verifies that at least n digits are present in the given values.

Usage:

type: MinimumDigitsPostValidatordigits: 3

ValueBanlistPostValidator#

This post validator bans value patterns. It returns false if a pattern is found.

type: ValueBanlistPostValidatorpatterns: ["ab[0-9]{2}"] # returns false if this pattern is found

ValueSimilarityPostValidator#

This post validator bans matches groups with a similarity strictly greater to the maximum similarity given. The similarity is the average of similarities between each couple. With this method, you can easily remove placeholders in multi-match secrets.

Usage:

type: ValueSimilarityPostValidatormax_similarity: 0.8similarity: difflib

ContextWindowBanlistPostValidator#

This post validator bans value patterns in a window around the matched string. It returns false if one of the patterns is found in the window.
You can define three types of window with the window_type parameter, chosen between "left", "center" or "right".

A "left" window will span from window_width characters before the matched string to the end of the matched string.
A "right" window will span from the beginning of the matched string to window_width after the matched string.
A "center" window will include all characters including matched strings from window_width characters before the matched string to window_width after it.

The window_type default value is "center".
The window_width defaults to the whole document if the window_type is center, all the document from the beginning until the matched string if the window_type is left, and all the document from the matched string to the end if the window_type is right.

type: ContextWindowBanlistPostValidatorpatterns: ["sha", "sum"] # returns false if one of those patterns is foundwindow_width: 25 # default Nonewindow_type: "left" # default center