nacc-form-validator

Data Quality Rule Definition Guidelines

Table of Contents

Introduction

The form validator uses QC rules defined as JSON or YAML data objects to check data, based on Cerberus. In a nutshell:

Example:

YAML Rule Definition JSON Rule Definition When Validating
ptid:
  type: integer
  required: true
birthmo:
  type: integer
  required: true
  min: 1
  max: 12
{
    "ptid": {
        "type": "integer",
        "required": true
    },
    "birthmo": {
        "type": "integer",
        "required": true,
        "min": 1,
        "max": 12
    }
}
{"ptid": 101, "birthmo": 12}    # passes
{"ptid": 102, "birthmo": 15}    # fails
{"ptid": 103}                   # fails

Validation Rules

Check the full list of built-in Cerberus rules here.

Keywords frequently used in UDS rules are described in the table below:

Keyword Description Example
allowed Specify the list of allowed values. Validation will fail if any other value is given in the data record.
schema = {
    "limit": {
        "type": "integer",
        "allowed": [-1, 10, 100]
    }
}

data = {"limit": 10}    # passes
data = {"limit": 20}    # fails
forbidden Specify the list of forbidden values. Validation will fail if values in this list are included in the record.
schema = {
    "user": {
        "type": "string",
        "forbidden": ["viewer", "editor"]
    }
}

data = {"user": "admin"}    # passes
data = {"user": "viewer"}   # fails
min, max Minimum and maximum value allowed (only applicable to object types which support comparison operations like integers or floats). Each keyword can be used independently. Use together to define a range.
schema = {
    "length": {
        "type": "float",
        "min": 10.5,
        "max": 20.5
    }
}

data = {"length": 14}       # passes
data = {"length": 20.8}     # fails
nullable If set to "true", the field value is allowed to be empty. This rule will be checked on every field, regardless of if it's defined or not. The rule's constraints defaults it to "false". In other words, if neither `nullable` nor `required` are set, the field is required.
schema = {
    "country": {
        "type": "string",
        "nullable": True
    }
}

data = {"country": "USA"}   # passes
data = {"country": ""}      # passes

schema = {
    "country": {
        "type": "string"
    }
}

data = {"country": ""}      # fails
required If set to "true", the field is mandatory. Validation will fail when it is missing.
schema = {
    "name": {
        "type": "string",
        "required": True
    },
    "age": {
        "type": "integer",
        "nullable": True
    }
}

data = {"name": "Steve", "age": 50}     # passes
data = {"name": "Debby"}                # passes
data = {"age": 40}                      # fails
type Data type allowed for the value. See the Cerberus documentation for the list of type names. If multiple types are allowed, you can specify the types as a list.
schema = {
    "limit": {
        "type": "integer"
    }
}

data = {"limit": 10}        # passes
data = {"limit": 11.5}      # fails

schema = {
    "limit": {
        "type": ["integer", "float"]
    }
}

data = {"limit": 10}        # passes
data = {"limit": 11.5}      # passes
data = {"limit": "one"}     # fails
anyof Allows to define different sets of rules to validate against, supplied in a list of dicts. Field will be considered valid if any of the provided constraints validates the field.
schema = {
    "age": {
        "type": "integer",
        "anyof": [
            {
                "min": 0,
                "max": 120
            },
            {
                "allowed": [999]
            }
        ]
    }
}

data = {"age": 40}      # passes
data = {"age": 999}     # passes
data = {"age": 200}     # fails
regex Regex to validate against; only valid for string values.
schema = {
    "email": {
        "type": "string",
        "regex": "^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$"
    }
}

data = {"email": "john@example.com"}            # passes
data = {"email": "john_at_example_dot_com"}     # fails

Custom Rules Defined for UDS

These are defined in the NACCValidator class.

compare_with

Used to validate the field based on comparison with another field, with optional adjustments.

The value to compare to (base) can be another field in the schema OR a special keywords either related to the current date (i.e. the exact time/date at time of validation) or a previous record

The rule definition for compare_with should follow the following format:

{
    "field_name": {
        "compare_with": {
            "comparator": "comparator, one of >, <, >=, <=, ==, !=",
            "base": "field or value to compare field_name to",
            "adjustment": "(optional) the adjustment field or value",
            "op": "(optional) operation, one of +, -, *, /, abs",
            "previous_record": "(optional) boolean, whether or not to compare to base in the previous record",
            "ignore_empty": "(optional) boolean, whether or not to ignore previous records where this field is empty"
        }
}

Example:

birthyr <= current_year - 15, e.g. birthyr must be at least 15 years prior to the current year.

YAML Rule Definition JSON Rule Definition When Validating
birthyr:
  type: integer
  required: true
  compare_with:
    comparator: <=
    base: current_year
    adjustment: 15
    op: "-"
{
    "birthyr": {
        "type": "integer",
        "required": true,
        "compare_with": {
            "comparator": "<=",
            "base": "current_year",
            "adjustment": 15,
            "op": "-"
        }
    }
}
{"birthyr": 1995}   # passes
{"birthyr": 2030}   # fails until 2045

Absolute Value Example

abs(waist1 - waist2) <= 0.5, e.g. the difference between waist1 and waist2 cannot be more than 0.5

YAML Rule Definition JSON Rule Definition When Validating
waist1:
  type: float
  required: true
  compare_with:
    comparator: <=
    base: waist2
    adjustment: 0.5
    op: abs
waist2:
  type: float
  required: true
{
        "waist1": {
            "type": "float",
            "required": true,
            "compare_with": {
                "comparator": "<=",
                "base": "waist2",
                "op": "abs",
                "adjustment": 0.5
            }
        },
        "waist2": {
            "type": "float",
            "required": true
        }
    }
{'waist1': 5, 'waist2': 5.25}   # passes
{'waist1': 5, 'waist2': 4.4}    # fails

compare_age

Used to compare ages. Takes the following parameters:

Age is calculated by combining and converting the birth_* fields into a date, and then calculating the age at the field (assuming the field’s value is also a valid date). The exact calculation is age = (field_value - birth_date).days / 365.25. This result must then satisfy the comparison formula against the compare_to field, e.g. age <= compare_to.

Additional things to note:

The rule definition for compare_age should follow the following format:

{
    "field_name": {
        "compare_age": {
            "comparator": "comparator, one of >, <, >=, <=, ==, !=",
            "birth_year": "the birth year (or year the age should start)",
            "birth_month": "the birth month (or month the age should start) - optional, defaults to 1 (first month year)",
            "birth_day": "the birth day (or day the age should start) - optional, defaults to 1 (first day of year)",
            "compare_to": "variable name, integer, or list of variable names/integers to compare to"
        }
    }
}

Example:

age at frmdate >= behage

YAML Rule Definition JSON Rule Definition When Validating
frmdate:
  type: string
  formatting: date
  compare_age:
    comparator: ">="
    birth_year: birthyr
    birth_month: birthmo
    compare_to: behage
birthmo:
  type: integer
  min: 1
  max: 12
birthyr:
  type: integer
behage:
  type: integer
{
    "frmdate": {
        "type": "string",
        "formatting": "date",
        "compare_age": {
            "comparator": ">=",
            "birth_year": "birthyr",
            "birth_month": "birthmo",
            "compare_to": "behage"
        }
    },
    "birthmo": {
        "type": "integer",
        "min": 1,
        "max": 12
    },
    "birthyr": {
        "type": "integer"
    },
    "behage": {
        "type": "integer"
    }
}
{'frmdate': '2024/02/02', 'birthmo': 6, 'birthyr': 1950, 'behage': 50}  # passes
{'frmdate': '2024/02/02', 'birthmo': 1, 'birthyr': 2024, 'behage': 50}  # fails

compatibility

Used to specify the list of compatibility (if-then-else) constraints for a given field with other fields within the form or across multiple forms. A field will only pass validation if none of the compatibility constraints are violated.

Each constraint specifies if, then, and (optionally) else attributes to allow the validation of a set of fields/subschemas based on the outcome of other fields/subschemas (i.e. when the schema(s) defined under if evaluates to true for a given record, then the schema(s) specified under then will be evaluated).

Each if/then/else attribute can have several fields which need to be satisifed, with the *_op attribute specifying the boolean operation in which to compare the different fields. For example, if if_op = or, then as long as any of the fields satsify their schema, the then attribute will be evaluated. The default *_op is and.

The rule definition for compatibility follows the following format:

{
    "<field_name>": {
        "compatibility": [
            {
                "if": {
                    "<field_name>": "subschema to be satisifed"
                },
                "then": {
                    "<field_name>": "subschema to be satisifed"
                }
            },
            {
                "if_op": "and",
                "then_op": "or",
                "else_op": "or",

                "if": {
                    "<field_name>": "subschema to be satisifed",
                    "<field_name>": "subschema to be satisifed"
                },
                "then": {
                    "<field_name>": "subschema to be satisifed",
                    "<field_name>": "subschema to be satisifed"
                },
                "else": {
                    "<field_name>": "subschema to be satisifed",
                    "<field_name>": "subschema to be satisifed"
                }
            }
        ]
    }
}

One additional nuance is the evaluation against None/null values. Because Cerberus always evaluates "nullable": False by default, the application of a subschema in this case must explicitly set "nullable": True if the attributes evaluate or result in null values. For example

# if case: If PARENTVAR is blank or 88, then VAR1 must be blank
"if": {
    "parentvar": {
        "nullable": True,  # <--- external nullable flag for the if clause
        "anyof": [
            {
                "nullable": True,
                "filled": False
            },
            {
                "allowed": [88]
            }
        ]
    }
},
"then": {
    "var1": {
        "nullable": True,
        "filled": False}
}

# then case: if PARENTVAR is blank, then the following must be blank: var1, var2, var3
"if": {
    "parentvar": {
        "nullable": True,
        "filled": False
    }
},
"then": {
    "var1": {
        "nullable": True,  # <--- nullable flag for the then clause
        "logic": {
            "formula": {
                "and": [
                    {"==": [None, {"var": "var1"}]},
                    {"==": [None, {"var": "var2"}]},
                    {"==": [None, {"var": "var3"}]}
                ]
            }
        }
    }
}

Examples:

If field incntmod (primary contact mode with participant) is 6, then field incntmdx (specify primary contact mode with participant) cannot be blank.

YAML Rule Definition JSON Rule Definition When Validating
incntmod:
type: integer
  required: true
incntmdx:
  type: integer
  nullable: true
  compatibility:
    - if:
      incntmod:
        allowed:
          - 6
      then:
        nullable: false
{
    "incntmod": {
        "type": "integer",
        "required": true
    },
    "incntmdx": {
        "type": "integer",
        "nullable": true,
        "compatibility": [
            {
                "if": {
                    "incntmod": {
                        "allowed": [6]
                    }
                },
                "then": {
                    "nullable": false
                }
            }
        ]
    }
}
{"incntmod": 1, "incntmdx": None}   # passes
{"incntmod": 6, "incntmdx": 1}      # passes
{"incntmod": 6, "incntmdx": None}   # fails

This rule can also be used to define the “if not, then” case. For this, we use forbidden instead of allowed.

So if field incntmod (primary contact mode with participant) is NOT 6, then field incntmdx (specify primary contact mode with participant) must be blank.

YAML Rule Definition JSON Rule Definition When Validating
incntmod:
  type: integer
  required: true
incntmdx:
  type: string
  nullable: true
  compatibility:
    - if: null
      incntmod:
        forbidden:
          - 6
      then:
        nullable: true
        filled: false
{
    "incntmod": {
        "type": "integer",
        "required": true
    },
    "incntmdx": {
        "type": "string",
        "nullable": true,
        "compatibility": [
            {
                "if": {
                    "incntmod": {
                        "forbidden": [6]
                    }
                },
                "then": {
                    "nullable": true,
                    "filled": false
                }
            }
        ]
    }
}
{"incntmod": 1, "incntmdx": None}   # passes
{"incntmod": 6, "incntmdx": 1}      # fails
{"incntmod": 6, "incntmdx": None}   # passes
{"incntmod": 1, "incntmdx": 1}      # fails

logic

Used to specify a mathematical formula/expression to validate against, and utilizes json-logic-py (saved as json_logic.py). This rule overlaps with compare_with, but allows for comparison between multiple fields as well as more complex, nested mathematical expressions. That being said, it does not account for the same special keywords like current_year.

The rule definition for logic should follow the following format:

{
    "<field_name>": {
        "logic": {
            "formula": {
                "operator": "list of arguments for the operator"
            },
            "errormsg": "<optional error message to supply if validation fails>"
        }
}

Example:

One of var1, var2, or var3 must be 1.

YAML Rule Definition JSON Rule Definition When Validating
var1:
  type: integer
  nullable: true
var2:
  type: integer
  nullable: true
var3:
  type: integer
  nullable: true
  logic:
    formula:
      or:
        - ==:
          - 1
          - var: var1
        - ==:
          - 1
          - var: var2
        - ==:
          - 1
          - var: var3
{
    "var1": {
        "type": "integer",
        "nullable": true,
    },
    "var2": {
        "type": "integer",
        "nullable": true,
    },
    "var3": {
        "type": "integer",
        "nullable": true,
        "logic": {
            "formula": {
                "or": [
                    {
                        "==": [1, {"var": "var1"}]
                    },
                    {
                        "==": [1, {"var": "var2"}]
                    },
                    {
                        "==": [1, {"var": "var3"}]
                    }
                ]
            }
        }
    }
}
{"var1": 1, "var2": 1, "var3": 1}            # passes
{"var1": 1, "var2": None, "var3": None}      # passes
{"var1": None, "var2": None, "var3": None}   # fails

Custom Operations

The validator also has custom operators in addition to the ones provided by json-logic-py:

Operator Arguments Description
count [var1, var2, var3...] Counts how many valid variables are in the list, ignoring null and 0 values
count_exact [base, var1, var2, var3, ...] Counts how many values in the list equal the base. The first value is always considered the base, and the rest of the list is compared to it, so this operator requires at least 2 items.

temporalrules

Used to specify the list of checks to be performed against the previous visit for the same participant.

Each constraint specifies previous and current attributes. If conditions specified under the previous subschema are satisfied by the previous visit record, then the current visit record must satisfy the conditions specified under current subschema.

Each constraint also has optional fields that can be set:

NOTE: To validate temporalrules, the validator should have a Datastore instance which will be used to retrieve the previous visit record(s) for the participant.

The rule definition for temporalrules should follow the following format:

{
    "<parent_field_name>": {
        "temporalrules": [
            {
                "previous": {
                    "<field_name>": "subschema to be satisfied for the previous record"
                },
                "current": {
                    "<field_name>": "subschema to be satisfied for the current record, will be evaluated if the previous subschema is satisfied"
                }
            },            {
                "ignore_empty": ["<field_name_1>", "<field_name_2>"],
                "prev_op": "or",
                "previous": {
                    "<field_name_1>": "subschema to be satisfied for the previous record",
                    "<field_name_2>": "subschema to be satisfied for the previous record"
                },
                "current": {
                    "<field_name>": "subschema to be satisfied for the current record, will be evaluated if either of the previous subschemas are satisfied"
                }
            },
            {
                "swap_order": true,
                "previous": {
                    "<field_name>": "subschema to be satisfied for the previous record, will be evaluated if the current subschema is satisfied"
                },
                "current": {
                    "<field_name>": "subschema to be satisfied for the current record"
                }
            }

        ]
    }
}

Example:

If field taxes (difficulty with taxes, business, and other papers) is 0 (normal) at a previous visit, then taxes cannot be 8 (not applicable/never did) at the follow-up visit.

YAML Rule Definition JSON Rule Definition When Validating
taxes:
  type: integer
  temporalrules:
    - previous:
      taxes:
        allowed:
          - 0
      current:
        taxes:
          forbidden:
            - 8
{
    "taxes": {
        "type": "integer",
        "temporalrules": [            
            {
                "previous": {
                    "taxes": {
                        "allowed": [0]
                    }
                },
                "current": {
                    "taxes": {
                        "forbidden": [8]
                    }
                }
            }
        ]
    }
}
# assume this record already exists
# {"visit_date": 1, "taxes": 0}

{"visit_date": 2, "taxes": 1}   # passes
{"visit_date": 2, "taxes": 8}   # fails

check_adcid

Used to check whether a specified ADCID is valid.

This validation is implemented using the function rule with custom check_adcid function in the NACCValidator. The rule definition should be in the following format:

{
    "<adcid_variable>": {
        "function": {
            "name": "check_adcid",
            "args": {"own": "<bool, whether to validate against own ADCID or list of current ADCIDs; defaults to True>"}
        }
    }
}

NOTE: To validate check_adcid, the validator should have a Datastore instance which implements the is_valid_adcid function (which should have access to center’s ADCID and the list of current ADCIDs).

Example:

The adcid must match the center’s own ADCID, whereas oldadcid should be a valid ADCID in the current ADCIDs list.

YAML Rule Definition JSON Rule Definition When Validating
adcid:
  type: integer
  function:
    name: check_adcid
oldadcid:
  type: integer
  function:
    name: check_adcid
    args:
      own: false
{
    "adcid": {
        "type": "integer",
        "function": {
            "name": "check_adcid"
        }
    },
    "oldadcid": {
        "type": "integer",
        "function": {
            "name": "check_adcid",
            "args": {"own": False}
    }
}
# assume the center's own ADCID is 0, and ADCIDs 0-5 inclusive are valid

{"adcid": 0, "oldadcid": 5}   # passes
{"adcid": 2, "oldadcid": 5}   # fails
{"adcid": 0, "oldadcid": 9}   # fails

compute_gds

Custom rule defined to validate the Geriatric Depression Scale (GDS) score computation. Only be used for validating the gds field in UDS Form B6.

The rule definition for compute_gds should follow the following format:

{
    "gds": {
        "compute_gds": ["list of fields used in GDS score computation"]
    }
}

rxnorm

Custom rule defined to check whether a given Drug ID is valid RXCUI code.

This function uses the check_with rule from Cerberus. The rule definition should be in the following format:

{
    "<rxnormid variable>": {
        "check_with": "rxnorm"
    }
}

NOTE: To validate rxnorm, the validator should have a Datastore instance which implements the is_valid_rxcui function which will check if the given rxnormid value is a valid RXCUI code