Data Processor

Modularisation / Include

Consider an input YAML file containing the following data:

hello:
  - INCLUDE: earth.yaml
  - INCLUDE: mars.yaml

If earth.yaml contains:

location: earth
targets:
  - human
  - cat
  - dog

And mars.yaml contains:

location: mars
targets:
  - martian

Running the processor on the input YAML file will yield the following output:

hello:
  - location: earth
    targets:
      - human
      - cat
      - dog
  - location: mars
    targets:
      - martian

If an include file is specified as a relative location, the processor searches for it from these locations in order:

The folder/directory containing the parent file.

The current working folder/directory.

The list of include folders/directories. This list can be specified:

Using the environment variable YP_INCLUDE_PATH. Use the same syntax as the PATH environment variable on your platform to specify multiple folders/directories.

Using the command line option --include=DIR. This option can be used multiple times. The utility will append each DIR to the list.

In the include_paths (list) attribute of the relevant yamlprocessor.dataprocess.DataProcessor instance.

(In Python only.) The include_dict (dict) attribute of the relevant yamlprocessor.dataprocess.DataProcessor instance can be populated with keys to match INCLUDE names. On a matching key, the value will be inserted as if it were the content loaded from an include file. The processor will always attempt to find a match from this attribute before looking for matching include files from the file system. Suppose we use the following Python logic with the above files:

from yamlprocessor.dataprocess import DataProcessor
# ...
processor = DataProcessor()
processor.include_dict.update({
    'earth.yaml': {'location': 'earth', 'targets': ['dinosaur']},
})
processor.process_data(['hello.yaml'])

We’ll get:

hello:
  - location: earth
    targets:
      - dinosaur
  - location: mars
    targets:
      - martian

LIMITATIONS

YAML anchors/references will only work within files, so an include file will not see anchors in the parent file, and vice versa.

Since INCLUDE is part of a map/dict, keys in the same map/dict that are not recognised will not be processed.

Modularisation / Include with Merge

A common use case for include is to merge a list read from an include file into the current list, (or similarly to merge a map/object read from an include file into the current map/object). We can tell the processor by adding the MERGE: true option to the INCLUDE: ... instruction.

The following example shows how to merge a list read from an include file into a list in place:

hello:
  - location: Earth
    targets:
      - Human
      - Dolphin
  - INCLUDE: hello-list.yaml
    MERGE: true
  - location: Mars
    targets:
      - Martians

Where hello-list.yaml contains:

- location: Endor
  targets:
    - Ewok
    - Dulok
- location: Pandora
  targets:
    - "Na'vi"
    - Avatar

Running the processor on the first input YAML file will yield the following output, where the content of the include file will be inserted into the original list in place:

hello:
  - location: Earth
    targets:
      - Human
      - Dolphin
  - location: Endor
    targets:
      - Ewok
      - Dulok
  - location: Pandora
    targets:
      - "Na'vi"
      - Avatar
  - location: Mars
    targets:
      - Martians

The following example shows how to merge a map/object read from an include file into a map/object in place:

targets:
  human:
    say: Hello World
  others:
    INCLUDE: sayings.yaml
    MERGE: true
  martians:
    say: Greeting Earthlings

Where sayings.yaml contains:

cat:
  say: miaow
dog:
  say: woof woof

Running the processor on the first input YAML file will yield the following output, where the content of the include file will be inserted into the original map/object. Note: map/object keys do not have an order, so items from the include file will override items in the original map/object that have the same keys, regardless of where the items appear in the original map/object.

targets:
  human:
    say: Hello World
  martians:
    say: Greeting Earthlings
  cat:
    say: miaow
  dog:
    say: woof woof

Modularisation / Include with Query

Consider an example where we want to include only a subset of the data structure from the include file. We can use a JMESPath query to achieve this.

For example, we may have something like this in hello-root.yaml:

hello:
  INCLUDE: planets.yaml
  QUERY: "[?type=='rocky']"

Where planets.yaml contains:

- location: earth
  type: rocky
  targets:
    - human
    - cat
    - dog
- location: mars
  type: rocky
  targets:
    - martian
- location: jupiter
  type: gaseous
  targets:
    - ...

The processor will select the planets of rocky type, and the output will look like:

hello:
- location: earth
  type: rocky
  targets:
    - human
    - cat
    - dog
- location: mars
  type: rocky
  targets:
    - martian

Multiple Input Files Concatenation

You can specify multiple input files in both command line and Python usages. The input files will be concatenated together (as text) before before parsed as a whole YAML document. For example, suppose we have part1.yaml with:

hello:

And part2.yaml with:

earth

mars

And part3.yaml with:

jupiter

saturn

Running yp-data -o- part1.yaml part2.yaml part3.yaml will give:

hello:
- earth
- mars
- jupiter
- saturn

You can achieve the same results by running:

from yamlprocessor.dataprocess import DataProcessor
# ...
processor = DataProcessor()
processor.process_data(['part1.yaml', 'part2.yaml', 'part3.yaml'])

String Value Variable Substitution

Consider:

key: ${SWEET_HOME}/sugar.txt

If SWEET_HOME is defined in the environment and has a value /home/sweet, then passing the above input to the processor will give the following output:

key: /home/sweet/sugar.txt

Note:

The processor recognises both $SWEET_HOME or ${SWEET_HOME}.

The processor is not implemented using a shell, so shell variable syntax won’t work.

You can configure what variables are available for substitution.

On the command line:

Use the --define=KEY=VALUE (-D KEY=VALUE) option to define new variables or override the value of an existing one.

Use the --undefine=KEY (-U KEY) option to remove a variable.

Use the --no-environment (-i) option if you do not want to use any variables defined in the environment for substitution. (So only those specified with --define=KEY=VALUE will work.)

In Python, simply manipulate the variable_map (dict) attribute of the relevant yamlprocessor.dataprocess.DataProcessor instance. The dict is a copy of os.environ at initialisation.

Finally, if you reference a variable in YAML that is not defined, you will normally get an unbound variable error. You can modify this behaviour by setting a place holder. On the command line, use the --unbound-placeholder=VALUE option. In Python, set the unbound_placeholder attribute of the relevant yamlprocessor.dataprocess.DataProcessor instance to a string value. If you want to leave the original syntax unchanged for unbound variables, set the placeholder VALUE to YP_ORIGINAL.

Variable Substitution Include Scope

It is possible to define or override the values of the variables for substitution in include files. The scope of the change will be local to the include file (and files that it includes). The following is an example of how to specify include scope variables.

Suppose we have a file called hello.yaml with:

hello:
  - INCLUDE: world.yaml
    VARIABLES:
      WORLD_NAME: venus
  - INCLUDE: world.yaml
    VARIABLES:
      WORLD_NAME: mars
  - INCLUDE: world.yaml

And a file called world.yaml with:

name: ${WORLD_NAME}
is_rocky: true

Running yp-data --define=WORLD_NAME=earth hello.yaml <yp-data> will give:

hello:
  - name: venus
    is_rocky: true
  - name: mars
    is_rocky: true
  - name: earth
    is_rocky: true

This can even be nested. For example, suppose we have main.yaml:

hello:
- INCLUDE: building.yaml
  VARIABLES:
    building: Castle
    car: Porsche

And a file called building.yaml with:

property: ${building}
car:
  INCLUDE: cars.yaml

And a file called cars.yaml with:

type: ${car}

Running yp-data main.yaml <yp-data> will give:

hello:
- property: Castle
  car:
    type: Porsche

It is possible to pass variables of any type via the include scope, and reference them in a substitution. However, only variables of string type can be used in substitution that involves a string concatenation.

Consider:

hello:
- INCLUDE: greet.yaml
  VARIABLES:
    HELLO: greet
    TARGETS:
      - Humans
      - Martians

You can reference TARGETS in greet.yaml on its own but not in a string substitution.

For example, this causes a ValueError:

# greet.yaml
say:
  - ${HELLO} ${TARGETS}  # Bad, cannot concatenate a list to a string
  # ...

But this is fine:

# greet.yaml
say:
  - hello: ${HELLO}
    targets: ${TARGETS}  # Good, value used on its own
  # ...

String Value Date-Time Substitution

The YAML processor utility also supports date-time substitution using a similar syntax, for variables names starting with:

YP_TIME_NOW (current time, time when yp-data starts running or set on initialisation of a yamlprocessor.dataprocess.DataProcessor instance).

YP_TIME_REF (reference time, specified using the YP_TIME_REF_VALUE environment variable, the --time-ref=VALUE command line option, or the time_ref attribute of the relevant yamlprocessor.dataprocess.DataProcessor instance in Python). If no value is set for the reference time, any reference to the reference time will simply use the current time.

You can use one or more of these trailing suffixes to apply deltas for the date-time:

_PLUS_XXX: adds the duration to the date-time.

_MINUS_XXX: substracts the duration to the date-time.

_AT_xxx: sets individual fields of the date-time. E.g., _AT_T0H will set the hour of the day part of the date-time to 00 hour.

where xxx is date-time duration-like syntax in the form nYnMnDTnHnMnS, e.g.:

12Y is 12 years.

1M2D is 1 month and 2 days.

1DT12H is 1 day and 12 hours.

T12H30M is 12 hours and 30 minutes.

Examples, (for argument sake, let’s assume the current time is 2022-02-01T10:11:18Z and we have set the reference time to 2024-12-25T11:11:11Z.)

Variable	Output
${YP_TIME_NOW}	2022-02-01T10:11:18Z
${YP_TIME_NOW_AT_T0H0M0S}	2022-02-01T00:00:00Z
${YP_TIME_NOW_AT_T0H0M0S_PLUS_T12H}	2022-02-01T12:00:00Z
${YP_TIME_REF}	2024-12-25T11:11:11Z
${YP_TIME_REF_AT_1DT18H}	2024-12-01T18:11:11Z
${YP_TIME_REF_PLUS_T6H30M}	2024-12-25T17:41:11Z
${YP_TIME_REF_MINUS_1D}	2024-12-24T11:11:11Z

You can specify different date-time output formats using:

Environment variables YP_TIME_FORMAT[_<NAME>].

The command line option --time-format=[NAME=]FORMAT.

The time_formats (dict) attribute of the relevant yamlprocessor.dataprocess.DataProcessor instance in Python. The default format is %FT%T%:z.

For example, if you set:

--time-format='%FT%T%:z' (default)

--time-format=CTIME='%a %e %b %T %Z %Y' or export YP_TIME_FORMAT_CTIME='%a %e %b %T %Z %Y'

--time-format=ABBR='%Y%m%dT%H%M%S%z' or export YP_TIME_FORMAT_ABBR='%Y%m%dT%H%M%S%z'

Then:

Variable	Output
${YP_TIME_REF}	2024-12-25T11:11:11Z
${YP_TIME_REF_FORMAT_CTIME}	Wed 25 Dec 11:11:11 GMT 2024
${YP_TIME_REF_PLUS_T12H_FORMAT_ABBR}	20241225T231111Z

See strftime, for example, for a list of date-time format code. The processor also supports the following format codes for numeric time zone:

%:z +hh:mm numeric time zone (e.g., -08:00, +05:45).
%::z + hh:mm numeric time zone (e.g., -08:00:00, +05:45:00).
%:::z numeric time zone with : to the necessary precision (e.g., -08, +05:45).

In addition, for all numeric time zone format code (including %z), the processor will use Z to denote UTC time zone (instead of for example +00:00) to save space.

Finally, if a variable name is already in the variable substitution mapping, e.g., defined in the environment or in a --define=... option, then the defined value takes precedence, so if you have already export YP_TIME_REF=whatever, then you will get the value whatever instead of the reference time.

Cast Value Variable Substitution

Environment variables are strings by nature, but YAML scalars can be numbers or booleans. Therefore, for non-string scalar values, i.e. integers, floats and booleans, the YAML processor utility supports casting the value to the correct type before using it for substitution:

${NAME.int}: Cast value of NAME to an integer.
${NAME.float}: Cast value of NAME to a float.
${NAME.bool}: Cast value of NAME to a boolean. Value of NAME must be one of the supported case insensitive strings: yes, true and 1 will cast to the boolean true, and no, false and 0 will be cast to the boolean false.

For example, suppose we have main.yaml:

version: ${ITEM_VERSION.int}
speed: ${ITEM_SPEED.float}

Running yp-data -D ITEM_VERSION=4 -D ITEM_SPEED=3.14 main.yaml <yp-data> will give:

version: 4
speed: 3.14

Note: The processor casts integers and floats using Python’s built-in int() and float() functions. The exact behaviour may change with the version of Python you are using.

However, a single value can only have a single substitution with a cast:

- ${NUM2.int}             # good
- xyz${NUM2.int}          # bad
- ${NUM2.int}${NUM3.int}  # bad

Turn Off Processing

If you need to turn off processing of INCLUDE syntax, you can do:

On the command line, use the --no-process-include option.

In Python, set the is_process_include attribute of the relevant yamlprocessor.dataprocess.DataProcessor instance to False.

If you need to turn off processing of variable and date-time substitution, you can do:

On the command line, use the --no-process-variable option.

In Python, set the is_process_variable attribute of the relevant yamlprocessor.dataprocess.DataProcessor instance to False.

Validation with JSON Schema

You can tell the processor to look for a JSON schema file and validate the current YAML file by adding a schema association line to the beginning of the YAML file, which can be one of:

#!<SCHEMA-URI>

# yaml-language-server: $schema=<SCHEMA-URI>

Where the SCHEMA-URI is a string pointing to the location of a JSON schema file. Some simple assumptions apply:

If SCHEMA-URI is a normal URI with a leading scheme, e.g., https://, it is used as-is.

If SCHEMA-URI does not have a leading scheme and exists in the local file system, then it is also used as-is.

Otherwise, a schema URI prefix can be specified to add to the value of SCHEMA-URI using:

The YP_SCHEMA_PREFIX environment variable.

On the command line, the --schema-prefix=PREFIX option.

In Python, the schema_prefix attribute of the relevant yamlprocessor.dataprocess.DataProcessor instance.

For example, if we have export YP_SCHEMA_PREFIX=file:///etc/ in the environment, both of the following examples will result in a validation against the JSON schema in file:///etc/world/hello.schema.json.

#!file:///etc/world/hello.schema.json
greet: earth
# ...

#!world/hello.schema.json
greet: earth
# ...