Data Processor
Modularisation / Include
Consider an input YAML file containing the following data:
hello:
- INCLUDE: earth.yaml
- INCLUDE: mars.yaml
If earth.yaml
contains:
location: earth
targets:
- human
- cat
- dog
And mars.yaml
contains:
location: mars
targets:
- martian
Running the processor on the input YAML file will yield the following output:
hello:
- location: earth
targets:
- human
- cat
- dog
- location: mars
targets:
- martian
If an include file is specified as a relative location, the processor searches for it from these locations in order:
The folder/directory containing the parent file.
The current working folder/directory.
The list of include folders/directories. This list can be specified:
Using the environment variable
YP_INCLUDE_PATH
. Use the same syntax as thePATH
environment variable on your platform to specify multiple folders/directories.Using the command line option
--include=DIR
. This option can be used multiple times. The utility will append eachDIR
to the list.In the
include_paths
(list) attribute of the relevantyamlprocessor.dataprocess.DataProcessor
instance.
(In Python only.) The include_dict
(dict) attribute of the relevant
yamlprocessor.dataprocess.DataProcessor
instance can be populated
with keys to match INCLUDE
names. On a matching key, the value will be
inserted as if it were the content loaded from an include file. The processor
will always attempt to find a match from this attribute before looking for
matching include files from the file system. Suppose we use the following
Python logic with the above files:
from yamlprocessor.dataprocess import DataProcessor
# ...
processor = DataProcessor()
processor.include_dict.update({
'earth.yaml': {'location': 'earth', 'targets': ['dinosaur']},
})
processor.process_data(['hello.yaml'])
We’ll get:
hello:
- location: earth
targets:
- dinosaur
- location: mars
targets:
- martian
LIMITATIONS
YAML anchors/references will only work within files, so an include file will not see anchors in the parent file, and vice versa.
Since INCLUDE is part of a map/dict, keys in the same map/dict that are not recognised will not be processed.
Modularisation / Include with Merge
A common use case for include is to merge a list read from an include
file into the current list, (or similarly to merge a map/object read from an
include file into the current map/object). We can tell the processor by adding
the MERGE: true
option to the INCLUDE: ...
instruction.
The following example shows how to merge a list read from an include file into a list in place:
hello:
- location: Earth
targets:
- Human
- Dolphin
- INCLUDE: hello-list.yaml
MERGE: true
- location: Mars
targets:
- Martians
Where hello-list.yaml
contains:
- location: Endor
targets:
- Ewok
- Dulok
- location: Pandora
targets:
- "Na'vi"
- Avatar
Running the processor on the first input YAML file will yield the following output, where the content of the include file will be inserted into the original list in place:
hello:
- location: Earth
targets:
- Human
- Dolphin
- location: Endor
targets:
- Ewok
- Dulok
- location: Pandora
targets:
- "Na'vi"
- Avatar
- location: Mars
targets:
- Martians
The following example shows how to merge a map/object read from an include file into a map/object in place:
targets:
human:
say: Hello World
others:
INCLUDE: sayings.yaml
MERGE: true
martians:
say: Greeting Earthlings
Where sayings.yaml
contains:
cat:
say: miaow
dog:
say: woof woof
Running the processor on the first input YAML file will yield the following output, where the content of the include file will be inserted into the original map/object. Note: map/object keys do not have an order, so items from the include file will override items in the original map/object that have the same keys, regardless of where the items appear in the original map/object.
targets:
human:
say: Hello World
martians:
say: Greeting Earthlings
cat:
say: miaow
dog:
say: woof woof
Modularisation / Include with Query
Consider an example where we want to include only a subset of the data structure from the include file. We can use a JMESPath query to achieve this.
For example, we may have something like this in hello-root.yaml
:
hello:
INCLUDE: planets.yaml
QUERY: "[?type=='rocky']"
Where planets.yaml
contains:
- location: earth
type: rocky
targets:
- human
- cat
- dog
- location: mars
type: rocky
targets:
- martian
- location: jupiter
type: gaseous
targets:
- ...
The processor will select the planets of rocky type, and the output will look like:
hello:
- location: earth
type: rocky
targets:
- human
- cat
- dog
- location: mars
type: rocky
targets:
- martian
Multiple Input Files Concatenation
You can specify multiple input files in both command line and Python usages.
The input files will be concatenated together (as text) before before parsed as
a whole YAML document. For example, suppose we have part1.yaml
with:
hello:
And part2.yaml
with:
earth
mars
And part3.yaml
with:
jupiter
saturn
Running yp-data -o- part1.yaml part2.yaml part3.yaml will give:
hello:
- earth
- mars
- jupiter
- saturn
You can achieve the same results by running:
from yamlprocessor.dataprocess import DataProcessor
# ...
processor = DataProcessor()
processor.process_data(['part1.yaml', 'part2.yaml', 'part3.yaml'])
String Value Variable Substitution
Consider:
key: ${SWEET_HOME}/sugar.txt
If SWEET_HOME
is defined in the environment and has a value
/home/sweet
, then passing the above input to the processor will give the
following output:
key: /home/sweet/sugar.txt
Note:
The processor recognises both
$SWEET_HOME
or${SWEET_HOME}
.The processor is not implemented using a shell, so shell variable syntax won’t work.
You can configure what variables are available for substitution.
On the command line:
Use the
--define=KEY=VALUE
(-D KEY=VALUE
) option to define new variables or override the value of an existing one.Use the
--undefine=KEY
(-U KEY
) option to remove a variable.Use the
--no-environment
(-i
) option if you do not want to use any variables defined in the environment for substitution. (So only those specified with--define=KEY=VALUE
will work.)
In Python, simply manipulate the variable_map
(dict) attribute of
the relevant yamlprocessor.dataprocess.DataProcessor
instance. The
dict is a copy of os.environ
at initialisation.
Finally, if you reference a variable in YAML that is not defined, you will
normally get an unbound variable error. You can modify this behaviour by
setting a place holder. On the command line, use the
--unbound-placeholder=VALUE
option. In Python, set the unbound_placeholder
attribute of the
relevant yamlprocessor.dataprocess.DataProcessor
instance to a
string value. If you want to leave the original syntax unchanged for unbound
variables, set the placeholder VALUE to YP_ORIGINAL
.
Variable Substitution Include Scope
It is possible to define or override the values of the variables for substitution in include files. The scope of the change will be local to the include file (and files that it includes). The following is an example of how to specify include scope variables.
Suppose we have a file called hello.yaml
with:
hello:
- INCLUDE: world.yaml
VARIABLES:
WORLD_NAME: venus
- INCLUDE: world.yaml
VARIABLES:
WORLD_NAME: mars
- INCLUDE: world.yaml
And a file called world.yaml
with:
name: ${WORLD_NAME}
is_rocky: true
Running yp-data --define=WORLD_NAME=earth hello.yaml <yp-data> will give:
hello:
- name: venus
is_rocky: true
- name: mars
is_rocky: true
- name: earth
is_rocky: true
This can even be nested. For example, suppose we have main.yaml
:
hello:
- INCLUDE: building.yaml
VARIABLES:
building: Castle
car: Porsche
And a file called building.yaml
with:
property: ${building}
car:
INCLUDE: cars.yaml
And a file called cars.yaml
with:
type: ${car}
Running yp-data main.yaml <yp-data> will give:
hello:
- property: Castle
car:
type: Porsche
It is possible to pass variables of any type via the include scope, and reference them in a substitution. However, only variables of string type can be used in substitution that involves a string concatenation.
Consider:
hello:
- INCLUDE: greet.yaml
VARIABLES:
HELLO: greet
TARGETS:
- Humans
- Martians
You can reference TARGETS
in greet.yaml
on its own but not in a string
substitution.
For example, this causes a ValueError
:
# greet.yaml
say:
- ${HELLO} ${TARGETS} # Bad, cannot concatenate a list to a string
# ...
But this is fine:
# greet.yaml
say:
- hello: ${HELLO}
targets: ${TARGETS} # Good, value used on its own
# ...
String Value Date-Time Substitution
The YAML processor utility also supports date-time substitution using a similar syntax, for variables names starting with:
YP_TIME_NOW
(current time, time when yp-data starts running or set on initialisation of ayamlprocessor.dataprocess.DataProcessor
instance).
YP_TIME_REF
(reference time, specified using theYP_TIME_REF_VALUE
environment variable, the--time-ref=VALUE
command line option, or thetime_ref
attribute of the relevantyamlprocessor.dataprocess.DataProcessor
instance in Python). If no value is set for the reference time, any reference to the reference time will simply use the current time.
You can use one or more of these trailing suffixes to apply deltas for the date-time:
_PLUS_XXX
: adds the duration to the date-time.
_MINUS_XXX
: substracts the duration to the date-time.
_AT_xxx
: sets individual fields of the date-time. E.g.,_AT_T0H
will set the hour of the day part of the date-time to00
hour.
where xxx
is date-time duration-like syntax in the form nYnMnDTnHnMnS
,
e.g.:
12Y
is 12 years.
1M2D
is 1 month and 2 days.
1DT12H
is 1 day and 12 hours.
T12H30M
is 12 hours and 30 minutes.
Examples, (for argument sake, let’s assume the
current time is 2022-02-01T10:11:18Z
and
we have set the reference time to 2024-12-25T11:11:11Z
.)
Variable |
Output |
---|---|
${YP_TIME_NOW} |
2022-02-01T10:11:18Z |
${YP_TIME_NOW_AT_T0H0M0S} |
2022-02-01T00:00:00Z |
${YP_TIME_NOW_AT_T0H0M0S_PLUS_T12H} |
2022-02-01T12:00:00Z |
${YP_TIME_REF} |
2024-12-25T11:11:11Z |
${YP_TIME_REF_AT_1DT18H} |
2024-12-01T18:11:11Z |
${YP_TIME_REF_PLUS_T6H30M} |
2024-12-25T17:41:11Z |
${YP_TIME_REF_MINUS_1D} |
2024-12-24T11:11:11Z |
You can specify different date-time output formats using:
Environment variables
YP_TIME_FORMAT[_<NAME>]
.The command line option
--time-format=[NAME=]FORMAT
.The
time_formats
(dict) attribute of the relevantyamlprocessor.dataprocess.DataProcessor
instance in Python. The default format is%FT%T%:z
.
For example, if you set:
--time-format='%FT%T%:z'
(default)
--time-format=CTIME='%a %e %b %T %Z %Y'
orexport YP_TIME_FORMAT_CTIME='%a %e %b %T %Z %Y'
--time-format=ABBR='%Y%m%dT%H%M%S%z'
orexport YP_TIME_FORMAT_ABBR='%Y%m%dT%H%M%S%z'
Then:
Variable |
Output |
---|---|
${YP_TIME_REF} |
2024-12-25T11:11:11Z |
${YP_TIME_REF_FORMAT_CTIME} |
Wed 25 Dec 11:11:11 GMT 2024 |
${YP_TIME_REF_PLUS_T12H_FORMAT_ABBR} |
20241225T231111Z |
See strftime, for example, for a list of date-time format code. The processor also supports the following format codes for numeric time zone:
%:z
+hh:mm numeric time zone (e.g., -08:00, +05:45).%::z
+ hh:mm numeric time zone (e.g., -08:00:00, +05:45:00).%:::z
numeric time zone with:
to the necessary precision (e.g., -08, +05:45).
In addition, for all numeric time zone format code (including %z
),
the processor will use Z
to denote UTC time zone (instead of for
example +00:00
) to save space.
Finally, if a variable name is already in the variable substitution mapping,
e.g., defined in the environment or in a --define=...
option, then the
defined value takes precedence, so if you have already export
YP_TIME_REF=whatever
, then you will get the value whatever
instead of the
reference time.
Cast Value Variable Substitution
Environment variables are strings by nature, but YAML scalars can be numbers or booleans. Therefore, for non-string scalar values, i.e. integers, floats and booleans, the YAML processor utility supports casting the value to the correct type before using it for substitution:
${NAME.int}
Cast value of
NAME
to an integer.${NAME.float}
Cast value of
NAME
to a float.${NAME.bool}
Cast value of
NAME
to a boolean. Value ofNAME
must be one of the supported case insensitive strings:yes
,true
and1
will cast to the booleantrue
, andno
,false
and0
will be cast to the booleanfalse
.
For example, suppose we have main.yaml
:
version: ${ITEM_VERSION.int}
speed: ${ITEM_SPEED.float}
Running yp-data -D ITEM_VERSION=4 -D ITEM_SPEED=3.14 main.yaml <yp-data> will give:
version: 4
speed: 3.14
Note: The processor casts integers and floats using Python’s built-in
int()
and float()
functions. The exact behaviour may change
with the version of Python you are using.
However, a single value can only have a single substitution with a cast:
- ${NUM2.int} # good
- xyz${NUM2.int} # bad
- ${NUM2.int}${NUM3.int} # bad
Turn Off Processing
If you need to turn off processing of INCLUDE
syntax, you can do:
On the command line, use the
--no-process-include
option.In Python, set the
is_process_include
attribute of the relevantyamlprocessor.dataprocess.DataProcessor
instance toFalse
.
If you need to turn off processing of variable and date-time substitution, you can do:
On the command line, use the
--no-process-variable
option.In Python, set the
is_process_variable
attribute of the relevantyamlprocessor.dataprocess.DataProcessor
instance toFalse
.
Validation with JSON Schema
You can tell the processor to look for a JSON schema file and validate the current YAML file by adding a schema association line to the beginning of the YAML file, which can be one of:
#!<SCHEMA-URI>
# yaml-language-server: $schema=<SCHEMA-URI>
Where the SCHEMA-URI
is a string pointing to the location of a JSON schema
file. Some simple assumptions apply:
If
SCHEMA-URI
is a normal URI with a leading scheme, e.g.,https://
, it is used as-is.If
SCHEMA-URI
does not have a leading scheme and exists in the local file system, then it is also used as-is.Otherwise, a schema URI prefix can be specified to add to the value of
SCHEMA-URI
using:
The
YP_SCHEMA_PREFIX
environment variable.On the command line, the
--schema-prefix=PREFIX
option.In Python, the
schema_prefix
attribute of the relevantyamlprocessor.dataprocess.DataProcessor
instance.
For example, if we have export YP_SCHEMA_PREFIX=file:///etc/
in the
environment, both of the following examples will result in a validation
against the JSON schema in file:///etc/world/hello.schema.json
.
#!file:///etc/world/hello.schema.json
greet: earth
# ...
#!world/hello.schema.json
greet: earth
# ...