PyYaml "include file" and yaml aliases (anchors/references)

Crucial for the handling of anchors and aliases in PyYAML is the dict anchors that is part of the Composer. It maps anchor to nodes so that aliases can be looked up. It existence is limited by the existence of the Composer, which is a composite element of the Loader that you use.

That Loader class only exists during the time of the call to yaml.load() so there is no trivial way to extract this afterwards: first you would have to make the instance of the Loader() persist and then make sure that the normal compose_document() method is not called (which among other things does self.anchors = {}, to be clean for the next document (in a single stream)).

To further complicate things if you would have warehouse.yaml:

warehouse:
  obj1: &obj1
    key1: 1
    key2: 2

and specific.yaml:

warehouse: !include warehouse.yaml
specific:
  spec1:
    <<: *obj1
  spec2:
    <<: *obj1
    key1: 10

you would never get this to work with your snippet, even if you could preserve, extract and pass on the anchor information because the composer handling specific.yaml will much earlier encountering a non-defined alias than the tag !include gets used for construction (and filling anchors).

What you can do to circumvent this problem is to include specific.yaml

specific:
  spec1:
    <<: *obj1
  spec2:
    <<: *obj1
    key1: 10

from warehouse.yaml:

warehouse:
  obj1: &obj1
    key1: 1
    key2: 2
specific: !include specific.yaml

, or include both in a third file. Please note that the key specific is in both files.

With those two files run:

import sys
from ruamel import yaml

def my_compose_document(self):
    self.get_event()
    node = self.compose_node(None, None)
    self.get_event()
    # self.anchors = {}    # <<<< commented out
    return node

yaml.SafeLoader.compose_document = my_compose_document

# adapted from http://code.activestate.com/recipes/577613-yaml-include-support/
def yaml_include(loader, node):
    with open(node.value) as inputfile:
        return list(my_safe_load(inputfile, master=loader).values())[0]
#              leave out the [0] if your include file drops the key ^^^

yaml.add_constructor("!include", yaml_include, Loader=yaml.SafeLoader)


def my_safe_load(stream, Loader=yaml.SafeLoader, master=None):
    loader = Loader(stream)
    if master is not None:
        loader.anchors = master.anchors
    try:
        return loader.get_single_data()
    finally:
        loader.dispose()

with open('warehouse.yaml') as fp:
    data = my_safe_load(fp)
yaml.safe_dump(data, sys.stdout, default_flow_style=False)

which gives:

specific:
  spec1:
    key1: 1
    key2: 2
  spec2:
    key1: 10
    key2: 2
warehouse:
  obj1:
    key1: 1
    key2: 2

If your specific.yaml would not have the top-level key specific:

spec1:
  <<: *obj1
spec2:
  <<: *obj1
  key1: 10

then replace the last line of yaml_include() with:

return my_safe_load(inputfile, master=loader)

The above was done with ruamel.yaml (disclaimer: I am the author of that package) and tested on Python 2.7 and 3.6. By changing the import it will work with PyYAML as well.


With the new ruamel.yaml API the above can be much simplified, because the loader handed to the yaml_include() constructor knows about the YAML instance, but of course you still need an adapted compose_document that doesn't destroy anchors. Assuming the specific.yaml without top-level key specific, the following gives the same output as before.

import sys
from ruamel.std.pathlib import Path
from ruamel.yaml import YAML, version_info

yaml = YAML(typ='safe', pure=True)
yaml.default_flow_style = False


def my_compose_document(self):
    self.parser.get_event()
    node = self.compose_node(None, None)
    self.parser.get_event()
    # self.anchors = {}    # <<<< commented out
    return node

yaml.Composer.compose_document = my_compose_document

# adapted from http://code.activestate.com/recipes/577613-yaml-include-support/
def yaml_include(loader, node):
    y = loader.loader
    yaml = YAML(typ=y.typ, pure=y.pure)  # same values as including YAML
    yaml.composer.anchors = loader.composer.anchors
    return yaml.load(Path(node.value))

yaml.Constructor.add_constructor("!include", yaml_include)

data = yaml.load(Path('warehouse.yaml'))
yaml.dump(data, sys.stdout)

It seems that someone has now solved this problem as an extension of ruamel.yaml.

pip install ruamel.yaml.include (source on GitHub)

To get the desired output above:

warehouse.yml

obj1: &obj1
  key1: 1
  key2: 2

specific.yml

specific:
  spec1: 
    <<: *obj1
  spec2:
    <<: *obj1
    key1: 10

Your code would be:

from ccorp.ruamel.yaml.include import YAML

yaml = YAML(typ='safe', pure=True)
yaml.allow_duplicate_keys = True

with open('specific.yml', 'r') as ymlfile:
    return yaml.load(ymlfile)

It also includes a handy !exclude function if you wanted to not have the warehouse key in your output. If you only wanted the specific key, your specific.yml could begin with:

!exclude includes:
- !include warehouse.yml

In that case, your warehouse.yml could also include the top-level warehouse: key.