Normalization
Overview
Normalization is the process of compiling libs to a namespace. Remember that every def is modeled as a dict (just a normal collection of name/value pairs). The defs packaged inside libs are called declared defs. Normalization will modify the tags of each def yielding a new dict which we call the effective def. Effective defs are used to compute documentation, reflection, def aware filters, etc.
Note: normalization is only required for software that wishes to build a namespace from source libs. Pre-normalized defs can be downloaded from Project-Haystack. These downloads may also be used as test cases to verify your own normalization software.
Pipeline
The process to compile libs to a namespace is composed of the following ordered steps:
- Scan: traverse input libs to find trio files
- Parse: parse each trio file to discover def and defx dicts
- Resolve: ensure every symbol tag maps to a def
- Taxonify: compute taxonomy tree of supertype/subtypes
- Defx: add defx tags to each declared def
- Normalize Tags: normalize def tags
- Inherit: recursively apply supertype tags into subtypes
- Validation: perform additional sanity checks
- Generate: output Namespace
Each of these steps is discussed in detail in the following sections.
Scan
The input to normalization is a list of lib files. These are the zip files that contain the declared defs. Each zip file is opened and searched for applicable Trio files in the "lib/" directory. There must be exactly one dict defined in "lib/lib.trio" with the lib's meta.
Parse
Every input Trio file found during scan is parsed to discover the declared def dicts identified with the def
tag. This phase is also used to discover our def extensions which are identified by the defx
tag. Every dict parsed must have either the def
or defx
tag and the value must be a symbol. During this phase, we detect and report illegal duplicate symbols - a given symbol must be mapped to only one def in all the input libs.
Resolve
During this phase, we walk every tag in every def and defx to resolve symbolic references. For each def, the following must resolve in its lib namespace:
- Every tag name must resolve to a tag def
- Every symbol used as a tag value must resolve
- Every list of symbols in a tag value must individually resolve
We compute each lib namespace by computing only the symbols in the lib's scope. These are the symbols defined by the lib itself and those imported via its depends
. Dependencies are not transitive - a given depend does not imply including the referenced lib's dependencies.
Example:
// lib alpha def: ^lib:alpha depends: [^lib:ph] -- def: ^alphaTag is: ^marker // lib beta def: ^lib:beta depends: [^lib:ph, ^lib:alpha] -- def: ^betaTag is: ^marker // lib gamma def: ^lib:beta depends: [^lib:ph, ^lib:beta]
The lib namespace to use for resolution would be as follows:
alpha: all symbols in ph, alphaTag (in its own lib) beta: all symbols in ph, betaTag (in its own lib), alphaTag (from include) gamma: all symbols in ph, betaTag (from include)
Note that gamma does not have visibility to alphaTag because alpha is not explicitly included.
Resolution is performed on def and defx dicts separately. This allows defx dicts to reference symbols that aren't in scope of the source def.
Taxonify
All of the following steps require knowing the taxonomy tree to determine each tag's supertypes. This phase should walk through each def and recusively derive its supertypes based on each def's is
tag. Specifically:
- compute the supertype/subtype tree based on the
is
tag - infer the
is
on feature key defs - verify every non-feature key has
is
tag with exception of following root tags:marker
,val
, andfeature
Example:
// before def: ^filetype:json mime: "application/json" // after def: ^filetype:json mime: "application/json" is: [^filetype]
Defx
After the resolve and taxonify steps complete, we can apply our defx extensions to their respective defs. Each defx is used to add one or more tags to a source def identified by the defx
tag. The defx mechanism provides for late binding of def metadata. Just like RDF allows anyone to add a triple to a given subject, anyone can add tags to a def using a defx.
Every defx must reference a def within its scope and must only add new tags. It is illegal for a defx to specify a tag declared by the def itself or by another defx. The exception to this rule is tags annotated as accumulate
which should be aggregated into a list.
Here is example to illustrate:
// source def def: ^date is: ^scalar // two defx dicts declared in separate libs defx: ^date acmeTerm: "ISODate" -- defx: ^date wombatFormatter: "DateFormatter" // after defx are merged, effective def is def: ^date is: ^scalar acmeTerm: "ISODate" wombatFormatter: "DateFormatter"
Normalize Tags
This step normalizes specific tags according to the following rules:
- Add the inferrred
lib
tag to each def - Normalize def tags which subtype from list
Every def has an inferred lib
tag that is a reference to its declaring lib. It is invalid for any def to declare its own lib
tag. The lib meta itself also receives a lib
tag referencing itself. Example:
// before (within phIoT lib) def: ^equip is: ^entity // after def: ^equip is: ^entity lib: ^lib:phIoT
As a convenience, tags such as is
and tagOn
can use a single symbol instead of list of symbols. However, their normalized representation must always be a list. This phase should iterate every def and normalize any tag that subtypes from list
. Example:
// before def: ^geoCity is: ^str tagOn: ^geoPlace // after def: ^geoCity is: [^str] tagOn: [^geoPlace]
Inherit
The inherit phase applies tag inheritance from supertypes:
- Start with the def tags from previous steps
- Each supertype is processed in order of declaration in the
is
tag - The supertype must recursively have its own inheritance normalized
- Filter out any tags from the supertype marked as
notInherited
- Inherit all tags from previous step that are not yet declared nor inherited from other supertypes.
- If the tag is marked as
accumulate
then inheritance is aggregated.
In the case of ambiguity via multiple inheritance, a subtype should explicitly declare the tag value.
Here is a fictitious example for an El Camino, which is a hybrid between a car and a pickup truck:
// declarations def: ^car numDoors: 4 color: "red" engine: "V8" ---- def: ^pickup numDoors: 2 color: "blue" bedLength: 80in ---- def: ^elCamino is: [^pickup, ^car] color: "purple"
The normalized definition with inheritance would be:
def: ^elCamino // declared is: [^pickup, ^car] // declared color: "purple" // declared numDoors: 2 // inherited from pickup first engine: "V8" // inherited from car bedLength: 80in // inherited from pickup
Validation
Before generation, normalization software should validate tag rules to flag errors not caught in previous steps:
- Verify every
lib
meta def has required tags - Verify tag values match their def's declared types
- Verify no def named
index
, which is reserved for documentation purposes - Verify every term in a conjunct is a marker
- Verify no defs declare a
computed
tag - Verify choice
of
is subtype of marker - Verify
tagOn
only used on a tag defs (not conjuncts or feature keys) - Verify
relationship
tags are only used on defs that subtype from ref
Generate
Once all previous steps have completed successfully, we can generate our namespace. From a logical perspective, a namespace is a map of symbols to effective def dicts. In actual implementations, this probably yields specific data structures. These data structures can then be used to perform additional computations such as reflection.