Zinc
Overview
Zinc stands for "Zinc Is Not CSV". Zinc is a plaintext syntax for serializing Haystack grids using a souped up CSV format. Unlike CSV, Zinc supports typed scalar values (such as Bool, Int, Float, Str, Date, etc) and arbitrary meta-data at the grid and column level. Unlike JSON, Zinc results in much higher compression for tabular data.
Zinc is represented by the def ZincFile.
Literals
The basic syntax of Zinc uses a custom literal syntax for each type:
- Null:
N - Marker:
M - Remove:
R - NA:
NA - Bool:
TorF(for true, false) - Number:
1,-34,10_000,5.4e-45,9.23kg,74.2°F,4min,INF,-INF,NaN - Str:
"hello","foo\nbar\"(uses all standard escape chars as C like languages) - Uri:
http://project-haystack.com/ - Ref:
@17eb0f3a-ad607713,@xyz "Display Name" - Symbol:
^hot-water - Date:
2010-03-13(YYYY-MM-DD) - Time:
08:12:05(hh:mm:ss.FFF) - DateTime:
2010-03-11T23:55:00-05:00 New_Yorkor2009-11-09T15:39:00Z - Coord:
C(37.55,-77.45) - XStr:
Type("value") - List:
[1, 2, 3] - Dict:
{dis:"Building" site area:35000ft²} - Grid:
<<ver:"3.0" ... >>
Syntax
Every grid has one line of meta-data applied to the entire grid, followed by one line of column definitions, then zero or more lines of rows. Each line is separated by a "\n" newline character.
The meta-data line must always begin with a ver tag and a value of "3.0".
Let's look at a simple example:
ver:"3.0"
firstName,bday
"Jack",1973-07-23
"Jill",1975-11-15
Note the first line defines the grid meta-data, which is just the version
tag. The second line defines two columns named firstName and bday. There
are two data rows each with a Str value for firstName and a Date value for
bday. Every row must define a cell value for each column.
Metadata may be specified on the grid itself or on each column as a set of name/value tags. Tags are specified as "name: val" or if value is omitted, then it is a marker tag. Tags are separated by a space. Here is an example:
ver:"3.0" database:"test" dis:"Site Energy Summary"
siteName dis:"Sites", val dis:"Value" unit:"kW"
"Site 1", 356.214kW
"Site 2", 463.028kW
It is common to have sparse tables where rows have a null value for a given
column. This is indicated either using the N literal or by omitting a
the cell entirely. For example these two rows are semantically identical:
"a",N,2,N,N,"z"
"a",,2,,,"z"
If there is only one column, then a null row must be represented with the N
character.
Nested lists, dicts, or grids may be used for any meta data value or cell:
ver:"3.0"
type,val
"list",[1,2,3]
"dict",{dis:"Dict!" foo}
"grid",<<
ver:"2.0"
a,b
1,2
3,4
>>
"scalar","simple string"
Nested dicts are optionally allowed to use a comma between name value pairs. However, commas are not allowed for grid and column meta-data.
Grammar
Grammar legend:
:= is defined as
<x> non-terminal
"x" literal
[x] optional
(x) grouping
x+ one or more times
x* zero or more times
x|x or
The formal grammar for Zinc:
<grid> := <gridMeta> <cols> [<row>]*
<gridMeta> := <ver> <tagsNoComma> <nl>
<ver> := "ver:" <str> // must be "3.0"
<tagsNoComma> := <tag>* // separated by one space (0x20)
<tagsCommaOk> := (<tag>, [","])* // trailing comma allowed/optional
<tag> := <tagMarker> | <tagPair>
<tagMarker> := <id> // val is assumed to be Marker
<tagPair> := <id> ":" <val>
<cols> := <col> ("," <col>)* <nl>
<col> := <id> <tagsNoComma>
<row> := <cell> ["," <cell>]* <nl>
<cell> := <val> // empty cell is same as null
<val> := <scalar> | <list> | <dict> | <grid>
<list> := "[" (<val> ",")* "]" // trailing comma allowed/optional
<dict> := "{" <tagsCommaOk> "}"
<grid> := "<<" <grid> ">>"
Zinc tokens:
<id> := <alphaLo> (<alphaLo> | <alphaHi> | <digit> | '_')*
<scalar> := <null> | <marker> | <remove> | <na> | <bool> | <ref> | <symbol> | <str> |
<uri> | <number> | <date> | <time> | <dateTime> | <coord> | <xstr>
<null> := "N"
<marker> := "M"
<remove> := "R"
<na> := "NA"
<bool> := "T" | "F"
<symbol> := "^" <refChar>+
<ref> := "@" <refChar>+ [ " " <str> ]
<refChar> := <alpha> | <digit> | "_" | ":" | "-" | "." | "~"
<str> := """ <strChar>* """
<uri> := "`" <uriChar>* "`"
<strChar> := <unicodeChar> | <strEscChar>
<uriChar> := <unicodeChar> | <uriEscChar>
<unicodeChar> := any 16-bit Unicode char >= 0x20 (except str/uri quote)
<strEscChar> := "\b" | "\f" | "\n" | "\r" | "\r" | "\t" | "\"" | "\\" | "\$" | <uEscChar>
<uriEscChar> := "\:" | "\/" | "\?" | "\#" | "\[" | "\]" | "\@" | "\`" | "\\" | "\&" | "\=" | "\;" | <uEscChar>
<uEscChar> := "\u" <hexDigit> <hexDigit> <hexDigit> <hexDigit>
<xstr> := <xstrType> "(" <str> ")"
<xstrType> := <alphaHi> (<alphaLo> | <alphaHi> | <digit> | '_')*
<number> := <decimal> | "INF" | "-INF" | "NaN"
<decimal> := ["-"] <digits> ["." <digits>] [<exp>] [<unit>]
<exp> := ("e"|"E") ["+"|"-"] <digits>
<unit> := <unitChar>*
<unitChar> := <alpha> | "%" | "_" | "/" | "$" | any char > 128 // see Units
<date> := YYYY-MM-DD
<time> := hh:mm:ss.FFFFFFFFF
<dateTime> := YYYY-MM-DD'T'hh:mm:ss.FFFFFFFFFz zzzz
<coord> := "C(" <coordDeg> "," <coordDeg> ")"
<coordDeg> := ["-"] <digits> ["." <digits>]
<alphaLo> := ('a' - 'z')
<alphaHi> := ('A' - 'Z')
<alpha> := <alphaLo> | <alphaHi>
<digit> := ('0' - '9')
<digits> := <digit> (<digit> | "_")*
<hexDigit> := ('a'-'f') | ('A'-'F') | digit
The space character 0x20 is allowed between tokens.
Notes
The following are notes for implementators:
Identifiers vs Keywords
Identifiers must start with a lower case letter. Keywords begin with an upper case letter: "N", "T", "F", "M", "NA", "INF", "NaN", etc
URIs
Escape chars in URIs are used to remove special meaning for reserved
characters. For example if a filename contains the # character, then
it must be escaped so that the # is not treated as a fragment identifier:
`file \#2`
Parsers should be prepared to encounter and preserve the backslash in these cases.
Number Tokens
When parsing, a leading digit may be a number, date, time, or datetime. You can use the following technique to consume these scalars:
- consume all the various chars into a string
- if dashes and no colons must be date
- if colons and no dashes must be time
- if colons and dashes must be dateTime, check for
Zor timezone - must be number with optional unit
DateTime
DateTime scalars are encoded using both offset and the timezone name:
2010-11-28T07:23:02.773-08:00 Los_Angeles // negative offset and timezone
2010-11-28T23:19:29.741+08:00 Taipei // positive offset and timezone
2010-11-28T18:21:58+03:00 GMT-3 // timezone may include '-'
2010-11-28T12:22:27-03:00 GMT+3 // timezone may include '+'
2010-01-08T05:00:00Z UTC // UTC example
2010-01-08T05:00:00Z // UTC may omit timezone name
Version History
Zinc 1.0
- initial version
- Bin format:
Bin mime:"text/plain"
Zinc 2.0
- change hex RecId syntax to @ Ref syntax
- remove support for cell display strings and metadata
- remove support for column display strings (use dis metadata tag)
- update Bin format:
Bin(text/plain)
Zinc 3.0
- add nested lists, dicts, grids
- add NA
- add XStr
- remove Bin format to use XStr syntax
Zinc 3.0 Haystack 4 features
- Version remains the same "3.0"
- Symbol literals
- Allow commas in nested dict literals