Compare commits

...

4 Commits

Author SHA1 Message Date
4690f0b808
parser: add option for case-insensitive scalar comparison
This does not support unicode case folding, which is very much a
sorry-not-sorry situation because unicode is a disgusting labyrinthine
chaotic hellformat. Actually, our unicode support isn't very good from
the standpoint that we don't do any form of normalization, so
specifying non-ASCII values for scalar comparisons is probably asking
for trouble.
2023-10-18 21:34:07 -07:00
1f75ff6b8a
readme: continue reading me
I'm going to have to come up with a license for this code also.
2023-10-18 21:29:58 -07:00
c83558de3e
start adding the readme 2023-10-18 21:15:29 -07:00
4c966ca9d0
parser: reintroduce space strings and change token parsing strategy
Once again I have entangled two conceptually distinct changes into a
single commit because demuxing them from the diff is too much work.
Alas. Let's break it down.

The simpler part of this change is to reintroduce "space strings" with
a slightly fresh coat of paint. We now have 3 different types of
string leaders that can be used together. So we now have:

    | directly concatenates this line with the previous line
    > prepends an LF character before concatenation
    + (NEW) prepends a single space character before concatenation

The `+` leader enables more æsthetic soft line wrapping than `|`
because it doesn't require the use of leading or trailing the
whitespace to separate words, as long as lines are broken at word
boundaries. Perhaps this is not as common a usecase as I am making it,
but I do like to hard wrap paragraphs in documents, so if anything,
it's a feature for me.

As I was considering what character to use for this leader, I realized
that I wanted to be able to support numeric map keys, a la:

    -1: negative one
    0:  zero
    +1: positive one

But previously this would not parse correctly, as the tokenizer would
find `-` and expect it to be followed by a space to indicate a list
item (and the additional string leader would cause the same problem
with `+`). I wanted to support this use case, so the parser was
changed to take a second pass on lines starting with the string
leaders (`|`, `+`, and `>`) and the list item leader (`-`) if the
leader has a non-space character following it. Note that this does not
apply to the comment leader (`#` not followed by a space or a newline
is a tokenization error) or to the inline list/map leaders(since those
do not respect internal whitespace, there is no way to treat them
unambiguously).

To reduce the likelihood of confusing documents, scalars are no longer
allowed to occupy their own line (the exception to this is if the
document consists only of a scalar value). Inline lists and maps can
still occupy their own line, though I am considering changing this as
well to force them to truly be inline. I think this change makes
sense, as scalars are generally intended to be represent an unbroken
single item serialization of some non-string value. In other words,

    # these two lines used to parse the same way
    key: 9001
    # but now the following line is a parse error due to the scalar
    # occupying its own line
    key:
        9001
    # also, this still works, but it may be changed to be an error in
    # the future
    key:
        [ 9, 0, 0, 1 ]

Inline maps have also been changed so that their keys can start with the
now-unforbidden string leaders and list item leader characters.
2023-10-18 21:15:29 -07:00
5 changed files with 250 additions and 125 deletions

67
readme.md Normal file
View File

@ -0,0 +1,67 @@
Have you ever wished someone would walk up to you and say, in a tremendously exaggerated, stereotypical surfer voice, "nice data, dude"? Well, wish no longer because now your data can be Nice by definition, due to our patented Manipulative Marketing Naming technique. Introducing!
# Nice Data: There's no Escape
```nice
# this is an example of some Nice data.
project:
name: Nice data
description:
| A file format for storing structured data. Nice uses syntactic whitespace
+ to represent the data structure. It defines two types of data, scalars and
+ strings, which are used to compose its two data structures, lists and maps.
>
> Nice to write, Nice to read.
inspiration:
- { name: NestedText, url: https://nestedtext.org }
- { name: YAML, url: https://yaml.org }
- A fervent dislike of TOML
non-goals: [ general-purpose data serialization, world domination ]
epic freaking funny number lol: 42069580089001421337666
```
Nice Data is a format for storing structured data in a file. It's pleasant to read and adheres to the philosophy that form should match structure. It's heavily inspired by [NestedText], though it also looks similar to [YAML].
## Syntax
For the purposes of illustration, the following syntax examples are accompanied by their corresponding JSON representation. If you are not already familiar with JSON syntax, I would certainly like to know how you got here.
- structured indentation using tabs or spaces
- scalars
- strings
- line, space, and concat strings
- lists
- inline lists
- maps
- inline maps
## Restrictions
Nice documents must be encoded in valid UTF-8. They must use `LF`-only newlines (`CR` characters are forbidden). Tabs and spaces cannot be mixed for indentation. Indentation *must* adhere to a consistent quantum. Nonprinting ASCII characters are forbidden (specifically, any character less than `0x20` (space) except for `0x09` (horizontal tab) and `0x0A` (newline)). Trailing whitespace, including lines consisting only of whitespace, is forbidden, although empty lines are permitted. Some keys and values cannot be represented (for example, map keys cannot start with the character `#`, though map values can).
## Philosophy
### Let the Application Interpret Data Types (Bring Your Own Schema)
An arbitrarily structured data format with strict types adds complexity to the parser and cannot possibly cover all necessary types needed for every possible application. For example, numbers in JSON are represented by a sequence of ASCII characters, but they are defined by the format to be restricted to specifying double precision floating point numbers. Of course, it is possible to generate a numeric ASCII sequence that does not fit into a double precision floating point number. If an application needs to represent a 64-bit integer in JSON without producing technically invalid JSON, the value must be serialized as a string, which places the burden of decoding it on the application, since the format cannot represent it as a direct numeric value. The same is true of an RFC 3339 datetime. It's not possible for a format to account for every possible data type that an application may need, so don't bother. Users are encouraged to parse Nice documents directly into well-defined, typed structures.
Nice explicitly differentiates between bare scalars and strings so that `null` may be disambiguated and interpreted differently from `"null"`.
### Fewer Rules over Flexibility
Nice is not, and does not try to be, a general-purpose data serialization format. There are, in fact, many values that simply cannot be represented Nicely. For example, map keys cannot start with a variety of characters, including `#`, `{`, `[`, or whitespace, which is a conscious design choice. In general, Nice is not a format designed with any emphasis placed on ease of programmatic production. While creating software that produces valid Nice data is certainly possible, this reference implementation has no functionality to do so.
### There's No Need to Conquer the World
Nice has no exhaustive specification or formal grammar. The parser is handwritten, and there are pretty much guaranteed to be some strange edge cases that weren't considered when writing it. Standardization is a good thing, generally speaking, but it's not a goal here. Perhaps this driven by the author's indolence more than deep philosophical zealotry. On the other hand, this paragraph is under the philosophy section.
# The Implementation
# Disclaimer
Yeah, it's entirely possible you hate this and think it's not in fact a nice format. That's fine, but, unfortunately, you forgot to make a time machine and make me name it something else. And yeah, this is probably impossible to search for.
[NestedText]: https://nestedtext.org
[YAML]: https://yaml.org

View File

@ -64,15 +64,30 @@ pub const Options = struct {
coerce_strings: bool = false,
// Only used by the parseTo family of functions.
// Two lists of strings. Truthy strings will be parsed to boolean true. Falsy
// strings will be parsed to boolean false. All other strings will raise an
// error.
boolean_strings: struct { truthy: []const []const u8, falsy: []const []const u8 } = .{
// Two lists of strings. Scalars in a document that match any of the truthy values
// will be parsed to boolean true. Scalars in the document that match any of the
// falsy values will be parsed to boolean false. All other scalar values will raise
// an error if the destination is a boolean type. By default, these comparisons are
// case-sensitive. See the `case_insensitive_scalar_coersion` option to change
// this.
boolean_scalars: struct { truthy: []const []const u8, falsy: []const []const u8 } = .{
.truthy = &.{ "true", "True", "yes", "on" },
.falsy = &.{ "false", "False", "no", "off" },
},
null_strings: []const []const u8 = &.{ "null", "nil", "None" },
// Only used by the parseTo family of functions.
// A list of strings. Scalars in the doucment that match any of the values listed
// will be parsed to optional `null`. Any other scalar value will be parsed as the
// optional child type if the destination type is an optional. By default, these
// comparisons are case-sensitive. See the `case_insensitive_scalar_coersion`
// option to change this.
null_scalars: []const []const u8 = &.{ "null", "nil", "None" },
// Only used by the parseTo family of functions.
// Perform ASCII-case-insensitive comparisons for scalars (i.e. `TRUE` in a document
// will match `true` in the boolean scalars. Unicode case folding is not currently
// supported.
case_insensitive_scalar_coersion: bool = false,
// Only used by the parseTo family of functions.
// If true, document scalars that appear to be numbers will attempt to convert into

View File

@ -59,7 +59,7 @@ pub const State = struct {
},
},
.value => switch (state.value_stack.getLast().*) {
// remove the final trailing newline or space
// we have an in-progress string, finish it.
.string => |*string| string.* = try state.string_builder.toOwnedSlice(arena_alloc),
// if we have a dangling -, attach an empty scalar to it
.list => |*list| if (state.expect_shift == .indent) try list.append(Value.emptyScalar()),
@ -104,7 +104,7 @@ pub const State = struct {
state.document.root = try Value.fromScalar(arena_alloc, str);
state.mode = .done;
},
.line_string, .concat_string => |str| {
.line_string, .space_string, .concat_string => |str| {
state.document.root = Value.emptyString();
try state.string_builder.appendSlice(arena_alloc, str);
try state.value_stack.append(&state.document.root);
@ -128,7 +128,7 @@ pub const State = struct {
switch (value) {
.empty => state.expect_shift = .indent,
.scalar => |str| try rootlist.append(try Value.fromScalar(arena_alloc, str)),
.line_string, .concat_string => |str| try rootlist.append(try Value.fromString(arena_alloc, str)),
.line_string, .space_string, .concat_string => |str| try rootlist.append(try Value.fromString(arena_alloc, str)),
.inline_list => |str| try rootlist.append(try state.parseFlow(str, .inline_list, dkb)),
.inline_map => |str| try rootlist.append(try state.parseFlow(str, .inline_map, dkb)),
}
@ -146,7 +146,7 @@ pub const State = struct {
state.dangling_key = dupekey;
},
.scalar => |str| try rootmap.put(dupekey, try Value.fromScalar(arena_alloc, str)),
.line_string, .concat_string => |str| try rootmap.put(dupekey, try Value.fromString(arena_alloc, str)),
.line_string, .space_string, .concat_string => |str| try rootmap.put(dupekey, try Value.fromString(arena_alloc, str)),
.inline_list => |str| try rootmap.put(dupekey, try state.parseFlow(str, .inline_list, dkb)),
.inline_map => |str| try rootmap.put(dupekey, try state.parseFlow(str, .inline_map, dkb)),
}
@ -188,9 +188,11 @@ pub const State = struct {
.comment => unreachable,
.in_line => |in_line| switch (in_line) {
.empty => unreachable,
inline .line_string, .concat_string => |str, tag| {
inline .line_string, .space_string, .concat_string => |str, tag| {
if (tag == .line_string)
try state.string_builder.append(arena_alloc, '\n');
if (tag == .space_string)
try state.string_builder.append(arena_alloc, ' ');
try state.string_builder.appendSlice(arena_alloc, str);
},
else => {
@ -249,10 +251,14 @@ pub const State = struct {
state.expect_shift = .dedent;
switch (in_line) {
.empty => unreachable,
.scalar => |str| try list.append(try Value.fromScalar(arena_alloc, str)),
.scalar => {
state.diagnostics.length = 1;
state.diagnostics.message = "the document may not contain a scalar value on its own line";
return error.UnexpectedValue;
},
.inline_list => |str| try list.append(try state.parseFlow(str, .inline_list, dkb)),
.inline_map => |str| try list.append(try state.parseFlow(str, .inline_map, dkb)),
.line_string, .concat_string => |str| {
.line_string, .space_string, .concat_string => |str| {
const new_string = try appendListGetValue(list, Value.emptyString());
try state.string_builder.appendSlice(arena_alloc, str);
try state.value_stack.append(new_string);
@ -266,7 +272,7 @@ pub const State = struct {
switch (value) {
.empty => state.expect_shift = .indent,
.scalar => |str| try list.append(try Value.fromScalar(arena_alloc, str)),
.line_string, .concat_string => |str| try list.append(try Value.fromString(arena_alloc, str)),
.line_string, .space_string, .concat_string => |str| try list.append(try Value.fromString(arena_alloc, str)),
.inline_list => |str| try list.append(try state.parseFlow(str, .inline_list, dkb)),
.inline_map => |str| try list.append(try state.parseFlow(str, .inline_map, dkb)),
}
@ -291,7 +297,7 @@ pub const State = struct {
if (state.expect_shift != .indent or line.shift != .indent) {
state.diagnostics.length = 1;
state.diagnostics.message = "the document contains an invalid map key in a list";
state.diagnostics.message = "the document contains a map item where a list item is expected";
return error.UnexpectedValue;
}
@ -348,12 +354,16 @@ pub const State = struct {
switch (in_line) {
.empty => unreachable,
.scalar => |str| try state.putMap(map, state.dangling_key.?, try Value.fromScalar(arena_alloc, str), dkb),
.scalar => {
state.diagnostics.length = 1;
state.diagnostics.message = "the document may not contain a scalar value on its own line";
return error.UnexpectedValue;
},
.inline_list => |str| try state.putMap(map, state.dangling_key.?, try state.parseFlow(str, .inline_list, dkb), dkb),
.inline_map => |str| {
try state.putMap(map, state.dangling_key.?, try state.parseFlow(str, .inline_map, dkb), dkb);
},
.line_string, .concat_string => |str| {
.line_string, .space_string, .concat_string => |str| {
// string pushes the stack
const new_string = try state.putMapGetValue(map, state.dangling_key.?, Value.emptyString(), dkb);
try state.string_builder.appendSlice(arena_alloc, str);
@ -375,7 +385,7 @@ pub const State = struct {
if (state.expect_shift != .indent or line.shift != .indent or state.dangling_key == null) {
state.diagnostics.length = 1;
state.diagnostics.message = "the document contains an invalid list item in a map";
state.diagnostics.message = "the document contains a list item where a map item is expected";
return error.UnexpectedValue;
}
@ -395,7 +405,7 @@ pub const State = struct {
state.dangling_key = dupekey;
},
.scalar => |str| try state.putMap(map, dupekey, try Value.fromScalar(arena_alloc, str), dkb),
.line_string, .concat_string => |str| try state.putMap(map, dupekey, try Value.fromString(arena_alloc, str), dkb),
.line_string, .space_string, .concat_string => |str| try state.putMap(map, dupekey, try Value.fromString(arena_alloc, str), dkb),
.inline_list => |str| try state.putMap(map, dupekey, try state.parseFlow(str, .inline_list, dkb), dkb),
.inline_map => |str| try state.putMap(map, dupekey, try state.parseFlow(str, .inline_map, dkb), dkb),
}
@ -567,7 +577,12 @@ pub const State = struct {
// forbid these characters so that inline dictionary keys cannot start
// with characters that regular dictionary keys cannot start with
// (even though they're unambiguous in this specific context).
'{', '[', '#', '-', '>', '|', ',' => return {
'{', '[', '#', ',' => return {
state.diagnostics.length = 1;
state.diagnostics.message = "this document contains a inline map key that starts with an invalid character";
return error.BadToken;
},
'-', '>', '+', '|' => if ((idx + 1) < contents.len and contents[idx + 1] == ' ') {
state.diagnostics.length = 1;
state.diagnostics.message = "this document contains a inline map key that starts with an invalid sequence";
return error.BadToken;

View File

@ -66,10 +66,17 @@ pub const Value = union(enum) {
switch (self) {
inline .scalar, .string => |str, tag| {
if (tag == .string and !options.coerce_strings) return error.BadValue;
for (options.boolean_strings.truthy) |check|
if (std.mem.eql(u8, str, check)) return true;
for (options.boolean_strings.falsy) |check|
if (std.mem.eql(u8, str, check)) return false;
if (options.case_insensitive_scalar_coersion) {
for (options.boolean_strings.truthy) |check|
if (std.ascii.eqlIgnoreCase(str, check)) return true;
for (options.boolean_strings.falsy) |check|
if (std.ascii.eqlIgnoreCase(str, check)) return false;
} else {
for (options.boolean_strings.truthy) |check|
if (std.mem.eql(u8, str, check)) return true;
for (options.boolean_strings.falsy) |check|
if (std.mem.eql(u8, str, check)) return false;
}
return error.BadValue;
},
@ -252,8 +259,13 @@ pub const Value = union(enum) {
switch (self) {
inline .scalar, .string => |str, tag| {
if (tag == .string and !options.coerce_strings) return error.BadValue;
for (options.null_strings) |check|
if (std.mem.eql(u8, str, check)) return null;
if (options.case_insensitive_scalar_coersion) {
for (options.null_strings) |check|
if (std.ascii.eqlIgnoreCase(str, check)) return null;
} else {
for (options.null_strings) |check|
if (std.mem.eql(u8, str, check)) return null;
}
return try self.convertTo(opt.child, allocator, options);
},

View File

@ -23,6 +23,7 @@ pub const InlineItem = union(enum) {
empty: void,
scalar: []const u8,
line_string: []const u8,
space_string: []const u8,
concat_string: []const u8,
inline_list: []const u8,
@ -162,104 +163,113 @@ pub fn LineTokenizer(comptime Buffer: type) type {
// this should not be possible, as empty lines are caught earlier.
if (line.len == 0) return error.Impossible;
switch (line[0]) {
'#' => {
// force comments to be followed by a space. This makes them
// behave the same way as strings, actually.
if (line.len > 1 and line[1] != ' ') {
self.buffer.diag().line_offset += 1;
self.buffer.diag().length = 1;
self.buffer.diag().message = "this line is missing a space after the start of comment character '#'";
return error.BadToken;
}
// simply lie about indentation when the line is a comment.
quantized = self.last_indent;
return .{
.shift = .none,
.contents = .{ .comment = line[1..] },
.raw = line,
};
},
'|', '>', '[', '{' => {
return .{
.shift = shift,
.contents = .{ .in_line = try self.detectInlineItem(line) },
.raw = line,
};
},
'-' => {
if (line.len > 1 and line[1] != ' ') {
self.buffer.diag().line_offset += 1;
self.buffer.diag().length = 1;
self.buffer.diag().message = "this line is missing a space after the list entry character '-'";
return error.BadToken;
}
// blindly add 2 here because an empty item cannot fail in
// the value, only if a bogus dedent has occurred
self.buffer.diag().line_offset += 2;
return if (line.len == 1) .{
.shift = shift,
.contents = .{ .list_item = .empty },
.raw = line,
} else .{
.shift = shift,
.contents = .{ .list_item = try self.detectInlineItem(line[2..]) },
.raw = line,
};
},
else => {
for (line, 0..) |char, idx| {
if (char == ':') {
if (idx > 0 and (line[idx - 1] == ' ' or line[idx - 1] == '\t')) {
self.buffer.diag().line_offset += idx - 1;
self.buffer.diag().length = 1;
self.buffer.diag().message = "this line contains space before the map key-value separator character ':'";
return error.TrailingWhitespace;
}
if (idx + 1 == line.len) {
self.buffer.diag().line_offset += idx + 1;
return .{
.shift = shift,
.contents = .{ .map_item = .{ .key = line[0..idx], .val = .empty } },
.raw = line,
};
}
if (line[idx + 1] != ' ') {
self.buffer.diag().line_offset += idx + 1;
self.buffer.diag().length = 1;
self.buffer.diag().message = "this line is missing a space after the map key-value separator character ':'";
return error.BadToken;
}
return .{
.shift = shift,
.contents = .{ .map_item = .{
.key = line[0..idx],
.val = try self.detectInlineItem(line[idx + 2 ..]),
} },
.raw = line,
};
sigil: {
switch (line[0]) {
'#' => {
// Force comments to be followed by a space. We could
// allow #: to be interpreted as a map key, but I'm going
// to specifically forbid it instead.
if (line.len > 1 and line[1] != ' ') {
self.buffer.diag().line_offset += 1;
self.buffer.diag().length = 1;
self.buffer.diag().message = "this line is missing a space after the start of comment character '#'";
return error.BadToken;
}
}
return .{
.shift = shift,
.contents = .{ .in_line = .{ .scalar = line } },
.raw = line,
};
},
// simply lie about indentation when the line is a comment.
quantized = self.last_indent;
return .{
.shift = .none,
.contents = .{ .comment = line[1..] },
.raw = line,
};
},
'|', '>', '+' => {
if (line.len > 1 and line[1] != ' ') {
// we want to try parsing this as a map key
break :sigil;
}
return .{
.shift = shift,
.contents = .{ .in_line = try self.detectInlineItem(line) },
.raw = line,
};
},
'[', '{' => {
// these don't require being followed by a space, so they
// cannot be interpreted as starting a map key in any way.
return .{
.shift = shift,
.contents = .{ .in_line = try self.detectInlineItem(line) },
.raw = line,
};
},
'-' => {
if (line.len > 1 and line[1] != ' ') {
// we want to try parsing this as a map key
break :sigil;
}
// blindly add 2 here because an empty item cannot fail in
// the value, only if a bogus dedent has occurred
self.buffer.diag().line_offset += 2;
return if (line.len == 1) .{
.shift = shift,
.contents = .{ .list_item = .empty },
.raw = line,
} else .{
.shift = shift,
.contents = .{ .list_item = try self.detectInlineItem(line[2..]) },
.raw = line,
};
},
else => break :sigil,
}
}
// somehow everything else has failed
self.buffer.diag().line_offset = 0;
self.buffer.diag().length = raw_line.len;
self.buffer.diag().message = "this document contains an unknown error. Please report this.";
return error.Impossible;
for (line, 0..) |char, idx| {
if (char == ':') {
if (idx > 0 and (line[idx - 1] == ' ' or line[idx - 1] == '\t')) {
self.buffer.diag().line_offset += idx - 1;
self.buffer.diag().length = 1;
self.buffer.diag().message = "this line contains space before the map key-value separator character ':'";
return error.TrailingWhitespace;
}
if (idx + 1 == line.len) {
self.buffer.diag().line_offset += idx + 1;
return .{
.shift = shift,
.contents = .{ .map_item = .{ .key = line[0..idx], .val = .empty } },
.raw = line,
};
}
if (line[idx + 1] != ' ') {
self.buffer.diag().line_offset += idx + 1;
self.buffer.diag().length = 1;
self.buffer.diag().message = "this line is missing a space after the map key-value separator character ':'";
return error.BadToken;
}
return .{
.shift = shift,
.contents = .{ .map_item = .{
.key = line[0..idx],
.val = try self.detectInlineItem(line[idx + 2 ..]),
} },
.raw = line,
};
}
}
return .{
.shift = shift,
.contents = .{ .in_line = .{ .scalar = line } },
.raw = line,
};
}
return null;
}
@ -281,8 +291,12 @@ pub fn LineTokenizer(comptime Buffer: type) type {
};
switch (buf[start]) {
'>', '|' => |char| {
if (buf.len - start > 1 and buf[start + 1] != ' ') return error.BadToken;
'>', '|', '+' => |char| {
if (buf.len - start > 1 and buf[start + 1] != ' ') {
self.buffer.diag().length = 1;
self.buffer.diag().message = "this line is missing a space after the string start character";
return error.BadToken;
}
const slice: []const u8 = switch (buf[buf.len - 1]) {
' ', '\t' => {
@ -295,10 +309,12 @@ pub fn LineTokenizer(comptime Buffer: type) type {
else => buf[start + @min(2, buf.len - start) .. buf.len],
};
return if (char == '>')
.{ .line_string = slice }
else
.{ .concat_string = slice };
return switch (char) {
'>' => .{ .line_string = slice },
'+' => .{ .space_string = slice },
'|' => .{ .concat_string = slice },
else => unreachable,
};
},
'[' => {
if (buf.len - start < 2 or buf[buf.len - 1] != ']') {