nice-data/src/tokenizer.zig

355 lines
15 KiB
Zig
Raw Normal View History

const std = @import("std");
const Diagnostics = @import("./parser.zig").Diagnostics;
pub const Error = error{
BadToken,
ExtraContent,
MixedIndentation,
TooMuchIndentation,
UnquantizedIndentation,
TrailingWhitespace,
state/tokenizer: go completely the opposite direction re: whitespace This commit makes both the parser and tokenizer a lot more willing to accept whitespace in places where it would previously cause strange behavior. Also, whitespace is ignored preceding and following all values and keys in flow-style objects now (in regular objects, trailing whitespace is an error, and it is also an error for non-flow map keys to have whitespace before the colon). Tabs are no longer allowed as whitespace in the line. They can be inside scalar values, though, including map keys. Also strings allow tabs inside of them. The primary motivation here is to apply the principle of least astonishment. For example, the following - [hello, there] would previously have been parsed as the scalar " [hello, there]" due to the presence of an additional space after the "-" list item indicator. This obviously looks like a flow list, and the way it was previously parsed was very visually confusing (this change does mean that scalars cannot start with [, but strings can, so this is not a real limitation. Note that strings still allow leading whitespace, so > hello will produce the string " hello" due to the additional space after the string designator. For flow lists, [ a, b ] would have been parsed as ["a", "b "], which was obviously confusing. The previous commit fixed this by making whitespace rules more strict. This commit fixes this by making whitespace rules more relaxed. In particular, all whitespace preceding and following flow items is now stripped. The main motivation for going in this direction is to allow aligning list items over multiple lines, visually, which can make data much easier to read for people, an explicit design goal. For example key: [ 1, 2, 3 ] other: [ 10, 20, 30 ] is now allowed. The indentation rules do not allow right-aligning "key" to "other", but I think that is acceptable (if we forced using tabs for indentation, we could actually allow this, which I think is worth consideration, at least). Flow maps are more generous: foo: { bar: baz } fooq: { barq: bazq } is allowed because flow maps do not use whitespace as a structural designator. These changes do affect how some things can be represented. Scalar values can no longer contain leading or trailing whitespace (previously the could contain leading whitespace). Map keys cannot contain trailing whitespace (they could before. This also means that keys consisting of whitespace cannot be represented at all). Ultimately, given the other restrictions the format imposes on keys and values, I find these to be acceptable and consistent with the goal of the format.
2023-10-04 22:54:53 -07:00
IllegalTabWhitespaceInLine,
Impossible,
};
pub const DetectedIndentation = union(enum) {
unknown: void,
spaces: usize,
tabs: void,
};
pub const InlineItem = union(enum) {
empty: void,
scalar: []const u8,
line_string: []const u8,
space_string: []const u8,
parser: change string and | semantics and expose slices in Value The way I implemented these changes ended up being directly coupled and I am not interested in trying to decouple them, so instead here's a single commit that makes changes to both the API and the format. Let's go over these. | now acts as a direct concatenation operator, rather than concatenating with a space. This is because the format allows the specification of a trailing space (by using | to fence the string just before the newline). So it's now possible to spread a long string without spaces over multiple lines, which couldn't be done before. This does have the downside that the common pattern of concatenating strings with a space now requires some extra trailing line noise. I may introduce a THIRD type of concatenating string (thinking of using + as the prefix) because I am a jerk. We will see. The way multi-line strings are concatenated has changed. Partially this has to do with increasing the simplicity of the aforementioned implementation change (the parser forgets the string type from the tokenizer. This worked before because there would always be a trailing character that could be popped off. But since one type now appends no character, this would have to be tracked through the parsing to determine if a character would need to be popped at the end). But I was also not terribly satisfied with the semantics of multiline strings before. I wrote several words about this in 429734e6e813b225654aa71c283f4a8b4444609f, where I reached the opposite conclusion from what is implemented in this commit. Basically, when different types of string concatenation are mixed, the results may be surprising. The previous approach would append the line terminator at the end of the line specified. The new approach prepends the line terminator at the beginning of the line specified. Since the specifier character is at the beginning of the line, I feel like this reads a little better simply due to the colocation of information. As an example: > first | second > third Would previously have resulted in "first\nsecondthird" but it will now result in "firstsecond\nthird". The only mildly baffling part about this is that the string signifier on the first line has absolutely no impact on the string. In the old design, it was the last line that had no impact. Finally, this commit also changes Value so that it uses []const u8 slices directly to store strings instead of ArrayLists. This is because everything downstream of the value was just reaching into string.items to access the slice directly, so cut out the middleman. It was unintuitive to access a field named .string and get an arraylist rather than a slice, anyway.
2023-10-08 16:57:52 -07:00
concat_string: []const u8,
inline_list: []const u8,
inline_map: []const u8,
};
pub const LineContents = union(enum) {
comment: []const u8,
in_line: InlineItem,
list_item: InlineItem,
map_item: struct { key: []const u8, val: InlineItem },
};
pub const ShiftDirection = enum { indent, dedent, none };
pub const LineShift = union(ShiftDirection) {
indent: void,
// we can dedent multiple levels at once.
dedent: usize,
none: void,
};
pub const Line = struct {
shift: LineShift,
contents: LineContents,
raw: []const u8,
};
// buffer is expected to be either LineBuffer or FixedLineBuffer, but can
// technically be anything with a conformant interface.
pub fn LineTokenizer(comptime Buffer: type) type {
return struct {
buffer: Buffer,
index: usize = 0,
indentation: DetectedIndentation = .unknown,
last_indent: usize = 0,
pub fn finish(self: @This()) !void {
if (!self.buffer.empty()) {
self.buffer.diag().line_offset = 0;
self.buffer.diag().length = 1;
self.buffer.diag().message = "the document has extra content or is missing the final LF character";
return error.ExtraContent;
}
}
pub fn next(self: *@This()) !?Line {
lineloop: while (try self.buffer.nextLine()) |raw_line| {
var indent: usize = 0;
for (raw_line, 0..) |char, idx| {
switch (char) {
' ' => {
switch (self.indentation) {
// There's a weird coupling here because we can't set this until
// all spaces have been consumed. I also thought about ignoring
// spaces on comment lines since those don't affect the
// relative indent/dedent, but then we would allow comments
// to ignore our indent quantum, which I dislike due to it making
// ugly documents.
.unknown => self.indentation = .{ .spaces = 0 },
.spaces => {},
.tabs => {
self.buffer.diag().line_offset = idx;
self.buffer.diag().length = 1;
self.buffer.diag().message = "the document contains mixed tab/space indentation";
return error.MixedIndentation;
},
}
},
'\t' => {
switch (self.indentation) {
.unknown => self.indentation = .tabs,
.spaces => {
self.buffer.diag().line_offset = idx;
self.buffer.diag().length = 1;
self.buffer.diag().message = "the document contains mixed tab/space indentation";
return error.MixedIndentation;
},
.tabs => {},
}
},
'\r' => {
return error.BadToken;
},
else => {
indent = idx;
break;
},
}
} else {
if (raw_line.len > 0) {
self.buffer.diag().line_offset = raw_line.len - 1;
self.buffer.diag().length = 1;
self.buffer.diag().message = "this line contains trailing whitespace";
return error.TrailingWhitespace;
}
continue :lineloop;
}
var quantized: usize = if (self.indentation == .spaces) quant: {
if (self.indentation.spaces == 0) {
self.indentation.spaces = indent;
}
if (@rem(indent, self.indentation.spaces) != 0) {
self.buffer.diag().line_offset = 0;
self.buffer.diag().length = indent;
self.buffer.diag().message = "this line contains incorrectly quantized indentation";
return error.UnquantizedIndentation;
}
break :quant @divExact(indent, self.indentation.spaces);
} else indent;
const shift: LineShift = if (quantized > self.last_indent) rel: {
if ((quantized - self.last_indent) > 1) {
self.buffer.diag().line_offset = 0;
self.buffer.diag().length = indent;
self.buffer.diag().message = "this line contains too much indentation";
return error.TooMuchIndentation;
}
break :rel .indent;
} else if (quantized < self.last_indent)
.{ .dedent = self.last_indent - quantized }
else
.none;
defer {
self.last_indent = quantized;
}
// update the diagnostics so that the parser can use them without
// knowing about the whitespace.
self.buffer.diag().line_offset = indent;
const line = raw_line[indent..];
// this should not be possible, as empty lines are caught earlier.
if (line.len == 0) return error.Impossible;
sigil: {
switch (line[0]) {
'#' => {
// Force comments to be followed by a space. We could
// allow #: to be interpreted as a map key, but I'm going
// to specifically forbid it instead.
if (line.len > 1 and line[1] != ' ') {
self.buffer.diag().line_offset += 1;
self.buffer.diag().length = 1;
self.buffer.diag().message = "this line is missing a space after the start of comment character '#'";
return error.BadToken;
}
// simply lie about indentation when the line is a comment.
quantized = self.last_indent;
return .{
.shift = .none,
.contents = .{ .comment = line[1..] },
.raw = line,
};
},
'|', '>', '+' => {
if (line.len > 1 and line[1] != ' ') {
// we want to try parsing this as a map key
break :sigil;
}
return .{
.shift = shift,
.contents = .{ .in_line = try self.detectInlineItem(line) },
.raw = line,
};
},
'[', '{' => {
// these don't require being followed by a space, so they
// cannot be interpreted as starting a map key in any way.
return .{
.shift = shift,
.contents = .{ .in_line = try self.detectInlineItem(line) },
.raw = line,
};
},
'-' => {
if (line.len > 1 and line[1] != ' ') {
// we want to try parsing this as a map key
break :sigil;
}
// blindly add 2 here because an empty item cannot fail in
// the value, only if a bogus dedent has occurred
self.buffer.diag().line_offset += 2;
return if (line.len == 1) .{
.shift = shift,
.contents = .{ .list_item = .empty },
.raw = line,
} else .{
.shift = shift,
.contents = .{ .list_item = try self.detectInlineItem(line[2..]) },
.raw = line,
};
},
else => break :sigil,
}
}
for (line, 0..) |char, idx| {
if (char == ':') {
if (idx > 0 and (line[idx - 1] == ' ' or line[idx - 1] == '\t')) {
self.buffer.diag().line_offset += idx - 1;
self.buffer.diag().length = 1;
self.buffer.diag().message = "this line contains space before the map key-value separator character ':'";
return error.TrailingWhitespace;
}
if (idx + 1 == line.len) {
self.buffer.diag().line_offset += idx + 1;
return .{
.shift = shift,
.contents = .{ .map_item = .{ .key = line[0..idx], .val = .empty } },
.raw = line,
};
}
if (line[idx + 1] != ' ') {
self.buffer.diag().line_offset += idx + 1;
self.buffer.diag().length = 1;
self.buffer.diag().message = "this line is missing a space after the map key-value separator character ':'";
return error.BadToken;
}
return .{
.shift = shift,
.contents = .{ .map_item = .{
.key = line[0..idx],
.val = try self.detectInlineItem(line[idx + 2 ..]),
} },
.raw = line,
};
}
}
return .{
.shift = shift,
.contents = .{ .in_line = .{ .scalar = line } },
.raw = line,
};
}
return null;
}
// TODO: it's impossible to get the right diagnostic offset in this function at the moment
fn detectInlineItem(self: @This(), buf: []const u8) Error!InlineItem {
if (buf.len == 0) return .empty;
state/tokenizer: go completely the opposite direction re: whitespace This commit makes both the parser and tokenizer a lot more willing to accept whitespace in places where it would previously cause strange behavior. Also, whitespace is ignored preceding and following all values and keys in flow-style objects now (in regular objects, trailing whitespace is an error, and it is also an error for non-flow map keys to have whitespace before the colon). Tabs are no longer allowed as whitespace in the line. They can be inside scalar values, though, including map keys. Also strings allow tabs inside of them. The primary motivation here is to apply the principle of least astonishment. For example, the following - [hello, there] would previously have been parsed as the scalar " [hello, there]" due to the presence of an additional space after the "-" list item indicator. This obviously looks like a flow list, and the way it was previously parsed was very visually confusing (this change does mean that scalars cannot start with [, but strings can, so this is not a real limitation. Note that strings still allow leading whitespace, so > hello will produce the string " hello" due to the additional space after the string designator. For flow lists, [ a, b ] would have been parsed as ["a", "b "], which was obviously confusing. The previous commit fixed this by making whitespace rules more strict. This commit fixes this by making whitespace rules more relaxed. In particular, all whitespace preceding and following flow items is now stripped. The main motivation for going in this direction is to allow aligning list items over multiple lines, visually, which can make data much easier to read for people, an explicit design goal. For example key: [ 1, 2, 3 ] other: [ 10, 20, 30 ] is now allowed. The indentation rules do not allow right-aligning "key" to "other", but I think that is acceptable (if we forced using tabs for indentation, we could actually allow this, which I think is worth consideration, at least). Flow maps are more generous: foo: { bar: baz } fooq: { barq: bazq } is allowed because flow maps do not use whitespace as a structural designator. These changes do affect how some things can be represented. Scalar values can no longer contain leading or trailing whitespace (previously the could contain leading whitespace). Map keys cannot contain trailing whitespace (they could before. This also means that keys consisting of whitespace cannot be represented at all). Ultimately, given the other restrictions the format imposes on keys and values, I find these to be acceptable and consistent with the goal of the format.
2023-10-04 22:54:53 -07:00
const start = start: {
for (buf, 0..) |chr, idx|
if (chr == ' ')
continue
else if (chr == '\t')
return error.IllegalTabWhitespaceInLine
else
break :start idx;
return error.TrailingWhitespace;
};
switch (buf[start]) {
'>', '|', '+' => |char| {
if (buf.len - start > 1 and buf[start + 1] != ' ') {
self.buffer.diag().length = 1;
self.buffer.diag().message = "this line is missing a space after the string start character";
return error.BadToken;
}
const slice: []const u8 = switch (buf[buf.len - 1]) {
' ', '\t' => {
self.buffer.diag().line_offset = 0;
self.buffer.diag().length = 1;
self.buffer.diag().message = "this line contains trailing whitespace";
return error.TrailingWhitespace;
},
state/tokenizer: go completely the opposite direction re: whitespace This commit makes both the parser and tokenizer a lot more willing to accept whitespace in places where it would previously cause strange behavior. Also, whitespace is ignored preceding and following all values and keys in flow-style objects now (in regular objects, trailing whitespace is an error, and it is also an error for non-flow map keys to have whitespace before the colon). Tabs are no longer allowed as whitespace in the line. They can be inside scalar values, though, including map keys. Also strings allow tabs inside of them. The primary motivation here is to apply the principle of least astonishment. For example, the following - [hello, there] would previously have been parsed as the scalar " [hello, there]" due to the presence of an additional space after the "-" list item indicator. This obviously looks like a flow list, and the way it was previously parsed was very visually confusing (this change does mean that scalars cannot start with [, but strings can, so this is not a real limitation. Note that strings still allow leading whitespace, so > hello will produce the string " hello" due to the additional space after the string designator. For flow lists, [ a, b ] would have been parsed as ["a", "b "], which was obviously confusing. The previous commit fixed this by making whitespace rules more strict. This commit fixes this by making whitespace rules more relaxed. In particular, all whitespace preceding and following flow items is now stripped. The main motivation for going in this direction is to allow aligning list items over multiple lines, visually, which can make data much easier to read for people, an explicit design goal. For example key: [ 1, 2, 3 ] other: [ 10, 20, 30 ] is now allowed. The indentation rules do not allow right-aligning "key" to "other", but I think that is acceptable (if we forced using tabs for indentation, we could actually allow this, which I think is worth consideration, at least). Flow maps are more generous: foo: { bar: baz } fooq: { barq: bazq } is allowed because flow maps do not use whitespace as a structural designator. These changes do affect how some things can be represented. Scalar values can no longer contain leading or trailing whitespace (previously the could contain leading whitespace). Map keys cannot contain trailing whitespace (they could before. This also means that keys consisting of whitespace cannot be represented at all). Ultimately, given the other restrictions the format imposes on keys and values, I find these to be acceptable and consistent with the goal of the format.
2023-10-04 22:54:53 -07:00
'|' => buf[start + @min(2, buf.len - start) .. buf.len - @intFromBool(buf.len - start > 1)],
else => buf[start + @min(2, buf.len - start) .. buf.len],
};
return switch (char) {
'>' => .{ .line_string = slice },
'+' => .{ .space_string = slice },
'|' => .{ .concat_string = slice },
else => unreachable,
};
},
'[' => {
state/tokenizer: go completely the opposite direction re: whitespace This commit makes both the parser and tokenizer a lot more willing to accept whitespace in places where it would previously cause strange behavior. Also, whitespace is ignored preceding and following all values and keys in flow-style objects now (in regular objects, trailing whitespace is an error, and it is also an error for non-flow map keys to have whitespace before the colon). Tabs are no longer allowed as whitespace in the line. They can be inside scalar values, though, including map keys. Also strings allow tabs inside of them. The primary motivation here is to apply the principle of least astonishment. For example, the following - [hello, there] would previously have been parsed as the scalar " [hello, there]" due to the presence of an additional space after the "-" list item indicator. This obviously looks like a flow list, and the way it was previously parsed was very visually confusing (this change does mean that scalars cannot start with [, but strings can, so this is not a real limitation. Note that strings still allow leading whitespace, so > hello will produce the string " hello" due to the additional space after the string designator. For flow lists, [ a, b ] would have been parsed as ["a", "b "], which was obviously confusing. The previous commit fixed this by making whitespace rules more strict. This commit fixes this by making whitespace rules more relaxed. In particular, all whitespace preceding and following flow items is now stripped. The main motivation for going in this direction is to allow aligning list items over multiple lines, visually, which can make data much easier to read for people, an explicit design goal. For example key: [ 1, 2, 3 ] other: [ 10, 20, 30 ] is now allowed. The indentation rules do not allow right-aligning "key" to "other", but I think that is acceptable (if we forced using tabs for indentation, we could actually allow this, which I think is worth consideration, at least). Flow maps are more generous: foo: { bar: baz } fooq: { barq: bazq } is allowed because flow maps do not use whitespace as a structural designator. These changes do affect how some things can be represented. Scalar values can no longer contain leading or trailing whitespace (previously the could contain leading whitespace). Map keys cannot contain trailing whitespace (they could before. This also means that keys consisting of whitespace cannot be represented at all). Ultimately, given the other restrictions the format imposes on keys and values, I find these to be acceptable and consistent with the goal of the format.
2023-10-04 22:54:53 -07:00
if (buf.len - start < 2 or buf[buf.len - 1] != ']') {
self.buffer.diag().line_offset = 0;
self.buffer.diag().length = 1;
self.buffer.diag().message = "this line contains a inline list but does not end with the closing character ']'";
return error.BadToken;
}
// keep the closing ] for the inline parser
return .{ .inline_list = buf[start + 1 ..] };
},
'{' => {
state/tokenizer: go completely the opposite direction re: whitespace This commit makes both the parser and tokenizer a lot more willing to accept whitespace in places where it would previously cause strange behavior. Also, whitespace is ignored preceding and following all values and keys in flow-style objects now (in regular objects, trailing whitespace is an error, and it is also an error for non-flow map keys to have whitespace before the colon). Tabs are no longer allowed as whitespace in the line. They can be inside scalar values, though, including map keys. Also strings allow tabs inside of them. The primary motivation here is to apply the principle of least astonishment. For example, the following - [hello, there] would previously have been parsed as the scalar " [hello, there]" due to the presence of an additional space after the "-" list item indicator. This obviously looks like a flow list, and the way it was previously parsed was very visually confusing (this change does mean that scalars cannot start with [, but strings can, so this is not a real limitation. Note that strings still allow leading whitespace, so > hello will produce the string " hello" due to the additional space after the string designator. For flow lists, [ a, b ] would have been parsed as ["a", "b "], which was obviously confusing. The previous commit fixed this by making whitespace rules more strict. This commit fixes this by making whitespace rules more relaxed. In particular, all whitespace preceding and following flow items is now stripped. The main motivation for going in this direction is to allow aligning list items over multiple lines, visually, which can make data much easier to read for people, an explicit design goal. For example key: [ 1, 2, 3 ] other: [ 10, 20, 30 ] is now allowed. The indentation rules do not allow right-aligning "key" to "other", but I think that is acceptable (if we forced using tabs for indentation, we could actually allow this, which I think is worth consideration, at least). Flow maps are more generous: foo: { bar: baz } fooq: { barq: bazq } is allowed because flow maps do not use whitespace as a structural designator. These changes do affect how some things can be represented. Scalar values can no longer contain leading or trailing whitespace (previously the could contain leading whitespace). Map keys cannot contain trailing whitespace (they could before. This also means that keys consisting of whitespace cannot be represented at all). Ultimately, given the other restrictions the format imposes on keys and values, I find these to be acceptable and consistent with the goal of the format.
2023-10-04 22:54:53 -07:00
if (buf.len - start < 2 or buf[buf.len - 1] != '}') {
self.buffer.diag().line_offset = 0;
self.buffer.diag().length = 1;
self.buffer.diag().message = "this line contains a inline map but does not end with the closing character '}'";
return error.BadToken;
}
// keep the closing } for the inline parser
return .{ .inline_map = buf[start + 1 ..] };
},
else => {
if (buf[buf.len - 1] == ' ' or buf[buf.len - 1] == '\t') {
self.buffer.diag().line_offset = 0;
self.buffer.diag().length = 1;
self.buffer.diag().message = "this line contains trailing whitespace";
return error.TrailingWhitespace;
}
state/tokenizer: go completely the opposite direction re: whitespace This commit makes both the parser and tokenizer a lot more willing to accept whitespace in places where it would previously cause strange behavior. Also, whitespace is ignored preceding and following all values and keys in flow-style objects now (in regular objects, trailing whitespace is an error, and it is also an error for non-flow map keys to have whitespace before the colon). Tabs are no longer allowed as whitespace in the line. They can be inside scalar values, though, including map keys. Also strings allow tabs inside of them. The primary motivation here is to apply the principle of least astonishment. For example, the following - [hello, there] would previously have been parsed as the scalar " [hello, there]" due to the presence of an additional space after the "-" list item indicator. This obviously looks like a flow list, and the way it was previously parsed was very visually confusing (this change does mean that scalars cannot start with [, but strings can, so this is not a real limitation. Note that strings still allow leading whitespace, so > hello will produce the string " hello" due to the additional space after the string designator. For flow lists, [ a, b ] would have been parsed as ["a", "b "], which was obviously confusing. The previous commit fixed this by making whitespace rules more strict. This commit fixes this by making whitespace rules more relaxed. In particular, all whitespace preceding and following flow items is now stripped. The main motivation for going in this direction is to allow aligning list items over multiple lines, visually, which can make data much easier to read for people, an explicit design goal. For example key: [ 1, 2, 3 ] other: [ 10, 20, 30 ] is now allowed. The indentation rules do not allow right-aligning "key" to "other", but I think that is acceptable (if we forced using tabs for indentation, we could actually allow this, which I think is worth consideration, at least). Flow maps are more generous: foo: { bar: baz } fooq: { barq: bazq } is allowed because flow maps do not use whitespace as a structural designator. These changes do affect how some things can be represented. Scalar values can no longer contain leading or trailing whitespace (previously the could contain leading whitespace). Map keys cannot contain trailing whitespace (they could before. This also means that keys consisting of whitespace cannot be represented at all). Ultimately, given the other restrictions the format imposes on keys and values, I find these to be acceptable and consistent with the goal of the format.
2023-10-04 22:54:53 -07:00
return .{ .scalar = buf[start..] };
},
}
}
};
}