config: dupe map keys

I didn't do an exhaustive search, but it seems that the managed hashmaps only allocates space for the structure of the map itself, not its keys or values. This mostly makes sense, but it also means that this was only working due to the fact that I am currently not freeing the input buffer until after iterating through the parse result. Looking through this, I'm also reasonably surprised by how many times this is assigned in the normal parsing vs the flow parsing. There is a lot more repetition in the code of the normal parser, I think because it does not have a granular state machine. It may be worth revisiting the structure to see if a more detailed state machine, like the one used for parsing the flow-style objects, would reduce the amount of code repetition here. I suspect it certainly could be better than it currently is, since it seems unlikely that there really are four different scenarios where we need to be parsing a dictionary key. Taking a quick glance at it, it looks like I could be taking better advantage of the flipflop loop on indent as well as dedent. This might be a bit less efficient due to essentially being less loop unrolling, but it would also potentially make more maintainable code by having less manual repetition.
config: use std.StringArrayHashMap for the map type
2023-09-22 00:53:26 -07:00 · 2023-09-22 00:53:26 -07:00 · 2023-09-22 00:53:26 -07:00 · 2023-09-22 00:53:26 -07:00
1 changed files with 485 additions and 344 deletions
--- a/src/config.zig
+++ b/src/config.zig
@@ -63,29 +63,114 @@

 const std = @import("std");

+pub const IndexSlice = struct { start: usize, len: usize };
+
 pub const Diagnostics = struct {
    row: usize,
    span: struct { absolute: usize, line_offset: usize, length: usize },
    message: []const u8,
 };

-pub const LineTokenizer = struct {
+pub const LineBuffer = struct {
+    allocator: std.mem.Allocator,
+    buffer: []u8,
+    used: usize,
+    window: IndexSlice,
+
+    pub const default_capacity: usize = 4096;
+    pub const Error = std.mem.Allocator.Error;
+
+    pub fn init(allocator: std.mem.Allocator) Error!LineBuffer {
+        return initCapacity(allocator, default_capacity);
+    }
+
+    pub fn initCapacity(allocator: std.mem.Allocator, capacity: usize) Error!LineBuffer {
+        return .{
+            .allocator = allocator,
+            .buffer = try allocator.alloc(u8, capacity),
+            .used = 0,
+            .window = .{ .start = 0, .len = 0 },
+        };
+    }
+
+    pub fn feed(self: *LineBuffer, data: []const u8) Error!void {
+        if (data.len == 0) return;
+        // TODO: check for usize overflow here if we want Maximum Robustness
+        const new_window_len = self.window.len + data.len;
+
+        // data cannot fit in the buffer with our scan window, so we have to realloc
+        if (new_window_len > self.buffer.len) {
+            // TODO: adopt an overallocation strategy? Will potentially avoid allocating
+            //       on every invocation but will cause the buffer to oversize
+            try self.allocator.realloc(self.buffer, new_window_len);
+            self.rehome();
+            @memcpy(self.buffer[self.used..].ptr, data);
+            self.used = new_window_len;
+            self.window.len = new_window_len;
+        }
+        // data will fit, but needs to be moved in the buffer
+        else if (self.window.start + new_window_len > self.buffer.len) {
+            self.rehome();
+            @memcpy(self.buffer[self.used..].ptr, data);
+            self.used = new_window_len;
+            self.window.len = new_window_len;
+        }
+        // data can simply be appended
+        else {
+            @memcpy(self.buffer[self.used..].ptr, data);
+        }
+    }
+
+    /// The memory returned by this function is valid until the next call to `feed`.
+    /// The resulting slice does not include the newline character.
+    pub fn nextLine(self: *LineBuffer) ?[]const u8 {
+        if (self.window.start >= self.buffer.len or self.window.len == 0)
+            return null;
+
+        const window = self.buffer[self.window.start..][0..self.window.len];
+        const split = std.mem.indexOfScalar(u8, window, '\n') orelse return null;
+
+        self.window.start += split + 1;
+        self.window.len -= split + 1;
+
+        return window[0..split];
+    }
+
+    fn rehome(self: *LineBuffer) void {
+        if (self.window.start == 0) return;
+
+        const window = self.buffer[self.window.start..][0..self.window.len];
+
+        if (self.window.len > self.window.start)
+            std.mem.copyForwards(u8, self.buffer, window)
+        else
+            @memcpy(self.buffer.ptr, window);
+
+        self.window.start = 0;
+        self.used = window.len;
+    }
+};
+
+pub const FixedLineBuffer = struct {
    buffer: []const u8,
-    index: usize = 0,
-    indentation: IndentationType = .immaterial,
-    last_indent: usize = 0,
-    diagnostics: *Diagnostics,
+    window: IndexSlice,

-    row: usize = 0,
+    pub fn init(data: []const u8) FixedLineBuffer {
+        return .{ .buffer = data, .window = .{ .start = 0, .len = data.len } };
+    }

-    const Error = error{
-        BadToken,
-        MixedIndentation,
-        UnquantizedIndentation,
-        TooMuchIndentation,
-        MissingNewline,
-        TrailingWhitespace,
-        Impossible,
+    pub fn nextLine(self: *FixedLineBuffer) ?[]const u8 {
+        if (self.window.start >= self.buffer.len or self.window.len == 0)
+            return null;
+
+        const window = self.buffer[self.window.start..][0..self.window.len];
+        const split = std.mem.indexOfScalar(u8, window, '\n') orelse return null;
+
+        self.window.start += split + 1;
+        self.window.len -= split + 1;
+
+        return window[0..split];
+    }
 };

 const IndentationType = union(enum) {
@@ -148,13 +233,29 @@ pub const LineTokenizer = struct {
    raw: []const u8,
 };

-    pub fn next(self: *LineTokenizer) Error!?Line {
-        if (self.index == self.buffer.len) return null;
+pub fn LineTokenizer(comptime Buffer: type) type {
+    return struct {
+        buffer: Buffer,
+        index: usize = 0,
+        indentation: IndentationType = .immaterial,
+        last_indent: usize = 0,
+        diagnostics: *Diagnostics,
+        row: usize = 0,

+        const Error = error{
+            BadToken,
+            MixedIndentation,
+            UnquantizedIndentation,
+            TooMuchIndentation,
+            MissingNewline,
+            TrailingWhitespace,
+            Impossible,
+        };
+
+        pub fn next(self: *@This()) Error!?Line {
+            lineloop: while (self.buffer.nextLine()) |raw_line| {
                var indent: usize = 0;
-        var offset: usize = 0;
-
-        for (self.buffer[self.index..], 0..) |char, idx| {
+                for (raw_line, 0..) |char, idx| {
                    switch (char) {
                        ' ' => {
                            switch (self.indentation) {
@@ -168,7 +269,6 @@ pub const LineTokenizer = struct {
                                .spaces => {},
                                .tabs => return error.MixedIndentation,
                            }
-                    indent += 1;
                        },
                        '\t' => {
                            switch (self.indentation) {
@@ -176,40 +276,28 @@ pub const LineTokenizer = struct {
                                .spaces => return error.MixedIndentation,
                                .tabs => {},
                            }
-                    indent += 1;
                        },
                        '\r' => {
                            return error.BadToken;
                        },
-                '\n' => {
-                    // don't even emit anything for empty rows.
-                    self.row += 1;
-                    offset = idx + 1;
-                    // if it's too hard to deal with, Just Make It An Error!!!
-                    // an empty line with whitespace on it is garbage. It can mess with
-                    // the indentation detection grossly in a way that is annoying to
-                    // deal with. Besides, having whitespace-only lines in a document
-                    // is essentially terrorism, with which negotiations are famously
-                    // not permitted.
-                    if (indent > 0) return error.TrailingWhitespace;
+                        else => {
+                            indent = idx;
+                            break;
                        },
-                else => break,
                    }
                } else {
-            std.debug.assert(self.buffer.len == self.index + indent + offset + 1);
-            self.index = self.buffer.len;
-            // this prong will get hit when the document only consists of whitespace
-            return null;
+                    if (raw_line.len > 0) return error.TrailingWhitespace;
+                    continue :lineloop;
                }

-        var quantized: usize = if (self.indentation == .spaces) blk: {
+                var quantized: usize = if (self.indentation == .spaces) quant: {
                    if (self.indentation.spaces == 0) {
                        self.indentation.spaces = indent;
                    }
                    if (@rem(indent, self.indentation.spaces) != 0)
                        return error.UnquantizedIndentation;

-            break :blk @divExact(indent, self.indentation.spaces);
+                    break :quant @divExact(indent, self.indentation.spaces);
                } else indent;

                const relative: RelativeIndent = if (quantized > self.last_indent) rel: {
@@ -221,16 +309,12 @@ pub const LineTokenizer = struct {
                else
                    .none;

-        offset += indent;
-
                defer {
                    self.row += 1;
                    self.last_indent = quantized;
-            self.index += offset;
                }

-        const line = try consumeLine(self.buffer[self.index + offset ..]);
-        offset += line.len + 1;
+                const line = raw_line[indent..];

                // this should not be possible, as empty lines are caught earlier.
                if (line.len == 0) return error.Impossible;
@@ -294,6 +378,11 @@ pub const LineTokenizer = struct {
                        };
                    },
                }
+
+                // somehow everything else has failed
+                return error.Impossible;
+            }
+            return null;
        }

        fn detectInlineItem(buf: []const u8) Error!InlineItem {
@@ -336,35 +425,40 @@ pub const LineTokenizer = struct {
                },
            }
        }
-
-    fn consumeLine(buf: []const u8) ![]const u8 {
-        for (buf, 0..) |char, idx| {
-            switch (char) {
-                '\n' => return buf[0..idx],
-                '\r' => return error.BadToken,
-                else => {},
-            }
-        }
-
-        return error.MissingNewline;
-    }
    };
+}

 pub const Value = union(enum) {
    pub const String = std.ArrayList(u8);
-    pub const Map = std.StringHashMap(Value);
+    pub const Map = std.StringArrayHashMap(Value);
    pub const List = std.ArrayList(Value);
+    pub const TagType = @typeInfo(Value).Union.tag_type.?;

+    scalar: String,
    string: String,
    list: List,
+    flow_list: List,
    map: Map,
+    flow_map: Map,
+
+    pub inline fn fromScalar(alloc: std.mem.Allocator, input: []const u8) !Value {
+        return try _fromScalarOrString(alloc, .scalar, input);
+    }

    pub inline fn fromString(alloc: std.mem.Allocator, input: []const u8) !Value {
-        var res: Value = .{ .string = try String.initCapacity(alloc, input.len) };
-        res.string.appendSliceAssumeCapacity(input);
+        return try _fromScalarOrString(alloc, .string, input);
+    }
+
+    inline fn _fromScalarOrString(alloc: std.mem.Allocator, comptime classification: TagType, input: []const u8) !Value {
+        var res = @unionInit(Value, @tagName(classification), try String.initCapacity(alloc, input.len));
+        @field(res, @tagName(classification)).appendSliceAssumeCapacity(input);
        return res;
    }

+    pub inline fn newScalar(alloc: std.mem.Allocator) Value {
+        return .{ .scalar = String.init(alloc) };
+    }
+
    pub inline fn newString(alloc: std.mem.Allocator) Value {
        return .{ .string = String.init(alloc) };
    }
@@ -373,10 +467,18 @@ pub const Value = union(enum) {
        return .{ .list = List.init(alloc) };
    }

+    pub inline fn newFlowList(alloc: std.mem.Allocator) Value {
+        return .{ .flow_list = List.init(alloc) };
+    }
+
    pub inline fn newMap(alloc: std.mem.Allocator) Value {
        return .{ .map = Map.init(alloc) };
    }

+    pub inline fn newFlowMap(alloc: std.mem.Allocator) Value {
+        return .{ .flow_map = Map.init(alloc) };
+    }
+
    pub fn printDebug(self: Value) void {
        self.printRecursive(0);
        std.debug.print("\n", .{});
@@ -384,7 +486,7 @@ pub const Value = union(enum) {

    fn printRecursive(self: Value, indent: usize) void {
        switch (self) {
-            .string => |str| {
+            .scalar, .string => |str| {
                if (std.mem.indexOfScalar(u8, str.items, '\n')) |_| {
                    var lines = std.mem.splitScalar(u8, str.items, '\n');
                    std.debug.print("\n", .{});
@@ -403,7 +505,7 @@ pub const Value = union(enum) {
                    std.debug.print("{s}", .{str.items});
                }
            },
-            .list => |list| {
+            .list, .flow_list => |list| {
                if (list.items.len == 0) {
                    std.debug.print("[]", .{});
                    return;
@@ -420,7 +522,7 @@ pub const Value = union(enum) {
                    .{ .empty = "", .indent = indent },
                );
            },
-            .map => |map| {
+            .map, .flow_map => |map| {
                if (map.count() == 0) {
                    std.debug.print("{{}}", .{});
                    return;
@@ -465,7 +567,7 @@ pub const Parser = struct {
        DuplicateKey,
        BadMapEntry,
        Fail,
-    } || LineTokenizer.Error || FlowParser.Error || std.mem.Allocator.Error;
+    } || LineTokenizer(FixedLineBuffer).Error || FlowParser.Error || std.mem.Allocator.Error;

    pub const DuplicateKeyBehavior = enum {
        use_first,
@@ -506,18 +608,43 @@ pub const Parser = struct {
        }
    };

+    pub const State = struct {
+        pub const Stack = std.ArrayList(*Value);
+
+        document: Document,
+        value_stack: Stack,
+        state: ParseState = .initial,
+        expect_shift: ShiftDirection = .none,
+        dangling_key: ?[]const u8 = null,
+
+        pub fn init(alloc: std.mem.Allocator) State {
+            return .{
+                .document = Document.init(alloc),
+                .value_stack = Stack.init(alloc),
+            };
+        }
+
+        pub fn deinit(self: State) void {
+            self.value_stack.deinit();
+        }
+    };
+
    pub fn parseBuffer(self: *Parser, buffer: []const u8) Error!Document {
        var document = Document.init(self.allocator);
        errdefer document.deinit();
        const arena_alloc = document.arena.allocator();

        var state: ParseState = .initial;
-        var expect_shift: LineTokenizer.ShiftDirection = .none;
+        var expect_shift: ShiftDirection = .none;
        var dangling_key: ?[]const u8 = null;
        var stack = std.ArrayList(*Value).init(arena_alloc);
        defer stack.deinit();

-        var tok: LineTokenizer = .{ .buffer = buffer, .diagnostics = &self.diagnostics };
+        var tok: LineTokenizer(FixedLineBuffer) = .{
+            .buffer = FixedLineBuffer.init(buffer),
+            .diagnostics = &self.diagnostics,
+        };
+
        while (try tok.next()) |line| {
            if (line.contents == .comment) continue;

@@ -536,7 +663,7 @@ pub const Parser = struct {
                                // empty scalars are only emitted for a list_item or a map_item
                                .empty => unreachable,
                                .scalar => |str| {
-                                    document.root = try valueFromString(arena_alloc, str);
+                                    document.root = try Value.fromScalar(arena_alloc, str);
                                    // this is a cheesy hack. If the document consists
                                    // solely of a scalar, the finalizer will try to
                                    // chop a line ending off of it, so we need to add
@@ -546,7 +673,7 @@ pub const Parser = struct {
                                    state = .done;
                                },
                                .line_string, .space_string => |str| {
-                                    document.root = try valueFromString(arena_alloc, str);
+                                    document.root = try Value.fromString(arena_alloc, str);
                                    try document.root.string.append(in_line.lineEnding());
                                    try stack.append(&document.root);
                                    state = .value;
@@ -561,7 +688,7 @@ pub const Parser = struct {
                                },
                            },
                            .list_item => |value| {
-                                document.root = .{ .list = Value.List.init(arena_alloc) };
+                                document.root = Value.newList(arena_alloc);
                                try stack.append(&document.root);

                                switch (value) {
@@ -569,8 +696,12 @@ pub const Parser = struct {
                                        expect_shift = .indent;
                                        state = .value;
                                    },
-                                    .line_string, .space_string, .scalar => |str| {
-                                        try document.root.list.append(try valueFromString(arena_alloc, str));
+                                    .scalar => |str| {
+                                        try document.root.list.append(try Value.fromScalar(arena_alloc, str));
+                                        state = .value;
+                                    },
+                                    .line_string, .space_string => |str| {
+                                        try document.root.list.append(try Value.fromString(arena_alloc, str));
                                        state = .value;
                                    },
                                    .flow_list => |str| {
@@ -584,7 +715,7 @@ pub const Parser = struct {
                                }
                            },
                            .map_item => |pair| {
-                                document.root = .{ .map = Value.Map.init(arena_alloc) };
+                                document.root = Value.newMap(arena_alloc);
                                try stack.append(&document.root);

                                switch (pair.val) {
@@ -596,13 +727,19 @@ pub const Parser = struct {
                                        // key somewhere until we can consume the
                                        // value. More parser state to lug along.

-                                        dangling_key = pair.key;
+                                        dangling_key = try arena_alloc.dupe(u8, pair.key);
                                        state = .value;
                                    },
-                                    .line_string, .space_string, .scalar => |str| {
+                                    .scalar => |str| {
                                        // we can do direct puts here because this is
                                        // the very first line of the document
-                                        try document.root.map.put(pair.key, try valueFromString(arena_alloc, str));
+                                        try document.root.map.put(pair.key, try Value.fromScalar(arena_alloc, str));
+                                        state = .value;
+                                    },
+                                    .line_string, .space_string => |str| {
+                                        // we can do direct puts here because this is
+                                        // the very first line of the document
+                                        try document.root.map.put(pair.key, try Value.fromString(arena_alloc, str));
                                        state = .value;
                                    },
                                    .flow_list => |str| {
@@ -618,6 +755,14 @@ pub const Parser = struct {
                        }
                    },
                    .value => switch (stack.getLast().*) {
+                        // these three states are never reachable here. flow_list and
+                        // flow_map are parsed with a separate state machine. These
+                        // value tyeps can only be present by themselves as the first
+                        // line of the document, in which case the document consists
+                        // only of that single line: this parser jumps immediately into
+                        // the .done state, bypassing the .value state in which this
+                        // switch is embedded.
+                        .scalar, .flow_list, .flow_map => unreachable,
                        .string => |*string| {
                            if (line.indent == .indent)
                                return error.UnexpectedIndent;
@@ -655,7 +800,7 @@ pub const Parser = struct {
                            // the first line here creates the expect_shift, but the second line
                            // is a valid continuation of the list despite not being indented
                            if (expect_shift == .indent and line.indent != .indent)
-                                try list.append(try valueFromString(arena_alloc, ""));
+                                try list.append(Value.newScalar(arena_alloc));

                            // Consider:
                            //
@@ -687,12 +832,12 @@ pub const Parser = struct {
                                    expect_shift = .dedent;
                                    switch (in_line) {
                                        .empty => unreachable,
-                                        .scalar => |str| try list.append(try valueFromString(arena_alloc, str)),
+                                        .scalar => |str| try list.append(try Value.fromScalar(arena_alloc, str)),
                                        .flow_list => |str| try list.append(try parseFlowList(arena_alloc, str, self.dupe_behavior)),
                                        .flow_map => |str| try list.append(try parseFlowMap(arena_alloc, str, self.dupe_behavior)),
                                        .line_string, .space_string => |str| {
                                            // string pushes the stack
-                                            const new_string = try appendListGetValue(list, try valueFromString(arena_alloc, str));
+                                            const new_string = try appendListGetValue(list, try Value.fromString(arena_alloc, str));

                                            try new_string.string.append(in_line.lineEnding());

@@ -708,7 +853,8 @@ pub const Parser = struct {
                                            expect_shift = .none;
                                            switch (value) {
                                                .empty => expect_shift = .indent,
-                                                .line_string, .space_string, .scalar => |str| try list.append(try valueFromString(arena_alloc, str)),
+                                                .scalar => |str| try list.append(try Value.fromScalar(arena_alloc, str)),
+                                                .line_string, .space_string => |str| try list.append(try Value.fromString(arena_alloc, str)),
                                                .flow_list => |str| try list.append(try parseFlowList(arena_alloc, str, self.dupe_behavior)),
                                                .flow_map => |str| try list.append(try parseFlowMap(arena_alloc, str, self.dupe_behavior)),
                                            }
@@ -718,13 +864,14 @@ pub const Parser = struct {
                                            if (expect_shift != .indent)
                                                return error.UnexpectedIndent;

-                                            const new_list = try appendListGetValue(list, .{ .list = Value.List.init(arena_alloc) });
+                                            const new_list = try appendListGetValue(list, Value.newList(arena_alloc));
                                            try stack.append(new_list);

                                            expect_shift = .none;
                                            switch (value) {
                                                .empty => expect_shift = .indent,
-                                                .line_string, .space_string, .scalar => |str| try new_list.list.append(try valueFromString(arena_alloc, str)),
+                                                .scalar => |str| try new_list.list.append(try Value.fromScalar(arena_alloc, str)),
+                                                .line_string, .space_string => |str| try new_list.list.append(try Value.fromString(arena_alloc, str)),
                                                .flow_list => |str| try new_list.list.append(try parseFlowList(arena_alloc, str, self.dupe_behavior)),
                                                .flow_map => |str| try new_list.list.append(try parseFlowMap(arena_alloc, str, self.dupe_behavior)),
                                            }
@@ -744,16 +891,17 @@ pub const Parser = struct {
                                    if (line.indent != .indent)
                                        return error.UnexpectedValue;

-                                    const new_map = try appendListGetValue(list, .{ .map = Value.Map.init(arena_alloc) });
+                                    const new_map = try appendListGetValue(list, Value.newMap(arena_alloc));
                                    try stack.append(new_map);
                                    expect_shift = .none;

                                    switch (pair.val) {
                                        .empty => {
-                                            dangling_key = pair.key;
+                                            dangling_key = try arena_alloc.dupe(u8, pair.key);
                                            expect_shift = .indent;
                                        },
-                                        .line_string, .space_string, .scalar => |str| try new_map.map.put(pair.key, try valueFromString(arena_alloc, str)),
+                                        .scalar => |str| try new_map.map.put(pair.key, try Value.fromScalar(arena_alloc, str)),
+                                        .line_string, .space_string => |str| try new_map.map.put(pair.key, try Value.fromString(arena_alloc, str)),
                                        .flow_list => |str| try new_map.map.put(pair.key, try parseFlowList(arena_alloc, str, self.dupe_behavior)),
                                        .flow_map => |str| try new_map.map.put(pair.key, try parseFlowMap(arena_alloc, str, self.dupe_behavior)),
                                    }
@@ -772,7 +920,7 @@ pub const Parser = struct {
                                try putMap(
                                    map,
                                    dangling_key orelse return error.Fail,
-                                    try valueFromString(arena_alloc, ""),
+                                    Value.newScalar(arena_alloc),
                                    self.dupe_behavior,
                                );
                                dangling_key = null;
@@ -799,14 +947,14 @@ pub const Parser = struct {

                                    switch (in_line) {
                                        .empty => unreachable,
-                                        .scalar => |str| try putMap(map, dangling_key.?, try valueFromString(arena_alloc, str), self.dupe_behavior),
+                                        .scalar => |str| try putMap(map, dangling_key.?, try Value.fromScalar(arena_alloc, str), self.dupe_behavior),
                                        .flow_list => |str| try putMap(map, dangling_key.?, try parseFlowList(arena_alloc, str, self.dupe_behavior), self.dupe_behavior),
                                        .flow_map => |str| {
                                            try putMap(map, dangling_key.?, try parseFlowMap(arena_alloc, str, self.dupe_behavior), self.dupe_behavior);
                                        },
                                        .line_string, .space_string => |str| {
                                            // string pushes the stack
-                                            const new_string = try putMapGetValue(map, dangling_key.?, try valueFromString(arena_alloc, str), self.dupe_behavior);
+                                            const new_string = try putMapGetValue(map, dangling_key.?, try Value.fromString(arena_alloc, str), self.dupe_behavior);
                                            try new_string.string.append(in_line.lineEnding());
                                            try stack.append(new_string);
                                            expect_shift = .none;
@@ -827,14 +975,15 @@ pub const Parser = struct {
                                    if (expect_shift != .indent or line.indent != .indent or dangling_key == null)
                                        return error.UnexpectedValue;

-                                    const new_list = try putMapGetValue(map, dangling_key.?, .{ .list = Value.List.init(arena_alloc) }, self.dupe_behavior);
+                                    const new_list = try putMapGetValue(map, dangling_key.?, Value.newList(arena_alloc), self.dupe_behavior);
                                    try stack.append(new_list);
                                    dangling_key = null;

                                    expect_shift = .none;
                                    switch (value) {
                                        .empty => expect_shift = .indent,
-                                        .line_string, .space_string, .scalar => |str| try new_list.list.append(try valueFromString(arena_alloc, str)),
+                                        .scalar => |str| try new_list.list.append(try Value.fromScalar(arena_alloc, str)),
+                                        .line_string, .space_string => |str| try new_list.list.append(try Value.fromString(arena_alloc, str)),
                                        .flow_list => |str| try new_list.list.append(try parseFlowList(arena_alloc, str, self.dupe_behavior)),
                                        .flow_map => |str| try new_list.list.append(try parseFlowMap(arena_alloc, str, self.dupe_behavior)),
                                    }
@@ -846,9 +995,10 @@ pub const Parser = struct {
                                        .none, .dedent => switch (pair.val) {
                                            .empty => {
                                                expect_shift = .indent;
-                                                dangling_key = pair.key;
+                                                dangling_key = try arena_alloc.dupe(u8, pair.key);
                                            },
-                                            .line_string, .space_string, .scalar => |str| try putMap(map, pair.key, try valueFromString(arena_alloc, str), self.dupe_behavior),
+                                            .scalar => |str| try putMap(map, pair.key, try Value.fromScalar(arena_alloc, str), self.dupe_behavior),
+                                            .line_string, .space_string => |str| try putMap(map, pair.key, try Value.fromString(arena_alloc, str), self.dupe_behavior),
                                            .flow_list => |str| try putMap(map, pair.key, try parseFlowList(arena_alloc, str, self.dupe_behavior), self.dupe_behavior),
                                            .flow_map => |str| try putMap(map, pair.key, try parseFlowMap(arena_alloc, str, self.dupe_behavior), self.dupe_behavior),
                                        },
@@ -856,16 +1006,17 @@ pub const Parser = struct {
                                        .indent => {
                                            if (expect_shift != .indent or dangling_key == null) return error.UnexpectedValue;

-                                            const new_map = try putMapGetValue(map, dangling_key.?, .{ .map = Value.Map.init(arena_alloc) }, self.dupe_behavior);
+                                            const new_map = try putMapGetValue(map, dangling_key.?, Value.newMap(arena_alloc), self.dupe_behavior);
                                            try stack.append(new_map);
                                            dangling_key = null;

                                            switch (pair.val) {
                                                .empty => {
                                                    expect_shift = .indent;
-                                                    dangling_key = pair.key;
+                                                    dangling_key = try arena_alloc.dupe(u8, pair.key);
                                                },
-                                                .line_string, .space_string, .scalar => |str| try new_map.map.put(pair.key, try valueFromString(arena_alloc, str)),
+                                                .scalar => |str| try new_map.map.put(pair.key, try Value.fromScalar(arena_alloc, str)),
+                                                .line_string, .space_string => |str| try new_map.map.put(pair.key, try Value.fromString(arena_alloc, str)),
                                                .flow_list => |str| try new_map.map.put(pair.key, try parseFlowList(arena_alloc, str, self.dupe_behavior)),
                                                .flow_map => |str| try new_map.map.put(pair.key, try parseFlowMap(arena_alloc, str, self.dupe_behavior)),
                                            }
@@ -887,17 +1038,18 @@ pub const Parser = struct {
        switch (state) {
            .initial => switch (self.default_object) {
                .string => document.root = .{ .string = std.ArrayList(u8).init(arena_alloc) },
-                .list => document.root = .{ .list = Value.List.init(arena_alloc) },
-                .map => document.root = .{ .map = Value.Map.init(arena_alloc) },
+                .list => document.root = Value.newList(arena_alloc),
+                .map => document.root = Value.newMap(arena_alloc),
                .fail => return error.EmptyDocument,
            },
            .value => switch (stack.getLast().*) {
                // remove the final trailing newline or space
-                .string => |*string| _ = string.popOrNull(),
+                .scalar, .string => |*string| _ = string.popOrNull(),
                // if we have a dangling -, attach an empty string to it
-                .list => |*list| if (expect_shift == .indent) try list.append(try valueFromString(arena_alloc, "")),
+                .list => |*list| if (expect_shift == .indent) try list.append(Value.newScalar(arena_alloc)),
                // if we have a dangling "key:", attach an empty string to it
-                .map => |*map| if (dangling_key) |dk| try putMap(map, dk, try valueFromString(arena_alloc, ""), self.dupe_behavior),
+                .map => |*map| if (dangling_key) |dk| try putMap(map, dk, Value.newScalar(arena_alloc), self.dupe_behavior),
+                .flow_list, .flow_map => {},
            },
            .done => {},
        }
@@ -905,12 +1057,6 @@ pub const Parser = struct {
        return document;
    }

-    fn valueFromString(alloc: std.mem.Allocator, buffer: []const u8) Error!Value {
-        var result: Value = .{ .string = try std.ArrayList(u8).initCapacity(alloc, buffer.len) };
-        result.string.appendSliceAssumeCapacity(buffer);
-        return result;
-    }
-
    fn parseFlowList(alloc: std.mem.Allocator, contents: []const u8, dupe_behavior: DuplicateKeyBehavior) Error!Value {
        var parser = try FlowParser.initList(alloc, contents);
        defer parser.deinit();
@@ -1067,8 +1213,8 @@ pub const FlowParser = struct {
        const parent = self.stack.getLastOrNull() orelse return .done;

        return switch (parent.value.*) {
-            .list => .want_list_separator,
-            .map => .want_map_separator,
+            .flow_list => .want_list_separator,
+            .flow_map => .want_map_separator,
            else => return error.BadState,
        };
    }
@@ -1077,12 +1223,12 @@ pub const FlowParser = struct {
        // prime the stack:
        switch (self.state) {
            .want_list_item => {
-                self.root = Value.newList(self.alloc);
+                self.root = Value.newFlowList(self.alloc);
                self.stack = try FlowStack.initCapacity(self.alloc, 1);
                self.stack.appendAssumeCapacity(.{ .value = &self.root });
            },
            .want_map_key => {
-                self.root = Value.newMap(self.alloc);
+                self.root = Value.newFlowMap(self.alloc);
                self.stack = try FlowStack.initCapacity(self.alloc, 1);
                self.stack.appendAssumeCapacity(.{ .value = &self.root });
            },
@@ -1101,15 +1247,15 @@ pub const FlowParser = struct {
                    ',' => {
                        // empty value
                        const tip = try getStackTip(self.stack);
-                        try tip.value.list.append(try Value.fromString(self.alloc, ""));
+                        try tip.value.flow_list.append(Value.newScalar(self.alloc));
                        tip.item_start = idx + 1;
                    },
                    '{' => {
                        const tip = try getStackTip(self.stack);

                        const new_map = try Parser.appendListGetValue(
-                            &tip.value.list,
-                            Value.newMap(self.alloc),
+                            &tip.value.flow_list,
+                            Value.newFlowMap(self.alloc),
                        );

                        tip.item_start = idx;
@@ -1120,8 +1266,8 @@ pub const FlowParser = struct {
                        const tip = try getStackTip(self.stack);

                        const new_list = try Parser.appendListGetValue(
-                            &tip.value.list,
-                            Value.newList(self.alloc),
+                            &tip.value.flow_list,
+                            Value.newFlowList(self.alloc),
                        );

                        tip.item_start = idx;
@@ -1130,10 +1276,8 @@ pub const FlowParser = struct {
                    },
                    ']' => {
                        const finished = self.stack.getLastOrNull() orelse return error.BadState;
-                        if (finished.value.list.items.len > 0 or idx > finished.item_start)
-                            try finished.value.list.append(
-                                try Parser.valueFromString(self.alloc, ""),
-                            );
+                        if (finished.value.flow_list.items.len > 0 or idx > finished.item_start)
+                            try finished.value.flow_list.append(Value.newScalar(self.alloc));
                        self.state = try self.popStack();
                    },
                    else => {
@@ -1145,8 +1289,8 @@ pub const FlowParser = struct {
                    ',' => {
                        const tip = try getStackTip(self.stack);

-                        try tip.value.list.append(
-                            try Value.fromString(self.alloc, self.buffer[tip.item_start..idx]),
+                        try tip.value.flow_list.append(
+                            try Value.fromScalar(self.alloc, self.buffer[tip.item_start..idx]),
                        );
                        tip.item_start = idx + 1;

@@ -1154,11 +1298,8 @@ pub const FlowParser = struct {
                    },
                    ']' => {
                        const finished = self.stack.getLastOrNull() orelse return error.BadState;
-                        try finished.value.list.append(
-                            try Parser.valueFromString(
-                                self.alloc,
-                                self.buffer[finished.item_start..idx],
-                            ),
+                        try finished.value.flow_list.append(
+                            try Value.fromScalar(self.alloc, self.buffer[finished.item_start..idx]),
                        );
                        self.state = try self.popStack();
                    },
@@ -1193,7 +1334,7 @@ pub const FlowParser = struct {
                .consuming_map_key => switch (char) {
                    ':' => {
                        const tip = try getStackTip(self.stack);
-                        dangling_key = self.buffer[tip.item_start..idx];
+                        dangling_key = try self.alloc.dupe(u8, self.buffer[tip.item_start..idx]);

                        self.state = .want_map_value;
                    },
@@ -1204,9 +1345,9 @@ pub const FlowParser = struct {
                    ',' => {
                        const tip = try getStackTip(self.stack);
                        try Parser.putMap(
-                            &tip.value.map,
+                            &tip.value.flow_map,
                            dangling_key.?,
-                            try Parser.valueFromString(self.alloc, ""),
+                            Value.newScalar(self.alloc),
                            dupe_behavior,
                        );

@@ -1217,9 +1358,9 @@ pub const FlowParser = struct {
                        const tip = try getStackTip(self.stack);

                        const new_list = try Parser.putMapGetValue(
-                            &tip.value.map,
+                            &tip.value.flow_map,
                            dangling_key.?,
-                            Value.newList(self.alloc),
+                            Value.newFlowList(self.alloc),
                            dupe_behavior,
                        );

@@ -1231,9 +1372,9 @@ pub const FlowParser = struct {
                        const tip = try getStackTip(self.stack);

                        const new_map = try Parser.putMapGetValue(
-                            &tip.value.map,
+                            &tip.value.flow_map,
                            dangling_key.?,
-                            Value.newMap(self.alloc),
+                            Value.newFlowMap(self.alloc),
                            dupe_behavior,
                        );

@@ -1245,9 +1386,9 @@ pub const FlowParser = struct {
                        // the value is an empty string and this map is closed
                        const tip = try getStackTip(self.stack);
                        try Parser.putMap(
-                            &tip.value.map,
+                            &tip.value.flow_map,
                            dangling_key.?,
-                            try Parser.valueFromString(self.alloc, ""),
+                            Value.newScalar(self.alloc),
                            dupe_behavior,
                        );

@@ -1263,9 +1404,9 @@ pub const FlowParser = struct {
                    ',', '}' => |term| {
                        const tip = try getStackTip(self.stack);
                        try Parser.putMap(
-                            &tip.value.map,
+                            &tip.value.flow_map,
                            dangling_key.?,
-                            try Parser.valueFromString(self.alloc, self.buffer[tip.item_start..idx]),
+                            try Value.fromScalar(self.alloc, self.buffer[tip.item_start..idx]),
                            dupe_behavior,
                        );
                        dangling_key = null;
Author	SHA1	Message	Date
torque	63ee3867be	config: dupe map keys I didn't do an exhaustive search, but it seems that the managed hashmaps only allocates space for the structure of the map itself, not its keys or values. This mostly makes sense, but it also means that this was only working due to the fact that I am currently not freeing the input buffer until after iterating through the parse result. Looking through this, I'm also reasonably surprised by how many times this is assigned in the normal parsing vs the flow parsing. There is a lot more repetition in the code of the normal parser, I think because it does not have a granular state machine. It may be worth revisiting the structure to see if a more detailed state machine, like the one used for parsing the flow-style objects, would reduce the amount of code repetition here. I suspect it certainly could be better than it currently is, since it seems unlikely that there really are four different scenarios where we need to be parsing a dictionary key. Taking a quick glance at it, it looks like I could be taking better advantage of the flipflop loop on indent as well as dedent. This might be a bit less efficient due to essentially being less loop unrolling, but it would also potentially make more maintainable code by having less manual repetition.	2023-09-22 00:53:26 -07:00
torque	a3c0935f1e	config: use std.StringArrayHashMap for the map type As I was thinking about this, I realized that data serialization is much more of a bear than deserialization. Or, more accurately, trying to make stable round trip serialization a goal puts heavier demands on deserialization, including preserving input order. I think there may be a mountain hiding under this molehill, though, because the goals of having a format that is designed to be handwritten and also machine written are at odds with each other. Right now, the parser does not preserve comments at all. But even if we did (they could easily become a special type of string), comment indentation is ignored. Comments are not directly a child of any other part of the document, they're awkward text that exists interspersed throughout it. With the current design, there are some essentially unsolvable problems, like comments interspersed throughout multiline strings. The string is processed into a single object in the output, so there can't be weird magic data interleaved with it because it loses the concept of being interleaved entirely (this is a bigger issue for space strings, which don't even preserve a unique way to reserialize them. Line strings at least contain a character (the newline) that can appear nowhere else but at a break in the string. Obviously this isn't technically impossible, but it would require a change to the way that values are modeled. And even if we did take the approach of associating a comment with, say, the value that follows it (which I think is a reasonable thing to do, ignoring the interleaved comment situation described above), if software reads in data, changes it, and writes it back out, how do we account for deleted items? Does the comment get deleted with the item? Does it become a dangling comment that just gets shoved somewhere in the document? How are comments that come after everything else in the document handled? From a pure data perspective, it's fairly obvious why JSON omits comments: they're trivial to parse, but there's not a strategy for emitting them that will always be correct, especially in a format that doesn't give a hoot about linebreaks. It may be interesting to look at fancy TOML (barf) parsers to see how they handle comments, though I assume the general technique is to store their row position in the original document and track when a line is added or removed. Ultimately, I think the use case a format to be written by humans and read by computers is still useful. That's my intended use case for this and why I started it, but its application as a configuration file format is probably hamstrung muchly by software not being able to write it back. On the other hand, there's a lot of successful software I use where the config files are not written directly by the software at all, so maybe it's entirely fine to declare this as being out of scope and not worrying about it further. At the very least it's almost certainly less of an issue than erroring on carriage returns. Also the fact that certain keys are simply unrepresentable. As a side note, I guess what they say about commit message length being inversely proportional to the change length is true. Hope you enjoyed the blog over this 5 character change.	2023-09-22 00:53:26 -07:00
torque	a88e890974	config: refactor LineTokenizer to use an internal line buffer The goal here is to support a streaming parser. However, I did decide the leave the flow item parser state machine as fully buffered (i.e. not streaming). This is not JSON and in general documents should be many, shorter lines, so this buffering strategy should work reasonably well. I have not actually tried the streaming implementation of this, yet.	2023-09-22 00:53:26 -07:00
torque	64dac2fd51	config: differentiate fields in Value this makes handling Value very slightly more work, but it provides useful metadata that can be used to perform better conversion and serialization. The motivation behind the "scalar" type is that in general, only scalars can be coerced to other types. For example, a scalar `null` and a string `> null` have the same in-memory representation. If they are treated identically, this precludes unambiguously converting an optional string whose contents are "null". With the two disambiguated, we can choose to convert `null` to the null object and `> null` to a string of contents "null". This ambiguity does not necessary exist for the standard boolean values `true` and `false`, but it does allow the conversion to be more strict, and it will theoretically result in documents that read more naturally. The motivation behind exposing flow_list and flow_map is that it will allow preserving document formatting round trip (well, this isn't strictly true: single line explicit strings neither remember whether they were line strings or space strings, and they don't remember if they were indented. However, that is much less information to lose). The following formulations will parse to the same indistinguishable value: key: > value key: > value key: \| value key: \| value I think that's okay. It's a lot easier to chose a canonical form for this case than it is for a map/list without any hints regarding its origin.	2023-09-22 00:53:26 -07:00