I guess arrays don't need special handling because their memory is
explicitly accounted for, but it would probably be good to check that
a sentinel-terminated array initialized as `undefined` does get the
correct sentinel value.
The errors in the line buffer and tokenizer now have diagnostics. The
line number is trivial to keep track of due to the line buffer, but
the column index requires quite a bit of juggling, as we pass
successively trimmed down buffers to the internals of the parser.
There will probably be some column index counting problems in the
future. Also, handling the diagnostics is a bit awkward, since it's a
mandatory out-parameter of the parse functions now. The user must
provide a valid diagnostics object that survives for the life of the
parser.
When the buffer was separated from the tokenizer, we lost some
validation, including really aggressive carriage return detection.
This brings this back in full force and adds some additional
validation on top of it.
With my pathological 50MiB 10_000 line nested list test, this is
definitely slower than the one shot parser, but it has peak memory
usage of 5MiB compared to the 120MiB of the one-shot parsing. Not bad.
Obviously this result is largely dependent on the fact that this
particular benchmark is 99% whitespace, which does not get copied into
the resulting document. A (significantly) smaller improvement will be
observed in files that are mostly data with little indentation or
empty lines.
But a win is a win.