testing: Add tests for tokenizer #10

Open Jookia opened this issue on 24 Aug 2021 - 23 comments

@Jookia Jookia commented on 24 Aug 2021

Now that the language syntax is described, we need some tests to validate the implementation.

Hopefully this would be done with pytest and hypothesis to provide some solid verification of behaviour.
Tests that state obvious conditions should be good too.

This should start with the tokenizer's mapping between a string and tokens.
It should test that the mapping works according to the structures written in the syntax documentation.
It should also validate that errors are correctly raised and reported.

@Jookia Jookia add the testing label on 24 Aug 2021
@Jookia Jookia change priority from default to highest on 24 Aug 2021
@Jookia Jookia change milestone from No milestone to Tested and documented on 24 Aug 2021

I've rigged up pytest and hypothesis with a simple fuzz test for the parser to start with. This has found at least one bug, which is great news.

I've decided we're going to have to hit four types of tests:

  1. Fuzzing tests that just make sure the code doesn't bleed random errors
  2. Property tests that run against random data of our choosing
  3. Unit tests that check specific edge cases and regressions
  4. Integration tests that check the binaries and build systems themselves.

This should be good enough for now. The fuzz test is already added as mentioned, though we're not using something fancy like python-afl, just hypothesis. This is suboptimal but I picture something like dedicated fuzzing should be run with integration tests. Eventually we should have some dedicated hardware just running these fuzzy tests for hours at a time to get some actually good random coverage.

We also have one unit test for a specific bug I've fixed that I found from using the little fuzzer. We're going to have a ton of unit tests for each bug we fix.

Anyway, focusing on 2 for now, I've thought about some properties to test the tokenizer and how. The input will be:

  • Random tokens. These will consist of language keywords, shebangs, symbols with random names, and notes and text containing random data.
  • Random whitespace to separate the tokens. This will be mixes of new lines, spaces, tabs, etcetera.

These will be converted to text for the tokenizer to read, and hopefully give us back the same input if it's done it's job of tokenizing properly.

The following properties should apply for valid input:

  • Input tokens separated by whitespace should equal output tokens for the most part;
  • Note tokens should be skipped
  • Empty tokens should be skipped
  • The starting shebang should be skipped
  • There should be a final EOF token added by the tokenizer
  • The line and column should match the input tokens

The following properties should apply for invalid input:

  • Multiple shebangs should error
  • Nested texts and notes should error
  • Unclosed texts and notes should error
  • The line and column of the error should match the error tokens

Currently the 'parse_file' method isn't testable since it opens a file and prints errors in a friendly way. It should be refactored to be testable.

Today I added tests (and fixed a bunch of bugs) relating to the text literal parser, and then added tests for the note parser.

For these I still have to add:

  • Filters to exclude BeginText/EndText from the data I'm testing
  • Whitespace fuzzing
  • Location checking

Then I have to add error checking for:

  • Nested tokens
  • Unterminated tokens

The following different types of tokens need to be generated and tested:

  • Shebang
  • Keywords
  • Bool
  • Symbols

I've decided to structure the tests as testing one thing at a time, and with that I've almost finished testing all the tokens standalone, with some minor extra things to do exclude certain input data. So hopefully today I'll have tests for all the working conditions done.

Next will probably be a soup test where I generate a sequence of valid tokens and ensure it outputs the correct data. I'm not too sure how this will work, but it should also test that things like shebangs and notes and text can contain keywords and weird whitespace without breaking.

Then after that I'll have some tests for bad input.

Then it's basically the same story for the parser but with tokens and an AST instead of characters and tokens.

So it's a little while later. I had to drop the standalone tests and just create a soup (a model) of tokens and expected output. Right now as far as I can tell the model tests all valid inputs.

The next step is testing error checking, which is going to be interesting. I'm planning to do this by inserting 'trouble' tokens in to the soup and confirming that the lexer catches these with the correct error.

I almost had these finished, but hit an unexpected roadblock that requires rewriting a lot of code.

First up: My code to parse Text and Note tokens was pretty standard. The approach I took was for the opening token (StartText) to switch the parser to a mode for reading data until the ending token (EndText). This seems reasonable enough, and it works! But in what situations will missing a token cause an error?

The answer is 'extremely specific ones', because if you miss writing an EndText then the parser will continue reading until the next EndText that's started by another Text. It's not keeping track of inadvertent nesting. The same thing goes for adding a StartText randomly in your code. It might be closed by an EndText, it might be inside a text.

Even worse things happen if you mix compounds. The fuzz tester brought out this case: If you take code along the lines of 'StartText StartNote EndText StartNote EndNote' it is parsed as a text containing 'StartNote' then an empty note. But if you insert a premature 'EndText' after 'StartText', the code still parses fine! It parses as an empty text, then a note containing 'EndText StartNote'.

Now, this issue isn't specific to NewLang. This headache of a statement actually translates straightforward to C as this piece of code: "/*"/**/ which upon inserting the text terminator turns in to ""/*"/**/ , also valid C. This works in most languages actually!

So when does this error? My initial idea was the following situations:

  • When there's a stray StartText or EndText
  • When you insert an EndText or remove a StartText
  • When you remove a StartText or add an EndText

These related to the kind of 'balancing' that you'd intuitively expect when using start and end tokens.
But this code won't give you errors.

Instead the errors it will give are:

  • There's no EndText somewhere after a StartText in the file
  • There's a stray EndText that's parsed as a token

The first error is related only to text in a file. It's not on the level of tokens or anything, the EndText could be anywhere, even inside other Texts or unrelated compounds like Notes as we saw above.

The second one applies if there's an EndText that's parsed as a token, so NOT inside a Text or Note definition. This makes sense kind of, but it's actually a little more insidious: It depends on the state of the parser.

So in order to test the second we have to implement a parser of our own. What the fuck.

...

The second issue is that my testing model doesn't deal with compound statements well at all. It generates a soup of tokens, each with the expected token output and the code fed to the parser. So for an example, we generate a SampleToken that contains the input code value "StartText Hello EndText" and the output code value "Text object with value of Hello".

This works fairly well, when it's time to test it joins all the code values together, passes that to the parser, then gets the tokens back and checks if they match.

The issue comes in when injecting errors: The current approach is to take a set of good tokens, then make an error. Say, adding an 'EndText' in to a Text. This requires writing code that basically parses the Text code to an extent so it knows where to insert the EndText. In this case the parsing is mostly just 'split the data in to words, insert a word, then join it back up', but the code for this is much too messy and complex and shouldn't be something we do in the first place.

...

So there's a few ideas I have to fix this mess. Step one is to make Text, Note and other compound statements saner: All StartTexts must have a matching EndText and vice-versa. If you miss one, you get an error. This does have a side-effect: You can no longer put 'StartText', 'EndText', 'StartNote', 'EndNote' inside texts or notes. This means you can no longer write code like "StartNote Make sure to use StartText and EndText together EndNote" in code. At some point down the track we'll have to figure out conventions and ways to work around this in specific situations. But whatever, fuck it.

Step two is to make the tests a bit lower level. Having them generate the final value of the token and the code to give to the parser is a bit of a jump, since that's testing both splitting things in to tokens correctly and that compound tokens are read properly. Having the generator represent the code as a list of tokens and whitespace would make it a lot easier to do modifications for tetsing.

Step three is to try and split up the parser a bit more. The tests at the moment test a lot of things at once, and this is really slowing test performance done and pushing complexity up. One idea I have here is to split the parser again so one step is to split things in to tokens and whitespace, and the other step is to take a list of tokens and whitespace and give back a list of just tokens, with the compound tokens being made here.

That's all for now.

@Jookia Jookia commented on 8 Feb

Alright, as of today the new tokenizer is done. I still need to add another test but as far as I'm concerned it's done. I've managed to get in to a test-driven design workflow which is working a lot better. Next up is the rewrite of the parser. I thought I might be able to salvage this code but it's going to have to get rewritten too. The good news is that I'm making excellent progress and have clear ideas about where to go from here.

@Jookia Jookia commented on 8 Feb

One thing I forgot to mention is that I've dropped debug logging. At this point it doesn't serve any purpose given the code is less of a state machine and more of a set of passes.

Dropping back in for an update: Today I've worked out an initial way for testing errors for parsing simple things. Compounds are next which is kinda annoying as it involves recursion. Not sure how I'll do it, I might have to explore the space a little bit. I'll either split things out so each compound expression is tested then have them all integrated somehow, or just integrated from the start.

Okay, popping back in to my nightmare development log to give an update: Yesterday I realized why compound statements are so hard to test.

The first issue is that some error conditions are impossible to meet in practice. For a simple example parse_bool can error in a few ways:

  • If there's no tokens to parse
  • If the token it parses isn't True or False

But it turns out parse_bool is ONLY called if there's a True/False token based on lookahead. So it's impossible to test errors in a compound. In general lookahead means a class of errors depending on the first token being a specific value can't be tested.

The next issue is with my assumption that we can fuzz errors by changing tokens. I hit this earlier with the note/text intertwining. If you fuzz things by changing a True to some random word you don't get the 'tried to parse bool' error, because we're no longer parsing a bool in the first place. This gets messier when parsing variable streams of tokens, such as text, notes or even argument lists for function calls: Changing a True or even removing it can still result in a valid program.

The reality here is that the effect of changing a token depends on the state of the parser, so fuzzing like that isn't going to work. Well, another hard lesson learned. I'm going to have to think about this.

I understand I'm flooding the issue tracker with this nonsense, but I'm having to learn this the hard way and part of learning involves explaining what you've learned.

What are errors? With a paser errors happen when you're parsing something and encounter some unexpected token. So one way to trigger errors is to modify a token so it becomes unexpected. This is what my existing tests do. You can think about this as a state machine where you have a current state that dictates what next acceptable tokens are.

For instance with, 'StartText Hello world EndText' we know there's that StartText isn't allowed within a text, so if we write 'StartText StartText Hello world EndText' we will get a FOUND_STARTTEXT error.

You can think about this as a state machine where you have a current state that dictates what next acceptable tokens are. Compound expressions complicate this a little bit by introducing multiple levels of state, but it's still the same idea: The current state dictates if the next token is an error or not.

This is all very matter of fact and obvious. But testing it turns out to be a lot, lot trickier than I'd like. The main problem being that generating random data with specific errors is tricky.

For instance with this code statement in a wider random piece of code: 'StartText Set To EndSet EndText'
If you remove StartText then the failure is now whether Set is a valid token here. If it is, then the failure is that there's a To after the Set.
if you remove EndText then the failure could be that there's another StartText later, or that you've reached the end of the file without finding an EndText.
In really weird situations it might contain syntax that counteracts the error altogether, such as intertwining comments and strings.
Furthermore, how do you verify that the error is actually correct? Without having some kind of parser emulator to compare to, you just have to trust that it's correct. Not great for testing.

I think a better approach is to start categorizing errors as behaviours themselves, like token X can cause a FOUND_STARTTEXT error if there is another StartText behind it with no EndText between them. Or token Y can cause a NOT_TOKEN error, or NO_TOKEN if removed. Then finding a token and causing the error by going back in the token stream and creating the situation. A bit like time travel.

Not too sure how this is going to work in practice. But this approach seems the sanest at this point: Finding where an error can happen then causing it.

It's me again. Possibly final post about parser testing. I think describing errors as behaviors and creating those errors in order to test the parser is a cool idea, but I think it's an alternative to testing each parser state individually like I am now. It's a very top down approach compared to the bottom up one I have now. Looking at parsers in general, this makes sense because my recursive descent parser is basically an LL(1) parser.

Unfortunately right now I have to test each parser state individually. In the next parser revision it would be good to have things be rule-based and only having to test the rule runner. That's a bit like a domain-specific language.

@Jookia Jookia commented on 5 May

Okay, so today I wrote my first test for a compound expression and it works. So I've figured out this issue, I just need to continue writing code.

Alright. I've added support for a few literals (bool, text), a note pass, and error reporting where the parest has a context per task it does. All this is tested.

Yesterday I started refactoring my tests so they're clearer. There's still more to do but I have some preliminary observations about tests and their patterns.

Most parser tests involve creating a valid specimen, checking that it parses properly, then mutating various tokens in the specimen and checking that errors report as expected.

Compound tests have the additional requirements that it must verify sub-expressions are parsed correctly and have errors reported. This can be simplified by redefining the sub-expression using mocking to be a single token, and controlling whether it throws an error or not depending on the test.

@Jookia Jookia commented on 9 Jun

Just another update from me as I write more tests. I've hit a snag today where I have a test that checks if variable names are valid. The logic right now is that they're invalid if they're a keyword, otherwise they're valid. This seems fairly simple to test, but it's a headache because it doesn't test the values consistently. Which makes no sense. I run 100 tests and have like 20 values to test against.

The problem I found is that Hypothesis is spending a lot of time testing and generating various combinations of things I don't care about: The variable's file location, the parse context. I'm blowing the 100 test budget ensuring I test if the 'is this a valid name' works with different location values.

Yes, in theory these could affect the output of the function. Maybe 'is this a valid name' function will give an exemption if the keyword is on line 1 of the file, or the filename is 'spaghetti'. But it doesn't, and I can prove that by showing you the source code.

Not only is this giving me poorer test coverage, this is also giving worse performance. I try not to be too picky about performance, but the test suite is only going to take longer and longer to run the more code is added.

So I'm not entirely sure what to do here. I think I need to use randomness a lot more sparingly.

Just as an update, I replaced random line numbers and parse contexts with static values. It has greatly sped up tests and made them more stable in areas where the search area is small. It seems to also have solved the issue of memory leaking with tests. It now takes 15 minutes to do a Windows and Linux Jenkins build concurrently down from 60, and long tests now take 3 hours and don't get use up all my system memory.

A quick development test takes 30 seconds on my machine down from 60. However if I disable the old parser tests it only takes 8 seconds, likely because those tests have the same issues. Once the new parser is done removing those should have significant speedup.

@Jookia Jookia commented on 2 Jul

I managed to delete my notes for this post I wrote a while ago by accidentally opening the file with 'rm' instead of 'vim. 'rm' is not a good editor, would not recommend.

Anyway, Almost a month later I have the tests for the Action syntax done. I've realized a a few more things doing this:

  • Python's decorators are a headache and hard to understand
  • My own decorators are too high level and need to be replaced

These are mostly fixable by a refactor, but I've realized something a bit heavier: I haven't been writing enough documentation. In fact, I've missed the point of documentation a lot. My code works, you can navigate it, edit it fairly easily, even write it easily. But I don't have any documentation about what the code is supposed to do or what it actually implements. So for example, if you look through my tests you can understand what it tests, but you have to guess why it's being tested.

Now to be fair on myself, most projects do this. Part of being a developer to some extent requires reading other people's code and trying to understand what they were doing and why. Then you decide if the code accomplishes what you think they wanted to write. It's like a game of detective, but kind of awful.

But the problem is it doesn't have to be this way. I've seen projects with GOOD documentation. The documentation doesn't have to be fancy or anything, it just has to give me an overview of what this code SHOULD be doing and WHY. This makes it a dozen times easier to understand and read the code.

In this case, I think I need to dump a paragraph or so of text at the start of my test files explaining the following:

  • What I'm testing
  • How I'm testing it
  • What areas I'm covering in my tests

That way people can tell if I've forgotten something or missed something. In this case the other person reading it is future me: I'm spending too much time having to re-read code to load it in to my brain and understand those points I just listed.

What's even sadder is that before writing the tests I'm writing a lot of these things down in my to-do list. Just pasting that at the top of my file would save me so much work.

Okay, I've settled in to a workflow that works like this:

  • Create new test file copied from a similar test
  • Edit the documentation at the top of the file to explain what I'm testing
  • Writing code to generate valid data for the tests
  • Writing the valid test
  • Writing code to make it pass
  • Writing invalid tests
  • Writing code to make them pass
  • Run docheck to check code for style and lint errors
  • Run domut for mutation testing
  • Commit to Git

This workflow seems to be working very well for me and giving me a lot of confidence in the code I write.

I have to go back and refactor the existing test code to work along these guidelines as well as dump the template decorations I created.

So what next? I still haven't hooked up the new parser, there's a few more things to do:

  • Write an error printer that takes a machine-generated error
  • Write code that will read in text, parse it and then either print error or pass is to the interpreter

But wait, there's more! I'm still not satisfied with the tests I have. They test each component of the parser, but not the parser as a whole. So I want to test the following:

  • That the components fit together properly
  • Real code and cases that I write either work or fail correctly
  • The new parser isn't unexpectedly different from the old parser

The first goal is easy to solve. Whatever test method we pick, it will do it.

The second goal of having real cases is also fairly easy: Just write some manual test cases. However it's kind of a question as to how 'deep' this should test. Should it test just that the code doesn't error? Should it test that the code gives an exact syntax tree output complete with line numbers? The deeper you test the more fragile these tests become as you have to update them as well as any other change.

The third goal is also a little weird. To be comprehensive it would need fuzzing of valid and invalid input and to compare it. Differences in data structures between the two parsers would need to be translated and checked. But this is also test code that is written just once before the old parser is thrown out. So I'm not sure if it's worth me writing it.

I think having a library of handwritten tests for language syntax and errors that can be used to test the old parser and the new parser would be a good compromise.

@Jookia Jookia commented on 2 Aug

Okay, so not much has happened for 10 days. I've been trying to wrap my head around how to go about testing this parser error reporting. It seems simple: Just text that it outputs the correct text. But it quickly becomes a can of worms.

Consider this: We create a function that greets a user by name.

It has the following behaviour:

  • It outputs a message in the currently set language
  • The message includes the user's name
  • The message includes the user's pronouns/gender

This is fairly easy right? A function that does this would work fine:

  • Look up user's pronouns
  • Look up user's name
  • Look up the currently set language we're using
  • Get the greeting for the language
  • Get the pronouns for the language
  • Fill in the pronouns and name in the greeting
  • Return the greeting

This is great, but how testing it is a headache as by the time we get the output it's just text. The only way to test this is to construct our own translated greeting and check that it matches.

The other issue is the tight coupling to the translation service. The function is asking for things to be translated according to the current global language. That's a headache to pain as it means we have to track that or switching language would break the tests.

My proposal is that the function would instead do this:

  • Look up the user's pronouns
  • Look up the user's name
  • Return the untranslated greeting with the pronouns and name

Then later some code would do this:

  • Get the current language
  • Translate the greeting
  • Fill in the greeting with the pronouns and name

This separates creating data and translating data.

A harder question is: What should the untranslated greeting look like? Should it be English? In Linux land the answer is generally 'yes': GNU gettext is how most translations get done and the untranslated strings are English. All translations are translations of English, and this makes it easier for translators to handle. Outside Linux land the answer is generally 'no': Machine-readble are used instead. The upside of this is that you no longer store the English translation in your program and updating the English translation doesn't break anything that relies on that translation.

Another option is to just ignore the translation key in our tests and just check that the pronouns and name are correct. This is a bad idea since we also want to test any kind of localization decision process, such as different translation strings based on plurality.

It's a controversial choice that can always change later, but I think I'm going to use machine-readable identifiers instead of English translations as keys. We can always write a tool that maps between the identifiers, English and another translation if we want to make translation easier. The reason for this is testing: If the translation is always a machine-readable identifier then the test won't need to be updated with the English translation update.

Another reason that's smaller but seems significant is that English takes up more space than identifiers which kind of sucks for microcontrollers, and we can't just use untranslated assets like images or sounds as keys, those would need identifiers anyway. Just using identifiers from the start will work for any kind of asset.

Okay, that's all for now. I'll hopefully get back to working on this soon.

@Jookia Jookia commented on 2 Dec

Hello again ... me? I don't know who else is reading this. While working on translations in October I found I needed to parse the translations. That would seem like a simple task: Copy the current parser, refactor, etcetera.

But the amount of work writing tests for the parser has really burned me out. So obviously we need to do some refactoring and simplify the tests somehow. I started on that but quickly hit a wall and decided to hide from the issue for a few months. That didn't seem to really work.

The most immediate problem is that I can't refactor any of my parser code because the tests use mocking and test the internal implementation. That sucks, so now I have to fix that. I'm taking the route of having tests also define data used by other tests for error testing. Things are much better this way, but it's going to take a long time to refactor. Probably a task for next week.

I'm going to be honest here: I think I fucked up. I focused on writing testable code instead of writing code that doesn't need to be tested. Not that code doesn't have to be tested, but I feel like I would have had a much better result writing a set of parser primitives that I can reason about, test those primitives, write code that uses the parser, then test those using static examples as well as fuzz testing of some sort.

Now one might think that the solution to this is write another parser. Maybe. I don't want three incomplete parsers in the source tree. But I'm not sure. It's certainly tempting. I think a better approach is to clean up the current tests a bit and implement this plan while writing the translations parser. Then a solid comparison can be done.

@Jookia Jookia commented on 3 Dec

I'm back the next day just to note down what would improve test complexity.

The first is that I have tests that only exist to check for error propagation working correctly. This is because I can't check which context was used for each parse step. One way I can think to avoid testing this is to somehow unify success and error conditions. So if we parse something successfully we see the parse context used for error handling, and if we pass something failure it uses the same details. But this assumes that all parse contexts can be attached to parse outputs. Is this true? Can it be made true?

The second is that I have tests that check that the NO_TOKEN error gets returned. I think this is basically the same issue as before: We have to check this only because the context needs to be checked.

Unifying the error case would help move towards more automated generation as we could now define error cases as just input and output cases instead of modifications of good cases.

...

All that said, I have some more thoughts:

  • Should I be writing property tests manually? Would it not make more sense to write some kind of test schema that is used to generate inputs and outputs?
  • Should I just build a schema based parser then? How would I test the schema?
  • Should I be testing the subparsers or just the entire parser at once? If I test the entire parser at once, how would I ensure coverage? How would tests find a context to put their statement in for testing?
  • Is parser-wide fuzzing worth it if we do property tests already? What about unit tests? We still need integration tests to actually test features.

I have mixed thoughts about all this, and no other projects for reference.

I guess I just wonder here: Is property testing the correct tool for this job? I feel like what I'm doing here could be done using simple unit tests and an easy to inspect parser schema. Both would be human readable and easier to debug or extend.

I'm spending a lot of time writing complex tests for bugs that would be easy to find and debug using simpler means and better tools. I'm also not fully convinced these tests are correct due to the complexity. So ... what am I gaining?

Hello again testing nightmare friends, it's me still shovelling away in this hole. "Dig up!" I hear you cry. Sorry but we're going even deeper.

So let me re-address the problem: I have a function that is supposed to take a list of tokens and turn them in to a tree of useful data. Now to be clear, there are tons of ways to do this. I've already written two parsers. Hell I'll be writing more parsers. At this point parsers are my life. Parsing is fine.

The problem is testing the parser. The easiest way to do this is to provide some test input source code and check that it gets parsed properly, or errors properly. This is a manual process and I'm guessing this is what most people do. But it has some significant drawbacks:

  1. You have to write a ton of manual tests. You will miss stuff.
  2. You can't test specific parts of the parser at a time. You have to test the entire parser.

For instance, if the only way to test the Bool parser is to use it in a function, you now have to always wrap it in a function in your test. What if you want to change functions sometime? All your tests break.

Now to be fair, most of the time parsers deal with a format that doesn't change over time. So it is acceptable to do things this way. But in development of formats and languages you're experimenting and changing things. For instance, I haven't added functions yet. So once I do that, suddenly all my tests break. Not because bool parsing is broken, but because I'm testing the whole parser which has changed.

The problem here I realized is that parsers are chaotic systems. Changing one thing will affect everything else. This is also why random testing is so hard: The input for the parser is exponential. An analogy here is like testing that you can pick things up in a text game by recording a gameplay session to a log file. If something happens like you need to get an ability to pick up hot items then your tests break, but the issue isn't picking stuff up, it's the abilities system. Characters can still jump, just not in the current game state.

My conclusion that testing the parser in its entirety is a doomed approach. Instead various parts of the parser need to be tested in isolation then combined. This is contrary to all the advice I can read about testing parsers. So I don't know what's up with my brain, but this is how it's going to work.

...

So how do you test a parser in isolation? Well, I'm already kind of doing it but badly: We take the rule for parsing a Bool and put it in our own test parser rule set where everything else is controlled. This is basically what I'm already doing with mocking, and it's why I've managed to actually test the entire parser so far. But it's nasty code.

The reason for this is that my parser code does too many things at once. I need to split it up in to two components:

  1. The parser runner
  2. The parser rule set

The parser runner is shared between all parsers and basically interprets and runs the rule set. It ensures that errors get propagated and managed properly.

I don't know exactly the form the ruleset is going to take yet, but it's basically going to describe the expressions, and the parser runner will parse it. To test a specific expression we can copy it from the ruleset and place it in a controlled environment and perform tests on it. This would mean for each expression we only have to test the things the expression does, not the things the parser does.

...

So how do I actually implement this? I don't know yet. It's going to involve a new parser: parser3. I figure if I just start adding numbers to modules I can keep them side by side and all is well.

The main interesting problem I can think of here is whether the ruleset should be data or code.

Having it as data makes things more predictable and easier to think of: The parser interprets data, like a mini programming language with rules and laws we decide on.

Having it as code means I'm layering the parser language on top of Python and have to make sure my code only does sane thing the parser does. This means only calling parser functions and telling it to parse expressions, not parsing other expressions. This is what I refer to as a 'fucking headache'.

...

More and more I'm having very mixed feelings towards embeddeding more specific languages in other languages. It almost never seems to work well, like specifying configuration in Python. It's just not fun. You get two different languages you have to understand to solve a problem.

Hopefully in NewLang we can find an expressive way to embed data or rules used by algorithms like parsers without having to write an entirely new language. I guess the solution here would be to have some kind of data or configuration language come with NewLang that can be used to import data? I don't know. This is the stuff people would use JSON or XML for, but these are deeply nested and annoying to deal with. I guess that's a project for another day especially since I've been stuck in this issue for A YEAR AND A HALF.

But there is light at the end of the tunnel, I can feel it. I feel it.

Alrighty, time for the weekly parser check-in. At least that's what I'm telling myself.

In order to reduce parser complexity it's looking like I'm making some kind of
parser engine that takes some grammer and parsers. This actually seems to work well so far!

However it's brought up some interesting problems. Let's start with a simple example:

Issue 1

Bool = True | False

In this case a Bool can be the token 'True' or 'False'.
To do this we have to have some way for the parser to try both and pick the winning one.
The current algorithm works like this:

  1. Try option 1
  2. If it fails, try option 2
  3. If that fails, return a specific error

This is simple, but let's look at a more complicated example:

Greeting = Hello world | Hi

Here the first option is the two tokens "Hello world" and the second option is "Hi".
So what should happen if you write "Hello word"?
In this case the parser won't recognize that you meant to write Hello world!
So now you get confused because instead of telling you it got 'word' but expected 'world'
it says 'Couldn't see any greeting'.

The main takeaway I get from this is that we want to limit backtracking to some
extent and decide to commit to an option if it makes some kind of progress.

The current parser has a rule that if the first token matches we commit to that path,
so going with that might be the best idea here.

Issue 2

That's the first issue. The second issue relates to variable-length syntax. For example:

Greeting = Hello | Error if it is Goodbye
Long Greeting = StartGreeting [Greeting] EndGreeting

Ideally we would parse it like this:

  1. Read StartGreeting
  2. Read next greeting
  3. If it's EndGreeting, finish parsing
  4. If it's 'Goodbye', error
  5. If it's 'Hello', repeat

My first attempt at this was simple: Repeat parsing a Greeting as long as possible possible and then parse EndGreeting.

This worked! But it had a weird side effect: If we wrote this "StartGreeting Goodbye EndGreeting"
it would say it expected EndGreeting but got Goodbye. What?

This is because the repeat function didn't tell the difference between these two cases:

  1. Failed to parse a Greeting because it isn't Hello or Goodbye
  2. Failed to parse a Greeting because it is Goodbye

It would see both as a 'failed to parse, move on to EndGreeting'.

This seems like a contrived example, but it actually shows up when adding error messages for things like 'you may not put StartText between StartText and EndText'. For example:

"StartText Hello StartText world EndText" it would complain that "StartText" was not "EndText", not that you can't put "StartText" in an EndText.

My next attempt for this was to read all tokens up to a terminator then parse inside that. That worked much better. Except...

Issue 3

Let's say we have some syntax like this:

Command = Subject [Options...]
IfStatement = If Command Then Command Else Command EndIf

The if statement will find 'Then', then parse a command in it, then the same with 'Else', then the same with 'EndIf'. But there is a problem: What if we modify the Else to be optional?

IfStatement = If Command Then Command [Else Command] EndIf

Now we need to parse up to either Else or EndIf, then depending on the terminator we find we have to decide if we're going to end or try parsing an else.

So we need to have the 'parse until terminator' thing not parse the terminator itself but only up to the terminator, and support multiple terminators. Then the parser could say 'try parsing an optional Else, then parse EndIf'.

But at this point we more have a kind of 'parse until a valid terminator' thing. It's tempting to say that we should say that Command will only parse non-terminators, because terminators aren't part of command syntax. So something like this:

Terminator = Then | Else | EndIf
Option = Anything that isn't Terminator
Command = Subject [Options...]
IfStatement = If Command Then Command [Else Command] EndIf

This syntax would parse correctly with the rule of 'parse Option until we have an error then continue on'. Then the IfStatement would handle the actual keywords and error if the wrong one was used. So it's tempting to go with this solution.

However this would also swallow valid errors, like malformed multi-token options like StartText Hello EndText. So my conclusion at this point is that the 'parse until any of these terminators' is the solution.

Issue 4

The last issue is with lexing. Or rather, the lack of lexing.

The original parser I wrote (now named oldparse) had a lexing pass where it would convert words in the file to tokens that had a type and value. This is needed for most programming languages that aren't just a list of whitespace-separated words, but I decided to chuck it in the current parser (formerly newparse, now known as parse) and just replace it with a whitespace splitter. This simplified code and tests as tokens were much simpler.

If we're going to be parsing more syntaxes than this we need to take lexers seriously. For example:

[User]
Name=Jookia
Description=A cool person
Money=100

This syntax needs to be split up in to symbols, words and whitespace so the parser can read it properly. Just splitting by spaces makes no sense.

One approach to this is to make the parser work on characters instead of tokens, then read each character at a time and build up expressions. So we would say something like:

Group = OpenBracket [Whitespace...] Letter... [Whitespace...] EndBracket [Whitespace...] NewLine
Variable = Letter... [Whitespace...] Equals Byte... NewLine
EmptyLine = [Whitespace...] NewLine 
Line = Group | Variable

While with tokens it would look like this:

Printable = Word | Whitespace
Sentence = Printable...
Variable = Word [Whitespace] Equals Sentence NewLine
Group = OpenBracket [Whitespace] Word [Whitespace] EndBracket [Whitespace] NewLine
Line = Group | Variable

You would also have an extra pass and code for lexing in to tokens. It's added complexity and there's many many arguments online about what is the right approach.

I'm going to give my opinion here: I think merging a lexer or tokenizer and parser would be a nightmare for the following reasons:

The first is that they are two different jobs. Lexing is very simple: Go through a string of bytes and put each byte in a token, and also categorize tokens if you can. Attach line numbers and position information to the tokens while you're at it. The parser's job is much more complicated: Turn these bytes in to a tree of things where each token has a different meaning based on the current parser state. Merging these would make testing much more of a headache simply because there's more things happening at once.

The second is that skipping lexing is slower: You're dealing with many more units of data than just tokens, which means running more code. How much slower is this? I'm not going to speculate too much given there's other more important issues, but I have seen numbers floating around like a 30% performance increase with a separate lexing pass.

The third is that parsing characters means backtracking can't easily be controlled. A rule like 'if a choice makes any progress don't backtrack' works for NewLang's syntax fairly well I think. This wouldn't work with characters as we couldn't backtrack from 'BeginTime' if it tried to parse 'BeginText' and found it was instead a variable name.

The fourth issue is that sometimes lexing needs to handle things that aren't 'part' of syntax such as strings, comments and indentation. This stuff is pretty hard to handle in a parser.

Conclusion

Alright, that's about it for now. This week I'm going to try implementing a prototype lexer and more parser primitives, then parse some more syntax.

After a day of thinking I decided once again I'm being fooled by trying to box things in to words: My impression that a lexer would never error is wrong because we might end up with things like '1324xyz' which is a number but not a number. So what's the difference between a lexer and a parser? I don't know. I guess complexity.

So at this point I'm going to just say that all the junk that gets a file in to an in-memory structure is the parser, and it runs in these stages:

  1. Split bytes in to tokens of three categories: Whitespace, new lines and words
  2. Attach line information to the tokens
  3. Strip the whitespace tokens out as we only needed them for line information
  4. Strip out notes
  5. Parse texts
  6. Categorize all remaining tokens, such as bool, keywords, etc
  7. Do the nested parsing to build an abstract syntax tree

You could do steps 1-6 all at once in a single function, but that would be a headache to test, and more importantly, hard to understand, and I do not want that.

I don't really care about the performance impact of a multi-pass system at the moment. If anything I think it might be better if after splitting categorization is done immediately, so I might change that.

Labels

Priority
highest
Milestone
Tested and documented
Assignee
No one
1 participant
@Jookia