testing: Add tests for tokenizer #10

Open Jookia opened this issue on 24 Aug 2021 - 18 comments

@Jookia Jookia commented on 24 Aug 2021

Now that the language syntax is described, we need some tests to validate the implementation.

Hopefully this would be done with pytest and hypothesis to provide some solid verification of behaviour.
Tests that state obvious conditions should be good too.

This should start with the tokenizer's mapping between a string and tokens.
It should test that the mapping works according to the structures written in the syntax documentation.
It should also validate that errors are correctly raised and reported.

@Jookia Jookia add the testing label on 24 Aug 2021
@Jookia Jookia change priority from default to highest on 24 Aug 2021
@Jookia Jookia change milestone from No milestone to Tested and documented on 24 Aug 2021

I've rigged up pytest and hypothesis with a simple fuzz test for the parser to start with. This has found at least one bug, which is great news.

I've decided we're going to have to hit four types of tests:

  1. Fuzzing tests that just make sure the code doesn't bleed random errors
  2. Property tests that run against random data of our choosing
  3. Unit tests that check specific edge cases and regressions
  4. Integration tests that check the binaries and build systems themselves.

This should be good enough for now. The fuzz test is already added as mentioned, though we're not using something fancy like python-afl, just hypothesis. This is suboptimal but I picture something like dedicated fuzzing should be run with integration tests. Eventually we should have some dedicated hardware just running these fuzzy tests for hours at a time to get some actually good random coverage.

We also have one unit test for a specific bug I've fixed that I found from using the little fuzzer. We're going to have a ton of unit tests for each bug we fix.

Anyway, focusing on 2 for now, I've thought about some properties to test the tokenizer and how. The input will be:

  • Random tokens. These will consist of language keywords, shebangs, symbols with random names, and notes and text containing random data.
  • Random whitespace to separate the tokens. This will be mixes of new lines, spaces, tabs, etcetera.

These will be converted to text for the tokenizer to read, and hopefully give us back the same input if it's done it's job of tokenizing properly.

The following properties should apply for valid input:

  • Input tokens separated by whitespace should equal output tokens for the most part;
  • Note tokens should be skipped
  • Empty tokens should be skipped
  • The starting shebang should be skipped
  • There should be a final EOF token added by the tokenizer
  • The line and column should match the input tokens

The following properties should apply for invalid input:

  • Multiple shebangs should error
  • Nested texts and notes should error
  • Unclosed texts and notes should error
  • The line and column of the error should match the error tokens

Currently the 'parse_file' method isn't testable since it opens a file and prints errors in a friendly way. It should be refactored to be testable.

Today I added tests (and fixed a bunch of bugs) relating to the text literal parser, and then added tests for the note parser.

For these I still have to add:

  • Filters to exclude BeginText/EndText from the data I'm testing
  • Whitespace fuzzing
  • Location checking

Then I have to add error checking for:

  • Nested tokens
  • Unterminated tokens

The following different types of tokens need to be generated and tested:

  • Shebang
  • Keywords
  • Bool
  • Symbols

I've decided to structure the tests as testing one thing at a time, and with that I've almost finished testing all the tokens standalone, with some minor extra things to do exclude certain input data. So hopefully today I'll have tests for all the working conditions done.

Next will probably be a soup test where I generate a sequence of valid tokens and ensure it outputs the correct data. I'm not too sure how this will work, but it should also test that things like shebangs and notes and text can contain keywords and weird whitespace without breaking.

Then after that I'll have some tests for bad input.

Then it's basically the same story for the parser but with tokens and an AST instead of characters and tokens.

@Jookia Jookia commented on 2 Dec

So it's a little while later. I had to drop the standalone tests and just create a soup (a model) of tokens and expected output. Right now as far as I can tell the model tests all valid inputs.

The next step is testing error checking, which is going to be interesting. I'm planning to do this by inserting 'trouble' tokens in to the soup and confirming that the lexer catches these with the correct error.

I almost had these finished, but hit an unexpected roadblock that requires rewriting a lot of code.

First up: My code to parse Text and Note tokens was pretty standard. The approach I took was for the opening token (StartText) to switch the parser to a mode for reading data until the ending token (EndText). This seems reasonable enough, and it works! But in what situations will missing a token cause an error?

The answer is 'extremely specific ones', because if you miss writing an EndText then the parser will continue reading until the next EndText that's started by another Text. It's not keeping track of inadvertent nesting. The same thing goes for adding a StartText randomly in your code. It might be closed by an EndText, it might be inside a text.

Even worse things happen if you mix compounds. The fuzz tester brought out this case: If you take code along the lines of 'StartText StartNote EndText StartNote EndNote' it is parsed as a text containing 'StartNote' then an empty note. But if you insert a premature 'EndText' after 'StartText', the code still parses fine! It parses as an empty text, then a note containing 'EndText StartNote'.

Now, this issue isn't specific to NewLang. This headache of a statement actually translates straightforward to C as this piece of code: "/*"/**/ which upon inserting the text terminator turns in to ""/*"/**/ , also valid C. This works in most languages actually!

So when does this error? My initial idea was the following situations:

  • When there's a stray StartText or EndText
  • When you insert an EndText or remove a StartText
  • When you remove a StartText or add an EndText

These related to the kind of 'balancing' that you'd intuitively expect when using start and end tokens.
But this code won't give you errors.

Instead the errors it will give are:

  • There's no EndText somewhere after a StartText in the file
  • There's a stray EndText that's parsed as a token

The first error is related only to text in a file. It's not on the level of tokens or anything, the EndText could be anywhere, even inside other Texts or unrelated compounds like Notes as we saw above.

The second one applies if there's an EndText that's parsed as a token, so NOT inside a Text or Note definition. This makes sense kind of, but it's actually a little more insidious: It depends on the state of the parser.

So in order to test the second we have to implement a parser of our own. What the fuck.

...

The second issue is that my testing model doesn't deal with compound statements well at all. It generates a soup of tokens, each with the expected token output and the code fed to the parser. So for an example, we generate a SampleToken that contains the input code value "StartText Hello EndText" and the output code value "Text object with value of Hello".

This works fairly well, when it's time to test it joins all the code values together, passes that to the parser, then gets the tokens back and checks if they match.

The issue comes in when injecting errors: The current approach is to take a set of good tokens, then make an error. Say, adding an 'EndText' in to a Text. This requires writing code that basically parses the Text code to an extent so it knows where to insert the EndText. In this case the parsing is mostly just 'split the data in to words, insert a word, then join it back up', but the code for this is much too messy and complex and shouldn't be something we do in the first place.

...

So there's a few ideas I have to fix this mess. Step one is to make Text, Note and other compound statements saner: All StartTexts must have a matching EndText and vice-versa. If you miss one, you get an error. This does have a side-effect: You can no longer put 'StartText', 'EndText', 'StartNote', 'EndNote' inside texts or notes. This means you can no longer write code like "StartNote Make sure to use StartText and EndText together EndNote" in code. At some point down the track we'll have to figure out conventions and ways to work around this in specific situations. But whatever, fuck it.

Step two is to make the tests a bit lower level. Having them generate the final value of the token and the code to give to the parser is a bit of a jump, since that's testing both splitting things in to tokens correctly and that compound tokens are read properly. Having the generator represent the code as a list of tokens and whitespace would make it a lot easier to do modifications for tetsing.

Step three is to try and split up the parser a bit more. The tests at the moment test a lot of things at once, and this is really slowing test performance done and pushing complexity up. One idea I have here is to split the parser again so one step is to split things in to tokens and whitespace, and the other step is to take a list of tokens and whitespace and give back a list of just tokens, with the compound tokens being made here.

That's all for now.

@Jookia Jookia commented on 8 Feb

Alright, as of today the new tokenizer is done. I still need to add another test but as far as I'm concerned it's done. I've managed to get in to a test-driven design workflow which is working a lot better. Next up is the rewrite of the parser. I thought I might be able to salvage this code but it's going to have to get rewritten too. The good news is that I'm making excellent progress and have clear ideas about where to go from here.

@Jookia Jookia commented on 8 Feb

One thing I forgot to mention is that I've dropped debug logging. At this point it doesn't serve any purpose given the code is less of a state machine and more of a set of passes.

Dropping back in for an update: Today I've worked out an initial way for testing errors for parsing simple things. Compounds are next which is kinda annoying as it involves recursion. Not sure how I'll do it, I might have to explore the space a little bit. I'll either split things out so each compound expression is tested then have them all integrated somehow, or just integrated from the start.

Okay, popping back in to my nightmare development log to give an update: Yesterday I realized why compound statements are so hard to test.

The first issue is that some error conditions are impossible to meet in practice. For a simple example parse_bool can error in a few ways:

  • If there's no tokens to parse
  • If the token it parses isn't True or False

But it turns out parse_bool is ONLY called if there's a True/False token based on lookahead. So it's impossible to test errors in a compound. In general lookahead means a class of errors depending on the first token being a specific value can't be tested.

The next issue is with my assumption that we can fuzz errors by changing tokens. I hit this earlier with the note/text intertwining. If you fuzz things by changing a True to some random word you don't get the 'tried to parse bool' error, because we're no longer parsing a bool in the first place. This gets messier when parsing variable streams of tokens, such as text, notes or even argument lists for function calls: Changing a True or even removing it can still result in a valid program.

The reality here is that the effect of changing a token depends on the state of the parser, so fuzzing like that isn't going to work. Well, another hard lesson learned. I'm going to have to think about this.

I understand I'm flooding the issue tracker with this nonsense, but I'm having to learn this the hard way and part of learning involves explaining what you've learned.

What are errors? With a paser errors happen when you're parsing something and encounter some unexpected token. So one way to trigger errors is to modify a token so it becomes unexpected. This is what my existing tests do. You can think about this as a state machine where you have a current state that dictates what next acceptable tokens are.

For instance with, 'StartText Hello world EndText' we know there's that StartText isn't allowed within a text, so if we write 'StartText StartText Hello world EndText' we will get a FOUND_STARTTEXT error.

You can think about this as a state machine where you have a current state that dictates what next acceptable tokens are. Compound expressions complicate this a little bit by introducing multiple levels of state, but it's still the same idea: The current state dictates if the next token is an error or not.

This is all very matter of fact and obvious. But testing it turns out to be a lot, lot trickier than I'd like. The main problem being that generating random data with specific errors is tricky.

For instance with this code statement in a wider random piece of code: 'StartText Set To EndSet EndText'
If you remove StartText then the failure is now whether Set is a valid token here. If it is, then the failure is that there's a To after the Set.
if you remove EndText then the failure could be that there's another StartText later, or that you've reached the end of the file without finding an EndText.
In really weird situations it might contain syntax that counteracts the error altogether, such as intertwining comments and strings.
Furthermore, how do you verify that the error is actually correct? Without having some kind of parser emulator to compare to, you just have to trust that it's correct. Not great for testing.

I think a better approach is to start categorizing errors as behaviours themselves, like token X can cause a FOUND_STARTTEXT error if there is another StartText behind it with no EndText between them. Or token Y can cause a NOT_TOKEN error, or NO_TOKEN if removed. Then finding a token and causing the error by going back in the token stream and creating the situation. A bit like time travel.

Not too sure how this is going to work in practice. But this approach seems the sanest at this point: Finding where an error can happen then causing it.

It's me again. Possibly final post about parser testing. I think describing errors as behaviors and creating those errors in order to test the parser is a cool idea, but I think it's an alternative to testing each parser state individually like I am now. It's a very top down approach compared to the bottom up one I have now. Looking at parsers in general, this makes sense because my recursive descent parser is basically an LL(1) parser.

Unfortunately right now I have to test each parser state individually. In the next parser revision it would be good to have things be rule-based and only having to test the rule runner. That's a bit like a domain-specific language.

@Jookia Jookia commented on 5 May

Okay, so today I wrote my first test for a compound expression and it works. So I've figured out this issue, I just need to continue writing code.

Alright. I've added support for a few literals (bool, text), a note pass, and error reporting where the parest has a context per task it does. All this is tested.

Yesterday I started refactoring my tests so they're clearer. There's still more to do but I have some preliminary observations about tests and their patterns.

Most parser tests involve creating a valid specimen, checking that it parses properly, then mutating various tokens in the specimen and checking that errors report as expected.

Compound tests have the additional requirements that it must verify sub-expressions are parsed correctly and have errors reported. This can be simplified by redefining the sub-expression using mocking to be a single token, and controlling whether it throws an error or not depending on the test.

@Jookia Jookia commented on 9 Jun

Just another update from me as I write more tests. I've hit a snag today where I have a test that checks if variable names are valid. The logic right now is that they're invalid if they're a keyword, otherwise they're valid. This seems fairly simple to test, but it's a headache because it doesn't test the values consistently. Which makes no sense. I run 100 tests and have like 20 values to test against.

The problem I found is that Hypothesis is spending a lot of time testing and generating various combinations of things I don't care about: The variable's file location, the parse context. I'm blowing the 100 test budget ensuring I test if the 'is this a valid name' works with different location values.

Yes, in theory these could affect the output of the function. Maybe 'is this a valid name' function will give an exemption if the keyword is on line 1 of the file, or the filename is 'spaghetti'. But it doesn't, and I can prove that by showing you the source code.

Not only is this giving me poorer test coverage, this is also giving worse performance. I try not to be too picky about performance, but the test suite is only going to take longer and longer to run the more code is added.

So I'm not entirely sure what to do here. I think I need to use randomness a lot more sparingly.

Just as an update, I replaced random line numbers and parse contexts with static values. It has greatly sped up tests and made them more stable in areas where the search area is small. It seems to also have solved the issue of memory leaking with tests. It now takes 15 minutes to do a Windows and Linux Jenkins build concurrently down from 60, and long tests now take 3 hours and don't get use up all my system memory.

A quick development test takes 30 seconds on my machine down from 60. However if I disable the old parser tests it only takes 8 seconds, likely because those tests have the same issues. Once the new parser is done removing those should have significant speedup.

@Jookia Jookia commented on 2 Jul

I managed to delete my notes for this post I wrote a while ago by accidentally opening the file with 'rm' instead of 'vim. 'rm' is not a good editor, would not recommend.

Anyway, Almost a month later I have the tests for the Action syntax done. I've realized a a few more things doing this:

  • Python's decorators are a headache and hard to understand
  • My own decorators are too high level and need to be replaced

These are mostly fixable by a refactor, but I've realized something a bit heavier: I haven't been writing enough documentation. In fact, I've missed the point of documentation a lot. My code works, you can navigate it, edit it fairly easily, even write it easily. But I don't have any documentation about what the code is supposed to do or what it actually implements. So for example, if you look through my tests you can understand what it tests, but you have to guess why it's being tested.

Now to be fair on myself, most projects do this. Part of being a developer to some extent requires reading other people's code and trying to understand what they were doing and why. Then you decide if the code accomplishes what you think they wanted to write. It's like a game of detective, but kind of awful.

But the problem is it doesn't have to be this way. I've seen projects with GOOD documentation. The documentation doesn't have to be fancy or anything, it just has to give me an overview of what this code SHOULD be doing and WHY. This makes it a dozen times easier to understand and read the code.

In this case, I think I need to dump a paragraph or so of text at the start of my test files explaining the following:

  • What I'm testing
  • How I'm testing it
  • What areas I'm covering in my tests

That way people can tell if I've forgotten something or missed something. In this case the other person reading it is future me: I'm spending too much time having to re-read code to load it in to my brain and understand those points I just listed.

What's even sadder is that before writing the tests I'm writing a lot of these things down in my to-do list. Just pasting that at the top of my file would save me so much work.

Okay, I've settled in to a workflow that works like this:

  • Create new test file copied from a similar test
  • Edit the documentation at the top of the file to explain what I'm testing
  • Writing code to generate valid data for the tests
  • Writing the valid test
  • Writing code to make it pass
  • Writing invalid tests
  • Writing code to make them pass
  • Run docheck to check code for style and lint errors
  • Run domut for mutation testing
  • Commit to Git

This workflow seems to be working very well for me and giving me a lot of confidence in the code I write.

I have to go back and refactor the existing test code to work along these guidelines as well as dump the template decorations I created.

So what next? I still haven't hooked up the new parser, there's a few more things to do:

  • Write an error printer that takes a machine-generated error
  • Write code that will read in text, parse it and then either print error or pass is to the interpreter

But wait, there's more! I'm still not satisfied with the tests I have. They test each component of the parser, but not the parser as a whole. So I want to test the following:

  • That the components fit together properly
  • Real code and cases that I write either work or fail correctly
  • The new parser isn't unexpectedly different from the old parser

The first goal is easy to solve. Whatever test method we pick, it will do it.

The second goal of having real cases is also fairly easy: Just write some manual test cases. However it's kind of a question as to how 'deep' this should test. Should it test just that the code doesn't error? Should it test that the code gives an exact syntax tree output complete with line numbers? The deeper you test the more fragile these tests become as you have to update them as well as any other change.

The third goal is also a little weird. To be comprehensive it would need fuzzing of valid and invalid input and to compare it. Differences in data structures between the two parsers would need to be translated and checked. But this is also test code that is written just once before the old parser is thrown out. So I'm not sure if it's worth me writing it.

I think having a library of handwritten tests for language syntax and errors that can be used to test the old parser and the new parser would be a good compromise.

@Jookia Jookia commented on 2 Aug

Okay, so not much has happened for 10 days. I've been trying to wrap my head around how to go about testing this parser error reporting. It seems simple: Just text that it outputs the correct text. But it quickly becomes a can of worms.

Consider this: We create a function that greets a user by name.

It has the following behaviour:

  • It outputs a message in the currently set language
  • The message includes the user's name
  • The message includes the user's pronouns/gender

This is fairly easy right? A function that does this would work fine:

  • Look up user's pronouns
  • Look up user's name
  • Look up the currently set language we're using
  • Get the greeting for the language
  • Get the pronouns for the language
  • Fill in the pronouns and name in the greeting
  • Return the greeting

This is great, but how testing it is a headache as by the time we get the output it's just text. The only way to test this is to construct our own translated greeting and check that it matches.

The other issue is the tight coupling to the translation service. The function is asking for things to be translated according to the current global language. That's a headache to pain as it means we have to track that or switching language would break the tests.

My proposal is that the function would instead do this:

  • Look up the user's pronouns
  • Look up the user's name
  • Return the untranslated greeting with the pronouns and name

Then later some code would do this:

  • Get the current language
  • Translate the greeting
  • Fill in the greeting with the pronouns and name

This separates creating data and translating data.

A harder question is: What should the untranslated greeting look like? Should it be English? In Linux land the answer is generally 'yes': GNU gettext is how most translations get done and the untranslated strings are English. All translations are translations of English, and this makes it easier for translators to handle. Outside Linux land the answer is generally 'no': Machine-readble are used instead. The upside of this is that you no longer store the English translation in your program and updating the English translation doesn't break anything that relies on that translation.

Another option is to just ignore the translation key in our tests and just check that the pronouns and name are correct. This is a bad idea since we also want to test any kind of localization decision process, such as different translation strings based on plurality.

It's a controversial choice that can always change later, but I think I'm going to use machine-readable identifiers instead of English translations as keys. We can always write a tool that maps between the identifiers, English and another translation if we want to make translation easier. The reason for this is testing: If the translation is always a machine-readable identifier then the test won't need to be updated with the English translation update.

Another reason that's smaller but seems significant is that English takes up more space than identifiers which kind of sucks for microcontrollers, and we can't just use untranslated assets like images or sounds as keys, those would need identifiers anyway. Just using identifiers from the start will work for any kind of asset.

Okay, that's all for now. I'll hopefully get back to working on this soon.

Labels

Priority
highest
Milestone
Tested and documented
Assignee
No one
1 participant
@Jookia