testing: Add tests for tokenizer #10

Open Jookia opened this issue on 24 Aug - 5 comments

@Jookia Jookia commented on 24 Aug

Now that the language syntax is described, we need some tests to validate the implementation.

Hopefully this would be done with pytest and hypothesis to provide some solid verification of behaviour.
Tests that state obvious conditions should be good too.

This should start with the tokenizer's mapping between a string and tokens.
It should test that the mapping works according to the structures written in the syntax documentation.
It should also validate that errors are correctly raised and reported.

@Jookia Jookia add the testing label on 24 Aug
@Jookia Jookia change priority from default to highest on 24 Aug
@Jookia Jookia change milestone from No milestone to Tested and documented on 24 Aug

I've rigged up pytest and hypothesis with a simple fuzz test for the parser to start with. This has found at least one bug, which is great news.

I've decided we're going to have to hit four types of tests:

  1. Fuzzing tests that just make sure the code doesn't bleed random errors
  2. Property tests that run against random data of our choosing
  3. Unit tests that check specific edge cases and regressions
  4. Integration tests that check the binaries and build systems themselves.

This should be good enough for now. The fuzz test is already added as mentioned, though we're not using something fancy like python-afl, just hypothesis. This is suboptimal but I picture something like dedicated fuzzing should be run with integration tests. Eventually we should have some dedicated hardware just running these fuzzy tests for hours at a time to get some actually good random coverage.

We also have one unit test for a specific bug I've fixed that I found from using the little fuzzer. We're going to have a ton of unit tests for each bug we fix.

Anyway, focusing on 2 for now, I've thought about some properties to test the tokenizer and how. The input will be:

  • Random tokens. These will consist of language keywords, shebangs, symbols with random names, and notes and text containing random data.
  • Random whitespace to separate the tokens. This will be mixes of new lines, spaces, tabs, etcetera.

These will be converted to text for the tokenizer to read, and hopefully give us back the same input if it's done it's job of tokenizing properly.

The following properties should apply for valid input:

  • Input tokens separated by whitespace should equal output tokens for the most part;
  • Note tokens should be skipped
  • Empty tokens should be skipped
  • The starting shebang should be skipped
  • There should be a final EOF token added by the tokenizer
  • The line and column should match the input tokens

The following properties should apply for invalid input:

  • Multiple shebangs should error
  • Nested texts and notes should error
  • Unclosed texts and notes should error
  • The line and column of the error should match the error tokens

Currently the 'parse_file' method isn't testable since it opens a file and prints errors in a friendly way. It should be refactored to be testable.

Today I added tests (and fixed a bunch of bugs) relating to the text literal parser, and then added tests for the note parser.

For these I still have to add:

  • Filters to exclude BeginText/EndText from the data I'm testing
  • Whitespace fuzzing
  • Location checking

Then I have to add error checking for:

  • Nested tokens
  • Unterminated tokens

The following different types of tokens need to be generated and tested:

  • Shebang
  • Keywords
  • Bool
  • Symbols

I've decided to structure the tests as testing one thing at a time, and with that I've almost finished testing all the tokens standalone, with some minor extra things to do exclude certain input data. So hopefully today I'll have tests for all the working conditions done.

Next will probably be a soup test where I generate a sequence of valid tokens and ensure it outputs the correct data. I'm not too sure how this will work, but it should also test that things like shebangs and notes and text can contain keywords and weird whitespace without breaking.

Then after that I'll have some tests for bad input.

Then it's basically the same story for the parser but with tokens and an AST instead of characters and tokens.

@Jookia Jookia commented on 2 Dec

So it's a little while later. I had to drop the standalone tests and just create a soup (a model) of tokens and expected output. Right now as far as I can tell the model tests all valid inputs.

The next step is testing error checking, which is going to be interesting. I'm planning to do this by inserting 'trouble' tokens in to the soup and confirming that the lexer catches these with the correct error.

I almost had these finished, but hit an unexpected roadblock that requires rewriting a lot of code.

First up: My code to parse Text and Note tokens was pretty standard. The approach I took was for the opening token (StartText) to switch the parser to a mode for reading data until the ending token (EndText). This seems reasonable enough, and it works! But in what situations will missing a token cause an error?

The answer is 'extremely specific ones', because if you miss writing an EndText then the parser will continue reading until the next EndText that's started by another Text. It's not keeping track of inadvertent nesting. The same thing goes for adding a StartText randomly in your code. It might be closed by an EndText, it might be inside a text.

Even worse things happen if you mix compounds. The fuzz tester brought out this case: If you take code along the lines of 'StartText StartNote EndText StartNote EndNote' it is parsed as a text containing 'StartNote' then an empty note. But if you insert a premature 'EndText' after 'StartText', the code still parses fine! It parses as an empty text, then a note containing 'EndText StartNote'.

Now, this issue isn't specific to NewLang. This headache of a statement actually translates straightforward to C as this piece of code: "/*"/**/ which upon inserting the text terminator turns in to ""/*"/**/ , also valid C. This works in most languages actually!

So when does this error? My initial idea was the following situations:

  • When there's a stray StartText or EndText
  • When you insert an EndText or remove a StartText
  • When you remove a StartText or add an EndText

These related to the kind of 'balancing' that you'd intuitively expect when using start and end tokens.
But this code won't give you errors.

Instead the errors it will give are:

  • There's no EndText somewhere after a StartText in the file
  • There's a stray EndText that's parsed as a token

The first error is related only to text in a file. It's not on the level of tokens or anything, the EndText could be anywhere, even inside other Texts or unrelated compounds like Notes as we saw above.

The second one applies if there's an EndText that's parsed as a token, so NOT inside a Text or Note definition. This makes sense kind of, but it's actually a little more insidious: It depends on the state of the parser.

So in order to test the second we have to implement a parser of our own. What the fuck.


The second issue is that my testing model doesn't deal with compound statements well at all. It generates a soup of tokens, each with the expected token output and the code fed to the parser. So for an example, we generate a SampleToken that contains the input code value "StartText Hello EndText" and the output code value "Text object with value of Hello".

This works fairly well, when it's time to test it joins all the code values together, passes that to the parser, then gets the tokens back and checks if they match.

The issue comes in when injecting errors: The current approach is to take a set of good tokens, then make an error. Say, adding an 'EndText' in to a Text. This requires writing code that basically parses the Text code to an extent so it knows where to insert the EndText. In this case the parsing is mostly just 'split the data in to words, insert a word, then join it back up', but the code for this is much too messy and complex and shouldn't be something we do in the first place.


So there's a few ideas I have to fix this mess. Step one is to make Text, Note and other compound statements saner: All StartTexts must have a matching EndText and vice-versa. If you miss one, you get an error. This does have a side-effect: You can no longer put 'StartText', 'EndText', 'StartNote', 'EndNote' inside texts or notes. This means you can no longer write code like "StartNote Make sure to use StartText and EndText together EndNote" in code. At some point down the track we'll have to figure out conventions and ways to work around this in specific situations. But whatever, fuck it.

Step two is to make the tests a bit lower level. Having them generate the final value of the token and the code to give to the parser is a bit of a jump, since that's testing both splitting things in to tokens correctly and that compound tokens are read properly. Having the generator represent the code as a list of tokens and whitespace would make it a lot easier to do modifications for tetsing.

Step three is to try and split up the parser a bit more. The tests at the moment test a lot of things at once, and this is really slowing test performance done and pushing complexity up. One idea I have here is to split the parser again so one step is to split things in to tokens and whitespace, and the other step is to take a list of tokens and whitespace and give back a list of just tokens, with the compound tokens being made here.

That's all for now.


Tested and documented
No one
1 participant