testing: Add tests for tokenizer #10

Open Jookia opened this issue on 24 Aug - 1 comment

@Jookia Jookia commented on 24 Aug

Now that the language syntax is described, we need some tests to validate the implementation.

Hopefully this would be done with pytest and hypothesis to provide some solid verification of behaviour.
Tests that state obvious conditions should be good too.

This should start with the tokenizer's mapping between a string and tokens.
It should test that the mapping works according to the structures written in the syntax documentation.
It should also validate that errors are correctly raised and reported.

@Jookia Jookia add the testing label on 24 Aug
@Jookia Jookia change priority from default to highest on 24 Aug
@Jookia Jookia change milestone from No milestone to Tested and documented on 24 Aug

I've rigged up pytest and hypothesis with a simple fuzz test for the parser to start with. This has found at least one bug, which is great news.

I've decided we're going to have to hit four types of tests:

  1. Fuzzing tests that just make sure the code doesn't bleed random errors
  2. Property tests that run against random data of our choosing
  3. Unit tests that check specific edge cases and regressions
  4. Integration tests that check the binaries and build systems themselves.

This should be good enough for now. The fuzz test is already added as mentioned, though we're not using something fancy like python-afl, just hypothesis. This is suboptimal but I picture something like dedicated fuzzing should be run with integration tests. Eventually we should have some dedicated hardware just running these fuzzy tests for hours at a time to get some actually good random coverage.

We also have one unit test for a specific bug I've fixed that I found from using the little fuzzer. We're going to have a ton of unit tests for each bug we fix.

Anyway, focusing on 2 for now, I've thought about some properties to test the tokenizer and how. The input will be:

  • Random tokens. These will consist of language keywords, shebangs, symbols with random names, and notes and text containing random data.
  • Random whitespace to separate the tokens. This will be mixes of new lines, spaces, tabs, etcetera.

These will be converted to text for the tokenizer to read, and hopefully give us back the same input if it's done it's job of tokenizing properly.

The following properties should apply for valid input:

  • Input tokens separated by whitespace should equal output tokens for the most part;
  • Note tokens should be skipped
  • Empty tokens should be skipped
  • The starting shebang should be skipped
  • There should be a final EOF token added by the tokenizer
  • The line and column should match the input tokens

The following properties should apply for invalid input:

  • Multiple shebangs should error
  • Nested texts and notes should error
  • Unclosed texts and notes should error
  • The line and column of the error should match the error tokens

Currently the 'parse_file' method isn't testable since it opens a file and prints errors in a friendly way. It should be refactored to be testable.

Labels

Priority
highest
Milestone
Tested and documented
Assignee
No one
1 participant
@Jookia