-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Lean 4 #6616
Add support for Lean 4 #6616
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is Lean 4 really so syntactically different from the Lean we already support that it warrants its own entry, samples and grammar?
If this is just a case of getting improved syntax support for Lean 4, does the newer Lean 4 grammar also support earlier Lean versions? If so, it would be better to switch out just the grammar for the newer grammar.
If not, we can add support for Lean 4 as its own entry, but I recommend grouping it with the current Lean (add group: Lean
), but note, this PR doesn't meet usage requirements so won't be merged until it does.
You also need to finish off the section you're adding to the languages.yml
file as per the CONTRIBUTING.md file.
You will also need to ensure there are at least two samples for both Leans to train the classifier and ensure it can correctly identify the languages of the 4+ samples.
I'd argue yes. The syntax changes are much greater than the change from Python 2 to Python 3 (to give a hopefully more familiar example).
No, nor is there any intent that this will be supported in future. Lean 3 -> Lean 4 is an almost complete rewrite of the language that was years in the making, and it is not expected for this to happen again any time soon.
Can you explain what the requirement is that we do not meet? https://github.com/topics/lean4 suggests that we have >200 repos, though perhaps that counts forks.
I've added one more of each; could you point me to any tests I should add for this? Also note that CI will not run on this PR without you hitting the button that might as well read "this user is not trying to hack our CI". |
#5756 For the latest, as referenced in the CONTRIBUTING.md file.
The only additional tests would be for the heuristics which you've already added. The testing of the classifier is this one which can be run locally (or in Codespaces/devcontainer) too.
This is only because you're a new contributor but you should still be able to run the tests locally with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the template to list the source and licenses of all the samples you've added.
The classifier test failure is because your Lean and Lean 4 samples are not sufficiently different for the classifier to differentiate between the two languages for This could take a bit of time but you can speed up the local testing using The other test failure is clear. |
Thanks! Regarding the classification failure; is this using either the heuristics or the textmate grammars to tell the two apart? Or will adjusting those have no effect on the success rate? (and if it's not using the heuristics, what is the point of these?) |
The classifier is using neither. It is tokenising the contents of the samples and then using that to perform a bayesian classification against the samples.
It's to catch those situations where the heuristics don't find any matches. If you're 💯 certain your heuristics will correctly identify all instances of I see you've updated the search string in the OP which shows much wider usage than your original. Thanks. |
Is this tokenization done using the language grammars, or is it a single tokenizer for all languages?
I am certain that the Do the heuristics/classifier also run on
Yes, I was going to bring this up once I fixed the other issues. It took me a while to work out how to perform a case-sensitive search! |
Single tokenizer for all. The only thing that uses the grammars is the syntax highlighting enginer which is completely independent of Linguist.
99% is good enough for me 😁
No. No analysis of content in codeblocks is ever performed. The author is expected to specify the language they want if they want syntax highlighting. |
Co-authored-by: Colin Seymour <[email protected]>
Do you want me to squash the commits? |
No need; we squash on merge. |
This reduces the numbers of samples down to one, avoiding the statistical classifier.
@lildude: this should hopefully pass CI now. |
Why are you deleting the |
Running |
If I don't delete it, then the classifier test still runs. Should I add |
Oooof. This is what I anticipated would happen initially when asking about how different things are between the two languages. So whilst the classifier isn't being used to assess the Lean 4 files (as there is only one sample), your Lean 4 sample is still being used to train the classifier. As there are now two
No. There isn't one. The only option is to add more samples. Adding another Lean sample that definitely has no Lean 4 tokens should hopefully swing things back in Lean's favour for its sample.
This is deliberately not the case as the classifier is designed to be the last guess attempt for things that don't match the heuristics, hence they're not considered by the classifier at all. It is of course far from perfect and is really just a last ditch attempt. |
To be clear, this doesn't happy any more now that the
If the classifier is only considered as a last-ditch attempt, why is the test checking that it works on samples that always are solved by the first ditch and never make it to the last ditch? |
How about this one? linguist/test/test_classifier.rb Lines 47 to 50 in e211cdc
|
We're trying to avoid adding to that as any new additions should not result in exclusions as it defeats the point of the classifier. That case is one where the sample existed before the most recent improvements so we added the exception rather than removing the sample. |
Would you mind restarting CI one more time to check that everything is ok, except for the fact that you aren't happy with my deletion of the |
I've worked out what the issue is... the comments in the Lean 4 samples are being considered because the tokenizer doesn't recognise the If you remove the comments from all the samples things start behaving correctly. I'll create a PR to add support for this format of comment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be good to revert bc21901 now
This reverts commit bc21901.
Done. |
Thanks for all your help! Is it too late for this to make it into #6627? |
Nope. I'll be merging master into that PR and performing one last update of the grammars before making the release tomorrow morning. |
Missed opportunity to name the language LE4N, just saying. 😉 |
Just to check how exactly do I do that here? -- with `Lean 4` as the infostring
abbreviation a_lean_3_line := tt
abbrev aLean4Line := true seems to incorrectly highlight as Lean 3 because everything after the
gives no highighting |
Correct. You can see the Lean 3 grammar is used in the HTML source too: It'll be
You need to replace the spaces in the language name with hyphens: -- with `Lean-4` as the infostring
abbreviation a_lean_3_line := tt
abbrev aLean4Line := true Case also doesn't matter either: -- with `lean-4` as the infostring
abbreviation a_lean_3_line := tt
abbrev aLean4Line := true I've just checked the Markup docs and this isn't mentioned, however we do mention it in one of the comments in the first example in our overrides doc. |
Description
Checklist:
#RRGGBB