-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bazel files misclassified as Python #182
Comments
For BUILD files, the tool could presumably check the filename—if it looks like Python and is named But Starlark also occurs in https://github.com/google/copybara, and will probably start to infect other tools over time. So I guess the question is to what extent is it important to always get this right, vs. catching the common cases? |
That's a good question, indeed. Maybe the way it is now is already ok since Skylark is a dialect of Python. |
There are a few shibboleths one could look for:
and so on. Some of those would be reasonably easy to automate with regular expressions, but I'm not sure how far down that rabbit hole we need to go. |
But if precision isn't crucial, you could more simply say:
|
Sounds good to me, but I'm not the leader of the language analysis team hehe |
I'm just trying to understand the parameters of the request. I'm a little bit reluctant to bake fragile heuristics into a very general classifier, unless it's pretty harmless. For Babelfish, the question "what language is this text?" really means "what does the syntax look like?" for which the distinction between S(tar|ky)lark and Python isn't important—and in fact probably distracting unless we teach the drivers to negotiate for custody. For UI tools, however, it may be more important to distinguish. It's possible a better solution is to identify an idea of "dialects", where two dialects share a syntax but have different embeddings (e.g., browser JS vs. Node JS, WoW Lua vs. SQL UDF Lua, Starkylark vs. Python, etc.). Then a classification may consist of one language and zero or more dialects, which might help. That's a much bigger task, however, so for now I'm just trying to get a sense of how much impact that work might have. If the analysis like the one you described above could work around the problem with some filename heuristics, it's not necessary to create a new scheme. |
as Enry right now just mirrors the Linguist - I would say everything works as expected and this issue belongs either to Linguist upstream (or the umbrella issue #181 where we track such requests on our side). @campoy
Well, so far we did not bake any heuristics to Enry at all, on top of what is in upstream's Linguist and there was no discussion that I'm aware of about changing this approach. On
I would suggest keeping the separation of responsibility between Enry (linguist's clone in Go, under src-d org) and downstream Bblfsh: there already is a discussion on a new feature - dialect support on bblfsh side (alas, not moved to a proper place in SDK yet) bblfsh/bash-driver#39 (comment). May be we could take this as an input for a broader issue in downstream bblfsh on designing a better dialect/multiple language support, but not sure if/how this fits 2019Q1 OKRs. |
I'm totally fine with this not being solved for now. It's definitely an interesting thing to take into account in the future, whether different dialects can be detected easily and whether it's worth the effort. Thanks for the info! |
Will close for now, as there seems to be no further discussion but please feel free to re-open if something is not clear, etc. |
As part of the kubernetes analysis we analyzed all of the languages in the kubernetes codebase:
We found Python reappearing during 2017 but after a deeper analysis it turns out those are all
BUILD
files with Bazel content.For instance, github.com/kubernetes/kubernetes/tree/master/cmd/kubeadm/app/discovery/https/BUILD appears as Python while it's actually Bazel.
This mistake is also done by linguist, as shown here
The text was updated successfully, but these errors were encountered: