Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bazel files misclassified as Python #182

Closed
campoy opened this issue Dec 13, 2018 · 9 comments
Closed

Bazel files misclassified as Python #182

campoy opened this issue Dec 13, 2018 · 9 comments

Comments

@campoy
Copy link
Contributor

campoy commented Dec 13, 2018

As part of the kubernetes analysis we analyzed all of the languages in the kubernetes codebase:

We found Python reappearing during 2017 but after a deeper analysis it turns out those are all BUILD files with Bazel content.

For instance, github.com/kubernetes/kubernetes/tree/master/cmd/kubeadm/app/discovery/https/BUILD appears as Python while it's actually Bazel.

This mistake is also done by linguist, as shown here

@creachadair
Copy link
Contributor

For BUILD files, the tool could presumably check the filename—if it looks like Python and is named BUILD, BUILD.bazel, or WORKSPACE, then it's probably a Bazel file.

But Starlark also occurs in https://github.com/google/copybara, and will probably start to infect other tools over time. So I guess the question is to what extent is it important to always get this right, vs. catching the common cases?

@campoy
Copy link
Contributor Author

campoy commented Dec 17, 2018

That's a good question, indeed.

Maybe the way it is now is already ok since Skylark is a dialect of Python.
Unless there's a very obvious way to detect it's indeed Skylark and not Python?

@creachadair
Copy link
Contributor

creachadair commented Dec 17, 2018

There are a few shibboleths one could look for: load("[^"]*", …) combined with lack of any import statements. Use of depset. Left-justified function calls with nothing after the opening bracket (for rules):

cc_library(
  name = "…",

and so on. Some of those would be reasonably easy to automate with regular expressions, but I'm not sure how far down that rabbit hole we need to go.

@creachadair
Copy link
Contributor

creachadair commented Dec 17, 2018

But if precision isn't crucial, you could more simply say:

  • If it looks like Python,
    • If the filename is BUILD , BUILD.bazel, or matches *.bzl, it's Bazel.
    • Otherwise if the filename matches *.sky it's some other Skylark application.
    • Otherwise it's Python.

@campoy
Copy link
Contributor Author

campoy commented Dec 17, 2018

Sounds good to me, but I'm not the leader of the language analysis team hehe

@creachadair
Copy link
Contributor

creachadair commented Dec 17, 2018

Sounds good to me, but I'm not the leader of the language analysis team hehe

I'm just trying to understand the parameters of the request. I'm a little bit reluctant to bake fragile heuristics into a very general classifier, unless it's pretty harmless. For Babelfish, the question "what language is this text?" really means "what does the syntax look like?" for which the distinction between S(tar|ky)lark and Python isn't important—and in fact probably distracting unless we teach the drivers to negotiate for custody. For UI tools, however, it may be more important to distinguish.

It's possible a better solution is to identify an idea of "dialects", where two dialects share a syntax but have different embeddings (e.g., browser JS vs. Node JS, WoW Lua vs. SQL UDF Lua, Starkylark vs. Python, etc.). Then a classification may consist of one language and zero or more dialects, which might help.

That's a much bigger task, however, so for now I'm just trying to get a sense of how much impact that work might have. If the analysis like the one you described above could work around the problem with some filename heuristics, it's not necessary to create a new scheme.

@bzz
Copy link
Contributor

bzz commented Dec 26, 2018

This mistake is also done by linguist, as shown here

as Enry right now just mirrors the Linguist - I would say everything works as expected and this issue belongs either to Linguist upstream (or the umbrella issue #181 where we track such requests on our side).


@campoy
From what I can tell, Linguist does intentionally recognize these as Python, as the same file/syntax/extension can be either a Pants build (that was open sourced quite some yeas ago) or Bazel build systems, both being a sub-set of Python.

@creachadair

I'm a little bit reluctant to bake fragile heuristics into a very general classifier, unless it's pretty harmless.

Well, so far we did not bake any heuristics to Enry at all, on top of what is in upstream's Linguist and there was no discussion that I'm aware of about changing this approach.

On

For Babelfish, the question "what language is this text?" really means "what does the syntax look like?"

I would suggest keeping the separation of responsibility between Enry (linguist's clone in Go, under src-d org) and downstream Bblfsh: there already is a discussion on a new feature - dialect support on bblfsh side (alas, not moved to a proper place in SDK yet) bblfsh/bash-driver#39 (comment).

May be we could take this as an input for a broader issue in downstream bblfsh on designing a better dialect/multiple language support, but not sure if/how this fits 2019Q1 OKRs.

@campoy
Copy link
Contributor Author

campoy commented Jan 7, 2019

I'm totally fine with this not being solved for now.

It's definitely an interesting thing to take into account in the future, whether different dialects can be detected easily and whether it's worth the effort.

Thanks for the info!

@bzz
Copy link
Contributor

bzz commented Jan 24, 2019

Will close for now, as there seems to be no further discussion but please feel free to re-open if something is not clear, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants