Bazel files misclassified as Python #182

campoy · 2018-12-13T18:59:16Z

As part of the kubernetes analysis we analyzed all of the languages in the kubernetes codebase:

We found Python reappearing during 2017 but after a deeper analysis it turns out those are all BUILD files with Bazel content.

For instance, github.com/kubernetes/kubernetes/tree/master/cmd/kubeadm/app/discovery/https/BUILD appears as Python while it's actually Bazel.

This mistake is also done by linguist, as shown here

The text was updated successfully, but these errors were encountered:

creachadair · 2018-12-14T00:58:49Z

For BUILD files, the tool could presumably check the filename—if it looks like Python and is named BUILD, BUILD.bazel, or WORKSPACE, then it's probably a Bazel file.

But Starlark also occurs in https://github.com/google/copybara, and will probably start to infect other tools over time. So I guess the question is to what extent is it important to always get this right, vs. catching the common cases?

campoy · 2018-12-17T22:01:22Z

That's a good question, indeed.

Maybe the way it is now is already ok since Skylark is a dialect of Python.
Unless there's a very obvious way to detect it's indeed Skylark and not Python?

creachadair · 2018-12-17T22:05:59Z

There are a few shibboleths one could look for: load("[^"]*", …) combined with lack of any import statements. Use of depset. Left-justified function calls with nothing after the opening bracket (for rules):

cc_library(
  name = "…",

and so on. Some of those would be reasonably easy to automate with regular expressions, but I'm not sure how far down that rabbit hole we need to go.

creachadair · 2018-12-17T22:09:40Z

But if precision isn't crucial, you could more simply say:

If it looks like Python,
- If the filename is BUILD , BUILD.bazel, or matches *.bzl, it's Bazel.
- Otherwise if the filename matches *.sky it's some other Skylark application.
- Otherwise it's Python.

campoy · 2018-12-17T22:27:26Z

Sounds good to me, but I'm not the leader of the language analysis team hehe

creachadair · 2018-12-17T23:05:32Z

Sounds good to me, but I'm not the leader of the language analysis team hehe

I'm just trying to understand the parameters of the request. I'm a little bit reluctant to bake fragile heuristics into a very general classifier, unless it's pretty harmless. For Babelfish, the question "what language is this text?" really means "what does the syntax look like?" for which the distinction between S(tar|ky)lark and Python isn't important—and in fact probably distracting unless we teach the drivers to negotiate for custody. For UI tools, however, it may be more important to distinguish.

It's possible a better solution is to identify an idea of "dialects", where two dialects share a syntax but have different embeddings (e.g., browser JS vs. Node JS, WoW Lua vs. SQL UDF Lua, Starkylark vs. Python, etc.). Then a classification may consist of one language and zero or more dialects, which might help.

That's a much bigger task, however, so for now I'm just trying to get a sense of how much impact that work might have. If the analysis like the one you described above could work around the problem with some filename heuristics, it's not necessary to create a new scheme.

bzz · 2018-12-26T11:20:19Z

This mistake is also done by linguist, as shown here

as Enry right now just mirrors the Linguist - I would say everything works as expected and this issue belongs either to Linguist upstream (or the umbrella issue #181 where we track such requests on our side).

@campoy
From what I can tell, Linguist does intentionally recognize these as Python, as the same file/syntax/extension can be either a Pants build (that was open sourced quite some yeas ago) or Bazel build systems, both being a sub-set of Python.

@creachadair

I'm a little bit reluctant to bake fragile heuristics into a very general classifier, unless it's pretty harmless.

Well, so far we did not bake any heuristics to Enry at all, on top of what is in upstream's Linguist and there was no discussion that I'm aware of about changing this approach.

On

For Babelfish, the question "what language is this text?" really means "what does the syntax look like?"

I would suggest keeping the separation of responsibility between Enry (linguist's clone in Go, under src-d org) and downstream Bblfsh: there already is a discussion on a new feature - dialect support on bblfsh side (alas, not moved to a proper place in SDK yet) bblfsh/bash-driver#39 (comment).

May be we could take this as an input for a broader issue in downstream bblfsh on designing a better dialect/multiple language support, but not sure if/how this fits 2019Q1 OKRs.

campoy · 2019-01-07T21:53:24Z

I'm totally fine with this not being solved for now.

It's definitely an interesting thing to take into account in the future, whether different dialects can be detected easily and whether it's worth the effort.

Thanks for the info!

bzz · 2019-01-24T10:32:46Z

Will close for now, as there seems to be no further discussion but please feel free to re-open if something is not clear, etc.

bzz closed this as completed Jan 24, 2019

bzz mentioned this issue Jan 24, 2019

language argument in client.parse changes number of files signigicantly bblfsh/python-client#126

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bazel files misclassified as Python #182

Bazel files misclassified as Python #182

campoy commented Dec 13, 2018

creachadair commented Dec 14, 2018

campoy commented Dec 17, 2018

creachadair commented Dec 17, 2018 •

edited

Loading

creachadair commented Dec 17, 2018 •

edited

Loading

campoy commented Dec 17, 2018

creachadair commented Dec 17, 2018 •

edited

Loading

bzz commented Dec 26, 2018 •

edited

Loading

campoy commented Jan 7, 2019

bzz commented Jan 24, 2019 •

edited

Loading

Bazel files misclassified as Python #182

Bazel files misclassified as Python #182

Comments

campoy commented Dec 13, 2018

creachadair commented Dec 14, 2018

campoy commented Dec 17, 2018

creachadair commented Dec 17, 2018 • edited Loading

creachadair commented Dec 17, 2018 • edited Loading

campoy commented Dec 17, 2018

creachadair commented Dec 17, 2018 • edited Loading

bzz commented Dec 26, 2018 • edited Loading

campoy commented Jan 7, 2019

bzz commented Jan 24, 2019 • edited Loading

creachadair commented Dec 17, 2018 •

edited

Loading

creachadair commented Dec 17, 2018 •

edited

Loading

creachadair commented Dec 17, 2018 •

edited

Loading

bzz commented Dec 26, 2018 •

edited

Loading

bzz commented Jan 24, 2019 •

edited

Loading