Language analysis wishlist #1063
spenserblack
started this conversation in
Ideas
Replies: 2 comments 11 replies
-
I came across an interesting project called hyperpolyglot, which aims to replicate the functionality of GitHub Linguist in Rust. Checking out this project might give you some ideas! |
Beta Was this translation helpful? Give feedback.
5 replies
-
Just an FYI that I've started a project to hit the things on this wishlist (productively procrastinating from my other personal projects) 😉 It's basically going to be "linguist but in Rust," but I'm also adding language detection by filename pattern. E.g. |
Beta Was this translation helpful? Give feedback.
6 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Inspired by a few issues that have been raised, I'm making this discussion so we have a list of things that we wish our language analysis tool (currently tokei) would do. This discussion can be referenced if we decide to switch tools, fork tokei, or write our own from scratch (the latter is in my ever-growing list of projects I want to do but get side-tracked by other to dos 🙃).
Want
I'm putting things here that I think we definitely want.
Classification (#26)
Currently, tokei analyzes the file extension and shebang, and it looks like there is some interest for using modelines. However, there seems to be little to no interest in analyzing the actual code contents for classification, as the maintainer doesn't consider this deterministic -- see XAMPPRocky/tokei#708, XAMPPRocky/tokei#305, and XAMPPRocky/tokei#764 for example.
These are reasonable metrics to use, but I don't think it's enough for our usage. I think that language classification should work "out of the box," and manually overriding with modelines or a configuration file should be the exception, not the rule. For this, I think it's necessary to analyze the source code, and make a best guess as to which language it is. IIRC github-linguist uses heuristics (regexes of syntax unique to the language) and bayesian classification from code samples.
Language Categories
E.g. programming, data, etc.
We've implemented this in onefetch, but I believe it would be better if this was implemented in the language analysis tool.
It's also possible that tokei will eventually add language categories: XAMPPRocky/tokei#962 (comment)
Maybe
Here are things that might pend further discussion.
Analyze revs (#1033)
Currently we analyze the contents of the filesystem. This can be confusing when there is a large number of untracked files. For example, a user will likely have every single project in some subfolder of
$HOME
, and if they have a dotfiles repo at$HOME/.git
, then onefetch can return wildly inaccurate results by including all of the untracked files in subfolders.Since we do require the existence of a git repository, I don't think it's unreasonable to analyze a git rev, defaulting to HEAD, instead of the directory contents. This can give better insight into what the project is, instead of what the project could be, if that makes sense.
As an added bonus, if we analyze revs instead of directory contents, we could probably start supporting bare repos.
Don't use LOC for language distribution
With the following project:
Onefetch will consider this 86% TypeScript, and 14% JavaScript. When, syntactically, it's more like 50-50. Lines-of-code might not be the best metric, as code style can severely influence the LOC without adding or removing actual code.
github-linguist uses blob size, and returns 56% TypeScript, 43% JavaScript, which is a more accurate distribution in this example.
There are a few things we can do to make things even more accurate. Counting uncommented tokens might be the most accurate, though this might be too computationally intensive.
Detect by filename (excluding extension)
This is something that both tokei and github-linguist currently can't do! Some examples of this would be detecting
Dockerfile.node
as Dockerfile, orMakefile.amd64
as Makefile. I haven't seen any complaints here yet, but this could be a nice-to-have. The biggest hurdle would be what happens withDockerfile.js
. Is that a Dockerfile, or a JavaScript file? Even with classification, one should probably take priority over the other.Beta Was this translation helpful? Give feedback.
All reactions