Language analysis wishlist #1063

spenserblack · 2023-05-23T16:35:00Z

spenserblack
May 23, 2023
Collaborator

Inspired by a few issues that have been raised, I'm making this discussion so we have a list of things that we wish our language analysis tool (currently tokei) would do. This discussion can be referenced if we decide to switch tools, fork tokei, or write our own from scratch (the latter is in my ever-growing list of projects I want to do but get side-tracked by other to dos 🙃).

Want

I'm putting things here that I think we definitely want.

Classification (#26)

Currently, tokei analyzes the file extension and shebang, and it looks like there is some interest for using modelines. However, there seems to be little to no interest in analyzing the actual code contents for classification, as the maintainer doesn't consider this deterministic -- see XAMPPRocky/tokei#708, XAMPPRocky/tokei#305, and XAMPPRocky/tokei#764 for example.

These are reasonable metrics to use, but I don't think it's enough for our usage. I think that language classification should work "out of the box," and manually overriding with modelines or a configuration file should be the exception, not the rule. For this, I think it's necessary to analyze the source code, and make a best guess as to which language it is. IIRC github-linguist uses heuristics (regexes of syntax unique to the language) and bayesian classification from code samples.

Language Categories

E.g. programming, data, etc.

We've implemented this in onefetch, but I believe it would be better if this was implemented in the language analysis tool.

It's also possible that tokei will eventually add language categories: XAMPPRocky/tokei#962 (comment)

Maybe

Here are things that might pend further discussion.

Analyze revs (#1033)

Currently we analyze the contents of the filesystem. This can be confusing when there is a large number of untracked files. For example, a user will likely have every single project in some subfolder of $HOME, and if they have a dotfiles repo at $HOME/.git, then onefetch can return wildly inaccurate results by including all of the untracked files in subfolders.

Since we do require the existence of a git repository, I don't think it's unreasonable to analyze a git rev, defaulting to HEAD, instead of the directory contents. This can give better insight into what the project is, instead of what the project could be, if that makes sense.

As an added bonus, if we analyze revs instead of directory contents, we could probably start supporting bare repos.

Don't use LOC for language distribution

With the following project:

// foo.js
const foo = [1, 2, 3, 4];

// foo.ts
const foo = [
  1,
  2,
  3,
  4,
];

Onefetch will consider this 86% TypeScript, and 14% JavaScript. When, syntactically, it's more like 50-50. Lines-of-code might not be the best metric, as code style can severely influence the LOC without adding or removing actual code.

github-linguist uses blob size, and returns 56% TypeScript, 43% JavaScript, which is a more accurate distribution in this example.

There are a few things we can do to make things even more accurate. Counting uncommented tokens might be the most accurate, though this might be too computationally intensive.

Detect by filename (excluding extension)

This is something that both tokei and github-linguist currently can't do! Some examples of this would be detecting Dockerfile.node as Dockerfile, or Makefile.amd64 as Makefile. I haven't seen any complaints here yet, but this could be a nice-to-have. The biggest hurdle would be what happens with Dockerfile.js. Is that a Dockerfile, or a JavaScript file? Even with classification, one should probably take priority over the other.

o2sh · 2023-05-28T22:27:29Z

o2sh
May 28, 2023
Maintainer

I came across an interesting project called hyperpolyglot, which aims to replicate the functionality of GitHub Linguist in Rust.

Checking out this project might give you some ideas!

5 replies

spenserblack May 29, 2023
Collaborator Author

Thanks for mentioning this! Thoughts on the wishlist, BTW? Anything to add, remove, or move?

o2sh May 30, 2023
Maintainer

Sure, IMO one of the main drawbacks of tokei, apart from its lack of maintenance 😭 , is - as you mentioned - that it mostly relies on file extensions for language detection which can lead to inaccuracies in favor of performance.

Ideally, it would be great to have a Rust equivalent of GitHub Linguist since Linguist has emerged as the de facto standard for language detection. Users often refer to Linguist when pointing out discrepancies with onefetch.

As a sidenote, I wonder how/whether linguist handles nested languages in Markdown and Jupyter-notebooks 🤔

We've implemented this in onefetch, but I believe it would be better if this was implemented in the language analysis tool.

The chip color for each language could also be provided.

Currently we analyze the contents of the filesystem.

Yes, but filesystem + .gitignore ~= tracked files ?

github-linguist uses blob size, and returns 56% TypeScript, 43% JavaScript, which is a more accurate distribution in this example.

Very good point, I wasn't aware of that. We may need both in that case: blob size for the language distribution and LOC for the info line of the same name

spenserblack May 30, 2023
Collaborator Author

Yes, but filesystem + .gitignore ~= tracked files ?

What I mean is that, if I did this:

git init
echo 'console.log("Hello, World!");' > foo.js
git add foo.js
git commit -m "Create foo.js"
mv foo.js foo.ts
github-linguist
onefetch

Then github-linguist detects JavaScript, and onefetch detects TypeScript, because linguist is analyzing HEAD, not the current state of the files.

o2sh May 31, 2023
Maintainer

My bad, I didn't know that Linguist only acknowledged committed changes.

However, I could see an argument where users would want to see their changes being taken into account live - without needing to commit.

Still (As you said), for onefetch - being a Git information tool, it does make sense to stick to HEAD when computing the language distribution.

spenserblack May 31, 2023
Collaborator Author

Yeah, I think the biggest argument for analyzing HEAD is the confusion in #26 (comment). But I think the majority of onefetch's users expect the current files to be analyzed. Or at least, I don't remember anyone else raising an issue about it.

My first time executing github-linguist locally, I actually found it surprising that I needed to commit changes for them to be analyzed.

spenserblack · 2023-07-27T16:28:28Z

spenserblack
Jul 27, 2023
Collaborator Author

Just an FYI that I've started a project to hit the things on this wishlist (productively procrastinating from my other personal projects) 😉

It's basically going to be "linguist but in Rust," but I'm also adding language detection by filename pattern. E.g. tsconfig(?:\..+)?\.json.

6 replies

spenserblack Aug 9, 2023
Collaborator Author

If you want to preview or contribute, let me know and I'll send an invite (I think I can send a few on the free plan).

Nevermind, I made it public.

o2sh Aug 10, 2023
Maintainer

Very happy to see this project coming to life 😊 . Do you have an estimated timeline for when we might be able to replace tokei with gengo? What are the key components still missing for the transition?

It would be great to see some benchmarks comparing gengo with other tools, especially on complex repositories. Also, how does gengo (will)handle nested languages, such as Markdown or Jupyter notebooks?

Will gengo support the exclusion of specific glob patterns as a parameter, similar to how tokei does?

I'll be keeping an eye on this project and hope to contribute. Please don't hesitate to create issues and tag them with "help wanted"

spenserblack Aug 10, 2023
Collaborator Author

As far as replacing goes, we'll need a lot more language support 😆 Also, the API is pretty different from tokei, so that will take some work on this end.

I don't forsee nested languages being supported, since, like linguist, distribution is calculated by blob size, and for performance reasons it won't read the whole file (default is the first MB).

Will gengo support the exclusion of specific glob patterns as a parameter, similar to how tokei does?

Yes, via gitattributes. By default a file is exluded if it is detected as documentation, generated, or vendored code. But you can customize this behavior. For example

# include dist/
dist/* gengo-detectable
# exclude js files
*.js -gengo-detectable

spenserblack Aug 10, 2023
Collaborator Author

🤔 Thinking about benchmarks, it's probably going to be pretty unfair to start comparing until there is a roughly equal amount of language entries as linguist and tokei. As it is right now, each language entry would add another iteration for attempting to identify a file. Although, now that I'm thinking about it, I could probably gain a lot of performance by mapping extensions, shebangs, etc. to lists of languages instead of mapping languages to extensions, shebangs, etc. 🤔

spenserblack Aug 29, 2023
Collaborator Author

Speaking about benchmarks, when I ran tokei and gengo on the linux repo they were about equal. Tokei was about 20 seconds, gengo about 22. github-linguist took 5 minutes IIRC...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Language analysis wishlist #1063

{{title}}

Replies: 2 comments 11 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Language analysis wishlist #1063

spenserblack May 23, 2023 Collaborator

Want

Classification (#26)

Language Categories

Maybe

Analyze revs (#1033)

Don't use LOC for language distribution

Detect by filename (excluding extension)

Replies: 2 comments · 11 replies

o2sh May 28, 2023 Maintainer

spenserblack May 29, 2023 Collaborator Author

o2sh May 30, 2023 Maintainer

spenserblack May 30, 2023 Collaborator Author

o2sh May 31, 2023 Maintainer

spenserblack May 31, 2023 Collaborator Author

spenserblack Jul 27, 2023 Collaborator Author

spenserblack Aug 9, 2023 Collaborator Author

o2sh Aug 10, 2023 Maintainer

spenserblack Aug 10, 2023 Collaborator Author

spenserblack Aug 10, 2023 Collaborator Author

spenserblack Aug 29, 2023 Collaborator Author

spenserblack
May 23, 2023
Collaborator

Replies: 2 comments 11 replies

o2sh
May 28, 2023
Maintainer

spenserblack May 29, 2023
Collaborator Author

o2sh May 30, 2023
Maintainer

spenserblack May 30, 2023
Collaborator Author

o2sh May 31, 2023
Maintainer

spenserblack May 31, 2023
Collaborator Author

spenserblack
Jul 27, 2023
Collaborator Author

spenserblack Aug 9, 2023
Collaborator Author

o2sh Aug 10, 2023
Maintainer

spenserblack Aug 10, 2023
Collaborator Author

spenserblack Aug 10, 2023
Collaborator Author

spenserblack Aug 29, 2023
Collaborator Author