-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raw text (not markdown) formatting from html #91
Comments
Using your |
Hi, |
I was surprised to find this wasn't the default. Then I spent a couple hours looking for any way to do this, both inside and outside this crate. Then I gave up and checked this crate's issues just to see if anyone had suggested it as a feature since it seems like such a trivial and common use case. Only then did I finally discover from these comments that there's already a decorator for it in this crate. I'd really argue this should be the default, considering the crate's name. It's not HTML to markdown, it's HTML to text. I thought that meant plain text, no formatting or syntax of any sort. The fact that the default config is called "plain" misled me further. There are countless use cases for HTML to plain text: converting HTML emails to plain text emails (since it's good practice to send both in a multipart body), generating preview blurbs for rich content, generating an OpenGraph description meta tag from a snippet of rich content, giving a simplified view of a document without formatting, indexing and searching rich content as text... Many of these are things I've had to do in the past, and I could go on. I don't think use cases for the current default--converting HTML to text with some markdown formatting--are nearly as common, considering those use cases have the additional requirement that something be taking that output and processing it as markdown. I believe plain text is a lot more general and therefore versatile. Why is it not the default? (I can open a separate issue for this if you'd like.) Also, by the way, you'll notice none of the use cases I exemplified have any sensible line length limit for text wrapping, which this crate's API requires. I pass in |
Also worth noting, |
I agree the documentation needs improving!
I guess it's a matter of perspective, and naming is hard! The original reason for writing this crate was to display HTML e-mails in a fixed width (text) terminal, which I continue to do every day, so that's still an important use case. 😃 So that is why things are as they are. But there isn't really a "default" as such - there's a configurable API ( (To be clear the default output is not valid markdown - it's aiming to do a reasonable job of showing readable text in a text-only format, and markdown-style formatting is often great, but not always. This is most obvious when tables are involved.)
I think this shows that terms like "plain text" and "formatting" are ambiguous. So what kind of output do you actually want? It sounds like you don't want line wrapping, but you do want markdown-style bullets/quotes, right? How do you want tables to be handled, or hyperlinks? |
Appreciate your reply! Personally, my aim (and what I assumed from the README and names of this crate and some of its items) is for the text simply to look like the HTML. This means formatting lists but treating hyperlinks like plain text. Tables should probably be formatted too, but there shouldn't be any syntax that only makes sense and is well-known in markdown. |
By this do you mean pretending they're not links at all (discarding the target etc.)?
What do you mean by this? |
Yes! In all of the example use cases I listed earlier, discarding the link target makes the most sense in general.
I mean that I'd argue markdown-specific syntax should be avoided unless the developer explicitly asks for it with an API for that purpose. For a library called "html2text", I (and apparently some others in the issues here) am surprised by the output conforming to any particular language like markdown. I just expect plain text. No obligations to act on any of these arguments, of course! I'm sure this is just something you work on in your free time. This is just my two cents. |
To work around this I created my own TextDecorator (copied from #[derive(Constructor)]
struct GitHubHtmlDecorator;
impl TextDecorator for GitHubHtmlDecorator {
type Annotation = ();
fn decorate_link_start(&mut self, _url: &str) -> (String, Self::Annotation) {
(String::new(), ())
}
fn decorate_link_end(&mut self) -> String {
String::new()
}
fn decorate_em_start(&self) -> (String, Self::Annotation) {
(String::new(), ())
}
fn decorate_em_end(&self) -> String {
String::new()
}
fn decorate_strong_start(&self) -> (String, Self::Annotation) {
(String::new(), ())
}
fn decorate_strong_end(&self) -> String {
String::new()
}
fn decorate_strikeout_start(&self) -> (String, Self::Annotation) {
(String::new(), ())
}
fn decorate_strikeout_end(&self) -> String {
String::new()
}
fn decorate_code_start(&self) -> (String, Self::Annotation) {
(String::new(), ())
}
fn decorate_code_end(&self) -> String {
String::new()
}
fn decorate_preformat_first(&self) -> Self::Annotation {}
fn decorate_preformat_cont(&self) -> Self::Annotation {}
fn decorate_image(&mut self, _src: &str, title: &str) -> (String, Self::Annotation) {
(title.to_string(), ())
}
fn header_prefix(&self, _level: usize) -> String {
String::new()
}
fn quote_prefix(&self) -> String {
String::from("> ")
}
fn unordered_item_prefix(&self) -> String {
String::from("- ")
}
fn ordered_item_prefix(&self, i: i64) -> String {
format!("{i}. ")
}
fn make_subblock_decorator(&self) -> Self {
Self::new()
}
fn finalise(&mut self, _links: Vec<String>) -> Vec<TaggedLine<()>> {
Vec::new()
}
} It does feel like an awful lot of boilerplate though to achieve something quite simple. I think a builder function would be nice here to minimise that. E.g: let decorator = TextDecoratorBuilder()
.decorate_image(|(_src, title)| title.to_string())
.quote_prefix("> ")
.ordered_item_prefix(|i| format!("{i}. "))
.unordered_item_prefix("- ")
.build(); Then anything else not specified would act like |
I agree with this - it feels unidiomatic (and presumptive that it's being outputted to a terminal) to have to specify |
Sorry if I did not find the relevant information by perusing the documentation but is there an easy way to produce a raw text output (without any markdown formatting at all)? It seems to me I can implement my own thing based on traversing a
RenderNode
or by implementing a customTextDecorator
? But it seems also a very common use-case, especially when pre-processing documents from the web for NLP pipelines.If it does not yet exists, can I contribute some to this library? If so, do you have any guidance on the correct way to do so?
The text was updated successfully, but these errors were encountered: