Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raw text (not markdown) formatting from html #91

Open
Yomguithereal opened this issue Oct 31, 2023 · 10 comments
Open

Raw text (not markdown) formatting from html #91

Yomguithereal opened this issue Oct 31, 2023 · 10 comments

Comments

@Yomguithereal
Copy link

Sorry if I did not find the relevant information by perusing the documentation but is there an easy way to produce a raw text output (without any markdown formatting at all)? It seems to me I can implement my own thing based on traversing a RenderNode or by implementing a custom TextDecorator? But it seems also a very common use-case, especially when pre-processing documents from the web for NLP pipelines.

If it does not yet exists, can I contribute some to this library? If so, do you have any guidance on the correct way to do so?

@Yomguithereal
Copy link
Author

Using your TrivialDecorator seems to do the trick, when used with from_read_with_decorator. Do you want me to add some documentation about this for people, or add a dedicated high-level function?

@jugglerchris
Copy link
Owner

Hi,
Glad you found TrivialDecorator. Some documentation would be great if you have the time - it's probably not worth adding another top level function.

@GrantGryczan
Copy link

GrantGryczan commented Sep 21, 2024

I was surprised to find this wasn't the default. Then I spent a couple hours looking for any way to do this, both inside and outside this crate. Then I gave up and checked this crate's issues just to see if anyone had suggested it as a feature since it seems like such a trivial and common use case. Only then did I finally discover from these comments that there's already a decorator for it in this crate.

I'd really argue this should be the default, considering the crate's name. It's not HTML to markdown, it's HTML to text. I thought that meant plain text, no formatting or syntax of any sort. The fact that the default config is called "plain" misled me further. There are countless use cases for HTML to plain text: converting HTML emails to plain text emails (since it's good practice to send both in a multipart body), generating preview blurbs for rich content, generating an OpenGraph description meta tag from a snippet of rich content, giving a simplified view of a document without formatting, indexing and searching rich content as text... Many of these are things I've had to do in the past, and I could go on.

I don't think use cases for the current default--converting HTML to text with some markdown formatting--are nearly as common, considering those use cases have the additional requirement that something be taking that output and processing it as markdown. I believe plain text is a lot more general and therefore versatile. Why is it not the default? (I can open a separate issue for this if you'd like.)

Also, by the way, you'll notice none of the use cases I exemplified have any sensible line length limit for text wrapping, which this crate's API requires. I pass in usize::MAX to effectively disable this functionality, but it's still getting processed, which is a waste of performance and memory. And the fact that there's a line limit also adds more errors to the result type the API returns--errors that can never happen with my usage. But this is a more minor complaint.

@GrantGryczan
Copy link

GrantGryczan commented Sep 22, 2024

Also worth noting, TrivialDecorator doesn't include bullets/numbers from lists, and quotes aren't prefixed with >, which might not be desirable if a plain text form of the HTML is what you're after.

@jugglerchris
Copy link
Owner

I was surprised to find this wasn't the default. Then I spent a couple hours looking for any way to do this, both inside and outside this crate. Then I gave up and checked this crate's issues just to see if anyone had suggested it as a feature since it seems like such a trivial and common use case. Only then did I finally discover from these comments that there's already a decorator for it in this crate.

I agree the documentation needs improving!

I don't think use cases for the current default--converting HTML to text with some markdown formatting--are nearly as common, considering those use cases have the additional requirement that something be taking that output and processing it as markdown. I believe plain text is a lot more general and therefore versatile. Why is it not the default? (I can open a separate issue for this if you'd like.)

I guess it's a matter of perspective, and naming is hard! The original reason for writing this crate was to display HTML e-mails in a fixed width (text) terminal, which I continue to do every day, so that's still an important use case. 😃 So that is why things are as they are.

But there isn't really a "default" as such - there's a configurable API (config) and some top level functions aiming to cover common use cases. I think you're right that there are some common use cases which should be covered by some simpler to use top level functions (making them better documented and discoverable as well as simpler).

(To be clear the default output is not valid markdown - it's aiming to do a reasonable job of showing readable text in a text-only format, and markdown-style formatting is often great, but not always. This is most obvious when tables are involved.)

Also worth noting, TrivialDecorator doesn't include bullets/numbers from lists, and quotes aren't prefixed with >, which might not be desirable if a plain text form of the HTML is what you're after.

I think this shows that terms like "plain text" and "formatting" are ambiguous. So what kind of output do you actually want? It sounds like you don't want line wrapping, but you do want markdown-style bullets/quotes, right? How do you want tables to be handled, or hyperlinks?

@GrantGryczan
Copy link

GrantGryczan commented Sep 22, 2024

Appreciate your reply!

Personally, my aim (and what I assumed from the README and names of this crate and some of its items) is for the text simply to look like the HTML. This means formatting lists but treating hyperlinks like plain text. Tables should probably be formatted too, but there shouldn't be any syntax that only makes sense and is well-known in markdown.

@jugglerchris
Copy link
Owner

This means formatting lists but treating hyperlinks like plain text.

By this do you mean pretending they're not links at all (discarding the target etc.)?

there shouldn't be any syntax that only makes sense and is well-known in markdown

What do you mean by this?

@GrantGryczan
Copy link

GrantGryczan commented Sep 25, 2024

This means formatting lists but treating hyperlinks like plain text.

By this do you mean pretending they're not links at all (discarding the target etc.)?

Yes! In all of the example use cases I listed earlier, discarding the link target makes the most sense in general.

there shouldn't be any syntax that only makes sense and is well-known in markdown

What do you mean by this?

I mean that I'd argue markdown-specific syntax should be avoided unless the developer explicitly asks for it with an API for that purpose. For a library called "html2text", I (and apparently some others in the issues here) am surprised by the output conforming to any particular language like markdown. I just expect plain text.

No obligations to act on any of these arguments, of course! I'm sure this is just something you work on in your free time. This is just my two cents.

@russellbanks
Copy link

Also worth noting, TrivialDecorator doesn't include bullets/numbers from lists, and quotes aren't prefixed with >, which might not be desirable if a plain text form of the HTML is what you're after.

To work around this I created my own TextDecorator (copied from TrivialDecorator) so I could specifically make it match this as closely as I could:

#[derive(Constructor)]
struct GitHubHtmlDecorator;

impl TextDecorator for GitHubHtmlDecorator {
    type Annotation = ();

    fn decorate_link_start(&mut self, _url: &str) -> (String, Self::Annotation) {
        (String::new(), ())
    }

    fn decorate_link_end(&mut self) -> String {
        String::new()
    }

    fn decorate_em_start(&self) -> (String, Self::Annotation) {
        (String::new(), ())
    }

    fn decorate_em_end(&self) -> String {
        String::new()
    }

    fn decorate_strong_start(&self) -> (String, Self::Annotation) {
        (String::new(), ())
    }

    fn decorate_strong_end(&self) -> String {
        String::new()
    }

    fn decorate_strikeout_start(&self) -> (String, Self::Annotation) {
        (String::new(), ())
    }

    fn decorate_strikeout_end(&self) -> String {
        String::new()
    }

    fn decorate_code_start(&self) -> (String, Self::Annotation) {
        (String::new(), ())
    }

    fn decorate_code_end(&self) -> String {
        String::new()
    }

    fn decorate_preformat_first(&self) -> Self::Annotation {}

    fn decorate_preformat_cont(&self) -> Self::Annotation {}

    fn decorate_image(&mut self, _src: &str, title: &str) -> (String, Self::Annotation) {
        (title.to_string(), ())
    }

    fn header_prefix(&self, _level: usize) -> String {
        String::new()
    }

    fn quote_prefix(&self) -> String {
        String::from("> ")
    }

    fn unordered_item_prefix(&self) -> String {
        String::from("- ")
    }

    fn ordered_item_prefix(&self, i: i64) -> String {
        format!("{i}. ")
    }

    fn make_subblock_decorator(&self) -> Self {
        Self::new()
    }

    fn finalise(&mut self, _links: Vec<String>) -> Vec<TaggedLine<()>> {
        Vec::new()
    }
}

It does feel like an awful lot of boilerplate though to achieve something quite simple. I think a builder function would be nice here to minimise that. E.g:

let decorator = TextDecoratorBuilder()
    .decorate_image(|(_src, title)| title.to_string())
    .quote_prefix("> ")
    .ordered_item_prefix(|i| format!("{i}. "))
    .unordered_item_prefix("- ")
    .build();

Then anything else not specified would act like TrivialDecorator.

@russellbanks
Copy link

Also, by the way, you'll notice none of the use cases I exemplified have any sensible line length limit for text wrapping, which this crate's API requires. I pass in usize::MAX to effectively disable this functionality, but it's still getting processed, which is a waste of performance and memory.

I agree with this - it feels unidiomatic (and presumptive that it's being outputted to a terminal) to have to specify usize::MAX to essentially disable functionality. An Option here would be more fitting so that if you specifically needed a width, you could specify Some(50), for example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants