Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements... #50

Open
scruffynerf opened this issue Dec 9, 2024 · 27 comments
Open

Improvements... #50

scruffynerf opened this issue Dec 9, 2024 · 27 comments

Comments

@scruffynerf
Copy link

So in reviewing your code and playing with this, I think massive improvements can be made with the following changes:

  1. use the new Auralis tts. It's so much faster. Night and Day...
    What took VoxNovel quite a while for a very short epub (6 or so pages), I had in mere seconds with https://github.com/JohnZolton/Fast-Audiobook which is a super simple implementation

  2. booknlp, for some ungodly reason, never outputs the text in a reasonable json format. So you're forced to parse the ugly html... let's fix that, and solve it with a strike at the root: fork booknlp to add a json output of the book text, not the html mess it generates which you have to reverse engineer, with just a speaker attribute, so it's all just a clean set of ordered lines to process, each with a speaker.

[
...
{line: 783, text:"I really like you", speaker: "Jane"},
{line: 784, text:"said Jane, smiling shyly,", speaker: "Narrator"},
{line: 785, text:"but... I have to say No.", speaker: "Jane"},
...
]

(The speakers would be in listed in json also, the above is just to be more human readable as example.
{speakerid: 0, name: "Narrator"} would be better as an index and speaker:0 above.

  1. There needs to be pre-booknlp text processing, and then post-booknlp text processing. That will solve the number issues, the punctuation, the mispronounced words, weird timing/spacing issues, and more.

Simple list(s) of substitutions (likely some regex as well, for number handling, for example, but other cases, like weird -- issues which can be seen in https://github.com/booknlp/booknlp/tree/main/examples/158_emma ) would allow people to customize and solve once, and even share.

  1. Once a line is generated, if the line isn't right, and it can fixed by tweaking the text or voice, that change can be rolled into the list(s) above, or one-timed on the fly, once generated so we regen it and solve it everywhere.
    so this would be easy to fix:

Oh, it mispronounced this, let's tweak that spelling so it sounds right....
[This change will affect other 17 lines, Y/N? Regenning 17 lines] [Add this to the list of TTS rewrites? Y/N]
Oh that voice just isn't right, let's change that to a different voice sample....
[This change will regen 38 lines Y/N?]
The default voice sample doesn't quite work here, but I like it in 99% of her speaking... let's just tweak THIS like with a new voice sample that sounds a bit more scared/anxious to get the tone right...
[selected ScaredJane.wav, added as new Speaker "JaneAnxious". Line regened]

Again, it all being in json helps here for all of these examples.

  1. UI improvements... Is there a reason you're not using Gradio/etc?
    https://github.com/quantumlump/eBook_to_Audiobook_with_F5-TTS (which again, isn't as complex and is single speaker focused) is SO nice... I'd love to see something like this with VoxNovel.
    Let me see a list of speakers and drop and drag voices, preview, etc. Let me preview each line and make the tweaks above, etc...

Happy to help implement some of these, just let me know.

@scruffynerf
Copy link
Author

in support of item 2, I've begun https://github.com/scruffynerf/book2jsonofnlp

@DrewThomasson
Copy link
Owner

oh? 👀

if you can figure out how to make the output not have to be like de-tokenized then that would fix a lot of issues that are hard to mess with in booknlp

👍

also sorry about the late response, been busy lately with finals and all

@scruffynerf
Copy link
Author

progress made, still debugging...
example output:

        {
            "text": "Vanessa took off her jean jacket and then pulled off the cotton hoodie she was wearing underneath it. She wadded it up and pressed it to Darryl 's side.",
            "speaker_id": 0,
            "speaker_name": "Narrator",
            "index": 409
        },
        {
            "text": "Take his head,",
            "speaker_id": 376,
            "speaker_name": "Van",
            "index": 410
        },
        {
            "text": "she said to me.",
            "speaker_id": 0,
            "speaker_name": "Narrator",
            "index": 411
        },
        {
            "text": "Keep it elevated.",
            "speaker_id": 376,
            "speaker_name": "Van",
            "index": 412
        },
        {
            "text": "To Jolu she said,",
            "speaker_id": 0,
            "speaker_name": "Narrator",
            "index": 413
        },
        {
            "text": "Get his feet up -- roll up your coat or something.",
            "speaker_id": 376,
            "speaker_name": "Van",
            "index": 414
        },
        {
            "text": "Jolu moved quickly. Vanessa 's mother is a nurse and she 'd had first aid training every summer at camp. She loved to watch people in movies get their first aid wrong and make fun of them.
 I was so glad to have her with us.",
            "speaker_id": 0,
            "speaker_name": "Narrator",
            "index": 415
        },

@DrewThomasson
Copy link
Owner

Oh dang!

And your not having any issues with words like

"Don't"

Being writtten as

"Don 't"?

@scruffynerf
Copy link
Author

scruffynerf commented Dec 13, 2024

fixing those spacing issues is on my todo list. Still working on it... the booknlp code is a bit crufty, so I'm both cleaning it up, learning it understand it, adding json output, and figuring out issues beyond booknlp's reach

As mentioned, pre-booknlp processing (so it doesn't struggle) and post-booknlp processing (to make it better for TTS) are likely both needed. I'll try and make the json output as clean as possible though. I suspect this will iterate well though. The above text is Little Brother by Cory Doctorow, good book, but also gutenberg text, with modern and has lots of weird formatting like > texting and * shouting * which probably are good test cases to solve.

The Emma text used by bookNLP as an example doesn't even get parsed entirely correctly. (look early and you'll see a 3 way convo with Emma, her dad, and Mr Knightly and it's incorrect. So Rather than figure out why, I figured I'd try a different text with more modern structures.

@scruffynerf
Copy link
Author

If we end up with a better sounding audiobook than these 4 AI voices, with actual multiple speakers, victory is ours.

https://hackernoon.com/dedicated-to-borderlands-books

(that contains the text section above)

@DrewThomasson
Copy link
Owner

DrewThomasson commented Dec 13, 2024

Oh yeah and about the gradio

I was actually looking into turning it into a gradio it's just time consuming and I kinda forgot about it

😅😅

But for ref here's what I was getting at it a couple months ago

auto styleTTS2 version

https://huggingface.co/spaces/drewThomasson/Auto-VoxNovel-Demo-StyleTTS

testing how to make the character selections in gradio

https://huggingface.co/spaces/drewThomasson/Dynamic-Gradio-Dropdowns

headless voxnovel gradio test space

https://huggingface.co/spaces/drewThomasson/Headless-VoxNovel-Demo-testing_grounds

xtts auto VoxNovel testing space

https://huggingface.co/spaces/drewThomasson/Headless-VoxNovel-Demo

@DrewThomasson
Copy link
Owner

I was looking at slapping them onto ebook2audiobook as an extra beta feature

Or at least getting these out to replace the crappy docker images of VoxNovel

But it got complex and to be honest VoxNovel was not nearly as popular as I thought it was

So mostly I was throwing my time into ebook2audiobook V2.0

@DrewThomasson
Copy link
Owner

DrewThomasson commented Dec 13, 2024

I think a good chunk of those links are fully functional tho

But like

The fine controls and such

Yeah lol

Anyway hope that helps you out in some way with their codes

@scruffynerf
Copy link
Author

Yeah, I found the many different programs a bit confusing.... unsure which is which (ie your efforts, adding features, etc).
There are lots of ebook->audio programs that do a single narrator... that space is crowded. The 'cast recording' far more open, and actually more useful.

For example... take a Doctor Who novel that Big Finish hasn't (yet) adapted, and give it a few distinct well known voice sample wavs and suddenly it's a full audio experience. Then you take the above style json with some extra tweaks (location, etc which booknlp can do), and suddenly it's an audio track for a video script, with lip sync-ed voices, moving images, and so on... and that's just one example. With the rapid AI video development, music and so on... having a decent book->json breakdown just makes one more potential resource to connect in. Retheming? Rewriting? Recasting? etc..

@DrewThomasson
Copy link
Owner

Well yeah I wanted to eventually have a local LLM also go through and change how things are said depending on the context surrounding them,

So like have a LLM prompt other audio generation models to generate background sounds when a scene is described in the book

Or have it change the emotion in how things are said through stuff like facebooks spirit lm

And such till we basically get a radio show out of a book generated locally

@DrewThomasson
Copy link
Owner

DrewThomasson commented Dec 13, 2024

That was my ultimate goal 😅😓

@scruffynerf
Copy link
Author

scruffynerf commented Dec 13, 2024

"Hey AI, take my favorite book, parse it, retheme it as a space western, add some musical soundtrack in the background in the style of Morricone meets Space Opera (NOT https://www.youtube.com/watch?v=YXJiIqJ9_tQ which is awful...), use my voice cast favorites, and give me some visual samples of outfits and crew to decide on..."
"Ok, these 5 picks look good, now make it into a 3 hour video I can watch this evening."

real video (not AI) https://www.youtube.com/watch?v=4SpX8bVEmJo
but seriously, we can do THIS today now.

@DrewThomasson
Copy link
Owner

Ok yeah making it into a video locally tho that'll probs take the next 5-10 years but yes 😭

@DrewThomasson
Copy link
Owner

At least we have the same kind of goals in mind for this

@scruffynerf
Copy link
Author

Ok yeah making it into a video locally tho that'll probs take the next 5-10 years but yes 😭

Nah, we're almost at realtime video... Suno/Udio is doing 3+ minute songs, static images are getting higher quality and faster every few months, and video models are already lightyears better than a year ago.

But text->audio is totally doable right now, and it'll be easy enough to adapt to do video stuff next. (I do a lot with ComfyUI already, and that's also on my todo list, to make booknlp work with ComfyUI and generate images.

@scruffynerf
Copy link
Author

scruffynerf commented Dec 15, 2024

progress:

         {
            "text": "She held up a camera and snapped a picture of me and my crew.",
            "speaker_id": 0,
            "speaker_name": "Narrator",
            "index": 232
        },
        {
            "text": "Cheese,",
            "speaker_id": 938,
            "speaker_name": "Another Kid My Age",
            "index": 233
        },
        {
            "text": "she said.",
            "speaker_id": 0,
            "speaker_name": "Narrator",
            "index": 234
        },
        {
            "text": "You're on candid snitch - cam.",
            "speaker_id": 938,
            "speaker_name": "Another Kid My Age",
            "index": 235
        },
        {
            "text": "No way,\"I said.\"You would n't --",
            "speaker_id": 0,
            "speaker_name": "Narrator",
            "index": 236
        },
        {
            "text": "I will,",
            "speaker_id": 938,
            "speaker_name": "Another Kid My Age",
            "index": 237
        },
        {
            "text": "she said.",
            "speaker_id": 0,
            "speaker_name": "Narrator",
            "index": 238
        },
        {
            "text": "I will send this photo to truant watch in thirty seconds unless you four back off from this clue and let me and my friends here run it down. You can come back in one hour and it'll be all yours. I think that's more than fair.",
            "speaker_id": 938,
            "speaker_name": "Another Kid My Age",
            "index": 239
        },
        {
            "text": "I looked behind her and noticed three other girls in similar garb -- one with blue hair, one with green, and one with purple.",
            "speaker_id": 0,
            "speaker_name": "Narrator",
            "index": 240
        },
        {
            "text": "Who are you supposed to be, the Popsicle Squad?",
            "speaker_id": 944,
            "speaker_name": "One With Purple",
            "index": 241
        },
        {
            "text": "We're the team that's going to kick your team's ass at Harajuku Fun Madness,",
            "speaker_id": 938,
            "speaker_name": "Another Kid My Age",
            "index": 242
        },
        {
            "text": "she said.",
            "speaker_id": 0,
            "speaker_name": "Narrator",
            "index": 243
        },
        {
            "text": "And I'm the one who's * right this second * about to upload your photo and get you in * so much trouble * --",
            "speaker_id": 938,
            "speaker_name": "Another Kid My Age",
            "index": 244
        },
        {
            "text": "Behind me I felt Van start forward. Her all - girls school was notorious for its brawls, and I was pretty sure she was ready to knock this chick's block off.",
            "speaker_id": 0,
            "speaker_name": "Narrator",
            "index": 245
        },
        {
            "text": "Then the world changed forever.",
            "speaker_id": 0,
            "speaker_name": "Narrator",
            "index": 246
        },
        {
            "text": "We felt it first, that sickening lurch of the cement under your feet that every Californian knows instinctively -- * earthquake *. My first inclination, as always, was to get away :\"when in trouble or in doubt, run in circles, scream and shout.\"But the fact was, we were already in the safest place we could be, not in a building that could fall in on us, not out toward the middle of the road where bits of falling cornice could brain us.",
            "speaker_id": 0,
            "speaker_name": "Narrator",
            "index": 247
        },
        {
            "text": "Earthquakes are eerily quiet -- at first, anyway -- but this was n't quiet. This was loud, an incredible roaring sound that was louder than anything I'd ever heard before. The sound was so punishing it drove me to my knees, and I was n't the only one. Darryl shook my arm and pointed over the buildings and we saw it then : a huge black cloud rising from the northeast, from the direction of the Bay.",
            "speaker_id": 0,
            "speaker_name": "Narrator",
            "index": 248
        },

so there is a bit of cleanup left to do... the was n't the stray "s (Unsure how to best handle this? Split always on "s?
The " - " stuff can likely be stripped to remove the spaces and combine words with the dash.
How to handle text (or text) to make the TTS do some sort of emphasize? Maybe split on those, and we can make it alter the voice params for those slightly? Or just figure out how to tell the TTS to do that?

also: "speaker_name": "One With Purple",
I think that's Narrator misattributed... so we'd still want a way to find stray lines and reattribute, and do so in bulk.

yeah, it is: original text:
I looked behind her and noticed three other girls in similar garb -- one with blue hair, one with green, and one with purple. "Who are you supposed to be, the Popsicle Squad?"

also, the "she said."s could be removed, IF the voices are now distinct... there are arguments both ways (text accurate, versus Audio cleanup)... obviously only the bare "she said" by narrator, and not "she said, warily, looking him over" sort of stuff. That could be an option to 'hide' those and not generate them.

@DrewThomasson
Copy link
Owner

lol yeah your running into the same issues I ran into

I ended up doing a bunch of manual reformatting

Should be around the top area with in the BOOKNLP part of my code

You can probs pass it through chatgpt to pull out the parts you want

It's a mess 😅😭😓

@scruffynerf
Copy link
Author

https://github.com/scruffynerf/book2jsonofnlp has the code for above

still using booknlp for actual python name, so no code changes needed externally..
just uninstall the current booknlp with pip then
pip install git+https://github.com/scruffynerf/book2jsonofnlp
and it should work. New file created is .book.json

Happy for fresh eyeballs. Still in progress, but this should help if you want to start using this.

@scruffynerf
Copy link
Author

scruffynerf commented Dec 15, 2024

lol yeah your running into the same issues I ran into

Which issues?

I ended up doing a bunch of manual reformatting

as I said, small substitutions are to be expected... What sort of manual reformatting?

Should be around the top area with in the BOOKNLP part of my code

not sure what you mean? Beyond the number stuff?

Reassigning speakers would now be ultra easy, thanks to the json... the character list is there, the ids are there (the names are likely to be removed/ignored, especially if we alter...)

In the above case, your current gui lets us reassign "Purple" back to Narrator. The 'improvement' would be search/select all/etc. (all of which should be easier with json-ed info)

@DrewThomasson
Copy link
Owner

Honestly probs just gona rebuilt the whole thing at this point

Like

I'll say it my code for VoxNovel is garbage

Idk how it's even functioning XD

@scruffynerf
Copy link
Author

to be clear, regardless of voxnovel or whatever the next gen is, or if you roll it into ebook2audiobook...

I'm doing the book->json cause that's the key piece missing for all of this (for whomever wants to do better multi-voice TTS)

@DrewThomasson
Copy link
Owner

DrewThomasson commented Dec 15, 2024

Yes yes yes this will be very helpful in any direction I go or anyone else goes with this

@scruffynerf
Copy link
Author

did you and Robert start a discord for this stuff?

@DrewThomasson
Copy link
Owner

No but we probs should lol

Cause the next ebook2audiobook will have 1107 languages.... so that'll send a lota people running at our work ._.

@DrewThomasson
Copy link
Owner

Here I'll rush one out but be warned I've never hosted a server lol

@DrewThomasson
Copy link
Owner

Join Our Discord Server!

Click the badge below to join the Ebook2audiobook Discord Server!
Discord

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants