Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding draft extension for host-provided scratch memory #423

Open
wants to merge 9 commits into
base: next
Choose a base branch
from

Conversation

jatinchowdhury18
Copy link
Contributor

No description provided.

@abique
Copy link
Contributor

abique commented Oct 15, 2024

Hi @jatinchowdhury18

I have some questions with the interface.

  1. Is this scratch memory thread-local?
  2. Can you get the scratch memory pointer from the thread pool? If so is it shared amongst the all the tasks, or each one will use the current thread local one?

include/clap/ext/draft/scratch-memory.h Show resolved Hide resolved
include/clap/ext/draft/scratch-memory.h Outdated Show resolved Hide resolved
@defiantnerd
Copy link
Contributor

Hi @jatinchowdhury18

I have some questions with the interface.

  1. Is this scratch memory thread-local?
  2. Can you get the scratch memory pointer from the thread pool? If so is it shared amongst the all the tasks, or each one will use the current thread local one?

The only constraint is that the memory is valid for a plugin during it's process call. Like the event queue or the audio buffers. No further lifetime guarantees are given.

@abique
Copy link
Contributor

abique commented Oct 15, 2024

Hi @jatinchowdhury18
I have some questions with the interface.

  1. Is this scratch memory thread-local?
  2. Can you get the scratch memory pointer from the thread pool? If so is it shared amongst the all the tasks, or each one will use the current thread local one?

The only constraint is that the memory is valid for a plugin during it's process call. Like the event queue or the audio buffers. No further lifetime guarantees are given.

Then consider this:

  • each voice needs 10K of scratch buffer
  • you process 32 voices in parallel

If you have a single scratch buffer for process(), you can't process your voice in parallel, or you need 32 * 10K.
If you have a thread local scratch that you can query from the thread pool, then you can process your voices in parallel with a scratch buffer of 10K.

I believe this needs to be clarified in the spec.

@defiantnerd
Copy link
Contributor

Then consider this:

  • each voice needs 10K of scratch buffer
  • you process 32 voices in parallel

If you have a single scratch buffer for process(), you can't process your voice in parallel, or you need 32 * 10K. If you have a thread local scratch that you can query from the thread pool, then you can process your voices in parallel with a scratch buffer of 10K.

I believe this needs to be clarified in the spec.

You can only request/register one buffer per instance with a predefined size during activation. The host signals in the return value of request_size if you can access that.

In the process call you just call auto *scratchmem = ext_scratch_memory->access(host); and if scratchmem is not a null you can just use it during the process call.

But there is only one buffer and the size does not change as long as the plugin is active.

@baconpaul
Copy link
Collaborator

Then consider this:

  • each voice needs 10K of scratch buffer
  • you process 32 voices in parallel

If you have a single scratch buffer for process(), you can't process your voice in parallel, or you need 32 * 10K. If you have a thread local scratch that you can query from the thread pool, then you can process your voices in parallel with a scratch buffer of 10K.
I believe this needs to be clarified in the spec.

You can only request/register one buffer per instance with a predefined size during activation. The host signals in the return value of request_size if you can access that.

In the process call you just call auto *scratchmem = ext_scratch_memory->access(host); and if scratchmem is not a null you can just use it during the process call.

But there is only one buffer and the size does not change as long as the plugin is active.

I think Alex's point is if you use the thread pool extension to schedule jobs, those thread pool local jobs will be effectively parallel and running under process.

Can a thread-pool extension job access memory? If so is it distinct per thread or is it a single memory location?

My guess is: The thread pool and memory scratch extensions need some careful co-consideration. And the patterns where people use the scratch memory will also require scratch-per-thread-voice not scratch-per-process-block in those cases.

@jatinchowdhury18
Copy link
Contributor Author

Yeah, thanks for bringing this up, I hadn't considered the inter-operation of scratch-memory and the thread-pool extension.

I think the simplest solution would be to have the plugin request the total amount of scratch memory that it needs across all possible threads. Since thread_pool.request_exec is blocking, I think it should be safe for the thread pool jobs to use the scratch memory, provided the plugin makes sure that each thread is using an independent "chunk" of the memory. So a simple example would look like:

struct My_Plugin
{
    size_t scratch_mem_per_voice = 10'000;
    size_t num_voices = 32;
    char* scratch_memory_data = nullptr;

    void activate(...) {
        scratch_memory_ext.pre_reserve(host, scratch_mem_per_voice * num_voices);
    }

    void process(...) {
        // Get all the scratch memory here
        scratch_memory_data = scratch_memory_ext.access(host);
        thread_pool_ext.request_exec(host, num_voices);
    }

    void thread_pool_callback(uint32_t task_index) {
        // Get a partition of the scratch memory for this voice to use
        char* this_voice_scratch_mem = scratch_memory_data + scratch_mem_per_voice * task_index;
        // do the actual DSP work here...
    }
};

However, I can see a few reasons why this might not be an ideal solution... For example, if the host thread pool only has 8 threads, then reserving enough scratch memory for 32 voices to be processed in parallel is a bit wasteful. Maybe the best solution is to have two scratch memory mechanisms... one scratch buffer that is intended to be accessed during the process() callback, and another to be accessed during the thread_pool_exec() callback?

typedef struct clap_host_scratch_memory {
    bool(CLAP_ABI *pre_reserve_process_scratch)(const clap_host_t *host, size_t scratch_size_bytes);
    void*(CLAP_ABI *access_process_scratch)(const clap_host_t *host);

    bool(CLAP_ABI *pre_reserve_thread_pool_scratch)(const clap_host_t *host, size_t scratch_size_per_thread_bytes);
    void*(CLAP_ABI *access_thread_pool_scratch)(const clap_host_t *host, uint32_t task_index);
} clap_host_scratch_memory_t;

All that said, I don't have much experience with the thread-pool extension, so I'll defer to your more informed opinions :).

@abique
Copy link
Contributor

abique commented Oct 16, 2024

Here is my proposal:

The scratch is a thread local pointer, so if you retrieve it from the process call, you'll get a pointer that you can share with all the jobs. If you retrieve it from the thread pool, you get a pointer that is only for the current job.

I think this is the correct direction because it corresponds to how the host will implement this feature: each audio thread will have a single scratch buffer (thread local) whose size is greater or equal to the max requested size of all plugin instances.

The total scratch memory is: num_threads * max_scratch_size.
You definitelly don't want max_scratch_size = max(plug_num_voices * plug_voice_scratch_size), but we want instead max_scratch_size = max(plug_voice_scratch_size).

// host when the plugin is de-activated.
//
// [main-thread & being-activated]
bool(CLAP_ABI *pre_reserve)(const clap_host_t *host, size_t scratch_size_bytes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we ever use size_t in any other extension? -> uint?_t ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, uint32_t would do the job I think.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use uint64_t which is size_t in most of our systems? The host has the option to return no for values out of bounds

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a preference for uint32_t vs uint64_t... but Trinitou is right that CLAP doesn't use size_t anywhere else, so I don't think we should use it here.

Copy link
Contributor

@abique abique Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A scratch size bigger than 4 GB would be problematic I suppose, remember nthreads * max_scratch_size.
Anyway regardless of the type, many host will likely have their own threshold.
uint32_t seems sufficient to me, but I'm happy with uint64_t as well.

@jatinchowdhury18
Copy link
Contributor Author

The scratch is a thread local pointer, so if you retrieve it from the process call, you'll get a pointer that you can share with all the jobs. If you retrieve it from the thread pool, you get a pointer that is only for the current job.

This makes sense! I've added some comments intended to clarify this point, but please let me know if there are ways I can improve my explanation :).

include/clap/ext/draft/scratch-memory.h Outdated Show resolved Hide resolved
Copy link
Contributor

@Trinitou Trinitou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks pretty good to me now. Any further details can be discussed on next IMO.


// This extension lets the plugin request "scratch" memory
// from the host. Scratch memory can be accessed during the
// `process()` callback, but is not persistent between callbacks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"but it's content isn't persistent between callbacks."

It could be confused with it's allocation that isn't persistent.

//
// On memory-constrained platforms, this optimization may allow for
// more plugins to be used simultaneously. On platforms with lots
// of memory, this optimization may improve CPU cache usage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both optimization: space and CPU cache are valid regardless of the available memory on the system.

// then this method must return a pointer to a memory block at least
// as large as the reserved size. If the host returned "false"
// when scratch memory was requested, then this method must not
// be called.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then this method must not be called and will return NULL.

// thread can independently provide the requested amount
// of scratch memory.
//
// [main-thread & being-activated]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I think [being-activated] should be documented somewhere, because I believe it is necessary for correctness here (assuming it means that the plugin may only call this whilst it is in the call stack of a call to clap_plugin.activate by the host): otherwise there could be a race between calls to reserve and access?

Comment on lines +47 to +49
// Note that any memory the host allocates to satisfy
// the requested scratch size can be de-allocated by the
// host when the plugin is de-activated.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: does this imply that if a plugin transitions activated -> deactivated -> being-activated, in the final being-activated state it must re-request scratch memory via a call to reserve? Or is the prior reservation still valid?

Comment on lines +55 to +57
// thread. Accordingly, the host must ensure that each
// thread can independently provide the requested amount
// of scratch memory.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: If I understand correctly, I think this could be challenging for a host to implement efficiently, as I'm not sure the host necessarily knows during the call to reserve how many concurrent jobs a plugin is going to request during process, and therefore how much memory needs to be allocated if it wants to be able to run every job in parallel. Consequently, the host would be forced to allocate the worst-case amount of requested memory for every thread pool thread it creates, which might be much more memory than is actually required.

As an extreme example, if the plugin knows it is never going to create thread pool jobs, then all of the per-thread scratch buffers allocated by the host for that plugin are wasted.

I wonder if it would be worth augmenting reserve to have a uint32_t concurrent_buffers argument, which;

  • If 0, means that all tasks receive the same buffer from access and it is up to the plugin to deal with that as it wishes (perhaps even a feature)
  • If n > 0 means that task indices up to n - 1 will successfully receive a buffer pointer upon calling access, and task indices n and upwards may receive NULL?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants