Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for generic distance calculation delegate #9

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

amaid
Copy link

@amaid amaid commented Sep 24, 2023

Closes #8
This pull request appears to be making changes to a C# codebase, likely related to a machine learning or data analysis library. The changes seem to be focused on introducing generic types and refining the code structure. Below is a description of the changes made in this pull request:

  1. DistanceCalculation Delegate Change:

    • A delegate called DistanceCalculation is modified to become a generic delegate DistanceCalculation<T> where T must implement the IUmapDataPoint interface. This change allows for more flexible distance calculations that can work with different types of data points.
  2. IUmapDataPoint Interface Addition:

    • A new interface named IUmapDataPoint is added to the UMAP namespace. This interface represents a single data point that will be processed by the UMAP algorithm. It requires implementing classes to provide a Data property, likely representing the data associated with the data point.
  3. NNDescent Changes:

    • The NNDescent class, which appears to be related to nearest neighbor descent, is updated to be a generic class NNDescent<T> where T must implement the IUmapDataPoint interface. This change ensures that the class can work with different types of data points.
  4. SIMD and SIMDInt Changes:

    • The SIMD and SIMDInt classes are updated to become generic classes SIMD<T> and SIMDInt<T>. This change likely reflects the need to work with different data types (possibly floating-point and integer) depending on the data points used.
  5. Tree Changes:

    • The Tree class is updated to become a generic class Tree<T> where T must implement the IUmapDataPoint interface. This change ensures that the class can work with different types of data points.
  6. Umap Class Changes:

    • The Umap class is updated to become a generic class Umap<T> where T must implement the IUmapDataPoint interface. This change allows the UMAP algorithm to operate on different types of data points.
    • The DistanceCalculation field is changed to DistanceCalculation<T> to reflect the use of a generic distance calculation delegate.
    • Various methods and internal structures within the Umap class are adjusted to accommodate the use of generic data types.
  7. Distance Functions:

    • The DistanceFunctions class defines different distance calculation functions such as cosine and Euclidean distance. These functions are updated to accept generic data types (T) instead of float arrays.
  8. OptimizationState Class:

    • The OptimizationState class appears to store various parameters and state information related to the UMAP algorithm. It is not directly impacted by the generic changes but is part of the UMAP class.

In summary, this pull request introduces generic type support for the UMAP algorithm, allowing it to work with different data point types while maintaining flexibility in distance calculations and optimizations. Additionally, it adds an interface (IUmapDataPoint) for representing data points processed by UMAP.

@amaid
Copy link
Author

amaid commented Sep 24, 2023

@theolivenbaum - Pull request created to address issue #8 (Distance calculation using additional data)

@Arlodotexe
Copy link

Arlodotexe commented Sep 26, 2023

👋 @amaid You've essentially abstracted away each separate vector in the embedding, rather than abstracting away the vector array float[]. I'm going to open a PR to your fork to fix this, which will show here once merged.

This was abstracting away each separate vector in the embedding, rather than abstracting away the vector array float[]
@Arlodotexe
Copy link

@amaid I've got a PR opened on your fork. Once merged, the changes should show in this PR automatically.

Needed for https://github.com/Arlodotexe/OwlCore.AI.Exocortex

Implemented IUmapDataPoint, cleanup IUmapDistanceParameter.
@amaid
Copy link
Author

amaid commented Sep 27, 2023

@amaid I've got a PR opened on your fork. Once merged, the changes should show in this PR automatically.

Needed for https://github.com/Arlodotexe/OwlCore.AI.Exocortex

Awesome! merged.

@Arlodotexe
Copy link

Arlodotexe commented Sep 27, 2023

@theolivenbaum I recommend squashing this PR when merging/closing.

I'm also aware that this is a breaking change to the library, which will need a bit of discussion and possibly a migration guide.

Or, maybe we should just ship an inbox non-generic implementation to polyfill what we had before? Something like:

public class RawVectorArrayUmapDataPoint : IUmapDataPoint
{
    public RawVectorArrayUmapDataPoint(float[] data) => Data = data;

     public float[] Data { get; }

    /// <summary>
    /// Define an implicit conversion operator from <see cref="float[]"/>.
    /// </summary>
    public static implicit operator RawVectorArrayUmapDataPoint(float[] data)  => new(x);   
        
    /// <summary>
    /// Implicit conversation back to <see cref="float[]"/>.
    /// </summary>
    public static implicit operator float[](RawVectorArrayUmapDataPoint x) => x.Data;   
}

public class Umap : Umap<RawVectorArrayUmapDataPoint>
{
    public Umap(
            DistanceCalculation<RawVectorArrayUmapDataPoint> distance = null,
            IProvideRandomValues random = null,
            int dimensions = 2,
            int numberOfNeighbors = 15,
            int? customNumberOfEpochs = null,
            ProgressReporter progressReporter = null)
                : base(distance, random, dimensions, numberOfNeighbors, customNumberOfEpochs, progressReporter)
        {
             // ...
        }
}

The implicit conversion in RawVectorArrayUmapDataPoint should allow the library consumer to use the library as they were before, passing a float[] directly to a new Umap(...).InitializeFit(someFloatArray); without the generics.

@amaid
Copy link
Author

amaid commented Nov 19, 2023

@theolivenbaum I recommend squashing this PR when merging/closing.

I'm also aware that this is a breaking change to the library, which will need a bit of discussion and possibly a migration guide.

Or, maybe we should just ship an inbox non-generic implementation to polyfill what we had before? Something like:

public class RawVectorArrayUmapDataPoint : IUmapDataPoint
{
    public RawVectorArrayUmapDataPoint(float[] data) => Data = data;

     public float[] Data { get; }

    /// <summary>
    /// Define an implicit conversion operator from <see cref="float[]"/>.
    /// </summary>
    public static implicit operator RawVectorArrayUmapDataPoint(float[] data)  => new(x);   
        
    /// <summary>
    /// Implicit conversation back to <see cref="float[]"/>.
    /// </summary>
    public static implicit operator float[](RawVectorArrayUmapDataPoint x) => x.Data;   
}

public class Umap : Umap<RawVectorArrayUmapDataPoint>
{
    public Umap(
            DistanceCalculation<RawVectorArrayUmapDataPoint> distance = null,
            IProvideRandomValues random = null,
            int dimensions = 2,
            int numberOfNeighbors = 15,
            int? customNumberOfEpochs = null,
            ProgressReporter progressReporter = null)
                : base(distance, random, dimensions, numberOfNeighbors, customNumberOfEpochs, progressReporter)
        {
             // ...
        }
}

The implicit conversion in RawVectorArrayUmapDataPoint should allow the library consumer to use the library as they were before, passing a float[] directly to a new Umap(...).InitializeFit(someFloatArray); without the generics.

This is implemented.

@Arlodotexe
Copy link

Arlodotexe commented Nov 20, 2023

You changed all the internal T to a hardcoded type (the new one), the code I provided works without making further changes to Umap<T>. We'll need to correct this.

@amaid
Copy link
Author

amaid commented Nov 20, 2023

You changed all the internal T to a hardcoded type (the new one), the code I provided works without making further changes to Umap<T>. We'll need to correct this.

This has been fixed, the implicit typecasting has been implemented to support explicit float[][] data to support existing class consumers without changing the existing implementation of Umap

@theolivenbaum The PR is ready to be merged. The changes have been tested for existing class consumers as well. The unit tests are passing.

image

@Arlodotexe
Copy link

Arlodotexe commented May 27, 2024

Great, thanks @amaid!

@theolivenbaum Bumping

@Arlodotexe
Copy link

You changed all the internal T to a hardcoded type (the new one), the code I provided works without making further changes to Umap<T>. We'll need to correct this.

This has been fixed, the implicit typecasting has been implemented to support explicit float[][] data to support existing class consumers without changing the existing implementation of Umap

@theolivenbaum The PR is ready to be merged. The changes have been tested for existing class consumers as well. The unit tests are passing.

@theolivenbaum Any chance we could get eyes on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Distance calculation with additional data
2 participants