Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#5 Feat: Implement Pipeline Builder for ease of pipeline creation #37

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

botirk38
Copy link
Collaborator

Why?

This feature is needed to streamline the creation of machine learning pipelines based on configuration files for different operations. By implementing this PipelineBuilder class, we can dynamically load configurations and create pipelines specific to the tasks such as text-to-embedding, embedding-to-text, text segmentation, and metric analysis. This makes the process of pipeline creation more modular, maintainable, and extensible, catering to different datasets and operations in a standardized way.

Use Case:

  • To facilitate the automated creation of different types of ML pipelines based on configuration files.
  • To allow easy extension and customization for new operations and datasets.
  • To improve maintainability by centralizing the pipeline creation logic.

How?

Technical Decisions:

  1. Directory Structure:

    • The config_dir defaults to huggingface_pipelines/datacards. This directory is assumed to contain subdirectories for each dataset, where operation-specific YAML configuration files are stored.
    • Example structure:
      huggingface_pipelines/datacards/
      ├── dataset_name1/
      │   ├── text_to_embedding.yaml
      │   ├── embedding_to_text.yaml
      └── ── dataset_name2/
            ├── text_segmentation.yaml
            ├── analyze_metric.yaml
      
  2. Factory Pattern:

    • A factory pattern is used to create pipelines. Each operation has a corresponding factory (e.g., TextToEmbeddingPipelineFactory) that is responsible for creating the pipeline based on the configuration.
    • This pattern allows for easy extension; new operations can be supported by simply adding a new factory class and updating the pipeline_factories dictionary.
  3. Configuration Loading:

    • Configuration files are loaded based on the dataset name and operation type. If a configuration file does not exist for a given dataset and operation, a FileNotFoundError is raised with an appropriate error message logged.
    • YAML is used for configuration files for readability and ease of editing.
  4. Pipeline Creation:

    • The create_pipeline method dynamically loads the configuration and uses the appropriate factory to create the pipeline.

Work In Progress:

  • If additional operations are needed in the future (e.g., audio_preprocessing), corresponding factory classes and YAML configurations must be created.
  • Additional error handling might be necessary for more robust operation (e.g., validating configuration content).

Test Plan

Unit Testing:

  • Created unit tests for the PipelineBuilder class to verify:
    1. Successful loading of configuration files.
    2. Correct creation of pipelines for each supported operation.
    3. Proper error handling when configuration files are missing or operations are unsupported.

Command Line Testing:

  • Example command to test pipeline creation for a text-to-embedding operation:
    builder = PipelineBuilder(config_dir="path/to/configs")
    pipeline = builder.create_pipeline(dataset_name="sample_dataset", operation="text_to_embedding")
    assert pipeline is not None, "Pipeline creation failed"
  • Tested edge cases like missing configuration files and unsupported operations to ensure the class behaves as expected.

Integration Testing:

  • Integrated the PipelineBuilder into the main application workflow and verified that pipelines are correctly built and executed for real datasets.
  • Verified logging output to ensure errors and important information are correctly logged.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 23, 2024
@botirk38 botirk38 changed the title Feat: Implement Pipeline Builder for ease of pipeline creation #5 Feat: Implement Pipeline Builder for ease of pipeline creation Aug 23, 2024
FileNotFoundError: If the configuration file is not found.

"""
config_file = self.config_dir / f"{dataset_name}/{operation}.yaml"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an example of such cards ?
btw, please look how it's done in sonar with model cards (where all loading logic is already done)

@avidale
Copy link
Contributor

avidale commented Oct 22, 2024

Are we going to merge this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants