#5 Feat: Implement Pipeline Builder for ease of pipeline creation #37

botirk38 · 2024-08-23T17:32:52Z

Why?

This feature is needed to streamline the creation of machine learning pipelines based on configuration files for different operations. By implementing this PipelineBuilder class, we can dynamically load configurations and create pipelines specific to the tasks such as text-to-embedding, embedding-to-text, text segmentation, and metric analysis. This makes the process of pipeline creation more modular, maintainable, and extensible, catering to different datasets and operations in a standardized way.

Use Case:

To facilitate the automated creation of different types of ML pipelines based on configuration files.
To allow easy extension and customization for new operations and datasets.
To improve maintainability by centralizing the pipeline creation logic.

How?

Technical Decisions:

Directory Structure:

The config_dir defaults to huggingface_pipelines/datacards. This directory is assumed to contain subdirectories for each dataset, where operation-specific YAML configuration files are stored.

Example structure:

huggingface_pipelines/datacards/
├── dataset_name1/
│   ├── text_to_embedding.yaml
│   ├── embedding_to_text.yaml
└── ── dataset_name2/
      ├── text_segmentation.yaml
      ├── analyze_metric.yaml

Factory Pattern:
- A factory pattern is used to create pipelines. Each operation has a corresponding factory (e.g., TextToEmbeddingPipelineFactory) that is responsible for creating the pipeline based on the configuration.
- This pattern allows for easy extension; new operations can be supported by simply adding a new factory class and updating the pipeline_factories dictionary.
Configuration Loading:
- Configuration files are loaded based on the dataset name and operation type. If a configuration file does not exist for a given dataset and operation, a FileNotFoundError is raised with an appropriate error message logged.
- YAML is used for configuration files for readability and ease of editing.
Pipeline Creation:
- The create_pipeline method dynamically loads the configuration and uses the appropriate factory to create the pipeline.

Work In Progress:

If additional operations are needed in the future (e.g., audio_preprocessing), corresponding factory classes and YAML configurations must be created.
Additional error handling might be necessary for more robust operation (e.g., validating configuration content).

Test Plan

Unit Testing:

Created unit tests for the PipelineBuilder class to verify:
1. Successful loading of configuration files.
2. Correct creation of pipelines for each supported operation.
3. Proper error handling when configuration files are missing or operations are unsupported.

Command Line Testing:

Example command to test pipeline creation for a text-to-embedding operation:

builder = PipelineBuilder(config_dir="path/to/configs")
pipeline = builder.create_pipeline(dataset_name="sample_dataset", operation="text_to_embedding")
assert pipeline is not None, "Pipeline creation failed"

Tested edge cases like missing configuration files and unsupported operations to ensure the class behaves as expected.

Integration Testing:

Integrated the PipelineBuilder into the main application workflow and verified that pipelines are correctly built and executed for real datasets.
Verified logging output to ensure errors and important information are correctly logged.

artemru · 2024-09-03T15:56:09Z

huggingface_pipelines/builder.py

+            FileNotFoundError: If the configuration file is not found.
+
+        """
+        config_file = self.config_dir / f"{dataset_name}/{operation}.yaml"


can you add an example of such cards ?
btw, please look how it's done in sonar with model cards (where all loading logic is already done)

avidale · 2024-10-22T10:56:34Z

Are we going to merge this?

Implement pipeline builder interface

d2e8414

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 23, 2024

botirk38 changed the title ~~Feat: Implement Pipeline Builder for ease of pipeline creation~~ #5 Feat: Implement Pipeline Builder for ease of pipeline creation Aug 23, 2024

botirk38 added 5 commits August 26, 2024 15:38

Fix: Align code with best practices

fbd6b4f

Create factory for audio to embedding pipeline

3ef816c

Update supported operations type to include audio_to_embedding

5e9268f

Add docs to pipeline builder

58a27ed

Add docs to pipeline builder

e10cd4d

artemru reviewed Sep 3, 2024

View reviewed changes

botirk38 added 5 commits September 3, 2024 17:43

Add analyze metric datacard

2caaa48

Add embedding to text datacard

3b5257b

Add text to embedding datacard

ba0cc76

Add text segmentation datacard

18a386e

Add embedding to text datacard

ddee4ca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#5 Feat: Implement Pipeline Builder for ease of pipeline creation #37

#5 Feat: Implement Pipeline Builder for ease of pipeline creation #37

botirk38 commented Aug 23, 2024

artemru Sep 3, 2024

avidale commented Oct 22, 2024

#5 Feat: Implement Pipeline Builder for ease of pipeline creation #37

Are you sure you want to change the base?

#5 Feat: Implement Pipeline Builder for ease of pipeline creation #37

Conversation

botirk38 commented Aug 23, 2024

Why?

How?

Technical Decisions:

Work In Progress:

Test Plan

artemru Sep 3, 2024

Choose a reason for hiding this comment

avidale commented Oct 22, 2024