Skip to content

Latest commit

 

History

History
146 lines (80 loc) · 8.85 KB

README.md

File metadata and controls

146 lines (80 loc) · 8.85 KB

🧠 MLfromScratch

Python NumPy License GitHub contributors

MLfromScratch is a library designed to help you learn and understand machine learning algorithms by building them from scratch using only NumPy! No black-box libraries, no hidden magic—just pure Python and math. It's perfect for beginners who want to see what's happening behind the scenes of popular machine learning models.

🔗 Explore the Documentation


📦 Package Structure

Our package structure is designed to look like scikit-learn, so if you're familiar with that, you'll feel right at home!

🔧 Modules and Algorithms (Explained for Beginners)

📈 1. Linear Models (linear_model)

  • LinearRegression Linear Regression: Imagine drawing a straight line through a set of points to predict future values. Linear Regression helps in predicting something like house prices based on size.

  • SGDRegressor SGD: A fast way to do Linear Regression using Stochastic Gradient Descent. Perfect for large datasets.

  • SGDClassifier Classifier: A classification algorithm predicting categories like "spam" or "not spam."

🌳 2. Decision Trees (tree)

  • DecisionTreeClassifier Tree: Think of this as playing 20 questions to guess something. A decision tree asks yes/no questions to classify data.

  • DecisionTreeRegressor Regressor: Predicts a continuous number (like temperature tomorrow) based on input features.

👥 3. K-Nearest Neighbors (neighbors)

  • KNeighborsClassifier KNN: Classifies data by looking at the 'k' nearest neighbors to the new point.

  • KNeighborsRegressor KNN: Instead of classifying, it predicts a number based on nearby data points.

🧮 4. Naive Bayes (naive_bayes)

  • GaussianNB Gaussian: Works great for data that follows a normal distribution (bell-shaped curve).

  • MultinomialNB Multinomial: Ideal for text classification tasks like spam detection.

📊 5. Clustering (cluster)

  • KMeans KMeans: Groups data into 'k' clusters based on similarity.

  • AgglomerativeClustering Agglomerative: Clusters by merging similar points until a single large cluster is formed.

  • DBSCAN DBSCAN: Groups points close to each other and filters out noise. No need to specify the number of clusters!

  • MeanShift MeanShift: Shifts data points toward areas of high density to find clusters.

🌲 6. Ensemble Methods (ensemble)

  • RandomForestClassifier RandomForest: Combines multiple decision trees to make stronger decisions.

  • RandomForestRegressor RandomForest: Predicts continuous values using an ensemble of decision trees.

  • GradientBoostingClassifier GradientBoosting: Builds trees sequentially, each correcting errors made by the last.

  • VotingClassifier Voting: Combines the results of multiple models to make a final prediction.

📐 7. Metrics (metrics)

Measure your model’s performance:

  • accuracy_score Accuracy: Measures how many predictions your model got right.

  • f1_score F1 Score: Balances precision and recall into a single score.

  • roc_curve ROC: Shows the trade-off between true positives and false positives.

⚙️ 8. Model Selection (model_selection)

  • train_test_split TrainTestSplit: Splits your data into training and test sets.

  • KFold KFold: Trains the model in 'k' iterations for better validation.

🔍 9. Preprocessing (preprocessing)

  • StandardScaler StandardScaler: Standardizes your data so it has a mean of 0 and a standard deviation of 1.

  • LabelEncoder LabelEncoder: Converts text labels into numerical labels (e.g., "cat", "dog").

🧩 10. Dimensionality Reduction (decomposition)

Dimensionality Reduction helps in simplifying data while retaining most of its valuable information. By reducing the number of features (dimensions) in a dataset, it makes data easier to visualize and speeds up machine learning algorithms.

  • PCA (Principal Component Analysis) PCA: PCA reduces the number of dimensions by finding new uncorrelated variables called principal components. It projects your data onto a lower-dimensional space while retaining as much variance as possible.

    • How It Works: PCA finds the axes (principal components) that maximize the variance in your data. The first principal component captures the most variance, and each subsequent component captures progressively less.
    • Use Case: Use PCA when you have many features, and you want to simplify your dataset for better visualization or faster computation. It is particularly useful when features are highly correlated.

🎯 Why Use This Library?

  • Learning-First Approach: If you're a beginner and want to understand machine learning, this is the library for you. No hidden complexity, just code.
  • No Hidden Magic: Everything is written from scratch, so you can see exactly how each algorithm works.
  • Lightweight: Uses only NumPy, making it fast and easy to run.

🚀 Getting Started

# Clone the repository
git clone https://github.com/adityajn105/MLfromScratch.git

# Navigate to the project directory
cd MLfromScratch

# Install the required dependencies
pip install -r requirements.txt



👨‍💻 Author

This project is maintained by Aditya Jain

🧑‍💻 Contributors

Constributor: Subrahmanya Gaonkar

We welcome contributions from everyone, especially beginners! If you're new to open-source, don’t worry—feel free to ask questions, open issues, or submit a pull request.

🤝 How to Contribute

  1. Fork the repository.
  2. Create a new branch (git checkout -b feature-branch).
  3. Make your changes and commit (git commit -m "Added new feature").
  4. Push the changes (git push origin feature-branch).
  5. Submit a pull request and explain your changes.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.