A project showcasing image captioning using two approaches: a custom CNN-based model for feature extraction and BLIP for state-of-the-art multimodal caption generation. Includes model comparisons, results, and code for training, inference, and evaluation on custom datasets.