Exploring the Multimodal LLM Inference Capabilities of PaliGemma with Keras

Rejected

Session Description

PaliGemma is a recently announced versatile and lightweight vision-language model (VLM) inspired by PaLI-3 and based on open components such as the SigLIP vision model and the Gemma language model. It takes both image and text as input and generates text as output, supporting multiple languages.

Paligemma is designed as a model for transfer to a wide range of vision-language tasks such as image and short video caption, visual question answering, text reading, object detection and object segmentation. This session explores the multi-modal capabilities of Paligemma also covers how you can use PaliGemma with Keras to set up a simple model that can infer information about supplied images and answer questions about them.

We will also explore later how you can fine-tune Paligemma with the help of JAX.

Key Takeaways

None

References

https://www.kaggle.com/models/google/paligemma

Session Categories

FOSS

Speakers

Shivay Lamba

TensorFlowJS SIG & WG Lead

Reviews

0 %

Approvability

Approvals

Rejections

Not Sure

Seems like a demo.

Reviewer #1

Rejected