Giving Open Source Eyes: Real-World Vision Applications with PaliGemma

Get Ticket

Time: 

Venue: LT2

Language: English

Level: Intermediate

Target Audience: General User

Most open-source AI discussions focus purely on text generation, but the physical world is visual. PaliGemma is Google's lightweight, open Vision-Language Model (VLM), designed to bridge the gap between pixels and language.

Instead of building another text chatbot, this session explores how to integrate PaliGemma into open-source pipelines to process visual data. We’ll walk through practical, non-mainstream use cases: using PaliGemma for real-time anomaly detection in open-source CCTV streams, performing "OCR-less" data extraction from complex UI screenshots, and automating visual accessibility tags. We will cover the mechanics of passing image embeddings into language space and how to deploy this VLM effectively on local machines. Attendees will walk away knowing how to build multimodal applications that can actually "see," without relying on expensive, proprietary cloud APIs.
Frankie WU

Frankie WU / Hong Kong

GDG Hong Kong


As a GDG Hong Kong Organizer and the Founder of Nexamind AI, Frankie helps forward-thinking organizations bridge the gap between complex data science and practical product design. Through his consultancy, he partners with businesses to build scalable machine learning solutions and autonomous AI agents that deliver real-world ROI. As a seasoned educator and public speaker, he has led numerous technical bootcamps and is a recognized voice in the tech community, frequently speaking on the practical applications of AI and the future of intelligent automation.