Talk Intermediate MIT First Talk

Few-Shot Action Recognition Using Open Foundation Models Locally

Review Pending

Session Description

Session description:

when I started looking into action recognition, I found that most existing solutions required huge datasets and expensive cloud-based GPUs. That felt out of reach for hobbyists and students. So I set out to build a lightweight system that could recognize simple actions such as pick and drop from just a few training examples, all while running on a regular personal computer.

This talk is about SnapAction, the system I ended up building by gluing together a few open-source models:

VideoMAE-base as the backbone. Image models look at a single frame and guess; VideoMAE actually looks at motion across frames, which matters here because a "pick" and a "drop" can look identical in a still frame and only differ in which direction things are moving.

LoRA for fine-tuning, because there's no way I'm retraining all ~80M parameters of VideoMAE on a laptop GPU with a tiny dataset. LoRA freezes the base model and only trains small adapter matrices in the attention layers, so it's fast and doesn't destroy what the model already knows about video.

SAM2, used optionally to mask out the background so the model isn't getting distracted by whatever's going on behind the person doing the action.

I'll walk through the whole pipeline — pulling 16 frames out of each video, the SAM2 masking step, the augmentations I used (temporal jitter, mixup, cropping/flipping/color jitter), fine-tuning VideoMAE with LoRA, and finally a small web app that runs the whole thing locally and shows you the model's confidence for each class. All of this runs offline on Fedora, no internet required.

Honestly, the more interesting part of this talk is everything that went wrong along the way:

Training was stuck for a while, and it took some digging to figure out it was two separate issues— SAM2 running live during training (which was both slow and giving inconsistent masks) and label smoothing accidentally being applied. during validation when it should only happen during training.

Once I fixed the LR schedule and switched to cosine annealing with early stopping, things got noticeably more stable.

Even after all this i ran into a real data ceiling. It looks like i needed around 60+ clips per class before "few-shot" actually starts behaving like few-shot.

Key Takeaways

Key Takeaways:

How VideoMAE, SAM2, and LoRA fit together for few-shot action recognition, and why each one earns its place in the pipeline

Debugging stories: chasing down a training stall caused by SAM2 and a mislabeled smoothing step, and what fixing the LR schedule actually changed

Where few-shot approaches hit a wall in practice — the data ceiling I ran into and what it means for similar projects

Practical notes on running multiple modern vision models locally without blowing through your VRAM

References

https://github.com/facebookresearch/sam2

https://github.com/MCG-NJU/VideoMAE

https://huggingface.co/docs/peft/main/en/conceptual_guides/lora

Session Categories

Tutorial about using a FOSS project

Talk License: MIT

Speakers

Chahat Patel Student

A passionate and curious computer science student who loves exploring open-source AI and building practical systems that run locally.

Reviews

No reviews yet.