Session description:
when I started looking into action recognition, I found that most existing solutions required huge datasets and expensive cloud-based GPUs. That felt out of reach for hobbyists and students. So I set out to build a lightweight system that could recognize simple actions such as pick and drop from just a few training examples, all while running on a regular personal computer.
This talk is about SnapAction, the system I ended up building by gluing together a few open-source models:
VideoMAE-base as the backbone. Image models look at a single frame and guess; VideoMAE actually looks at motion across frames, which matters here because a "pick" and a "drop" can look identical in a still frame and only differ in which direction things are moving.
LoRA for fine-tuning, because there's no way I'm retraining all ~80M parameters of VideoMAE on a laptop GPU with a tiny dataset. LoRA freezes the base model and only trains small adapter matrices in the attention layers, so it's fast and doesn't destroy what the model already knows about video.
SAM2, used optionally to mask out the background so the model isn't getting distracted by whatever's going on behind the person doing the action.
I'll walk through the whole pipeline — pulling 16 frames out of each video, the SAM2 masking step, the augmentations I used (temporal jitter, mixup, cropping/flipping/color jitter), fine-tuning VideoMAE with LoRA, and finally a small web app that runs the whole thing locally and shows you the model's confidence for each class. All of this runs offline on Fedora, no internet required.
Honestly, the more interesting part of this talk is everything that went wrong along the way:
Training was stuck for a while, and it took some digging to figure out it was two separate issues— SAM2 running live during training (which was both slow and giving inconsistent masks) and label smoothing accidentally being applied. during validation when it should only happen during training.
Once I fixed the LR schedule and switched to cosine annealing with early stopping, things got noticeably more stable.
Even after all this i ran into a real data ceiling. It looks like i needed around 60+ clips per class before "few-shot" actually starts behaving like few-shot.