We’re predicting pasture biomass (5 targets) from field images. Training data has metadata (NDVI, height, species, date), but test set has images only. Current approaches (R² ~0.2-0.4):
v1-v4: Hand-crafted features (green coverage, vegetation indices, texture) → MLP with auxiliary heads (predict height/species from images) → biomass
Multi-task learning with teacher forcing during training
Key constraints:
~300 training images, 800 test images
No metadata at inference
Pretrained models allowed (DINOv2, etc.)
Question: What architecture would you recommend?
CNN backbone (ResNet/EfficientNet) vs Vision Transformer (ViT/DINOv2)?
End-to-end vs two-stage (feature extraction → regression)?
Single multi-head model (predict all 5 targets) vs separate models per target?
Should auxiliary tasks (height/species prediction) be kept, or does a strong pretrained backbone make them redundant?
40 images sample ranked from largest biomass on top left to smallest on bottom right
- *: NDVI level (low/medium/high)
- ^: Height level (short/medium/tall)
| Position | Symbol | Meaning |
|---|---|---|
| Top-left | Colored shape | State (color) + Species (shape) + NDVI (size) |
| Top-right | Yellow * ** *** | NDVI level |
| Left | Cyan ^ ^^ ^^^ | Height |
| Bottom-right | White number | Biomass in grams |
Red = NSW (New South Wales)
Blue = Vic (Victoria)
Green = Tas (Tasmania)
Orange = WA (Western Australia)
SHAPES = SPECIES (Plant Type)
■ Square = Ryegrass ● Circle = Clover ▲ Triangle = Lucerne ◆ Diamond = Phalaris ⬠ Pentagon = Fescue ✕ X = Other (mixed species)
SIZE = NDVI (Satellite Vegetation Index)
Bigger shape = Higher NDVI = More green detected from satellite Smaller shape = Lower NDVI = Less vegetation
