Best ways to predict biomass?

We’re predicting pasture biomass (5 targets) from field images. Training data has metadata (NDVI, height, species, date), but test set has images only. Current approaches (R² ~0.2-0.4):

v1-v4: Hand-crafted features (green coverage, vegetation indices, texture) → MLP with auxiliary heads (predict height/species from images) → biomass
Multi-task learning with teacher forcing during training
Key constraints:

~300 training images, 800 test images
No metadata at inference
Pretrained models allowed (DINOv2, etc.)
Question: What architecture would you recommend?

CNN backbone (ResNet/EfficientNet) vs Vision Transformer (ViT/DINOv2)?
End-to-end vs two-stage (feature extraction → regression)?
Single multi-head model (predict all 5 targets) vs separate models per target?
Should auxiliary tasks (height/species prediction) be kept, or does a strong pretrained backbone make them redundant?

40 images sample ranked from largest biomass on top left to smallest on bottom right

  • *: NDVI level (low/medium/high)
  • ^: Height level (short/medium/tall)
Position Symbol Meaning
Top-left Colored shape State (color) + Species (shape) + NDVI (size)
Top-right Yellow * ** *** NDVI level
Left Cyan ^ ^^ ^^^ Height
Bottom-right White number Biomass in grams

:red_circle: Red = NSW (New South Wales) :blue_circle: Blue = Vic (Victoria) :green_circle: Green = Tas (Tasmania) :orange_circle: Orange = WA (Western Australia)

SHAPES = SPECIES (Plant Type)

■ Square = Ryegrass ● Circle = Clover ▲ Triangle = Lucerne ◆ Diamond = Phalaris ⬠ Pentagon = Fescue ✕ X = Other (mixed species)

SIZE = NDVI (Satellite Vegetation Index)

Bigger shape = Higher NDVI = More green detected from satellite Smaller shape = Lower NDVI = Less vegetation