We train representation models with procedural data only, and apply them on visual similarity, classification, and semantic segmentation tasks
without further training by using visual memory—an explicit database of reference image embeddings. Unlike prior work on visual memory, our
approach achieves full separation of perception and knowledge with respect to all real-world images while retaining strong performance.
In standard approaches, perceptual skills and knowledge are entangled—there is no boundary between what the model is able to do
(e.g. classification, segmentation), and what it knows (e.g. people, dogs). Even in visual memory approaches, the embedding model
is trained on real data, leaking knowledge into the representation. Our method addresses these limitations with efficient training
(no real data), easy editability (data addition/removal through database modification), and explainability (decisions traced to specific examples).