Our biologically inspired ARTSCAN model of visual object learning and recognition with active eye movements proposes answers to these questions. The model explains how surface attention interacts with eye movement generating modules and object recognition modules so that the views that correspond to the same object are selectively clustered together. This interaction does not require prior knowledge of the object identity. The modules in the model conform to modules in What and Where streams of the visual system. The What stream learns a spatially-invariant and size-invariant representation of an object, using bottom-up filtering and top down attentional mechanisms. The Where stream computes indices of object location and guides eyes movements. Preprocessing is assumed to happen in the primary visual areas, notably log-polar compression of the periphery, contrast enhancement and parallel processing of boundary and surface properties. The WHAT stream is a variant of ARTMAP classifiers designed by Gail Carpenter and Stephen Grossberg.