For tracking multiple targets in a scene, the most common approach is to represent the target in a bounding box and track the whole box as a single entity. However, in the case of humans, the body goes through complex articulation and occlusion that severely deteriorate the tracking
performance. In this paper, we argue that instead of tracking the whole body of a target, if we focus on a relatively rigid body organ, better tracking results can be achieved. Based on this assumption, we followed the tracking-by-detection paradigm and generated the target hypothesis of only
the spatial locations of heads in every frame. After the localization of head location, a constant velocity motion model is used for the temporal evolution of the targets in the visual scene. For associating the targets in the consecutive frames, combinatorial optimization is used that associates
the corresponding targets in a greedy fashion. Qualitative results are evaluated on four challenging video surveillance dataset and promising results has been achieved.