Recognition and understanding of group activities can significantly improve situational awareness in Surveillance Systems. To maximize reliability and effectiveness of Persistent Surveillance Systems, annotations of sequential images gathered from video streams (i.e. imagery and acoustic features) must be fused together to generate semantic messages describing group activities (GA). To facilitate efficient fusion of extracted features from any physical sensors a common structure will suffice to ease integration of processed data into new comprehension. In this paper, we describe a framework for extraction and management of pertinent features/attributes vital for annotation of group activities reliably. A robust technique is proposed for fusion of generated events and entities' attributes from video streams. A modified Transducer Markup Language (TML) is introduced for semantic annotation of events and entities attributes. By aggregation of multi-attribute TML messages, we have demonstrated that salient group activities can be spatiotemporal can be reliable annotated. This paper discusses our experimental results; our analysis of a set of simulated group activities performed under different contexts and demonstrates the efficiency and effectiveness of the proposed modified TML data structure which facilitates seamless fusion of extracted information from video streams.