Block figure of the AVIVA. Video localization is based on face and head detection. The visual location of each speaker is approximated after processing the 2D image information and obtained from at least two synchronized color video cameras through calibration parameters and an optimization method. The position of the microphone array and the output of the visual localizer are used to calculate the direction of arrival information of each speaker. Based on this information, a smart initialization is set for the FastIVA algorithm.