Semantic segmentation is a popular task that has piqued the interest of many industries and research communities. However, acquiring segmentation labels is costly as it often requires carefully annotating the boundaries of the objects of interest. This has triggered research on weakly-supervised methods with image-level labels that are less costly to obtain. Existing methods leverage pseudo-labels produced from class activation maps (CAM) generated with models pre-trained on ImageNet. Using CAMs as pseudo-labels introduces two different challenges. First, ImageNet pre-training biases models to predict a single object per image. Second, pseudo-labels are noisy. In this work, we address the first problem by pre-training the backbone with multi-label iterated learning. In the literature, the second problem is usually alleviated by introducing an additional consistency loss during the backbone pre-training or as an additional CAM refinement step. Here, we propose a generalization of Puzzle-CAMs consistency loss that supports multiple augmentations and tiling resolutions, which helps to further reduce the noise in CAMs and improve the final segmentation performance. The results show improved results in both PASCAL VOC and COCO in the weakly supervised settings for the mIoU scores compared to existing methods.