One of the key components of computer vision applications like satellite and remote sensing and medical diagnosis is multi-modal image fusion. There are various multi-modal image fusion techniques, and each has advantages and disadvantages of its own. This paper proposes a new method based on multi-scale guided filtering. Initially, each source image is divided into coarse and fine layers at various scales using a guided filter. In order to fuse coarse and fine layers, two different saliency maps are used: an energy saliency map to coarse layers and a modified spatial frequency energy saliency map to fine levels. According to the simulation results, the suggested technique performs better in terms of quantitative evaluations of quality than other state-of-the-art techniques. All the simulation results are carried on a standard brain atlas database.