One of the key components of computer vision applications, such as satellite and remote sensing and medical diagnosis, is multi-modal image fusion. There are various multi-modal image fusion techniques, each with its own advantages and disadvantages. This paper proposes a new method based on multi-scale guided filtering. Initially, each source image is divided into coarse and fine layers at various scales using a guided filter. To fuse these layers, two different saliency maps are employed: an energy saliency map for the coarse layers and a modified spatial frequency energy saliency map for the fine layers. According to the simulation results, the suggested technique outperforms other state-of-the-art techniques in terms of quantitative evaluations of quality. All the simulations were conducted on a standard brain atlas database.