I have a problem in Multi-Agent RL, where agents need to navigate the environment searching for an object. I use PPO with actor/critic networks being Convolutional Nets. An agent observes it's own location, the search history, and other agents' locations. These observations are in the form of grid maps (the search area is represented as a grid map, as shown below). The actor network for an agent takes these 3 maps and produces an action.

