GPT4 is a multi-modal model. They haven't exposed a way to use the image embeddings, so consumers can't utilize it, but the model accepts image input and it was trained on images and text. Yes, there are other modals that can be incorporated, and reinforcement learning is still pretty nascent/very much unsolved.
Well, there's multi-modal training. There's tons of untapped audiovisual data.