
You are working on a text modeling problem and need to choose an architecture for sequence data. The candidate should explain how CNNs, RNNs, and Transformers process text, what each one learns well, and where each tends to break down.
What is the difference between CNNs, RNNs, and Transformers?