June 28, 2023, 2:16 a.m. | /u/Secure_Equivalent_46

Deep Learning www.reddit.com

Hi! I am doing a video captioning task using LSTM-RNN encoder-decoder architecture. I notice that after training for few thousands of epochs, the output I get from each video make sense only for the first three words. From 4th word onward, it does not sound like human language to me.


Any suggestion to solve this?

architecture captioning decoder deeplearning encoder encoder-decoder human language lstm nlp nlp model rnn sense sound training video word words

ML/AI Engineer / NLP Expert - Custom LLM Development (x/f/m)

@ HelloBetter | Remote

Doctoral Researcher (m/f/div) in Automated Processing of Bioimages

@ Leibniz Institute for Natural Product Research and Infection Biology (Leibniz-HKI) | Jena

Seeking Developers and Engineers for AI T-Shirt Generator Project

@ Chevon Hicks | Remote

Global Clinical Data Manager

@ Warner Bros. Discovery | CRI - San Jose - San Jose (City Place)

Global Clinical Data Manager

@ Warner Bros. Discovery | COL - Cundinamarca - Bogotá (Colpatria)

Ingénieur Data Manager / Pau

@ Capgemini | Paris, FR