Image Captioning
Using a convolutional neural network (CNN) and Long Short Term Memory (LSTM) create an image caption generator and generate captions for your own images.
A convolutional neural network can be used to create a dense feature vector. This dense vector, also called an embedding, can be used as feature input into other algorithms or networks. For an image caption model, this embedding becomes a dense representation of the image and will be used as the initial state of the LSTM.
To generate captions, first, you’ll create a caption generator. This caption generator utilizes beam search to improve the quality of sentences generated.
At each iteration, the generator passes the previous state of the LSTM (the initial state is the image embedding) and the previous sequence to generate the next softmax vector.
The top N most probable candidates are kept and utilized in the next inference step. This process continues until either the max sentence length is reached or all sentences have generated the end-of-sentence token.
Next, you’ll load the show and tell the model and use it with the above caption generator to create candidate sentences. These sentences will be printed along with their log probability in your Colab notebook.
To receive credit for this assignment, submit a link to a functioning, self-contained Colab notebook for review. You will need to include your Python code, relevant graphs/plots, as well as an explanation of your logic in markdown format.