Transformers in Vision

Hi, everyone!

I am sure most of you are already familiar with Transformers (Vaswani et al.). They have been one of the most remarkable breakthroughs in Deep Learning in the last years. Originally designed for Natural Language Processing, they have increase the performance of neural networks in computer vision as well, like object detection, semantic segmentation, clustering or 3D analysis tasks.

Kan et al. shows a great example of this in their survey, including a comparison between them, a review about their pros and cons and future challenges.

Have you already used any of this architectures? How did they perform? Let us know in the comments!

I have pre-trained and fine-tuned the Roberta-based transformer architecture myself. This model does well on NLP tasks when compared to old models.

We used other multiple transformer architectures to generate sentence embeddings.

1 Like

Hi, @akkefa!

Sound so good!
Those sentences embeddings are similar to the GloVe embeddings but for sentences instead of words?

yes. SentenceTransformers is an excellent library for that task.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.',
    'The quick brown fox jumps over the lazy dog.']

#Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences)

#Print the embeddings
for sentence, embedding in zip(sentences, embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)