How to cluster similar sentences using BERT

You can use Sentence Transformers to generate the sentence embeddings. These embeddings are much more meaningful as compared to the one obtained from bert-as-service, as they have been fine-tuned such that semantically similar sentences have higher similarity score. You can use FAISS based clustering algorithm if number of sentences to be clustered are in millions or more as vanilla K-means like clustering algorithm takes quadratic time.


You will need to generate bert embeddidngs for the sentences first. bert-as-service provides a very easy way to generate embeddings for sentences.

This is how you can geberate bert vectors for a list of sentences you need to cluster. It is explained very well in the bert-as-service repository: https://github.com/hanxiao/bert-as-service

Installations:

pip install bert-serving-server  # server
pip install bert-serving-client  # client, independent of `bert-serving-server`

Download one of the pre-trained models available at https://github.com/google-research/bert

Start the service:

bert-serving-start -model_dir /your_model_directory/ -num_worker=4 

Generate the vectors for the list of sentences:

from bert_serving.client import BertClient
bc = BertClient()
vectors=bc.encode(your_list_of_sentences)

This would give you a list of vectors, you could write them into a csv and use any clustering algorithm as the sentences are reduced to numbers.