On January 15, I attended the Watson Warriors event in snowy Seattle, hosted by Tech Data. Watson Warriors is a multi-challenge game, developed by Launch Consulting, that allows data scientists to compete against each other to solve AI problems using Watson Studio Cloud. In this first Warrior challenge, we used weather data and images to predict where fires were occurring. Several more challenges will be hosted around the country in the coming months.
I was very pleased to see so many people at the event. In my estimation, there were probably about 80 people there. I teamed up with a couple of friends from IBM and another business partner in the competition. We didn’t win, but we did have a great time. I’m looking forward to the next one in San Antonio in March!
If you read my blog from December 20 about answering questions from long passages using BERT, you know how excited I am about how BERT is having a huge impact on natural language processing. BERT, or Bidirectional Encoder Representations from Transformers, which was developed by Google, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks.
The BERT pre-trained models can be used for more than just question/answer tasks. They can also be used to determine how similar two sentences are to each other. In this post, I am going to show how to find these similarities using a measure known as cosine similarity. I do some very simple testing using 3 sentences that I have tokenized manually. If using a larger corpus, you will definitely want to have the sentences tokenized using something like nltk.tokenize. The first two sentences (0 and 1) come from the same blog entry, while the third (2) comes from a separate blog entry. The similarity between sentences 0 and 1 should be higher with each other than with sentence 2 (I’ll explain later why I used the numbers 0-2 instead of 1-3). Let’s see if that is the case. The sentences are:
In this testing, I used the BERT-as-a-Service server and client. The testing was done two different ways. The first was where the BERT server and BERT client were both running on the same physical server. The second was where the BERT server was on a different physical server than the BERT client was on. Where the two differ, I will list the syntax needed for each. Note: If running the BERT server in a Docker container on one physical server and the BERT client on a different physical server or PC, you must specify that ports 5555 and 5556 are explicitly used for the container the BERT server is running in. If you do not, the client won’t be able to connect. To do this, be sure to add the -p switch to your docker run command (-p 5555:5555 -p 5556:5556). This is not necessary if the BERT server and BERT client are both running on the same physical server.
Installing the BERT Server and Client
The BERT server and client require TensorFlow version 1.10 or greater. I used version 1.14.0-rc0 in this testing. To install each, run the following:
Create a directory for the models to be saved in:
Get the pre-trained models (https://github.com/google-research/bert#pre-trained-models). I used BERT-Base, both the Uncased and Cased versions for testing.
Starting the BERT Server
When starting the BERT server, you must specify which pre-trained model to use. The switch specifies how many clients can concurrently connect to the server at one time. The switch is not necessary, but I use it here so the encoded sentences can be seen.
If using the Uncased model, run:
If using the Cased model, run:
When the server is started, you will see something similar to this:
Connecting to the BERT Server with the BERT Client and Finding Sentence Similarity
The following script (BERT_sentence_similarity.py) can be found on my Github at https://github.com/pacejohn/BERT-Cosine-Similarities.
# Connect to a BERT server from a BERT client and determine the cosine similarity between 2 sentences
from bert_serving.client import BertClient
from sklearn.metrics.pairwise import cosine_similarity
# Uncomment the following line if the BERT server that is running locally (on the same physical server that the client will be running on).
#client = BertClient()
# Uncomment the following line if the BERT server is running remotely (on a different physical server than the client will be running on).
# You must specify the IP of the remote server and the ports
client = BertClient(ip='10.3.50.61', port=5555, port_out=5556)
# Save tokenized sentences to variables. This makes it easier later.
# I started the numbering at 0 rather than 1 so it matches the indexes of the arrays that are created when the encoding happens.
sentence0 = ['bert', 'was', 'developed', 'by', 'google', 'and', 'nvidia', 'has', 'created', 'an', 'optimized', 'version', 'that', 'uses', 'tensorrt']
sentence1 = ['one', 'drawback', 'of', 'bert', 'is', 'that', 'only', 'short', 'passages', 'can', 'be', 'queried']
sentence2 = ['i', 'attended', 'a', 'conference', 'in', 'denver']
# Specify which 2 sentences to compare.
first_sentence = 0
second_sentence = 1
# Encode the sentences using the BERT client.
sentences = client.encode([sentence0, sentence1, sentence2], show_tokens=True, is_tokenized=True)
# If you print ‘sentences’, it will show the arrays along with the encoded sentences. This can be interesting because it shows which words it did not recognize. They are denoted by [UNK].
# Calculate cosine similarity between the 2 sentences you specified and display it
cos_sim = cosine_similarity(sentences[first_sentence][:].reshape(1,-1),sentences[ second_sentence][:].reshape(1,-1))
# Show the sentences and their cosine similarity, but leave off the [CLS] at the beginning and [SEP] at the end
if first_sentence == 0 and second_sentence == 1:
print("\n\nThe cosine similarity between sentence:\n" + str(' '.join(sentence0)) + "\n\nand sentence:\n\n" + str(' '.join(sentence1)) + "\nis " + str(cos_sim))
if first_sentence == 0 and second_sentence == 2:
print("\n\nThe cosine similarity between sentence:\n" + str(' '.join(sentence0)) + "\n\nand sentence:\n\n" + str(' '.join(sentence2)) + "\nis " + str(cos_sim))
if first_sentence == 1 and second_sentence == 2:
print("\n\nThe cosine similarity between sentence:\n" + str(' '.join(sentence1)) + "\n\nand sentence:\n\n" + str(' '.join(sentence2)) + "\nis " + str(cos_sim))
# Show the encoded versions of the sentences for comparison
print("\n******\nThe encoded sentences are")
As a final note, Tables 1 to 4 below show the differences in the cosine similarity between the sentences when capital letters are used or not used and when the Uncased or Cased model is used. In my next post, I will discuss these differences in more detail.
I welcome your feedback and suggestions. Follow me on Twitter (@pacejohn) and on LinkedIn (https://www.linkedin.com/in/john-pace-phd-20b87070/). Also, follow my company, Mark III Systems on Twitter (@markiiisystems).
This post is also on Medium at https://medium.com/@johnpace_32927/finding-cosine-similarity-between-sentences-using-bert-as-a-service-6bbcd02defcf.
Comparison of Cosine Similarities Using BERT Uncased and Cased Models
Thanks to Anirudh S and Han Xiao for code snippets and ideas: