Graph to connect sentences
Didn't implement NLP for verb / noun separation, just added a list of good words.
They can be extracted and normalized with spacy relatively easy.
Please note that walk
occurs in 1,2,5 sentences and forms a triad.
import re
import networkx as nx
import matplotlib.pyplot as plt
plt.style.use("ggplot")
sentences = [
"I went out for a walk or walking.",
"When I was walking, I saw a cat. ",
"The cat was injured. ",
"My mum's name is Marylin.",
"While I was walking, I met John. ",
"Nothing has happened.",
]
G = nx.Graph()
# set of possible good words
good_words = {"went", "walk", "cat", "walking"}
# remove punctuation and keep only good words inside sentences
words = list(
map(
lambda x: set(re.sub(r"[^\w\s]", "", x).lower().split()).intersection(
good_words
),
sentences,
)
)
# convert sentences to dict for furtehr labeling
sentences = {k: v for k, v in enumerate(sentences)}
# add nodes
for i, sentence in sentences.items():
G.add_node(i)
# add edges if two nodes have the same word inside
for i in range(len(words)):
for j in range(i + 1, len(words)):
for edge_label in words[i].intersection(words[j]):
G.add_edge(i, j, r=edge_label)
# compute layout coords
coord = nx.spring_layout(G)
plt.figure(figsize=(20, 14))
# set label coords a bit upper the nodes
node_label_coords = {}
for node, coords in coord.items():
node_label_coords[node] = (coords[0], coords[1] + 0.04)
# draw the network
nodes = nx.draw_networkx_nodes(G, pos=coord)
edges = nx.draw_networkx_edges(G, pos=coord)
edge_labels = nx.draw_networkx_edge_labels(G, pos=coord)
node_labels = nx.draw_networkx_labels(G, pos=node_label_coords, labels=sentences)
plt.title("Sentences network")
plt.axis("off")
Update
If you want to measure the similarity between different sentences, you may want to calculate the difference between sentence embedding.
This gives you an opportunity to find semantic similarity between sentences with different words like "A soccer game with multiple males playing" and "Some men are playing a sport". Almost SoTA approach using BERT can be found here, more simple approaches are here.
Since you have similarity measure, just replace add_edge block to add new edge only if similarity measure is greater than some threshold. Resulting add edges code will look like this:
# add edges if two nodes have the same word inside
tresold = 0.90
for i in range(len(words)):
for j in range(i + 1, len(words)):
# suppose you have some similarity function using BERT or PCA
similarity = check_similarity(sentences[i], sentences[j])
if similarity > tresold:
G.add_edge(i, j, r=similarity)