Quick Start
Installation
You can easily install Semnet with pip.
pip install semnet
Basic Usage
All you need to start building your network is a set of embeddings and (optionally) some labels.
from semnet import SemanticNetwork, to_pandas
from sentence_transformers import SentenceTransformer
import networkx as nx
# Your documents
docs = [
"The cat sat on the mat",
"A cat was sitting on a mat",
"The dog ran in the park",
"I love Python",
"Python is a great programming language",
]
# Generate embeddings (use any embedding provider)
embedding_model = SentenceTransformer("BAAI/bge-base-en-v1.5")
embeddings = embedding_model.encode(docs)
# Create and configure semantic network
sem = SemanticNetwork(thresh=0.3, distance="angular")
# Build the semantic graph from your embeddings
G = sem.fit_transform(embeddings, labels=docs)
# Analyze the graph
print(f"Nodes: {G.number_of_nodes()}")
print(f"Edges: {G.number_of_edges()}")
print(f"Connected components: {nx.number_connected_components(G)}")
# Export to pandas
nodes_df, edges_df = to_pandas(G)
# Find similar document groups
for component in nx.connected_components(G):
if len(component) > 1:
similar_docs = [G.nodes[i]["label"] for i in component]
print(f"Similar documents: {similar_docs}")
# Calculate centrality measures
centrality = nx.degree_centrality(G)
for node, cent_value in centrality.items():
print(f"Document: {G.nodes[node]['label']}, Degree Centrality: {cent_value:.4f}")
G.nodes[node]["degree_centrality"] = cent_value
Performance at Scale
The Scaling Problem
Graph construction entails finding pairwise relationships (edges) between entities (nodes) in a dataset. For large corpora, scaling problems rapidly become apparent as the number of possible pairs in a set scales quadratically.
If we were to naively attempt to construct a graph from a modestly-sized set of documents we encounter problems early on. Building a graph from 10,000 documents would entail operations across 50 million pairs, for 100,000 it’s around 5 billion!
Naive Approach Problems
from sklearn.metrics import DistanceMetric
import numpy as np
dist = DistanceMetric.get_metric("euclidean")
# Generate 100,000 random embeddings
embeddings = np.random.rand(100_000, 768)
dist_scores = dist.pairwise(embeddings)
>> MemoryError: Unable to allocate 74.5 GiB for an array
with shape (100000, 100000) and data type float64
Semnet’s Solution
Semnet solves the scaling problem using Approximate Nearest Neighbours search with Annoy.
Instead of making comparisons between each document in the corpus, Semnet indexes the embeddings, iterates over each one, and returns a top_k best matches from within their neighbourhood.
from semnet import SemanticNetwork
import time
start_time = time.time()
# Make 100,000 embeddings
embeddings = np.random.rand(100_000, 768)
# Build semantic network
semnet = SemanticNetwork(thresh=0.4, top_k=5)
G = semnet.fit_transform(embeddings)
end_time = time.time()
print(f"Processing time: {end_time - start_time:.2f} seconds")
>> Processing time: 24.26 seconds
We’re not only able to process all the embeddings without crashing our kernel, but it’s done in under 30 seconds.
Working with Node Data
You can pass arbitrary metadata during graph construction:
# Create node data dictionary
node_data = {}
for i, doc in enumerate(docs):
node_data[i] = {
"category": "python" if "python" in doc.lower() else "other",
"length": len(doc),
"custom_field": "example"
}
# Build graph with metadata
G = sem.fit_transform(
embeddings=embeddings,
labels=docs,
node_data=node_data
)
# Access node attributes
for node_id, node_attrs in G.nodes(data=True):
print(f"Node {node_id}: {node_attrs}")
Graph Analysis
Once you have your graph, you can leverage the full power of NetworkX:
import networkx as nx
# Community detection
communities = nx.community.louvain_communities(G)
print(f"Found {len(communities)} communities")
# Centrality measures
degree_centrality = nx.degree_centrality(G)
closeness_centrality = nx.closeness_centrality(G)
betweenness_centrality = nx.betweenness_centrality(G)
# Path finding
if G.number_of_nodes() > 2:
nodes = list(G.nodes())
try:
path = nx.shortest_path(G, nodes[0], nodes[1])
print(f"Shortest path: {path}")
except nx.NetworkXNoPath:
print("No path found between nodes")