Semnet: Semantic networks from embeddings

Semnet efficiently constructs network structures from embeddings, enabling graph-based analysis and operations over embedded documents, images, and more.

The name “Semnet” derives from semantic network, as it was initially designed for an NLP use-case, but it will work well with any form of embedded document (e.g., images, audio, even or graphs).

Features

Semnet is a relatively small library, which does just one thing: efficiently construct network structures from embeddings.

Its key features are:

Rapid conversion of large embedding collections to graph format, with nodes as documents and edges as document similarity.
It’s fast and memory efficient: Semnet uses Annoy under the hood to perform efficient pair-wise distance calculations, allowing for million-node networks to be constructed in minutes on consumer hardware.
Graphs are returned as NetworkX objects, opening up a wide range of algorithms for your project.
Pass arbitrary metadata during graph construction or update it later in NetworkX.
Control graph construction by setting distance measures, similarity cut-offs, and limits on outbound edges per node.
Easily convert to pandas for downstream use.

Use cases

Semnet may be used for:

Graph algorithms: enrich your data with communities, centrality and much more for use in NLP, search, RAG and context engineering.
Deduplication: remove duplicate records (e.g., “Donald Trump”, “Donald J. Trump”) from datasets.
Exploratory data analysis and visualisation, Cosmograph works brilliantly for large corpora.

Quick start

You can easily install Semnet with pip.

pip install semnet

All you need to start building your network is a set of embeddings and (optionally) some labels.

from semnet import SemanticNetwork, to_pandas
from sentence_transformers import SentenceTransformer

# Your documents
docs = [
    "The cat sat on the mat",
    "A cat was sitting on a mat",
    "The dog ran in the park",
    "I love Python",
    "Python is a great programming language",
]

# Generate embeddings (use any embedding provider)
embedding_model = SentenceTransformer("BAAI/bge-base-en-v1.5")
embeddings = embedding_model.encode(docs)

# Create and configure semantic network
sem = SemanticNetwork(thresh=0.3, distance="angular")

# Build the semantic graph from your embeddings
G = sem.fit_transform(embeddings, labels=docs)

# Analyze the graph
print(f"Nodes: {G.number_of_nodes()}")
print(f"Edges: {G.number_of_edges()}")

# Export to pandas
nodes_df, edges_df = to_pandas(G)

What problem does Semnet solve?

Graph construction entails finding pairwise relationships (edges) between entities (nodes) in a dataset.

For large corpora, scaling problems rapidly become apparent as the number of possible pairs in a set scales quadratically.

Naive approach

If we were to naively attempt to construct a graph from a modestly-sized set of documents we encounter problems early on with modestly-sized corpora. Building a graph from 10,000 documents would entail operations across 50 million pairs, for 100,000 it’s around 5 billion!

Iterating over each pair is very slow, and faster approaches via scikit-learn run into memory problems:

from sklearn.metrics import DistanceMetric
import numpy as np

dist = DistanceMetric.get_metric("euclidean")

# Generate 100,000 random embeddings
embeddings = np.random.rand(100_000, 768)
dist_scores = dist.pairwise(embeddings)

>> MemoryError: Unable to allocate 74.5 GiB for an array
with shape (100000, 100000) and data type float64

With Semnet

Semnet solves the scaling problem using Approximate Nearest Neighbours search with Annoy.

Instead of making comparisons between each document in the corpus, Semnet indexes the embeddings, iterates over each one, and returns a top_k best matches from within their neighbourhood.

from semnet import SemanticNetwork
import time

start_time = time.time()

# Make 100,000 embeddings
embeddings = np.random.rand(100_000, 768)

# Build semantic network
semnet = SemanticNetwork(thresh=0.4, top_k=5)
G = semnet.fit_transform(embeddings)

end_time = time.time()
print(f"Processing time: {end_time - start_time:.2f} seconds")

>> Processing time: 24.26 seconds

We’re not only able to process all the embeddings without crashing our kernel, but it’s done in under 30 seconds.

Why use graph structures?

By opening up the NetworkX API to embedded documents, Semnet provides a new suite of tools and metrics for workflows in domains such as NLP, RAG, search and context engineering.

For most use cases, Semnet will work best as a complement to traditional spatial workflows, rather than as a replacement. Its power lies in encoding information about relationships between data points, which can be used as features in downstream tasks.

Approaches will vary depending on your use case, but benefits include:

Accessing network structures like paths, local neighbourhoods and subgraphs
Using centrality measures and communities that capture relationships and structure
Making some very pretty visualisations