API Reference
Semnet: Semantic Network Deduplication
A Python package for building semantic networks using embeddings and graph clustering to perform intelligent deduplication of text data.
- class semnet.SemanticNetwork(metric: Literal['angular', 'euclidean', 'manhattan', 'hamming', 'dot'] = 'angular', n_trees: int = 10, thresh: float = 0.3, top_k: int = 20, search_k: int | None = None, verbose: bool = False)[source]
Bases:
objectA semantic network builder for creating graphs from document embeddings.
This class follows the scikit-learn pattern with fit() and transform() methods. Users must provide pre-computed embeddings during the fit process.
The fitting process builds an approximate nearest neighbor index from embeddings. The transformation process constructs a graph where edges represent semantic similarity.
- Key Methods:
fit(): Build the similarity index from provided embeddings transform(): Construct and return a networkx object fit_transform(): Combined fit and transform in one step to_pandas(): Export graph structure to pandas DataFrames for analysis
- metric
Distance metric for the Annoy index
- n_trees
Number of trees for the Annoy index
- thresh
Similarity threshold for connecting documents
- top_k
Maximum neighbors to check per document
- verbose
Whether to show progress bars and detailed logging
- is_fitted_
Whether the model has been fitted
- embeddings_
Document embeddings array (available after fitting)
- index_
Annoy index for similarity search (available after fitting)
- fit(embeddings: numpy.ndarray) SemanticNetwork[source]
Build the index from document embeddings.
This method uses provided embeddings to create an Annoy index for fast nearest neighbor search.
- Parameters:
embeddings – Pre-computed embeddings array with shape (n_docs, embedding_dim).
labels – Optional list of text labels/documents for the embeddings. If not provided, will use string indices as labels.
node_data – Optional dictionary containing additional data to attach to nodes. Format: {node_index: {attribute_name: value, …}, …} OR {node_index: single_value, …} (will be stored as {‘value’: single_value}) Only nodes present in the dictionary will get additional attributes.
- Returns:
Returns the fitted estimator
- Return type:
self
- Raises:
ValueError – If labels provided but length doesn’t match embeddings
ValueError – If ids provided but length doesn’t match embeddings
ValueError – If node_data values don’t match embeddings length
- transform(thresh: float | None = None, top_k: int | None = None, search_k: int | None = None, labels: List[str] | None = None, node_data: Dict | None = None) networkx.Graph[source]
Build and return a weighted graph from the fitted embeddings.
- Parameters:
thresh – The similarity threshold for edge inclusion. If None, uses the threshold from initialization.
top_k – Optional max neighbors override for this transform. If None, uses the top_k from initialization.
search_k – Optional parameter for Annoy index search_k, controlling the number of nodes to inspect during search. If None, uses the search_k from initialization.
labels – Optional list of text labels/documents for the embeddings. If not provided, will use string indices as labels.
node_data – Optional dictionary containing additional data to attach to nodes. Format: {node_index: {attribute_name: value, …}, …} OR {node_index: single_value, …} (will be stored as {‘value’: single_value}) Only nodes present in the dictionary will get additional attributes.
- Returns:
NetworkX graph where nodes represent documents and edges represent similarities above the threshold.
- Raises:
ValueError – If the model hasn’t been fitted yet
- fit_transform(embeddings: numpy.ndarray, labels: List[str] | None = None, node_data: Dict | None = None, thresh: float | None = None, top_k: int | None = None) networkx.Graph[source]
Fit the model and transform the embeddings in one step.
- Parameters:
embeddings – Pre-computed embeddings array with shape (n_docs, embedding_dim).
labels – Optional list of text labels/documents for the embeddings.
node_data – Optional dictionary containing additional data to attach to nodes.
thresh – Optional similarity threshold override for this transform.
top_k – Optional max neighbors override for this transform.
- Returns:
NetworkX graph representing the semantic network
- semnet.to_pandas(graph: networkx.Graph) Tuple[pandas.DataFrame, pandas.DataFrame][source]
Export a NetworkX graph to pandas DataFrames.
This function operates on any NetworkX graph, making it useful for analyzing graphs from SemanticNetwork or any other NetworkX graph.
- Parameters:
graph – NetworkX graph to export
- Returns:
- A tuple containing:
nodes (pd.DataFrame): Node attributes with index as node ID. Columns include all node attributes from the graph.
edges (pd.DataFrame): Edge list with columns ‘source’, ‘target’, and any edge attributes (e.g., ‘weight’).
- Return type:
Tuple[pd.DataFrame, pd.DataFrame]
Examples
>>> # Export any NetworkX graph >>> import networkx as nx >>> from semnet import to_pandas >>> >>> # Create or load any graph >>> G = nx.karate_club_graph() >>> nodes, edges = to_pandas(G)
>>> # Use with SemanticNetwork >>> network = SemanticNetwork(thresh=0.8) >>> graph = network.fit_transform(embeddings, labels=docs) >>> nodes, edges = to_pandas(graph)
>>> # Export a subgraph >>> subgraph = graph.subgraph([0, 1, 2]) >>> sub_nodes, sub_edges = to_pandas(subgraph)