API Reference

Semnet: Semantic Network Deduplication

A Python package for building semantic networks using embeddings and graph clustering to perform intelligent deduplication of text data.

class semnet.SemanticNetwork(metric: Literal['angular', 'euclidean', 'manhattan', 'hamming', 'dot'] = 'angular', n_trees: int = 10, thresh: float = 0.3, top_k: int = 20, search_k: int | None = None, verbose: bool = False)[source]

Bases: object

A semantic network builder for creating graphs from document embeddings.

This class follows the scikit-learn pattern with fit() and transform() methods. Users must provide pre-computed embeddings during the fit process.

The fitting process builds an approximate nearest neighbor index from embeddings. The transformation process constructs a graph where edges represent semantic similarity.

Key Methods:

fit(): Build the similarity index from provided embeddings transform(): Construct and return a networkx object fit_transform(): Combined fit and transform in one step to_pandas(): Export graph structure to pandas DataFrames for analysis

metric

Distance metric for the Annoy index

n_trees

Number of trees for the Annoy index

thresh

Similarity threshold for connecting documents

top_k

Maximum neighbors to check per document

verbose

Whether to show progress bars and detailed logging

is_fitted_

Whether the model has been fitted

embeddings_

Document embeddings array (available after fitting)

index_

Annoy index for similarity search (available after fitting)

fit(embeddings: numpy.ndarray) SemanticNetwork[source]

Build the index from document embeddings.

This method uses provided embeddings to create an Annoy index for fast nearest neighbor search.

Parameters:
  • embeddings – Pre-computed embeddings array with shape (n_docs, embedding_dim).

  • labels – Optional list of text labels/documents for the embeddings. If not provided, will use string indices as labels.

  • node_data – Optional dictionary containing additional data to attach to nodes. Format: {node_index: {attribute_name: value, …}, …} OR {node_index: single_value, …} (will be stored as {‘value’: single_value}) Only nodes present in the dictionary will get additional attributes.

Returns:

Returns the fitted estimator

Return type:

self

Raises:
  • ValueError – If labels provided but length doesn’t match embeddings

  • ValueError – If ids provided but length doesn’t match embeddings

  • ValueError – If node_data values don’t match embeddings length

transform(thresh: float | None = None, top_k: int | None = None, search_k: int | None = None, labels: List[str] | None = None, node_data: Dict | None = None) networkx.Graph[source]

Build and return a weighted graph from the fitted embeddings.

Parameters:
  • thresh – The similarity threshold for edge inclusion. If None, uses the threshold from initialization.

  • top_k – Optional max neighbors override for this transform. If None, uses the top_k from initialization.

  • search_k – Optional parameter for Annoy index search_k, controlling the number of nodes to inspect during search. If None, uses the search_k from initialization.

  • labels – Optional list of text labels/documents for the embeddings. If not provided, will use string indices as labels.

  • node_data – Optional dictionary containing additional data to attach to nodes. Format: {node_index: {attribute_name: value, …}, …} OR {node_index: single_value, …} (will be stored as {‘value’: single_value}) Only nodes present in the dictionary will get additional attributes.

Returns:

NetworkX graph where nodes represent documents and edges represent similarities above the threshold.

Raises:

ValueError – If the model hasn’t been fitted yet

fit_transform(embeddings: numpy.ndarray, labels: List[str] | None = None, node_data: Dict | None = None, thresh: float | None = None, top_k: int | None = None) networkx.Graph[source]

Fit the model and transform the embeddings in one step.

Parameters:
  • embeddings – Pre-computed embeddings array with shape (n_docs, embedding_dim).

  • labels – Optional list of text labels/documents for the embeddings.

  • node_data – Optional dictionary containing additional data to attach to nodes.

  • thresh – Optional similarity threshold override for this transform.

  • top_k – Optional max neighbors override for this transform.

Returns:

NetworkX graph representing the semantic network

semnet.to_pandas(graph: networkx.Graph) Tuple[pandas.DataFrame, pandas.DataFrame][source]

Export a NetworkX graph to pandas DataFrames.

This function operates on any NetworkX graph, making it useful for analyzing graphs from SemanticNetwork or any other NetworkX graph.

Parameters:

graph – NetworkX graph to export

Returns:

A tuple containing:
  • nodes (pd.DataFrame): Node attributes with index as node ID. Columns include all node attributes from the graph.

  • edges (pd.DataFrame): Edge list with columns ‘source’, ‘target’, and any edge attributes (e.g., ‘weight’).

Return type:

Tuple[pd.DataFrame, pd.DataFrame]

Examples

>>> # Export any NetworkX graph
>>> import networkx as nx
>>> from semnet import to_pandas
>>>
>>> # Create or load any graph
>>> G = nx.karate_club_graph()
>>> nodes, edges = to_pandas(G)
>>> # Use with SemanticNetwork
>>> network = SemanticNetwork(thresh=0.8)
>>> graph = network.fit_transform(embeddings, labels=docs)
>>> nodes, edges = to_pandas(graph)
>>> # Export a subgraph
>>> subgraph = graph.subgraph([0, 1, 2])
>>> sub_nodes, sub_edges = to_pandas(subgraph)