Semsi

Semsi is about associating your/any files with eachother through language, about non-linear filehandling, about creative group research and avoiding hierarchies, about a dropbox that doesn't need to be curated and yet being able to pull meaningful parts from it.

You shall know a word by the company it keeps (Firth, J. R. 1957:11)

Intro

Semsi explores the semantic proximity of tagged artefacts so that files can be sorted and browsed by "the company they keep". The original experiment lived in Colab notebooks that downloaded GloVe vectors (more on GloVe at the end of this file) and manually wrangled the resulting similarity matrix. This repository now exposes a lightweight Python module and command line interface that reproduce the workflow in a more reliable and reusable way, without external dependencies.

Workflow

Parse the contents.txt file describing your artefacts and their tags.
Convert the tags into TF-IDF embeddings (implemented with the standard library so it runs anywhere).
Build a cosine similarity matrix and explore the most related files.

The logic behind these steps lives in the semsi package (semsi/data.py, semsi/embedding.py, semsi/similarity.py). You can reuse the components in scripts or notebooks without having to redo the original notebook plumbing.

Quick start

Run the CLI directly with python -m from the repository root:

python -m semsi.cli example_data/contents.txt --preview 4

--output controls where the similarity matrix is stored (CSV, pickle or JSON).
--top prints the N closest files to the --target identifier directly in the terminal.
--list emits the parsed identifiers without computing a matrix. This is handy for spotting typos in the metadata file.

Python usage

from semsi import parse_contents_file, TagEmbeddingModel, build_similarity_matrix, get_top_similar

# Parse the metadata file
documents = parse_contents_file("example_data/contents.txt")

# Fit a TF-IDF model on the tags and construct the similarity matrix
model = TagEmbeddingModel()
embeddings = model.fit_transform(documents)
similarity = build_similarity_matrix(documents, embeddings)

# Inspect the nearest neighbours for a particular document
top = get_top_similar(similarity, documents[0].identifier, top_n=5)
for label, score in top:
    print(label, score)

Tests

Run the unit tests with pytest:

pytest

Notebook workflow

A ready-to-run notebook lives under notebooks/Semsi_Workflow.ipynb. It uses the package API shipped in this repository and stays in sync with the CLI logic. Install the optional dependencies and open it in Jupyter Lab or VS Code:

pip install -e .[notebook]

Browser UI

Prefer a point-and-click interface? Launch the Streamlit app and explore the similarity matrix without touching the command line:

pip install -e .[ui]
streamlit run semsi/ui_app.py

The app lets you load a contents.txt file (or the bundled example), preview the parsed documents, inspect top matches per identifier, and download the matrix as CSV or JSON.

GloVe

GloVe, coined from Global Vectors, is a model for distributed word representation. The model is an unsupervised learning algorithm for obtaining vector representations of words. This is achieved by mapping words into a meaningful space where the distance between words is related to semantic similarity. Different concepts show up as quasi linear mappings between words.

Legacy notebooks

The original Colab notebooks remain under semsi_jupyter/ for reference. They have not been deleted, but the heavy lifting is now handled by the importable package which can be exercised from notebooks without duplicating setup cells. For day-to-day work prefer notebooks/Semsi_Workflow.ipynb or the Streamlit UI outlined above.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
example_data		example_data
notebooks		notebooks
semsi		semsi
semsi_jupyter		semsi_jupyter
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semsi

Intro

Workflow

Quick start

Python usage

Tests

Notebook workflow

Browser UI

GloVe

Legacy notebooks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Semsi

Intro

Workflow

Quick start

Python usage

Tests

Notebook workflow

Browser UI

GloVe

Legacy notebooks

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages