Semsi is about associating your/any files with eachother through language, about non-linear filehandling, about creative group research and avoiding hierarchies, about a dropbox that doesn't need to be curated and yet being able to pull meaningful parts from it.
You shall know a word by the company it keeps (Firth, J. R. 1957:11)
Semsi explores the semantic proximity of tagged artefacts so that files can be sorted and browsed by "the company they keep". The original experiment lived in Colab notebooks that downloaded GloVe vectors (more on GloVe at the end of this file) and manually wrangled the resulting similarity matrix. This repository now exposes a lightweight Python module and command line interface that reproduce the workflow in a more reliable and reusable way, without external dependencies.
- Parse the
contents.txtfile describing your artefacts and their tags. - Convert the tags into TF-IDF embeddings (implemented with the standard library so it runs anywhere).
- Build a cosine similarity matrix and explore the most related files.
The logic behind these steps lives in the semsi package (semsi/data.py,
semsi/embedding.py, semsi/similarity.py). You can reuse the components in
scripts or notebooks without having to redo the original notebook plumbing.
Run the CLI directly with python -m from the repository root:
python -m semsi.cli example_data/contents.txt --preview 4--outputcontrols where the similarity matrix is stored (CSV, pickle or JSON).--topprints the N closest files to the--targetidentifier directly in the terminal.--listemits the parsed identifiers without computing a matrix. This is handy for spotting typos in the metadata file.
from semsi import parse_contents_file, TagEmbeddingModel, build_similarity_matrix, get_top_similar
# Parse the metadata file
documents = parse_contents_file("example_data/contents.txt")
# Fit a TF-IDF model on the tags and construct the similarity matrix
model = TagEmbeddingModel()
embeddings = model.fit_transform(documents)
similarity = build_similarity_matrix(documents, embeddings)
# Inspect the nearest neighbours for a particular document
top = get_top_similar(similarity, documents[0].identifier, top_n=5)
for label, score in top:
print(label, score)Run the unit tests with pytest:
pytestA ready-to-run notebook lives under notebooks/Semsi_Workflow.ipynb. It uses the
package API shipped in this repository and stays in sync with the CLI logic.
Install the optional dependencies and open it in Jupyter Lab or VS Code:
pip install -e .[notebook]Prefer a point-and-click interface? Launch the Streamlit app and explore the similarity matrix without touching the command line:
pip install -e .[ui]
streamlit run semsi/ui_app.pyThe app lets you load a contents.txt file (or the bundled example), preview the
parsed documents, inspect top matches per identifier, and download the matrix as
CSV or JSON.
GloVe, coined from Global Vectors, is a model for distributed word representation. The model is an unsupervised learning algorithm for obtaining vector representations of words. This is achieved by mapping words into a meaningful space where the distance between words is related to semantic similarity. Different concepts show up as quasi linear mappings between words.
The original Colab notebooks remain under semsi_jupyter/ for reference. They
have not been deleted, but the heavy lifting is now handled by the importable
package which can be exercised from notebooks without duplicating setup cells.
For day-to-day work prefer notebooks/Semsi_Workflow.ipynb or the Streamlit UI
outlined above.