Skip to content

feat: Cloud storage compatibility with build, save and from_filespace vectorstore operations#170

Open
frayle-ons wants to merge 1 commit into
mainfrom
169-build-from-cloud
Open

feat: Cloud storage compatibility with build, save and from_filespace vectorstore operations#170
frayle-ons wants to merge 1 commit into
mainfrom
169-build-from-cloud

Conversation

@frayle-ons
Copy link
Copy Markdown
Contributor

✨ Summary

These changes introduce the use of the fsspec library to enable:

  • VectorStore creation directly from a csv file that lives in cloud storage,
  • saving created vectorstores (including vectors and metadata) on cloud stoarage if the user specifies a cloud URI for the output_dir
  • Loading a vectorstore from cloud stoerage if a cloud URI is passed to the from_filespace argument.

This has been validated to work with google cloud storage URIs (gs://buket-name/foldername/filename.csv) but should also extend to other cloud spaces supported by the fsspec library such as AWS storage which uses the s3 protocol.

Finally, the existing functionality of operating on local filespace is unchanged - users can also mix and match protocol types, i.e. reading a csv from cloud and saving on local directory, or reading from local and saving to cloud. Or fully cloud native, or fully local as mentioned already. When no output directory is specified the system attempts to save to a local folder with the same name as the input file.

The primary use of Fsspec in these changes uses the function: fsspec.core.url_to_fs(file/folder/path). This function analyses the specified path and detects which protocol to use and sets up a filsesystem object to interact with [ local, gs, s3, etc ]. It also provides a path object which specifies which directory within the filsystem that should be operated on.

Example extract from the from_filespace() class method:

in_fs, in_path = fsspec.core.url_to_fs(folder_path) # detects the filesystem returning an filsystem object and path

# `in_fs` provides the api connection to google cloud or other, and file-type operations can be made
# `in_path` is the specific file path to look at in that api cloud connection i.e. `classifai-bucket/testdata.csv`

# check if the folder exists in the filesystem, operating with in_fs filesystem object and acting on the path
     if not in_fs.isdir(in_path):
            raise DataValidationError(
                "folder_path must be an existing directory.", context={"folder_path": folder_path}
            )

📜 Changes Introduced

  • Added Fsspec library to the toml file and corresponding uv lock changes.
  • Modified vectorstore init method to work with fsspec to handle filespace protocol detections and operations
  • modified the vectorsore from_filespace method to work with loading from cloud
  • added error handling to be more explicit around issues relating to fsspec - for example, informative error messages when the user uses a protocol that doesn't exist in fsspec or when the user needs to install an extra dependency.

✅ Checklist

All precommit checks passed.

🔍 How to Test

Using the following generic code that sets up a VectorStore, the tester could trial the different ways of loading and saving vector stores. The user will need to install an extra dependency gcsfs which currently is captured and presented by ClassifaiErrors if its not installed before attempting to use Google buckets.

from classifai.vectorisers import HuggingFaceVectoriser
from classifai.indexers import VectorStore

my_vectoriser = HuggingFaceVectoriser(model_name="sentence-transformers/all-MiniLM-L6-v2")

my_vector_store = VectorStore(
    # file_name="./DEMO/data/testdata.csv",
    #file_name="gs://your-test-bucket/testdata.csv",
    data_type="csv",
    vectoriser=my_vectoriser,
    overwrite=False,
    #output_dir="./your_vectorstore_vdb/"
    #output_dir="gs://your-test-bucket/your_vectorstore_vdb/"
)

input_data = VectorStoreSearchInput(
    {"id": ["1", 2], "query": ["What is the colour of the sky?", "What language is spoken in Brazil?"]}
)

results = my_vector_store.search(input_data, n_results=2)
print(results)

in the above code I've included the local path for test data that's available in this repo, as well as an example path that could be used to load the data from a bucket. To test this you would need to upload the data to a test bucket in a Gcloud environment and authenticate with the correct project to test the code.

Additionally, I've shown the example of how the output dir can be either local or remote under the same principles as the input file. I would recommend trying several combinations, local in -> cloud out. cloud in -> cloud out. cloud in - > local out etc.

And also try to test some error/edge cases:

  • when no output directory is specified the system attempts to save to a local folder with the same name as the input file.
  • try input files and output dirs that don't exist
  • try a protocol other than gs that does not exists, zl23q for example

Also it would be good to test the from_filespace method to load in a vectostore that was saved to cloud, code snippet:

my_vector_store = VectorStore.from_filespace(
    folder_path="gs://your-test-bucket/your_vectorstore_vdb/",
    vectoriser=my_vectoriser,
)

@frayle-ons frayle-ons requested a review from a team as a code owner May 14, 2026 14:11
@frayle-ons frayle-ons linked an issue May 14, 2026 that may be closed by this pull request
@github-actions github-actions Bot added the enhancement New feature or request label May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cloud storage uris cause DataValidationError on VectorStore init

1 participant