feat: Cloud storage compatibility with build, save and from_filespace vectorstore operations#170
Open
frayle-ons wants to merge 1 commit into
Open
feat: Cloud storage compatibility with build, save and from_filespace vectorstore operations#170frayle-ons wants to merge 1 commit into
frayle-ons wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
✨ Summary
These changes introduce the use of the fsspec library to enable:
output_dirfrom_filespaceargument.This has been validated to work with google cloud storage URIs (
gs://buket-name/foldername/filename.csv) but should also extend to other cloud spaces supported by the fsspec library such as AWS storage which uses thes3 protocol.Finally, the existing functionality of operating on local filespace is unchanged - users can also mix and match protocol types, i.e. reading a csv from cloud and saving on local directory, or reading from local and saving to cloud. Or fully cloud native, or fully local as mentioned already. When no output directory is specified the system attempts to save to a local folder with the same name as the input file.
The primary use of Fsspec in these changes uses the function:
fsspec.core.url_to_fs(file/folder/path). This function analyses the specified path and detects which protocol to use and sets up afilsesystemobject to interact with[ local, gs, s3, etc ]. It also provides apathobject which specifies which directory within thefilsystemthat should be operated on.Example extract from the
from_filespace()class method:📜 Changes Introduced
vectorstoreinit method to work with fsspec to handle filespace protocol detections and operationsvectorsorefrom_filespace method to work with loading from cloud✅ Checklist
All precommit checks passed.
🔍 How to Test
Using the following generic code that sets up a VectorStore, the tester could trial the different ways of loading and saving vector stores. The user will need to install an extra dependency
gcsfswhich currently is captured and presented by ClassifaiErrors if its not installed before attempting to use Google buckets.in the above code I've included the local path for test data that's available in this repo, as well as an example path that could be used to load the data from a bucket. To test this you would need to upload the data to a test bucket in a Gcloud environment and authenticate with the correct project to test the code.
Additionally, I've shown the example of how the output dir can be either local or remote under the same principles as the input file. I would recommend trying several combinations, local in -> cloud out. cloud in -> cloud out. cloud in - > local out etc.
And also try to test some error/edge cases:
zl23qfor exampleAlso it would be good to test the from_filespace method to load in a vectostore that was saved to cloud, code snippet: