Conversation
storage/google/cloud/storage/blob.py
Outdated
| def upload_parallel( | ||
| self, path, content_type=None, client=None, predefined_acl=None | ||
| ): | ||
| """Upload this blob's contents parallel from the content of file in directory. |
There was a problem hiding this comment.
"Upload this blob's contents in parallel, from the contents of a file in the directory."
storage/google/cloud/storage/blob.py
Outdated
| ): | ||
| """Upload this blob's contents parallel from the content of file in directory. | ||
|
|
||
| The content type of the upload will be determined in order |
There was a problem hiding this comment.
"The type of the uploaded content will be determined in the order"
| thread = threading.Thread( | ||
| target=self._upload_from_list, | ||
| args=(files_list, total_files, content_type, client, predefined_acl), | ||
| ) |
There was a problem hiding this comment.
This doesn't seem right. Here you are creating multiple threads that essentially try to upload the same set of files. That is, _upload_from_list uses the same files_list over and over again. There should be distribution of individual file uploads among separate threads, not duplicating the uploading procedure.
There was a problem hiding this comment.
I have done the changes which doesn't pass files_list and total_files with all threads , but it just passes the files_list with all thread as a argument , but self._files_list[self._file_count] line doesn't allow to upload duplicate files because it increment the count with every upload and took the file from list from particular index.
There was a problem hiding this comment.
How about racing? The index is updated after the upload. Therefore we cannot eliminate the possibility that while a file is being uploaded by a thread, another thread starts uploading the same file. This can also lead to out-of-range exception.
| content_type, | ||
| client, | ||
| predefined_acl, | ||
| ) |
There was a problem hiding this comment.
Now there is a different kind of a problem. If the upload fails, the counter is still increased. There should be some sort of a retry mechanism to recover from such state, or an index tracking, to handle each file individually.
There was a problem hiding this comment.
I think in the Storage module they didn't apply mechanism of retry, that's why they have a different task in git to implement retry mechanism in storage.
Please refer issue [7907].
There was a problem hiding this comment.
Regardless of that, the possibility of racing should be eliminated.
213e374 to
42f1d9e
Compare
issue [4684]