Skip to content

add parallel blob object upload/4684#6

Draft
HemangChothani wants to merge 3 commits intomasterfrom
feature/storage_parallel_operations_copying_objects
Draft

add parallel blob object upload/4684#6
HemangChothani wants to merge 3 commits intomasterfrom
feature/storage_parallel_operations_copying_objects

Conversation

@HemangChothani
Copy link
Owner

issue [4684]

def upload_parallel(
self, path, content_type=None, client=None, predefined_acl=None
):
"""Upload this blob's contents parallel from the content of file in directory.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Upload this blob's contents in parallel, from the contents of a file in the directory."

):
"""Upload this blob's contents parallel from the content of file in directory.

The content type of the upload will be determined in order
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The type of the uploaded content will be determined in the order"

thread = threading.Thread(
target=self._upload_from_list,
args=(files_list, total_files, content_type, client, predefined_acl),
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem right. Here you are creating multiple threads that essentially try to upload the same set of files. That is, _upload_from_list uses the same files_list over and over again. There should be distribution of individual file uploads among separate threads, not duplicating the uploading procedure.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done the changes which doesn't pass files_list and total_files with all threads , but it just passes the files_list with all thread as a argument , but self._files_list[self._file_count] line doesn't allow to upload duplicate files because it increment the count with every upload and took the file from list from particular index.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about racing? The index is updated after the upload. Therefore we cannot eliminate the possibility that while a file is being uploaded by a thread, another thread starts uploading the same file. This can also lead to out-of-range exception.

content_type,
client,
predefined_acl,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now there is a different kind of a problem. If the upload fails, the counter is still increased. There should be some sort of a retry mechanism to recover from such state, or an index tracking, to handle each file individually.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the Storage module they didn't apply mechanism of retry, that's why they have a different task in git to implement retry mechanism in storage.
Please refer issue [7907].

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless of that, the possibility of racing should be eliminated.

@HemangChothani HemangChothani force-pushed the feature/storage_parallel_operations_copying_objects branch from 213e374 to 42f1d9e Compare June 27, 2019 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants