Conversation
* Added owlbot templeted files * updated repo-metadata * Fixed Kokoro CI errors * Fixed failing tests * added test file to documentai_toolbox to test docs * changed docs files * updated code * updated code and removed 3.6 constraints
* Added owlbot templeted files * updated repo-metadata * Fixed Kokoro CI errors * Fixed failing tests * added test file to documentai_toolbox to test docs * changed docs files * Added DocumentWrapper, EntityWrapper,PageWrapper * Fixed code per comments * Refactored code * updated code * updated code * refactored imports * added storage dependency to setup.py * fixed lint issues * refactored code and added tests * refactored code and tests * removed samples contents
* feat: add get_document and list_document functions * fixed tests * lint fix * added tests and changed DocumentWrapper * fixed failing test * updated tests * changed DocStrings and tests * fixed failing test * changed name and return type of list_documents * updated failing tests * lint fix * updated get_document name to get_shards
Bumps [protobuf](https://github.com/protocolbuffers/protobuf) from 3.20.1 to 3.20.2. - [Release notes](https://github.com/protocolbuffers/protobuf/releases) - [Changelog](https://github.com/protocolbuffers/protobuf/blob/main/generate_changelog.py) - [Commits](protocolbuffers/protobuf@v3.20.1...v3.20.2) --- updated-dependencies: - dependency-name: protobuf dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Yu-Han Liu <dizcology@hotmail.com>
* feat: add get_document and list_document functions * fixed tests * lint fix * added tests and changed DocumentWrapper * fixed failing test * updated tests * changed DocStrings and tests * fixed failing test * changed name and return type of list_documents * updated failing tests * lint fix * chore: updated comments * updating naming for get_document to get_shards * revert get_document changes * lint fix * added code-block to comments * edited comments
* feat: add get_document and list_document functions * fixed tests * lint fix * added tests and changed DocumentWrapper * fixed failing test * updated tests * changed DocStrings and tests * fixed failing test * changed name and return type of list_documents * updated failing tests * lint fix * chore: updated comments * updating naming for get_document to get_shards * revert get_document changes * lint fix * added code-block to comments * feat: add TableWrapper and helper functions * wrapped lines and paragraphs * added tests for new features * lint fix * removed functions not related to TableWrapper * changed format *_wrapper to wrapped_* * changed wrapped_* format to *
* Chore: fix missing changes * lint fix * table fix
* feat: add get_document and list_document functions * fixed tests * lint fix * added tests and changed DocumentWrapper * fixed failing test * updated tests * changed DocStrings and tests * fixed failing test * changed name and return type of list_documents * updated failing tests * lint fix * chore: updated comments * updating naming for get_document to get_shards * revert get_document changes * lint fix * added code-block to comments * feat: add TableWrapper and helper functions * wrapped lines and paragraphs * added tests for new features * lint fix * feat: added helper functions to DocumentWrapper * lint fix * fixed failing test * refactored code * lint fix * lint fix * refactored code * fixed failing test * refactored code * added text fixture to simplify testing
* feat: add get_document and list_document functions * fixed tests * lint fix * added tests and changed DocumentWrapper * fixed failing test * updated tests * changed DocStrings and tests * fixed failing test * changed name and return type of list_documents * updated failing tests * lint fix * chore: updated comments * updating naming for get_document to get_shards * revert get_document changes * lint fix * added code-block to comments * feat: add TableWrapper and helper functions * wrapped lines and paragraphs * added tests for new features * lint fix * chore: added client_info to storage client * lint fix * removed helper functions * created single source of truth for library version * lint fix * lint fix * failing test fix * refactored code * lint fix
* Chore : Updated readme links * updated readme * updated links * updated readme
* feat: add get_document and list_document functions * fixed tests * lint fix * added tests and changed DocumentWrapper * fixed failing test * updated tests * changed DocStrings and tests * fixed failing test * changed name and return type of list_documents * updated failing tests * lint fix * chore: updated comments * updating naming for get_document to get_shards * revert get_document changes * lint fix * added code-block to comments * feat: add TableWrapper and helper functions * wrapped lines and paragraphs * added tests for new features * lint fix * removed functions not related to TableWrapper * changed format *_wrapper to wrapped_* * changed wrapped_* format to * * chore: refactored classes * fixed failing test * removed samples file * changed from_* function from self to cls
* feat: refactor code * linter * fix typo. comment out CODEOWNER for @googleapis/python-samples-reviewers
* chore: made print_gcs_document_tree accessible * updated init files
* chore: add unit test for Entity
* chore: fix to_dataframe header issue
* chore(main): release 0.1.0 * feat: change release version to alpha * changed changelog version Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>
* chore: updated readme * changed rst text * added disclaimer * update readme * update readme * update readme * update readme * Update README.rst * Update README.rst Co-authored-by: Anthonios Partheniou <partheniou@google.com> Co-authored-by: Holt Skinner <13262395+holtskinner@users.noreply.github.com> Co-authored-by: Anthonios Partheniou <partheniou@google.com>
* chore: added tests for page.py * added test fixture for page.py tests * changed fixture name
Bumps [certifi](https://github.com/certifi/python-certifi) from 2022.9.24 to 2022.12.7. - [Release notes](https://github.com/certifi/python-certifi/releases) - [Commits](certifi/python-certifi@2022.09.24...2022.12.07) --- updated-dependencies: - dependency-name: certifi dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* chore: update repo-metadata.json * syntax
…a.json (#31) * chore: update client_documentation and issue_tracker in .repo-metadata.json * remove client_documentation value
* chore: changed gcs_prefix pattern comment * changed functions using gcs_prefix * fixed failing tests * updated comments * updated comments
* chore: documentation changes * addressed comments
* chore: updated testing constraints * updates 3.7 constraints * updated test constraints * updated constrainst 3.7 * updated setup.py deps * removed setup.py dep * removed google-common-proto * updated pandas deps * changes pandas in setup.py * revertes setup changes added deps to3.7constraints * updated storage deps * removed numpy from 3.7 constraint * added numpy * changed constraints * added numpy * fixed dependency error * changed setup.py * updated numpy constraints * changed min dep for numpy * changed api core * fixed lint issue * removed get_bytes test * testing changes * removed test issue
* docs: fix docs arrangement * updated documentation * lint fix --------- Co-authored-by: Gal Zahavi <38544478+galz10@users.noreply.github.com>
Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>
* chore: minor refactoring of GCS Functions in document wrapper - Simplified `print_gcs_document_tree` for readaibility/maintainability (And to resolve linter errors) - Added constants for reused values - Added `ignore_unknown_values` to `Document.from_json()` to avoid exceptions with new Document Proto versions between client library updates * chore: minor refactoring of GCS Functions in document wrapper - Simplified `print_gcs_document_tree` for readaibility/maintainability (And to resolve linter errors) - Added constants for reused values - Added `ignore_unknown_values` to `Document.from_json()` to avoid exceptions with new Document Proto versions between client library updates * chore: Fix to allow tests to pass
* chore: update docs url * update README
* chore: fixed documentation devsite issues * fixed comments * Update README.rst * updated readme
Source-Link: googleapis/synthtool@0da1658 Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-python:latest@sha256:5cddfe2fb5019bbf78335bc55f15bc13e18354a56b3ff46e1834f8e540807f05 Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
Source-Link: googleapis/synthtool@6357517 Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-python:latest@sha256:91d0075c6f2fd6a073a06168feee19fa2a8507692f2519a1dc7de3366d157e99 Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
Source-Link: googleapis/synthtool@59171c8 Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-python:latest@sha256:2ed982f884312e4883e01b5ab8af8b6935f0216a5a2d82928d273081fc3be562 Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
* chore(deps): update all dependencies * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md --------- Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com> Co-authored-by: Holt Skinner <13262395+holtskinner@users.noreply.github.com>
Source-Link: googleapis/synthtool@e808c98 Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-python:latest@sha256:8e3e7e18255c22d1489258d0374c901c01f9c4fd77a12088670cd73d580aa737 Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>
#368) Source-Link: googleapis/synthtool@de3def6 Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-python:latest@sha256:a1c5112b81d645f5bbc4d4bbc99d7dcb5089a52216c0e3fb1203a0eeabadd7d5 Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
…fig (#371) Source-Link: googleapis/synthtool@106d292 Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-python:latest@sha256:8ff1efe878e18bd82a0fb7b70bb86f77e7ab6901fed394440b6135db0ba8d84a Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.4 to 3.1.5. - [Release notes](https://github.com/pallets/jinja/releases) - [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst) - [Commits](pallets/jinja@3.1.4...3.1.5) --- updated-dependencies: - dependency-name: jinja2 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Holt Skinner <13262395+holtskinner@users.noreply.github.com>
Source-Link: googleapis/synthtool@bd9ede2 Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-python:latest@sha256:04c35dc5f49f0f503a306397d6d043685f8d2bb822ab515818c4208d7fb2db3a Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
Bumps [cryptography](https://github.com/pyca/cryptography) from 43.0.1 to 44.0.1. - [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst) - [Commits](pyca/cryptography@43.0.1...44.0.1) --- updated-dependencies: - dependency-name: cryptography dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Source-Link: googleapis/synthtool@aa69fb7 Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-python:latest@sha256:f016446d6e520e5fb552c45b110cba3f217bffdd3d06bdddd076e9e6d13266cf Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
* chore(deps): update all dependencies * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * Update version of toolbox --------- Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com> Co-authored-by: Holt Skinner <13262395+holtskinner@users.noreply.github.com> Co-authored-by: Holt Skinner <holtskinner@google.com>
* chore: remove unused files * update post processor image * update post processor image
Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>
Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>
Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>
…lbox/main' into migration.python-documentai-toolbox.migration.2026-03-02_16-59-45.migrate
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request focuses on migrating the Document AI Toolbox code from the googleapis/python-documentai-toolbox repository into a dedicated package within the google-cloud-documentai-toolbox repository. This involves moving the source code, tests, and documentation, as well as setting up the necessary infrastructure for continuous integration, linting, and release management. The goal is to consolidate the codebase and improve maintainability. Highlights
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This PR migrates the python-documentai-toolbox library into this repository. The changes consist of adding a large number of files for the new package, including source code, tests, samples, and configuration. The code is generally well-structured and tested. My feedback focuses on making the code more robust and maintainable, addressing a potential bug in file handling, and improving code clarity and type correctness.
Note: Security Review is unavailable for this PR.
| try: | ||
| for blob in blobs: | ||
| if blob.name.endswith("/"): | ||
| continue | ||
| file_name = os.path.basename(blob.name) | ||
| if annotation_file_prefix in file_name: | ||
| annotation_blob = blob | ||
| elif config_file_prefix in file_name: | ||
| metadata_blob = blob | ||
| elif constants.PDF_EXTENSION in file_name: | ||
| doc_blob = blob | ||
|
|
||
| if config_path: | ||
| metadata_blob = gcs_utilities.get_blob(config_path) | ||
|
|
||
| directory_name = os.path.basename(gcs_uri) | ||
| print(f"Downloaded: {directory_name}", end="\r") | ||
|
|
||
| return ( | ||
| annotation_blob.download_as_bytes(), | ||
| doc_blob.download_as_bytes(), | ||
| metadata_blob.download_as_bytes(), | ||
| directory_name, | ||
| ) | ||
| except Exception as e: | ||
| raise e |
There was a problem hiding this comment.
The _get_bytes function does not handle cases where one of the expected files (annotation, config, or PDF) is missing in the GCS directory. If a file is not found, an UnboundLocalError will be raised when trying to access variables like annotation_blob, metadata_blob, or doc_blob. It would be more robust to initialize these variables to None before the loop and add a check to ensure all required files are found before proceeding. This would provide a clearer FileNotFoundError.
| y_min = _convert_bbox_units( | ||
| block.bounding_box[f"{block.bounding_y}"], | ||
| input_bbox_units=block.bounding_unit, | ||
| width=block.page_height, | ||
| multiplier=y_multiplier, | ||
| ) |
There was a problem hiding this comment.
In the call to _convert_bbox_units for the y_min coordinate, the width parameter is used to pass block.page_height. This is confusing as the function signature for _convert_bbox_units includes a height parameter. Using the height parameter would make the code more readable and less prone to errors.
| y_min = _convert_bbox_units( | |
| block.bounding_box[f"{block.bounding_y}"], | |
| input_bbox_units=block.bounding_unit, | |
| width=block.page_height, | |
| multiplier=y_multiplier, | |
| ) | |
| y_min = _convert_bbox_units( | |
| block.bounding_box[f"{block.bounding_y}"], | |
| input_bbox_units=block.bounding_unit, | |
| height=block.page_height, | |
| multiplier=y_multiplier, | |
| ) |
| """ | ||
| entity_annotations: List[EntityAnnotation] = [] | ||
| for token in page_info.page.tokens: | ||
| v: vision.Vertex = [] |
There was a problem hiding this comment.
The type hint for the v variable is vision.Vertex, but it is initialized as a list []. This is incorrect as vision.Vertex represents a single vertex object, not a list of them. To improve code clarity and correctness for static analysis tools, the type hint should be changed to List[Dict[str, int]] or List[vision.Vertex].
| v: vision.Vertex = [] | |
| v: List[Dict[str, int]] = [] |
See #11026.
This PR should be merged with a merge-commit, not a squash-commit, in order to preserve the git history.