Skip to content

chore(migration): Migrate code from googleapis/python-documentai-toolbox into packages/google-cloud-documentai-toolbox#16010

Draft
parthea wants to merge 262 commits intomainfrom
migration.python-documentai-toolbox.migration.2026-03-02_16-59-45.migrate
Draft

chore(migration): Migrate code from googleapis/python-documentai-toolbox into packages/google-cloud-documentai-toolbox#16010
parthea wants to merge 262 commits intomainfrom
migration.python-documentai-toolbox.migration.2026-03-02_16-59-45.migrate

Conversation

@parthea
Copy link
Contributor

@parthea parthea commented Mar 2, 2026

See #11026.

This PR should be merged with a merge-commit, not a squash-commit, in order to preserve the git history.

galz10 and others added 30 commits September 14, 2022 13:24
* Added owlbot templeted files

* updated repo-metadata

* Fixed Kokoro CI errors

* Fixed failing tests

* added test file to documentai_toolbox to test docs

* changed docs files

* updated code

* updated code and removed 3.6 constraints
* Added owlbot templeted files

* updated repo-metadata

* Fixed Kokoro CI errors

* Fixed failing tests

* added test file to documentai_toolbox to test docs

* changed docs files

* Added DocumentWrapper, EntityWrapper,PageWrapper

* Fixed code per comments

* Refactored code

* updated code

* updated code

* refactored imports

* added storage dependency to setup.py

* fixed lint issues

* refactored code and added tests

* refactored code and tests

* removed samples contents
* feat: add get_document and list_document functions

* fixed tests

* lint fix

* added tests and changed DocumentWrapper

* fixed failing test

* updated tests

* changed DocStrings and tests

* fixed failing test

* changed name and return type of list_documents

* updated failing tests

* lint fix

* updated get_document name to get_shards
Bumps [protobuf](https://github.com/protocolbuffers/protobuf) from 3.20.1 to 3.20.2.
- [Release notes](https://github.com/protocolbuffers/protobuf/releases)
- [Changelog](https://github.com/protocolbuffers/protobuf/blob/main/generate_changelog.py)
- [Commits](protocolbuffers/protobuf@v3.20.1...v3.20.2)

---
updated-dependencies:
- dependency-name: protobuf
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Yu-Han Liu <dizcology@hotmail.com>
* feat: add get_document and list_document functions

* fixed tests

* lint fix

* added tests and changed DocumentWrapper

* fixed failing test

* updated tests

* changed DocStrings and tests

* fixed failing test

* changed name and return type of list_documents

* updated failing tests

* lint fix

* chore: updated comments

* updating naming for get_document to get_shards

* revert get_document changes

* lint fix

* added code-block to comments

* edited comments
* feat: add get_document and list_document functions

* fixed tests

* lint fix

* added tests and changed DocumentWrapper

* fixed failing test

* updated tests

* changed DocStrings and tests

* fixed failing test

* changed name and return type of list_documents

* updated failing tests

* lint fix

* chore: updated comments

* updating naming for get_document to get_shards

* revert get_document changes

* lint fix

* added code-block to comments

* feat: add TableWrapper and helper functions

* wrapped lines and paragraphs

* added tests for new features

* lint fix

* removed functions not related to TableWrapper

* changed format *_wrapper to wrapped_*

* changed wrapped_* format to *
* Chore: fix missing changes

* lint fix

* table fix
* feat: add get_document and list_document functions

* fixed tests

* lint fix

* added tests and changed DocumentWrapper

* fixed failing test

* updated tests

* changed DocStrings and tests

* fixed failing test

* changed name and return type of list_documents

* updated failing tests

* lint fix

* chore: updated comments

* updating naming for get_document to get_shards

* revert get_document changes

* lint fix

* added code-block to comments

* feat: add TableWrapper and helper functions

* wrapped lines and paragraphs

* added tests for new features

* lint fix

* feat: added helper functions to DocumentWrapper

* lint fix

* fixed failing test

* refactored code

* lint fix

* lint fix

* refactored code

* fixed failing test

* refactored code

* added text fixture to simplify testing
* feat: add get_document and list_document functions

* fixed tests

* lint fix

* added tests and changed DocumentWrapper

* fixed failing test

* updated tests

* changed DocStrings and tests

* fixed failing test

* changed name and return type of list_documents

* updated failing tests

* lint fix

* chore: updated comments

* updating naming for get_document to get_shards

* revert get_document changes

* lint fix

* added code-block to comments

* feat: add TableWrapper and helper functions

* wrapped lines and paragraphs

* added tests for new features

* lint fix

* chore: added client_info to storage client

* lint fix

* removed helper functions

* created single source of truth for library version

* lint fix

* lint fix

* failing test fix

* refactored code

* lint fix
* Chore : Updated readme links

* updated readme

* updated links

* updated readme
* feat: add get_document and list_document functions

* fixed tests

* lint fix

* added tests and changed DocumentWrapper

* fixed failing test

* updated tests

* changed DocStrings and tests

* fixed failing test

* changed name and return type of list_documents

* updated failing tests

* lint fix

* chore: updated comments

* updating naming for get_document to get_shards

* revert get_document changes

* lint fix

* added code-block to comments

* feat: add TableWrapper and helper functions

* wrapped lines and paragraphs

* added tests for new features

* lint fix

* removed functions not related to TableWrapper

* changed format *_wrapper to wrapped_*

* changed wrapped_* format to *

* chore: refactored classes

* fixed failing test

* removed samples file

* changed from_* function from self to cls
* feat: refactor code

* linter

* fix typo. comment out CODEOWNER for @googleapis/python-samples-reviewers
* chore: made print_gcs_document_tree accessible

* updated init files
* chore: add unit test for Entity
* chore: fix to_dataframe header issue
* chore(main): release 0.1.0

* feat: change release version to alpha

* changed changelog version

Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>
* chore: updated readme

* changed rst text

* added disclaimer

* update readme

* update readme

* update readme

* update readme

* Update README.rst

* Update README.rst

Co-authored-by: Anthonios Partheniou <partheniou@google.com>

Co-authored-by: Holt Skinner <13262395+holtskinner@users.noreply.github.com>
Co-authored-by: Anthonios Partheniou <partheniou@google.com>
* chore: added tests for page.py

* added test fixture for page.py tests

* changed fixture name
Bumps [certifi](https://github.com/certifi/python-certifi) from 2022.9.24 to 2022.12.7.
- [Release notes](https://github.com/certifi/python-certifi/releases)
- [Commits](certifi/python-certifi@2022.09.24...2022.12.07)

---
updated-dependencies:
- dependency-name: certifi
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* chore: update repo-metadata.json

* syntax
…a.json (#31)

* chore: update client_documentation and issue_tracker in .repo-metadata.json

* remove client_documentation value
* chore: changed gcs_prefix pattern comment

* changed functions using gcs_prefix

* fixed failing tests

* updated comments

* updated comments
* chore: documentation changes

* addressed comments
* chore: updated testing constraints

* updates 3.7 constraints

* updated test constraints

* updated constrainst 3.7

* updated setup.py deps

* removed setup.py dep

* removed google-common-proto

* updated pandas deps

* changes pandas in setup.py

* revertes setup changes added deps to3.7constraints

* updated storage deps

* removed numpy from 3.7 constraint

* added numpy

* changed constraints

* added numpy

* fixed dependency error

* changed setup.py

* updated numpy constraints

* changed min dep for numpy

* changed api core

* fixed lint issue

* removed get_bytes test

* testing changes

* removed test issue
* docs: fix docs arrangement

* updated documentation

* lint fix

---------

Co-authored-by: Gal Zahavi <38544478+galz10@users.noreply.github.com>
Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>
* chore: minor refactoring of GCS Functions in document wrapper
- Simplified `print_gcs_document_tree` for readaibility/maintainability (And to resolve linter errors)
- Added constants for reused values
- Added `ignore_unknown_values` to `Document.from_json()` to avoid exceptions with new Document Proto versions between client library updates

* chore: minor refactoring of GCS Functions in document wrapper
- Simplified `print_gcs_document_tree` for readaibility/maintainability (And to resolve linter errors)
- Added constants for reused values
- Added `ignore_unknown_values` to `Document.from_json()` to avoid exceptions with new Document Proto versions between client library updates

* chore: Fix to allow tests to pass
* chore: update docs url

* update README
* chore: fixed documentation devsite issues

* fixed comments

* Update README.rst

* updated readme
gcf-owl-bot bot and others added 25 commits October 31, 2024 10:07
Source-Link: googleapis/synthtool@0da1658
Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-python:latest@sha256:5cddfe2fb5019bbf78335bc55f15bc13e18354a56b3ff46e1834f8e540807f05

Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
Source-Link: googleapis/synthtool@6357517
Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-python:latest@sha256:91d0075c6f2fd6a073a06168feee19fa2a8507692f2519a1dc7de3366d157e99

Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
Source-Link: googleapis/synthtool@59171c8
Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-python:latest@sha256:2ed982f884312e4883e01b5ab8af8b6935f0216a5a2d82928d273081fc3be562

Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
* chore(deps): update all dependencies

* 🦉 Updates from OwlBot post-processor

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

---------

Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
Co-authored-by: Holt Skinner <13262395+holtskinner@users.noreply.github.com>
Source-Link: googleapis/synthtool@e808c98
Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-python:latest@sha256:8e3e7e18255c22d1489258d0374c901c01f9c4fd77a12088670cd73d580aa737

Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>
#368)

Source-Link: googleapis/synthtool@de3def6
Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-python:latest@sha256:a1c5112b81d645f5bbc4d4bbc99d7dcb5089a52216c0e3fb1203a0eeabadd7d5

Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
…fig (#371)

Source-Link: googleapis/synthtool@106d292
Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-python:latest@sha256:8ff1efe878e18bd82a0fb7b70bb86f77e7ab6901fed394440b6135db0ba8d84a

Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.4 to 3.1.5.
- [Release notes](https://github.com/pallets/jinja/releases)
- [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst)
- [Commits](pallets/jinja@3.1.4...3.1.5)

---
updated-dependencies:
- dependency-name: jinja2
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Holt Skinner <13262395+holtskinner@users.noreply.github.com>
Source-Link: googleapis/synthtool@bd9ede2
Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-python:latest@sha256:04c35dc5f49f0f503a306397d6d043685f8d2bb822ab515818c4208d7fb2db3a

Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
Bumps [cryptography](https://github.com/pyca/cryptography) from 43.0.1 to 44.0.1.
- [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst)
- [Commits](pyca/cryptography@43.0.1...44.0.1)

---
updated-dependencies:
- dependency-name: cryptography
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Source-Link: googleapis/synthtool@aa69fb7
Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-python:latest@sha256:f016446d6e520e5fb552c45b110cba3f217bffdd3d06bdddd076e9e6d13266cf

Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
* chore(deps): update all dependencies

* 🦉 Updates from OwlBot post-processor

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

* Update version of toolbox

---------

Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
Co-authored-by: Holt Skinner <13262395+holtskinner@users.noreply.github.com>
Co-authored-by: Holt Skinner <holtskinner@google.com>
* chore: remove unused files

* update post processor image

* update post processor image
Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>
Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>
Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>
…lbox/main' into migration.python-documentai-toolbox.migration.2026-03-02_16-59-45.migrate
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on migrating the Document AI Toolbox code from the googleapis/python-documentai-toolbox repository into a dedicated package within the google-cloud-documentai-toolbox repository. This involves moving the source code, tests, and documentation, as well as setting up the necessary infrastructure for continuous integration, linting, and release management. The goal is to consolidate the codebase and improve maintainability.

Highlights

  • Code Migration: Migrated the googleapis/python-documentai-toolbox code into the packages/google-cloud-documentai-toolbox directory.
  • Dependency Management: Introduced new files for dependency management, linting, testing, and release automation.
  • Repository Configuration: Added various configuration files for GitHub actions, code owners, issue templates, and repository synchronization.
Activity
  • The PR description requests a merge-commit to preserve git history.
  • The PR introduces a large number of new files, primarily for configuration and automation.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR migrates the python-documentai-toolbox library into this repository. The changes consist of adding a large number of files for the new package, including source code, tests, samples, and configuration. The code is generally well-structured and tested. My feedback focuses on making the code more robust and maintainable, addressing a potential bug in file handling, and improving code clarity and type correctness.

Note: Security Review is unavailable for this PR.

Comment on lines +233 to +258
try:
for blob in blobs:
if blob.name.endswith("/"):
continue
file_name = os.path.basename(blob.name)
if annotation_file_prefix in file_name:
annotation_blob = blob
elif config_file_prefix in file_name:
metadata_blob = blob
elif constants.PDF_EXTENSION in file_name:
doc_blob = blob

if config_path:
metadata_blob = gcs_utilities.get_blob(config_path)

directory_name = os.path.basename(gcs_uri)
print(f"Downloaded: {directory_name}", end="\r")

return (
annotation_blob.download_as_bytes(),
doc_blob.download_as_bytes(),
metadata_blob.download_as_bytes(),
directory_name,
)
except Exception as e:
raise e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The _get_bytes function does not handle cases where one of the expected files (annotation, config, or PDF) is missing in the GCS directory. If a file is not found, an UnboundLocalError will be raised when trying to access variables like annotation_blob, metadata_blob, or doc_blob. It would be more robust to initialize these variables to None before the loop and add a check to ensure all required files are found before proceeding. This would provide a clearer FileNotFoundError.

Comment on lines +233 to +238
y_min = _convert_bbox_units(
block.bounding_box[f"{block.bounding_y}"],
input_bbox_units=block.bounding_unit,
width=block.page_height,
multiplier=y_multiplier,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In the call to _convert_bbox_units for the y_min coordinate, the width parameter is used to pass block.page_height. This is confusing as the function signature for _convert_bbox_units includes a height parameter. Using the height parameter would make the code more readable and less prone to errors.

Suggested change
y_min = _convert_bbox_units(
block.bounding_box[f"{block.bounding_y}"],
input_bbox_units=block.bounding_unit,
width=block.page_height,
multiplier=y_multiplier,
)
y_min = _convert_bbox_units(
block.bounding_box[f"{block.bounding_y}"],
input_bbox_units=block.bounding_unit,
height=block.page_height,
multiplier=y_multiplier,
)

"""
entity_annotations: List[EntityAnnotation] = []
for token in page_info.page.tokens:
v: vision.Vertex = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The type hint for the v variable is vision.Vertex, but it is initialized as a list []. This is incorrect as vision.Vertex represents a single vertex object, not a list of them. To improve code clarity and correctness for static analysis tools, the type hint should be changed to List[Dict[str, int]] or List[vision.Vertex].

Suggested change
v: vision.Vertex = []
v: List[Dict[str, int]] = []

@parthea parthea self-assigned this Mar 2, 2026
@parthea parthea added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Mar 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do not merge Indicates a pull request not ready for merge, due to either quality or timing.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants