Skip to content

digitization: proposal for refactor file_import#29

Merged
namollayo merged 7 commits intocern-sis:mainfrom
namollayo:#24-file-matching-logic
Apr 22, 2026
Merged

digitization: proposal for refactor file_import#29
namollayo merged 7 commits intocern-sis:mainfrom
namollayo:#24-file-matching-logic

Conversation

@namollayo
Copy link
Copy Markdown
Contributor

@namollayo namollayo commented Apr 14, 2026

Implement Boite-to-S3 file matcher

* Implement fetch_boite_files to replace manual fetching
* ref cern-sis#15
* Match Boite .xlsx records to S3 paths case-insensitively
* Support flat and subfolder layouts for PDF and PDF_LATEX
* Log unmatched files and output data for XML generation
* ref cern-sis#24
@namollayo namollayo force-pushed the #24-file-matching-logic branch from 24cc87a to 8bd8602 Compare April 15, 2026 11:32
Copy link
Copy Markdown
Collaborator

@PascalEgn PascalEgn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I check most of the code and left some comments, also i noticed that the matching of normal PDFs doesnt seem to work for me (but for PDF_Latex it does work). I suspect its because for PDF_Latex the pdf files on S3 are stored directly under the filename (for example PDF_LATEX/BOITE_O0126/300-BT-HK-D-1_latex.pdf) but for the 'normal' PDF its for example PDF/BOITE_O0126/300-BT-HK-D-1/300-BT-HK-D-1.pdf)

Comment thread refactory/storage_connection.py Outdated
Comment thread refactory/storage_connection.py Outdated
Comment thread refactory/storage_connection.py Outdated
Comment thread refactory/cli.py Outdated
Comment thread refactory/cli.py
Comment thread refactory/file_import/boite_matcher.py Outdated
Comment thread refactory/file_import/utils.py Outdated
Comment thread refactory/cli.py Outdated
Comment thread refactory/check_files/main.py Outdated
Comment thread refactory/cli.py Outdated
@namollayo namollayo merged commit 9fdf017 into cern-sis:main Apr 22, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants