Skip to content

Add abstract __len__ method to MultilingualCorpus#39

Open
menicacci wants to merge 4 commits intomainfrom
features/corpus-len
Open

Add abstract __len__ method to MultilingualCorpus#39
menicacci wants to merge 4 commits intomainfrom
features/corpus-len

Conversation

@menicacci
Copy link
Copy Markdown
Contributor

No description provided.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a standardized __len__ contract for MultilingualCorpus and adds an implementation + tests for ParallelCorpus to support len(corpus).

Changes:

  • Add an abstract __len__ method to MultilingualCorpus.
  • Implement ParallelCorpus.__len__ using a line-counting helper.
  • Add unit tests validating len(ParallelCorpus) for empty and populated corpora.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
src/larakit/corpus/_base.py Adds __len__ as an abstract requirement for all corpora.
src/larakit/corpus/_parallel.py Implements ParallelCorpus.__len__ using a cached line count.
tests/corpus/parallel.py Adds tests asserting len() behavior for empty and written corpora.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/larakit/corpus/_base.py Outdated
Comment thread src/larakit/corpus/_parallel.py
Comment thread src/larakit/corpus/_parallel.py
Comment thread src/larakit/corpus/_parallel.py
def writer(self) -> ParallelCorpusWriter:
return ParallelCorpusWriter(self._language, self._source, self._target)

def __len__(self) -> int:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cannot cache length like this: if the user writes lines to the parallel corpus, the _size field becomes obsolete.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, but JTMCorpus and TMXCorpus both assume the file won't change after the corpus object is constructed — they cache length the same way and go stale after a write too. I agree it's not a good implementation, but if we want to change this behavior we should plan to do it for all of them.
Do you agree?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed offline, let's put a cache invalidation at writer creation

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@menicacci menicacci changed the title Add abstract method __len__ to MultilingualCorpus Add abstract __len__ method to MultilingualCorpus Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants