Add abstract __len__ method to MultilingualCorpus#39
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces a standardized __len__ contract for MultilingualCorpus and adds an implementation + tests for ParallelCorpus to support len(corpus).
Changes:
- Add an abstract
__len__method toMultilingualCorpus. - Implement
ParallelCorpus.__len__using a line-counting helper. - Add unit tests validating
len(ParallelCorpus)for empty and populated corpora.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
src/larakit/corpus/_base.py |
Adds __len__ as an abstract requirement for all corpora. |
src/larakit/corpus/_parallel.py |
Implements ParallelCorpus.__len__ using a cached line count. |
tests/corpus/parallel.py |
Adds tests asserting len() behavior for empty and written corpora. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def writer(self) -> ParallelCorpusWriter: | ||
| return ParallelCorpusWriter(self._language, self._source, self._target) | ||
|
|
||
| def __len__(self) -> int: |
There was a problem hiding this comment.
You cannot cache length like this: if the user writes lines to the parallel corpus, the _size field becomes obsolete.
There was a problem hiding this comment.
That's true, but JTMCorpus and TMXCorpus both assume the file won't change after the corpus object is constructed — they cache length the same way and go stale after a write too. I agree it's not a good implementation, but if we want to change this behavior we should plan to do it for all of them.
Do you agree?
There was a problem hiding this comment.
As we discussed offline, let's put a cache invalidation at writer creation
__len__ to MultilingualCorpus__len__ method to MultilingualCorpus
No description provided.