modify usfm for chapter-level drafting to avoid import issues; move remarks to chapters#285
modify usfm for chapter-level drafting to avoid import issues; move remarks to chapters#285mshannon-sil wants to merge 6 commits intomainfrom
Conversation
…emarks to chapters
ddaspit
left a comment
There was a problem hiding this comment.
@ddaspit reviewed 2 files and all commit messages, and made 1 comment.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on Enkidu93 and mshannon-sil).
machine/corpora/update_usfm_parser_handler.py line 345 at r1 (raw file):
tokens = list(self._tokens) if chapters is not None: tokens = self._get_incremental_draft_tokens(tokens, chapters)
I think we can do something similar, but before we parse instead of after. Instead of calling parse_usfm in update_usfm, we can do something like this:
tokenizer = UsfmTokenizer(self._settings.stylesheet)
tokens = tokenizer.tokenize(usfm)
tokens = filter_tokens_by_chapter(tokens, chapters)
parser = UsfmParser(tokens, handler, self._settings.stylesheet, self._settings.versification)
parser.process_tokens()This would avoid updating the whole book.
|
Previously, ddaspit (Damien Daspit) wrote…
I updated it accordingly, how does it look now? If we change parse_usfm to accept a Sequence[UsfmToken] as well as str, we could call it here and let it instantiate the parser and process the tokens to avoid some code duplication. But not sure if there's a reason parse_usfm only accepts str. |
Enkidu93
left a comment
There was a problem hiding this comment.
@Enkidu93 reviewed 4 files and all commit messages, and made 2 comments.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on ddaspit and mshannon-sil).
machine/corpora/paratext_project_text_updater_base.py line 97 at r2 (raw file):
in_id_marker = False elif token.type == UsfmTokenType.CHAPTER: if token.data and int(token.data) in chapters:
I think this may be safe now after some recent changes, but you may want to double check what happens with bad chapter references like \c 1. if you haven't already/isn't already covered by tests.
machine/corpora/update_usfm_parser_handler.py line 348 at r2 (raw file):
remark_tokens.append(UsfmToken(UsfmTokenType.TEXT, text=remark)) if len(tokens) > 0: for index, token in enumerate(tokens):
Don't we want to preserve the ability to add book-level remarks? Am I reading this correctly that we'd no longer be able to do so with this change? Peter has also put in a PR related to the per-chapter remarks; are you guys coordinating on this sillsdev/machine#408? I think if we did something like he has there, we would have a bit more flexibility.
This PR addresses issue #284.
Mostly looking for high-level feedback about the approach at the moment. As we were discussing, is the right place for this functionality in the
get_usfm()method as essentially a post-processing step? Or should we look to implement this feature inprocess_tokens()(and maybe move the remark logic here as well)?Some initial thoughts:
Pros for putting it in
get_usfm():process_token().Pros for putting it in
process_token():This change is