Agregar API first_block e identificación automática de elementos del front#62
Agregar API first_block e identificación automática de elementos del front#62eduranm wants to merge 11 commits intoscieloorg:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Este PR agrega soporte base para identificar automáticamente elementos del front al procesar DOCX en markup_doc, y expone un endpoint first_block para extraer metadatos del bloque inicial reutilizando LLM/utilidades existentes en el proyecto.
Changes:
- Registro del endpoint
first_blocken el router API y creación del ViewSet correspondiente. - Incorporación del procesamiento de
first_block+ detección/estructuración de<abstract>,<kwd-group>,<date-accepted>,<date-received>en el flujo demarkup_doc/tasks.py. - Introducción de nuevas utilidades/modelos/admin hooks para persistir y administrar la estructura marcada (Wagtail StreamFields) y sincronización de colecciones/journals vía API.
Reviewed changes
Copilot reviewed 19 out of 28 changed files in this pull request and generated 23 comments.
Show a summary per file
| File | Description |
|---|---|
| model_ai/llama.py | Ajuste del flujo Gemini (incluye una pausa fija tras generar contenido). |
| markuplib/function_docx.py | Nuevo parser DOCX para extraer contenido/first_block y detectar elementos del front. |
| markup_doc/wagtail_hooks.py | Registro de SnippetViewSets/acciones de admin para carga/procesamiento/sync. |
| markup_doc/tasks.py | Integración del procesamiento de first_block + front elements en el pipeline Celery. |
| markup_doc/sync_api.py | Nuevas funciones de sync de colecciones/journals desde SciELO Core API. |
| markup_doc/models.py | Nuevos modelos + StreamFields para front/body/back y entidades de colección/journal. |
| markup_doc/labeling_utils.py | Utilidades de LLM + extracción de keywords + procesamiento de referencias/citas. |
| markup_doc/marker.py | Funciones auxiliares para marcar artículo/referencias usando LlamaService. |
| markup_doc/api/v1/views.py | Nuevo endpoint first_block (ViewSet) para invocar mark_article. |
| markup_doc/api/v1/serializers.py | Serializer base para ArticleDocx (actualmente no usado en el flujo del ViewSet). |
| markup_doc/choices.py | Labels del front + reglas de orden/regex/estilos para etiquetado. |
| markup_doc/apps.py | AppConfig para markup_doc. |
| markup_doc/forms.py | Import base de form de Wagtail (placeholder). |
| markup_doc/admin.py | Admin placeholder. |
| markup_doc/tests.py | Archivo de tests placeholder. |
| markup_doc/migrations/0001_initial.py | Migración inicial del app markup_doc. |
| markup_doc/migrations/0002_alter_articledocx_estatus_and_more.py | Ajuste del campo estatus para usar choices. |
| fixtures/e14790.docx | Fixture DOCX para pruebas/manual testing. |
| config/settings/base.py | Habilita markup_doc y markuplib en INSTALLED_APPS. |
| config/api_router.py | Registra el endpoint first_block en el router /api/v1/. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| response_gemini = model.generate_content(user_input).text | ||
| time.sleep(15) | ||
| return response_gemini |
There was a problem hiding this comment.
The unconditional time.sleep(15) after every Gemini call will block the request/task thread even on successful responses, significantly reducing throughput and potentially causing timeouts under load. If this is meant for rate-limiting, switch to conditional backoff only on retryable errors (e.g., 429/503) and/or use the provider SDK retry options rather than sleeping on the happy path.
| from markup_doc.tasks import get_labels, task_sync_journals_from_api | ||
| from django.urls import path, reverse | ||
| from django.utils.html import format_html | ||
| from wagtail.admin import messages | ||
| from wagtail.admin.views import generic |
There was a problem hiding this comment.
messages is imported from both django.contrib (line 3) and wagtail.admin (line 28), and the latter overwrites the former. This makes it unclear which API is being used and can lead to subtle behavior differences. Rename one import (e.g., from django.contrib import messages as django_messages) or remove the unused one to avoid shadowing.
| class JournalModelCreateView(CreateView): | ||
| def get_context_data(self, **kwargs): | ||
| context = super().get_context_data(**kwargs) | ||
| task_sync_journals_from_api |
There was a problem hiding this comment.
JournalModelCreateView.get_context_data() references task_sync_journals_from_api but never calls it, so the journals sync won’t run. If the intent is to trigger a Celery task, this should likely be task_sync_journals_from_api.delay() (or call the sync function directly), otherwise this line has no effect.
| task_sync_journals_from_api | |
| task_sync_journals_from_api.delay() |
| try: | ||
| obj = cls.get(title=title) | ||
| except (cls.DoesNotExist, ValueError): | ||
| pass |
There was a problem hiding this comment.
The except ...: pass branch leaves obj undefined when the record doesn’t exist, but the method then uses obj.estatus below, which will raise UnboundLocalError. Return early (or re-raise) when the object is missing, or use update_or_create() / filter(...).update(...) to avoid needing an instance.
| pass | |
| return None |
| obj['type'] = 'aff_paragraph' | ||
|
|
||
| if re.search(r"^(translation)", item.get('text').lower()): | ||
| state['label'] = '<translate-fron>' |
There was a problem hiding this comment.
The label is set to '<translate-fron>', which doesn’t match any of the declared front_labels (it looks like it should be '<translate-front>'). This mismatch will prevent the value from being valid in the StreamField ChoiceBlock and may break downstream logic that expects the canonical label.
| state['label'] = '<translate-fron>' | |
| state['label'] = '<translate-front>' |
| if metadata == 'affiliation': | ||
| messages, response_format = LlamaInputSettings.get_affiliations() | ||
| if metadata == 'doi': | ||
| messages, response_format = LlamaInputSettings.get_doi_and_section() | ||
| if metadata == 'title': | ||
| messages, response_format = LlamaInputSettings.get_titles() |
There was a problem hiding this comment.
mark_article() uses independent if statements without an else/default, so an unexpected metadata value will leave messages/response_format undefined and crash at LlamaService(messages, ...). Consider using an if/elif/else chain and returning a clear error (or raising) when metadata is not one of the supported values.
| if metadata == 'affiliation': | |
| messages, response_format = LlamaInputSettings.get_affiliations() | |
| if metadata == 'doi': | |
| messages, response_format = LlamaInputSettings.get_doi_and_section() | |
| if metadata == 'title': | |
| messages, response_format = LlamaInputSettings.get_titles() | |
| elif metadata == 'affiliation': | |
| messages, response_format = LlamaInputSettings.get_affiliations() | |
| elif metadata == 'doi': | |
| messages, response_format = LlamaInputSettings.get_doi_and_section() | |
| elif metadata == 'title': | |
| messages, response_format = LlamaInputSettings.get_titles() | |
| else: | |
| raise ValueError( | |
| f"Unsupported metadata value: {metadata!r}. " | |
| "Expected one of: 'author', 'affiliation', 'doi', 'title'." | |
| ) |
| # Determina si es parte de una lista | ||
| is_numPr = paragraph.find('.//w:numPr', namespaces=paragraph.nsmap) is not None | ||
|
|
||
| # obtiene id y nivel | ||
| if is_numPr: | ||
| numPr = paragraph.find('.//w:numPr', namespaces=paragraph.nsmap) | ||
| numId = numPr.find('.//w:numId', namespaces=paragraph.nsmap).get(namespaces_p + 'val') | ||
| type = [(key, objt) for key, objt in list_types.items() if objt['numId'] == numId] | ||
|
|
There was a problem hiding this comment.
list_types can be None when the DOCX has no word/numbering.xml (see extract_numbering_info), which makes list_types.items() fail. Also, the list comprehension may return an empty list for unknown numId, so later type[0]... will raise IndexError. Consider defaulting list_types to {} and handling the “unknown numbering” case (e.g., treat as bullet or skip list typing) before indexing.
| if not is_numPr: | ||
| content.append(obj) |
There was a problem hiding this comment.
is_numPr and obj are defined only inside the if isinstance(element, CT_P): block, but are used afterwards for every element in the body. If the document body contains a table (CT_Tbl) or any non-paragraph element, this will raise UnboundLocalError. Move the content.append(obj) logic inside the CT_P branch (and add explicit CT_Tbl handling if needed).
| if not is_numPr: | |
| content.append(obj) | |
| if not is_numPr: | |
| content.append(obj) | |
| elif isinstance(element, CT_Tbl): | |
| if len(current_list) > 0: | |
| current_list.append('[/list]') | |
| objl = {} | |
| objl['type'] = 'list' | |
| objl['list'] = '\n'.join(current_list) | |
| current_list = [] | |
| current_num_id = None | |
| content.append(objl) |
| return str(self) | ||
|
|
||
| def __str__(self): | ||
| return self.title |
There was a problem hiding this comment.
JournalModel.title is nullable, but __str__ returns self.title directly; Django expects __str__ to always return a string, so None will raise TypeError. Consider returning self.title or '' (or another fallback).
| return self.title | |
| return self.title or '' |
| # Separar por punto y coma o coma | ||
| keywords = re.split(r'\s*[;,]\s*', content) | ||
| clean_keywords = [p.strip() for p in keywords if p.strip()] | ||
| clean_keywords = ", ".join(keywords) |
There was a problem hiding this comment.
clean_keywords is computed as a stripped list, but then immediately overwritten with ", ".join(keywords) (the unstripped original list), discarding the cleaning step. This likely produces trailing spaces and empty entries. Join clean_keywords (the filtered/stripped list) instead of keywords.
| clean_keywords = ", ".join(keywords) | |
| clean_keywords = ", ".join(clean_keywords) |
O que esse PR faz?
Agrega la base para identificar automáticamente los elementos del front dentro de
markup_docy expone una APIfirst_blockpara procesar el bloque inicial del artículo.Incluye:
first_block;markup_doc;first_blocky extraer palabras clave;first_block<abstract><kwd-group><date-accepted><date-received>Onde a revisão poderia começar?
Por commits
Como este poderia ser testado manualmente?
Levantar el entorno;
Cargar un DOCX en
markup_doc;Verificar que se identifiquen y estructuren los elementos del front dentro del resultado procesado.
Algum cenário de contexto que queira dar?
N/A
Screenshots
N/A
Quais são tickets relevantes?
#61
Referências