Agregar API first_block e identificación automática de elementos del front by eduranm · Pull Request #62 · scieloorg/markapi

eduranm · 2026-04-21T17:42:27Z

O que esse PR faz?

Agrega la base para identificar automáticamente los elementos del front dentro de markup_doc y expone una API first_block para procesar el bloque inicial del artículo.

Incluye:

registro del endpoint first_block;
incorporación de la API de markup_doc;
utilidades para consultar first_block y extraer palabras clave;
integración del procesamiento de:
- first_block
- <abstract>
- <kwd-group>
- <date-accepted>
- <date-received>

Onde a revisão poderia começar?

Por commits

Como este poderia ser testado manualmente?

Levantar el entorno;
Cargar un DOCX en markup_doc;
Verificar que se identifiquen y estructuren los elementos del front dentro del resultado procesado.

Algum cenário de contexto que queira dar?

N/A

Screenshots

N/A

Quais são tickets relevantes?

#61

Referências

N/A

…l de DOCX

…kup_doc

…ión en Wagtail

Copilot

Pull request overview

Este PR agrega soporte base para identificar automáticamente elementos del front al procesar DOCX en markup_doc, y expone un endpoint first_block para extraer metadatos del bloque inicial reutilizando LLM/utilidades existentes en el proyecto.

Changes:

Registro del endpoint first_block en el router API y creación del ViewSet correspondiente.
Incorporación del procesamiento de first_block + detección/estructuración de <abstract>, <kwd-group>, <date-accepted>, <date-received> en el flujo de markup_doc/tasks.py.
Introducción de nuevas utilidades/modelos/admin hooks para persistir y administrar la estructura marcada (Wagtail StreamFields) y sincronización de colecciones/journals vía API.

Reviewed changes

Copilot reviewed 19 out of 28 changed files in this pull request and generated 23 comments.

Show a summary per file

File	Description
model_ai/llama.py	Ajuste del flujo Gemini (incluye una pausa fija tras generar contenido).
markuplib/function_docx.py	Nuevo parser DOCX para extraer contenido/first_block y detectar elementos del front.
markup_doc/wagtail_hooks.py	Registro de SnippetViewSets/acciones de admin para carga/procesamiento/sync.
markup_doc/tasks.py	Integración del procesamiento de first_block + front elements en el pipeline Celery.
markup_doc/sync_api.py	Nuevas funciones de sync de colecciones/journals desde SciELO Core API.
markup_doc/models.py	Nuevos modelos + StreamFields para front/body/back y entidades de colección/journal.
markup_doc/labeling_utils.py	Utilidades de LLM + extracción de keywords + procesamiento de referencias/citas.
markup_doc/marker.py	Funciones auxiliares para marcar artículo/referencias usando LlamaService.
markup_doc/api/v1/views.py	Nuevo endpoint `first_block` (ViewSet) para invocar `mark_article`.
markup_doc/api/v1/serializers.py	Serializer base para `ArticleDocx` (actualmente no usado en el flujo del ViewSet).
markup_doc/choices.py	Labels del front + reglas de orden/regex/estilos para etiquetado.
markup_doc/apps.py	AppConfig para `markup_doc`.
markup_doc/forms.py	Import base de form de Wagtail (placeholder).
markup_doc/admin.py	Admin placeholder.
markup_doc/tests.py	Archivo de tests placeholder.
markup_doc/migrations/0001_initial.py	Migración inicial del app `markup_doc`.
markup_doc/migrations/0002_alter_articledocx_estatus_and_more.py	Ajuste del campo `estatus` para usar choices.
fixtures/e14790.docx	Fixture DOCX para pruebas/manual testing.
config/settings/base.py	Habilita `markup_doc` y `markuplib` en `INSTALLED_APPS`.
config/api_router.py	Registra el endpoint `first_block` en el router `/api/v1/`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-24T14:22:20Z

+      response_gemini = model.generate_content(user_input).text
+      time.sleep(15)
+      return response_gemini


The unconditional time.sleep(15) after every Gemini call will block the request/task thread even on successful responses, significantly reducing throughput and potentially causing timeouts under load. If this is meant for rate-limiting, switch to conditional backoff only on retryable errors (e.g., 429/503) and/or use the provider SDK retry options rather than sleeping on the happy path.

Copilot · 2026-04-24T14:22:20Z

+from markup_doc.tasks import get_labels, task_sync_journals_from_api
+from django.urls import path, reverse
+from django.utils.html import format_html
+from wagtail.admin import messages
+from wagtail.admin.views import generic


messages is imported from both django.contrib (line 3) and wagtail.admin (line 28), and the latter overwrites the former. This makes it unclear which API is being used and can lead to subtle behavior differences. Rename one import (e.g., from django.contrib import messages as django_messages) or remove the unused one to avoid shadowing.

Copilot · 2026-04-24T14:22:20Z

+class JournalModelCreateView(CreateView):
+    def get_context_data(self, **kwargs):
+        context = super().get_context_data(**kwargs)
+        task_sync_journals_from_api


JournalModelCreateView.get_context_data() references task_sync_journals_from_api but never calls it, so the journals sync won’t run. If the intent is to trigger a Celery task, this should likely be task_sync_journals_from_api.delay() (or call the sync function directly), otherwise this line has no effect.

Suggested change

task_sync_journals_from_api

task_sync_journals_from_api.delay()

Copilot · 2026-04-24T14:22:21Z

+        try:
+            obj = cls.get(title=title)
+        except (cls.DoesNotExist, ValueError):
+            pass


The except ...: pass branch leaves obj undefined when the record doesn’t exist, but the method then uses obj.estatus below, which will raise UnboundLocalError. Return early (or re-raise) when the object is missing, or use update_or_create() / filter(...).update(...) to avoid needing an instance.

Suggested change

pass

return None

Copilot · 2026-04-24T14:22:21Z

+            obj['type'] = 'aff_paragraph'
+
+    if re.search(r"^(translation)", item.get('text').lower()):
+        state['label'] = '<translate-fron>'


The label is set to '<translate-fron>', which doesn’t match any of the declared front_labels (it looks like it should be '<translate-front>'). This mismatch will prevent the value from being valid in the StreamField ChoiceBlock and may break downstream logic that expects the canonical label.

Suggested change

state['label'] = '<translate-fron>'

state['label'] = '<translate-front>'

Copilot · 2026-04-24T14:22:26Z

+    if metadata == 'affiliation':
+        messages, response_format = LlamaInputSettings.get_affiliations()
+    if metadata == 'doi':
+        messages, response_format = LlamaInputSettings.get_doi_and_section()
+    if metadata == 'title':
+        messages, response_format = LlamaInputSettings.get_titles()


mark_article() uses independent if statements without an else/default, so an unexpected metadata value will leave messages/response_format undefined and crash at LlamaService(messages, ...). Consider using an if/elif/else chain and returning a clear error (or raising) when metadata is not one of the supported values.

Suggested change

if metadata == 'affiliation':

messages, response_format = LlamaInputSettings.get_affiliations()

if metadata == 'doi':

messages, response_format = LlamaInputSettings.get_doi_and_section()

if metadata == 'title':

messages, response_format = LlamaInputSettings.get_titles()

elif metadata == 'affiliation':

messages, response_format = LlamaInputSettings.get_affiliations()

elif metadata == 'doi':

messages, response_format = LlamaInputSettings.get_doi_and_section()

elif metadata == 'title':

messages, response_format = LlamaInputSettings.get_titles()

else:

raise ValueError(

f"Unsupported metadata value: {metadata!r}. "

"Expected one of: 'author', 'affiliation', 'doi', 'title'."

)

Copilot · 2026-04-24T14:22:26Z

+                # Determina si es parte de una lista
+                is_numPr = paragraph.find('.//w:numPr', namespaces=paragraph.nsmap) is not None
+
+                # obtiene id y nivel
+                if is_numPr:
+                    numPr = paragraph.find('.//w:numPr', namespaces=paragraph.nsmap)
+                    numId = numPr.find('.//w:numId', namespaces=paragraph.nsmap).get(namespaces_p + 'val')
+                    type = [(key, objt) for key, objt in list_types.items() if objt['numId'] == numId]
+


list_types can be None when the DOCX has no word/numbering.xml (see extract_numbering_info), which makes list_types.items() fail. Also, the list comprehension may return an empty list for unknown numId, so later type[0]... will raise IndexError. Consider defaulting list_types to {} and handling the “unknown numbering” case (e.g., treat as bullet or skip list typing) before indexing.

Copilot · 2026-04-24T14:22:26Z

+            if not is_numPr:
+                content.append(obj)


is_numPr and obj are defined only inside the if isinstance(element, CT_P): block, but are used afterwards for every element in the body. If the document body contains a table (CT_Tbl) or any non-paragraph element, this will raise UnboundLocalError. Move the content.append(obj) logic inside the CT_P branch (and add explicit CT_Tbl handling if needed).

Suggested change

if not is_numPr:

content.append(obj)

if not is_numPr:

content.append(obj)

elif isinstance(element, CT_Tbl):

if len(current_list) > 0:

current_list.append('[/list]')

objl = {}

objl['type'] = 'list'

objl['list'] = '\n'.join(current_list)

current_list = []

current_num_id = None

content.append(objl)

Copilot · 2026-04-24T14:22:27Z

+        return str(self)
+
+    def __str__(self):
+        return self.title


JournalModel.title is nullable, but __str__ returns self.title directly; Django expects __str__ to always return a string, so None will raise TypeError. Consider returning self.title or '' (or another fallback).

Suggested change

return self.title

return self.title or ''

Copilot · 2026-04-24T14:22:27Z

+    # Separar por punto y coma o coma
+    keywords = re.split(r'\s*[;,]\s*', content)
+    clean_keywords = [p.strip() for p in keywords if p.strip()]
+    clean_keywords = ", ".join(keywords)


clean_keywords is computed as a stripped list, but then immediately overwritten with ", ".join(keywords) (the unstripped original list), discarding the cleaning step. This likely produces trailing spaces and empty entries. Join clean_keywords (the filtered/stripped list) instead of keywords.

Suggested change

clean_keywords = ", ".join(keywords)

clean_keywords = ", ".join(clean_keywords)

eduranm added 11 commits April 16, 2026 13:52

Agregar la app markup_doc para gestión y marcación de documentos DOCX

2614d27

Agregar archivos DOCX de ejemplo para pruebas del flujo de marcación

e5a6783

Registrar markup_doc

3ddf5ca

Registrar markuplib y agregar utilidades base para lectura estructura…

a4d2fa6

…l de DOCX

Agregar utilidades de etiquetado y marcación automática de referencias

adbb38d

Integrar procesamiento automático de referencias en las tareas de mar…

b3a5141

…kup_doc

Disparar el procesamiento automático del DOCX desde el flujo de creac…

fc28a0b

…ión en Wagtail

Ajustar la respuesta de Gemini en el servicio de inferencia de model_ai

40ee0cb

Agregar API first_block para procesar el bloque inicial del artículo

fb9da61

Agregar utilidades para consultar first_block y extraer palabras clave

0cd08bb

Integrar la identificación de elementos del front en get_labels

ad725c4

robertatakenaka requested a review from Copilot April 24, 2026 14:13

Copilot started reviewing on behalf of robertatakenaka April 24, 2026 14:14 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agregar API first_block e identificación automática de elementos del front#62

Agregar API first_block e identificación automática de elementos del front#62
eduranm wants to merge 11 commits intoscieloorg:mainfrom
eduranm:issue-04

eduranm commented Apr 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	task_sync_journals_from_api
	task_sync_journals_from_api.delay()

	state['label'] = '<translate-fron>'
	state['label'] = '<translate-front>'

-            if not is_numPr:
-                content.append(obj)
+                if not is_numPr:
+                    content.append(obj)
+            elif isinstance(element, CT_Tbl):
+                if len(current_list) > 0:
+                    current_list.append('[/list]')
+                    objl = {}
+                    objl['type'] = 'list'
+                    objl['list'] = '\n'.join(current_list)
+                    current_list = []
+                    current_num_id = None
+                    content.append(objl)

	clean_keywords = ", ".join(keywords)
	clean_keywords = ", ".join(clean_keywords)

Conversation

eduranm commented Apr 21, 2026

O que esse PR faz?

Onde a revisão poderia começar?

Como este poderia ser testado manualmente?

Algum cenário de contexto que queira dar?

Screenshots

Quais são tickets relevantes?

Referências

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants