Skip to content

Agregar API first_block e identificación automática de elementos del front#62

Open
eduranm wants to merge 11 commits intoscieloorg:mainfrom
eduranm:issue-04
Open

Agregar API first_block e identificación automática de elementos del front#62
eduranm wants to merge 11 commits intoscieloorg:mainfrom
eduranm:issue-04

Conversation

@eduranm
Copy link
Copy Markdown
Contributor

@eduranm eduranm commented Apr 21, 2026

O que esse PR faz?

Agrega la base para identificar automáticamente los elementos del front dentro de markup_doc y expone una API first_block para procesar el bloque inicial del artículo.

Incluye:

  • registro del endpoint first_block;
  • incorporación de la API de markup_doc;
  • utilidades para consultar first_block y extraer palabras clave;
  • integración del procesamiento de:
    • first_block
    • <abstract>
    • <kwd-group>
    • <date-accepted>
    • <date-received>

Onde a revisão poderia começar?

Por commits

Como este poderia ser testado manualmente?

Levantar el entorno;
Cargar un DOCX en markup_doc;
Verificar que se identifiquen y estructuren los elementos del front dentro del resultado procesado.

Algum cenário de contexto que queira dar?

N/A

Screenshots

N/A

Quais são tickets relevantes?

#61

Referências

  • N/A

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Este PR agrega soporte base para identificar automáticamente elementos del front al procesar DOCX en markup_doc, y expone un endpoint first_block para extraer metadatos del bloque inicial reutilizando LLM/utilidades existentes en el proyecto.

Changes:

  • Registro del endpoint first_block en el router API y creación del ViewSet correspondiente.
  • Incorporación del procesamiento de first_block + detección/estructuración de <abstract>, <kwd-group>, <date-accepted>, <date-received> en el flujo de markup_doc/tasks.py.
  • Introducción de nuevas utilidades/modelos/admin hooks para persistir y administrar la estructura marcada (Wagtail StreamFields) y sincronización de colecciones/journals vía API.

Reviewed changes

Copilot reviewed 19 out of 28 changed files in this pull request and generated 23 comments.

Show a summary per file
File Description
model_ai/llama.py Ajuste del flujo Gemini (incluye una pausa fija tras generar contenido).
markuplib/function_docx.py Nuevo parser DOCX para extraer contenido/first_block y detectar elementos del front.
markup_doc/wagtail_hooks.py Registro de SnippetViewSets/acciones de admin para carga/procesamiento/sync.
markup_doc/tasks.py Integración del procesamiento de first_block + front elements en el pipeline Celery.
markup_doc/sync_api.py Nuevas funciones de sync de colecciones/journals desde SciELO Core API.
markup_doc/models.py Nuevos modelos + StreamFields para front/body/back y entidades de colección/journal.
markup_doc/labeling_utils.py Utilidades de LLM + extracción de keywords + procesamiento de referencias/citas.
markup_doc/marker.py Funciones auxiliares para marcar artículo/referencias usando LlamaService.
markup_doc/api/v1/views.py Nuevo endpoint first_block (ViewSet) para invocar mark_article.
markup_doc/api/v1/serializers.py Serializer base para ArticleDocx (actualmente no usado en el flujo del ViewSet).
markup_doc/choices.py Labels del front + reglas de orden/regex/estilos para etiquetado.
markup_doc/apps.py AppConfig para markup_doc.
markup_doc/forms.py Import base de form de Wagtail (placeholder).
markup_doc/admin.py Admin placeholder.
markup_doc/tests.py Archivo de tests placeholder.
markup_doc/migrations/0001_initial.py Migración inicial del app markup_doc.
markup_doc/migrations/0002_alter_articledocx_estatus_and_more.py Ajuste del campo estatus para usar choices.
fixtures/e14790.docx Fixture DOCX para pruebas/manual testing.
config/settings/base.py Habilita markup_doc y markuplib en INSTALLED_APPS.
config/api_router.py Registra el endpoint first_block en el router /api/v1/.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread model_ai/llama.py
Comment on lines +101 to +103
response_gemini = model.generate_content(user_input).text
time.sleep(15)
return response_gemini
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unconditional time.sleep(15) after every Gemini call will block the request/task thread even on successful responses, significantly reducing throughput and potentially causing timeouts under load. If this is meant for rate-limiting, switch to conditional backoff only on retryable errors (e.g., 429/503) and/or use the provider SDK retry options rather than sleeping on the happy path.

Copilot uses AI. Check for mistakes.
Comment on lines +25 to +29
from markup_doc.tasks import get_labels, task_sync_journals_from_api
from django.urls import path, reverse
from django.utils.html import format_html
from wagtail.admin import messages
from wagtail.admin.views import generic
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

messages is imported from both django.contrib (line 3) and wagtail.admin (line 28), and the latter overwrites the former. This makes it unclear which API is being used and can lead to subtle behavior differences. Rename one import (e.g., from django.contrib import messages as django_messages) or remove the unused one to avoid shadowing.

Copilot uses AI. Check for mistakes.
class JournalModelCreateView(CreateView):
def get_context_data(self, **kwargs):
context = super().get_context_data(**kwargs)
task_sync_journals_from_api
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JournalModelCreateView.get_context_data() references task_sync_journals_from_api but never calls it, so the journals sync won’t run. If the intent is to trigger a Celery task, this should likely be task_sync_journals_from_api.delay() (or call the sync function directly), otherwise this line has no effect.

Suggested change
task_sync_journals_from_api
task_sync_journals_from_api.delay()

Copilot uses AI. Check for mistakes.
Comment thread markup_doc/models.py
try:
obj = cls.get(title=title)
except (cls.DoesNotExist, ValueError):
pass
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The except ...: pass branch leaves obj undefined when the record doesn’t exist, but the method then uses obj.estatus below, which will raise UnboundLocalError. Return early (or re-raise) when the object is missing, or use update_or_create() / filter(...).update(...) to avoid needing an instance.

Suggested change
pass
return None

Copilot uses AI. Check for mistakes.
obj['type'] = 'aff_paragraph'

if re.search(r"^(translation)", item.get('text').lower()):
state['label'] = '<translate-fron>'
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The label is set to '<translate-fron>', which doesn’t match any of the declared front_labels (it looks like it should be '<translate-front>'). This mismatch will prevent the value from being valid in the StreamField ChoiceBlock and may break downstream logic that expects the canonical label.

Suggested change
state['label'] = '<translate-fron>'
state['label'] = '<translate-front>'

Copilot uses AI. Check for mistakes.
Comment thread markup_doc/marker.py
Comment on lines +11 to +16
if metadata == 'affiliation':
messages, response_format = LlamaInputSettings.get_affiliations()
if metadata == 'doi':
messages, response_format = LlamaInputSettings.get_doi_and_section()
if metadata == 'title':
messages, response_format = LlamaInputSettings.get_titles()
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mark_article() uses independent if statements without an else/default, so an unexpected metadata value will leave messages/response_format undefined and crash at LlamaService(messages, ...). Consider using an if/elif/else chain and returning a clear error (or raising) when metadata is not one of the supported values.

Suggested change
if metadata == 'affiliation':
messages, response_format = LlamaInputSettings.get_affiliations()
if metadata == 'doi':
messages, response_format = LlamaInputSettings.get_doi_and_section()
if metadata == 'title':
messages, response_format = LlamaInputSettings.get_titles()
elif metadata == 'affiliation':
messages, response_format = LlamaInputSettings.get_affiliations()
elif metadata == 'doi':
messages, response_format = LlamaInputSettings.get_doi_and_section()
elif metadata == 'title':
messages, response_format = LlamaInputSettings.get_titles()
else:
raise ValueError(
f"Unsupported metadata value: {metadata!r}. "
"Expected one of: 'author', 'affiliation', 'doi', 'title'."
)

Copilot uses AI. Check for mistakes.
Comment on lines +234 to +242
# Determina si es parte de una lista
is_numPr = paragraph.find('.//w:numPr', namespaces=paragraph.nsmap) is not None

# obtiene id y nivel
if is_numPr:
numPr = paragraph.find('.//w:numPr', namespaces=paragraph.nsmap)
numId = numPr.find('.//w:numId', namespaces=paragraph.nsmap).get(namespaces_p + 'val')
type = [(key, objt) for key, objt in list_types.items() if objt['numId'] == numId]

Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list_types can be None when the DOCX has no word/numbering.xml (see extract_numbering_info), which makes list_types.items() fail. Also, the list comprehension may return an empty list for unknown numId, so later type[0]... will raise IndexError. Consider defaulting list_types to {} and handling the “unknown numbering” case (e.g., treat as bullet or skip list typing) before indexing.

Copilot uses AI. Check for mistakes.
Comment on lines +419 to +420
if not is_numPr:
content.append(obj)
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_numPr and obj are defined only inside the if isinstance(element, CT_P): block, but are used afterwards for every element in the body. If the document body contains a table (CT_Tbl) or any non-paragraph element, this will raise UnboundLocalError. Move the content.append(obj) logic inside the CT_P branch (and add explicit CT_Tbl handling if needed).

Suggested change
if not is_numPr:
content.append(obj)
if not is_numPr:
content.append(obj)
elif isinstance(element, CT_Tbl):
if len(current_list) > 0:
current_list.append('[/list]')
objl = {}
objl['type'] = 'list'
objl['list'] = '\n'.join(current_list)
current_list = []
current_num_id = None
content.append(objl)

Copilot uses AI. Check for mistakes.
Comment thread markup_doc/models.py
return str(self)

def __str__(self):
return self.title
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JournalModel.title is nullable, but __str__ returns self.title directly; Django expects __str__ to always return a string, so None will raise TypeError. Consider returning self.title or '' (or another fallback).

Suggested change
return self.title
return self.title or ''

Copilot uses AI. Check for mistakes.
# Separar por punto y coma o coma
keywords = re.split(r'\s*[;,]\s*', content)
clean_keywords = [p.strip() for p in keywords if p.strip()]
clean_keywords = ", ".join(keywords)
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clean_keywords is computed as a stripped list, but then immediately overwritten with ", ".join(keywords) (the unstripped original list), discarding the cleaning step. This likely produces trailing spaces and empty entries. Join clean_keywords (the filtered/stripped list) instead of keywords.

Suggested change
clean_keywords = ", ".join(keywords)
clean_keywords = ", ".join(clean_keywords)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants