Skip to content

Getty supplierProcessor cleans more#4663

Open
paperboyo wants to merge 10 commits into
mainfrom
mk-more-cleaning
Open

Getty supplierProcessor cleans more#4663
paperboyo wants to merge 10 commits into
mainfrom
mk-more-cleaning

Conversation

@paperboyo
Copy link
Copy Markdown
Contributor

@paperboyo paperboyo commented Mar 14, 2026

Co-authored by Claude Opus 4.6. PR Description written by a human, 48.

What does this change?

Following from #4615, this attempts cleanup of Getty metadata:

  • it removes (Photo by/Photo credit should read [Byline]/[Credit(s)]) template from Description only when Byline and Credit(s) match in their respective fields (we ignore diacritics in matching)
  • it removes variations of Location1, Location2 - human date: from the Description. It only does it when ONE of the locations match (to account for TX/Texas and Turkey/Turkiye). It matches date strings to dateTaken and allows for one day variation within same month/year (to account for timezones, but not when month or year would clash)
  • finally (finally!), it EXTRACTS Byline from Description string for when Getty only provides it there and runs usual Byline cleaners on it (because normally they run earlier on the usual Byline fields)

The idea is: better not to clean than clean too greedily. This should first and foremost be safe.

Let's hear from a human if increasingly involved commits do not pass the vibe-craze threshold. I think it's all useful, maybe extractor even most useful (those missing Bylines are a common occurrence on the site).

image

How should a reviewer test this change?

Here are good media ids to test on:
search?nonFree=true&ids=0d458bc8819e806691f3f5283a29921359accbbb,4760414a441b59b452357fe7f284f239be858b4e,d1b0e4d58b99339d389c24cce68148fec5e0ad45,f2564b9840b462bacc9d13998845903dde3fb247,684a161fc6229aa11423090556cc71fbc56c6e4c,4f18808b8a0b9164e498f8e9cd811461905253e5,3fdbb73ddf1dba0aa9a547c917658c86716e1f09,3d61535686e1b6ed27dbf54eff18760e85d2c965,39c2163e54d5b1563347d78fa8d660670d9322f3,b4869673db4902274e50a6db8d4e1b14bf18b05e,65538756af81c296f0cc29068226975ec7c01580,f06c9abe4495803a68faef8410a91f433cadaeee,6c4fed0921e56dc8ad7ef21ae95b173e6810618a

How can success be measured?

Less manual cleanup of Descriptions in Composer.
Finally, no need to manually extract Bylines for certain intermediaries, which almost nobody does anyway (so: less missing photographer names).

Who should look at this?

You asked for it, robot: assigning.

Tested? Documented?

  • locally by committer
  • locally by Guardian reviewer
  • on the Guardian's TEST environment
  • relevant documentation added or amended (if needed)

@paperboyo paperboyo requested a review from Copilot March 14, 2026 02:18
@paperboyo paperboyo requested a review from a team as a code owner March 14, 2026 02:18
@paperboyo paperboyo added the feature Departmental tracking: work on a new feature label Mar 14, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates Getty-specific supplier processing to more safely clean up noisy Getty caption metadata (description/byline/credit), reducing manual caption editing in Composer while avoiding overly-greedy changes.

Changes:

  • Strip trailing “(Photo by … / …)” / “(Photo credit should read … / …)” from descriptions only when the byline+credit already match metadata.
  • Remove leading “LOCATION1, LOCATION2 - MONTH DAY[, YEAR] …” prefixes when at least one location matches metadata and the date aligns with dateTaken (±1 day within the same month).
  • Extract/fix photographer byline from the description when the existing byline is actually a credit component, and run standard byline cleaners on the extracted value.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
common-lib/src/main/scala/com/gu/mediaservice/lib/cleanup/SupplierProcessors.scala Adds Getty description cleanup, location/date prefix stripping, and byline/credit correction logic in GettyXmpParser.
common-lib/src/test/scala/com/gu/mediaservice/lib/cleanup/SupplierProcessorsTest.scala Adds extensive unit coverage for the new Getty cleanup behaviors and edge cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 14, 2026

// Matches: "LOC1, LOC2 - MONTH DAY:" or "LOC1, LOC2 - MONTH DAY, YEAR:" or "LOC1, LOC2 - MON DAY, YEAR - "
private val LocationDatePrefix = """(?i)^([^,]+),\s+(.+?)\s+-\s+([A-Za-z]+)\s+(\d{1,2})(?:,\s*(\d{4}))?\s*(?::\s*|-\s+)""".r

private val monthNumbers: Map[String, Int] = Map(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely this is available in Joda?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Available in Joda this surely is. But when 🤖 applied it, it was longer and it told me it’s 100% identical functionality. So I decided it’s clearer as silly strings, if only coz we can add another silly string more easily should we find it not being cleaned.

🤖:

// Joda version — needs two new imports (DateTimeFormat, Locale) and still
// requires a manual entry for "sept". Produces identical output, not shorter:

private val monthNumbers: Map[String, Int] = {
  val fullFormat = DateTimeFormat.forPattern("MMMM").withLocale(Locale.ENGLISH)
  val shortFormat = DateTimeFormat.forPattern("MMM").withLocale(Locale.ENGLISH)
  (1 to 12).flatMap { m =>
    val dt = new DateTime(2000, m, 1, 0, 0)
    List(
      fullFormat.print(dt).toLowerCase -> m,
      shortFormat.print(dt).toLowerCase -> m
    )
  }.toMap + ("sept" -> 9)
}

// vs. the explicit map — no extra imports, self-documenting, same 24 entries:

private val monthNumbers: Map[String, Int] = Map(
  "january" -> 1, "jan" -> 1, "february" -> 2, "feb" -> 2,
  "march" -> 3, "mar" -> 3, "april" -> 4, "apr" -> 4,
  "may" -> 5, "june" -> 6, "jun" -> 6, "july" -> 7, "jul" -> 7,
  "august" -> 8, "aug" -> 8, "september" -> 9, "sep" -> 9, "sept" -> 9,
  "october" -> 10, "oct" -> 10, "november" -> 11, "nov" -> 11,
  "december" -> 12, "dec" -> 12
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously, can do Joda if you still think that’s better!

dt <- dateTaken
month <- monthNumbers.get(monthStr)
} yield {
// Allow ±1 day for timezone differences, but only within the same month
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do timezones suddenly not exist at the start/end of the month?

If I take a picture on 2025-12-31T23:00:00Z but am in Sydney, it is entirely reasonable for the year, month and day not to match.

Copy link
Copy Markdown
Contributor Author

@paperboyo paperboyo May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I can even reconstruct my (always) convoluted train of thought, this was to prevent removal of eg. human-supplied date bit from January 1 2077 – Cyberpunk 8 debuts in the New Year… and leaving Date Taken which is potentially in the old year from the users’ POV (remember, Grid is stupid about timezones: we flatten than to UTC on arrival which is destructive in maybe exactly similar context and then display in user’s browser tz. That latter part I may be wrong on). Or the other way around. Or something.

Will take another look. And massive thanks!

@paperboyo
Copy link
Copy Markdown
Contributor Author

@davidfurey Thank you so much for the comments. I hope I addressed all. Wait. Not I. Robot addressed (most of) them. I ran tests. Will check TEST too.

BTW, we should allow robots talking to robots here directly, because I felt like a robot having to copy paste only because we castrated them from having github access.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature Departmental tracking: work on a new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants