Getty supplierProcessor cleans more#4663
Conversation
...only one, because TX-Texas and Turkey-Turkiye etc
…er when they match then)
...for when they include the year, have month abbreviated and have hyphen instead of colon
There was a problem hiding this comment.
Pull request overview
Updates Getty-specific supplier processing to more safely clean up noisy Getty caption metadata (description/byline/credit), reducing manual caption editing in Composer while avoiding overly-greedy changes.
Changes:
- Strip trailing “(Photo by … / …)” / “(Photo credit should read … / …)” from descriptions only when the byline+credit already match metadata.
- Remove leading “LOCATION1, LOCATION2 - MONTH DAY[, YEAR] …” prefixes when at least one location matches metadata and the date aligns with
dateTaken(±1 day within the same month). - Extract/fix photographer byline from the description when the existing byline is actually a credit component, and run standard byline cleaners on the extracted value.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| common-lib/src/main/scala/com/gu/mediaservice/lib/cleanup/SupplierProcessors.scala | Adds Getty description cleanup, location/date prefix stripping, and byline/credit correction logic in GettyXmpParser. |
| common-lib/src/test/scala/com/gu/mediaservice/lib/cleanup/SupplierProcessorsTest.scala | Adds extensive unit coverage for the new Getty cleanup behaviors and edge cases. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| // Matches: "LOC1, LOC2 - MONTH DAY:" or "LOC1, LOC2 - MONTH DAY, YEAR:" or "LOC1, LOC2 - MON DAY, YEAR - " | ||
| private val LocationDatePrefix = """(?i)^([^,]+),\s+(.+?)\s+-\s+([A-Za-z]+)\s+(\d{1,2})(?:,\s*(\d{4}))?\s*(?::\s*|-\s+)""".r | ||
|
|
||
| private val monthNumbers: Map[String, Int] = Map( |
There was a problem hiding this comment.
Surely this is available in Joda?
There was a problem hiding this comment.
Available in Joda this surely is. But when 🤖 applied it, it was longer and it told me it’s 100% identical functionality. So I decided it’s clearer as silly strings, if only coz we can add another silly string more easily should we find it not being cleaned.
🤖:
// Joda version — needs two new imports (DateTimeFormat, Locale) and still
// requires a manual entry for "sept". Produces identical output, not shorter:
private val monthNumbers: Map[String, Int] = {
val fullFormat = DateTimeFormat.forPattern("MMMM").withLocale(Locale.ENGLISH)
val shortFormat = DateTimeFormat.forPattern("MMM").withLocale(Locale.ENGLISH)
(1 to 12).flatMap { m =>
val dt = new DateTime(2000, m, 1, 0, 0)
List(
fullFormat.print(dt).toLowerCase -> m,
shortFormat.print(dt).toLowerCase -> m
)
}.toMap + ("sept" -> 9)
}
// vs. the explicit map — no extra imports, self-documenting, same 24 entries:
private val monthNumbers: Map[String, Int] = Map(
"january" -> 1, "jan" -> 1, "february" -> 2, "feb" -> 2,
"march" -> 3, "mar" -> 3, "april" -> 4, "apr" -> 4,
"may" -> 5, "june" -> 6, "jun" -> 6, "july" -> 7, "jul" -> 7,
"august" -> 8, "aug" -> 8, "september" -> 9, "sep" -> 9, "sept" -> 9,
"october" -> 10, "oct" -> 10, "november" -> 11, "nov" -> 11,
"december" -> 12, "dec" -> 12
)There was a problem hiding this comment.
Obviously, can do Joda if you still think that’s better!
| dt <- dateTaken | ||
| month <- monthNumbers.get(monthStr) | ||
| } yield { | ||
| // Allow ±1 day for timezone differences, but only within the same month |
There was a problem hiding this comment.
Do timezones suddenly not exist at the start/end of the month?
If I take a picture on 2025-12-31T23:00:00Z but am in Sydney, it is entirely reasonable for the year, month and day not to match.
There was a problem hiding this comment.
If I can even reconstruct my (always) convoluted train of thought, this was to prevent removal of eg. human-supplied date bit from January 1 2077 – Cyberpunk 8 debuts in the New Year… and leaving Date Taken which is potentially in the old year from the users’ POV (remember, Grid is stupid about timezones: we flatten than to UTC on arrival which is destructive in maybe exactly similar context and then display in user’s browser tz. That latter part I may be wrong on). Or the other way around. Or something.
Will take another look. And massive thanks!
|
@davidfurey Thank you so much for the comments. I hope I addressed all. Wait. Not I. Robot addressed (most of) them. I ran tests. Will check TEST too. BTW, we should allow robots talking to robots here directly, because I felt like a robot having to copy paste only because we castrated them from having github access. |
Co-authored by Claude Opus 4.6. PR Description written by a human, 48.
What does this change?
Following from #4615, this attempts cleanup of Getty metadata:
(Photo by/Photo credit should read [Byline]/[Credit(s)])template from Description only when Byline and Credit(s) match in their respective fields (we ignore diacritics in matching)Location1, Location2 - human date:from the Description. It only does it when ONE of the locations match (to account for TX/Texas and Turkey/Turkiye). It matches date strings to dateTaken and allows for one day variation within same month/year (to account for timezones, but not when month or year would clash)The idea is: better not to clean than clean too greedily. This should first and foremost be safe.
Let's hear from a human if increasingly involved commits do not pass the vibe-craze threshold. I think it's all useful, maybe extractor even most useful (those missing Bylines are a common occurrence on the site).
How should a reviewer test this change?
Here are good media ids to test on:
search?nonFree=true&ids=0d458bc8819e806691f3f5283a29921359accbbb,4760414a441b59b452357fe7f284f239be858b4e,d1b0e4d58b99339d389c24cce68148fec5e0ad45,f2564b9840b462bacc9d13998845903dde3fb247,684a161fc6229aa11423090556cc71fbc56c6e4c,4f18808b8a0b9164e498f8e9cd811461905253e5,3fdbb73ddf1dba0aa9a547c917658c86716e1f09,3d61535686e1b6ed27dbf54eff18760e85d2c965,39c2163e54d5b1563347d78fa8d660670d9322f3,b4869673db4902274e50a6db8d4e1b14bf18b05e,65538756af81c296f0cc29068226975ec7c01580,f06c9abe4495803a68faef8410a91f433cadaeee,6c4fed0921e56dc8ad7ef21ae95b173e6810618aHow can success be measured?
Less manual cleanup of Descriptions in Composer.
Finally, no need to manually extract Bylines for certain intermediaries, which almost nobody does anyway (so: less missing photographer names).
Who should look at this?
You asked for it, robot: assigning.
Tested? Documented?