Conversation
|
@agalitsyna I've added a warning when sorting by columns not defined in pairsam_format.DTYPES_PAIRSAM, defaulting them to string type, and updated the --extra-col help text to reflect this behavior. I've also added a test to verify warnings for undefined custom columns |
|
@golobor @agalitsyna @Phlya |
pairtools/lib/headerops.py
Outdated
|
|
||
|
|
||
|
|
||
| def canonicalize_columns(columns): |
There was a problem hiding this comment.
"canonicalize" -> "standardize"
There was a problem hiding this comment.
Make it a function with a single argument.
pairtools/lib/headerops.py
Outdated
|
|
||
| def canonicalize_columns(columns): | ||
| """Convert between common column name variants.""" | ||
| canonical_map = { |
There was a problem hiding this comment.
Move the dictionary to https://github.com/open2c/pairtools/blob/master/pairtools/lib/pairsam_format.py
There was a problem hiding this comment.
Remove identities (when key equals the value)
pairtools/lib/headerops.py
Outdated
| } | ||
| return [canonical_map.get(col.lower(), col) for col in columns] | ||
|
|
||
| def get_column_index(column_names, column_spec): |
There was a problem hiding this comment.
"column_spec" -> "column/col"
pairtools/lib/headerops.py
Outdated
| 'pt': 'pair_type', | ||
| 'pair_type': 'pair_type' | ||
| } | ||
| return [canonical_map.get(col.lower(), col) for col in columns] |
There was a problem hiding this comment.
canonical_map.get(col.lower(), col.lower())
pairtools/lib/headerops.py
Outdated
| except ValueError: | ||
| pass | ||
|
|
||
| # Try case-insensitive |
There was a problem hiding this comment.
Remove the case-insensittive law or explain why it's needed (we have not found the usecase; Open2C mtg 30 June 2025)
pairtools/cli/dedup.py
Outdated
|
|
||
| # Get column indices with fallbacks | ||
| try: | ||
| col1 = headerops.get_column_index(column_names, c1) |
There was a problem hiding this comment.
the names like "col1/col2/cols1" are not descriptive enough, so maybe use "col_c1/col_c2/col_s1" instead: (1) use underscore, (2) preserve the recognizable name that was used before (c1/c2/p1/p2/etc.).
Closes #264
Flexible and Canonicalized Column Handling for
dedupandsortOverview
This pull request enhances the flexibility and robustness of column handling in
pairtools, with a primary focus on improving CLI usability, internal consistency, and resilience to variations in input column names.Note: This PR also fixes some flake8 linting issues.
Key Enhancements
✅ 1. Unified Column Lookup via
headeropsheaderops.get_column_index()to allow CLI options like--c1,--c2,--p1,--p2,--s1,--s2, and--ptto accept both:1,3) — with bounds and type checks."chr1","chrom1") — supporting canonicalization and case-insensitivity.headerops.canonicalize_columns()to standardize commonly used aliases (e.g.,chr1→chrom1,pt→pair_type) across all CLI tools and internal logic.✅ 2. Improved
dedupandsortCLI Behaviorget_column_index().extra_col_pairandextra_coloptions with warnings for missing columns instead of hard failures.--pt(pair_type) option optional insort, skipping it gracefully when not present in the header.--c2indedup(was "Chrom 1 column", now corrected to "Chrom 2 column").✅ 3. Column Defaults Remain String-Based
--c1,--c2, etc.) are still defined using canonical string names (e.g.,"chr1","pos1"), not integer indices as initially planned.get_column_index()'s flexibility.✅ 4. Code Cleanup and Readability
l→line) for better readability across modules.✅ 5. Comprehensive Testing
test_headerops.pyto validate:Summary
This PR lays the groundwork for robust and user-friendly CLI interactions in
pairtools, reducing the brittleness of column name handling and allowing greater flexibility for users working with varied input formats. It introduces modular utilities (canonicalize_columns,get_column_index) that can be reused across future tools and extensions.Follow-Up Considerations
--c1,--c2, etc.) to use integer indices as per the original plan.