Skip to content

feat: harden CSV analysis parsing and schema inference#6

Merged
rad1092 merged 2 commits intomainfrom
codex/start-project-according-to-readme.md
Feb 13, 2026
Merged

feat: harden CSV analysis parsing and schema inference#6
rad1092 merged 2 commits intomainfrom
codex/start-project-according-to-readme.md

Conversation

@rad1092
Copy link
Copy Markdown
Owner

@rad1092 rad1092 commented Feb 13, 2026

Motivation

  • Improve robustness of CSV ingestion and address incorrect dtype detection observed with pasted or semicolon/tab-delimited CSVs.
  • Provide more accurate schema inference (distinguish int, float, date, string) so generated prompts contain better numeric context.
  • Add standard deviation and unify file/text CSV parsing paths to eliminate inconsistencies between CLI and UI flows.

Description

  • Added a dedicated _parse_csv_text that uses csv.Sniffer to auto-detect delimiters (,, ;, \t, |) with a safe fallback to , and used it for both text and file-based ingestion in bitnet_tools/analysis.py.
  • Implemented stronger dtype helpers _to_int and _to_iso_date and expanded inference to emit int, float, date, or string per column, and compute std (population std) along with existing count/mean/min/q1/median/q3/max stats.
  • Kept top-value ranking and missing counts logic, and ensured numeric stats are only emitted when all non-missing values are numeric.
  • Updated tests (tests/test_analysis.py) to validate integer/date inference, presence of std, and semicolon-delimited CSV handling, and wired the CLI to use the improved parser (verified with an integration smoke run).

Testing

  • Ran the unit test suite with python -m pytest -q which returned 3 passed.
  • Performed a CLI smoke test with a semicolon-delimited CSV via python -m bitnet_tools.cli analyze /tmp/next.csv --question '요약' --out /tmp/next_payload.json and verified inferred dtypes were {'id': 'int', 'value': 'int', 'dt': 'date'}.
  • Verified payload saving and prompt generation from both file and CSV-text code paths (payload JSON written and contains expected prompt template).

Codex Task

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fe775e6d32

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread bitnet_tools/analysis.py
if not v:
return None
try:
return float(v)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reject non-finite floats during CSV numeric parsing

Parsing values with float(v) currently accepts tokens like NaN, inf, and -inf as numeric values, but those later cause statistics.pstdev to raise ValueError (e.g., "cannot convert NaN to integer ratio") when building numeric stats. This makes build_analysis_payload* fail for common CSVs that encode missing numeric values as NaN, breaking both CLI analyze and the /api/analyze UI flow instead of returning a summary.

Useful? React with 👍 / 👎.

@rad1092 rad1092 merged commit 4e58ee9 into main Feb 13, 2026
0 of 4 checks passed
@rad1092 rad1092 deleted the codex/start-project-according-to-readme.md branch February 13, 2026 08:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant