Skip to content

aa280380/Snowflake-Cortex-AI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

🏛️ Snowflake Cortex Intelligence Pipeline

Document AI · Automated Compliance Checking · Counties · Data Never Leaves Snowflake

Status Platform AI Gov


🧭 The Story — What Problem This Solves

Florida has 67 counties. Each county submits an annual property tax ratio study (Report FF09) to the Department of Revenue. Each report is a complex document — tables, statistical ratios, compliance thresholds, confidence intervals, assessor notes.

Every year, DOR analysts must:

  1. Open each document manually
  2. Find the COD (Coefficient of Dispersion) and PRD (Price-Related Differential) values
  3. Check them against IAAO standards (different thresholds per property type)
  4. Write a compliance determination
  5. Generate an executive summary memo for county officials

For 67 counties × 10 years of historical data = 670 documents. All manual.

This pipeline automates the entire workflow using Snowflake Cortex AI — and the data never leaves Snowflake's security boundary.


🏗️ The 11-Module Pipeline

MODULE 00: Environment Check
  SET ROLE · SET WAREHOUSE · Verify Cortex accessible
         ↓
MODULE 01: Session Schema & Stage Setup
  Unique session ID → SESSION_XXXXXXXX schema
  Encrypted internal stage (SNOWFLAKE_SSE)
  Copy documents from shared stage
         ↓
MODULE 02: Parse Document — OCR Mode
  SNOWFLAKE.CORTEX.PARSE_DOCUMENT → plain text
  Good for: simple extraction
  Limitation: no table structure preserved
         ↓
MODULE 03: Parse Document — Layout Mode
  SNOWFLAKE.CORTEX.PARSE_DOCUMENT → Markdown
  Tables preserved with | column | separators |
  Best choice for AI downstream prompts
         ↓
MODULE 04: Parse Document — Page Split
  page_split: TRUE → one row per page
  Enables parallel processing at scale
  67 counties × 10 years = parallelized
         ↓
MODULE 05: Store & Catalog
  CREATE TABLE parsed_documents
  county · report_year · text_content · page_count
  CLUSTER BY (county, report_year)
         ↓
MODULE 06: AI Extract — Structured Metrics
  CORTEX.COMPLETE → JSON extraction
  county · total_parcels · overall_ratio · strata data
         ↓
MODULE 07: AI Analysis — Compliance Check
  CORTEX.COMPLETE → IAAO standards check
  COD thresholds · PRD bounds · Risk level
         ↓
MODULE 08: AI Classification
  AI_CLASSIFY → property focus type
  Residential · Commercial · Agricultural · Mixed
         ↓
MODULE 09: AI Executive Summary
  CORTEX.COMPLETE → government memo format
  Overview · Key Findings · Concerns · Recommendations
         ↓
MODULE 10: Enrichment Pipeline
  All AI combined into enriched_ratio_study table
  compliance_status · risk_level · key_metrics_json
         ↓
MODULE 11: Cleanup
  DROP SCHEMA → removes all session data

💻 Key Code Patterns

Parse a Government Document

-- LAYOUT mode preserves table structure for AI
SELECT
    SNOWFLAKE.CORTEX.PARSE_DOCUMENT(
        '@document_stage',
        'county_ratio_study.docx',
        {'mode': 'LAYOUT'}
    ):content::VARCHAR AS document_markdown;

AI Compliance Check

-- Claude Sonnet checks against IAAO standards
SELECT
    county,
    TRIM(SNOWFLAKE.CORTEX.COMPLETE(
        'claude-4-sonnet',
        CONCAT(
            'Based on IAAO standards (COD < 15% residential, ',
            '< 20% commercial; PRD between 0.98 and 1.03), ',
            'is this county IN COMPLIANCE? ',
            'Answer with EXACTLY one word: ',
            'COMPLIANT, NEEDS_REVIEW, or NON_COMPLIANT.\n\n',
            'Document:\n', LEFT(text_content, 4000)
        )
    )) AS compliance_status
FROM parsed_documents;

AI Executive Summary Memo

-- Generates professional government memo automatically
SELECT
    SNOWFLAKE.CORTEX.COMPLETE(
        'claude-4-sonnet',
        CONCAT(
            'You are a senior analyst at the Florida . ',
            'Generate an executive summary memo with sections: ',
            'OVERVIEW · KEY FINDINGS · STRATUM PERFORMANCE · ',
            'AREAS OF CONCERN · RECOMMENDATIONS\n\n',
            'Document:\n', text_content
        )
    ) AS executive_summary
FROM parsed_documents;

Full Enrichment Pipeline (1 query, 4 AI outputs)

CREATE OR REPLACE TABLE enriched_ratio_study AS
SELECT
    county, report_year,
    AI_CLASSIFY(text_content, [
        'Residential Focus', 'Commercial Focus',
        'Mixed Property Types', 'Agricultural Focus'
    ]):label::VARCHAR                    AS primary_property_focus,
    CORTEX.COMPLETE('claude-4-sonnet', ...) AS key_metrics_json,
    TRIM(CORTEX.COMPLETE(...))           AS compliance_status,
    TRIM(CORTEX.COMPLETE(...))           AS risk_level
FROM parsed_documents;

🔒 Security Analysis — What's Safe & What Needs Fixing

This is the most valuable part of this project — not just building it, but auditing it.

✅ What The Lab Gets Right

Control Implementation Why It Matters
RBAC USE ROLE DOR_LAB_ROLE Least-privilege access
Encryption at rest SNOWFLAKE_SSE on stage Documents encrypted in storage
Data residency Cortex runs inside Snowflake Data never reaches Anthropic/Meta servers
Session isolation Unique SESSION_XXXXXXXX schema Users cannot see each other's data

❌ Five Gaps for Production Government Use

Gap Risk Level Fix Required
No PII masking 🔴 HIGH Dynamic data masking on text_content column
No network policy 🔴 HIGH Agency IP allowlist — restrict to VPN/network only
No MFA enforcement 🔴 HIGH Account-level MFA policy (FISMA requirement)
Incomplete audit logging 🟡 MEDIUM Access History + Object Tagging + SIEM integration
No AI model governance 🟡 MEDIUM Model cards + version logging + human review gate

The Fix for Gap 1 — PII Masking

-- Add this before going to production:
CREATE MASKING POLICY mask_document_text AS
(val STRING) RETURNS STRING ->
  CASE 
    WHEN CURRENT_ROLE() IN ('DOR_ADMIN_ROLE') THEN val
    ELSE '*** DOCUMENT RESTRICTED — INSUFFICIENT PRIVILEGES ***'
  END;

ALTER TABLE parsed_documents
  MODIFY COLUMN text_content
  SET MASKING POLICY mask_document_text;

The Hidden Risk Nobody Talks About — Time Travel

-- Module 11 runs: DROP SCHEMA IF EXISTS
-- But Snowflake Time Travel retains data for 90 days by default!
-- "Deleted" data is still recoverable by an ACCOUNTADMIN.

-- The correct production cleanup:
ALTER SCHEMA IDENTIFIER($lab_schema)
  SET DATA_RETENTION_TIME_IN_DAYS = 0;  -- disable Time Travel FIRST
DROP SCHEMA IF EXISTS IDENTIFIER($lab_schema);  -- now truly deleted

The Model Versioning Problem

-- Current code: no model version recorded
-- Risk: same query run on Aug 1 vs July 3 may give different compliance
--       answer if Snowflake silently updates the Claude model version

-- Production fix: always log model + timestamp
ALTER TABLE enriched_ratio_study ADD COLUMN
  model_version VARCHAR DEFAULT 'claude-4-sonnet-20250514';
ALTER TABLE enriched_ratio_study ADD COLUMN
  inference_timestamp TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP();

🧠 The Toughest Questions This Project Raised

These are questions that stump even senior engineers:

  1. Output validation gap: TRIM(CORTEX.COMPLETE(...)) expects exactly one word back. What if the model returns "The county appears to be COMPLIANT" instead of "COMPLIANT"? The enriched table stores a wrong label silently — no error, no alert.

  2. Context truncation risk: LEFT(text_content, 4000) cuts the document at 4,000 characters. What if the critical COD/PRD values for Stratum 4 are on page 3 — past the cutoff? The compliance decision was made on incomplete data.

  3. Legal liability: If AI incorrectly labels a county as COMPLIANT when it should be NON_COMPLIANT, and DOR uses that to make a legal determination — who is liable? There is no human review gate in the current pipeline.

  4. Model drift: The same query run today vs. in 3 months may give a different compliance result if Snowflake updates the Claude model. Government records must be reproducible. How do you version-control AI outputs?


📁 Project Files

File Description
fl_dor_lab.sql Complete 11-module Snowflake SQL lab

🔮 Production Roadmap

CURRENT (Lab):
  1 county · 1 document · Manual trigger · No masking

NEXT (Pilot):
  3 counties · Batch processing · Masked PII
  Network policy · MFA enforced · Model version logging

FULL PRODUCTION:
  67 counties × 10 years = 670 documents
  Snowflake Task (automated trigger on new file upload)
  Stream on parsed_documents → detect new inserts
  Power BI / Tableau connected to enriched_ratio_study
  Full audit trail · Human review gate · SIEM integrated

💡 Key Learnings

"AI inside your secure environment is fundamentally different from AI in the public cloud. But 'inside' doesn't mean 'safe' — it means you've solved one of five problems."

  1. SNOWFLAKE_SSE encryption solves storage security — not access control, not audit, not AI governance
  2. Data residency (Cortex running inside Snowflake) is the most important security property for government use
  3. TRY_PARSE_JSON returning NULL on AI output failure is silent — always validate AI outputs
  4. Time Travel is a compliance liability if you DROP schemas containing confidential documents without setting DATA_RETENTION_TIME_IN_DAYS = 0 first
  5. Model versioning isn't optional in government AI — it's a reproducibility and legal requirement

Author: Ashok Ankalla — Enterprise Data & AI Transformation Leader Analyzed and extended with production security design · 2026

About

AI document intelligence pipeline for 67 Florida counties using Snowflake Cortex — compliance checking inside your secure boundary

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors