CM08

Supporting Document Upload

Journey: Customer Mobile

Duration: ~1 minute

AI Agent: Document Classifier

External APIs: AWS S3, AWS Textract (OCR)

⬅️ Previous: CM07 📱 View Screen 🔄 View Journey Flow Next: CM09 ➡️

User Interface Layer

Drag-and-drop file upload with document checklist

📋 Document Checklist

Business License: Optional trading license or permits (PDF, JPG, PNG, max 10MB)
Business Plan: Optional expansion or growth plans (PDF, DOCX, max 10MB)
Proof of Address: Optional utility bill or bank statement (PDF, JPG, max 10MB)
Additional Documents: Optional any other supporting documents (PDF, JPG, DOCX, max 10MB)
All Optional: No documents are strictly required - customer can skip this screen

💡 Why All Optional?

Since we've already collected comprehensive data through CM02 (financial review via TrueLayer + Xero) and CM04 (identity verification via Onfido), additional documents are supplementary only.

Purpose: Speed up approval or provide additional context, but not required for decision.

📤 Upload Interface

Drag-and-Drop Zone: Large drop zone for each document category with hover effects
Click to Upload: Alternative click action to open file picker dialog
File Type Badges: Visual indicators of accepted formats (PDF, JPG, PNG, DOCX)
File Size Limit: "Max 10MB per file" displayed prominently
Mobile Optimization: Camera capture option for photos on mobile devices

✅ Upload Confirmation UI

Success State: Green checkmark badge replaces "Optional" badge when file uploaded
File Preview Card: Shows filename, file size, file type icon, upload timestamp
Remove Option: "×" button to delete uploaded file and restore upload zone
View Option: "View" button to preview uploaded document in modal
Upload Counter: Button text updates to "Continue with X documents"

Upload Success UI Example

// After successful upload
<div class="file-preview">
  <div class="file-info">
    <div class="file-icon">📄</div>
    <div class="file-details">
      <div class="file-name">business-plan.pdf</div>
      <div class="file-meta">2.4 MB • Uploaded just now</div>
    </div>
  </div>
  <div class="file-actions">
    <button onclick="viewFile()">View</button>
    <button onclick="removeFile()">×</button>
  </div>
</div>

🔄 Progress & Feedback

Upload Progress Bar: Shows percentage while file is uploading
Processing Indicator: "Analyzing document..." spinner while AI classifies
Success Toast: "✓ Document uploaded successfully" notification
Error Message: Clear error messages for failed uploads (file too large, wrong type, etc.)

🎬 Action Buttons

Continue Button: Large primary CTA, updates text based on upload count
Skip Link: "Skip - I'll add documents later" secondary link below button
Back Navigation: "←" button in header to return to CM07

API / Backend-for-Frontend Layer

Multipart file upload and document management endpoints

📤 POST /api/v1/documents/upload

Purpose: Upload document file and metadata
Content-Type: multipart/form-data (supports file uploads)
Authentication: JWT Bearer token from login session
Rate Limit: 10 uploads/minute per user
Max File Size: 10MB per file (enforced at API gateway)
Response Time: <2 seconds (includes S3 upload + OCR trigger)

Multipart Request Example

POST /api/v1/documents/upload
Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
Content-Type: multipart/form-data; boundary=----WebKitFormBoundary

------WebKitFormBoundary
Content-Disposition: form-data; name="file"; filename="business-plan.pdf"
Content-Type: application/pdf

[Binary file data]
------WebKitFormBoundary
Content-Disposition: form-data; name="application_id"

APP-2025-001234
------WebKitFormBoundary
Content-Disposition: form-data; name="document_type"

business_plan
------WebKitFormBoundary
Content-Disposition: form-data; name="category"

supporting_documents
------WebKitFormBoundary--

Response Example (200 OK)

{
  "request_id": "req_cm08_20251123_150000",
  "status": "success",
  "document": {
    "id": "DOC-2025-001234-001",
    "filename": "business-plan.pdf",
    "file_size": 2457600, // 2.4 MB in bytes
    "file_type": "application/pdf",
    "document_type": "business_plan",
    "category": "supporting_documents",
    "uploaded_at": "2025-11-23T15:00:00Z",
    "s3_url": "s3://aina-docs/APP-2025-001234/DOC-001.pdf",
    "view_url": "/api/v1/documents/DOC-2025-001234-001/view"
  },
  "classification": {
    "status": "processing",
    "message": "Document is being analyzed..."
  },
  "processing_time_ms": 1847
}

📋 GET /api/v1/documents/list

Purpose: Retrieve all documents for an application
Query Params: application_id, category (optional)
Response: Array of document objects with metadata
Caching: 5-minute cache with cache invalidation on upload

List Request Example

GET /api/v1/documents/list?application_id=APP-2025-001234

Response:
{
  "application_id": "APP-2025-001234",
  "total_documents": 2,
  "documents": [
    {
      "id": "DOC-2025-001234-001",
      "filename": "business-plan.pdf",
      "document_type": "business_plan",
      "file_size": 2457600,
      "uploaded_at": "2025-11-23T15:00:00Z",
      "classification": "business_plan",
      "confidence": 0.95
    },
    {
      "id": "DOC-2025-001234-002",
      "filename": "trading-license.jpg",
      "document_type": "business_license",
      "file_size": 1534720,
      "uploaded_at": "2025-11-23T15:02:15Z",
      "classification": "license",
      "confidence": 0.89
    }
  ]
}

👁️ GET /api/v1/documents/{id}/view

Purpose: Generate pre-signed URL for secure document viewing
Security: Temporary URL expires after 1 hour
Authorization: Only document owner or assigned banker can view

🗑️ DELETE /api/v1/documents/{id}

Purpose: Remove uploaded document
Behavior: Soft delete (marks as deleted, keeps S3 file for 30 days)
Audit: Logs deletion event with reason and timestamp

🔐 Security & Validation

File Type Whitelist: Only allow PDF, JPG, PNG, DOCX (block executables, scripts)
Virus Scanning: All files scanned with ClamAV before storage
File Size Limit: 10MB enforced at API gateway + application layer
Ownership Check: Verify user owns the application before upload
Rate Limiting: Prevent abuse with 10 uploads/minute per user

⚠️ Security Note: Files are stored in S3 with encryption at rest (AES-256) and accessed only via pre-signed URLs with 1-hour expiry. Never expose direct S3 URLs to clients.

Business Logic Layer

AI document classification and OCR extraction

🤖 AI Document Classifier

Model: Fine-tuned DistilBERT on financial documents (87% accuracy)
Input: First page of document + filename
Output: Document category + confidence score
Categories: Business plan, license, invoice, bank statement, contract, other
Processing Time: <500ms per document

Classification Logic

async function classifyDocument(documentId, filePath) {
  // Extract text from first page using OCR
  const firstPageText = await extractTextFromPDF(filePath, page=1);
  
  // Get filename for additional context
  const filename = path.basename(filePath);
  
  // Call ML model API
  const classification = await mlModel.classify({
    text: firstPageText,
    filename: filename,
    max_length: 512 // Token limit for DistilBERT
  });
  
  // Store classification result
  await Document.update(documentId, {
    classification: classification.category,
    classification_confidence: classification.confidence,
    classified_at: new Date()
  });
  
  return classification;
}

// Example output:
// { category: "business_plan", confidence: 0.95 }

📄 OCR Text Extraction (AWS Textract)

Service: AWS Textract for OCR and form extraction
Purpose: Extract text from images and PDFs for classification
Features: Detect text, tables, forms, key-value pairs
Accuracy: 99%+ for printed documents, 85%+ for handwritten
Cost: $1.50 per 1,000 pages

Textract Integration

async function extractTextFromPDF(s3Bucket, s3Key, page=1) {
  const params = {
    Document: {
      S3Object: {
        Bucket: s3Bucket,
        Name: s3Key
      }
    },
    FeatureTypes: ['TABLES', 'FORMS']
  };
  
  const response = await textract.analyzeDocument(params).promise();
  
  // Extract text blocks from specified page
  const pageBlocks = response.Blocks.filter(
    block => block.BlockType === 'LINE' && block.Page === page
  );
  
  const text = pageBlocks.map(block => block.Text).join(' ');
  
  return text;
}

✅ Document Validation Rules

File Type Check: MIME type must match file extension (prevent disguised executables)
Size Validation: File must be ≤10MB (reject if larger)
Virus Scan: ClamAV scan must return clean status
Image Quality: For images, minimum 800×600 resolution for OCR
PDF Integrity: PDF must be readable (not corrupted)
Duplicate Check: Warn if same file hash already uploaded

Validation Flow

async function validateDocument(file) {
  // 1. Check file size
  if (file.size > 10 * 1024 * 1024) {
    throw new ValidationError("File too large (max 10MB)");
  }
  
  // 2. Check file type
  const allowedTypes = ['application/pdf', 'image/jpeg', 'image/png'];
  if (!allowedTypes.includes(file.mimetype)) {
    throw new ValidationError("Invalid file type");
  }
  
  // 3. Virus scan
  const scanResult = await clamav.scanFile(file.path);
  if (scanResult.isInfected) {
    throw new SecurityError("File contains malware");
  }
  
  // 4. Check duplicate
  const fileHash = calculateSHA256(file.buffer);
  const existingDoc = await Document.findByHash(fileHash);
  if (existingDoc) {
    return { warning: "This file was already uploaded" };
  }
  
  return { valid: true };
}

🎯 Business Rules

No Documents Required: Application can proceed without any uploads (all optional)
Max Documents: 10 documents per application (prevent spam)
Classification Threshold: If confidence <60%, mark as "unclassified" for manual review
Auto-categorization: Documents auto-tagged based on classification (speeds up RM review)

Integration & Middleware Layer

S3 storage, async processing, and notifications

☁️ AWS S3 Storage

Bucket Structure: s3://aina-docs/{application_id}/{document_id}.{ext}
Encryption: Server-side encryption (SSE-S3) with AES-256
Access: Private bucket, pre-signed URLs for viewing (1-hour expiry)
Lifecycle: Soft-deleted files moved to Glacier after 30 days, purged after 7 years
Versioning: Enabled for audit trail and recovery

S3 Upload Flow

async function uploadToS3(file, applicationId, documentId) {
  const s3Key = `${applicationId}/${documentId}${path.extname(file.originalname)}`;
  
  const uploadParams = {
    Bucket: 'aina-docs',
    Key: s3Key,
    Body: file.buffer,
    ContentType: file.mimetype,
    ServerSideEncryption: 'AES256',
    Metadata: {
      application_id: applicationId,
      uploaded_by: file.userId,
      uploaded_at: new Date().toISOString()
    }
  };
  
  const result = await s3.upload(uploadParams).promise();
  
  return {
    s3_url: result.Location,
    s3_key: s3Key,
    etag: result.ETag
  };
}

⚡ Async Processing Queue

Upload Flow: File uploaded to S3 → API returns immediately → Background job processes classification
Queue: AWS SQS for reliable async processing
Workers: Lambda functions triggered by SQS messages
Retry Logic: 3 retries with exponential backoff for failed classifications
Status Updates: WebSocket notifications when classification completes

💡 Why Async Processing?

Speed: Don't make customer wait 2-3 seconds for OCR + classification

Scalability: Process multiple documents in parallel

Reliability: SQS ensures classification happens even if worker temporarily down

🔔 Real-time Notifications

WebSocket: Push classification results to client when complete
Event: "document_classified" with document ID and category
UI Update: Auto-update document badge from "Processing..." to category name
Fallback: If WebSocket disconnected, poll GET /documents/list every 5 seconds

WebSocket Event

{
  "event": "document_classified",
  "document_id": "DOC-2025-001234-001",
  "classification": {
    "category": "business_plan",
    "confidence": 0.95,
    "processing_time_ms": 487
  },
  "timestamp": "2025-11-23T15:00:03Z"
}

📊 Analytics & Monitoring

Upload Metrics: Track success rate, average file size, popular document types
Classification Accuracy: Monitor confidence scores, flag low-confidence docs
Performance: Track upload time, S3 latency, OCR processing time
Errors: Alert on high error rates, virus detections, failed uploads

🔐 Security Measures

Pre-signed URLs: Generate temporary S3 URLs that expire after 1 hour
CORS Policy: Restrict S3 bucket to only accept requests from AINA domain
IAM Roles: Least-privilege access for Lambda functions
Audit Logging: All document access logged to CloudTrail

External Systems Integration

AWS services for storage, OCR, and virus scanning

☁️ AWS S3 (Simple Storage Service)

Purpose: Secure document storage with encryption
Bucket: aina-docs (private, encrypted)
Pricing: $0.023/GB/month storage + $0.09/GB transfer out
SLA: 99.99% availability
Features: Versioning, lifecycle policies, encryption at rest

📄 AWS Textract

Purpose: OCR text extraction from PDFs and images
API: AnalyzeDocument for text + DetectDocumentText for simple OCR
Pricing: $1.50 per 1,000 pages (first page only for classification)
Accuracy: 99%+ for printed text, 85%+ for handwritten
Response Time: 1-3 seconds per page

Textract API Call

const textract = new AWS.Textract();

const params = {
  Document: {
    S3Object: {
      Bucket: 'aina-docs',
      Name: 'APP-2025-001234/DOC-001.pdf'
    }
  },
  FeatureTypes: ['TABLES', 'FORMS']
};

const result = await textract.analyzeDocument(params).promise();

// Extract text blocks
const textBlocks = result.Blocks
  .filter(b => b.BlockType === 'LINE')
  .map(b => b.Text);
  
const fullText = textBlocks.join(' ');

🛡️ ClamAV (Virus Scanner)

Purpose: Scan uploaded files for malware/viruses
Deployment: Self-hosted ClamAV on EC2 or Lambda layer
Database: Virus definitions updated daily
Processing: <1 second per file (10MB)
Action: Reject upload if virus detected, log security event

🤖 ML Model Service (Internal)

Model: Fine-tuned DistilBERT for document classification
Hosting: AWS SageMaker endpoint (ml.t3.medium instance)
Input: First 512 tokens of document text
Output: Category + confidence score
Latency: <500ms per classification
Cost: $0.05/hour compute (~$35/month for always-on)

🔄 No Other External Dependencies

Note: Unlike CM03 (TrueLayer, Xero) or CM04 (Onfido), this screen uses only internal AWS services
Benefit: Full control over data, no third-party API limits
Cost: Pay-as-you-go AWS pricing, no per-transaction fees

Data Persistence Layer

Document metadata storage and audit logging

📊 documents Table Schema

Column	Type	Description	Example
`id`	VARCHAR(50)	Primary key	DOC-2025-001234-001
`application_id`	VARCHAR(50)	Foreign key to applications	APP-2025-001234
`filename`	VARCHAR(255)	Original filename	business-plan.pdf
`file_size`	INTEGER	Size in bytes	2457600
`file_type`	VARCHAR(100)	MIME type	application/pdf
`document_type`	VARCHAR(50)	User-selected category	business_plan
`s3_bucket`	VARCHAR(100)	S3 bucket name	aina-docs
`s3_key`	VARCHAR(500)	S3 object key	APP-2025-001234/DOC-001.pdf
`s3_etag`	VARCHAR(100)	S3 ETag for integrity	"a8b3c5d7..."
`file_hash`	VARCHAR(64)	SHA-256 hash	7f4b8c...
`classification`	VARCHAR(50)	AI-determined category	business_plan
`classification_confidence`	DECIMAL(4,3)	Confidence score	0.950
`ocr_text`	TEXT	Extracted text (first page)	"Business Plan 2025..."
`virus_scan_status`	VARCHAR(20)	Clean/Infected/Pending	clean
`uploaded_by`	VARCHAR(50)	User ID	CUST-001234
`uploaded_at`	TIMESTAMP	Upload timestamp	2025-11-23 15:00:00
`classified_at`	TIMESTAMP	When AI classified	2025-11-23 15:00:03
`deleted_at`	TIMESTAMP	Soft delete timestamp	NULL

SQL: Create Table

CREATE TABLE documents (
    id VARCHAR(50) PRIMARY KEY,
    application_id VARCHAR(50) NOT NULL REFERENCES applications(id),
    filename VARCHAR(255) NOT NULL,
    file_size INTEGER NOT NULL,
    file_type VARCHAR(100) NOT NULL,
    document_type VARCHAR(50),
    s3_bucket VARCHAR(100) NOT NULL,
    s3_key VARCHAR(500) NOT NULL,
    s3_etag VARCHAR(100),
    file_hash VARCHAR(64) NOT NULL,
    classification VARCHAR(50),
    classification_confidence DECIMAL(4,3),
    ocr_text TEXT,
    virus_scan_status VARCHAR(20) DEFAULT 'pending',
    uploaded_by VARCHAR(50) NOT NULL,
    uploaded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    classified_at TIMESTAMP,
    deleted_at TIMESTAMP,
    CONSTRAINT check_file_size CHECK (file_size <= 10485760) -- 10MB
);

CREATE INDEX idx_documents_app ON documents(application_id);
CREATE INDEX idx_documents_hash ON documents(file_hash);
CREATE INDEX idx_documents_uploaded ON documents(uploaded_at);

💾 Insert Document Record

SQL: Insert Document

INSERT INTO documents (
    id,
    application_id,
    filename,
    file_size,
    file_type,
    document_type,
    s3_bucket,
    s3_key,
    s3_etag,
    file_hash,
    virus_scan_status,
    uploaded_by
) VALUES (
    'DOC-2025-001234-001',
    'APP-2025-001234',
    'business-plan.pdf',
    2457600,
    'application/pdf',
    'business_plan',
    'aina-docs',
    'APP-2025-001234/DOC-001.pdf',
    'a8b3c5d7...',
    '7f4b8c...',
    'clean',
    'CUST-001234'
);

🔄 Update Classification

SQL: Update After Classification

UPDATE documents
SET
    classification = 'business_plan',
    classification_confidence = 0.950,
    ocr_text = 'Business Plan 2025 - Smith''s Artisan Café...',
    classified_at = CURRENT_TIMESTAMP
WHERE id = 'DOC-2025-001234-001';

📋 Audit Log (Elasticsearch)

JSON: Document Upload Event

{
  "event_type": "document_uploaded",
  "application_id": "APP-2025-001234",
  "document_id": "DOC-2025-001234-001",
  "customer_id": "CUST-001234",
  "timestamp": "2025-11-23T15:00:00Z",
  "document_details": {
    "filename": "business-plan.pdf",
    "file_size": 2457600,
    "file_type": "application/pdf",
    "document_type": "business_plan"
  },
  "classification": {
    "category": "business_plan",
    "confidence": 0.950,
    "classified_at": "2025-11-23T15:00:03Z"
  },
  "user_agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 17_0)",
  "ip_address": "86.134.x.x"
}

📊 Data Retention Policy

Active Documents: Retained permanently in S3 + PostgreSQL
Soft Deleted: Kept in S3 for 30 days, then moved to Glacier
Glacier Storage: Retained for 7 years (regulatory compliance)
Audit Logs: Retained for 7 years in Elasticsearch
OCR Text: Retained for search/classification, can be purged after 90 days

📊 Document Upload & Classification Sequence

👤

Customer

1. Selects file

💻

Browser

💻

Browser

2. POST multipart/form-data

🔒

API Gateway

🔒

API Gateway

3. Validate file (size, type)

✅

Validator

✅

Validator

4. Virus scan

🛡️

ClamAV

🛡️

ClamAV

5. Upload to S3 (encrypted)

☁️

AWS S3

☁️

AWS S3

6. Save metadata to DB

💾

PostgreSQL

💾

PostgreSQL

7. Return success (2s)

👤

Customer

💾

PostgreSQL

8. Queue classification job

📨

SQS Queue

📨

SQS Queue

9. Trigger Lambda worker

⚡

Lambda

⚡

Lambda

10. OCR first page

📄

Textract

📄

Textract

11. Classify document

🤖

ML Model

🤖

ML Model

12. Update classification

💾

PostgreSQL

💾

PostgreSQL

13. WebSocket notification

👤

Customer

⚠️ Error Scenarios & Handling

File Too Large

File exceeds 10MB limit

Response: 413 Payload Too Large
UI: "File too large. Maximum size is 10MB"
Action: Reject upload, suggest compressing file

Invalid File Type

File type not in whitelist (e.g., .exe, .zip)

Response: 400 Bad Request
UI: "Invalid file type. Please upload PDF, JPG, or PNG"
Action: Reject upload, show accepted formats

Virus Detected

ClamAV detects malware in file

Response: 403 Forbidden
UI: "Security threat detected. File cannot be uploaded"
Action: Reject upload, log security incident, alert admin

S3 Upload Failure

Network error or S3 service unavailable

Response: 503 Service Unavailable
UI: "Upload failed. Please try again"
Action: Retry 3 times with exponential backoff

Duplicate File

File hash matches existing document

Response: 200 OK (with warning)
UI: "⚠️ This file was already uploaded"
Action: Allow upload but show warning, link to existing doc

Classification Failed

ML model error or OCR extraction failed

Response: Document saved, classification = "unclassified"
UI: Show "Unclassified" badge instead of category
Action: Flag for manual review by RM

Low Confidence Classification

ML model confidence <60%

Response: Classification saved but flagged
UI: Show category with "?" icon
Action: Allow RM to manually reclassify if needed

Corrupted PDF

PDF cannot be opened or read

Response: 400 Bad Request
UI: "File appears corrupted. Please re-upload"
Action: Reject upload, suggest re-creating PDF

Too Many Documents

Application has 10+ documents already

Response: 429 Too Many Requests
UI: "Maximum 10 documents allowed"
Action: Suggest deleting old docs before uploading new

Session Expired

JWT token expired during upload

Response: 401 Unauthorized
UI: "Session expired. Please log in again"
Action: Redirect to login, preserve file for retry

⬅️ Back: CM07 Architecture (Offer Structuring) Next: CM09 Architecture (E-Signature) ➡️