CM08

Supporting Document Upload

Journey: Customer Mobile
Duration: ~1 minute
AI Agent: Document Classifier
External APIs: AWS S3, AWS Textract (OCR)
1

User Interface Layer

Drag-and-drop file upload with document checklist

📋 Document Checklist

  • Business License: Optional trading license or permits (PDF, JPG, PNG, max 10MB)
  • Business Plan: Optional expansion or growth plans (PDF, DOCX, max 10MB)
  • Proof of Address: Optional utility bill or bank statement (PDF, JPG, max 10MB)
  • Additional Documents: Optional any other supporting documents (PDF, JPG, DOCX, max 10MB)
  • All Optional: No documents are strictly required - customer can skip this screen
💡 Why All Optional?

Since we've already collected comprehensive data through CM02 (financial review via TrueLayer + Xero) and CM04 (identity verification via Onfido), additional documents are supplementary only.

Purpose: Speed up approval or provide additional context, but not required for decision.

📤 Upload Interface

  • Drag-and-Drop Zone: Large drop zone for each document category with hover effects
  • Click to Upload: Alternative click action to open file picker dialog
  • File Type Badges: Visual indicators of accepted formats (PDF, JPG, PNG, DOCX)
  • File Size Limit: "Max 10MB per file" displayed prominently
  • Mobile Optimization: Camera capture option for photos on mobile devices

✅ Upload Confirmation UI

  • Success State: Green checkmark badge replaces "Optional" badge when file uploaded
  • File Preview Card: Shows filename, file size, file type icon, upload timestamp
  • Remove Option: "×" button to delete uploaded file and restore upload zone
  • View Option: "View" button to preview uploaded document in modal
  • Upload Counter: Button text updates to "Continue with X documents"
Upload Success UI Example
// After successful upload
<div class="file-preview">
  <div class="file-info">
    <div class="file-icon">📄</div>
    <div class="file-details">
      <div class="file-name">business-plan.pdf</div>
      <div class="file-meta">2.4 MB • Uploaded just now</div>
    </div>
  </div>
  <div class="file-actions">
    <button onclick="viewFile()">View</button>
    <button onclick="removeFile()">×</button>
  </div>
</div>

🔄 Progress & Feedback

  • Upload Progress Bar: Shows percentage while file is uploading
  • Processing Indicator: "Analyzing document..." spinner while AI classifies
  • Success Toast: "✓ Document uploaded successfully" notification
  • Error Message: Clear error messages for failed uploads (file too large, wrong type, etc.)

🎬 Action Buttons

  • Continue Button: Large primary CTA, updates text based on upload count
  • Skip Link: "Skip - I'll add documents later" secondary link below button
  • Back Navigation: "←" button in header to return to CM07
2

API / Backend-for-Frontend Layer

Multipart file upload and document management endpoints

📤 POST /api/v1/documents/upload

  • Purpose: Upload document file and metadata
  • Content-Type: multipart/form-data (supports file uploads)
  • Authentication: JWT Bearer token from login session
  • Rate Limit: 10 uploads/minute per user
  • Max File Size: 10MB per file (enforced at API gateway)
  • Response Time: <2 seconds (includes S3 upload + OCR trigger)
Multipart Request Example
POST /api/v1/documents/upload
Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
Content-Type: multipart/form-data; boundary=----WebKitFormBoundary

------WebKitFormBoundary
Content-Disposition: form-data; name="file"; filename="business-plan.pdf"
Content-Type: application/pdf

[Binary file data]
------WebKitFormBoundary
Content-Disposition: form-data; name="application_id"

APP-2025-001234
------WebKitFormBoundary
Content-Disposition: form-data; name="document_type"

business_plan
------WebKitFormBoundary
Content-Disposition: form-data; name="category"

supporting_documents
------WebKitFormBoundary--
Response Example (200 OK)
{
  "request_id": "req_cm08_20251123_150000",
  "status": "success",
  "document": {
    "id": "DOC-2025-001234-001",
    "filename": "business-plan.pdf",
    "file_size": 2457600, // 2.4 MB in bytes
    "file_type": "application/pdf",
    "document_type": "business_plan",
    "category": "supporting_documents",
    "uploaded_at": "2025-11-23T15:00:00Z",
    "s3_url": "s3://aina-docs/APP-2025-001234/DOC-001.pdf",
    "view_url": "/api/v1/documents/DOC-2025-001234-001/view"
  },
  "classification": {
    "status": "processing",
    "message": "Document is being analyzed..."
  },
  "processing_time_ms": 1847
}

📋 GET /api/v1/documents/list

  • Purpose: Retrieve all documents for an application
  • Query Params: application_id, category (optional)
  • Response: Array of document objects with metadata
  • Caching: 5-minute cache with cache invalidation on upload
List Request Example
GET /api/v1/documents/list?application_id=APP-2025-001234

Response:
{
  "application_id": "APP-2025-001234",
  "total_documents": 2,
  "documents": [
    {
      "id": "DOC-2025-001234-001",
      "filename": "business-plan.pdf",
      "document_type": "business_plan",
      "file_size": 2457600,
      "uploaded_at": "2025-11-23T15:00:00Z",
      "classification": "business_plan",
      "confidence": 0.95
    },
    {
      "id": "DOC-2025-001234-002",
      "filename": "trading-license.jpg",
      "document_type": "business_license",
      "file_size": 1534720,
      "uploaded_at": "2025-11-23T15:02:15Z",
      "classification": "license",
      "confidence": 0.89
    }
  ]
}

👁️ GET /api/v1/documents/{id}/view

  • Purpose: Generate pre-signed URL for secure document viewing
  • Security: Temporary URL expires after 1 hour
  • Authorization: Only document owner or assigned banker can view

🗑️ DELETE /api/v1/documents/{id}

  • Purpose: Remove uploaded document
  • Behavior: Soft delete (marks as deleted, keeps S3 file for 30 days)
  • Audit: Logs deletion event with reason and timestamp

🔐 Security & Validation

  • File Type Whitelist: Only allow PDF, JPG, PNG, DOCX (block executables, scripts)
  • Virus Scanning: All files scanned with ClamAV before storage
  • File Size Limit: 10MB enforced at API gateway + application layer
  • Ownership Check: Verify user owns the application before upload
  • Rate Limiting: Prevent abuse with 10 uploads/minute per user
⚠️ Security Note: Files are stored in S3 with encryption at rest (AES-256) and accessed only via pre-signed URLs with 1-hour expiry. Never expose direct S3 URLs to clients.
3

Business Logic Layer

AI document classification and OCR extraction

🤖 AI Document Classifier

  • Model: Fine-tuned DistilBERT on financial documents (87% accuracy)
  • Input: First page of document + filename
  • Output: Document category + confidence score
  • Categories: Business plan, license, invoice, bank statement, contract, other
  • Processing Time: <500ms per document
Classification Logic
async function classifyDocument(documentId, filePath) {
  // Extract text from first page using OCR
  const firstPageText = await extractTextFromPDF(filePath, page=1);
  
  // Get filename for additional context
  const filename = path.basename(filePath);
  
  // Call ML model API
  const classification = await mlModel.classify({
    text: firstPageText,
    filename: filename,
    max_length: 512 // Token limit for DistilBERT
  });
  
  // Store classification result
  await Document.update(documentId, {
    classification: classification.category,
    classification_confidence: classification.confidence,
    classified_at: new Date()
  });
  
  return classification;
}

// Example output:
// { category: "business_plan", confidence: 0.95 }

📄 OCR Text Extraction (AWS Textract)

  • Service: AWS Textract for OCR and form extraction
  • Purpose: Extract text from images and PDFs for classification
  • Features: Detect text, tables, forms, key-value pairs
  • Accuracy: 99%+ for printed documents, 85%+ for handwritten
  • Cost: $1.50 per 1,000 pages
Textract Integration
async function extractTextFromPDF(s3Bucket, s3Key, page=1) {
  const params = {
    Document: {
      S3Object: {
        Bucket: s3Bucket,
        Name: s3Key
      }
    },
    FeatureTypes: ['TABLES', 'FORMS']
  };
  
  const response = await textract.analyzeDocument(params).promise();
  
  // Extract text blocks from specified page
  const pageBlocks = response.Blocks.filter(
    block => block.BlockType === 'LINE' && block.Page === page
  );
  
  const text = pageBlocks.map(block => block.Text).join(' ');
  
  return text;
}

✅ Document Validation Rules

  • File Type Check: MIME type must match file extension (prevent disguised executables)
  • Size Validation: File must be ≤10MB (reject if larger)
  • Virus Scan: ClamAV scan must return clean status
  • Image Quality: For images, minimum 800×600 resolution for OCR
  • PDF Integrity: PDF must be readable (not corrupted)
  • Duplicate Check: Warn if same file hash already uploaded
Validation Flow
async function validateDocument(file) {
  // 1. Check file size
  if (file.size > 10 * 1024 * 1024) {
    throw new ValidationError("File too large (max 10MB)");
  }
  
  // 2. Check file type
  const allowedTypes = ['application/pdf', 'image/jpeg', 'image/png'];
  if (!allowedTypes.includes(file.mimetype)) {
    throw new ValidationError("Invalid file type");
  }
  
  // 3. Virus scan
  const scanResult = await clamav.scanFile(file.path);
  if (scanResult.isInfected) {
    throw new SecurityError("File contains malware");
  }
  
  // 4. Check duplicate
  const fileHash = calculateSHA256(file.buffer);
  const existingDoc = await Document.findByHash(fileHash);
  if (existingDoc) {
    return { warning: "This file was already uploaded" };
  }
  
  return { valid: true };
}

🎯 Business Rules

  • No Documents Required: Application can proceed without any uploads (all optional)
  • Max Documents: 10 documents per application (prevent spam)
  • Classification Threshold: If confidence <60%, mark as "unclassified" for manual review
  • Auto-categorization: Documents auto-tagged based on classification (speeds up RM review)
4

Integration & Middleware Layer

S3 storage, async processing, and notifications

☁️ AWS S3 Storage

  • Bucket Structure: s3://aina-docs/{application_id}/{document_id}.{ext}
  • Encryption: Server-side encryption (SSE-S3) with AES-256
  • Access: Private bucket, pre-signed URLs for viewing (1-hour expiry)
  • Lifecycle: Soft-deleted files moved to Glacier after 30 days, purged after 7 years
  • Versioning: Enabled for audit trail and recovery
S3 Upload Flow
async function uploadToS3(file, applicationId, documentId) {
  const s3Key = `${applicationId}/${documentId}${path.extname(file.originalname)}`;
  
  const uploadParams = {
    Bucket: 'aina-docs',
    Key: s3Key,
    Body: file.buffer,
    ContentType: file.mimetype,
    ServerSideEncryption: 'AES256',
    Metadata: {
      application_id: applicationId,
      uploaded_by: file.userId,
      uploaded_at: new Date().toISOString()
    }
  };
  
  const result = await s3.upload(uploadParams).promise();
  
  return {
    s3_url: result.Location,
    s3_key: s3Key,
    etag: result.ETag
  };
}

⚡ Async Processing Queue

  • Upload Flow: File uploaded to S3 → API returns immediately → Background job processes classification
  • Queue: AWS SQS for reliable async processing
  • Workers: Lambda functions triggered by SQS messages
  • Retry Logic: 3 retries with exponential backoff for failed classifications
  • Status Updates: WebSocket notifications when classification completes
💡 Why Async Processing?

Speed: Don't make customer wait 2-3 seconds for OCR + classification

Scalability: Process multiple documents in parallel

Reliability: SQS ensures classification happens even if worker temporarily down

🔔 Real-time Notifications

  • WebSocket: Push classification results to client when complete
  • Event: "document_classified" with document ID and category
  • UI Update: Auto-update document badge from "Processing..." to category name
  • Fallback: If WebSocket disconnected, poll GET /documents/list every 5 seconds
WebSocket Event
{
  "event": "document_classified",
  "document_id": "DOC-2025-001234-001",
  "classification": {
    "category": "business_plan",
    "confidence": 0.95,
    "processing_time_ms": 487
  },
  "timestamp": "2025-11-23T15:00:03Z"
}

📊 Analytics & Monitoring

  • Upload Metrics: Track success rate, average file size, popular document types
  • Classification Accuracy: Monitor confidence scores, flag low-confidence docs
  • Performance: Track upload time, S3 latency, OCR processing time
  • Errors: Alert on high error rates, virus detections, failed uploads

🔐 Security Measures

  • Pre-signed URLs: Generate temporary S3 URLs that expire after 1 hour
  • CORS Policy: Restrict S3 bucket to only accept requests from AINA domain
  • IAM Roles: Least-privilege access for Lambda functions
  • Audit Logging: All document access logged to CloudTrail
5

External Systems Integration

AWS services for storage, OCR, and virus scanning

☁️ AWS S3 (Simple Storage Service)

  • Purpose: Secure document storage with encryption
  • Bucket: aina-docs (private, encrypted)
  • Pricing: $0.023/GB/month storage + $0.09/GB transfer out
  • SLA: 99.99% availability
  • Features: Versioning, lifecycle policies, encryption at rest

📄 AWS Textract

  • Purpose: OCR text extraction from PDFs and images
  • API: AnalyzeDocument for text + DetectDocumentText for simple OCR
  • Pricing: $1.50 per 1,000 pages (first page only for classification)
  • Accuracy: 99%+ for printed text, 85%+ for handwritten
  • Response Time: 1-3 seconds per page
Textract API Call
const textract = new AWS.Textract();

const params = {
  Document: {
    S3Object: {
      Bucket: 'aina-docs',
      Name: 'APP-2025-001234/DOC-001.pdf'
    }
  },
  FeatureTypes: ['TABLES', 'FORMS']
};

const result = await textract.analyzeDocument(params).promise();

// Extract text blocks
const textBlocks = result.Blocks
  .filter(b => b.BlockType === 'LINE')
  .map(b => b.Text);
  
const fullText = textBlocks.join(' ');

🛡️ ClamAV (Virus Scanner)

  • Purpose: Scan uploaded files for malware/viruses
  • Deployment: Self-hosted ClamAV on EC2 or Lambda layer
  • Database: Virus definitions updated daily
  • Processing: <1 second per file (10MB)
  • Action: Reject upload if virus detected, log security event

🤖 ML Model Service (Internal)

  • Model: Fine-tuned DistilBERT for document classification
  • Hosting: AWS SageMaker endpoint (ml.t3.medium instance)
  • Input: First 512 tokens of document text
  • Output: Category + confidence score
  • Latency: <500ms per classification
  • Cost: $0.05/hour compute (~$35/month for always-on)

🔄 No Other External Dependencies

  • Note: Unlike CM03 (TrueLayer, Xero) or CM04 (Onfido), this screen uses only internal AWS services
  • Benefit: Full control over data, no third-party API limits
  • Cost: Pay-as-you-go AWS pricing, no per-transaction fees
6

Data Persistence Layer

Document metadata storage and audit logging

📊 documents Table Schema

Column Type Description Example
id VARCHAR(50) Primary key DOC-2025-001234-001
application_id VARCHAR(50) Foreign key to applications APP-2025-001234
filename VARCHAR(255) Original filename business-plan.pdf
file_size INTEGER Size in bytes 2457600
file_type VARCHAR(100) MIME type application/pdf
document_type VARCHAR(50) User-selected category business_plan
s3_bucket VARCHAR(100) S3 bucket name aina-docs
s3_key VARCHAR(500) S3 object key APP-2025-001234/DOC-001.pdf
s3_etag VARCHAR(100) S3 ETag for integrity "a8b3c5d7..."
file_hash VARCHAR(64) SHA-256 hash 7f4b8c...
classification VARCHAR(50) AI-determined category business_plan
classification_confidence DECIMAL(4,3) Confidence score 0.950
ocr_text TEXT Extracted text (first page) "Business Plan 2025..."
virus_scan_status VARCHAR(20) Clean/Infected/Pending clean
uploaded_by VARCHAR(50) User ID CUST-001234
uploaded_at TIMESTAMP Upload timestamp 2025-11-23 15:00:00
classified_at TIMESTAMP When AI classified 2025-11-23 15:00:03
deleted_at TIMESTAMP Soft delete timestamp NULL
SQL: Create Table
CREATE TABLE documents (
    id VARCHAR(50) PRIMARY KEY,
    application_id VARCHAR(50) NOT NULL REFERENCES applications(id),
    filename VARCHAR(255) NOT NULL,
    file_size INTEGER NOT NULL,
    file_type VARCHAR(100) NOT NULL,
    document_type VARCHAR(50),
    s3_bucket VARCHAR(100) NOT NULL,
    s3_key VARCHAR(500) NOT NULL,
    s3_etag VARCHAR(100),
    file_hash VARCHAR(64) NOT NULL,
    classification VARCHAR(50),
    classification_confidence DECIMAL(4,3),
    ocr_text TEXT,
    virus_scan_status VARCHAR(20) DEFAULT 'pending',
    uploaded_by VARCHAR(50) NOT NULL,
    uploaded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    classified_at TIMESTAMP,
    deleted_at TIMESTAMP,
    CONSTRAINT check_file_size CHECK (file_size <= 10485760) -- 10MB
);

CREATE INDEX idx_documents_app ON documents(application_id);
CREATE INDEX idx_documents_hash ON documents(file_hash);
CREATE INDEX idx_documents_uploaded ON documents(uploaded_at);

💾 Insert Document Record

SQL: Insert Document
INSERT INTO documents (
    id,
    application_id,
    filename,
    file_size,
    file_type,
    document_type,
    s3_bucket,
    s3_key,
    s3_etag,
    file_hash,
    virus_scan_status,
    uploaded_by
) VALUES (
    'DOC-2025-001234-001',
    'APP-2025-001234',
    'business-plan.pdf',
    2457600,
    'application/pdf',
    'business_plan',
    'aina-docs',
    'APP-2025-001234/DOC-001.pdf',
    'a8b3c5d7...',
    '7f4b8c...',
    'clean',
    'CUST-001234'
);

🔄 Update Classification

SQL: Update After Classification
UPDATE documents
SET
    classification = 'business_plan',
    classification_confidence = 0.950,
    ocr_text = 'Business Plan 2025 - Smith''s Artisan Café...',
    classified_at = CURRENT_TIMESTAMP
WHERE id = 'DOC-2025-001234-001';

📋 Audit Log (Elasticsearch)

JSON: Document Upload Event
{
  "event_type": "document_uploaded",
  "application_id": "APP-2025-001234",
  "document_id": "DOC-2025-001234-001",
  "customer_id": "CUST-001234",
  "timestamp": "2025-11-23T15:00:00Z",
  "document_details": {
    "filename": "business-plan.pdf",
    "file_size": 2457600,
    "file_type": "application/pdf",
    "document_type": "business_plan"
  },
  "classification": {
    "category": "business_plan",
    "confidence": 0.950,
    "classified_at": "2025-11-23T15:00:03Z"
  },
  "user_agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 17_0)",
  "ip_address": "86.134.x.x"
}

📊 Data Retention Policy

  • Active Documents: Retained permanently in S3 + PostgreSQL
  • Soft Deleted: Kept in S3 for 30 days, then moved to Glacier
  • Glacier Storage: Retained for 7 years (regulatory compliance)
  • Audit Logs: Retained for 7 years in Elasticsearch
  • OCR Text: Retained for search/classification, can be purged after 90 days
📊 Document Upload & Classification Sequence
👤
Customer
1. Selects file
💻
Browser
💻
Browser
2. POST multipart/form-data
🔒
API Gateway
🔒
API Gateway
3. Validate file (size, type)
Validator
Validator
4. Virus scan
🛡️
ClamAV
🛡️
ClamAV
5. Upload to S3 (encrypted)
☁️
AWS S3
☁️
AWS S3
6. Save metadata to DB
💾
PostgreSQL
💾
PostgreSQL
7. Return success (2s)
👤
Customer
💾
PostgreSQL
8. Queue classification job
📨
SQS Queue
📨
SQS Queue
9. Trigger Lambda worker
Lambda
Lambda
10. OCR first page
📄
Textract
📄
Textract
11. Classify document
🤖
ML Model
🤖
ML Model
12. Update classification
💾
PostgreSQL
💾
PostgreSQL
13. WebSocket notification
👤
Customer
⚠️ Error Scenarios & Handling

File Too Large

File exceeds 10MB limit

Response: 413 Payload Too Large
UI: "File too large. Maximum size is 10MB"
Action: Reject upload, suggest compressing file

Invalid File Type

File type not in whitelist (e.g., .exe, .zip)

Response: 400 Bad Request
UI: "Invalid file type. Please upload PDF, JPG, or PNG"
Action: Reject upload, show accepted formats

Virus Detected

ClamAV detects malware in file

Response: 403 Forbidden
UI: "Security threat detected. File cannot be uploaded"
Action: Reject upload, log security incident, alert admin

S3 Upload Failure

Network error or S3 service unavailable

Response: 503 Service Unavailable
UI: "Upload failed. Please try again"
Action: Retry 3 times with exponential backoff

Duplicate File

File hash matches existing document

Response: 200 OK (with warning)
UI: "⚠️ This file was already uploaded"
Action: Allow upload but show warning, link to existing doc

Classification Failed

ML model error or OCR extraction failed

Response: Document saved, classification = "unclassified"
UI: Show "Unclassified" badge instead of category
Action: Flag for manual review by RM

Low Confidence Classification

ML model confidence <60%

Response: Classification saved but flagged
UI: Show category with "?" icon
Action: Allow RM to manually reclassify if needed

Corrupted PDF

PDF cannot be opened or read

Response: 400 Bad Request
UI: "File appears corrupted. Please re-upload"
Action: Reject upload, suggest re-creating PDF

Too Many Documents

Application has 10+ documents already

Response: 429 Too Many Requests
UI: "Maximum 10 documents allowed"
Action: Suggest deleting old docs before uploading new

Session Expired

JWT token expired during upload

Response: 401 Unauthorized
UI: "Session expired. Please log in again"
Action: Redirect to login, preserve file for retry
⬅️ Back: CM07 Architecture (Offer Structuring) Next: CM09 Architecture (E-Signature) ➡️