OCR & PDF Processing
The LoanPilot platform leverages a sophisticated OCR and document processing pipeline to transform unstructured PDF documents into structured financial data. This system combines client-side pre-processing with server-side AI extraction to ensure high accuracy across bank statements, financial records, and tax returns.
Processing Pipeline Overview
The document processing lifecycle follows a four-stage workflow:
- Ingestion & Storage: Files are uploaded via
api/documents/uploadand stored using a smart storage service (supporting S3 or local fallbacks). - Preprocessing: PDFs are prepared for analysis. For scanned documents, the system utilizes
pdf.jsfor rendering andTesseract.jsfor initial text layer orientation. - AI Extraction: Document buffers are sent to the
AIProvider(e.g., OpenAI, Anthropic, or specialized OCR models) to extract specific financial schemas. - Reconciliation: Extracted data is cross-referenced across different document types (e.g., verifying Bank Statement credits against GST Revenue).
Supported Document Types
The system includes specialized extractors for the following categories:
| Document Type | Purpose | Key Metrics Extracted | | :--- | :--- | :--- | | Bank Statements | Cash flow & Fraud analysis | Daily balances, transaction categories, EMI detection, fraud flags. | | Financial Statements | Solvency & Performance | P&L (Revenue, EBITDA), Balance Sheet (Assets, Liabilities, Equity). | | GST Returns | Revenue Verification | Monthly turnover, tax liability, input tax credits. | | ITR (Income Tax) | Income Verification | Gross total income, tax paid, filing consistency. |
Analysis API Reference
1. Bank Statement Analysis
Processes bank statements to determine repayment capacity and detect fraudulent patterns.
Endpoint: POST /api/analyze/bank-statement
Request Body:
{
"documentId": "uuid",
"applicationId": "uuid",
"organizationId": "uuid",
"aiProvider": "openai" | "anthropic" // Optional
}
Key Response Fields:
summary: Average balance, net income, total credits/debits.enhanced_analysis: Fraud risk scores, EMI-to-inflow ratios, and cashflow volatility.transactions: Structured list of categorized transactions.
2. Financial Statement Analysis
Extracts data from Balance Sheets, Profit & Loss statements, or ITR filings.
Endpoint: POST /api/analyze/financial-statement
Request Body:
{
"documentId": "uuid",
"applicationId": "uuid",
"organizationId": "uuid",
"statementType": "balance_sheet" | "profit_loss" | "itr"
}
Output Ratios: The API automatically calculates and returns:
dscr(Debt Service Coverage Ratio)current_ratio&quick_ratiodebt_to_equity_rationet_profit_margin
3. GST Analysis
Analyzes GST return filings to validate business turnover.
Endpoint: POST /api/analyze/gst-returns
Usage: Supports both multipart/form-data for direct uploads and application/json for processing pre-uploaded documents via documentId.
Client-Side Processing
For document visualization and manual verification, LoanPilot utilizes a robust client-side stack:
- pdf.js: Used within the
Applicationdashboard to render PDF documents without requiring external plugins. - Tesseract.js: Handles browser-based OCR for quick text selection and "click-to-fill" functionality when users are verifying AI-extracted fields.
- Napis Canvas: Manages high-performance rendering of document overlays, highlighting detected entities (like account numbers or dates) directly on the document preview.
AI Provider Configuration
The extraction logic is abstracted through an AIProvider interface. This allows developers to toggle between different LLM engines based on the organization's requirements for speed vs. accuracy.
// Internal Usage Example
const provider = createAIProvider('openai');
const result = await provider.extractFinancialStatement(
base64File,
'profit_loss',
'application/pdf'
);
Data Reconciliation
Once multiple documents are processed, the Reconciliation Engine (api/analyze/reconciliation) can be triggered. It compares:
- Bank vs. GST: Does the bank inflow match the reported GST turnover?
- P&L vs. Bank: Are the reported operating expenses reflected in the bank debits?
- ITR vs. Financials: Is the net profit consistent across tax filings and internal P&L?
This provides a "Confidence Score" for the entire loan application based on data consistency.