← Back to projects
CompletedMar 1, 2024 — updated Jan 15, 2025
AIRBUS DOCUMENT ANALYZER: AIR-GAPPED AI SYSTEM
Fully offline document analysis application with local LLM, OCR, and semantic search for high-confidentiality environments at Airbus Defence and Space.
lines 9.7K
pythonllmragpyside6faissocrdesktop-app
View on GitHub →Overview
A fully offline, air-gapped desktop application designed for high-confidentiality environments at Airbus Defence and Space. It enables document analysis, question answering, and semantic search using a local LLM — with zero network connectivity by design.
Architecture
┌─────────────────────────────────────────────────────────┐
│ DOCUMENT INPUT (14+ formats) │
│ PDF · DOCX · XLSX · PPTX · Images · Markdown · CSV │
└────────────────────────┬────────────────────────────────┘
▼
┌──────────────────────┬──────────────────────────────────┐
│ OCR (Tesseract) │ SEMANTIC CHUNKING │
│ Scanned docs & │ Recursive splitting with │
│ image extraction │ overlap preservation │
└──────────┬───────────┴──────────────┬───────────────────┘
▼ ▼
┌─────────────────────────────────────────────────────────┐
│ FAISS VECTOR STORE │
│ BGE embeddings → cosine similarity search │
│ Multi-query retrieval + contextual compression │
└────────────────────────┬────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ LLM (LLaMA 3.1 8B GGUF, 4-bit) │
│ LCEL chain composition → Pydantic structured output │
│ Conversation memory → session persistence │
└────────────────────────┬────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ PySide6 DESKTOP GUI │
│ Streaming responses → source citations → chat UI │
└─────────────────────────────────────────────────────────┘
Security Architecture
The system enforces complete network isolation — "air-gapped" means:
- OS-level socket blocking — Python's socket module is monkey-patched at startup to block ALL outbound connections; no HTTP, DNS, or TCP calls can leave the process
- Startup verification — Network isolation is cryptographically verified before any document processing begins
- Audit logging — Every blocked connection attempt is logged with timestamp, target, and calling module for compliance reporting
- Encrypted local cache — SQLite caching layer with encrypted storage for embeddings and conversation history
- Zero telemetry — No analytics, no cloud sync, no model phone-home, no update checks
- No external dependencies at runtime — All models, embeddings, and assets bundled into the binary
In practice: the application can run on a workstation with the network cable physically disconnected and function identically.
Key Features
- 14+ File Formats — PDF, DOCX, XLSX, PPTX, images (PNG, JPG, TIFF), Markdown, CSV
- OCR Pipeline — Tesseract-based text extraction from scanned documents and images
- RAG Pipeline — Semantic chunking → BGE embeddings → FAISS search → LLM generation with source citations
- Conversation Memory — Multi-turn context with session persistence and token management
- Streaming Responses — Real-time LLM output display in the PySide6 GUI
Production Notes
- Packaging — PyInstaller single-binary distribution (~4GB with bundled LLaMA model weights)
- Deployment — Installed on air-gapped Airbus Defence and Space workstations via USB transfer
- User training — Documentation and onboarding guide for non-technical users
- Iterative development — 8-phase LangChain integration built incrementally, each phase validated before proceeding
Tech Stack
Python, PySide6, LLaMA 3.1 (8B GGUF), LangChain, FAISS, sentence-transformers, Tesseract OCR, PyInstaller