KS
← Back to projects
CompletedMar 1, 2024 — updated Jan 15, 2025

AIRBUS DOCUMENT ANALYZER: AIR-GAPPED AI SYSTEM

Fully offline document analysis application with local LLM, OCR, and semantic search for high-confidentiality environments at Airbus Defence and Space.

lines 9.7K
pythonllmragpyside6faissocrdesktop-app
View on GitHub →

Overview

A fully offline, air-gapped desktop application designed for high-confidentiality environments at Airbus Defence and Space. It enables document analysis, question answering, and semantic search using a local LLM — with zero network connectivity by design.

Architecture

┌─────────────────────────────────────────────────────────┐
│              DOCUMENT INPUT (14+ formats)                │
│   PDF · DOCX · XLSX · PPTX · Images · Markdown · CSV   │
└────────────────────────┬────────────────────────────────┘
                         ▼
┌──────────────────────┬──────────────────────────────────┐
│   OCR (Tesseract)    │     SEMANTIC CHUNKING            │
│   Scanned docs &     │     Recursive splitting with     │
│   image extraction   │     overlap preservation         │
└──────────┬───────────┴──────────────┬───────────────────┘
           ▼                          ▼
┌─────────────────────────────────────────────────────────┐
│              FAISS VECTOR STORE                          │
│   BGE embeddings → cosine similarity search             │
│   Multi-query retrieval + contextual compression        │
└────────────────────────┬────────────────────────────────┘
                         ▼
┌─────────────────────────────────────────────────────────┐
│           LLM (LLaMA 3.1 8B GGUF, 4-bit)               │
│   LCEL chain composition → Pydantic structured output   │
│   Conversation memory → session persistence             │
└────────────────────────┬────────────────────────────────┘
                         ▼
┌─────────────────────────────────────────────────────────┐
│              PySide6 DESKTOP GUI                         │
│   Streaming responses → source citations → chat UI      │
└─────────────────────────────────────────────────────────┘

Security Architecture

The system enforces complete network isolation — "air-gapped" means:

  • OS-level socket blocking — Python's socket module is monkey-patched at startup to block ALL outbound connections; no HTTP, DNS, or TCP calls can leave the process
  • Startup verification — Network isolation is cryptographically verified before any document processing begins
  • Audit logging — Every blocked connection attempt is logged with timestamp, target, and calling module for compliance reporting
  • Encrypted local cache — SQLite caching layer with encrypted storage for embeddings and conversation history
  • Zero telemetry — No analytics, no cloud sync, no model phone-home, no update checks
  • No external dependencies at runtime — All models, embeddings, and assets bundled into the binary

In practice: the application can run on a workstation with the network cable physically disconnected and function identically.

Key Features

  • 14+ File Formats — PDF, DOCX, XLSX, PPTX, images (PNG, JPG, TIFF), Markdown, CSV
  • OCR Pipeline — Tesseract-based text extraction from scanned documents and images
  • RAG Pipeline — Semantic chunking → BGE embeddings → FAISS search → LLM generation with source citations
  • Conversation Memory — Multi-turn context with session persistence and token management
  • Streaming Responses — Real-time LLM output display in the PySide6 GUI

Production Notes

  • Packaging — PyInstaller single-binary distribution (~4GB with bundled LLaMA model weights)
  • Deployment — Installed on air-gapped Airbus Defence and Space workstations via USB transfer
  • User training — Documentation and onboarding guide for non-technical users
  • Iterative development — 8-phase LangChain integration built incrementally, each phase validated before proceeding

Tech Stack

Python, PySide6, LLaMA 3.1 (8B GGUF), LangChain, FAISS, sentence-transformers, Tesseract OCR, PyInstaller