Legal Transcription Platform - Server and Storage Design Plan ================================================================= Version: 2026-05-26 (Updated) Purpose ------- This document defines the recommended server, storage, tenant-isolation, and backup design for the Legal Transcription and Document Automation Platform. The platform must be built as one tenant-aware codebase that can run in multiple deployment modes: 1. Hosted multi-tenant mode - One server/platform hosted in the data centre. - Multiple law firms use the same system. - Each firm is a separate tenant. - Data, users, templates, recordings, transcripts, generated documents, AI settings, audit logs, and feature flags are isolated by firm. 2. Hosted single-tenant dedicated mode - The same platform runs for one firm only. - Useful for larger clients or firms that want dedicated infrastructure. - Still uses the same Platform -> Firm -> User structure. 3. Onsite single-tenant mode - The same platform can be installed onsite if a firm requires local control. - This must not be a separate product or separate codebase. - It is the same system deployed in a different location. 4. Hosted platform with optional local connector agent - Preferred for firms that need access to local SMB shares, recording folders, scanners, or local document stores. - A small local agent runs at the law firm site. - The agent securely uploads recordings/documents to the hosted platform. - The agent can download completed documents back to local folders. - See the Local Connector Agent section for full specification. Core Rule --------- Do not build separate hosted and on-prem products. Build one multi-tenant, tenant-aware platform that can run in different deployment modes. The system must always use this structure: Platform -> Firm / Tenant -> Users Even if there is only one firm on the server. The code must never assume there is only one firm. The platform must be deployed on a dedicated server. It must not share infrastructure with unrelated applications such as billing software or other business systems. Recommended Starting Server --------------------------- For the first proper MVP/development server: VM name: legaltranscribe1 Operating system: Ubuntu 24.04 LTS Suggested resources: vCPU: 8 cores RAM: 32 GB OS disk: 150 GB SSD/NVMe Data disk: 2 TB SSD/NVMe Backup target: separate disk, NAS, SMB share, SFTP server, or backup server Initial services: - Nginx (with TLS from day one) - PHP/Laravel or selected web framework - PostgreSQL (recommended - see Database Design section) - Redis (with persistence enabled - see Redis Configuration section) - Queue worker - Python worker environment - LibreOffice/headless document tools - ffmpeg - ClamAV or equivalent virus scanner - SMB/SFTP/FTP connector tools - Backup scripts - AI provider configuration - Monitoring/logging tools - logrotate configuration This is enough for the MVP because the heavy AI processing will likely use external AI APIs at first, rather than local GPU-based inference. Recommended Production Starting Server -------------------------------------- For a hosted multi-tenant production server in the data centre: Suggested resources: vCPU: 16 cores RAM: 64 GB OS disk: 150-250 GB SSD/NVMe Data disk: 2-4 TB SSD/NVMe Database disk: separate high-speed SSD/NVMe volume (recommended) Backup target: separate physical storage, backup server, NAS, SFTP target, or cloud/object storage The exact size should be increased based on: - number of firms - number of users - transcription minutes per day - recording retention period - document retention period - AI usage volume - live dictation usage - backup retention policy Minimum Disk Design ------------------- The server should use a separate data disk from the beginning. Minimum recommended layout: Disk 1 - OS and Application / /opt/legaltranscribe/app /var/log /etc Disk 2 - Application Data /data/legaltranscribe/ Backup Target - Separate Storage /backups/legaltranscribe/ or an external SMB/SFTP/NAS/cloud backup destination The data disk holds all legal recordings, transcripts, templates, generated documents, attachments, exports, tenant files, and temporary processing files. The OS/application disk must not be used as the main legal data store. Why Use a Separate Data Disk From Day One ----------------------------------------- A separate data disk is strongly recommended because it gives: 1. Easier backups 2. Easier restores 3. Easier storage expansion 4. Easier tenant migration later 5. Cleaner separation between application code and legal data 6. Lower risk if the OS disk fills up 7. Easier encryption of legal data 8. Easier snapshots of legal data 9. Easier migration to another server 10. Better long-term scaling Recommended Data Folder Layout ------------------------------- Use a tenant-aware folder layout from day one. Recommended structure: /data/legaltranscribe/ firms/ firm_0001/ recordings/ transcripts/ documents/ templates/ attachments/ exports/ temp/ reports/ backups/ firm_0002/ recordings/ transcripts/ documents/ templates/ attachments/ exports/ temp/ reports/ backups/ shared/ quarantine/ system/ imports/ processing/ Each firm must have its own isolated storage path. Example tenant folder paths: /data/legaltranscribe/firms/{firm_id}/recordings /data/legaltranscribe/firms/{firm_id}/transcripts /data/legaltranscribe/firms/{firm_id}/documents /data/legaltranscribe/firms/{firm_id}/templates /data/legaltranscribe/firms/{firm_id}/attachments /data/legaltranscribe/firms/{firm_id}/exports /data/legaltranscribe/firms/{firm_id}/temp /data/legaltranscribe/firms/{firm_id}/reports Folder Definitions ------------------ Each top-level folder under /data/legaltranscribe/ has a defined purpose: firms/ Contains one sub-folder per tenant. Each firm's data is fully isolated within its own folder tree. No application process should ever read or write across firm boundaries. shared/ Platform-level assets that are not firm-specific. This may include global template starters, platform branding assets, and shared reference files distributed to firms. This folder must not contain any firm data or recordings. quarantine/ Files that have failed virus/malware scanning or format validation are moved here instead of being accepted into a firm's folder or deleted outright. Quarantined files must be reviewed by a platform admin before permanent removal. This folder must be restricted to platform admin access only. No firm user or application process other than the scanner and the admin tool should be able to read from this folder. system/ Platform-level operational files such as lock files, PID files, health check outputs, and system-level job state. Not for firm data. imports/ Staging area for files arriving from external connectors (SMB, SFTP, FTP, local agent) before they have been validated and assigned to a firm. Files should not remain here after processing completes. Stale files in imports indicate a failed or stuck import job and should trigger an alert. processing/ Temporary working area for files actively being processed by a queue worker (e.g. audio being transcribed, documents being merged). A file should only exist in processing while a worker job is actively working on it. Stale files here indicate a crashed or stuck worker and must trigger an alert. See the Temp and Processing Folder Lifecycle section. firm/{firm_id}/temp/ Per-firm temporary folder for intermediate files created during a job for that firm. Must be cleaned on schedule. See the Temp and Processing Folder Lifecycle section. Temp and Processing Folder Lifecycle ------------------------------------- The temp and processing folders can accumulate large audio files and intermediate documents if not actively managed. A cleanup job must run on a schedule. Rules: - Files in /data/legaltranscribe/firms/{firm_id}/temp/ that are older than a configurable threshold (default: 24 hours) must be automatically deleted by a scheduled cleanup job. - Files in /data/legaltranscribe/processing/ should only exist while a queue worker is actively working on that job. If a file has been in processing longer than a configurable threshold (default: 2 hours) and its associated job is no longer active in the queue, this is treated as a stale file from a crashed or stuck worker. The cleanup job must flag it and notify the platform admin. - Files in /data/legaltranscribe/imports/ should be cleared as soon as they are validated and moved to the correct firm folder. A file remaining in imports longer than a configurable threshold (default: 1 hour) after arrival indicates a failed import job and should trigger an alert. - The cleanup job must be a scheduled queue job, visible in the admin diagnostics panel, with a log of what was cleaned and when. - The cleanup job must never delete files from firm data folders (recordings, documents, transcripts). It only cleans temp, processing, and imports. Storage Isolation Rule ---------------------- The application must never directly hard-code file paths inside business logic. All file access must go through a storage abstraction layer. The storage abstraction must understand: - firm/tenant ID - storage type - base path - permissions - encryption rules - retention rules - quota rules All file paths resolved by the storage abstraction must be validated against the allowed base path for that firm before any read or write operation is executed. This prevents path traversal attacks where a manipulated filename or folder name could cause the application to access files outside the firm's permitted path. If a resolved path does not begin with the firm's permitted base path, the operation must be rejected and the attempt must be logged in the audit log. This allows a firm to be moved later from shared storage to its own disk, volume, SMB share, SFTP location, object bucket, or dedicated server without rewriting the transcription, document, email, or AI modules. Filesystem Permissions ---------------------- Correct filesystem permissions are critical because this system handles confidential legal recordings and documents. Required permission model: Application user: - Create a dedicated application user, for example: legaltranscribe - The application and all queue workers must run under this user - This user owns the application code directory and the data directory Web server user: - Nginx (www-data or equivalent) must have no direct access to the data disk - The web server must only serve the application through the PHP/app runtime - The web server must not be able to browse or read firm data folders directly Firm folder permissions: - Each firm folder must be readable and writable only by the application user - No other OS user should have read access to firm folders - Permissions should be set to 750 or stricter on firm data directories Quarantine folder: - Writable by the application user (for moving files into quarantine) - Readable only by the application user and platform admin processes - No firm-level process should be able to read from quarantine Log folders: - Application logs writable by the application user - Log files must not be world-readable - Log files must never contain secrets, API keys, or credential values Upload validation: - All uploaded file paths and connector-sourced file paths must be validated against the permitted base path for the receiving firm before acceptance - Path traversal characters (../, ..\, encoded equivalents) must be rejected at the storage abstraction layer before any file operation Summary of recommended permissions: /opt/legaltranscribe/app 750 legaltranscribe:legaltranscribe /data/legaltranscribe/ 750 legaltranscribe:legaltranscribe /data/legaltranscribe/firms/ 750 legaltranscribe:legaltranscribe /data/legaltranscribe/quarantine/ 750 legaltranscribe:legaltranscribe /data/legaltranscribe/processing/ 750 legaltranscribe:legaltranscribe /backups/legaltranscribe/ 750 legaltranscribe:legaltranscribe Virus and Malware Scanning -------------------------- Law firms upload files from many sources including client-supplied recordings, email attachments, scanned documents, and files arriving via SMB or SFTP connectors. All uploaded files must be scanned before they are accepted into a firm's storage path. Requirements: Scanner: - ClamAV or an equivalent open-source virus/malware scanner must be installed and kept up to date. - The scanner must be integrated into the file upload pipeline and the connector import pipeline. - Scanning must run before the file is moved from the imports or temp area into the firm's permanent storage path. On scan failure: - Files that fail scanning must be moved to the quarantine folder. - The associated job must be marked as failed with a clear error message. - The firm admin and the submitting user must be notified. - The quarantined file must not be deleted automatically. It must be reviewed by a platform admin before any action is taken. On scan pass: - The file proceeds normally through the upload or import pipeline. - The scan result (clean, timestamp, scanner version) must be recorded in the audit log for that file. Scope: - Scanning applies to all uploaded audio files, Word documents, PDFs, attachments, and any file arriving via a connector. - Email attachments processed through the email integration module must also be scanned before being accepted. Updates: - ClamAV virus definitions must be updated automatically on a daily schedule at minimum using freshclam or equivalent. - The last successful definition update date must be visible in the admin diagnostics panel. A stale definition database must trigger an alert. Storage Quota Enforcement -------------------------- Storage quotas allow the platform admin to control how much disk space each firm can consume and to enforce plan limits in hosted mode. Requirements: Tracking: - Storage usage per firm must be tracked in the database. - The usage counter must be updated whenever files are written to or deleted from a firm's storage path. - Usage must be calculated across all firm sub-folders: recordings, transcripts, documents, templates, attachments, exports, and temp. Quota settings: - The platform admin must be able to set a storage quota per firm, either as part of the firm's plan or as a per-firm override. - A default quota can be set at the platform level and applied to new firms. Warning threshold: - When a firm's storage usage reaches a configurable warning threshold (default: 80% of quota), the platform admin and firm admin must be notified. - The warning must appear in the admin dashboard and as a notification. Quota enforcement: - When a firm reaches 100% of its quota, further file uploads must be blocked with a clear, user-readable error message. - The block must apply to all upload paths: web upload, connector import, email attachment import, and agent upload. - Jobs must not silently fail when quota is exceeded. The error must be surfaced to the user and logged in the audit trail. Admin override: - Platform admins must be able to increase a firm's quota at any time. - Platform admins must be able to view storage usage per firm in the admin panel. Visibility: - Storage usage and quota must be visible to firm admins in their firm settings area. - A storage usage breakdown by folder type (recordings, documents, etc.) should be shown where practical. Should Each Tenant Have a Separate Hard Disk? --------------------------------------------- Not at the beginning. The recommended approach is a progressive isolation model: Tenant isolation level 1 (MVP): Shared database with tenant_id on every tenant-owned table. Separate tenant folders under /data/legaltranscribe/firms/{firm_id}/. Logical separation enforced by the application and storage abstraction. Tenant isolation level 2: Separate tenant storage folder or volume with enforced quotas. Tenant isolation level 3: Separate disk, volume, mount point, or object storage bucket for larger firms. Example: mount a dedicated disk directly at /data/legaltranscribe/firms/firm_0042/ Tenant isolation level 4: Dedicated VM or server for very large, high-security, or high-volume firms. For the MVP, use logical tenant separation first. Design the storage abstraction so that large tenants can later be moved to their own disk, mount point, storage volume, object bucket, dedicated database, or dedicated server without rewriting application code. Database Design --------------- The system must be tenant-aware at the database level. Recommended database: PostgreSQL PostgreSQL is the recommended choice over MariaDB for the following reasons: - Row-level security support, which can be used as an additional tenant isolation safeguard at the database layer. - More mature full-text search, which may be used for transcript and document search. - Better JSON/JSONB support for flexible settings and metadata storage. - Stronger data integrity and constraint enforcement. - Widely supported by Laravel, Python ORMs, and most deployment tooling. MariaDB remains an acceptable fallback if PostgreSQL is not feasible for a specific deployment, but PostgreSQL should be the default for all new builds. The database choice must be recorded in the stack recommendation document before coding begins. Recommended MVP approach: Shared database with tenant_id / firm_id on every tenant-owned table. Every tenant-owned table must include firm_id unless the table is truly global platform data. Tenant-scoped tables include: - users - recordings - transcription jobs - transcripts - transcript versions - generated documents - document versions - templates - template versions - AI jobs - AI usage logs - AI review results - email connections - email drafts - sent emails - file connectors - storage connectors - audit logs - feature flags (firm-level) - firm settings - backup jobs - reports - notifications - matters / clients - queue jobs (where firm context is applicable) - failed jobs (with firm context) - retention policies - firm plan assignments - storage quotas and usage tracking - local connector agents Global platform tables may include: - platform admins - global AI provider definitions - global feature definitions - global system settings - platform health logs - product plans and packages - global feature flags Future database isolation options: - separate database per firm for high-security firms - hybrid shared platform database plus dedicated firm database - dedicated VM/server per firm The MVP must not block any of these future options. Optional Separate Database Disk -------------------------------- For the MVP, the database can live on the OS disk or the data disk. For production, consider separating: OS disk: - Ubuntu - application code - system packages Database disk: - PostgreSQL data directory File data disk: - recordings - transcripts - generated documents - templates - attachments This is not mandatory for the MVP, but the system should be designed so that the database storage can be moved to a dedicated disk or server later without application changes. Redis Configuration ------------------- Redis is used as the queue backend and cache. For a legal workflow platform where queue jobs represent real billable work, Redis must be configured with persistence enabled so that a Redis restart or unexpected shutdown does not silently discard pending jobs. Required configuration: Persistence: - Enable AOF (append-only file) persistence as the primary durability mechanism. Set appendonly yes in the Redis configuration. - Alternatively, configure RDB snapshots with a short interval (e.g. every 60 seconds if 1000 keys have changed) as a minimum. - For production, AOF with fsync set to everysec is the recommended balance between durability and performance. Binding: - Redis must only bind to localhost (127.0.0.1) or a private internal network interface. - Redis must never be exposed on a public network interface or internet-facing port. Authentication: - Redis must be protected with a strong password (requirepass). - The password must be stored in the application's encrypted environment configuration, not in plain text. Memory limits: - Set a maxmemory limit appropriate to the server's available RAM. - Set maxmemory-policy to noeviction for the queue database so that Redis never silently drops queue jobs to free memory. Use a separate Redis database or instance with an eviction policy for caching if needed. Monitoring: - Redis health and memory usage must be included in the admin diagnostics panel. TLS / HTTPS Configuration -------------------------- The web portal must be served over HTTPS from the very first deployment, including the MVP/development server. Plain HTTP must not be used at any stage. Requirements: Development/MVP server: - Obtain a TLS certificate from Let's Encrypt using Certbot or equivalent. - Configure Nginx to redirect all HTTP traffic (port 80) to HTTPS (port 443). - Set up automatic certificate renewal via a cron job or systemd timer. Production server: - Use Let's Encrypt for standard deployments. - Use an internal certificate authority or a purchased certificate for environments with stricter requirements. - Ensure certificate renewal is monitored and alerts are sent before expiry. Nginx TLS settings: - Use TLS 1.2 and TLS 1.3 only. Disable TLS 1.0 and TLS 1.1. - Use a strong cipher suite. Follow current Mozilla SSL Configuration Generator recommendations for the Intermediate or Modern profile. - Enable HSTS (HTTP Strict Transport Security) with a suitable max-age. - Enable OCSP stapling where supported. Custom domains: - For hosted mode with per-firm custom domains or subdomains, a wildcard certificate or per-domain certificate must be provisioned and renewed for each firm domain. - Certificate management for custom firm domains must be included in the firm onboarding process. Certificate monitoring: - Certificate expiry dates must be visible in the admin diagnostics panel. - An alert must fire if any certificate is within 14 days of expiry. Firewall Rules -------------- The server must have a firewall configured from the moment it is provisioned. Only the minimum required ports should be open. Required firewall rules: Inbound - allowed: - Port 80 (HTTP) - allowed from anywhere, for Let's Encrypt and HTTP redirect - Port 443 (HTTPS) - allowed from anywhere, for the web portal - Port 22 (SSH) - restricted to known management IP addresses only. Never open SSH to the world. Inbound - blocked: - Database port (PostgreSQL 5432) - must never be exposed externally. Database access must only be available on localhost or a private interface. - Redis port (6379) - must never be exposed externally. Redis must only be available on localhost or a private interface. - SMB ports (445, 139) - the application connects outbound to SMB shares. Inbound SMB must never be open on the platform server. - All other ports not explicitly listed above must be blocked by default. Outbound: - The server must be able to reach external AI provider APIs over HTTPS. - The server must be able to reach external SFTP, FTP, and SMB targets for connector jobs. - The local connector agent connects outbound to the platform API on port 443. The agent must not require any inbound firewall rule on the platform server. - Outbound access should be restricted to required destinations where the hosting environment supports outbound firewall rules. Firewall tool: - Use ufw (Uncomplicated Firewall) on Ubuntu as the default. - All rules must be documented and version-controlled alongside the server build documentation. Log Management -------------- The platform generates audit logs, application logs, AI usage logs, queue worker logs, and connector logs continuously. Without active log management, log files will grow without bound and can fill the OS disk. Requirements: logrotate: - logrotate must be configured for all application log files from day one. - Logs should be rotated daily or when they reach a configurable size threshold. - Compressed old logs should be retained for a configurable number of days. - logrotate configuration must be included in the server build checklist. Log retention: - Application error logs must be retained for a minimum configurable period (default: 90 days). - Audit logs must be retained according to each firm's retention policy, which may be longer than the application log retention period. - Queue worker logs must be retained for a minimum configurable period (default: 30 days). Log content rules: - Logs must never contain secrets, API keys, passwords, or credential values at any log level (debug, info, warning, error). - Logs must never contain full file paths that reveal the internal folder structure to unauthorised parties. - Logs must not contain unredacted personal data beyond what is required for audit purposes. - These rules apply to all log destinations: files, database audit logs, third-party log aggregators. Log visibility: - A tail of the most recent application errors must be available in the admin diagnostics panel. - Log file sizes and last-rotation dates should be visible in the diagnostics panel so that log growth problems are caught early. MFA: - MFA support must be included in the application. It is not optional and must not be deferred to a future phase. Platform admin accounts must require MFA. Firm admin accounts should support MFA with the option to require it per firm. Backup Design ------------- Do not rely on VM snapshots as the only backup. Snapshots are useful, but they are not sufficient on their own. The platform must support: 1. App/source backup 2. Database dump 3. File/data backup 4. Per-tenant backup/export 5. Off-server backup copy 6. Restore test 7. Backup status reporting 8. Backup failure alerts Backups must be stored away from the live data disk. Recommended backup targets: - separate VM disk - backup server - NAS - SMB share - SFTP server - cloud/object storage (future option) Backups must support encryption. Backup Types ------------ The system must support these backup types: 1. Fast source backup - Used before code or UI-only changes. - Includes application source, config templates, scripts, changelog, and handover files. - Does not require a full database dump unless schema or data is being changed. 2. Database backup - Used when schema, migrations, tables, seed data, or production data are changed. - Must include a full database dump and schema summary. 3. File/data backup - Includes recordings, transcripts, documents, templates, attachments, exports, and reports. - May be full or incremental. 4. Per-tenant backup/export - Used to export one firm's settings, templates, documents, transcripts, and configuration. - Secrets must not be exported in plain text. 5. Full platform backup - Used for complete server and platform recovery. Backup Folder Layout -------------------- Recommended backup layout: /backups/legaltranscribe/ source/ database/ files/ tenants/ firm_0001/ firm_0002/ handovers/ reports/ restore-tests/ Backups must include: - application source - database - uploaded recordings - transcripts - final documents - Word templates - attachments - config/settings exports - audit logs where appropriate - changelog - handover file - route list - schema summary - smoke-test reports - latest error log tail Tenant Backup and Restore Requirements -------------------------------------- In hosted mode, it must be possible to restore: 1. The whole server/platform 2. One firm/tenant 3. One firm's settings and templates 4. One firm's documents and transcripts where practical 5. One firm's data into a dedicated instance if required later Per-firm backup retention policies must be configurable. Sensitive items must never be exported in plain text: - passwords - API keys - email credentials - storage credentials - private keys Database Design --------------- Recommended database: PostgreSQL See the Database Design section above for the full recommendation and rationale. Security Requirements --------------------- The server will handle confidential legal recordings and documents. Minimum requirements: - tenant isolation at application, database, and filesystem layers - encrypted credentials and secrets - restricted filesystem permissions (see Filesystem Permissions section) - audit logs - role-based access control - MFA support for platform admin and firm admin accounts - feature flags and route-level blocking for disabled modules - secure backup storage with encryption - no secrets in any log at any log level - no API keys in handover files or changelogs - no cross-firm file access - no cross-firm database query leakage - background jobs must always run with explicit firm context - all uploaded files scanned before acceptance (see Virus Scanning section) - TLS from day one (see TLS / HTTPS Configuration section) - firewall configured before any service is started (see Firewall Rules section) - path traversal validation in the storage abstraction layer Platform admins must be audited whenever they access or impersonate a firm user. Support access must be explicit, time-limited, and fully logged. The firm admin must be able to view the support access history for their firm. The system must make it visually clear in the portal when a platform admin is acting inside a firm context. Local Connector Agent --------------------- The local connector agent is the preferred solution for firms that need access to local SMB shares or network recording folders but do not want the full platform installed onsite. It is described in the deployment models section as Model 4. Agent Responsibilities: - Watch one or more configured local folders for new audio files or documents. - Securely upload new files to the hosted platform for processing. - Poll the hosted platform for completed documents and download them into the firm's configured output folders. - Queue uploads and downloads locally so that network outages do not cause job loss. - Retry uploads and downloads automatically when connectivity is restored. - Log all upload and download actions locally and report them to the hosted platform for audit purposes. - Report status and errors to the hosted platform. Agent Communication: - The agent must communicate with the hosted platform using a secure HTTPS API connection on port 443. - Authentication must use per-agent tokens, registered and managed through the platform admin panel. - All connections must use TLS. Plain HTTP must not be accepted. - The agent must not expose any inbound ports. All communication must be initiated by the agent outbound to the platform. No inbound firewall rule should be required on the platform server for the agent to function. - Agent tokens must be revocable from the platform admin panel with immediate effect. Agent Installation and Updates: - The agent must be installable on Windows (as a Windows Service) and Linux (as a systemd service) as a background process. - The agent should support automatic updates pushed from the hosted platform, with a fallback to manual update if auto-update fails. - The installation process must be documented clearly enough for a non-developer to complete it on a standard office Windows or Linux machine. - The agent must have a simple local status page or command-line status command so that an administrator at the firm can confirm it is running, check its last sync time, and view recent errors without needing access to the hosted platform portal. Offline and Resilience Behaviour: - The agent must queue pending uploads locally if the hosted platform is unreachable. - On reconnection, the agent must process the local queue in order. - The agent must not attempt to re-upload a file that has already been successfully uploaded and acknowledged by the platform. - If a file fails to upload after a configurable number of retries, it must be flagged in the local error log and the agent must notify the platform of the failure when connectivity is restored. Agent Registration and Management: - Each agent must be registered to exactly one firm in the platform admin panel. - Platform admins and firm admins must be able to view registered agents, their last seen time, their installed version, and their current status. - Platform admins must be able to revoke an agent's access token immediately. - Each agent must have a unique identifier visible in both the local agent status output and the platform admin panel. - Agent registration must be part of the firm onboarding process. Agent Audit Logging: - Every file the agent uploads must be recorded in the firm's audit log on the hosted platform, including filename, size, timestamp, and agent ID. - Every file the agent downloads must be recorded in the firm's audit log, including filename, destination folder, timestamp, and agent ID. - Agent errors and connectivity failures must be logged and visible in the admin diagnostics panel. - Agent version and last-seen information must be visible in the diagnostics panel so that outdated agents can be identified and updated. Important MVP Infrastructure Rules ------------------------------------ The MVP must include these foundations from day one: 1. Feature flags 2. Tenant settings 3. Storage abstraction 4. AI abstraction 5. Event-based modules These must not be treated as future upgrades. The first version can be simple, but the architecture must be ready to grow. Feature flags must allow the system to start as a basic transcription portal and slowly reveal advanced modules later. Tenant settings must allow each firm to have its own branding, users, permissions, storage, templates, AI settings, email settings, and enabled features. Storage abstraction must allow files to be stored on local disk first, then later SMB, SFTP, object storage, or tenant-specific volumes. AI abstraction must allow different AI providers and models to be used for transcription, cleanup, document classification, template selection, legal review, email drafting, and second-AI checking. The abstraction must enforce the firm's permitted AI privacy tier. Event-based modules must allow future features to be added without tightly coupling modules together. Server Build Order ------------------ Recommended infrastructure build order: Phase 0 - Provision Server - Create new VM/server on dedicated infrastructure. - Install Ubuntu 24.04 LTS. - Attach OS disk (150 GB SSD/NVMe). - Attach separate data disk (2 TB SSD/NVMe). - Attach or configure backup target. - Configure hostname and DNS. - Configure firewall (ufw) - see Firewall Rules section. Apply rules before any service is started. - Configure SSH access. Restrict to known management IPs only. Disable password authentication. Use key-based authentication only. - Configure automatic system security updates (unattended-upgrades). - Configure time zone (UTC recommended for servers). - Create application user: legaltranscribe. Phase 1 - Base Services - Install Nginx. - Configure TLS certificate (Let's Encrypt / Certbot). - Configure Nginx HTTPS redirect and strong TLS settings. - Install PHP/app runtime (if Laravel is chosen). - Install PostgreSQL. Configure for localhost access only. - Install Redis. Configure persistence, authentication, and localhost binding. See Redis Configuration section. - Install Python environment and required packages. - Install ffmpeg. - Install LibreOffice/headless document tools. - Install ClamAV. Configure freshclam for daily definition updates. - Install SMB/SFTP/FTP client tools (smbclient, openssh-client, lftp). - Configure logrotate for all application and service logs. - Configure system services to start on boot. Phase 2 - Storage Layout - Mount data disk at /data. - Create /data/legaltranscribe directory structure. - Create tenant folder structure (firms/, shared/, quarantine/, system/, imports/, processing/). - Set correct ownership and permissions. See Filesystem Permissions section. - Create application user home directory if required. - Create backup folders or connect and test remote backup target. - Verify no cross-folder access is possible between firm directories. Phase 3 - Application Skeleton - Create app at /opt/legaltranscribe/app. - Configure database connection and run initial migrations. - Configure queue worker and process manager (Supervisor or systemd). - Configure encrypted environment file (.env) for secrets. - Add firm/tenant model with firm_id scoping. - Add users/roles/permissions module. - Add feature flags module. - Add tenant settings module. - Add storage abstraction layer with path traversal protection. - Add AI provider abstraction with privacy tier enforcement. - Add event system. - Add notification system foundation (in-app). - Add basic matter/client module. Phase 4 - Admin Safety Tools - Add backup scripts (source, database, file, tenant export). - Add changelog file. - Add handover file. - Add route list report. - Add schema summary report. - Add latest error log tail report. - Add smoke-test script (including tenant isolation checks). - Add crawler/link checker. - Add admin diagnostics page (including queue health, worker health, ClamAV definition age, certificate expiry, storage quotas, Redis status, log sizes, cleanup job status). Phase 5 - Basic Transcription MVP - Manual audio upload with virus scan on receipt. - Create transcription job. - Queue worker processes job. - Save raw transcript. - Show transcript viewer/editor. - Store transcript under correct firm folder. - Audit all important actions. Phase 6 - Document Template MVP - Upload Word templates (with virus scan on receipt). - Assign templates to firm/document type. - Select template manually or via AI classification. - Generate Word document. - Apply draft watermark to generated document. - Save completed document into tenant data folder. - Show/download final document. Recommended First Build Specification --------------------------------------- Server: legaltranscribe1 OS: Ubuntu 24.04 LTS vCPU: 8 cores RAM: 32 GB OS disk: 150 GB SSD/NVMe Data disk: 2 TB SSD/NVMe Backup target: separate off-server backup target Suggested paths: App: /opt/legaltranscribe/app Data: /data/legaltranscribe Backups: /backups/legaltranscribe or remote target Database: PostgreSQL (recommended default) Queue/cache: Redis (with AOF persistence and localhost binding) Web server: Nginx (HTTPS only, TLS 1.2/1.3, strong cipher suite) App framework: As determined by the stack recommendation document Workers: Python worker environment for transcription, document processing, AI tasks, and background jobs. Document tools: LibreOffice/headless python-docx or equivalent PDF/document conversion tools where required Audio tools: ffmpeg Security tools: ClamAV with freshclam (daily definition updates) ufw firewall Certbot / Let's Encrypt for TLS Fail2ban (recommended for SSH brute-force protection) Logging: logrotate configured for all application logs Admin diagnostics panel with log tail and log size visibility Security: Encrypted secrets (environment file, secrets manager, or vault) Firm-scoped filesystem permissions Tenant-scoped storage with path traversal validation Audit logging from day one MFA support included in application (not deferred) SSH key-based authentication only Recommended Final Decision -------------------------- Start with a fresh, dedicated server/VM. Do not build this platform on a server that hosts other applications. Recommended first server: - 8 vCPU - 32 GB RAM - 150 GB OS disk - 2 TB data disk - separate backup target - Ubuntu 24.04 LTS Build one tenant-aware platform. Use PostgreSQL as the database. Configure Redis with persistence enabled and localhost binding. Set up TLS with Nginx before any application is deployed. Configure the firewall before any service is started. Install and configure ClamAV before any file upload is enabled. Set correct filesystem permissions before any firm data is written. Use a separate data disk from day one. Use logical tenant separation first. Do not allocate a separate physical disk per tenant at the beginning. Design the storage abstraction so that large tenants can later be moved to their own disk, mount point, storage volume, object bucket, dedicated database, or dedicated server without any application code changes. The key principle is: One codebase. Always tenant-aware. Dedicated server from day one. Separate data disk from day one. TLS and firewall before first deployment. Virus scanning before first file upload. Logical tenant separation first. Optional per-tenant disk or dedicated server later.