storage_design_plan.txt
Legal Transcription Platform - Server and Storage Design Plan
=================================================================
Version: 2026-05-26 (Updated)
Purpose
-------
This document defines the recommended server, storage, tenant-isolation, and backup
design for the Legal Transcription and Document Automation Platform.
The platform must be built as one tenant-aware codebase that can run in multiple
deployment modes:
1. Hosted multi-tenant mode
- One server/platform hosted in the data centre.
- Multiple law firms use the same system.
- Each firm is a separate tenant.
- Data, users, templates, recordings, transcripts, generated documents, AI settings,
audit logs, and feature flags are isolated by firm.
2. Hosted single-tenant dedicated mode
- The same platform runs for one firm only.
- Useful for larger clients or firms that want dedicated infrastructure.
- Still uses the same Platform -> Firm -> User structure.
3. Onsite single-tenant mode
- The same platform can be installed onsite if a firm requires local control.
- This must not be a separate product or separate codebase.
- It is the same system deployed in a different location.
4. Hosted platform with optional local connector agent
- Preferred for firms that need access to local SMB shares, recording folders,
scanners, or local document stores.
- A small local agent runs at the law firm site.
- The agent securely uploads recordings/documents to the hosted platform.
- The agent can download completed documents back to local folders.
- See the Local Connector Agent section for full specification.
Core Rule
---------
Do not build separate hosted and on-prem products.
Build one multi-tenant, tenant-aware platform that can run in different deployment modes.
The system must always use this structure:
Platform
-> Firm / Tenant
-> Users
Even if there is only one firm on the server.
The code must never assume there is only one firm.
The platform must be deployed on a dedicated server. It must not share infrastructure
with unrelated applications such as billing software or other business systems.
Recommended Starting Server
---------------------------
For the first proper MVP/development server:
VM name: legaltranscribe1
Operating system: Ubuntu 24.04 LTS
Suggested resources:
vCPU: 8 cores
RAM: 32 GB
OS disk: 150 GB SSD/NVMe
Data disk: 2 TB SSD/NVMe
Backup target: separate disk, NAS, SMB share, SFTP server, or backup server
Initial services:
- Nginx (with TLS from day one)
- PHP/Laravel or selected web framework
- PostgreSQL (recommended - see Database Design section)
- Redis (with persistence enabled - see Redis Configuration section)
- Queue worker
- Python worker environment
- LibreOffice/headless document tools
- ffmpeg
- ClamAV or equivalent virus scanner
- SMB/SFTP/FTP connector tools
- Backup scripts
- AI provider configuration
- Monitoring/logging tools
- logrotate configuration
This is enough for the MVP because the heavy AI processing will likely use external
AI APIs at first, rather than local GPU-based inference.
Recommended Production Starting Server
--------------------------------------
For a hosted multi-tenant production server in the data centre:
Suggested resources:
vCPU: 16 cores
RAM: 64 GB
OS disk: 150-250 GB SSD/NVMe
Data disk: 2-4 TB SSD/NVMe
Database disk: separate high-speed SSD/NVMe volume (recommended)
Backup target: separate physical storage, backup server, NAS, SFTP target,
or cloud/object storage
The exact size should be increased based on:
- number of firms
- number of users
- transcription minutes per day
- recording retention period
- document retention period
- AI usage volume
- live dictation usage
- backup retention policy
Minimum Disk Design
-------------------
The server should use a separate data disk from the beginning.
Minimum recommended layout:
Disk 1 - OS and Application
/
/opt/legaltranscribe/app
/var/log
/etc
Disk 2 - Application Data
/data/legaltranscribe/
Backup Target - Separate Storage
/backups/legaltranscribe/
or an external SMB/SFTP/NAS/cloud backup destination
The data disk holds all legal recordings, transcripts, templates, generated documents,
attachments, exports, tenant files, and temporary processing files.
The OS/application disk must not be used as the main legal data store.
Why Use a Separate Data Disk From Day One
-----------------------------------------
A separate data disk is strongly recommended because it gives:
1. Easier backups
2. Easier restores
3. Easier storage expansion
4. Easier tenant migration later
5. Cleaner separation between application code and legal data
6. Lower risk if the OS disk fills up
7. Easier encryption of legal data
8. Easier snapshots of legal data
9. Easier migration to another server
10. Better long-term scaling
Recommended Data Folder Layout
-------------------------------
Use a tenant-aware folder layout from day one.
Recommended structure:
/data/legaltranscribe/
firms/
firm_0001/
recordings/
transcripts/
documents/
templates/
attachments/
exports/
temp/
reports/
backups/
firm_0002/
recordings/
transcripts/
documents/
templates/
attachments/
exports/
temp/
reports/
backups/
shared/
quarantine/
system/
imports/
processing/
Each firm must have its own isolated storage path.
Example tenant folder paths:
/data/legaltranscribe/firms/{firm_id}/recordings
/data/legaltranscribe/firms/{firm_id}/transcripts
/data/legaltranscribe/firms/{firm_id}/documents
/data/legaltranscribe/firms/{firm_id}/templates
/data/legaltranscribe/firms/{firm_id}/attachments
/data/legaltranscribe/firms/{firm_id}/exports
/data/legaltranscribe/firms/{firm_id}/temp
/data/legaltranscribe/firms/{firm_id}/reports
Folder Definitions
------------------
Each top-level folder under /data/legaltranscribe/ has a defined purpose:
firms/
Contains one sub-folder per tenant. Each firm's data is fully isolated within
its own folder tree. No application process should ever read or write across
firm boundaries.
shared/
Platform-level assets that are not firm-specific. This may include global
template starters, platform branding assets, and shared reference files
distributed to firms. This folder must not contain any firm data or recordings.
quarantine/
Files that have failed virus/malware scanning or format validation are moved
here instead of being accepted into a firm's folder or deleted outright.
Quarantined files must be reviewed by a platform admin before permanent removal.
This folder must be restricted to platform admin access only. No firm user or
application process other than the scanner and the admin tool should be able
to read from this folder.
system/
Platform-level operational files such as lock files, PID files, health check
outputs, and system-level job state. Not for firm data.
imports/
Staging area for files arriving from external connectors (SMB, SFTP, FTP,
local agent) before they have been validated and assigned to a firm. Files
should not remain here after processing completes. Stale files in imports
indicate a failed or stuck import job and should trigger an alert.
processing/
Temporary working area for files actively being processed by a queue worker
(e.g. audio being transcribed, documents being merged). A file should only
exist in processing while a worker job is actively working on it. Stale
files here indicate a crashed or stuck worker and must trigger an alert.
See the Temp and Processing Folder Lifecycle section.
firm/{firm_id}/temp/
Per-firm temporary folder for intermediate files created during a job for
that firm. Must be cleaned on schedule. See the Temp and Processing Folder
Lifecycle section.
Temp and Processing Folder Lifecycle
-------------------------------------
The temp and processing folders can accumulate large audio files and intermediate
documents if not actively managed. A cleanup job must run on a schedule.
Rules:
- Files in /data/legaltranscribe/firms/{firm_id}/temp/ that are older than a
configurable threshold (default: 24 hours) must be automatically deleted by
a scheduled cleanup job.
- Files in /data/legaltranscribe/processing/ should only exist while a queue
worker is actively working on that job. If a file has been in processing
longer than a configurable threshold (default: 2 hours) and its associated
job is no longer active in the queue, this is treated as a stale file from
a crashed or stuck worker. The cleanup job must flag it and notify the
platform admin.
- Files in /data/legaltranscribe/imports/ should be cleared as soon as they
are validated and moved to the correct firm folder. A file remaining in
imports longer than a configurable threshold (default: 1 hour) after arrival
indicates a failed import job and should trigger an alert.
- The cleanup job must be a scheduled queue job, visible in the admin
diagnostics panel, with a log of what was cleaned and when.
- The cleanup job must never delete files from firm data folders (recordings,
documents, transcripts). It only cleans temp, processing, and imports.
Storage Isolation Rule
----------------------
The application must never directly hard-code file paths inside business logic.
All file access must go through a storage abstraction layer.
The storage abstraction must understand:
- firm/tenant ID
- storage type
- base path
- permissions
- encryption rules
- retention rules
- quota rules
All file paths resolved by the storage abstraction must be validated against the
allowed base path for that firm before any read or write operation is executed.
This prevents path traversal attacks where a manipulated filename or folder name
could cause the application to access files outside the firm's permitted path.
If a resolved path does not begin with the firm's permitted base path, the
operation must be rejected and the attempt must be logged in the audit log.
This allows a firm to be moved later from shared storage to its own disk, volume,
SMB share, SFTP location, object bucket, or dedicated server without rewriting
the transcription, document, email, or AI modules.
Filesystem Permissions
----------------------
Correct filesystem permissions are critical because this system handles confidential
legal recordings and documents.
Required permission model:
Application user:
- Create a dedicated application user, for example: legaltranscribe
- The application and all queue workers must run under this user
- This user owns the application code directory and the data directory
Web server user:
- Nginx (www-data or equivalent) must have no direct access to the data disk
- The web server must only serve the application through the PHP/app runtime
- The web server must not be able to browse or read firm data folders directly
Firm folder permissions:
- Each firm folder must be readable and writable only by the application user
- No other OS user should have read access to firm folders
- Permissions should be set to 750 or stricter on firm data directories
Quarantine folder:
- Writable by the application user (for moving files into quarantine)
- Readable only by the application user and platform admin processes
- No firm-level process should be able to read from quarantine
Log folders:
- Application logs writable by the application user
- Log files must not be world-readable
- Log files must never contain secrets, API keys, or credential values
Upload validation:
- All uploaded file paths and connector-sourced file paths must be validated
against the permitted base path for the receiving firm before acceptance
- Path traversal characters (../, ..\, encoded equivalents) must be rejected
at the storage abstraction layer before any file operation
Summary of recommended permissions:
/opt/legaltranscribe/app 750 legaltranscribe:legaltranscribe
/data/legaltranscribe/ 750 legaltranscribe:legaltranscribe
/data/legaltranscribe/firms/ 750 legaltranscribe:legaltranscribe
/data/legaltranscribe/quarantine/ 750 legaltranscribe:legaltranscribe
/data/legaltranscribe/processing/ 750 legaltranscribe:legaltranscribe
/backups/legaltranscribe/ 750 legaltranscribe:legaltranscribe
Virus and Malware Scanning
--------------------------
Law firms upload files from many sources including client-supplied recordings,
email attachments, scanned documents, and files arriving via SMB or SFTP connectors.
All uploaded files must be scanned before they are accepted into a firm's storage path.
Requirements:
Scanner:
- ClamAV or an equivalent open-source virus/malware scanner must be installed
and kept up to date.
- The scanner must be integrated into the file upload pipeline and the
connector import pipeline.
- Scanning must run before the file is moved from the imports or temp area
into the firm's permanent storage path.
On scan failure:
- Files that fail scanning must be moved to the quarantine folder.
- The associated job must be marked as failed with a clear error message.
- The firm admin and the submitting user must be notified.
- The quarantined file must not be deleted automatically. It must be reviewed
by a platform admin before any action is taken.
On scan pass:
- The file proceeds normally through the upload or import pipeline.
- The scan result (clean, timestamp, scanner version) must be recorded in the
audit log for that file.
Scope:
- Scanning applies to all uploaded audio files, Word documents, PDFs,
attachments, and any file arriving via a connector.
- Email attachments processed through the email integration module must also
be scanned before being accepted.
Updates:
- ClamAV virus definitions must be updated automatically on a daily schedule
at minimum using freshclam or equivalent.
- The last successful definition update date must be visible in the admin
diagnostics panel. A stale definition database must trigger an alert.
Storage Quota Enforcement
--------------------------
Storage quotas allow the platform admin to control how much disk space each firm
can consume and to enforce plan limits in hosted mode.
Requirements:
Tracking:
- Storage usage per firm must be tracked in the database.
- The usage counter must be updated whenever files are written to or deleted
from a firm's storage path.
- Usage must be calculated across all firm sub-folders: recordings, transcripts,
documents, templates, attachments, exports, and temp.
Quota settings:
- The platform admin must be able to set a storage quota per firm, either as
part of the firm's plan or as a per-firm override.
- A default quota can be set at the platform level and applied to new firms.
Warning threshold:
- When a firm's storage usage reaches a configurable warning threshold
(default: 80% of quota), the platform admin and firm admin must be notified.
- The warning must appear in the admin dashboard and as a notification.
Quota enforcement:
- When a firm reaches 100% of its quota, further file uploads must be blocked
with a clear, user-readable error message.
- The block must apply to all upload paths: web upload, connector import,
email attachment import, and agent upload.
- Jobs must not silently fail when quota is exceeded. The error must be
surfaced to the user and logged in the audit trail.
Admin override:
- Platform admins must be able to increase a firm's quota at any time.
- Platform admins must be able to view storage usage per firm in the admin panel.
Visibility:
- Storage usage and quota must be visible to firm admins in their firm
settings area.
- A storage usage breakdown by folder type (recordings, documents, etc.)
should be shown where practical.
Should Each Tenant Have a Separate Hard Disk?
---------------------------------------------
Not at the beginning.
The recommended approach is a progressive isolation model:
Tenant isolation level 1 (MVP):
Shared database with tenant_id on every tenant-owned table.
Separate tenant folders under /data/legaltranscribe/firms/{firm_id}/.
Logical separation enforced by the application and storage abstraction.
Tenant isolation level 2:
Separate tenant storage folder or volume with enforced quotas.
Tenant isolation level 3:
Separate disk, volume, mount point, or object storage bucket for larger firms.
Example: mount a dedicated disk directly at /data/legaltranscribe/firms/firm_0042/
Tenant isolation level 4:
Dedicated VM or server for very large, high-security, or high-volume firms.
For the MVP, use logical tenant separation first.
Design the storage abstraction so that large tenants can later be moved to their own
disk, mount point, storage volume, object bucket, dedicated database, or dedicated
server without rewriting application code.
Database Design
---------------
The system must be tenant-aware at the database level.
Recommended database: PostgreSQL
PostgreSQL is the recommended choice over MariaDB for the following reasons:
- Row-level security support, which can be used as an additional tenant isolation
safeguard at the database layer.
- More mature full-text search, which may be used for transcript and document search.
- Better JSON/JSONB support for flexible settings and metadata storage.
- Stronger data integrity and constraint enforcement.
- Widely supported by Laravel, Python ORMs, and most deployment tooling.
MariaDB remains an acceptable fallback if PostgreSQL is not feasible for a specific
deployment, but PostgreSQL should be the default for all new builds.
The database choice must be recorded in the stack recommendation document before
coding begins.
Recommended MVP approach:
Shared database with tenant_id / firm_id on every tenant-owned table.
Every tenant-owned table must include firm_id unless the table is truly global
platform data.
Tenant-scoped tables include:
- users
- recordings
- transcription jobs
- transcripts
- transcript versions
- generated documents
- document versions
- templates
- template versions
- AI jobs
- AI usage logs
- AI review results
- email connections
- email drafts
- sent emails
- file connectors
- storage connectors
- audit logs
- feature flags (firm-level)
- firm settings
- backup jobs
- reports
- notifications
- matters / clients
- queue jobs (where firm context is applicable)
- failed jobs (with firm context)
- retention policies
- firm plan assignments
- storage quotas and usage tracking
- local connector agents
Global platform tables may include:
- platform admins
- global AI provider definitions
- global feature definitions
- global system settings
- platform health logs
- product plans and packages
- global feature flags
Future database isolation options:
- separate database per firm for high-security firms
- hybrid shared platform database plus dedicated firm database
- dedicated VM/server per firm
The MVP must not block any of these future options.
Optional Separate Database Disk
--------------------------------
For the MVP, the database can live on the OS disk or the data disk.
For production, consider separating:
OS disk:
- Ubuntu
- application code
- system packages
Database disk:
- PostgreSQL data directory
File data disk:
- recordings
- transcripts
- generated documents
- templates
- attachments
This is not mandatory for the MVP, but the system should be designed so that the
database storage can be moved to a dedicated disk or server later without
application changes.
Redis Configuration
-------------------
Redis is used as the queue backend and cache. For a legal workflow platform where
queue jobs represent real billable work, Redis must be configured with persistence
enabled so that a Redis restart or unexpected shutdown does not silently discard
pending jobs.
Required configuration:
Persistence:
- Enable AOF (append-only file) persistence as the primary durability mechanism.
Set appendonly yes in the Redis configuration.
- Alternatively, configure RDB snapshots with a short interval (e.g. every
60 seconds if 1000 keys have changed) as a minimum.
- For production, AOF with fsync set to everysec is the recommended balance
between durability and performance.
Binding:
- Redis must only bind to localhost (127.0.0.1) or a private internal network
interface.
- Redis must never be exposed on a public network interface or internet-facing
port.
Authentication:
- Redis must be protected with a strong password (requirepass).
- The password must be stored in the application's encrypted environment
configuration, not in plain text.
Memory limits:
- Set a maxmemory limit appropriate to the server's available RAM.
- Set maxmemory-policy to noeviction for the queue database so that Redis
never silently drops queue jobs to free memory. Use a separate Redis
database or instance with an eviction policy for caching if needed.
Monitoring:
- Redis health and memory usage must be included in the admin diagnostics panel.
TLS / HTTPS Configuration
--------------------------
The web portal must be served over HTTPS from the very first deployment, including
the MVP/development server. Plain HTTP must not be used at any stage.
Requirements:
Development/MVP server:
- Obtain a TLS certificate from Let's Encrypt using Certbot or equivalent.
- Configure Nginx to redirect all HTTP traffic (port 80) to HTTPS (port 443).
- Set up automatic certificate renewal via a cron job or systemd timer.
Production server:
- Use Let's Encrypt for standard deployments.
- Use an internal certificate authority or a purchased certificate for
environments with stricter requirements.
- Ensure certificate renewal is monitored and alerts are sent before expiry.
Nginx TLS settings:
- Use TLS 1.2 and TLS 1.3 only. Disable TLS 1.0 and TLS 1.1.
- Use a strong cipher suite. Follow current Mozilla SSL Configuration
Generator recommendations for the Intermediate or Modern profile.
- Enable HSTS (HTTP Strict Transport Security) with a suitable max-age.
- Enable OCSP stapling where supported.
Custom domains:
- For hosted mode with per-firm custom domains or subdomains, a wildcard
certificate or per-domain certificate must be provisioned and renewed for
each firm domain.
- Certificate management for custom firm domains must be included in the
firm onboarding process.
Certificate monitoring:
- Certificate expiry dates must be visible in the admin diagnostics panel.
- An alert must fire if any certificate is within 14 days of expiry.
Firewall Rules
--------------
The server must have a firewall configured from the moment it is provisioned.
Only the minimum required ports should be open.
Required firewall rules:
Inbound - allowed:
- Port 80 (HTTP) - allowed from anywhere, for Let's Encrypt and HTTP redirect
- Port 443 (HTTPS) - allowed from anywhere, for the web portal
- Port 22 (SSH) - restricted to known management IP addresses only.
Never open SSH to the world.
Inbound - blocked:
- Database port (PostgreSQL 5432) - must never be exposed externally.
Database access must only be available on localhost or a private interface.
- Redis port (6379) - must never be exposed externally. Redis must only
be available on localhost or a private interface.
- SMB ports (445, 139) - the application connects outbound to SMB shares.
Inbound SMB must never be open on the platform server.
- All other ports not explicitly listed above must be blocked by default.
Outbound:
- The server must be able to reach external AI provider APIs over HTTPS.
- The server must be able to reach external SFTP, FTP, and SMB targets
for connector jobs.
- The local connector agent connects outbound to the platform API on port 443.
The agent must not require any inbound firewall rule on the platform server.
- Outbound access should be restricted to required destinations where the
hosting environment supports outbound firewall rules.
Firewall tool:
- Use ufw (Uncomplicated Firewall) on Ubuntu as the default.
- All rules must be documented and version-controlled alongside the server
build documentation.
Log Management
--------------
The platform generates audit logs, application logs, AI usage logs, queue worker
logs, and connector logs continuously. Without active log management, log files
will grow without bound and can fill the OS disk.
Requirements:
logrotate:
- logrotate must be configured for all application log files from day one.
- Logs should be rotated daily or when they reach a configurable size threshold.
- Compressed old logs should be retained for a configurable number of days.
- logrotate configuration must be included in the server build checklist.
Log retention:
- Application error logs must be retained for a minimum configurable period
(default: 90 days).
- Audit logs must be retained according to each firm's retention policy, which
may be longer than the application log retention period.
- Queue worker logs must be retained for a minimum configurable period
(default: 30 days).
Log content rules:
- Logs must never contain secrets, API keys, passwords, or credential values
at any log level (debug, info, warning, error).
- Logs must never contain full file paths that reveal the internal folder
structure to unauthorised parties.
- Logs must not contain unredacted personal data beyond what is required for
audit purposes.
- These rules apply to all log destinations: files, database audit logs,
third-party log aggregators.
Log visibility:
- A tail of the most recent application errors must be available in the
admin diagnostics panel.
- Log file sizes and last-rotation dates should be visible in the diagnostics
panel so that log growth problems are caught early.
MFA:
- MFA support must be included in the application. It is not optional and
must not be deferred to a future phase. Platform admin accounts must require
MFA. Firm admin accounts should support MFA with the option to require it
per firm.
Backup Design
-------------
Do not rely on VM snapshots as the only backup.
Snapshots are useful, but they are not sufficient on their own.
The platform must support:
1. App/source backup
2. Database dump
3. File/data backup
4. Per-tenant backup/export
5. Off-server backup copy
6. Restore test
7. Backup status reporting
8. Backup failure alerts
Backups must be stored away from the live data disk.
Recommended backup targets:
- separate VM disk
- backup server
- NAS
- SMB share
- SFTP server
- cloud/object storage (future option)
Backups must support encryption.
Backup Types
------------
The system must support these backup types:
1. Fast source backup
- Used before code or UI-only changes.
- Includes application source, config templates, scripts, changelog, and
handover files.
- Does not require a full database dump unless schema or data is being changed.
2. Database backup
- Used when schema, migrations, tables, seed data, or production data are changed.
- Must include a full database dump and schema summary.
3. File/data backup
- Includes recordings, transcripts, documents, templates, attachments,
exports, and reports.
- May be full or incremental.
4. Per-tenant backup/export
- Used to export one firm's settings, templates, documents, transcripts,
and configuration.
- Secrets must not be exported in plain text.
5. Full platform backup
- Used for complete server and platform recovery.
Backup Folder Layout
--------------------
Recommended backup layout:
/backups/legaltranscribe/
source/
database/
files/
tenants/
firm_0001/
firm_0002/
handovers/
reports/
restore-tests/
Backups must include:
- application source
- database
- uploaded recordings
- transcripts
- final documents
- Word templates
- attachments
- config/settings exports
- audit logs where appropriate
- changelog
- handover file
- route list
- schema summary
- smoke-test reports
- latest error log tail
Tenant Backup and Restore Requirements
--------------------------------------
In hosted mode, it must be possible to restore:
1. The whole server/platform
2. One firm/tenant
3. One firm's settings and templates
4. One firm's documents and transcripts where practical
5. One firm's data into a dedicated instance if required later
Per-firm backup retention policies must be configurable.
Sensitive items must never be exported in plain text:
- passwords
- API keys
- email credentials
- storage credentials
- private keys
Database Design
---------------
Recommended database: PostgreSQL
See the Database Design section above for the full recommendation and rationale.
Security Requirements
---------------------
The server will handle confidential legal recordings and documents.
Minimum requirements:
- tenant isolation at application, database, and filesystem layers
- encrypted credentials and secrets
- restricted filesystem permissions (see Filesystem Permissions section)
- audit logs
- role-based access control
- MFA support for platform admin and firm admin accounts
- feature flags and route-level blocking for disabled modules
- secure backup storage with encryption
- no secrets in any log at any log level
- no API keys in handover files or changelogs
- no cross-firm file access
- no cross-firm database query leakage
- background jobs must always run with explicit firm context
- all uploaded files scanned before acceptance (see Virus Scanning section)
- TLS from day one (see TLS / HTTPS Configuration section)
- firewall configured before any service is started (see Firewall Rules section)
- path traversal validation in the storage abstraction layer
Platform admins must be audited whenever they access or impersonate a firm user.
Support access must be explicit, time-limited, and fully logged. The firm admin
must be able to view the support access history for their firm.
The system must make it visually clear in the portal when a platform admin is
acting inside a firm context.
Local Connector Agent
---------------------
The local connector agent is the preferred solution for firms that need access to
local SMB shares or network recording folders but do not want the full platform
installed onsite. It is described in the deployment models section as Model 4.
Agent Responsibilities:
- Watch one or more configured local folders for new audio files or documents.
- Securely upload new files to the hosted platform for processing.
- Poll the hosted platform for completed documents and download them into the
firm's configured output folders.
- Queue uploads and downloads locally so that network outages do not cause job loss.
- Retry uploads and downloads automatically when connectivity is restored.
- Log all upload and download actions locally and report them to the hosted
platform for audit purposes.
- Report status and errors to the hosted platform.
Agent Communication:
- The agent must communicate with the hosted platform using a secure HTTPS API
connection on port 443.
- Authentication must use per-agent tokens, registered and managed through the
platform admin panel.
- All connections must use TLS. Plain HTTP must not be accepted.
- The agent must not expose any inbound ports. All communication must be
initiated by the agent outbound to the platform. No inbound firewall rule
should be required on the platform server for the agent to function.
- Agent tokens must be revocable from the platform admin panel with immediate
effect.
Agent Installation and Updates:
- The agent must be installable on Windows (as a Windows Service) and Linux
(as a systemd service) as a background process.
- The agent should support automatic updates pushed from the hosted platform,
with a fallback to manual update if auto-update fails.
- The installation process must be documented clearly enough for a non-developer
to complete it on a standard office Windows or Linux machine.
- The agent must have a simple local status page or command-line status command
so that an administrator at the firm can confirm it is running, check its
last sync time, and view recent errors without needing access to the hosted
platform portal.
Offline and Resilience Behaviour:
- The agent must queue pending uploads locally if the hosted platform is
unreachable.
- On reconnection, the agent must process the local queue in order.
- The agent must not attempt to re-upload a file that has already been
successfully uploaded and acknowledged by the platform.
- If a file fails to upload after a configurable number of retries, it must
be flagged in the local error log and the agent must notify the platform
of the failure when connectivity is restored.
Agent Registration and Management:
- Each agent must be registered to exactly one firm in the platform admin panel.
- Platform admins and firm admins must be able to view registered agents,
their last seen time, their installed version, and their current status.
- Platform admins must be able to revoke an agent's access token immediately.
- Each agent must have a unique identifier visible in both the local agent
status output and the platform admin panel.
- Agent registration must be part of the firm onboarding process.
Agent Audit Logging:
- Every file the agent uploads must be recorded in the firm's audit log on
the hosted platform, including filename, size, timestamp, and agent ID.
- Every file the agent downloads must be recorded in the firm's audit log,
including filename, destination folder, timestamp, and agent ID.
- Agent errors and connectivity failures must be logged and visible in the
admin diagnostics panel.
- Agent version and last-seen information must be visible in the diagnostics
panel so that outdated agents can be identified and updated.
Important MVP Infrastructure Rules
------------------------------------
The MVP must include these foundations from day one:
1. Feature flags
2. Tenant settings
3. Storage abstraction
4. AI abstraction
5. Event-based modules
These must not be treated as future upgrades.
The first version can be simple, but the architecture must be ready to grow.
Feature flags must allow the system to start as a basic transcription portal and
slowly reveal advanced modules later.
Tenant settings must allow each firm to have its own branding, users, permissions,
storage, templates, AI settings, email settings, and enabled features.
Storage abstraction must allow files to be stored on local disk first, then later
SMB, SFTP, object storage, or tenant-specific volumes.
AI abstraction must allow different AI providers and models to be used for
transcription, cleanup, document classification, template selection, legal review,
email drafting, and second-AI checking. The abstraction must enforce the firm's
permitted AI privacy tier.
Event-based modules must allow future features to be added without tightly
coupling modules together.
Server Build Order
------------------
Recommended infrastructure build order:
Phase 0 - Provision Server
- Create new VM/server on dedicated infrastructure.
- Install Ubuntu 24.04 LTS.
- Attach OS disk (150 GB SSD/NVMe).
- Attach separate data disk (2 TB SSD/NVMe).
- Attach or configure backup target.
- Configure hostname and DNS.
- Configure firewall (ufw) - see Firewall Rules section.
Apply rules before any service is started.
- Configure SSH access. Restrict to known management IPs only.
Disable password authentication. Use key-based authentication only.
- Configure automatic system security updates (unattended-upgrades).
- Configure time zone (UTC recommended for servers).
- Create application user: legaltranscribe.
Phase 1 - Base Services
- Install Nginx.
- Configure TLS certificate (Let's Encrypt / Certbot).
- Configure Nginx HTTPS redirect and strong TLS settings.
- Install PHP/app runtime (if Laravel is chosen).
- Install PostgreSQL. Configure for localhost access only.
- Install Redis. Configure persistence, authentication, and
localhost binding. See Redis Configuration section.
- Install Python environment and required packages.
- Install ffmpeg.
- Install LibreOffice/headless document tools.
- Install ClamAV. Configure freshclam for daily definition updates.
- Install SMB/SFTP/FTP client tools (smbclient, openssh-client, lftp).
- Configure logrotate for all application and service logs.
- Configure system services to start on boot.
Phase 2 - Storage Layout
- Mount data disk at /data.
- Create /data/legaltranscribe directory structure.
- Create tenant folder structure (firms/, shared/, quarantine/,
system/, imports/, processing/).
- Set correct ownership and permissions. See Filesystem Permissions section.
- Create application user home directory if required.
- Create backup folders or connect and test remote backup target.
- Verify no cross-folder access is possible between firm directories.
Phase 3 - Application Skeleton
- Create app at /opt/legaltranscribe/app.
- Configure database connection and run initial migrations.
- Configure queue worker and process manager (Supervisor or systemd).
- Configure encrypted environment file (.env) for secrets.
- Add firm/tenant model with firm_id scoping.
- Add users/roles/permissions module.
- Add feature flags module.
- Add tenant settings module.
- Add storage abstraction layer with path traversal protection.
- Add AI provider abstraction with privacy tier enforcement.
- Add event system.
- Add notification system foundation (in-app).
- Add basic matter/client module.
Phase 4 - Admin Safety Tools
- Add backup scripts (source, database, file, tenant export).
- Add changelog file.
- Add handover file.
- Add route list report.
- Add schema summary report.
- Add latest error log tail report.
- Add smoke-test script (including tenant isolation checks).
- Add crawler/link checker.
- Add admin diagnostics page (including queue health, worker health,
ClamAV definition age, certificate expiry, storage quotas,
Redis status, log sizes, cleanup job status).
Phase 5 - Basic Transcription MVP
- Manual audio upload with virus scan on receipt.
- Create transcription job.
- Queue worker processes job.
- Save raw transcript.
- Show transcript viewer/editor.
- Store transcript under correct firm folder.
- Audit all important actions.
Phase 6 - Document Template MVP
- Upload Word templates (with virus scan on receipt).
- Assign templates to firm/document type.
- Select template manually or via AI classification.
- Generate Word document.
- Apply draft watermark to generated document.
- Save completed document into tenant data folder.
- Show/download final document.
Recommended First Build Specification
---------------------------------------
Server: legaltranscribe1
OS: Ubuntu 24.04 LTS
vCPU: 8 cores
RAM: 32 GB
OS disk: 150 GB SSD/NVMe
Data disk: 2 TB SSD/NVMe
Backup target: separate off-server backup target
Suggested paths:
App: /opt/legaltranscribe/app
Data: /data/legaltranscribe
Backups: /backups/legaltranscribe or remote target
Database: PostgreSQL (recommended default)
Queue/cache: Redis (with AOF persistence and localhost binding)
Web server: Nginx (HTTPS only, TLS 1.2/1.3, strong cipher suite)
App framework: As determined by the stack recommendation document
Workers:
Python worker environment for transcription, document processing,
AI tasks, and background jobs.
Document tools:
LibreOffice/headless
python-docx or equivalent
PDF/document conversion tools where required
Audio tools:
ffmpeg
Security tools:
ClamAV with freshclam (daily definition updates)
ufw firewall
Certbot / Let's Encrypt for TLS
Fail2ban (recommended for SSH brute-force protection)
Logging:
logrotate configured for all application logs
Admin diagnostics panel with log tail and log size visibility
Security:
Encrypted secrets (environment file, secrets manager, or vault)
Firm-scoped filesystem permissions
Tenant-scoped storage with path traversal validation
Audit logging from day one
MFA support included in application (not deferred)
SSH key-based authentication only
Recommended Final Decision
--------------------------
Start with a fresh, dedicated server/VM.
Do not build this platform on a server that hosts other applications.
Recommended first server:
- 8 vCPU
- 32 GB RAM
- 150 GB OS disk
- 2 TB data disk
- separate backup target
- Ubuntu 24.04 LTS
Build one tenant-aware platform.
Use PostgreSQL as the database.
Configure Redis with persistence enabled and localhost binding.
Set up TLS with Nginx before any application is deployed.
Configure the firewall before any service is started.
Install and configure ClamAV before any file upload is enabled.
Set correct filesystem permissions before any firm data is written.
Use a separate data disk from day one.
Use logical tenant separation first.
Do not allocate a separate physical disk per tenant at the beginning.
Design the storage abstraction so that large tenants can later be moved to their
own disk, mount point, storage volume, object bucket, dedicated database, or
dedicated server without any application code changes.
The key principle is:
One codebase.
Always tenant-aware.
Dedicated server from day one.
Separate data disk from day one.
TLS and firewall before first deployment.
Virus scanning before first file upload.
Logical tenant separation first.
Optional per-tenant disk or dedicated server later.