Running a reliable, secure, and performant Revenue Recovery platform requires following operational best practices tailored to your deployment model. This guide helps you establish maintenance rhythms, implement security controls, plan capacity effectively, and prepare for disaster recovery. Whether you're managing a SaaS deployment focused on Relay operations or an On-Premises deployment with full infrastructure responsibility, these practices keep your platform running smoothly.

For monitoring responsibilities and health indicators, see System Monitoring & Health. For hands-on diagnostic procedures, see Component Diagnostics & Health Checks.

General Operational Best Practices

These foundational practices apply to all deployments, regardless of model.

💡
Tip: Automate operational checks using existing monitoring platforms (Nagios, Prometheus, Datadog). Contact [email protected] for metric endpoint guidance.

Core Infrastructure Monitoring

Monitor these foundational infrastructure elements to prevent service disruptions:

Time Synchronization:

Enable NTP on all customer-managed infrastructure (sudo timedatectl set-ntp true)
Verify sync status: timedatectl status
For VMs: disable in-guest NTP to prevent conflicts with hypervisor time sync
Why: Clock drift causes TLS authentication failures and log correlation issues

DNS Resolution:

Use redundant DNS servers; configure forward and reverse lookups
Test resolution weekly: nslookup <hostname> for both Ailevate (*.ailevate.com) and internal hosts
Why: DNS failures cause immediate connectivity loss

TLS Certificates:

Monitor certificate expiration dates; renew 30+ days before expiration
Check expiry: openssl s_client -connect <host>:<port> -showcerts 2>/dev/null | openssl x509 -noout -dates
Coordinate client certificate updates with Ailevate support (On-Premise Elasticsearch and AI Warehouse)
Why: Expired certificates cause immediate service outages

Firewall Rules:

Document all changes in change management system; review quarterly
Monitor logs for unexpected blocked traffic
Test connectivity after any firewall changes
Why: Misconfigured rules can block required traffic or expose systems

See Also:

SaaS Deployment Guide - Outbound firewall requirements
On-Premises Deployment Guide - Inbound and outbound requirements
Relay Service Deployment Guide - Network flow matrix

Role-Based Access Control (RBAC)

Implement principle of least privilege for all administrative access.

Best Practices:

Separate duties: infrastructure admins, application admins, security teams
Document access policies and review quarterly
Remove access immediately upon role change or termination
Use individual accounts (no shared credentials)
Require MFA for administrative access

Compliance Alignment: HIPAA and SOC2 require documented access controls and regular reviews.

For detailed user role management and permission assignment in the Revenue Recovery platform, see User Role Management.

Relay Service Best Practices

Applies to: All deployments (SaaS and On-Premise)

Deployment and Placement

Network Proximity:

Deploy Relay VM in same subnet/LAN as EHR SQL datastore
Minimizes latency for SQL queries (port 1433)
Test latency: ping <sql-hostname> (should be <5ms for optimal performance)

Connectivity:

Ensure stable outbound HTTPS (port 443) to *.ailevate.com
Avoid proxy servers unless absolutely required (adds latency and failure points)
Validate: curl -vk https://<relay-endpoint>.ailevate.com

Operational Maintenance

Establish regular operational rhythms to maintain Relay health and catch issues early:

Apply OS Security Patches (Monthly):

Run: sudo apt update && apt upgrade
Reboot if kernel updated (schedule during maintenance window)
Why: Prevents security exploits and ensures system stability

Review Service Logs (Weekly):

Check for errors: sudo journalctl -u ailevate-tunnel.service --since "7 days ago" | grep -i error
Export for support: sudo journalctl -u ailevate-tunnel.service --since "7 days ago" > relay-weekly.log
Why: Early detection of connectivity or SQL errors before they impact ingestion

Rotate SQL Credentials (60-90 days):

Generate new credentials → Update in Revenue Recovery platform → Test connectivity → Retire old credentials
Why: Meets security policies and compliance requirements (HIPAA)

Test SQL Connectivity (Monthly):

Verify port access: nc -vz <sql-hostname> 1433
Why: Proactively detects network or firewall issues

Verify Time Sync (Weekly):

Check status: timedatectl status
Why: TLS authentication depends on accurate time

Verify DNS Resolution (Weekly):

Test hostnames: nslookup <hostname> for both *.ailevate.com and SQL server
Why: DNS failures cause immediate connectivity loss

Review Disk Space (Weekly):

Check utilization: df -h (ensure ≥20% free)
Why: Prevents service failures due to disk exhaustion

Log Management:

Configure centralized logging for long-term retention (HIPAA compliance)
See Component Diagnostics & Troubleshooting for log locations and error patterns

Monitoring and Alerting

Configure alerts for:

Service failures (ailevate-tunnel.service, ailevate-proxy.service)
Extended connectivity loss (>15 minutes)
Disk space warnings (<20% free)
Time sync failures

Integration: Integrate Relay monitoring into existing infrastructure platforms (Nagios, Prometheus, Datadog).

See Component Diagnostics & Troubleshooting for diagnostic commands.

Elasticsearch Best Practices (On-Premise Only)

📘
Note: Ailevate manages Elasticsearch in SaaS deployments. Skip this section if you're on SaaS.

Configuration and Sizing Requirements

Component	Minimum Requirement	Optimal Configuration	Critical Threshold	Impact if Exceeded
Cluster Size	3 nodes	3+ nodes for HA	N/A	Single-node = no redundancy
Master Nodes	Part of data nodes	Dedicated master-eligible nodes	For clusters >5,000 claims/day	Stability issues under load
Storage Type	SSD	NVMe	N/A	Spinning disks cannot meet IOPS
Disk Utilization	<85%	<70%	85% = high watermark 95% = flood stage (read-only)	Shard allocation stops; cluster unusable
JVM Heap	≤32 GB/node	31 GB (for 64GB RAM node)	32 GB (compressed pointers limit)	Lost performance optimization
Shard Size	Target ~50 GB	30-50 GB per shard	>100 GB	Query performance degradation
Page Cache	Remaining RAM after heap	~50% of RAM (for 64GB node: 33GB)	N/A	Lucene file system cache for performance

See On-Premises Deployment Guide for detailed cluster sizing tables based on claim volume.

Network and Connectivity

Inbound from Ailevate Cloud:

Allow inbound TLS from Ailevate Cloud Services IP ranges to port 9200
Coordinate with Ailevate support for current IP ranges
Test connectivity after firewall changes

Inter-Node Communication:

Keep network latency <2ms between nodes (ideal)
Deploy nodes in same region/availability zone when possible
Ensure sufficient bandwidth for shard replication

Capacity Monitoring

Check disk allocation regularly:

curl -XGET "https://<elastic-host>:9200/_cat/allocation?v" \
  --cacert /path/to/ca.crt \
  --cert /path/to/client.crt \
  --key /path/to/client.key

Plan capacity expansion when disk growth indicates approaching 85% within 3 months.

TLS Certificate Lifecycle

Maintain Certificates:

Node-to-node TLS certificates
Client-to-cluster TLS certificates
Monitor expiry dates; renew 30+ days before expiration

Coordinate with Ailevate:

Client certificate updates require coordination with Ailevate support
Test new certificates in non-production first

Backup Configuration

Elasticsearch snapshots are critical for disaster recovery. Configure automated daily backups minimum.

One-Time Snapshot Repository Setup:

curl -XPUT "https://<elastic-host>:9200/_snapshot/backups" \
  -H 'Content-Type: application/json' \
  --cacert /path/to/ca.crt \
  --cert /path/to/client.crt \
  --key /path/to/client.key \
  -d '{
    "type": "fs",
    "settings": {
      "location": "/mnt/backups/elasticsearch"
    }
  }'

Daily Snapshot Automation:

# Take snapshot (add to daily cron job)
curl -XPUT "https://<elastic-host>:9200/_snapshot/backups/snapshot_$(date +%Y%m%d)" \
  --cacert /path/to/ca.crt \
  --cert /path/to/client.crt \
  --key /path/to/client.key

Best Practices:

Store snapshots in geographically separate location
Test restore procedures quarterly
Maintain 30-90 day retention (typical for HIPAA)

See the "Backup & Disaster Recovery" section below for complete backup/DR strategy.

AI Warehouse Best Practices (On-Premise Only)

📘
Note: Ailevate manages AI Warehouse in SaaS deployments. Skip this section if you're on SaaS.

Hardware Comparison and Requirements

Requirement	TT-LoudBox (Air-Cooled)	TT-QuietBox (Liquid-Cooled)
Form Factor	4U Rack-mount	Desktop
Accelerators	4x Wormhole	4x Blackhole
Cooling System	Front-to-back airflow (fan-based)	Sealed liquid cooling (maintenance-free)
Cooling Requirements	Adequate rack airflow; verify fans operational	Ambient temp <35°C (95°F); coolant pre-filled
Power Configuration	Dual PSU (1+1 redundancy)	Single 20A dedicated circuit
Power Best Practices	Monitor PSU LEDs during inspections	Direct wall outlet only (no surge protectors/strips)
QSFP-DD Interconnect	2 cables	8 cables (mesh topology)
Link Status Indicators	Solid green LEDs = healthy	Solid green LEDs = healthy
Physical Inspection	Check PSU LEDs, verify airflow, remove obstructions	Check BMC for cooling status, verify ambient temp

Common to Both:

Weekly health checks: lspci -d 1e52: (verify 4 accelerators detected) and tt-smi (check status)
Monthly: Physical inspection of QSFP-DD cables, PSU status, thermal monitoring via BMC
TLS-secured vLLM API on port 8080
Inbound connectivity from Ailevate Cloud Services required

See Detailed Guides:

Tenstorrent LoudBox Guide - Rack-mount air-cooled setup
Tenstorrent QuietBox Guide - Desktop liquid-cooled setup

Regular Health Checks

Weekly:

# Verify accelerators detected
sudo update-pciids && lspci -d 1e52:

# Check accelerator status
tt-smi

# Test vLLM API
curl -k https://<ai-warehouse-host>:8080/v1/models

Monthly:

Physical inspection: QSFP-DD cables, PSU LEDs, cooling
BMC event log review
Temperature trend analysis

Mesh Interconnect Links

Verify Topology:

QuietBox: 8 QSFP-DD cables (mesh topology)
LoudBox: 2 QSFP-DD cables

Check Link Status:

Inspect QSFP-DD port LEDs (solid green = healthy)
Reseat cables if link failures detected
Verify topology matches deployment guide diagrams

BMC Security and Monitoring

Harden BMC Access:

1. Change default password immediately
2. Use strong passwords (16+ characters)
3. Restrict BMC access to management VLAN if possible
4. Monitor BMC access logs for unauthorized attempts

Regular BMC Checks:

Access http://<bmc-ip>
Review System Health → Event Log
Monitor Sensors → Temperature readings
Check Power → PSU status

vLLM API Security

TLS Configuration:

Enforce TLS for vLLM API endpoint (port 8080)
Allow inbound connections only from Ailevate Cloud Services IP ranges
Monitor API access logs for unauthorized attempts

Certificate Management:

Rotate vLLM API certificates before expiration
Coordinate client certificate updates with Ailevate support

Security Best Practices

Healthcare data requires robust security practices aligned with HIPAA and SOC2.

Security Monitoring by Component

Component	What to Monitor	Where to Look	Red Flags
Platform Access (All Deployments)	• Failed login attempts • Unusual access patterns • Privilege escalation • Session anomalies	Admin → Audit Logs in Revenue Recovery UI	• Concurrent sessions from different locations • Unexpected location/time access • Repeated failed logins
Elasticsearch (On-Premise Only)	• TLS certificate validity • Unauthorized connections • Client cert authentication • Query patterns	• Elasticsearch access logs • Certificate expiry monitoring	• Non-Ailevate IP sources • Bulk exports • Certificate expiring <30 days
AI Warehouse (On-Premise Only)	• BMC access attempts • vLLM API access • TLS certificate validity	• BMC access logs • vLLM API logs	• Unauthorized BMC access attempts • TLS handshake failures • API access from unknown IPs
Relay VM (All Deployments)	• SSH access • sudo usage • Service changes • Outbound connections	• `/var/log/auth.log` • Service logs	• Unexpected outbound connections (should only be `*.ailevate.com`) • Unauthorized SSH access • Unusual sudo activity
Network (All Deployments)	• Firewall blocked attempts • Inbound connections (On-Premise)	• Firewall logs • Network monitoring	• Repeated connection attempts from unknown IPs • Unexpected inbound connections to Elasticsearch/AI Warehouse
Data Access (All Deployments)	• Bulk data exports • Unusual claim access • PHI access patterns • API usage	• Audit logs • API access logs	• Mass exports by single user • Access to claims outside normal scope • Automated scraping patterns

Infrastructure Security Best Practices (On-Premise):

Elasticsearch: Verify client certificate authentication enforced; monitor for unusual query patterns
AI Warehouse: Restrict BMC to management VLAN; monitor vLLM API access logs
Network: Monitor inbound connections (should only originate from Ailevate Cloud IP ranges)

⚠️
Warning: Security Incident Response
Report suspected security incidents to [email protected] immediately with "SECURITY INCIDENT" in subject line. Coordinate with your internal security team and Ailevate.

Capacity Planning

Proactive capacity planning prevents resource exhaustion before it impacts operations.

When to Expand Infrastructure

Proactive capacity planning prevents resource exhaustion. Use these triggers to plan expansion with appropriate lead times.

Component	Expansion Trigger	Lead Time	Contact Support When
Elasticsearch	• Disk trending >85% within 3 months • Query latency +25% over baseline • Cluster consistently YELLOW (shard constraints)	3 months	Trending toward any threshold
AI Warehouse	• Task queue depth >100 consistently • Processing time +50% over baseline • Accelerator utilization >80% sustained	6-8 weeks	Queue consistently high or performance degrading
Relay VM	• CPU/memory >75% consistently • Claim ingestion delays extending	1 week	Resource constraints detected

Establishing Performance Baselines (On-Premise):

When your deployment goes live, document these metrics to establish your baseline:

Elasticsearch:

Query response times for common searches
Indexing rates during peak windows
Disk growth rate (GB/month)
CPU/memory utilization during normal operation
Shard count per index

AI Warehouse:

AI task processing time per claim (average and 95th percentile)
vLLM API response times
Accelerator utilization during peak processing
Temperature ranges during normal operation

Relay VM:

CPU/memory during claim ingestion
Network throughput during peak sync
EHR query response times

Monthly Capacity Tracking:

Calculate growth rates: (current_usage - last_month_usage) / last_month_usage * 100
Project when critical thresholds will be reached
Plan expansion 3-6 months in advance

💡
Tip: Contact Ailevate support early when identifying capacity constraints. On-Premise Elasticsearch cluster expansion requires coordination; AI Warehouse hardware procurement takes 6-8 weeks.

Related Resources

For more detailed information:

System Monitoring & Health – Monitoring responsibilities and health indicators
Component Diagnostics & Troubleshooting – Diagnostic commands and troubleshooting
Common Error Messages – User-facing application errors
SaaS Deployment Guide – SaaS architecture and setup
On-Premises Deployment Guide – On-Premises infrastructure and requirements
Relay Service Deployment Guide – Relay installation and operations
Tenstorrent LoudBox Guide – Air-cooled AI Warehouse hardware
Tenstorrent QuietBox Guide – Liquid-cooled AI Warehouse hardware
Security Settings – Authentication and access controls

Need Help? For guidance on operational best practices specific to your deployment, contact Ailevate support at [email protected]. We can help with capacity planning, maintenance scheduling, and infrastructure optimization.

Operational Best Practices

General Operational Best Practices

Tip: Automate operational checks using existing monitoring platforms (Nagios, Prometheus, Datadog). Contact [email protected] for metric endpoint guidance.

Core Infrastructure Monitoring

Role-Based Access Control (RBAC)

Relay Service Best Practices

Deployment and Placement

Operational Maintenance

Monitoring and Alerting

Elasticsearch Best Practices (On-Premise Only)

Note: Ailevate manages Elasticsearch in SaaS deployments. Skip this section if you're on SaaS.

Configuration and Sizing Requirements

Network and Connectivity

Capacity Monitoring

TLS Certificate Lifecycle

Backup Configuration

AI Warehouse Best Practices (On-Premise Only)

Note: Ailevate manages AI Warehouse in SaaS deployments. Skip this section if you're on SaaS.

Hardware Comparison and Requirements

Regular Health Checks

Mesh Interconnect Links

BMC Security and Monitoring

vLLM API Security

Security Best Practices

Security Monitoring by Component

Warning: Security Incident Response
Report suspected security incidents to [email protected] immediately with "SECURITY INCIDENT" in subject line. Coordinate with your internal security team and Ailevate.

Capacity Planning

When to Expand Infrastructure

Tip: Contact Ailevate support early when identifying capacity constraints. On-Premise Elasticsearch cluster expansion requires coordination; AI Warehouse hardware procurement takes 6-8 weeks.

Related Resources

General Operational Best Practices

Tip: Automate operational checks using existing monitoring platforms (Nagios, Prometheus, Datadog). Contact [email protected] for metric endpoint guidance.

Core Infrastructure Monitoring

Role-Based Access Control (RBAC)

Relay Service Best Practices

Deployment and Placement

Operational Maintenance

Monitoring and Alerting

Elasticsearch Best Practices (On-Premise Only)

Note: Ailevate manages Elasticsearch in SaaS deployments. Skip this section if you're on SaaS.

Configuration and Sizing Requirements

Network and Connectivity

Capacity Monitoring

TLS Certificate Lifecycle

Backup Configuration

AI Warehouse Best Practices (On-Premise Only)

Note: Ailevate manages AI Warehouse in SaaS deployments. Skip this section if you're on SaaS.

Hardware Comparison and Requirements

Regular Health Checks

Mesh Interconnect Links

BMC Security and Monitoring

vLLM API Security

Security Best Practices

Security Monitoring by Component

Warning: Security Incident Response Report suspected security incidents to [email protected] immediately with "SECURITY INCIDENT" in subject line. Coordinate with your internal security team and Ailevate.

Capacity Planning

When to Expand Infrastructure

Tip: Contact Ailevate support early when identifying capacity constraints. On-Premise Elasticsearch cluster expansion requires coordination; AI Warehouse hardware procurement takes 6-8 weeks.

Related Resources

Warning: Security Incident Response
Report suspected security incidents to [email protected] immediately with "SECURITY INCIDENT" in subject line. Coordinate with your internal security team and Ailevate.