Operational Best Practices

Reliability, security, and maintenance best practices for SaaS and On-Premises deployments

Running a reliable, secure, and performant Revenue Recovery platform requires following operational best practices tailored to your deployment model. This guide helps you establish maintenance rhythms, implement security controls, plan capacity effectively, and prepare for disaster recovery. Whether you're managing a SaaS deployment focused on Relay operations or an On-Premises deployment with full infrastructure responsibility, these practices keep your platform running smoothly.

For monitoring responsibilities and health indicators, see System Monitoring & Health. For hands-on diagnostic procedures, see Component Diagnostics & Health Checks.


General Operational Best Practices

These foundational practices apply to all deployments, regardless of model.

💡

Tip: Automate operational checks using existing monitoring platforms (Nagios, Prometheus, Datadog). Contact [email protected] for metric endpoint guidance.

Core Infrastructure Monitoring

Monitor these foundational infrastructure elements to prevent service disruptions:

Time Synchronization:

  • Enable NTP on all customer-managed infrastructure (sudo timedatectl set-ntp true)
  • Verify sync status: timedatectl status
  • For VMs: disable in-guest NTP to prevent conflicts with hypervisor time sync
  • Why: Clock drift causes TLS authentication failures and log correlation issues

DNS Resolution:

  • Use redundant DNS servers; configure forward and reverse lookups
  • Test resolution weekly: nslookup <hostname> for both Ailevate (*.ailevate.com) and internal hosts
  • Why: DNS failures cause immediate connectivity loss

TLS Certificates:

  • Monitor certificate expiration dates; renew 30+ days before expiration
  • Check expiry: openssl s_client -connect <host>:<port> -showcerts 2>/dev/null | openssl x509 -noout -dates
  • Coordinate client certificate updates with Ailevate support (On-Premise Elasticsearch and AI Warehouse)
  • Why: Expired certificates cause immediate service outages

Firewall Rules:

  • Document all changes in change management system; review quarterly
  • Monitor logs for unexpected blocked traffic
  • Test connectivity after any firewall changes
  • Why: Misconfigured rules can block required traffic or expose systems

See Also:

Role-Based Access Control (RBAC)

Implement principle of least privilege for all administrative access.

Best Practices:

  • Separate duties: infrastructure admins, application admins, security teams
  • Document access policies and review quarterly
  • Remove access immediately upon role change or termination
  • Use individual accounts (no shared credentials)
  • Require MFA for administrative access

Compliance Alignment: HIPAA and SOC2 require documented access controls and regular reviews.

For detailed user role management and permission assignment in the Revenue Recovery platform, see User Role Management.


Relay Service Best Practices

Applies to: All deployments (SaaS and On-Premise)

Deployment and Placement

Network Proximity:

  • Deploy Relay VM in same subnet/LAN as EHR SQL datastore
  • Minimizes latency for SQL queries (port 1433)
  • Test latency: ping <sql-hostname> (should be <5ms for optimal performance)

Connectivity:

  • Ensure stable outbound HTTPS (port 443) to *.ailevate.com
  • Avoid proxy servers unless absolutely required (adds latency and failure points)
  • Validate: curl -vk https://<relay-endpoint>.ailevate.com

Operational Maintenance

Establish regular operational rhythms to maintain Relay health and catch issues early:

Apply OS Security Patches (Monthly):

  • Run: sudo apt update && apt upgrade
  • Reboot if kernel updated (schedule during maintenance window)
  • Why: Prevents security exploits and ensures system stability

Review Service Logs (Weekly):

  • Check for errors: sudo journalctl -u ailevate-tunnel.service --since "7 days ago" | grep -i error
  • Export for support: sudo journalctl -u ailevate-tunnel.service --since "7 days ago" > relay-weekly.log
  • Why: Early detection of connectivity or SQL errors before they impact ingestion

Rotate SQL Credentials (60-90 days):

  • Generate new credentials → Update in Revenue Recovery platform → Test connectivity → Retire old credentials
  • Why: Meets security policies and compliance requirements (HIPAA)

Test SQL Connectivity (Monthly):

  • Verify port access: nc -vz <sql-hostname> 1433
  • Why: Proactively detects network or firewall issues

Verify Time Sync (Weekly):

  • Check status: timedatectl status
  • Why: TLS authentication depends on accurate time

Verify DNS Resolution (Weekly):

  • Test hostnames: nslookup <hostname> for both *.ailevate.com and SQL server
  • Why: DNS failures cause immediate connectivity loss

Review Disk Space (Weekly):

  • Check utilization: df -h (ensure ≥20% free)
  • Why: Prevents service failures due to disk exhaustion

Log Management:

Monitoring and Alerting

Configure alerts for:

  • Service failures (ailevate-tunnel.service, ailevate-proxy.service)
  • Extended connectivity loss (>15 minutes)
  • Disk space warnings (<20% free)
  • Time sync failures

Integration: Integrate Relay monitoring into existing infrastructure platforms (Nagios, Prometheus, Datadog).

See Component Diagnostics & Troubleshooting for diagnostic commands.


Elasticsearch Best Practices (On-Premise Only)

📘

Note: Ailevate manages Elasticsearch in SaaS deployments. Skip this section if you're on SaaS.

Configuration and Sizing Requirements

ComponentMinimum RequirementOptimal ConfigurationCritical ThresholdImpact if Exceeded
Cluster Size3 nodes3+ nodes for HAN/ASingle-node = no redundancy
Master NodesPart of data nodesDedicated master-eligible nodesFor clusters >5,000 claims/dayStability issues under load
Storage TypeSSDNVMeN/ASpinning disks cannot meet IOPS
Disk Utilization<85%<70%85% = high watermark
95% = flood stage (read-only)
Shard allocation stops; cluster unusable
JVM Heap≤32 GB/node31 GB (for 64GB RAM node)32 GB (compressed pointers limit)Lost performance optimization
Shard SizeTarget ~50 GB30-50 GB per shard>100 GBQuery performance degradation
Page CacheRemaining RAM after heap~50% of RAM (for 64GB node: 33GB)N/ALucene file system cache for performance

See On-Premises Deployment Guide for detailed cluster sizing tables based on claim volume.

Network and Connectivity

Inbound from Ailevate Cloud:

  • Allow inbound TLS from Ailevate Cloud Services IP ranges to port 9200
  • Coordinate with Ailevate support for current IP ranges
  • Test connectivity after firewall changes

Inter-Node Communication:

  • Keep network latency <2ms between nodes (ideal)
  • Deploy nodes in same region/availability zone when possible
  • Ensure sufficient bandwidth for shard replication

Capacity Monitoring

Check disk allocation regularly:

curl -XGET "https://<elastic-host>:9200/_cat/allocation?v" \
  --cacert /path/to/ca.crt \
  --cert /path/to/client.crt \
  --key /path/to/client.key

Plan capacity expansion when disk growth indicates approaching 85% within 3 months.

TLS Certificate Lifecycle

Maintain Certificates:

  • Node-to-node TLS certificates
  • Client-to-cluster TLS certificates
  • Monitor expiry dates; renew 30+ days before expiration

Coordinate with Ailevate:

  • Client certificate updates require coordination with Ailevate support
  • Test new certificates in non-production first

Backup Configuration

Elasticsearch snapshots are critical for disaster recovery. Configure automated daily backups minimum.

One-Time Snapshot Repository Setup:

curl -XPUT "https://<elastic-host>:9200/_snapshot/backups" \
  -H 'Content-Type: application/json' \
  --cacert /path/to/ca.crt \
  --cert /path/to/client.crt \
  --key /path/to/client.key \
  -d '{
    "type": "fs",
    "settings": {
      "location": "/mnt/backups/elasticsearch"
    }
  }'

Daily Snapshot Automation:

# Take snapshot (add to daily cron job)
curl -XPUT "https://<elastic-host>:9200/_snapshot/backups/snapshot_$(date +%Y%m%d)" \
  --cacert /path/to/ca.crt \
  --cert /path/to/client.crt \
  --key /path/to/client.key

Best Practices:

  • Store snapshots in geographically separate location
  • Test restore procedures quarterly
  • Maintain 30-90 day retention (typical for HIPAA)

See the "Backup & Disaster Recovery" section below for complete backup/DR strategy.


AI Warehouse Best Practices (On-Premise Only)

📘

Note: Ailevate manages AI Warehouse in SaaS deployments. Skip this section if you're on SaaS.

Hardware Comparison and Requirements

RequirementTT-LoudBox (Air-Cooled)TT-QuietBox (Liquid-Cooled)
Form Factor4U Rack-mountDesktop
Accelerators4x Wormhole4x Blackhole
Cooling SystemFront-to-back airflow (fan-based)Sealed liquid cooling (maintenance-free)
Cooling RequirementsAdequate rack airflow; verify fans operationalAmbient temp <35°C (95°F); coolant pre-filled
Power ConfigurationDual PSU (1+1 redundancy)Single 20A dedicated circuit
Power Best PracticesMonitor PSU LEDs during inspectionsDirect wall outlet only (no surge protectors/strips)
QSFP-DD Interconnect2 cables8 cables (mesh topology)
Link Status IndicatorsSolid green LEDs = healthySolid green LEDs = healthy
Physical InspectionCheck PSU LEDs, verify airflow, remove obstructionsCheck BMC for cooling status, verify ambient temp

Common to Both:

  • Weekly health checks: lspci -d 1e52: (verify 4 accelerators detected) and tt-smi (check status)
  • Monthly: Physical inspection of QSFP-DD cables, PSU status, thermal monitoring via BMC
  • TLS-secured vLLM API on port 8080
  • Inbound connectivity from Ailevate Cloud Services required

See Detailed Guides:

Regular Health Checks

Weekly:

# Verify accelerators detected
sudo update-pciids && lspci -d 1e52:

# Check accelerator status
tt-smi

# Test vLLM API
curl -k https://<ai-warehouse-host>:8080/v1/models

Monthly:

  • Physical inspection: QSFP-DD cables, PSU LEDs, cooling
  • BMC event log review
  • Temperature trend analysis

Mesh Interconnect Links

Verify Topology:

  • QuietBox: 8 QSFP-DD cables (mesh topology)
  • LoudBox: 2 QSFP-DD cables

Check Link Status:

  • Inspect QSFP-DD port LEDs (solid green = healthy)
  • Reseat cables if link failures detected
  • Verify topology matches deployment guide diagrams

BMC Security and Monitoring

Harden BMC Access:

1. Change default password immediately
2. Use strong passwords (16+ characters)
3. Restrict BMC access to management VLAN if possible
4. Monitor BMC access logs for unauthorized attempts

Regular BMC Checks:

  • Access http://<bmc-ip>
  • Review System Health → Event Log
  • Monitor Sensors → Temperature readings
  • Check Power → PSU status

vLLM API Security

TLS Configuration:

  • Enforce TLS for vLLM API endpoint (port 8080)
  • Allow inbound connections only from Ailevate Cloud Services IP ranges
  • Monitor API access logs for unauthorized attempts

Certificate Management:

  • Rotate vLLM API certificates before expiration
  • Coordinate client certificate updates with Ailevate support

Security Best Practices

Healthcare data requires robust security practices aligned with HIPAA and SOC2.

Security Monitoring by Component

ComponentWhat to MonitorWhere to LookRed Flags
Platform Access
(All Deployments)
• Failed login attempts
• Unusual access patterns
• Privilege escalation
• Session anomalies
Admin → Audit Logs in Revenue Recovery UI• Concurrent sessions from different locations
• Unexpected location/time access
• Repeated failed logins
Elasticsearch
(On-Premise Only)
• TLS certificate validity
• Unauthorized connections
• Client cert authentication
• Query patterns
• Elasticsearch access logs
• Certificate expiry monitoring
• Non-Ailevate IP sources
• Bulk exports
• Certificate expiring <30 days
AI Warehouse
(On-Premise Only)
• BMC access attempts
• vLLM API access
• TLS certificate validity
• BMC access logs
• vLLM API logs
• Unauthorized BMC access attempts
• TLS handshake failures
• API access from unknown IPs
Relay VM
(All Deployments)
• SSH access
• sudo usage
• Service changes
• Outbound connections
/var/log/auth.log
• Service logs
• Unexpected outbound connections (should only be *.ailevate.com)
• Unauthorized SSH access
• Unusual sudo activity
Network
(All Deployments)
• Firewall blocked attempts
• Inbound connections (On-Premise)
• Firewall logs
• Network monitoring
• Repeated connection attempts from unknown IPs
• Unexpected inbound connections to Elasticsearch/AI Warehouse
Data Access
(All Deployments)
• Bulk data exports
• Unusual claim access
• PHI access patterns
• API usage
• Audit logs
• API access logs
• Mass exports by single user
• Access to claims outside normal scope
• Automated scraping patterns

Infrastructure Security Best Practices (On-Premise):

  • Elasticsearch: Verify client certificate authentication enforced; monitor for unusual query patterns
  • AI Warehouse: Restrict BMC to management VLAN; monitor vLLM API access logs
  • Network: Monitor inbound connections (should only originate from Ailevate Cloud IP ranges)
⚠️

Warning: Security Incident Response
Report suspected security incidents to [email protected] immediately with "SECURITY INCIDENT" in subject line. Coordinate with your internal security team and Ailevate.


Capacity Planning

Proactive capacity planning prevents resource exhaustion before it impacts operations.

When to Expand Infrastructure

Proactive capacity planning prevents resource exhaustion. Use these triggers to plan expansion with appropriate lead times.

ComponentExpansion TriggerLead TimeContact Support When
Elasticsearch• Disk trending >85% within 3 months
• Query latency +25% over baseline
• Cluster consistently YELLOW (shard constraints)
3 monthsTrending toward any threshold
AI Warehouse• Task queue depth >100 consistently
• Processing time +50% over baseline
• Accelerator utilization >80% sustained
6-8 weeksQueue consistently high or performance degrading
Relay VM• CPU/memory >75% consistently
• Claim ingestion delays extending
1 weekResource constraints detected

Establishing Performance Baselines (On-Premise):

When your deployment goes live, document these metrics to establish your baseline:

Elasticsearch:

  • Query response times for common searches
  • Indexing rates during peak windows
  • Disk growth rate (GB/month)
  • CPU/memory utilization during normal operation
  • Shard count per index

AI Warehouse:

  • AI task processing time per claim (average and 95th percentile)
  • vLLM API response times
  • Accelerator utilization during peak processing
  • Temperature ranges during normal operation

Relay VM:

  • CPU/memory during claim ingestion
  • Network throughput during peak sync
  • EHR query response times

Monthly Capacity Tracking:

  • Calculate growth rates: (current_usage - last_month_usage) / last_month_usage * 100
  • Project when critical thresholds will be reached
  • Plan expansion 3-6 months in advance
💡

Tip: Contact Ailevate support early when identifying capacity constraints. On-Premise Elasticsearch cluster expansion requires coordination; AI Warehouse hardware procurement takes 6-8 weeks.


Related Resources

For more detailed information:

Need Help? For guidance on operational best practices specific to your deployment, contact Ailevate support at [email protected]. We can help with capacity planning, maintenance scheduling, and infrastructure optimization.