Operational Best Practices
Reliability, security, and maintenance best practices for SaaS and On-Premises deployments
Running a reliable, secure, and performant Revenue Recovery platform requires following operational best practices tailored to your deployment model. This guide helps you establish maintenance rhythms, implement security controls, plan capacity effectively, and prepare for disaster recovery. Whether you're managing a SaaS deployment focused on Relay operations or an On-Premises deployment with full infrastructure responsibility, these practices keep your platform running smoothly.
For monitoring responsibilities and health indicators, see System Monitoring & Health. For hands-on diagnostic procedures, see Component Diagnostics & Health Checks.
General Operational Best Practices
These foundational practices apply to all deployments, regardless of model.
Tip: Automate operational checks using existing monitoring platforms (Nagios, Prometheus, Datadog). Contact [email protected] for metric endpoint guidance.
Core Infrastructure Monitoring
Monitor these foundational infrastructure elements to prevent service disruptions:
Time Synchronization:
- Enable NTP on all customer-managed infrastructure (
sudo timedatectl set-ntp true) - Verify sync status:
timedatectl status - For VMs: disable in-guest NTP to prevent conflicts with hypervisor time sync
- Why: Clock drift causes TLS authentication failures and log correlation issues
DNS Resolution:
- Use redundant DNS servers; configure forward and reverse lookups
- Test resolution weekly:
nslookup <hostname>for both Ailevate (*.ailevate.com) and internal hosts - Why: DNS failures cause immediate connectivity loss
TLS Certificates:
- Monitor certificate expiration dates; renew 30+ days before expiration
- Check expiry:
openssl s_client -connect <host>:<port> -showcerts 2>/dev/null | openssl x509 -noout -dates - Coordinate client certificate updates with Ailevate support (On-Premise Elasticsearch and AI Warehouse)
- Why: Expired certificates cause immediate service outages
Firewall Rules:
- Document all changes in change management system; review quarterly
- Monitor logs for unexpected blocked traffic
- Test connectivity after any firewall changes
- Why: Misconfigured rules can block required traffic or expose systems
See Also:
- SaaS Deployment Guide - Outbound firewall requirements
- On-Premises Deployment Guide - Inbound and outbound requirements
- Relay Service Deployment Guide - Network flow matrix
Role-Based Access Control (RBAC)
Implement principle of least privilege for all administrative access.
Best Practices:
- Separate duties: infrastructure admins, application admins, security teams
- Document access policies and review quarterly
- Remove access immediately upon role change or termination
- Use individual accounts (no shared credentials)
- Require MFA for administrative access
Compliance Alignment: HIPAA and SOC2 require documented access controls and regular reviews.
For detailed user role management and permission assignment in the Revenue Recovery platform, see User Role Management.
Relay Service Best Practices
Applies to: All deployments (SaaS and On-Premise)
Deployment and Placement
Network Proximity:
- Deploy Relay VM in same subnet/LAN as EHR SQL datastore
- Minimizes latency for SQL queries (port 1433)
- Test latency:
ping <sql-hostname>(should be <5ms for optimal performance)
Connectivity:
- Ensure stable outbound HTTPS (port 443) to
*.ailevate.com - Avoid proxy servers unless absolutely required (adds latency and failure points)
- Validate:
curl -vk https://<relay-endpoint>.ailevate.com
Operational Maintenance
Establish regular operational rhythms to maintain Relay health and catch issues early:
Apply OS Security Patches (Monthly):
- Run:
sudo apt update && apt upgrade - Reboot if kernel updated (schedule during maintenance window)
- Why: Prevents security exploits and ensures system stability
Review Service Logs (Weekly):
- Check for errors:
sudo journalctl -u ailevate-tunnel.service --since "7 days ago" | grep -i error - Export for support:
sudo journalctl -u ailevate-tunnel.service --since "7 days ago" > relay-weekly.log - Why: Early detection of connectivity or SQL errors before they impact ingestion
Rotate SQL Credentials (60-90 days):
- Generate new credentials → Update in Revenue Recovery platform → Test connectivity → Retire old credentials
- Why: Meets security policies and compliance requirements (HIPAA)
Test SQL Connectivity (Monthly):
- Verify port access:
nc -vz <sql-hostname> 1433 - Why: Proactively detects network or firewall issues
Verify Time Sync (Weekly):
- Check status:
timedatectl status - Why: TLS authentication depends on accurate time
Verify DNS Resolution (Weekly):
- Test hostnames:
nslookup <hostname>for both*.ailevate.comand SQL server - Why: DNS failures cause immediate connectivity loss
Review Disk Space (Weekly):
- Check utilization:
df -h(ensure ≥20% free) - Why: Prevents service failures due to disk exhaustion
Log Management:
- Configure centralized logging for long-term retention (HIPAA compliance)
- See Component Diagnostics & Troubleshooting for log locations and error patterns
Monitoring and Alerting
Configure alerts for:
- Service failures (
ailevate-tunnel.service,ailevate-proxy.service) - Extended connectivity loss (>15 minutes)
- Disk space warnings (<20% free)
- Time sync failures
Integration: Integrate Relay monitoring into existing infrastructure platforms (Nagios, Prometheus, Datadog).
See Component Diagnostics & Troubleshooting for diagnostic commands.
Elasticsearch Best Practices (On-Premise Only)
Note: Ailevate manages Elasticsearch in SaaS deployments. Skip this section if you're on SaaS.
Configuration and Sizing Requirements
| Component | Minimum Requirement | Optimal Configuration | Critical Threshold | Impact if Exceeded |
|---|---|---|---|---|
| Cluster Size | 3 nodes | 3+ nodes for HA | N/A | Single-node = no redundancy |
| Master Nodes | Part of data nodes | Dedicated master-eligible nodes | For clusters >5,000 claims/day | Stability issues under load |
| Storage Type | SSD | NVMe | N/A | Spinning disks cannot meet IOPS |
| Disk Utilization | <85% | <70% | 85% = high watermark 95% = flood stage (read-only) | Shard allocation stops; cluster unusable |
| JVM Heap | ≤32 GB/node | 31 GB (for 64GB RAM node) | 32 GB (compressed pointers limit) | Lost performance optimization |
| Shard Size | Target ~50 GB | 30-50 GB per shard | >100 GB | Query performance degradation |
| Page Cache | Remaining RAM after heap | ~50% of RAM (for 64GB node: 33GB) | N/A | Lucene file system cache for performance |
See On-Premises Deployment Guide for detailed cluster sizing tables based on claim volume.
Network and Connectivity
Inbound from Ailevate Cloud:
- Allow inbound TLS from Ailevate Cloud Services IP ranges to port 9200
- Coordinate with Ailevate support for current IP ranges
- Test connectivity after firewall changes
Inter-Node Communication:
- Keep network latency <2ms between nodes (ideal)
- Deploy nodes in same region/availability zone when possible
- Ensure sufficient bandwidth for shard replication
Capacity Monitoring
Check disk allocation regularly:
curl -XGET "https://<elastic-host>:9200/_cat/allocation?v" \
--cacert /path/to/ca.crt \
--cert /path/to/client.crt \
--key /path/to/client.keyPlan capacity expansion when disk growth indicates approaching 85% within 3 months.
TLS Certificate Lifecycle
Maintain Certificates:
- Node-to-node TLS certificates
- Client-to-cluster TLS certificates
- Monitor expiry dates; renew 30+ days before expiration
Coordinate with Ailevate:
- Client certificate updates require coordination with Ailevate support
- Test new certificates in non-production first
Backup Configuration
Elasticsearch snapshots are critical for disaster recovery. Configure automated daily backups minimum.
One-Time Snapshot Repository Setup:
curl -XPUT "https://<elastic-host>:9200/_snapshot/backups" \
-H 'Content-Type: application/json' \
--cacert /path/to/ca.crt \
--cert /path/to/client.crt \
--key /path/to/client.key \
-d '{
"type": "fs",
"settings": {
"location": "/mnt/backups/elasticsearch"
}
}'Daily Snapshot Automation:
# Take snapshot (add to daily cron job)
curl -XPUT "https://<elastic-host>:9200/_snapshot/backups/snapshot_$(date +%Y%m%d)" \
--cacert /path/to/ca.crt \
--cert /path/to/client.crt \
--key /path/to/client.keyBest Practices:
- Store snapshots in geographically separate location
- Test restore procedures quarterly
- Maintain 30-90 day retention (typical for HIPAA)
See the "Backup & Disaster Recovery" section below for complete backup/DR strategy.
AI Warehouse Best Practices (On-Premise Only)
Note: Ailevate manages AI Warehouse in SaaS deployments. Skip this section if you're on SaaS.
Hardware Comparison and Requirements
| Requirement | TT-LoudBox (Air-Cooled) | TT-QuietBox (Liquid-Cooled) |
|---|---|---|
| Form Factor | 4U Rack-mount | Desktop |
| Accelerators | 4x Wormhole | 4x Blackhole |
| Cooling System | Front-to-back airflow (fan-based) | Sealed liquid cooling (maintenance-free) |
| Cooling Requirements | Adequate rack airflow; verify fans operational | Ambient temp <35°C (95°F); coolant pre-filled |
| Power Configuration | Dual PSU (1+1 redundancy) | Single 20A dedicated circuit |
| Power Best Practices | Monitor PSU LEDs during inspections | Direct wall outlet only (no surge protectors/strips) |
| QSFP-DD Interconnect | 2 cables | 8 cables (mesh topology) |
| Link Status Indicators | Solid green LEDs = healthy | Solid green LEDs = healthy |
| Physical Inspection | Check PSU LEDs, verify airflow, remove obstructions | Check BMC for cooling status, verify ambient temp |
Common to Both:
- Weekly health checks:
lspci -d 1e52:(verify 4 accelerators detected) andtt-smi(check status) - Monthly: Physical inspection of QSFP-DD cables, PSU status, thermal monitoring via BMC
- TLS-secured vLLM API on port 8080
- Inbound connectivity from Ailevate Cloud Services required
See Detailed Guides:
- Tenstorrent LoudBox Guide - Rack-mount air-cooled setup
- Tenstorrent QuietBox Guide - Desktop liquid-cooled setup
Regular Health Checks
Weekly:
# Verify accelerators detected
sudo update-pciids && lspci -d 1e52:
# Check accelerator status
tt-smi
# Test vLLM API
curl -k https://<ai-warehouse-host>:8080/v1/modelsMonthly:
- Physical inspection: QSFP-DD cables, PSU LEDs, cooling
- BMC event log review
- Temperature trend analysis
Mesh Interconnect Links
Verify Topology:
- QuietBox: 8 QSFP-DD cables (mesh topology)
- LoudBox: 2 QSFP-DD cables
Check Link Status:
- Inspect QSFP-DD port LEDs (solid green = healthy)
- Reseat cables if link failures detected
- Verify topology matches deployment guide diagrams
BMC Security and Monitoring
Harden BMC Access:
1. Change default password immediately
2. Use strong passwords (16+ characters)
3. Restrict BMC access to management VLAN if possible
4. Monitor BMC access logs for unauthorized attempts
Regular BMC Checks:
- Access
http://<bmc-ip> - Review System Health → Event Log
- Monitor Sensors → Temperature readings
- Check Power → PSU status
vLLM API Security
TLS Configuration:
- Enforce TLS for vLLM API endpoint (port 8080)
- Allow inbound connections only from Ailevate Cloud Services IP ranges
- Monitor API access logs for unauthorized attempts
Certificate Management:
- Rotate vLLM API certificates before expiration
- Coordinate client certificate updates with Ailevate support
Security Best Practices
Healthcare data requires robust security practices aligned with HIPAA and SOC2.
Security Monitoring by Component
| Component | What to Monitor | Where to Look | Red Flags |
|---|---|---|---|
| Platform Access (All Deployments) | • Failed login attempts • Unusual access patterns • Privilege escalation • Session anomalies | Admin → Audit Logs in Revenue Recovery UI | • Concurrent sessions from different locations • Unexpected location/time access • Repeated failed logins |
| Elasticsearch (On-Premise Only) | • TLS certificate validity • Unauthorized connections • Client cert authentication • Query patterns | • Elasticsearch access logs • Certificate expiry monitoring | • Non-Ailevate IP sources • Bulk exports • Certificate expiring <30 days |
| AI Warehouse (On-Premise Only) | • BMC access attempts • vLLM API access • TLS certificate validity | • BMC access logs • vLLM API logs | • Unauthorized BMC access attempts • TLS handshake failures • API access from unknown IPs |
| Relay VM (All Deployments) | • SSH access • sudo usage • Service changes • Outbound connections | • /var/log/auth.log• Service logs | • Unexpected outbound connections (should only be *.ailevate.com)• Unauthorized SSH access • Unusual sudo activity |
| Network (All Deployments) | • Firewall blocked attempts • Inbound connections (On-Premise) | • Firewall logs • Network monitoring | • Repeated connection attempts from unknown IPs • Unexpected inbound connections to Elasticsearch/AI Warehouse |
| Data Access (All Deployments) | • Bulk data exports • Unusual claim access • PHI access patterns • API usage | • Audit logs • API access logs | • Mass exports by single user • Access to claims outside normal scope • Automated scraping patterns |
Infrastructure Security Best Practices (On-Premise):
- Elasticsearch: Verify client certificate authentication enforced; monitor for unusual query patterns
- AI Warehouse: Restrict BMC to management VLAN; monitor vLLM API access logs
- Network: Monitor inbound connections (should only originate from Ailevate Cloud IP ranges)
Warning: Security Incident Response
Report suspected security incidents to [email protected] immediately with "SECURITY INCIDENT" in subject line. Coordinate with your internal security team and Ailevate.
Capacity Planning
Proactive capacity planning prevents resource exhaustion before it impacts operations.
When to Expand Infrastructure
Proactive capacity planning prevents resource exhaustion. Use these triggers to plan expansion with appropriate lead times.
| Component | Expansion Trigger | Lead Time | Contact Support When |
|---|---|---|---|
| Elasticsearch | • Disk trending >85% within 3 months • Query latency +25% over baseline • Cluster consistently YELLOW (shard constraints) | 3 months | Trending toward any threshold |
| AI Warehouse | • Task queue depth >100 consistently • Processing time +50% over baseline • Accelerator utilization >80% sustained | 6-8 weeks | Queue consistently high or performance degrading |
| Relay VM | • CPU/memory >75% consistently • Claim ingestion delays extending | 1 week | Resource constraints detected |
Establishing Performance Baselines (On-Premise):
When your deployment goes live, document these metrics to establish your baseline:
Elasticsearch:
- Query response times for common searches
- Indexing rates during peak windows
- Disk growth rate (GB/month)
- CPU/memory utilization during normal operation
- Shard count per index
AI Warehouse:
- AI task processing time per claim (average and 95th percentile)
- vLLM API response times
- Accelerator utilization during peak processing
- Temperature ranges during normal operation
Relay VM:
- CPU/memory during claim ingestion
- Network throughput during peak sync
- EHR query response times
Monthly Capacity Tracking:
- Calculate growth rates:
(current_usage - last_month_usage) / last_month_usage * 100 - Project when critical thresholds will be reached
- Plan expansion 3-6 months in advance
Tip: Contact Ailevate support early when identifying capacity constraints. On-Premise Elasticsearch cluster expansion requires coordination; AI Warehouse hardware procurement takes 6-8 weeks.
Related Resources
For more detailed information:
- System Monitoring & Health – Monitoring responsibilities and health indicators
- Component Diagnostics & Troubleshooting – Diagnostic commands and troubleshooting
- Common Error Messages – User-facing application errors
- SaaS Deployment Guide – SaaS architecture and setup
- On-Premises Deployment Guide – On-Premises infrastructure and requirements
- Relay Service Deployment Guide – Relay installation and operations
- Tenstorrent LoudBox Guide – Air-cooled AI Warehouse hardware
- Tenstorrent QuietBox Guide – Liquid-cooled AI Warehouse hardware
- Security Settings – Authentication and access controls
Need Help? For guidance on operational best practices specific to your deployment, contact Ailevate support at [email protected]. We can help with capacity planning, maintenance scheduling, and infrastructure optimization.
Updated about 1 month ago
