System Monitoring & Health
Understanding monitoring responsibilities and health indicators for SaaS and On-Premises deployments
Keeping your Revenue Recovery platform healthy starts with understanding what to monitor and who's responsible for monitoring it. This guide helps you recognize the difference between SaaS and On-Premises monitoring responsibilities, identify health indicators, and know when to take action. Whether you're troubleshooting an issue or planning your monitoring strategy, this overview gives you the foundation you need.
For hands-on diagnostic commands and procedures, see the Component Diagnostics & Health Checks guide. For operational best practices and maintenance schedules, see Operational Best Practices.
Understanding Your Platform Architecture
Revenue Recovery operates on a four-layer architecture where monitoring responsibilities vary by deployment model. For complete architecture details, see the Deployment Architecture guide. For a summary of components and requirements, see System Requirements.
The four layers are:
- Cloud Services Layer – Core application platform (Ailevate-managed in all deployments)
- Relay Service Layer – Lightweight VM for EHR connectivity (customer-managed in all deployments)
- Database Storage Layer – Elasticsearch datastore (Ailevate-managed in SaaS; customer-managed in On-Premises)
- AI Compute Layer – Tenstorrent AI hardware (Ailevate-managed in SaaS; customer-managed in On-Premises)
Your deployment model determines who monitors each layer:
| Deployment Model | Ailevate Monitors | Customer Monitors |
|---|---|---|
| SaaS | Cloud Services, Elasticsearch, AI Warehouse | Relay VM, EHR connectivity |
| On-Premises | Cloud Services only | Relay VM, Elasticsearch, AI Warehouse, EHR connectivity |
Platform Component Visibility by Deployment Model
Understanding what you can and cannot monitor directly is crucial for effective system administration. This table shows your visibility into each platform layer:
| Platform Layer | SaaS Visibility | On-Premise Visibility | Key Differences |
|---|---|---|---|
| Cloud Services Layer | View UI symptoms only (errors, latency, login issues) | Same as SaaS | Ailevate monitors all Cloud Services in both deployment models |
| Relay Service Layer | Monitor Relay VM locally; Ailevate monitors tunnel status via Azure Arc | Same as SaaS | Identical monitoring responsibilities—you manage the Relay VM in both models |
| Elastic Datastore | View UI symptoms only; Ailevate monitors fully | You monitor directly (cluster health, nodes, disk, TLS) | On-Premise requires you to manage the Elasticsearch cluster |
| AI Warehouse | View UI symptoms only; Ailevate monitors fully | You monitor directly (accelerators, vLLM API, cooling, power) | On-Premise requires you to manage Tenstorrent hardware and software |
What This Means for You
If you're on SaaS:
- Focus your monitoring efforts on the Relay VM and your EHR connectivity
- Ailevate will monitor and maintain Cloud Services, Elasticsearch and AI infrastructure
- Report application-level symptoms to Ailevate support when they occur
If you're on On-Premises:
- You're responsible for comprehensive infrastructure monitoring (Elastic, AI Warehouse, Relay)
- Ailevate can detect symptoms when your infrastructure fails but cannot directly monitor your systems
Warning: On-Premise deployments require inbound HTTPS connectivity from Ailevate Cloud to your Elasticsearch cluster (port 9200) and AI Warehouse (port 8080) so the Cloud Services layer can access your infrastructure.You must monitor these inbound connections as part of your responsibilities. See the On-Premise Deployment Guide for complete network requirements.
For Relay-specific monitoring details, see the Relay Service Deployment Guide.
Monitoring Responsibilities by Deployment Model
SaaS: What Ailevate Monitors
In a SaaS deployment, Ailevate operates and monitors most of your platform infrastructure:
- Cloud Services: API uptime, authentication services, workflow execution, error rates
- Elasticsearch: Cluster health, disk capacity, shard allocation, indexing performance
- AI Warehouse: Accelerator availability, model inference performance, thermal and power status
- Data Pipeline: Claim ingestion rates, AI task execution, data validation
- Relay Connectivity: TLS tunnel status via Azure Arc (whether the Relay can reach Ailevate Cloud)
SaaS: What You Monitor
Your monitoring responsibilities in SaaS focus on local infrastructure:
Relay VM:
- Operating system health (disk space, memory, CPU)
- Service status (
ailevate-tunnel.serviceandailevate-proxy.service) - Outbound HTTPS connectivity to Ailevate Cloud (port 443)
- DNS resolution for
*.ailevate.com - Time synchronization (NTP)
EHR Datastore:
- SQL Server availability and performance
- Database connectivity on port 1433
- Network reachability from Relay VM
Application-Level Symptoms:
- Connection error banners in the Revenue Recovery UI
- "Elasticsearch Unavailable" messages (rare in SaaS)
- Data processing failures or workflow delays
- Authentication and login issues
Tip: In SaaS, if you see application errors that aren't related to EHR connectivity, report them to Ailevate support. These typically indicate issues with Ailevate-managed infrastructure that we'll investigate and resolve.
On-Premise: What You Monitor
On-Premises deployments require comprehensive infrastructure monitoring:
Elasticsearch Cluster:
- Cluster health status (GREEN/YELLOW/RED)
- Node availability and resource utilization
- Disk utilization (keep below 85% to avoid Elasticsearch watermark issues)
- TLS certificate expiry dates
- Shard allocation and replication status
- Inbound connectivity from Ailevate Cloud (port 9200/TLS)
AI Warehouse (Tenstorrent):
- Accelerator detection via
lspciandtt-smitools - vLLM API endpoint health (port 8080/TLS)
- QSFP-DD interconnect link status
- Power supply redundancy (dual PSU status)
- Cooling system operation (airflow for LoudBox, coolant for QuietBox)
- BMC (Baseboard Management Controller) accessibility
- Inbound connectivity from Ailevate Cloud (port 8080/TLS)
Relay VM:
- Same as SaaS (services, connectivity, DNS, NTP)
Network Performance:
- Bandwidth and latency between Cloud Services, Elasticsearch, and AI Warehouse
See the On-Premise Deployment Guide for sizing recommendations and detailed network requirements.
Recommended Monitoring Stack for On-PremiseAilevate uses Prometheus (metrics collection) and Grafana (visualization) to monitor our Cloud Services layer. We recommend On-Premise customers adopt a similar stack for consistency:
- Elasticsearch: Use Prometheus Elasticsearch exporter for cluster health, query latency, and disk metrics
- AI Warehouse: Expose hardware metrics (accelerator status, temperature, vLLM API response times) via node exporters
- Relay VM: Monitor service uptime, log error rates, and system resources
This alignment simplifies support conversations and provides industry-standard open-source tooling.
On-Premise: What Ailevate Monitors
Ailevate monitors the Cloud Services layer and can detect symptoms when your infrastructure fails:
- Symptom Detection: Failed Elasticsearch queries indicate your cluster is unreachable; failed AI tasks indicate your AI Warehouse is unavailable
- Support Assistance: When symptoms occur, Ailevate support helps troubleshoot, but you must provide infrastructure diagnostics (cluster health, hardware status, logs)
Warning: In On-Premise deployments, Ailevate cannot directly monitor your Elasticsearch cluster, AI Warehouse, or Relay VM internals.We can only see symptoms when these components fail. You're responsible for proactive monitoring to catch issues before they impact operations.
Recognizing System Health Indicators
Revenue Recovery provides several ways to monitor system health directly from the application and through automated alerts.
In-Application Indicators
The platform communicates system health through various UI messages and notifications. Understanding these indicators helps you determine whether issues are infrastructure-related (covered in this guide) or application-level errors.
EHR Connection Error Banner:
Appears when the Relay cannot reach your EHR or SQL connectivity fails. This indicates a problem with:
- Relay VM services not running
- SQL Server unavailable
- Network blocking port 1433
"Elasticsearch Unavailable" Message:
- SaaS: Rare—indicates an Ailevate-managed infrastructure issue
- On-Premise: Your Elasticsearch cluster is unreachable or degraded; check cluster health immediately
Critical: New Elasticsearch Instance Data Loss (On-Premise Only)If your On-Premise Elasticsearch cluster fails completely and cannot be recovered, Ailevate can guide you through onboarding a new ES instance via a dedicated workflow. However, replacing your Elasticsearch cluster results in complete data loss—all historical claims data, AI analysis results, and platform history will be permanently lost.
Prevention is critical: Implement daily automated snapshots (see Operational Best Practices) and test restore procedures quarterly. A properly maintained backup strategy is the only way to recover from catastrophic Elasticsearch failure without data loss.
AI Task Processing Failures:
Workflows show "AI processing failed" or similar errors when:
- SaaS: Ailevate investigates AI Warehouse issue
- On-Premise: Check your AI Warehouse status, vLLM API health, and accelerator availability
Authentication Failures:
Users cannot log in, MFA doesn't work, or sessions timeout unexpectedly. This typically indicates Cloud Services issues (Ailevate-managed)—report to support.
"Fetching Data Failed" Notifications:
Toast notifications indicating backend data retrieval issues:
- SaaS: May indicate Elasticsearch issue (Ailevate investigates)
- On-Premise: Check your Elasticsearch cluster health and connectivity
Tip: For user-facing errors like HTTP status codes (400, 401, 403, 404, 500, 503), toast notifications, or feature-specific messages, see the Common Error Messages guide. This guide focuses on infrastructure-level monitoring and diagnostics.
Email Alerts
Relay Tunnel Failure:
- Ailevate detects Relay tunnel disconnection via Azure Arc and sends an alert. Check your Relay VM status, services, and network connectivity.
Elasticsearch Unreachable (On-Premise):
- Ailevate detects failed Elasticsearch queries and alerts you to check your cluster. Run cluster health checks immediately.
AI Warehouse Connection Failure (On-Premise):
- Ailevate detects failed AI task routing and alerts you to check your AI Warehouse. Verify vLLM API health and accelerator status.
Repeated Ingestion Failures:
- Ongoing EHR data sync issues. Check Relay logs, SQL connectivity, and EHR datastore health.
Service Behavior During Degradation
Understanding how the platform behaves when components fail helps you prioritize remediation:
| Component Down | Effect on Operations | Priority |
|---|---|---|
| Relay VM | No new claims ingested; existing data accessible | High—stops data flow |
| Elasticsearch | UI incomplete/unreadable; searches fail; reports unavailable | Critical—platform unusable |
| AI Warehouse | AI-powered tasks unavailable (denial analysis, recommendations) | Medium—manual workflows still function |
| SQL Datastore | EHR claim ingestion halts; no new data synchronized | High—stops data flow |
| Cloud Services | Application unavailable or severely degraded | Critical—platform unusable |
Note: Revenue Recovery is designed to maintain read access to existing data even when ingestion or AI processing fails. Users can continue working with claims already in the system while you resolve infrastructure issues.
Monitoring Responsibility Quick Reference
Use this table to quickly identify who monitors each component in your deployment:
| Component | SaaS | On-Premises | Monitoring Focus |
|---|---|---|---|
| Cloud Services | Ailevate | Ailevate | API health, authentication, workflows |
| Relay VM Services | Customer | Customer | Service status, logs, system resources |
| Relay Connectivity | Customer + Ailevate | Customer + Ailevate | Outbound 443, SQL 1433; Ailevate monitors tunnel status |
| Elasticsearch | Ailevate | Customer | Cluster health, disk, nodes, TLS, inbound connectivity |
| AI Warehouse | Ailevate | Customer | Accelerators, vLLM API, cooling, power, inbound connectivity |
| EHR Datastore | Customer | Customer | SQL Server health, connectivity, performance |
| Network Performance | Shared | Customer | Ailevate monitors Cloud; Customer monitors infrastructure |
Tip:
- SaaS: Focus monitoring on Relay VM and EHR connectivity
- On-Premise: Comprehensive monitoring of Relay VM, Elasticsearch, AI Warehouse, and EHR connectivity
When to Contact Ailevate Support
Immediate Escalation Required
Contact [email protected] immediately (mark subject "CRITICAL") for:
- Elasticsearch cluster status RED (data unavailable, On-Premise only)
- Revenue Recovery application completely unavailable for all users
- Suspected security breach or unauthorized access
- Data integrity concerns (missing claims, corrupted data)
- AI Warehouse complete failure (all accelerators offline, On-Premise only)
Note: Ailevate is also automatically notified of several critical issues requiring escalation and support.
Standard Support Requests
For SaaS Deployments:
Contact [email protected] for:
- Persistent Relay connectivity issues after completing troubleshooting steps
- Application errors or performance degradation
- Capacity planning guidance
- Questions about platform features or configuration
For On-Premise Deployments:
Contact [email protected] for:
- Persistent Relay connectivity issues after completing troubleshooting steps
- Elasticsearch performance degradation or cluster issues (you manage the infrastructure; we provide guidance)
- AI Warehouse hardware failures or performance issues (you manage the hardware; we provide guidance)
- Application errors not related to your infrastructure (Cloud Services issues)
- Capacity planning guidance
Important: In On-Premise deployments, you are responsible for managing your Elasticsearch cluster and AI Warehouse infrastructure.Ailevate provides troubleshooting guidance and software support, but infrastructure operations, monitoring, and maintenance are your responsibility.
Related Resources
For more detailed information on specific topics:
- Component Diagnostics & Troubleshooting – Hands-on diagnostic commands and procedures for all platform components
- Operational Best Practices – Maintenance schedules, security practices, capacity planning, and backup/DR guidance
- Common Error Messages – User-facing application errors and resolution steps
- SaaS Deployment Guide – Architecture, network requirements, and SaaS-specific setup
- On-Premises Deployment Guide – Complete On-Premises architecture, sizing, and connectivity requirements
- Relay Service Deployment Guide – Detailed Relay installation, configuration, and operational runbook
- Tenstorrent LoudBox Guide – Air-cooled AI Warehouse hardware setup and maintenance
- Tenstorrent QuietBox Guide – Liquid-cooled AI Warehouse hardware setup and maintenance
Need Help? If you have questions about monitoring your Revenue Recovery deployment or need assistance interpreting health indicators, contact Ailevate support at [email protected].
Updated about 1 month ago
