Keeping your Revenue Recovery platform healthy starts with understanding what to monitor and who's responsible for monitoring it. This guide helps you recognize the difference between SaaS and On-Premises monitoring responsibilities, identify health indicators, and know when to take action. Whether you're troubleshooting an issue or planning your monitoring strategy, this overview gives you the foundation you need.

For hands-on diagnostic commands and procedures, see the Component Diagnostics & Health Checks guide. For operational best practices and maintenance schedules, see Operational Best Practices.

Understanding Your Platform Architecture

Revenue Recovery operates on a four-layer architecture where monitoring responsibilities vary by deployment model. For complete architecture details, see the Deployment Architecture guide. For a summary of components and requirements, see System Requirements.

The four layers are:

Cloud Services Layer – Core application platform (Ailevate-managed in all deployments)
Relay Service Layer – Lightweight VM for EHR connectivity (customer-managed in all deployments)
Database Storage Layer – Elasticsearch datastore (Ailevate-managed in SaaS; customer-managed in On-Premises)
AI Compute Layer – Tenstorrent AI hardware (Ailevate-managed in SaaS; customer-managed in On-Premises)

Your deployment model determines who monitors each layer:

Deployment Model	Ailevate Monitors	Customer Monitors
SaaS	Cloud Services, Elasticsearch, AI Warehouse	Relay VM, EHR connectivity
On-Premises	Cloud Services only	Relay VM, Elasticsearch, AI Warehouse, EHR connectivity

Platform Component Visibility by Deployment Model

Understanding what you can and cannot monitor directly is crucial for effective system administration. This table shows your visibility into each platform layer:

Platform Layer	SaaS Visibility	On-Premise Visibility	Key Differences
Cloud Services Layer	View UI symptoms only (errors, latency, login issues)	Same as SaaS	Ailevate monitors all Cloud Services in both deployment models
Relay Service Layer	Monitor Relay VM locally; Ailevate monitors tunnel status via Azure Arc	Same as SaaS	Identical monitoring responsibilities—you manage the Relay VM in both models
Elastic Datastore	View UI symptoms only; Ailevate monitors fully	You monitor directly (cluster health, nodes, disk, TLS)	On-Premise requires you to manage the Elasticsearch cluster
AI Warehouse	View UI symptoms only; Ailevate monitors fully	You monitor directly (accelerators, vLLM API, cooling, power)	On-Premise requires you to manage Tenstorrent hardware and software

What This Means for You

If you're on SaaS:

Focus your monitoring efforts on the Relay VM and your EHR connectivity
Ailevate will monitor and maintain Cloud Services, Elasticsearch and AI infrastructure
Report application-level symptoms to Ailevate support when they occur

If you're on On-Premises:

You're responsible for comprehensive infrastructure monitoring (Elastic, AI Warehouse, Relay)
Ailevate can detect symptoms when your infrastructure fails but cannot directly monitor your systems

⚠️
Warning: On-Premise deployments require inbound HTTPS connectivity from Ailevate Cloud to your Elasticsearch cluster (port 9200) and AI Warehouse (port 8080) so the Cloud Services layer can access your infrastructure.
You must monitor these inbound connections as part of your responsibilities. See the On-Premise Deployment Guide for complete network requirements.

For Relay-specific monitoring details, see the Relay Service Deployment Guide.

Monitoring Responsibilities by Deployment Model

SaaS: What Ailevate Monitors

In a SaaS deployment, Ailevate operates and monitors most of your platform infrastructure:

Cloud Services: API uptime, authentication services, workflow execution, error rates
Elasticsearch: Cluster health, disk capacity, shard allocation, indexing performance
AI Warehouse: Accelerator availability, model inference performance, thermal and power status
Data Pipeline: Claim ingestion rates, AI task execution, data validation
Relay Connectivity: TLS tunnel status via Azure Arc (whether the Relay can reach Ailevate Cloud)

SaaS: What You Monitor

Your monitoring responsibilities in SaaS focus on local infrastructure:

Relay VM:

Operating system health (disk space, memory, CPU)
Service status (ailevate-tunnel.service and ailevate-proxy.service)
Outbound HTTPS connectivity to Ailevate Cloud (port 443)
DNS resolution for *.ailevate.com
Time synchronization (NTP)

EHR Datastore:

SQL Server availability and performance
Database connectivity on port 1433
Network reachability from Relay VM

Application-Level Symptoms:

Connection error banners in the Revenue Recovery UI
"Elasticsearch Unavailable" messages (rare in SaaS)
Data processing failures or workflow delays
Authentication and login issues

💡
Tip: In SaaS, if you see application errors that aren't related to EHR connectivity, report them to Ailevate support. These typically indicate issues with Ailevate-managed infrastructure that we'll investigate and resolve.

On-Premise: What You Monitor

On-Premises deployments require comprehensive infrastructure monitoring:

Elasticsearch Cluster:

Cluster health status (GREEN/YELLOW/RED)
Node availability and resource utilization
Disk utilization (keep below 85% to avoid Elasticsearch watermark issues)
TLS certificate expiry dates
Shard allocation and replication status
Inbound connectivity from Ailevate Cloud (port 9200/TLS)

AI Warehouse (Tenstorrent):

Accelerator detection via lspci and tt-smi tools
vLLM API endpoint health (port 8080/TLS)
QSFP-DD interconnect link status
Power supply redundancy (dual PSU status)
Cooling system operation (airflow for LoudBox, coolant for QuietBox)
BMC (Baseboard Management Controller) accessibility
Inbound connectivity from Ailevate Cloud (port 8080/TLS)

Relay VM:

Same as SaaS (services, connectivity, DNS, NTP)

Network Performance:

Bandwidth and latency between Cloud Services, Elasticsearch, and AI Warehouse

See the On-Premise Deployment Guide for sizing recommendations and detailed network requirements.

💡
Recommended Monitoring Stack for On-Premise
Ailevate uses Prometheus (metrics collection) and Grafana (visualization) to monitor our Cloud Services layer. We recommend On-Premise customers adopt a similar stack for consistency:

Elasticsearch: Use Prometheus Elasticsearch exporter for cluster health, query latency, and disk metrics

AI Warehouse: Expose hardware metrics (accelerator status, temperature, vLLM API response times) via node exporters

Relay VM: Monitor service uptime, log error rates, and system resources

This alignment simplifies support conversations and provides industry-standard open-source tooling.

On-Premise: What Ailevate Monitors

Ailevate monitors the Cloud Services layer and can detect symptoms when your infrastructure fails:

Symptom Detection: Failed Elasticsearch queries indicate your cluster is unreachable; failed AI tasks indicate your AI Warehouse is unavailable
Support Assistance: When symptoms occur, Ailevate support helps troubleshoot, but you must provide infrastructure diagnostics (cluster health, hardware status, logs)

⚠️
Warning: In On-Premise deployments, Ailevate cannot directly monitor your Elasticsearch cluster, AI Warehouse, or Relay VM internals.
We can only see symptoms when these components fail. You're responsible for proactive monitoring to catch issues before they impact operations.

Recognizing System Health Indicators

Revenue Recovery provides several ways to monitor system health directly from the application and through automated alerts.

In-Application Indicators

The platform communicates system health through various UI messages and notifications. Understanding these indicators helps you determine whether issues are infrastructure-related (covered in this guide) or application-level errors.

EHR Connection Error Banner:

Appears when the Relay cannot reach your EHR or SQL connectivity fails. This indicates a problem with:

Relay VM services not running
SQL Server unavailable
Network blocking port 1433

"Elasticsearch Unavailable" Message:

SaaS: Rare—indicates an Ailevate-managed infrastructure issue
On-Premise: Your Elasticsearch cluster is unreachable or degraded; check cluster health immediately

⚠️
Critical: New Elasticsearch Instance Data Loss (On-Premise Only)
If your On-Premise Elasticsearch cluster fails completely and cannot be recovered, Ailevate can guide you through onboarding a new ES instance via a dedicated workflow. However, replacing your Elasticsearch cluster results in complete data loss—all historical claims data, AI analysis results, and platform history will be permanently lost.
Prevention is critical: Implement daily automated snapshots (see Operational Best Practices) and test restore procedures quarterly. A properly maintained backup strategy is the only way to recover from catastrophic Elasticsearch failure without data loss.

AI Task Processing Failures:

Workflows show "AI processing failed" or similar errors when:

SaaS: Ailevate investigates AI Warehouse issue
On-Premise: Check your AI Warehouse status, vLLM API health, and accelerator availability

Authentication Failures:

Users cannot log in, MFA doesn't work, or sessions timeout unexpectedly. This typically indicates Cloud Services issues (Ailevate-managed)—report to support.

"Fetching Data Failed" Notifications:

Toast notifications indicating backend data retrieval issues:

SaaS: May indicate Elasticsearch issue (Ailevate investigates)
On-Premise: Check your Elasticsearch cluster health and connectivity

💡
Tip: For user-facing errors like HTTP status codes (400, 401, 403, 404, 500, 503), toast notifications, or feature-specific messages, see the Common Error Messages guide. This guide focuses on infrastructure-level monitoring and diagnostics.

Email Alerts

Relay Tunnel Failure:

Ailevate detects Relay tunnel disconnection via Azure Arc and sends an alert. Check your Relay VM status, services, and network connectivity.

Elasticsearch Unreachable (On-Premise):

Ailevate detects failed Elasticsearch queries and alerts you to check your cluster. Run cluster health checks immediately.

AI Warehouse Connection Failure (On-Premise):

Ailevate detects failed AI task routing and alerts you to check your AI Warehouse. Verify vLLM API health and accelerator status.

Repeated Ingestion Failures:

Ongoing EHR data sync issues. Check Relay logs, SQL connectivity, and EHR datastore health.

Service Behavior During Degradation

Understanding how the platform behaves when components fail helps you prioritize remediation:

Component Down	Effect on Operations	Priority
Relay VM	No new claims ingested; existing data accessible	High—stops data flow
Elasticsearch	UI incomplete/unreadable; searches fail; reports unavailable	Critical—platform unusable
AI Warehouse	AI-powered tasks unavailable (denial analysis, recommendations)	Medium—manual workflows still function
SQL Datastore	EHR claim ingestion halts; no new data synchronized	High—stops data flow
Cloud Services	Application unavailable or severely degraded	Critical—platform unusable

📘
Note: Revenue Recovery is designed to maintain read access to existing data even when ingestion or AI processing fails. Users can continue working with claims already in the system while you resolve infrastructure issues.

Monitoring Responsibility Quick Reference

Use this table to quickly identify who monitors each component in your deployment:

Component	SaaS	On-Premises	Monitoring Focus
Cloud Services	Ailevate	Ailevate	API health, authentication, workflows
Relay VM Services	Customer	Customer	Service status, logs, system resources
Relay Connectivity	Customer + Ailevate	Customer + Ailevate	Outbound 443, SQL 1433; Ailevate monitors tunnel status
Elasticsearch	Ailevate	Customer	Cluster health, disk, nodes, TLS, inbound connectivity
AI Warehouse	Ailevate	Customer	Accelerators, vLLM API, cooling, power, inbound connectivity
EHR Datastore	Customer	Customer	SQL Server health, connectivity, performance
Network Performance	Shared	Customer	Ailevate monitors Cloud; Customer monitors infrastructure

💡
Tip:

SaaS: Focus monitoring on Relay VM and EHR connectivity

On-Premise: Comprehensive monitoring of Relay VM, Elasticsearch, AI Warehouse, and EHR connectivity

When to Contact Ailevate Support

Immediate Escalation Required

Contact [email protected] immediately (mark subject "CRITICAL") for:

Elasticsearch cluster status RED (data unavailable, On-Premise only)
Revenue Recovery application completely unavailable for all users
Suspected security breach or unauthorized access
Data integrity concerns (missing claims, corrupted data)
AI Warehouse complete failure (all accelerators offline, On-Premise only)

Note: Ailevate is also automatically notified of several critical issues requiring escalation and support.

Standard Support Requests

For SaaS Deployments:

Contact [email protected] for:

Persistent Relay connectivity issues after completing troubleshooting steps
Application errors or performance degradation
Capacity planning guidance
Questions about platform features or configuration

For On-Premise Deployments:

Contact [email protected] for:

Persistent Relay connectivity issues after completing troubleshooting steps
Elasticsearch performance degradation or cluster issues (you manage the infrastructure; we provide guidance)
AI Warehouse hardware failures or performance issues (you manage the hardware; we provide guidance)
Application errors not related to your infrastructure (Cloud Services issues)
Capacity planning guidance

⚠️
Important: In On-Premise deployments, you are responsible for managing your Elasticsearch cluster and AI Warehouse infrastructure.
Ailevate provides troubleshooting guidance and software support, but infrastructure operations, monitoring, and maintenance are your responsibility.

Related Resources

For more detailed information on specific topics:

Component Diagnostics & Troubleshooting – Hands-on diagnostic commands and procedures for all platform components
Operational Best Practices – Maintenance schedules, security practices, capacity planning, and backup/DR guidance
Common Error Messages – User-facing application errors and resolution steps
SaaS Deployment Guide – Architecture, network requirements, and SaaS-specific setup
On-Premises Deployment Guide – Complete On-Premises architecture, sizing, and connectivity requirements
Relay Service Deployment Guide – Detailed Relay installation, configuration, and operational runbook
Tenstorrent LoudBox Guide – Air-cooled AI Warehouse hardware setup and maintenance
Tenstorrent QuietBox Guide – Liquid-cooled AI Warehouse hardware setup and maintenance

Need Help? If you have questions about monitoring your Revenue Recovery deployment or need assistance interpreting health indicators, contact Ailevate support at [email protected].

System Monitoring & Health

Understanding Your Platform Architecture

Platform Component Visibility by Deployment Model

What This Means for You

Warning: On-Premise deployments require inbound HTTPS connectivity from Ailevate Cloud to your Elasticsearch cluster (port 9200) and AI Warehouse (port 8080) so the Cloud Services layer can access your infrastructure.

Monitoring Responsibilities by Deployment Model

SaaS: What Ailevate Monitors

SaaS: What You Monitor

Tip: In SaaS, if you see application errors that aren't related to EHR connectivity, report them to Ailevate support. These typically indicate issues with Ailevate-managed infrastructure that we'll investigate and resolve.

On-Premise: What You Monitor

Recommended Monitoring Stack for On-Premise

On-Premise: What Ailevate Monitors

Warning: In On-Premise deployments, Ailevate cannot directly monitor your Elasticsearch cluster, AI Warehouse, or Relay VM internals.

Recognizing System Health Indicators

In-Application Indicators

Critical: New Elasticsearch Instance Data Loss (On-Premise Only)

Tip: For user-facing errors like HTTP status codes (400, 401, 403, 404, 500, 503), toast notifications, or feature-specific messages, see the Common Error Messages guide. This guide focuses on infrastructure-level monitoring and diagnostics.

Email Alerts

Service Behavior During Degradation

Note: Revenue Recovery is designed to maintain read access to existing data even when ingestion or AI processing fails. Users can continue working with claims already in the system while you resolve infrastructure issues.

Monitoring Responsibility Quick Reference

Tip:

When to Contact Ailevate Support

Immediate Escalation Required

Standard Support Requests

Important: In On-Premise deployments, you are responsible for managing your Elasticsearch cluster and AI Warehouse infrastructure.

Related Resources