System Monitoring & Health

Understanding monitoring responsibilities and health indicators for SaaS and On-Premises deployments

Keeping your Revenue Recovery platform healthy starts with understanding what to monitor and who's responsible for monitoring it. This guide helps you recognize the difference between SaaS and On-Premises monitoring responsibilities, identify health indicators, and know when to take action. Whether you're troubleshooting an issue or planning your monitoring strategy, this overview gives you the foundation you need.

For hands-on diagnostic commands and procedures, see the Component Diagnostics & Health Checks guide. For operational best practices and maintenance schedules, see Operational Best Practices.


Understanding Your Platform Architecture

Revenue Recovery operates on a four-layer architecture where monitoring responsibilities vary by deployment model. For complete architecture details, see the Deployment Architecture guide. For a summary of components and requirements, see System Requirements.

The four layers are:

  1. Cloud Services Layer – Core application platform (Ailevate-managed in all deployments)
  2. Relay Service Layer – Lightweight VM for EHR connectivity (customer-managed in all deployments)
  3. Database Storage Layer – Elasticsearch datastore (Ailevate-managed in SaaS; customer-managed in On-Premises)
  4. AI Compute Layer – Tenstorrent AI hardware (Ailevate-managed in SaaS; customer-managed in On-Premises)

Your deployment model determines who monitors each layer:

Deployment ModelAilevate MonitorsCustomer Monitors
SaaSCloud Services, Elasticsearch, AI WarehouseRelay VM, EHR connectivity
On-PremisesCloud Services onlyRelay VM, Elasticsearch, AI Warehouse, EHR connectivity

Platform Component Visibility by Deployment Model

Understanding what you can and cannot monitor directly is crucial for effective system administration. This table shows your visibility into each platform layer:

Platform LayerSaaS VisibilityOn-Premise VisibilityKey Differences
Cloud Services LayerView UI symptoms only (errors, latency, login issues)Same as SaaSAilevate monitors all Cloud Services in both deployment models
Relay Service LayerMonitor Relay VM locally; Ailevate monitors tunnel status via Azure ArcSame as SaaSIdentical monitoring responsibilities—you manage the Relay VM in both models
Elastic DatastoreView UI symptoms only; Ailevate monitors fullyYou monitor directly (cluster health, nodes, disk, TLS)On-Premise requires you to manage the Elasticsearch cluster
AI WarehouseView UI symptoms only; Ailevate monitors fullyYou monitor directly (accelerators, vLLM API, cooling, power)On-Premise requires you to manage Tenstorrent hardware and software

What This Means for You

If you're on SaaS:

  • Focus your monitoring efforts on the Relay VM and your EHR connectivity
  • Ailevate will monitor and maintain Cloud Services, Elasticsearch and AI infrastructure
  • Report application-level symptoms to Ailevate support when they occur

If you're on On-Premises:

  • You're responsible for comprehensive infrastructure monitoring (Elastic, AI Warehouse, Relay)
  • Ailevate can detect symptoms when your infrastructure fails but cannot directly monitor your systems
⚠️

Warning: On-Premise deployments require inbound HTTPS connectivity from Ailevate Cloud to your Elasticsearch cluster (port 9200) and AI Warehouse (port 8080) so the Cloud Services layer can access your infrastructure.

You must monitor these inbound connections as part of your responsibilities. See the On-Premise Deployment Guide for complete network requirements.

For Relay-specific monitoring details, see the Relay Service Deployment Guide.


Monitoring Responsibilities by Deployment Model

SaaS: What Ailevate Monitors

In a SaaS deployment, Ailevate operates and monitors most of your platform infrastructure:

  • Cloud Services: API uptime, authentication services, workflow execution, error rates
  • Elasticsearch: Cluster health, disk capacity, shard allocation, indexing performance
  • AI Warehouse: Accelerator availability, model inference performance, thermal and power status
  • Data Pipeline: Claim ingestion rates, AI task execution, data validation
  • Relay Connectivity: TLS tunnel status via Azure Arc (whether the Relay can reach Ailevate Cloud)

SaaS: What You Monitor

Your monitoring responsibilities in SaaS focus on local infrastructure:

Relay VM:

  • Operating system health (disk space, memory, CPU)
  • Service status (ailevate-tunnel.service and ailevate-proxy.service)
  • Outbound HTTPS connectivity to Ailevate Cloud (port 443)
  • DNS resolution for *.ailevate.com
  • Time synchronization (NTP)

EHR Datastore:

  • SQL Server availability and performance
  • Database connectivity on port 1433
  • Network reachability from Relay VM

Application-Level Symptoms:

  • Connection error banners in the Revenue Recovery UI
  • "Elasticsearch Unavailable" messages (rare in SaaS)
  • Data processing failures or workflow delays
  • Authentication and login issues
💡

Tip: In SaaS, if you see application errors that aren't related to EHR connectivity, report them to Ailevate support. These typically indicate issues with Ailevate-managed infrastructure that we'll investigate and resolve.

On-Premise: What You Monitor

On-Premises deployments require comprehensive infrastructure monitoring:

Elasticsearch Cluster:

  • Cluster health status (GREEN/YELLOW/RED)
  • Node availability and resource utilization
  • Disk utilization (keep below 85% to avoid Elasticsearch watermark issues)
  • TLS certificate expiry dates
  • Shard allocation and replication status
  • Inbound connectivity from Ailevate Cloud (port 9200/TLS)

AI Warehouse (Tenstorrent):

  • Accelerator detection via lspci and tt-smi tools
  • vLLM API endpoint health (port 8080/TLS)
  • QSFP-DD interconnect link status
  • Power supply redundancy (dual PSU status)
  • Cooling system operation (airflow for LoudBox, coolant for QuietBox)
  • BMC (Baseboard Management Controller) accessibility
  • Inbound connectivity from Ailevate Cloud (port 8080/TLS)

Relay VM:

  • Same as SaaS (services, connectivity, DNS, NTP)

Network Performance:

  • Bandwidth and latency between Cloud Services, Elasticsearch, and AI Warehouse

See the On-Premise Deployment Guide for sizing recommendations and detailed network requirements.

💡

Recommended Monitoring Stack for On-Premise

Ailevate uses Prometheus (metrics collection) and Grafana (visualization) to monitor our Cloud Services layer. We recommend On-Premise customers adopt a similar stack for consistency:

  • Elasticsearch: Use Prometheus Elasticsearch exporter for cluster health, query latency, and disk metrics
  • AI Warehouse: Expose hardware metrics (accelerator status, temperature, vLLM API response times) via node exporters
  • Relay VM: Monitor service uptime, log error rates, and system resources

This alignment simplifies support conversations and provides industry-standard open-source tooling.

On-Premise: What Ailevate Monitors

Ailevate monitors the Cloud Services layer and can detect symptoms when your infrastructure fails:

  • Symptom Detection: Failed Elasticsearch queries indicate your cluster is unreachable; failed AI tasks indicate your AI Warehouse is unavailable
  • Support Assistance: When symptoms occur, Ailevate support helps troubleshoot, but you must provide infrastructure diagnostics (cluster health, hardware status, logs)
⚠️

Warning: In On-Premise deployments, Ailevate cannot directly monitor your Elasticsearch cluster, AI Warehouse, or Relay VM internals.

We can only see symptoms when these components fail. You're responsible for proactive monitoring to catch issues before they impact operations.


Recognizing System Health Indicators

Revenue Recovery provides several ways to monitor system health directly from the application and through automated alerts.

In-Application Indicators

The platform communicates system health through various UI messages and notifications. Understanding these indicators helps you determine whether issues are infrastructure-related (covered in this guide) or application-level errors.

EHR Connection Error Banner:

Appears when the Relay cannot reach your EHR or SQL connectivity fails. This indicates a problem with:

  • Relay VM services not running
  • SQL Server unavailable
  • Network blocking port 1433

"Elasticsearch Unavailable" Message:

  • SaaS: Rare—indicates an Ailevate-managed infrastructure issue
  • On-Premise: Your Elasticsearch cluster is unreachable or degraded; check cluster health immediately
⚠️

Critical: New Elasticsearch Instance Data Loss (On-Premise Only)

If your On-Premise Elasticsearch cluster fails completely and cannot be recovered, Ailevate can guide you through onboarding a new ES instance via a dedicated workflow. However, replacing your Elasticsearch cluster results in complete data loss—all historical claims data, AI analysis results, and platform history will be permanently lost.

Prevention is critical: Implement daily automated snapshots (see Operational Best Practices) and test restore procedures quarterly. A properly maintained backup strategy is the only way to recover from catastrophic Elasticsearch failure without data loss.

AI Task Processing Failures:

Workflows show "AI processing failed" or similar errors when:

  • SaaS: Ailevate investigates AI Warehouse issue
  • On-Premise: Check your AI Warehouse status, vLLM API health, and accelerator availability

Authentication Failures:

Users cannot log in, MFA doesn't work, or sessions timeout unexpectedly. This typically indicates Cloud Services issues (Ailevate-managed)—report to support.

"Fetching Data Failed" Notifications:

Toast notifications indicating backend data retrieval issues:

  • SaaS: May indicate Elasticsearch issue (Ailevate investigates)
  • On-Premise: Check your Elasticsearch cluster health and connectivity
💡

Tip: For user-facing errors like HTTP status codes (400, 401, 403, 404, 500, 503), toast notifications, or feature-specific messages, see the Common Error Messages guide. This guide focuses on infrastructure-level monitoring and diagnostics.

Email Alerts

Relay Tunnel Failure:

  • Ailevate detects Relay tunnel disconnection via Azure Arc and sends an alert. Check your Relay VM status, services, and network connectivity.

Elasticsearch Unreachable (On-Premise):

  • Ailevate detects failed Elasticsearch queries and alerts you to check your cluster. Run cluster health checks immediately.

AI Warehouse Connection Failure (On-Premise):

  • Ailevate detects failed AI task routing and alerts you to check your AI Warehouse. Verify vLLM API health and accelerator status.

Repeated Ingestion Failures:

  • Ongoing EHR data sync issues. Check Relay logs, SQL connectivity, and EHR datastore health.

Service Behavior During Degradation

Understanding how the platform behaves when components fail helps you prioritize remediation:

Component DownEffect on OperationsPriority
Relay VMNo new claims ingested; existing data accessibleHigh—stops data flow
ElasticsearchUI incomplete/unreadable; searches fail; reports unavailableCritical—platform unusable
AI WarehouseAI-powered tasks unavailable (denial analysis, recommendations)Medium—manual workflows still function
SQL DatastoreEHR claim ingestion halts; no new data synchronizedHigh—stops data flow
Cloud ServicesApplication unavailable or severely degradedCritical—platform unusable
📘

Note: Revenue Recovery is designed to maintain read access to existing data even when ingestion or AI processing fails. Users can continue working with claims already in the system while you resolve infrastructure issues.


Monitoring Responsibility Quick Reference

Use this table to quickly identify who monitors each component in your deployment:

ComponentSaaSOn-PremisesMonitoring Focus
Cloud ServicesAilevateAilevateAPI health, authentication, workflows
Relay VM ServicesCustomerCustomerService status, logs, system resources
Relay ConnectivityCustomer + AilevateCustomer + AilevateOutbound 443, SQL 1433; Ailevate monitors tunnel status
ElasticsearchAilevateCustomerCluster health, disk, nodes, TLS, inbound connectivity
AI WarehouseAilevateCustomerAccelerators, vLLM API, cooling, power, inbound connectivity
EHR DatastoreCustomerCustomerSQL Server health, connectivity, performance
Network PerformanceSharedCustomerAilevate monitors Cloud; Customer monitors infrastructure
💡

Tip:

  • SaaS: Focus monitoring on Relay VM and EHR connectivity
  • On-Premise: Comprehensive monitoring of Relay VM, Elasticsearch, AI Warehouse, and EHR connectivity

When to Contact Ailevate Support

Immediate Escalation Required

Contact [email protected] immediately (mark subject "CRITICAL") for:

  • Elasticsearch cluster status RED (data unavailable, On-Premise only)
  • Revenue Recovery application completely unavailable for all users
  • Suspected security breach or unauthorized access
  • Data integrity concerns (missing claims, corrupted data)
  • AI Warehouse complete failure (all accelerators offline, On-Premise only)

Note: Ailevate is also automatically notified of several critical issues requiring escalation and support.

Standard Support Requests

For SaaS Deployments:

Contact [email protected] for:

  • Persistent Relay connectivity issues after completing troubleshooting steps
  • Application errors or performance degradation
  • Capacity planning guidance
  • Questions about platform features or configuration

For On-Premise Deployments:

Contact [email protected] for:

  • Persistent Relay connectivity issues after completing troubleshooting steps
  • Elasticsearch performance degradation or cluster issues (you manage the infrastructure; we provide guidance)
  • AI Warehouse hardware failures or performance issues (you manage the hardware; we provide guidance)
  • Application errors not related to your infrastructure (Cloud Services issues)
  • Capacity planning guidance
⚠️

Important: In On-Premise deployments, you are responsible for managing your Elasticsearch cluster and AI Warehouse infrastructure.

Ailevate provides troubleshooting guidance and software support, but infrastructure operations, monitoring, and maintenance are your responsibility.


Related Resources

For more detailed information on specific topics:

Need Help? If you have questions about monitoring your Revenue Recovery deployment or need assistance interpreting health indicators, contact Ailevate support at [email protected].