When you need to troubleshoot an issue or investigate a problem with your Revenue Recovery platform, this guide provides the specific diagnostic commands and procedures you need. Whether you're investigating a Relay connectivity problem, checking Elasticsearch cluster health, or diagnosing AI Warehouse hardware, you'll find step-by-step troubleshooting procedures here.

For understanding monitoring responsibilities and health indicators, see System Monitoring & Health. For operational best practices and maintenance schedules, see Operational Best Practices.

Relay Service Diagnostics

The Relay VM requires monitoring in both SaaS and On-Premises deployments. Use these diagnostic procedures when investigating connectivity issues or performing routine health checks.

Service Status Checks

Check Service Health:

sudo systemctl status ailevate-tunnel.service
sudo systemctl status ailevate-proxy.service

Both services should show active (running). If inactive, restart:

sudo systemctl restart ailevate-tunnel.service
sudo systemctl restart ailevate-proxy.service

Quick Health Check (All-in-One):

systemctl is-active ailevate-tunnel.service ailevate-proxy.service && curl -sk https://<relay-endpoint>.ailevate.com > /dev/null && echo "✓ Relay healthy" || echo "✗ Relay issue detected"

Connectivity Tests

Test Outbound HTTPS (Port 443):

curl -vk https://<your-relay-endpoint>.ailevate.com

Should complete TLS handshake successfully. Look for SSL connection using in output.

⚠️
Warning: If you see Connection refused or timeout, check:

Firewall allows outbound port 443 to *.ailevate.com

DNS can resolve Ailevate domains

No proxy server blocking connections

Test SQL Datastore Reachability (Port 1433):

nc -vz <sql-hostname> 1433

Should report Connection to <sql-hostname> 1433 port [tcp/*] succeeded!

Test DNS Resolution:

# External DNS (Ailevate Cloud and Azure)
nslookup management.azure.com
nslookup <your-relay-endpoint>.ailevate.com

# Internal DNS (EHR SQL Server)
nslookup <sql-hostname>

All should resolve to valid IP addresses. If failing, check /etc/resolv.conf for correct nameservers.

Verify Time Synchronization:

timedatectl status

Must show System clock synchronized: yes. TLS requires accurate time (drift >5 minutes causes failures).

Enable NTP if disabled:

sudo timedatectl set-ntp true

Log Analysis

View Real-Time Logs:

# Tunnel service
sudo journalctl -u ailevate-tunnel.service -f

# Proxy service
sudo journalctl -u ailevate-proxy.service -f

Export Logs for Support:

# Last 24 hours
sudo journalctl -u ailevate-tunnel.service --since "24 hours ago" > tunnel-logs.txt

# Specific time range
sudo journalctl -u ailevate-tunnel.service --since "2025-11-19 09:00" --until "2025-11-19 10:00" > tunnel-logs-morning.txt

Common Log Patterns:

Pattern	Meaning	Action
`TLS handshake failed`	Time sync or certificate issue	Check `timedatectl status`, verify system time accurate
`Connection refused`	Firewall blocking outbound 443	Verify firewall rules, test with `curl -vk`
`SQL connection error`	Database unreachable or bad credentials	Test with `nc -vz`, verify SQL Server running
`DNS resolution failed`	DNS misconfiguration	Check `/etc/resolv.conf`, test `nslookup`
`Tunnel connected successfully`	Normal operation	No action needed

Log Retention:

Relay logs managed by systemd journald (default 7-day retention)
For long-term retention, configure log forwarding to your centralized logging system
Consider increasing journal size: edit /etc/systemd/journald.conf and set SystemMaxUse=500M

SQL Server / EHR Logs:

When diagnosing SQL connectivity issues, correlate Relay logs with SQL Server error logs:

Windows SQL Server:

Default location: C:\Program Files\Microsoft SQL Server\MSSQL15.MSSQLSERVER\MSSQL\Log\ERRORLOG
Use SQL Server Management Studio: Management → SQL Server Logs

Linux SQL Server:

Default location: /var/opt/mssql/log/errorlog

What to look for:

Failed login attempts (check SQL credentials)
Connection timeouts (check network/firewall)
Login failed for user errors (verify Relay SQL account permissions)

System Resource Checks

Disk Space:

df -h

Ensure root partition has ≥20% free space. Low disk space can cause service failures.

CPU and Memory:

top
# or for a snapshot
free -h && top -bn1 | head -20

Relay VM should typically use <50% CPU and <75% memory during normal operation.

For complete Relay operational procedures, see the Relay Service Deployment Guide.

Elasticsearch Diagnostics (On-Premise Only)

📘
Note: If you're on SaaS, Ailevate manages Elasticsearch. Skip this section.

On-Premise deployments require regular Elasticsearch health checks and troubleshooting capabilities.

Cluster Health Checks

Check Overall Cluster Status:

curl -XGET "https://<elastic-host>:9200/_cluster/health?pretty" \
  --cacert /path/to/ca.crt \
  --cert /path/to/client.crt \
  --key /path/to/client.key

Status Interpretation:

Status	Meaning	Action Required
🟢 GREEN	All primary and replica shards allocated	Normal operation
🟡 YELLOW	All primaries allocated, some replicas missing	Investigate replica allocation; may be acceptable temporarily
🔴 RED	One or more primary shards unallocated	URGENT - data unavailable; investigate immediately

Verify Node Availability:

curl -XGET "https://<elastic-host>:9200/_cat/nodes?v" \
  --cacert /path/to/ca.crt \
  --cert /path/to/client.crt \
  --key /path/to/client.key

Look for:

All expected nodes present
Heap usage <75%
CPU usage patterns
Master node marked with *

Disk Space Monitoring

Check Disk Allocation:

curl -XGET "https://<elastic-host>:9200/_cat/allocation?v" \
  --cacert /path/to/ca.crt \
  --cert /path/to/client.crt \
  --key /path/to/client.key

⚠️
Warning: Critical Thresholds
Elasticsearch enforces disk watermarks:

85% (high watermark) → No new shards allocated to node

90% (flood stage) → Indices become read-only

95% (critical) → Cluster stability at risk

⚠️ Action: Keep disk utilization below 85%. Free space immediately if approaching limits.

Quick Disk Check Per Node:

# On each Elasticsearch node
df -h /var/lib/elasticsearch

Shard Health

Check for Unassigned Shards:

curl -XGET "https://<elastic-host>:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason" \
  --cacert /path/to/ca.crt \
  --cert /path/to/client.crt \
  --key /path/to/client.key | grep UNASSIGNED

Common Unassigned Reasons:

ALLOCATION_FAILED → Disk watermarks exceeded or allocation rules preventing assignment
NODE_LEFT → Node failed and shards not yet reallocated
REPLICA_ADDED → New replica being allocated (normal)

Force Shard Reallocation (if needed):

curl -XPOST "https://<elastic-host>:9200/_cluster/reroute?retry_failed=true" \
  --cacert /path/to/ca.crt \
  --cert /path/to/client.crt \
  --key /path/to/client.key

TLS Certificate Validation

Check Certificate Expiry:

openssl s_client -connect <elastic-host>:9200 -showcerts 2>/dev/null | openssl x509 -noout -dates

Look for notAfter date. Renew certificates 30+ days before expiration.

Test Client Certificate Authentication:

curl -XGET "https://<elastic-host>:9200/" \
  --cacert /path/to/ca.crt \
  --cert /path/to/client.crt \
  --key /path/to/client.key

Should return cluster information JSON. Authentication failures indicate certificate issues.

Performance Diagnostics

Check Indexing Performance:

curl -XGET "https://<elastic-host>:9200/_stats/indexing?pretty" \
  --cacert /path/to/ca.crt \
  --cert /path/to/client.crt \
  --key /path/to/client.key

Check Query Performance:

curl -XGET "https://<elastic-host>:9200/_stats/search?pretty" \
  --cacert /path/to/ca.crt \
  --cert /path/to/client.crt \
  --key /path/to/client.key

Monitor query_time_in_millis and query_total for trends.

Log Locations and Analysis

Elasticsearch Log Files (On-Premise installations):

# Main cluster log (default location)
/var/log/elasticsearch/<cluster-name>.log

# Slow query log (if enabled)
/var/log/elasticsearch/<cluster-name>_index_search_slowlog.log
/var/log/elasticsearch/<cluster-name>_index_indexing_slowlog.log

# Deprecation warnings
/var/log/elasticsearch/<cluster-name>_deprecation.log

Key Log Patterns to Monitor:

ERROR entries indicate failures requiring investigation
OutOfMemoryError → JVM heap exhausted; review heap configuration
ClusterBlockException → Disk watermarks exceeded; free disk space
NoNodeAvailableException → Cluster connectivity issues
Frequent WARN about shard allocation → Capacity or configuration issue

Log Retention:

Default: Elasticsearch rotates logs daily, keeps 7 days
HIPAA compliance typically requires 30-90 day retention
Configure in /etc/elasticsearch/elasticsearch.yml or use centralized logging

Inbound Connectivity Tests

Work with Ailevate support to verify connectivity from Ailevate Cloud to your Elasticsearch endpoint:

What Ailevate Tests:

Firewall allows inbound HTTPS from Ailevate IP ranges to port 9200
TLS handshake succeeds with client certificate authentication
Query execution completes successfully
Response times acceptable (<500ms for health checks)

Your Verification:

# Check firewall rules allow Ailevate IP ranges
sudo iptables -L -n | grep 9200

# Check Elasticsearch is listening
sudo netstat -tlnp | grep 9200

Common Issue Resolution

Symptom	Diagnostic Command	Resolution
Cluster RED	`curl .../_ cluster/health`	Check node status, disk space, review logs
High disk (>85%)	`curl .../_cat/allocation`	Delete old indices, add nodes, or expand storage
Slow queries	`curl .../_stats/search`	Check node resources, review query patterns
Certificate expired	`openssl s_client...`	Renew certificates, restart nodes
Node missing	`curl .../_cat/nodes`	Check node VM status, network, Elasticsearch service

See the On-Premises Deployment Guide for Elasticsearch sizing and configuration details.

AI Warehouse Diagnostics (On-Premise Only)

📘
Note: If you're on SaaS, Ailevate manages the AI Warehouse. Skip this section.

On-Premise AI Warehouse monitoring focuses on hardware health, vLLM API, and cooling systems.

Accelerator Detection

Verify Accelerators Detected:

sudo update-pciids
lspci -d 1e52:

Expected Output:

LoudBox: 4x Wormhole accelerators
QuietBox: 4x Blackhole accelerators

Check Driver Status:

lsmod | grep tt_kmd

Should show tt_kmd kernel module loaded. If missing:

sudo modprobe tt_kmd

Accelerator Health Monitoring

Check Accelerator Status:

tt-smi

Monitor:

Temperature (should be stable, varies by model)
Power consumption (should be consistent)
Utilization (varies based on workload)

⚠️
Warning: Thermal Alerts
If temperatures are rising or showing warnings:

LoudBox: Check airflow, verify fans running, remove obstructions

QuietBox: Verify ambient temperature <35°C, check BMC for cooling system status

Count Detected Accelerators:

lspci -d 1e52: | wc -l

Should return 4 (both LoudBox and QuietBox have 4 accelerators).

vLLM API Health

Test API Endpoint:

curl -k https://<ai-warehouse-host>:8080/v1/models

Expected Response: JSON list of loaded models

If API Unreachable:

# Check vLLM service status
sudo systemctl status vllm.service

# Restart if needed
sudo systemctl restart vllm.service

# View logs
sudo journalctl -u vllm.service -n 100

Test API Response Time:

time curl -k https://<ai-warehouse-host>:8080/v1/models

Should respond in <2 seconds. Slow responses indicate performance issues.

Log Locations and Analysis

vLLM Service Logs:

# View recent logs
sudo journalctl -u vllm.service -n 100

# Follow logs in real-time
sudo journalctl -u vllm.service -f

# Export for troubleshooting
sudo journalctl -u vllm.service --since "24 hours ago" > vllm-logs.txt

Key Log Patterns:

Model loaded successfully → Normal startup
CUDA out of memory / Accelerator memory error → Model too large or memory leak
Connection refused → API binding issue; check port 8080 availability
Timeout waiting for accelerator → Hardware communication issue; check tt-smi

BMC Event Logs:

Access via BMC web interface: http://<bmc-ip> → System Health → Event Log
Look for: thermal warnings, PSU failures, memory errors, PCIe errors
Export event log for Ailevate support when reporting hardware issues

Log Retention:

vLLM logs managed by systemd journald
BMC event log capacity limited; export regularly for historical analysis

Interconnect Link Verification

QSFP-DD Link Status:

Physically inspect QSFP-DD ports on the hardware:

Link LEDs: Should be solid green for active links
Cable count: 8 cables (QuietBox mesh), 2 cables (LoudBox)

Check Link Topology:

Refer to deployment guide diagrams:

Tenstorrent QuietBox Guide - 8-cable mesh topology
Tenstorrent LoudBox Guide - 2-cable topology

Reseat Cables: If link down, power off system, reseat QSFP-DD cables, power on.

Power and Cooling Diagnostics

Check Power Supply Status:

Inspect PSU indicator lights:

Both PSUs should show green/operational (1+1 redundancy)
Amber/red indicates PSU failure—replace failed unit promptly

LoudBox (Air-Cooled) Checks:

# Check fan operation
# Listen for airflow from front to back
# Visual inspection: no obstructions to intake/exhaust

Ambient temperature monitoring:

Measure temperature in server room/rack
Ensure adequate ventilation

QuietBox (Liquid-Cooled) Checks:

Access BMC interface: http://<bmc-ip> (default IPMI)

Monitor via BMC:

CPU and system temperatures
Coolant system status (pre-filled, sealed—no user maintenance)
Ambient temperature must be <35°C

⚠️
Warning: The QuietBox liquid cooling system is sealed and maintenance-free. Do not attempt to service coolant. If BMC shows cooling warnings, reduce ambient temperature or improve room ventilation.

BMC (Baseboard Management Controller)

Access BMC:

URL: http://<bmc-ip>
Default User: ADMIN (check hardware documentation)

Change default password immediately after first access.

Monitor via BMC:

System Health → Event Log (hardware errors)
Sensors → Temperature readings
Power → PSU status
System → Memory and CPU status

Common BMC Alerts:

Alert	Meaning	Action
Temperature warning	Cooling insufficient	Improve ventilation, check fans (LoudBox), reduce ambient (QuietBox)
PSU failure	Power supply failed	Replace failed PSU (redundancy maintains operation)
Memory error	RAM issue detected	Check BMC logs, reseat memory, replace if needed
PCIe error	Accelerator connection issue	Reseat accelerator cards, check PCIe slot

Inbound Connectivity Tests

Work with Ailevate support to verify connectivity from Ailevate Cloud to vLLM API:

What Ailevate Tests:

Firewall allows inbound HTTPS from Ailevate IP ranges to port 8080
TLS handshake succeeds
AI inference requests complete successfully
Response times acceptable

Your Verification:

# Check vLLM service listening
sudo netstat -tlnp | grep 8080

# Check firewall rules
sudo iptables -L -n | grep 8080

Common Issue Resolution

Symptom	Diagnostic Steps	Resolution
Accelerators not detected	`lspci -d 1e52:`	Reseat cards, check `tt_kmd` module, verify PCIe connections
vLLM API unreachable	`systemctl status vllm.service`	Restart service, check logs, verify port 8080 open
Thermal warnings	Check BMC, inspect cooling	Improve ventilation, reduce ambient temp, verify fans
QSFP-DD link down	Inspect LEDs, cable connections	Reseat cables, verify topology matches docs
PSU failure	Inspect PSU LEDs	Replace failed PSU (system continues on redundant PSU)

For detailed hardware setup and troubleshooting, see:

Tenstorrent LoudBox Guide (air-cooled, rack-mount)
Tenstorrent QuietBox Guide (liquid-cooled, desktop)

Cloud Services Diagnostics

Ailevate monitors all Cloud Services infrastructure in both SaaS and On-Premise deployments. You cannot directly diagnose Cloud Services components, but you can recognize symptoms and report them effectively.

Recognizing Cloud Services Issues

Common Symptoms:

Symptom	Likely Component	User Impact
HTTP 500/503 errors	API backend	Intermittent failures loading pages or data
Authentication failures	Auth service	Cannot log in, MFA not working
Slow page loads	API or database	Delays loading claims, reports, workflows
Workflow delays	Background job processing	AI tasks not completing, data not processing
Data sync failures	Ingestion pipeline	EHR claims not appearing in platform

What to Report

When experiencing Cloud Services issues, contact [email protected] with:

Required Information:

Timestamp: When did the issue start? (include timezone)
Affected users: All users, specific user(s), or specific role(s)?
Symptoms: Exact error messages, screenshots of errors
Actions triggering issue: What were users doing when issue occurred?
Frequency: Intermittent or continuous?

Example Report:

Subject: Application Error - HTTP 500 on Claim Search

Timestamp: 2025-11-19 14:30 EST
Affected Users: All users
Symptom: HTTP 500 error when searching for claims
Error Message: "An error occurred while processing your request"
Trigger: Performing any claim search in Search page
Frequency: Continuous since 14:30
Deployment: SaaS

Screenshot attached.

Ailevate monitors Cloud Services uptime, latency, error rates, and performance. We'll investigate and resolve platform issues.

💡
Tip: For troubleshooting HTTP status codes (400, 401, 403, 404, 500, 503), toast notifications, or feature-specific application errors, see the Common Error Messages guide. This guide focuses on infrastructure diagnostics.

Related Resources

For more information on monitoring and operations:

System Monitoring & Health – Understanding monitoring responsibilities and health indicators
Operational Best Practices – Maintenance schedules, security practices, and capacity planning
Common Error Messages – User-facing application errors and resolution steps
SaaS Deployment Guide – SaaS architecture and network requirements
On-Premises Deployment Guide – On-Premises infrastructure and connectivity requirements
Relay Service Deployment Guide – Relay installation and operational runbook
Tenstorrent LoudBox Guide – Air-cooled AI Warehouse hardware
Tenstorrent QuietBox Guide – Liquid-cooled AI Warehouse hardware

Need Help? If diagnostic procedures don't resolve your issue or you need assistance interpreting results, contact Ailevate support at [email protected]. Include diagnostic command output and any error messages you've encountered.

Component Diagnostics & Troubleshooting

Relay Service Diagnostics

Service Status Checks

Connectivity Tests

Warning: If you see `Connection refused` or timeout, check:

Log Analysis

System Resource Checks

Elasticsearch Diagnostics (On-Premise Only)

Note: If you're on SaaS, Ailevate manages Elasticsearch. Skip this section.

Cluster Health Checks

Disk Space Monitoring

Warning: Critical Thresholds
Elasticsearch enforces disk watermarks:

Shard Health

TLS Certificate Validation

Performance Diagnostics

Log Locations and Analysis

Inbound Connectivity Tests

Common Issue Resolution

AI Warehouse Diagnostics (On-Premise Only)

Note: If you're on SaaS, Ailevate manages the AI Warehouse. Skip this section.

Accelerator Detection

Accelerator Health Monitoring

Warning: Thermal Alerts
If temperatures are rising or showing warnings:

vLLM API Health

Log Locations and Analysis

Interconnect Link Verification

Power and Cooling Diagnostics

Warning: The QuietBox liquid cooling system is sealed and maintenance-free. Do not attempt to service coolant. If BMC shows cooling warnings, reduce ambient temperature or improve room ventilation.

BMC (Baseboard Management Controller)

Inbound Connectivity Tests

Common Issue Resolution

Cloud Services Diagnostics

Recognizing Cloud Services Issues

What to Report

Tip: For troubleshooting HTTP status codes (400, 401, 403, 404, 500, 503), toast notifications, or feature-specific application errors, see the Common Error Messages guide. This guide focuses on infrastructure diagnostics.

Related Resources

Relay Service Diagnostics

Service Status Checks

Connectivity Tests

Warning: If you see Connection refused or timeout, check:

Log Analysis

System Resource Checks

Elasticsearch Diagnostics (On-Premise Only)

Note: If you're on SaaS, Ailevate manages Elasticsearch. Skip this section.

Cluster Health Checks

Disk Space Monitoring

Warning: Critical Thresholds Elasticsearch enforces disk watermarks:

Shard Health

TLS Certificate Validation

Performance Diagnostics

Log Locations and Analysis

Inbound Connectivity Tests

Common Issue Resolution

AI Warehouse Diagnostics (On-Premise Only)

Note: If you're on SaaS, Ailevate manages the AI Warehouse. Skip this section.

Accelerator Detection

Accelerator Health Monitoring

Warning: Thermal Alerts If temperatures are rising or showing warnings:

vLLM API Health

Log Locations and Analysis

Interconnect Link Verification

Power and Cooling Diagnostics

Warning: The QuietBox liquid cooling system is sealed and maintenance-free. Do not attempt to service coolant. If BMC shows cooling warnings, reduce ambient temperature or improve room ventilation.

BMC (Baseboard Management Controller)

Inbound Connectivity Tests

Common Issue Resolution

Cloud Services Diagnostics

Recognizing Cloud Services Issues

What to Report

Tip: For troubleshooting HTTP status codes (400, 401, 403, 404, 500, 503), toast notifications, or feature-specific application errors, see the Common Error Messages guide. This guide focuses on infrastructure diagnostics.

Related Resources

Warning: If you see `Connection refused` or timeout, check:

Warning: Critical Thresholds
Elasticsearch enforces disk watermarks:

Warning: Thermal Alerts
If temperatures are rising or showing warnings: