Component Diagnostics & Troubleshooting
Hands-on diagnostic commands and troubleshooting procedures for investigating issues
When you need to troubleshoot an issue or investigate a problem with your Revenue Recovery platform, this guide provides the specific diagnostic commands and procedures you need. Whether you're investigating a Relay connectivity problem, checking Elasticsearch cluster health, or diagnosing AI Warehouse hardware, you'll find step-by-step troubleshooting procedures here.
For understanding monitoring responsibilities and health indicators, see System Monitoring & Health. For operational best practices and maintenance schedules, see Operational Best Practices.
Relay Service Diagnostics
The Relay VM requires monitoring in both SaaS and On-Premises deployments. Use these diagnostic procedures when investigating connectivity issues or performing routine health checks.
Service Status Checks
Check Service Health:
sudo systemctl status ailevate-tunnel.service
sudo systemctl status ailevate-proxy.serviceBoth services should show active (running). If inactive, restart:
sudo systemctl restart ailevate-tunnel.service
sudo systemctl restart ailevate-proxy.serviceQuick Health Check (All-in-One):
systemctl is-active ailevate-tunnel.service ailevate-proxy.service && curl -sk https://<relay-endpoint>.ailevate.com > /dev/null && echo "✓ Relay healthy" || echo "✗ Relay issue detected"Connectivity Tests
Test Outbound HTTPS (Port 443):
curl -vk https://<your-relay-endpoint>.ailevate.comShould complete TLS handshake successfully. Look for SSL connection using in output.
Warning: If you seeConnection refusedor timeout, check:
- Firewall allows outbound port 443 to
*.ailevate.com- DNS can resolve Ailevate domains
- No proxy server blocking connections
Test SQL Datastore Reachability (Port 1433):
nc -vz <sql-hostname> 1433Should report Connection to <sql-hostname> 1433 port [tcp/*] succeeded!
Test DNS Resolution:
# External DNS (Ailevate Cloud and Azure)
nslookup management.azure.com
nslookup <your-relay-endpoint>.ailevate.com
# Internal DNS (EHR SQL Server)
nslookup <sql-hostname>All should resolve to valid IP addresses. If failing, check /etc/resolv.conf for correct nameservers.
Verify Time Synchronization:
timedatectl statusMust show System clock synchronized: yes. TLS requires accurate time (drift >5 minutes causes failures).
Enable NTP if disabled:
sudo timedatectl set-ntp trueLog Analysis
View Real-Time Logs:
# Tunnel service
sudo journalctl -u ailevate-tunnel.service -f
# Proxy service
sudo journalctl -u ailevate-proxy.service -fExport Logs for Support:
# Last 24 hours
sudo journalctl -u ailevate-tunnel.service --since "24 hours ago" > tunnel-logs.txt
# Specific time range
sudo journalctl -u ailevate-tunnel.service --since "2025-11-19 09:00" --until "2025-11-19 10:00" > tunnel-logs-morning.txtCommon Log Patterns:
| Pattern | Meaning | Action |
|---|---|---|
TLS handshake failed | Time sync or certificate issue | Check timedatectl status, verify system time accurate |
Connection refused | Firewall blocking outbound 443 | Verify firewall rules, test with curl -vk |
SQL connection error | Database unreachable or bad credentials | Test with nc -vz, verify SQL Server running |
DNS resolution failed | DNS misconfiguration | Check /etc/resolv.conf, test nslookup |
Tunnel connected successfully | Normal operation | No action needed |
Log Retention:
- Relay logs managed by systemd journald (default 7-day retention)
- For long-term retention, configure log forwarding to your centralized logging system
- Consider increasing journal size: edit
/etc/systemd/journald.confand setSystemMaxUse=500M
SQL Server / EHR Logs:
When diagnosing SQL connectivity issues, correlate Relay logs with SQL Server error logs:
Windows SQL Server:
- Default location:
C:\Program Files\Microsoft SQL Server\MSSQL15.MSSQLSERVER\MSSQL\Log\ERRORLOG - Use SQL Server Management Studio: Management → SQL Server Logs
Linux SQL Server:
- Default location:
/var/opt/mssql/log/errorlog
What to look for:
- Failed login attempts (check SQL credentials)
- Connection timeouts (check network/firewall)
Login failed for usererrors (verify Relay SQL account permissions)
System Resource Checks
Disk Space:
df -hEnsure root partition has ≥20% free space. Low disk space can cause service failures.
CPU and Memory:
top
# or for a snapshot
free -h && top -bn1 | head -20Relay VM should typically use <50% CPU and <75% memory during normal operation.
For complete Relay operational procedures, see the Relay Service Deployment Guide.
Elasticsearch Diagnostics (On-Premise Only)
Note: If you're on SaaS, Ailevate manages Elasticsearch. Skip this section.
On-Premise deployments require regular Elasticsearch health checks and troubleshooting capabilities.
Cluster Health Checks
Check Overall Cluster Status:
curl -XGET "https://<elastic-host>:9200/_cluster/health?pretty" \
--cacert /path/to/ca.crt \
--cert /path/to/client.crt \
--key /path/to/client.keyStatus Interpretation:
| Status | Meaning | Action Required |
|---|---|---|
| 🟢 GREEN | All primary and replica shards allocated | Normal operation |
| 🟡 YELLOW | All primaries allocated, some replicas missing | Investigate replica allocation; may be acceptable temporarily |
| 🔴 RED | One or more primary shards unallocated | URGENT - data unavailable; investigate immediately |
Verify Node Availability:
curl -XGET "https://<elastic-host>:9200/_cat/nodes?v" \
--cacert /path/to/ca.crt \
--cert /path/to/client.crt \
--key /path/to/client.keyLook for:
- All expected nodes present
- Heap usage <75%
- CPU usage patterns
- Master node marked with
*
Disk Space Monitoring
Check Disk Allocation:
curl -XGET "https://<elastic-host>:9200/_cat/allocation?v" \
--cacert /path/to/ca.crt \
--cert /path/to/client.crt \
--key /path/to/client.key
Warning: Critical Thresholds
Elasticsearch enforces disk watermarks:
- 85% (high watermark) → No new shards allocated to node
- 90% (flood stage) → Indices become read-only
- 95% (critical) → Cluster stability at risk
⚠️ Action: Keep disk utilization below 85%. Free space immediately if approaching limits.
Quick Disk Check Per Node:
# On each Elasticsearch node
df -h /var/lib/elasticsearchShard Health
Check for Unassigned Shards:
curl -XGET "https://<elastic-host>:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason" \
--cacert /path/to/ca.crt \
--cert /path/to/client.crt \
--key /path/to/client.key | grep UNASSIGNEDCommon Unassigned Reasons:
ALLOCATION_FAILED→ Disk watermarks exceeded or allocation rules preventing assignmentNODE_LEFT→ Node failed and shards not yet reallocatedREPLICA_ADDED→ New replica being allocated (normal)
Force Shard Reallocation (if needed):
curl -XPOST "https://<elastic-host>:9200/_cluster/reroute?retry_failed=true" \
--cacert /path/to/ca.crt \
--cert /path/to/client.crt \
--key /path/to/client.keyTLS Certificate Validation
Check Certificate Expiry:
openssl s_client -connect <elastic-host>:9200 -showcerts 2>/dev/null | openssl x509 -noout -datesLook for notAfter date. Renew certificates 30+ days before expiration.
Test Client Certificate Authentication:
curl -XGET "https://<elastic-host>:9200/" \
--cacert /path/to/ca.crt \
--cert /path/to/client.crt \
--key /path/to/client.keyShould return cluster information JSON. Authentication failures indicate certificate issues.
Performance Diagnostics
Check Indexing Performance:
curl -XGET "https://<elastic-host>:9200/_stats/indexing?pretty" \
--cacert /path/to/ca.crt \
--cert /path/to/client.crt \
--key /path/to/client.keyCheck Query Performance:
curl -XGET "https://<elastic-host>:9200/_stats/search?pretty" \
--cacert /path/to/ca.crt \
--cert /path/to/client.crt \
--key /path/to/client.keyMonitor query_time_in_millis and query_total for trends.
Log Locations and Analysis
Elasticsearch Log Files (On-Premise installations):
# Main cluster log (default location)
/var/log/elasticsearch/<cluster-name>.log
# Slow query log (if enabled)
/var/log/elasticsearch/<cluster-name>_index_search_slowlog.log
/var/log/elasticsearch/<cluster-name>_index_indexing_slowlog.log
# Deprecation warnings
/var/log/elasticsearch/<cluster-name>_deprecation.logKey Log Patterns to Monitor:
ERRORentries indicate failures requiring investigationOutOfMemoryError→ JVM heap exhausted; review heap configurationClusterBlockException→ Disk watermarks exceeded; free disk spaceNoNodeAvailableException→ Cluster connectivity issues- Frequent
WARNabout shard allocation → Capacity or configuration issue
Log Retention:
- Default: Elasticsearch rotates logs daily, keeps 7 days
- HIPAA compliance typically requires 30-90 day retention
- Configure in
/etc/elasticsearch/elasticsearch.ymlor use centralized logging
Inbound Connectivity Tests
Work with Ailevate support to verify connectivity from Ailevate Cloud to your Elasticsearch endpoint:
What Ailevate Tests:
- Firewall allows inbound HTTPS from Ailevate IP ranges to port 9200
- TLS handshake succeeds with client certificate authentication
- Query execution completes successfully
- Response times acceptable (<500ms for health checks)
Your Verification:
# Check firewall rules allow Ailevate IP ranges
sudo iptables -L -n | grep 9200
# Check Elasticsearch is listening
sudo netstat -tlnp | grep 9200Common Issue Resolution
| Symptom | Diagnostic Command | Resolution |
|---|---|---|
| Cluster RED | curl .../_ cluster/health | Check node status, disk space, review logs |
| High disk (>85%) | curl .../_cat/allocation | Delete old indices, add nodes, or expand storage |
| Slow queries | curl .../_stats/search | Check node resources, review query patterns |
| Certificate expired | openssl s_client... | Renew certificates, restart nodes |
| Node missing | curl .../_cat/nodes | Check node VM status, network, Elasticsearch service |
See the On-Premises Deployment Guide for Elasticsearch sizing and configuration details.
AI Warehouse Diagnostics (On-Premise Only)
Note: If you're on SaaS, Ailevate manages the AI Warehouse. Skip this section.
On-Premise AI Warehouse monitoring focuses on hardware health, vLLM API, and cooling systems.
Accelerator Detection
Verify Accelerators Detected:
sudo update-pciids
lspci -d 1e52:Expected Output:
- LoudBox: 4x Wormhole accelerators
- QuietBox: 4x Blackhole accelerators
Check Driver Status:
lsmod | grep tt_kmdShould show tt_kmd kernel module loaded. If missing:
sudo modprobe tt_kmdAccelerator Health Monitoring
Check Accelerator Status:
tt-smiMonitor:
- Temperature (should be stable, varies by model)
- Power consumption (should be consistent)
- Utilization (varies based on workload)
Warning: Thermal Alerts
If temperatures are rising or showing warnings:
- LoudBox: Check airflow, verify fans running, remove obstructions
- QuietBox: Verify ambient temperature <35°C, check BMC for cooling system status
Count Detected Accelerators:
lspci -d 1e52: | wc -lShould return 4 (both LoudBox and QuietBox have 4 accelerators).
vLLM API Health
Test API Endpoint:
curl -k https://<ai-warehouse-host>:8080/v1/modelsExpected Response: JSON list of loaded models
If API Unreachable:
# Check vLLM service status
sudo systemctl status vllm.service
# Restart if needed
sudo systemctl restart vllm.service
# View logs
sudo journalctl -u vllm.service -n 100Test API Response Time:
time curl -k https://<ai-warehouse-host>:8080/v1/modelsShould respond in <2 seconds. Slow responses indicate performance issues.
Log Locations and Analysis
vLLM Service Logs:
# View recent logs
sudo journalctl -u vllm.service -n 100
# Follow logs in real-time
sudo journalctl -u vllm.service -f
# Export for troubleshooting
sudo journalctl -u vllm.service --since "24 hours ago" > vllm-logs.txtKey Log Patterns:
Model loaded successfully→ Normal startupCUDA out of memory/Accelerator memory error→ Model too large or memory leakConnection refused→ API binding issue; check port 8080 availabilityTimeout waiting for accelerator→ Hardware communication issue; checktt-smi
BMC Event Logs:
- Access via BMC web interface:
http://<bmc-ip>→ System Health → Event Log - Look for: thermal warnings, PSU failures, memory errors, PCIe errors
- Export event log for Ailevate support when reporting hardware issues
Log Retention:
- vLLM logs managed by systemd journald
- BMC event log capacity limited; export regularly for historical analysis
Interconnect Link Verification
QSFP-DD Link Status:
Physically inspect QSFP-DD ports on the hardware:
- Link LEDs: Should be solid green for active links
- Cable count: 8 cables (QuietBox mesh), 2 cables (LoudBox)
Check Link Topology:
Refer to deployment guide diagrams:
- Tenstorrent QuietBox Guide - 8-cable mesh topology
- Tenstorrent LoudBox Guide - 2-cable topology
Reseat Cables: If link down, power off system, reseat QSFP-DD cables, power on.
Power and Cooling Diagnostics
Check Power Supply Status:
Inspect PSU indicator lights:
- Both PSUs should show green/operational (1+1 redundancy)
- Amber/red indicates PSU failure—replace failed unit promptly
LoudBox (Air-Cooled) Checks:
# Check fan operation
# Listen for airflow from front to back
# Visual inspection: no obstructions to intake/exhaustAmbient temperature monitoring:
- Measure temperature in server room/rack
- Ensure adequate ventilation
QuietBox (Liquid-Cooled) Checks:
Access BMC interface: http://<bmc-ip> (default IPMI)
Monitor via BMC:
- CPU and system temperatures
- Coolant system status (pre-filled, sealed—no user maintenance)
- Ambient temperature must be <35°C
Warning: The QuietBox liquid cooling system is sealed and maintenance-free. Do not attempt to service coolant. If BMC shows cooling warnings, reduce ambient temperature or improve room ventilation.
BMC (Baseboard Management Controller)
Access BMC:
URL: http://<bmc-ip>
Default User: ADMIN (check hardware documentation)
Change default password immediately after first access.
Monitor via BMC:
- System Health → Event Log (hardware errors)
- Sensors → Temperature readings
- Power → PSU status
- System → Memory and CPU status
Common BMC Alerts:
| Alert | Meaning | Action |
|---|---|---|
| Temperature warning | Cooling insufficient | Improve ventilation, check fans (LoudBox), reduce ambient (QuietBox) |
| PSU failure | Power supply failed | Replace failed PSU (redundancy maintains operation) |
| Memory error | RAM issue detected | Check BMC logs, reseat memory, replace if needed |
| PCIe error | Accelerator connection issue | Reseat accelerator cards, check PCIe slot |
Inbound Connectivity Tests
Work with Ailevate support to verify connectivity from Ailevate Cloud to vLLM API:
What Ailevate Tests:
- Firewall allows inbound HTTPS from Ailevate IP ranges to port 8080
- TLS handshake succeeds
- AI inference requests complete successfully
- Response times acceptable
Your Verification:
# Check vLLM service listening
sudo netstat -tlnp | grep 8080
# Check firewall rules
sudo iptables -L -n | grep 8080Common Issue Resolution
| Symptom | Diagnostic Steps | Resolution |
|---|---|---|
| Accelerators not detected | lspci -d 1e52: | Reseat cards, check tt_kmd module, verify PCIe connections |
| vLLM API unreachable | systemctl status vllm.service | Restart service, check logs, verify port 8080 open |
| Thermal warnings | Check BMC, inspect cooling | Improve ventilation, reduce ambient temp, verify fans |
| QSFP-DD link down | Inspect LEDs, cable connections | Reseat cables, verify topology matches docs |
| PSU failure | Inspect PSU LEDs | Replace failed PSU (system continues on redundant PSU) |
For detailed hardware setup and troubleshooting, see:
- Tenstorrent LoudBox Guide (air-cooled, rack-mount)
- Tenstorrent QuietBox Guide (liquid-cooled, desktop)
Cloud Services Diagnostics
Ailevate monitors all Cloud Services infrastructure in both SaaS and On-Premise deployments. You cannot directly diagnose Cloud Services components, but you can recognize symptoms and report them effectively.
Recognizing Cloud Services Issues
Common Symptoms:
| Symptom | Likely Component | User Impact |
|---|---|---|
| HTTP 500/503 errors | API backend | Intermittent failures loading pages or data |
| Authentication failures | Auth service | Cannot log in, MFA not working |
| Slow page loads | API or database | Delays loading claims, reports, workflows |
| Workflow delays | Background job processing | AI tasks not completing, data not processing |
| Data sync failures | Ingestion pipeline | EHR claims not appearing in platform |
What to Report
When experiencing Cloud Services issues, contact [email protected] with:
Required Information:
- Timestamp: When did the issue start? (include timezone)
- Affected users: All users, specific user(s), or specific role(s)?
- Symptoms: Exact error messages, screenshots of errors
- Actions triggering issue: What were users doing when issue occurred?
- Frequency: Intermittent or continuous?
Example Report:
Subject: Application Error - HTTP 500 on Claim Search
Timestamp: 2025-11-19 14:30 EST
Affected Users: All users
Symptom: HTTP 500 error when searching for claims
Error Message: "An error occurred while processing your request"
Trigger: Performing any claim search in Search page
Frequency: Continuous since 14:30
Deployment: SaaS
Screenshot attached.
Ailevate monitors Cloud Services uptime, latency, error rates, and performance. We'll investigate and resolve platform issues.
Tip: For troubleshooting HTTP status codes (400, 401, 403, 404, 500, 503), toast notifications, or feature-specific application errors, see the Common Error Messages guide. This guide focuses on infrastructure diagnostics.
Related Resources
For more information on monitoring and operations:
- System Monitoring & Health – Understanding monitoring responsibilities and health indicators
- Operational Best Practices – Maintenance schedules, security practices, and capacity planning
- Common Error Messages – User-facing application errors and resolution steps
- SaaS Deployment Guide – SaaS architecture and network requirements
- On-Premises Deployment Guide – On-Premises infrastructure and connectivity requirements
- Relay Service Deployment Guide – Relay installation and operational runbook
- Tenstorrent LoudBox Guide – Air-cooled AI Warehouse hardware
- Tenstorrent QuietBox Guide – Liquid-cooled AI Warehouse hardware
Need Help? If diagnostic procedures don't resolve your issue or you need assistance interpreting results, contact Ailevate support at [email protected]. Include diagnostic command output and any error messages you've encountered.
Updated about 1 month ago
