Component Diagnostics & Troubleshooting

Hands-on diagnostic commands and troubleshooting procedures for investigating issues

When you need to troubleshoot an issue or investigate a problem with your Revenue Recovery platform, this guide provides the specific diagnostic commands and procedures you need. Whether you're investigating a Relay connectivity problem, checking Elasticsearch cluster health, or diagnosing AI Warehouse hardware, you'll find step-by-step troubleshooting procedures here.

For understanding monitoring responsibilities and health indicators, see System Monitoring & Health. For operational best practices and maintenance schedules, see Operational Best Practices.


Relay Service Diagnostics

The Relay VM requires monitoring in both SaaS and On-Premises deployments. Use these diagnostic procedures when investigating connectivity issues or performing routine health checks.

Service Status Checks

Check Service Health:

sudo systemctl status ailevate-tunnel.service
sudo systemctl status ailevate-proxy.service

Both services should show active (running). If inactive, restart:

sudo systemctl restart ailevate-tunnel.service
sudo systemctl restart ailevate-proxy.service

Quick Health Check (All-in-One):

systemctl is-active ailevate-tunnel.service ailevate-proxy.service && curl -sk https://<relay-endpoint>.ailevate.com > /dev/null && echo "✓ Relay healthy" || echo "✗ Relay issue detected"

Connectivity Tests

Test Outbound HTTPS (Port 443):

curl -vk https://<your-relay-endpoint>.ailevate.com

Should complete TLS handshake successfully. Look for SSL connection using in output.

⚠️

Warning: If you see Connection refused or timeout, check:

  • Firewall allows outbound port 443 to *.ailevate.com
  • DNS can resolve Ailevate domains
  • No proxy server blocking connections

Test SQL Datastore Reachability (Port 1433):

nc -vz <sql-hostname> 1433

Should report Connection to <sql-hostname> 1433 port [tcp/*] succeeded!

Test DNS Resolution:

# External DNS (Ailevate Cloud and Azure)
nslookup management.azure.com
nslookup <your-relay-endpoint>.ailevate.com

# Internal DNS (EHR SQL Server)
nslookup <sql-hostname>

All should resolve to valid IP addresses. If failing, check /etc/resolv.conf for correct nameservers.

Verify Time Synchronization:

timedatectl status

Must show System clock synchronized: yes. TLS requires accurate time (drift >5 minutes causes failures).

Enable NTP if disabled:

sudo timedatectl set-ntp true

Log Analysis

View Real-Time Logs:

# Tunnel service
sudo journalctl -u ailevate-tunnel.service -f

# Proxy service
sudo journalctl -u ailevate-proxy.service -f

Export Logs for Support:

# Last 24 hours
sudo journalctl -u ailevate-tunnel.service --since "24 hours ago" > tunnel-logs.txt

# Specific time range
sudo journalctl -u ailevate-tunnel.service --since "2025-11-19 09:00" --until "2025-11-19 10:00" > tunnel-logs-morning.txt

Common Log Patterns:

PatternMeaningAction
TLS handshake failedTime sync or certificate issueCheck timedatectl status, verify system time accurate
Connection refusedFirewall blocking outbound 443Verify firewall rules, test with curl -vk
SQL connection errorDatabase unreachable or bad credentialsTest with nc -vz, verify SQL Server running
DNS resolution failedDNS misconfigurationCheck /etc/resolv.conf, test nslookup
Tunnel connected successfullyNormal operationNo action needed

Log Retention:

  • Relay logs managed by systemd journald (default 7-day retention)
  • For long-term retention, configure log forwarding to your centralized logging system
  • Consider increasing journal size: edit /etc/systemd/journald.conf and set SystemMaxUse=500M

SQL Server / EHR Logs:

When diagnosing SQL connectivity issues, correlate Relay logs with SQL Server error logs:

Windows SQL Server:

  • Default location: C:\Program Files\Microsoft SQL Server\MSSQL15.MSSQLSERVER\MSSQL\Log\ERRORLOG
  • Use SQL Server Management Studio: Management → SQL Server Logs

Linux SQL Server:

  • Default location: /var/opt/mssql/log/errorlog

What to look for:

  • Failed login attempts (check SQL credentials)
  • Connection timeouts (check network/firewall)
  • Login failed for user errors (verify Relay SQL account permissions)

System Resource Checks

Disk Space:

df -h

Ensure root partition has ≥20% free space. Low disk space can cause service failures.

CPU and Memory:

top
# or for a snapshot
free -h && top -bn1 | head -20

Relay VM should typically use <50% CPU and <75% memory during normal operation.

For complete Relay operational procedures, see the Relay Service Deployment Guide.


Elasticsearch Diagnostics (On-Premise Only)

📘

Note: If you're on SaaS, Ailevate manages Elasticsearch. Skip this section.

On-Premise deployments require regular Elasticsearch health checks and troubleshooting capabilities.

Cluster Health Checks

Check Overall Cluster Status:

curl -XGET "https://<elastic-host>:9200/_cluster/health?pretty" \
  --cacert /path/to/ca.crt \
  --cert /path/to/client.crt \
  --key /path/to/client.key

Status Interpretation:

StatusMeaningAction Required
🟢 GREENAll primary and replica shards allocatedNormal operation
🟡 YELLOWAll primaries allocated, some replicas missingInvestigate replica allocation; may be acceptable temporarily
🔴 REDOne or more primary shards unallocatedURGENT - data unavailable; investigate immediately

Verify Node Availability:

curl -XGET "https://<elastic-host>:9200/_cat/nodes?v" \
  --cacert /path/to/ca.crt \
  --cert /path/to/client.crt \
  --key /path/to/client.key

Look for:

  • All expected nodes present
  • Heap usage <75%
  • CPU usage patterns
  • Master node marked with *

Disk Space Monitoring

Check Disk Allocation:

curl -XGET "https://<elastic-host>:9200/_cat/allocation?v" \
  --cacert /path/to/ca.crt \
  --cert /path/to/client.crt \
  --key /path/to/client.key
⚠️

Warning: Critical Thresholds
Elasticsearch enforces disk watermarks:

  • 85% (high watermark) → No new shards allocated to node
  • 90% (flood stage) → Indices become read-only
  • 95% (critical) → Cluster stability at risk

⚠️ Action: Keep disk utilization below 85%. Free space immediately if approaching limits.

Quick Disk Check Per Node:

# On each Elasticsearch node
df -h /var/lib/elasticsearch

Shard Health

Check for Unassigned Shards:

curl -XGET "https://<elastic-host>:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason" \
  --cacert /path/to/ca.crt \
  --cert /path/to/client.crt \
  --key /path/to/client.key | grep UNASSIGNED

Common Unassigned Reasons:

  • ALLOCATION_FAILED → Disk watermarks exceeded or allocation rules preventing assignment
  • NODE_LEFT → Node failed and shards not yet reallocated
  • REPLICA_ADDED → New replica being allocated (normal)

Force Shard Reallocation (if needed):

curl -XPOST "https://<elastic-host>:9200/_cluster/reroute?retry_failed=true" \
  --cacert /path/to/ca.crt \
  --cert /path/to/client.crt \
  --key /path/to/client.key

TLS Certificate Validation

Check Certificate Expiry:

openssl s_client -connect <elastic-host>:9200 -showcerts 2>/dev/null | openssl x509 -noout -dates

Look for notAfter date. Renew certificates 30+ days before expiration.

Test Client Certificate Authentication:

curl -XGET "https://<elastic-host>:9200/" \
  --cacert /path/to/ca.crt \
  --cert /path/to/client.crt \
  --key /path/to/client.key

Should return cluster information JSON. Authentication failures indicate certificate issues.

Performance Diagnostics

Check Indexing Performance:

curl -XGET "https://<elastic-host>:9200/_stats/indexing?pretty" \
  --cacert /path/to/ca.crt \
  --cert /path/to/client.crt \
  --key /path/to/client.key

Check Query Performance:

curl -XGET "https://<elastic-host>:9200/_stats/search?pretty" \
  --cacert /path/to/ca.crt \
  --cert /path/to/client.crt \
  --key /path/to/client.key

Monitor query_time_in_millis and query_total for trends.

Log Locations and Analysis

Elasticsearch Log Files (On-Premise installations):

# Main cluster log (default location)
/var/log/elasticsearch/<cluster-name>.log

# Slow query log (if enabled)
/var/log/elasticsearch/<cluster-name>_index_search_slowlog.log
/var/log/elasticsearch/<cluster-name>_index_indexing_slowlog.log

# Deprecation warnings
/var/log/elasticsearch/<cluster-name>_deprecation.log

Key Log Patterns to Monitor:

  • ERROR entries indicate failures requiring investigation
  • OutOfMemoryError → JVM heap exhausted; review heap configuration
  • ClusterBlockException → Disk watermarks exceeded; free disk space
  • NoNodeAvailableException → Cluster connectivity issues
  • Frequent WARN about shard allocation → Capacity or configuration issue

Log Retention:

  • Default: Elasticsearch rotates logs daily, keeps 7 days
  • HIPAA compliance typically requires 30-90 day retention
  • Configure in /etc/elasticsearch/elasticsearch.yml or use centralized logging

Inbound Connectivity Tests

Work with Ailevate support to verify connectivity from Ailevate Cloud to your Elasticsearch endpoint:

What Ailevate Tests:

  • Firewall allows inbound HTTPS from Ailevate IP ranges to port 9200
  • TLS handshake succeeds with client certificate authentication
  • Query execution completes successfully
  • Response times acceptable (<500ms for health checks)

Your Verification:

# Check firewall rules allow Ailevate IP ranges
sudo iptables -L -n | grep 9200

# Check Elasticsearch is listening
sudo netstat -tlnp | grep 9200

Common Issue Resolution

SymptomDiagnostic CommandResolution
Cluster REDcurl .../_ cluster/healthCheck node status, disk space, review logs
High disk (>85%)curl .../_cat/allocationDelete old indices, add nodes, or expand storage
Slow queriescurl .../_stats/searchCheck node resources, review query patterns
Certificate expiredopenssl s_client...Renew certificates, restart nodes
Node missingcurl .../_cat/nodesCheck node VM status, network, Elasticsearch service

See the On-Premises Deployment Guide for Elasticsearch sizing and configuration details.


AI Warehouse Diagnostics (On-Premise Only)

📘

Note: If you're on SaaS, Ailevate manages the AI Warehouse. Skip this section.

On-Premise AI Warehouse monitoring focuses on hardware health, vLLM API, and cooling systems.

Accelerator Detection

Verify Accelerators Detected:

sudo update-pciids
lspci -d 1e52:

Expected Output:

  • LoudBox: 4x Wormhole accelerators
  • QuietBox: 4x Blackhole accelerators

Check Driver Status:

lsmod | grep tt_kmd

Should show tt_kmd kernel module loaded. If missing:

sudo modprobe tt_kmd

Accelerator Health Monitoring

Check Accelerator Status:

tt-smi

Monitor:

  • Temperature (should be stable, varies by model)
  • Power consumption (should be consistent)
  • Utilization (varies based on workload)
⚠️

Warning: Thermal Alerts
If temperatures are rising or showing warnings:

  • LoudBox: Check airflow, verify fans running, remove obstructions
  • QuietBox: Verify ambient temperature <35°C, check BMC for cooling system status

Count Detected Accelerators:

lspci -d 1e52: | wc -l

Should return 4 (both LoudBox and QuietBox have 4 accelerators).

vLLM API Health

Test API Endpoint:

curl -k https://<ai-warehouse-host>:8080/v1/models

Expected Response: JSON list of loaded models

If API Unreachable:

# Check vLLM service status
sudo systemctl status vllm.service

# Restart if needed
sudo systemctl restart vllm.service

# View logs
sudo journalctl -u vllm.service -n 100

Test API Response Time:

time curl -k https://<ai-warehouse-host>:8080/v1/models

Should respond in <2 seconds. Slow responses indicate performance issues.

Log Locations and Analysis

vLLM Service Logs:

# View recent logs
sudo journalctl -u vllm.service -n 100

# Follow logs in real-time
sudo journalctl -u vllm.service -f

# Export for troubleshooting
sudo journalctl -u vllm.service --since "24 hours ago" > vllm-logs.txt

Key Log Patterns:

  • Model loaded successfully → Normal startup
  • CUDA out of memory / Accelerator memory error → Model too large or memory leak
  • Connection refused → API binding issue; check port 8080 availability
  • Timeout waiting for accelerator → Hardware communication issue; check tt-smi

BMC Event Logs:

  • Access via BMC web interface: http://<bmc-ip> → System Health → Event Log
  • Look for: thermal warnings, PSU failures, memory errors, PCIe errors
  • Export event log for Ailevate support when reporting hardware issues

Log Retention:

  • vLLM logs managed by systemd journald
  • BMC event log capacity limited; export regularly for historical analysis

Interconnect Link Verification

QSFP-DD Link Status:

Physically inspect QSFP-DD ports on the hardware:

  • Link LEDs: Should be solid green for active links
  • Cable count: 8 cables (QuietBox mesh), 2 cables (LoudBox)

Check Link Topology:

Refer to deployment guide diagrams:

Reseat Cables: If link down, power off system, reseat QSFP-DD cables, power on.

Power and Cooling Diagnostics

Check Power Supply Status:

Inspect PSU indicator lights:

  • Both PSUs should show green/operational (1+1 redundancy)
  • Amber/red indicates PSU failure—replace failed unit promptly

LoudBox (Air-Cooled) Checks:

# Check fan operation
# Listen for airflow from front to back
# Visual inspection: no obstructions to intake/exhaust

Ambient temperature monitoring:

  • Measure temperature in server room/rack
  • Ensure adequate ventilation

QuietBox (Liquid-Cooled) Checks:

Access BMC interface: http://<bmc-ip> (default IPMI)

Monitor via BMC:

  • CPU and system temperatures
  • Coolant system status (pre-filled, sealed—no user maintenance)
  • Ambient temperature must be <35°C
⚠️

Warning: The QuietBox liquid cooling system is sealed and maintenance-free. Do not attempt to service coolant. If BMC shows cooling warnings, reduce ambient temperature or improve room ventilation.

BMC (Baseboard Management Controller)

Access BMC:

URL: http://<bmc-ip>
Default User: ADMIN (check hardware documentation)

Change default password immediately after first access.

Monitor via BMC:

  • System Health → Event Log (hardware errors)
  • Sensors → Temperature readings
  • Power → PSU status
  • System → Memory and CPU status

Common BMC Alerts:

AlertMeaningAction
Temperature warningCooling insufficientImprove ventilation, check fans (LoudBox), reduce ambient (QuietBox)
PSU failurePower supply failedReplace failed PSU (redundancy maintains operation)
Memory errorRAM issue detectedCheck BMC logs, reseat memory, replace if needed
PCIe errorAccelerator connection issueReseat accelerator cards, check PCIe slot

Inbound Connectivity Tests

Work with Ailevate support to verify connectivity from Ailevate Cloud to vLLM API:

What Ailevate Tests:

  • Firewall allows inbound HTTPS from Ailevate IP ranges to port 8080
  • TLS handshake succeeds
  • AI inference requests complete successfully
  • Response times acceptable

Your Verification:

# Check vLLM service listening
sudo netstat -tlnp | grep 8080

# Check firewall rules
sudo iptables -L -n | grep 8080

Common Issue Resolution

SymptomDiagnostic StepsResolution
Accelerators not detectedlspci -d 1e52:Reseat cards, check tt_kmd module, verify PCIe connections
vLLM API unreachablesystemctl status vllm.serviceRestart service, check logs, verify port 8080 open
Thermal warningsCheck BMC, inspect coolingImprove ventilation, reduce ambient temp, verify fans
QSFP-DD link downInspect LEDs, cable connectionsReseat cables, verify topology matches docs
PSU failureInspect PSU LEDsReplace failed PSU (system continues on redundant PSU)

For detailed hardware setup and troubleshooting, see:


Cloud Services Diagnostics

Ailevate monitors all Cloud Services infrastructure in both SaaS and On-Premise deployments. You cannot directly diagnose Cloud Services components, but you can recognize symptoms and report them effectively.

Recognizing Cloud Services Issues

Common Symptoms:

SymptomLikely ComponentUser Impact
HTTP 500/503 errorsAPI backendIntermittent failures loading pages or data
Authentication failuresAuth serviceCannot log in, MFA not working
Slow page loadsAPI or databaseDelays loading claims, reports, workflows
Workflow delaysBackground job processingAI tasks not completing, data not processing
Data sync failuresIngestion pipelineEHR claims not appearing in platform

What to Report

When experiencing Cloud Services issues, contact [email protected] with:

Required Information:

  • Timestamp: When did the issue start? (include timezone)
  • Affected users: All users, specific user(s), or specific role(s)?
  • Symptoms: Exact error messages, screenshots of errors
  • Actions triggering issue: What were users doing when issue occurred?
  • Frequency: Intermittent or continuous?

Example Report:

Subject: Application Error - HTTP 500 on Claim Search

Timestamp: 2025-11-19 14:30 EST
Affected Users: All users
Symptom: HTTP 500 error when searching for claims
Error Message: "An error occurred while processing your request"
Trigger: Performing any claim search in Search page
Frequency: Continuous since 14:30
Deployment: SaaS

Screenshot attached.

Ailevate monitors Cloud Services uptime, latency, error rates, and performance. We'll investigate and resolve platform issues.

💡

Tip: For troubleshooting HTTP status codes (400, 401, 403, 404, 500, 503), toast notifications, or feature-specific application errors, see the Common Error Messages guide. This guide focuses on infrastructure diagnostics.


Related Resources

For more information on monitoring and operations:

Need Help? If diagnostic procedures don't resolve your issue or you need assistance interpreting results, contact Ailevate support at [email protected]. Include diagnostic command output and any error messages you've encountered.