Skip to content

Latest commit

 

History

History
142 lines (81 loc) · 14.6 KB

emqx-and-deepseek.md

File metadata and controls

142 lines (81 loc) · 14.6 KB

Introduction

The observability of IoT data is a practice that involves monitoring and managing data from platforms like connected vehicles and industrial IoT to ensure high data quality, availability, and reliability across complex systems, processes, and pipelines. It provides users with a comprehensive understanding of data status, helps quickly identify and analyze issues, and enhances system stability and operational efficiency.

This article will explore how to combine EMQX's observability data with DeepSeek's LLM (Large Language Model) services. By leveraging AI technologies such as vectorized knowledge bases, automated code generation, and natural language processing, users can rapidly resolve issues like data upload failures, device disconnections, increased connection latency, and slow data forwarding.

Limitations of Existing IoT Data Observability Tools

In scenarios like connected vehicles and industrial IoT, issues such as device disconnections, slow message subscriptions, message forwarding delays, and message losses are common due to network conditions and the complexity of applications. Without an efficient observability data collection, storage, and analysis system, operation teams spend excessive time pinpointing and analyzing these issues. This leads to increased Mean Time to Recovery (MTTR), a decline in user experience, and potential customer complaints or damage to brand reputation.

Generally, observability data analysis relies on three main data sources: metrics, tracing, and logs:

Metrics Users can quickly assess the overall health of the system through time-series graphs, such as line charts.

  • Metrics like CPU, memory, and network usage help identify anomalies within a specific time frame.
  • EMQX system data can reveal information on connections, message sending, and forwarding.

Mature products like Prometheus and Grafana are already available in the market for easy storage and visualization of these metrics.

Tracing Tracing helps identify the internal operational state of the system and pinpoint where issues lie.

  • It involves tracking the call chain and time consumption across system components using internal tracking points.

Products like Jaeger are used in the market for storing, analyzing, and visualizing tracing data.

Logs Logs are essential for accurately diagnosing faults.

  • Logs are generated by the system during execution, providing developers and operations teams with insights into the system's status and any errors or exceptions encountered.

Mature products like ElasticSearch can store, query, and visualize logs effectively.

However, most observability tools in the market today have the following limitations:

  • Preset Functionality: Most tools rely on vendor preset configurations, making it difficult to adapt to unforeseen anomalies.
  • Static Knowledge Base: They rely on text-based search for solutions, which can’t provide precise recommendations for related issues.
  • Lack of Intelligence: These tools often lack reasoning capabilities, making them ineffective in analyzing complex issues in challenging scenarios.

Leveraging AI for Smarter Observability Data Analysis

The reasoning capabilities provided by Large Language Models (LLMs) can significantly enhance the intelligence of observability data analysis:

  • Intelligent Reasoning: Instead of relying on hard-coded rules, AI uses context to reason and judge system anomalies.
  • Natural Language Processing: AI-driven code generation processes data flexibly, catering to the needs of special scenarios.
  • Vectorized Knowledge Base: AI reasoning powers precise solutions to problem-solving.
  • AI Agent Framework: Based on LLM-derived solutions, AI-driven automation provides smart operations in the AI era.

DeepSeek R1 is a reasoning optimization model developed by DeepSeek, trained through reinforcement learning (RL) to perform efficient reasoning and content generation in complex scenarios. DeepSeek V3 is a powerful generative large language model that uses a mixed-expert architecture, optimizing training effectiveness and the efficiency and quality of content generation. By combining DeepSeek's R1 and V3 models, large volumes of heterogeneous data and interaction requirements in IoT scenarios can be processed efficiently.

To assist users in more efficient IoT system operations, the latest version of EMQX ECP integrates DeepSeek's V3-based observability tools. Built on fast deployment, remote operation, and centralized management features for EMQX clusters and edge services, users can fully leverage AI’s reasoning power for data-driven smart IoT operations.

This observability tool primarily consists of the following three components:

  1. Vector Knowledge Base Construction: Documents such as product manuals, operational knowledge, and incident analysis reports are vectorized, enhancing the LLM’s ability to efficiently search and apply relevant information to related issues.
  2. Data Source Collection: EMQX sends data such as metrics, tracing, and logs to the Datalayers database via protocols like OpenTelemetry, providing the necessary data sources for LLM analysis.
  3. Problem Resolution:
    • The system directly searches the vector knowledge base for relevant content as context and combines it with the prompt, returning the inference results (output) directly to the user.
    • Based on the user’s requirements, the system loads the relevant data from the Datalayers database and generates corresponding code to process the data. Additionally, problematic data and the results found in the vector knowledge base are sent as context to the LLM, which performs inference to generate a relevant solution. This solution is then organized into natural language and returned to the user.

image.png

Based on customer needs, future scenarios can include automated AI Agent-driven operations orchestration. For example, certain situations could trigger automatic actions such as scaling up resources or sending notifications. Additionally, automated online maintenance inspections can be incorporated, generating and sending high-quality inspection reports.

AI Interaction Demo

Next, we will demonstrate how to use AI for interactive operations. After deploying EMQX ECP, users can access the dashboard and click on the "Trace" function in the left navigation bar to utilize EMQX's end-to-end tracing capabilities for analyzing and troubleshooting issues. While tracing provides powerful data support to help identify and locate problems, complex scenarios often require professional expertise to analyze the root causes. To improve efficiency, we have integrated DeepSeek's LLM into the advanced query page of the tracing feature. This integration leverages AI's reasoning and generation capabilities to help users quickly and intelligently pinpoint issues and provide solutions.

Data Analysis Overview Feature

First, navigate to the tracing page and click on the "Advanced Query" button in the top-right corner to enter the query page. Here, users need to select an EMQX cluster identifier and, if necessary, choose one or more Client IDs to locate the data source. Next, select the desired time range for analysis (the default is the entire time period), and then click the query button. The system will return all tracing data for the specified Client ID.

EMQX ECP

Once the query results are returned, users will see a list containing multiple tracing data entries, which can be large in volume. Manually identifying abnormal data can be challenging. At this point, users can use the AI Assistant feature. By clicking the “Ask AI” button in the bottom-right corner, a dialog box will pop up. Users can input their data analysis requirements into the dialog box, and the AI Assistant will generate data analysis results and provide optimization suggestions based on the input information.

image.png

When clicking the "Tracing Data Overview" shortcut button in the AI Assistant (bottom-right corner), the system will generate an overview analysis for all the tracing data retrieved in the current query. This analysis typically includes the following sections:

  • Overall Status: Displays the total number of trace entries, success rate, average response time, minimum response time, maximum response time, and P95 and P99 response times for all client IDs in the current query.
  • Anomalies: Shows clients with high error rates (above a defined threshold), clients with abnormal response times (average response time exceeding a threshold), and abnormal traces (such as those with unusually long durations).
  • Key Findings: Lists the main clients or links with abnormalities, helping users pinpoint potential fault areas.
  • Recommendations: Based on the data analysis results, the system provides targeted optimization suggestions and troubleshooting directions.

image.png

Based on the report generated by the overview feature, users can quickly identify abnormal trace data. For example, the system might highlight that a particular client's response time is too long, or that certain Client IDs have a high trace error rate. Using this information, users can search and filter to immediately locate the Trace ID of the abnormal trace. Afterward, by clicking on the Trace ID, the detailed information for that trace will appear below, displaying a timeline of the trace structure for related services and operations. Each Span represents an operation, and users can hover their mouse over any operation to view the specific steps and details.

image.png

While effective data attributes help pinpoint issues, the problem may not be clear in cases where only error codes are present. In such situations, users can use the AI Assistant feature by clicking the "Spans Data Overview" button for a quick analysis of the underlying causes.

The AI Assistant will provide a detailed analysis for each Span operation, including error information, potential causes, and suggested fixes. By integrating with our knowledge base, AI can more accurately analyze the root cause of errors and offer targeted troubleshooting plans or repair suggestions. Leveraging historical case studies and solutions stored in the knowledge base, AI can quickly identify the real issue, saving users from manual searching and troubleshooting, thus improving the accuracy and efficiency of issue resolution.

image.png

image.png

Script Mode Feature

The EMQX ECP’s AI Assistant also offers a custom script feature. By toggling the Script Mode switch at the top of the dialog box, users can activate the script mode and leverage DeepSeek's powerful model and coding inference capabilities. Once script mode is enabled, users can send custom data requests through the dialog box. The AI will automatically generate the corresponding data analysis script, execute it on the current data, and quickly produce the analysis results. AI can also automatically generate charts or documents based on the results, helping users visualize the data and quickly locate issues.

Example 1: Calculating the Average Duration

In script mode, the user sends a request to "calculate the average value of the duration field (time taken) in the current trace data." The AI Assistant will automatically generate and execute the necessary script, calculate the average duration, and return the result. The result will be displayed directly in the dialog box, along with a summary report of the analysis.

image.png

image.png

Example 2: Viewing the Proportion of Error Traces in the Total Traces

The AI Assistant will automatically determine whether a corresponding chart needs to be generated based on the user's request. For example, when the user asks, "View the proportion of error traces in the total traces," the AI Assistant will generate an analysis script based on the request, calculate the percentage of error traces, and automatically generate a chart to display the results. Along with the chart, an analysis report will be provided. This feature is particularly useful for users to evaluate the health status of the system through proportion-based analysis.

image.png

image.png

image.png

Through script mode, AI not only assists users in completing complex data analysis tasks but also flexibly generates code based on user requirements. Users do not need to write code or manually calculate data; AI will automatically perform reasoning and calculations, reducing manual intervention and improving data analysis efficiency. Additionally, the generated charts and reports are more intuitive, helping users quickly grasp key information and optimize the decision-making process.

Conclusion

By combining EMQX's observability data with DeepSeek's LLM data reasoning capabilities, system maintenance workload and costs can be significantly reduced, while operational efficiency and quality are enhanced. This integration also helps to greatly shorten the time required for fault detection and analysis, providing targeted solutions or recommendations, ultimately improving customer satisfaction.

In addition to DeepSeek, EMQX also offers integration capabilities with other prominent AI models like Grok, OpenAI, and Claude. Users can easily harness the power of AI through EMQX’s ready-to-use integration features.

With the rapid development of LLM technology, intelligent maintenance agents are enabling more complex automation capabilities, gradually relieving the workload of operations and support teams, and providing strong support for the digital transformation of enterprises.

If you are interested in learning more about EMQX's AI-based observability tools, please do not hesitate to reach out to us at [email protected]. We would be happy to provide you with further information.

Talk to an Expert
Contact Us →