v6.6 EE Release Notes
This document was translated by ChatGPT
#1. Business and Applications
#1.1 Business Observability
- Data Association
- ⭐ The universal map now supports displaying alert events.
- Usability Enhancements
- Optimized topology style.
#1.2 Application Observability
- AutoTracing
- ⭐ Introduced TraceMap capability, which calculates the aggregate topology of all Traces matching the search criteria in real-time, helping users quickly organize software architecture and globally locate performance bottlenecks.
- ⭐ Supports using Wasm Plugin to enhance HTTP2/gRPC call logs (currently does not support enhancing eBPF uprobe data), documentation.
- ⭐ File read/write events now support collecting the full path of the file name and the offset of the read/write file.
- Supports collection and tracing of Memcached protocol, documentation.
- Supports collection and tracing of Tars protocol, documentation.
- Supports collection and tracing of Ping protocol, documentation.
- Supports collection and tracing of Dubbo protocol when using Fastjson serialization, documentation.
- Supports parsing MySQL Login Response statements and parsing truncated MySQL protocol content.
- Optimized parsing of unary type gRPC calls, documentation.
- Supports parsing multiple DNS requests in TCP Payload and parsing SRV type DNS call logs, documentation (opens new window).
- Supports collection and tracing of Some/IP protocol.
- Introduced URL obfuscation capability for HTTP protocol, with Redis protocol obfuscation enabled by default, documentation.
- Optimized default values for NTP clock offset (
host_clock_offset_us) and network transmission delay (network_delay_us) configuration parameters used in network Span tracing to reduce the probability of mismatches. - Optimized mapping of schema/target fields in OTel Span to
l7_flow_log, documentation. - Supports collecting Unix Socket call logs and automatic tracing between TCP/UDP Socket call logs and Unix Socket call logs.
- Enriched eBPF Hook points for file read/write event collection to enhance adaptability.
- Supports parsing TraceID and SpanID from APMs like BoRui, Tingyun, and Cloudwise.
- Supports cross-thread analysis of the parent Span of the current Span (system Span at the client process location).
- Supports collecting multiple HTTP2/gRPC requests and responses in a single Packet.
- Supports obtaining the full file path for file read/write events:
- Can fully obtain NAS file paths, supporting NFS, SMB, CIFS, and other protocols.
- Can fully obtain the complete absolute path of file reads/writes inside a container Pod.
- AutoMetrics
- Supports aligning timestamps of request and response metrics within the same session to help AIOps systems better achieve root cause localization.
- AutoTagging
- Supports aggregating traffic of multiple member physical network cards of Open vSwitch bond interface, documentation.
- Correctly marks Universal Tag for loopback network card traffic on K8s Node.
- ⭐ Optimized the meaning of the
process_knamefield in call logs and file read/write event data fromkernel thread nametosystem processname for better readability. - ⭐ Optimized the meaning of the response status (
response_status) field in call logs and improved page prompt information.- Normal: Response code is normal.
- Client Error: Response code indicates a client-side error, such as HTTP 4XX.
- Server Error: Response code indicates a server-side error, such as HTTP 5XX.
- Timeout: If no response is collected within a certain time, the request is marked as timed out.
- Collector
application session merge timeout settingconfiguration: DNS and TLS default to 15s, other protocols default to 120s, documentation.
- Collector
- Unknown: When concurrent request volume exceeds the collector's cache capacity, the oldest request is marked as unknown.
- Collector
session aggregation maximum entriesconfiguration: Default cache of 64K requests, documentation.
- Collector
- Parse Failure: A response was collected, but due to truncation or compression, the response code could not be parsed.
- Collector
Payload truncationconfiguration: Default parses the first 1024 bytes of Payload, documentation.
- Collector
- Search Capability
- Introduced
application searchmode, allowing quick selection of services (app_service) and instances (app_instance), and quick input of endpoints (endpoint) and TraceID. - Optimized the layout of the search box and the display style of the quick selection box.
- Introduced
- Performance Enhancement
- Optimized the performance of distributed tracing API.
- Usability Enhancements
- ⭐ Added
backend analysiscapability to distributed tracing, intelligently guiding users to quickly trace requests to the backend when eBPF AutoTracing is disconnected. - ⭐ Supports quickly viewing continuous profiling and real-time profiling data corresponding to system Span.
- ⭐ Distributed tracing supports
waterfall listdisplay mode. - ⭐ deepflow-agent supports directly receiving tracing data from SkyWalking and Datadog without forwarding through otel-collector.
- ⭐ Distributed tracing flame graph automatically corrects slight clock deviations between different machines.
- ⭐ Optimized the usability of the
network pathright slide page. - Optimized the process configuration capability for enabling eBPF uprobe functionality, documentation.
- Optimized the distributed tracing page: adjusted the order of the quick filter box on the left, added trend analysis line charts, and optimized the call log details table.
- Resource analysis, path analysis, and topology analysis pages support automatically selecting the most appropriate metric time granularity for queries to optimize query speed.
- Resource analysis, path analysis, and topology analysis pages support quickly filtering the value range of metrics.
- The call log in the right slide page only displays abnormal entries by default.
- Optimized the display position of Tips in the topology map.
- Optimized the display of tag classification in both Chinese and English.
- Enriched the quick filter box on the left side of the page.
- Optimized the display order of search box candidates.
- ⭐ Added
#1.3 Code Observability
- AutoProfiling
- ⭐ Supports eBPF zero-intrusion collection of memory profiling data for Java and Rust processes, documentation.
- ⭐ Supports using DWARF to achieve stack unwinding in the absence of Frame Pointer, documentation.
- ⭐ Supports CPU performance profiling for Python and CUDA.
- AutoMetrics
- Supports aggregating and generating eBPF profiling metric data with 1s granularity to accelerate profiling metric queries.
- Performance Enhancement
- ⭐ Optimized Java process symbol table synchronization mechanism, reducing instantaneous CPU consumption introduced by business processes by about 50%.
- ⭐ Improved function stack merging efficiency, reducing resource overhead for function stack reporting, with significant performance improvement in scenarios with many threads of the same name.
- ⭐ Supports compressed transmission of Profiling data, reducing bandwidth consumption by 30%.
- ⭐ Supports compressed transmission of call logs and flow logs, with a compression ratio of up to 8:1 in test environments, documentation.
- Real-time Profiling
- ⭐ Supports using JVM API to obtain function stack, GC statistics, and Heap statistics of Java processes.
- Grafana
- Supports viewing DeepFlow eBPF On-CPU Profiling data in Grafana Panel.
- Usability Enhancements
- Optimized the process configuration capability for enabling Profile functionality, documentation.
- Optimized the display of function types in flame graphs.
- Optimized the text of memory profiling flame graphs.
- Enriched the quick filter box on the left side of the page.
- Optimized the display order of search box candidates.
- Default collection of OnCPU profiling data for
Java/Python. - Default collection of OnCPU profiling data for
deepflow-*.
#2. Infrastructure
#2.1 Asset Observability
- ⭐ Introduced asset observability feature, supporting viewing observability data from the perspective of cloud host and container resources.
#2.2 Network Observability
- AutoTagging
- Supports aggregating traffic of multiple member physical network cards of Open vSwitch bond interface, documentation.
- Correctly marks Universal Tag for loopback network card traffic on K8s Node.
- PCAP
- ⭐ Supports online analysis of PCAP packet data.
- Performance Enhancement
- Optimized NAT tracing API performance.
- Usability Enhancements
- ⭐ Optimized the usability of the
network pathright slide page. - Changed the default unit for all traffic rates on the page from bytes per second (
Bps) to bits per second (bps). - For non-TCP traffic in network flow logs (
l4_flow_log), changed the end status (close_type) from timeout to normal end (1). - Resource analysis, path analysis, and topology analysis pages support automatically selecting the most appropriate metric time granularity for queries to optimize query speed.
- Resource analysis, path analysis, and topology analysis pages support quickly filtering the value range of metrics.
- Traffic of
collector=other network cardsis included in resource metrics. - Supports aggregation of tunnel and non-tunnel traffic, solving the problem of asymmetric path traffic aggregation.
- When collecting TCP packet headers or PCAP data, network flow log collection is automatically enabled.
- Optimized the display of information bar on the NAT tracing page.
- Optimized the display position of Tips in the topology map.
- The flow log in the right slide page only displays abnormal entries by default.
- Optimized the display of tag classification in both Chinese and English.
- Provides graphical interpretation of flow log end types (
close_type). - Enriched the quick filter box on the left side of the page.
- Optimized the display order of search box candidates.
- ⭐ Optimized the usability of the
#2.3 Traffic Distribution
- ⭐ Traffic distribution supports ZMQ protocol.
- Distribution strategy supports specifying collector groups.
#3. Integration
#3.1 Probing Center
- Real-time Probing
- ⭐ deepflow-agent has built-in probing capability, eliminating the need to install binary files for probing commands.
- Supported commands include: ping, tcpping, curl, dig, traceroute.
- Supports executing probing commands within business Pods.
#4. Customization
#4.1 Dashboard
- Panel Enhancements
- ⭐ Aggregated metrics and overview charts support PromQL queries to enhance the display capability of Prometheus metrics in the dashboard.
- Optimized the display of Tips in Panels, showing query names when there are multiple query conditions, and compactly (ignoring) displaying metric names when there is only a single metric.
- Topology maps support setting thresholds for the difference in path metrics (between adjacent hops).
- Tables support more extensive color settings.
- Panels support setting the color of legends.
- Optimized the customization capability of pie charts.
- Added the ability to zoom in for a closer look.
- Unified the method for setting metric units.
- Usability Enhancements
- Simplified the method for setting aliases, units, and thresholds for Panel metrics.
- Optimized the right slide detail page for resource change events and file read/write events.
- Optimized the display of Tab pages in the right slide page.
- Supports adding tags to dashboards.
- Supports copying entire dashboards.
- Optimized legend display.
#5. Integration
#5.1 Log Center
- Performance Enhancement
- ⭐ Application log data supports compressed transmission, reducing bandwidth consumption by 95% (CPU consumption increased by 3%).
- Usability Enhancements
- Optimized the search box, fixing the selection of application service (
app_service) filter condition.
- Optimized the search box, fixing the selection of application service (
#6. Others
#6.1 Usability
- Supports copying and pasting the entire content of the search box.
#6.2 Alert Management
- Alert Strategy
- ⭐ Email push content supports Markdown format and supports Jinja2 syntax for referencing tags.
- The search module supports setting metric aliases and viewing units.
- Supports customizing the criteria for alert recovery: if there are no "fatal/error/warning/no data" monitoring events for N consecutive times, a recovery event is generated.
- Alert Events
- ⭐ Enriched alert event Tags to align with all observability data.
- Added
event analysispage for statistical analysis of alert events.
- Push Endpoints
- When pushing to Kafka, supports
SCRAM-SHA-256authentication method.
- When pushing to Kafka, supports
- Usability Enhancements
- Optimized system alert events to display detailed internal module names for DeepFlow process anomalies.
- Optimized the display of time range in the alert event right slide box.
- Performance Enhancement
- Optimized page loading time.
#6.3 Report Management
N/A
#7. Management
#7.1 Resource List
- AutoTagging
- ⭐ Supports synchronizing resource tags from Volcengine, documentation.
- Supports synchronizing LoadBalancer type container services.
- Enhanced process synchronization capability, documentation.
- ⭐ Supports synchronizing only processes inside containers.
- Supports not synchronizing Socket information (only synchronizing process information).
- Process Resources
- ⭐ Automatically records gprocess name as jar/py file name to avoid all displaying as java/python.
- Aggregates processes with the same
cmdlineon the same cloud host or the same K8s workload into a unique gprocess, reducing redundant process information. - Optimized default values for process matchers, documentation.
- By default, ignores the collection of
sleep/sh/bash/pause/runcprocess information. - By default, collects process information for
Java/Python. - By default, collects process information for
deepflow-*. - By default, collects process information inside containers.
- By default, ignores the collection of
- Supports entering custom tags for cloud.tag on the page.
- Simplified process synchronization blacklist configuration, documentation.
- Adapted to K8s v1.32+ API.
- Management Capability
- When entering a K8s cluster, supports specifying ClusterID to reuse the old ClusterID when re-entering the cluster.
- Supports using Lua Plugin to customize K8s workload abstraction rules, documentation.
- Limits to only one
collector synchronizationtype cloud platform per organization per region. - Supports completing Alibaba Cloud resource synchronization using a regular account's AK/SK with ResourceGroupId.
- Predefined system alerts for resource synchronization lag and resource relationship anomalies.
- Performance Enhancement
- Cancels synchronization of K8s Pods in Evicted state to reduce resource overhead.
- Optimized storage performance of
genesis*related MySQL tables.
- Adaptability Optimization
- When a cloud platform (Domain) is configured with a region whitelist, there is no need to call the Region API.
- Failure to obtain NAT gateways, route tables, and load balancers for Alibaba Cloud and Tencent Cloud does not affect the synchronization of other resource information.
#7.2 System Management
Server
- ⭐ Supports using OceanBase to replace MySQL.
- ⭐ Supports using ByConity to replace ClickHouse, documentation.
- ⭐ Supports using ClickHouse Enterprise Edition (currently only supports Alibaba Cloud), documentation (opens new window).
- Supports terminating remote upgrades of collectors, optimizing CPU resource overhead during upgrades.
- Default aggregation generates network performance metrics and application performance metrics with granularity of 1h and 1d.
- Filters for data export (Kafka/Prometheus/OTel) (
tag-filters-groups) support filling in multiple groups to achieve logical OR semantics. - Unified log format for each module in deepflow-server.
- Supports setting the maximum query duration to avoid excessive resource consumption for large time-scale queries.
- Performance Enhancement
- ClickHouse accesses MySQL through a proxy to obtain dictionary data, reducing MySQL connections and optimizing cross-region bandwidth consumption.
Agent
⭐ OneAgent: Supports using deepflow-agent to collect application logs, host system metrics, and K8s container system metrics.
⭐ OneAgent: Supports using deepflow-agent for continuous probing.
⭐ Security: Supports limiting the number of Sockets used by deepflow-agent, documentation.
⭐ Configuration refactoring, significantly improving usability, documentation.
⭐ Supports collecting traffic of virtual and physical network cards on DPDK KVM hosts that are not Open vSwitch, documentation.
Supports collecting traffic of internal network cards in Pods, suitable for scenarios where Pod network card traffic cannot be directly collected under the Root network namespace (e.g., Huawei Cloud CCE Turbo CNI (opens new window)), documentation.
Adapted to K8s CNI with the same MAC address for virtual network cards on the same host.
Supports specifying and disabling K8s List & Watch through environment variables, documentation.
Supports decapsulation of VXLAN type remote mirror traffic, documentation.
Dedicated collectors support setting to ignore PCP processing of mirror traffic, documentation.
Dedicated collectors support calculating network location (capture_network_type) based on QinQ inner VLAN, documentation.
Dedicated collectors by default do not limit the number of concurrent flows and the memory overhead of the policy module, documentation.
Idle memory circuit breaker mechanism supports using available memory metrics, documentation.
Limits the bandwidth consumption of data sent by the agent, allowing 100Mbps of data to be sent by default, documentation.
When the agent's traffic reaches the rate limit, it supports choosing between
discardorwaitstrategies, with the default behavior being discard, configurable to wait to improve data transmission success rate, documentation.Optimized resource overhead protection mechanism when application protocol recognition fails to avoid mistakenly prohibiting application protocol parsing, documentation.
Introduced a circuit breaker mechanism for the disk free space of the Agent runtime environment, documentation.
Supports prohibiting the Agent from using Swap memory, documentation.
Optimization: Reduced the work done by the Agent in a disabled state.
Reduced the number of Sockets used by deepflow-agent when sending data:
- Merged Sockets used for transmitting open_telemetry and open_telemetry_compressed data when integrating OpenTelemetry.
- Merged Sockets used for agent self-monitoring, transmitting deepflow_stats and agent_log data.
- Merged Sockets used for transmitting prometheus and telegraf metrics when integrating Prometheus and Telegraf.
Performance Enhancement
⭐ Reduced eBPF kernel memory overhead of the Agent, reducing memory consumption by 60% under default configuration.
⭐ Supports using BPF FANOUT mechanism to improve collection performance, documentation.
⭐ Performance: Optimized memory usage of Cache used for application performance metrics in the Agent by timely cleaning up expired LRU entries, reducing overall memory consumption by 43% in test environments.
⭐ Performance: Aggregated storage of flow logs generated by LB health checks, reducing flow log storage overhead by nearly 50% in a certain production environment, documentation.
⭐ Performance: Improved the merge success rate of call logs on the agent side, significantly reducing the proportion of call logs with
response_status = Unknown, with a 50% reduction in unknown proportion observed in test environments.When TraceID is present in the protocol header, supports disabling eBPF syscall_trace_id calculation to reduce impact on business performance, documentation.
Supports completely disabling cBPF data collection (by configuring
inputs.cbpf.af_packet.interface_regexto an empty string) to reduce memory overhead, documentation.Supports deepflow-agent using a single Socket to transmit all observability data, documentation.
When Linux has BTF (BPF Type Format) enabled, and the kernel is greater than or equal to 5.5 (opens new window) on X86 architecture or greater than or equal to 6.0 (opens new window) on ARM architecture, the agent will automatically use fentry/fexit instead of kprobe/kretprobe, resulting in approximately 15% performance improvement.
Supports Watchdog mechanism to ensure circuit breakers can execute normally in extreme cases.
Supports compressed transmission of application logs, with a compression ratio between 5:1 and 20:1, documentation.
Usability Improvements
- Significantly reduced the URL length of web pages, remembering the activation status of the right slide box in the URL.
- Displays the container cluster to which the collector belongs in the collector list.
- Optimized the display of the navigation bar in both Chinese and English.
#7.3 Account
- Multi-Tenant Support
- ⭐ Supports setting visible pages, resources, databases/tables/fields/field enumeration values for tenants.
- ⭐ Supports specifying the team to which a specific affiliated container cluster of a cloud platform belongs, allowing different teams to manage their own container clusters.
- Administrators are not visible to tenants within the organization.
- When a regular administrator joins a tenant organization, they default to guest status.
- Tenants are not allowed to create organizations.
#8. Incompatible Changes
- AutoTracing
- To reduce resource overhead and avoid misidentification, the agent by default only parses the following application protocols (to enable parsing of other protocols, please configure
l7-protocol-enabled):- HTTP, HTTP2/gRPC, MySQL, Redis, Kafka, DNS, TLS.
- Reminder: When using Wasm to parse private protocols, please add Custom to
l7-protocol-enabled.
- To reduce resource overhead and avoid misidentification, the agent by default only parses the following application protocols (to enable parsing of other protocols, please configure
- Agent
- The original environment variable
ONLY_WATCH_K8S_RESOURCEhas been replaced withK8S_WATCH_POLICY, documentation.
- The original environment variable
- API
- Profiling API uses Dataframe return format to compress response size and improve API performance, PR (opens new window), documentation.
| #Functions | Response Size (Byte) | Download Time | |
|---|---|---|---|
| Before | 450,000 | 21.9M | 6.16s |
| After | 450,000 | 3.07M | 0.78s |