Skip to content

节点内存泄漏 #130

@masterOcean

Description

@masterOcean

节点在 30 分钟内内存泄露直到宕机,主要是 PubMessage对象

我们生产环境 4个节点(16c 32g)组成的集群,其中一个节点在 30分钟内内存一直上升,gc 不下来,最终不可用,我们也是第一次出现。当时集群总连接数大概12W,单个节点 3W,每个连接大概 5S或10S发一次消息,通过共享连接转出到kafka。
gc log:

[2025-05-20T09:40:43.490+0800][30530][gc] GC(168430) Garbage Collection (Allocation Rate) 6626M(54%)->1966M(16%)
[2025-05-20T09:40:58.462+0800][30530][gc] GC(168431) Garbage Collection (Allocation Rate) 7112M(58%)->2040M(17%)
.....
[2025-05-20T09:44:37.140+0800][30530][gc] GC(168446) Garbage Collection (Allocation Rate) 5252M(43%)->2862M(23%)
[2025-05-20T09:44:45.639+0800][30530][gc] GC(168447) Garbage Collection (Allocation Rate) 5222M(42%)->2422M(20%)
[2025-05-20T09:44:55.678+0800][30530][gc] GC(168448) Garbage Collection (Allocation Rate) 4706M(38%)->3060M(25%)
.....
[2025-05-20T09:48:01.016+0800][30530][gc] GC(168460) Garbage Collection (Allocation Rate) 7482M(61%)->3376M(27%)
[2025-05-20T09:48:15.690+0800][30530][gc] GC(168461) Garbage Collection (Allocation Rate) 7518M(61%)->3448M(28%)
[2025-05-20T09:48:30.689+0800][30530][gc] GC(168462) Garbage Collection (Allocation Rate) 7720M(63%)->3502M(28%)
.....
[2025-05-20T09:52:06.422+0800][30530][gc] GC(168483) Garbage Collection (Allocation Rate) 6834M(56%)->4232M(34%)
[2025-05-20T09:52:16.279+0800][30530][gc] GC(168484) Garbage Collection (Allocation Rate) 6762M(55%)->4284M(35%)
[2025-05-20T09:52:26.332+0800][30530][gc] GC(168485) Garbage Collection (Allocation Rate) 6868M(56%)->4376M(36%)
.....
[2025-05-20T10:03:21.311+0800][30530][gc] GC(168626) Garbage Collection (Allocation Rate) 7308M(59%)->7280M(59%)
[2025-05-20T10:03:25.278+0800][30530][gc] GC(168627) Garbage Collection (Allocation Rate) 7302M(59%)->7328M(60%)
.....
[2025-05-20T10:10:03.710+0800][30530][gc] GC(168717) Garbage Collection (Allocation Rate) 9672M(79%)->9866M(80%)
[2025-05-20T10:10:09.355+0800][30843][gc] Allocation Stall (crdt-service-scheduler) 176.979ms
[2025-05-20T10:10:09.355+0800][30755][gc] Allocation Stall (io-rpc-worker-elg-31) 107.565ms
[2025-05-20T10:10:09.355+0800][32197][gc] Allocation Stall (basekv-range-mutator) 560.162ms

坏节点的 heap 直方图

 num     #instances         #bytes  class name (module)
-------------------------------------------------------
   1:       4387991      789,541360  [Ljava.lang.Object; (java.base@17.0.10)
   2:       6318061      692352160  [B (java.base@17.0.10)
   3:       2070386      560754472  [J (java.base@17.0.10)
   4:      14831152      474596864  java.util.concurrent.CompletableFuture (java.base@17.0.10)
   5:       2057200      298501136  [I (java.base@17.0.10)
   6:       4197419      268634816  java.util.concurrent.CompletableFuture$UniWhenComplete (java.base@17.0.10)
   7:       2076515      215957560  com.baidu.bifromq.type.Message
   8:       3136953      175669368  java.util.concurrent.CompletableFuture$UniRelay (java.base@17.0.10)
   9:       2048731      163898352  [S (java.base@17.0.10)
  10:       2120880      135736320  java.util.concurrent.CompletableFuture$UniApply (java.base@17.0.10)
  11:       2076511      132896704  java.util.concurrent.CompletableFuture$UniExceptionally (java.base@17.0.10)
  12:       4110609      131539488  java.lang.String (java.base@17.0.10)
  13:       1078152      120753024  io.netty.buffer.PooledUnsafeDirectByteBuf
  14:       2076517       99672816  com.baidu.bifromq.basescheduler.CallTask
  15:       2076510       83060400  com.baidu.bifromq.dist.client.scheduler.DistServerCall
  16:       2161485       69167520  java.util.concurrent.ConcurrentLinkedQueue$Node (java.base@17.0.10)
  17:       1061848       67958272  io.netty.buffer.PooledSlicedByteBuf
  18:       1060441       67868224  java.util.concurrent.CompletableFuture$UniAccept (java.base@17.0.10)
  19:       2076518       66448576  com.baidu.bifromq.dist.client.scheduler.BatcherKey
  20:       1060435       59384360  com.baidu.bifromq.mqtt.handler.MQTTSessionHandler$$Lambda$1774/0x00007fdddea97150
  21:       1060405       59382680  com.baidu.bifromq.mqtt.handler.MQTTSessionHandler$$Lambda$1777/0x00007fdddea97808
  22:       1038502       58156112  java.util.LinkedHashMap$Entry (java.base@17.0.10)
  23:       1016076       56900256  java.util.concurrent.CancellationException (java.base@17.0.10)
  24:        279987       51446320  [Ljava.util.HashMap$Node; (java.base@17.0.10)
  25:       1060436       50900928  com.baidu.bifromq.plugin.authprovider.type.CheckResult
  26:       1060435       50900880  io.netty.handler.codec.mqtt.MqttPublishMessage
  27:       2088592       50126208  com.google.protobuf.ByteString$LiteralByteString
  28:       2076517       49836408  com.baidu.bifromq.basescheduler.BatchCallScheduler$$Lambda$1429/0x00007fddde92f5f0
  29:       1127869       45114760  java.util.HashMap$Node (java.base@17.0.10)
  30:       1060439       42417560  io.netty.handler.codec.mqtt.MqttFixedHeader
  31:        212802       37065952  [Ljava.util.concurrent.ConcurrentHashMap$Node; (java.base@17.0.10)
  32:        410806       36150928  io.netty.channel.DefaultChannelHandlerContext
  33:       1060437       33933984  com.baidu.bifromq.mqtt.handler.MQTTSessionHandler$$Lambda$1693/0x00007fdddea5e250
  34:       1060435       33933920  io.netty.handler.codec.mqtt.MqttPublishVariableHeader
  35:        268707       25795872  java.util.concurrent.ConcurrentHashMap (java.base@17.0.10)
  36:        321774       25741920  java.util.LinkedHashMap (java.base@17.0.10)
  37:       1060435       25450440  com.baidu.bifromq.mqtt.handler.MQTTSessionHandler$$Lambda$1778/0x00007fdddea97c78
  38:        776047       24833504  io.netty.util.Recycler$DefaultHandle
  39:       1016077       24385848  java.util.concurrent.CompletableFuture$AltResult (java.base@17.0.10)
  40:        526340       21053600  java.util.concurrent.ConcurrentHashMap$Node (java.base@17.0.10)
  41:         45411       13804944  com.baidu.bifromq.mqtt.handler.v3.MQTT3TransientSessionHandler
  42:        137358       10988640  java.util.TreeMap (java.base@17.0.10)
  43:         45864       10273536  io.netty.channel.epoll.EpollSocketChannel
  44:        393943        9454632  java.util.concurrent.atomic.AtomicLong (java.base@17.0.10)
  45:        194612        9341376  com.google.protobuf.MapField
  46:         45609        8756928  io.netty.handler.traffic.TrafficCounter
  47:         91987        8094856  io.netty.util.concurrent.ScheduledFutureTask
  48:        139228        7796768  com.baidu.bifromq.type.ClientInfo
  49:        182432        7297280  io.netty.util.DefaultAttributeMap$DefaultAttribute
  50:        220074        7042368  java.util.concurrent.ConcurrentHashMap$KeySetView (java.base@17.0.10)
  51:        138238        6635424  com.baidu.bifromq.inbox.storage.proto.TopicFilterOption
  52:        194612        6227584  com.google.protobuf.MapField$MutabilityAwareMap
  53:         96349        6166336  java.util.HashMap (java.base@17.0.10)
  54:         44444        6044384  com.baidu.bifromq.inbox.storage.proto.InboxMetadata
  55:          7600        5168000  io.netty.util.internal.shaded.org.jctools.queues.MpscArrayQueue
  56:         45605        4742920  com.baidu.bifromq.mqtt.handler.TenantSettings
  57:        195219        4685256  java.util.concurrent.atomic.AtomicReference (java.base@17.0.10)
  58:        194612        4670688  com.google.protobuf.MapField$ImmutableMessageConverter
  59:         45871        4403616  io.netty.channel.DefaultChannelPipeline$HeadContext

正常节点的堆内存直方图

 num     #instances         #bytes  class name (module)
-------------------------------------------------------
   1:        311863      113943464  [Ljava.lang.Object; (java.base@17.0.10)
   2:       2077519      107863216  [B (java.base@17.0.10)
   3:       1940738       62103616  java.lang.String (java.base@17.0.10)
   4:        906314       50753584  java.util.LinkedHashMap$Entry (java.base@17.0.10)
   5:        216048       37429824  [Ljava.util.concurrent.ConcurrentHashMap$Node; (java.base@17.0.10)
   6:        420222       36979536  io.netty.channel.DefaultChannelHandlerContext
   7:        263687       34769200  [Ljava.util.HashMap$Node; (java.base@17.0.10)
   8:        273474       26253504  java.util.concurrent.ConcurrentHashMap (java.base@17.0.10)
   9:        306730       24538400  java.util.LinkedHashMap (java.base@17.0.10)
  10:        525528       21021120  java.util.concurrent.ConcurrentHashMap$Node (java.base@17.0.10)
  11:         46471       14127184  com.baidu.bifromq.mqtt.handler.v3.MQTT3TransientSessionHandler
  12:        140444       11235520  java.util.TreeMap (java.base@17.0.10)
  13:         46915       10508960  io.netty.channel.epoll.EpollSocketChannel
  14:        421213       10109112  java.util.concurrent.atomic.AtomicLong (java.base@17.0.10)
  15:         46653        8957376  io.netty.handler.traffic.TrafficCounter
  16:        176317        8463216  com.google.protobuf.MapField
  17:         94029        8274552  io.netty.util.concurrent.ScheduledFutureTask
  18:        186600        7464000  io.netty.util.DefaultAttributeMap$DefaultAttribute
  19:        223556        7153792  java.util.concurrent.ConcurrentHashMap$KeySetView (java.base@17.0.10)
  20:        141252        6780096  com.baidu.bifromq.inbox.storage.proto.TopicFilterOption
  21:        120211        6731816  com.baidu.bifromq.type.ClientInfo
  22:         98398        6297472  java.util.HashMap (java.base@17.0.10)
  23:         45324        6164064  com.baidu.bifromq.inbox.storage.proto.InboxMetadata
  24:        176317        5642144  com.google.protobuf.MapField$MutabilityAwareMap
  25:        230407        5529768  java.util.concurrent.atomic.AtomicReference (java.base@17.0.10)
  26:        164961        5278752  java.util.concurrent.CompletableFuture (java.base@17.0.10)
  27:          7440        5059200  io.netty.util.internal.shaded.org.jctools.queues.MpscArrayQueue
  28:         46640        4850560  com.baidu.bifromq.mqtt.handler.TenantSettings
  29:         46922        4504512  io.netty.channel.DefaultChannelPipeline$HeadContext
  30:         46653        4478688  io.netty.handler.codec.mqtt.MqttDecoder
  31:         46653        4478688  io.netty.handler.traffic.ChannelTrafficShapingHandler
  32:        176317        4231608  com.google.protobuf.MapField$ImmutableMessageConverter
  33:         46922        4129136  io.netty.channel.DefaultChannelPipeline$TailContext
  34:         46891        4126408  io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl
  35:         46640        4104320  com.baidu.bifromq.mqtt.handler.v3.MQTT3ConnectHandler
  36:        118016        3776512  com.baidu.bifromq.mqtt.service.LocalDistService$TopicFilter
  37:        118014        3776448  com.baidu.bifromq.mqtt.service.LocalDistService$LocalRoutes
  38:         93994        3759760  java.net.InetAddress$InetAddressHolder (java.base@17.0.10)
  39:         46922        3753760  io.netty.channel.DefaultChannelPipeline
  40:         46915        3753200  io.netty.channel.epoll.EpollSocketChannel$EpollSocketChannelUnsafe
  41:         46915        3753200  io.netty.channel.epoll.EpollSocketChannelConfig
  42:         93280        3731200  com.baidu.bifromq.mqtt.session.MQTTSessionAuthProvider
  43:        152437        3658488  java.util.LinkedHashMap$LinkedEntrySet (java.base@17.0.10)
  44:         62679        3510024  java.util.TreeMap$Entry (java.base@17.0.10)
  45:          8682        3504408  [I (java.base@17.0.10)
  46:         18591        3456320  java.lang.Class (java.base@17.0.10)
  47:         93293        3358496  [Lcom.baidu.bifromq.mqtt.handler.condition.Condition;
  48:         46640        3358080  com.baidu.bifromq.mqtt.handler.ConditionalSlowDownHandler
  2774:             2             96  io.netty.handler.codec.mqtt.MqttPublishMessage

我们怀疑是 DistServerCallScheduler 中,在 batcher 里grpc 超时阻塞,MqttPublishMessage全都添加到 Batcher.callTaskBuffers 中,这是个 ConcurrentLinkedQueue,是无界的。

Environment

  • Version: [3.2.1]
  • JVM Version: [OpenJDK17,启动参数 -Xms12g -Xmx12g -XX:MetaspaceSize=512m -XX:MaxMetaspaceSize=512m -XX:MaxDirectMemorySize=12g]
  • Hardware Spec: [15c32g, 4个节点]
  • OS: [腾讯云OS]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions