0%

一次线程池使用错误导致的问题

记录一次服务线程数量异常问题的排查过程

背景

通过监控发现一个服务的线程数异常多

同期CPU 内存 网络连接都没有什么异常。

排查

第一个反应就是查看线程栈

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
"pool-2493-thread-3" #3718833 prio=5 os_prio=0 tid=0x00007f1610041000 nid=0x38bff6 waiting on condition [0x00007f18ba29e000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000007a044cce0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

"pool-2493-thread-2" #3718832 prio=5 os_prio=0 tid=0x00007f161003f800 nid=0x38bff5 waiting on condition [0x00007f18bd502000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000007a044cce0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

"pool-2493-thread-1" #3718823 prio=5 os_prio=0 tid=0x00007f161003e000 nid=0x38bff3 waiting on condition [0x00007f18c2725000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000007a044cce0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

本来想从线程栈中找到线程是从哪里创建的,但从线程栈现在只能看到线程是由ThreadPoolExecutor创建的,只能看到线程运行态的东西。

那么heap中会有吗?

名词
Object/Stack Frame java.lang.Thread @ 0x7a044af30
Name pool-2493-thread-3
Shallow Heap 120
Retained Heap 1,728
Max. Locals’ Retained Heap
Context Class Loader org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader @ 0x71d665028
Is Daemon false
State [alive, parked, waiting, waiting indefinitely]
State value 0x291

这里只是比线程栈多了一个Context Class Loader别的没有什么有用的信息。

回看下线程栈信息,发现一个规律[pool-2493-thread-1] pool后面的数字的递增的,但thread后面数字基本都是1 2 3

从这篇文章Naming Executor Service Threads and Thread Pool in Java 可以推断出,应该在代码中有位置初始化了线程池,并且核心线程数是3,在代码中搜索了一下,还真找到了这么一段代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
// import cn.hutool.core.thread.ThreadUtil;

@GetMapping(value = "/jiankunkingTransfer")
public Map<String, Object> jiankunkingTransfer(String date) {
Executor executor = ThreadUtil.newExecutor(3, 6, 30000);
// todo 执行一些逻辑处理
for (Object c : json.getJSONArray("data")) {
executor.execute(() -> {
// todo 执行一些逻辑处理
});
}
// 注意这里没有用CountDownLatch来await()
// 也没有shutdown()
}

一开始以为局部创建的线程会被GC 回收掉,然后通过Arthas vmtool强制GC了一把,发现线程并没有减少。

为啥没有回收呢?

=>
Does an ExecutorService get garbage collected when out of scope?

=> 这里提到了线程池的shutdown(),但此时的我还没有将重点放在这里。
=> 导致我认为没有被回收的线程是全局变量,一顿排查操作并没有发现全局且很大的线程池
=> 再次搜索找到查找答案:
https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ThreadPoolExecutor.html

在JDK的官方文档中找到这么一段话

1
2
3
4
5
Finalization
A pool that is no longer referenced in a program AND has no remaining threads will be shutdown automatically.
If you would like to ensure that unreferenced pools are reclaimed even if users forget to call shutdown(),
then you must arrange that unused threads eventually die, by setting appropriate keep-alive times, using a
lower bound of zero core threads and/or setting allowCoreThreadTimeOut(boolean).

=>
程序中不再引用且没有剩余线程的地将自动shutdown。如果您想确保即使用户忘记调用shutdown也能回收未引用的地,那么您必须通过设置适当的保持活动时间、使用零核心线程的下限和/或设置allowCoreThreadTimeOut(boolean)来安排未使用的线程最终死亡allowCoreThreadTimeOut(boolean)

结论

通过官方文档可以看出解决这个问题有下面几种方式

  1. 线程池 设置成 成员变量 (推荐)
  2. 使用threadPoolExecutor.shutdown(); 方法关闭线程池(会等待任务执行结束)
  3. 设置核心线程超时关闭 allowCoreThreadTimeOut(true);
  4. 核心线程数设置为0, 但在使用上会带来别的麻烦(略)

问题的处理方式已经清楚了,下面再来看下shutdown()跟线程池默认的命名规则

拓展

shutdown()实现

shutdown就是将线程池状态设置为SHUTDOWN,然后中断所有空闲(空闲即阻塞在队列上)的线程,最终设置线程池状态为Terminated。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
/**
* Initiates an orderly shutdown in which previously submitted
* tasks are executed, but no new tasks will be accepted.
* Invocation has no additional effect if already shut down.
*
* <p>This method does not wait for previously submitted tasks to
* complete execution. Use {@link #awaitTermination awaitTermination}
* to do that.
*
* @throws SecurityException {@inheritDoc}
*/
public void shutdown() {

final ReentrantLock mainLock = this.mainLock;
// 加锁(全局锁)
mainLock.lock();
try {
// 权限校验,安全策略相关判断
checkShutdownAccess();
// 设置线程池状态为SHUTDOWN。
advanceRunState(SHUTDOWN);
// 中断所有的空闲的工作线程
interruptIdleWorkers();
// 空方法,留给子类实现
onShutdown(); // hook for ScheduledThreadPoolExecutor
} finally {
// 解锁
mainLock.unlock();
}
tryTerminate();
}

https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ThreadPoolExecutor.html#shutdown--

线程池默认的命名规则

java.util.concurrent.DefaultThreadFactory

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
static class DefaultThreadFactory implements ThreadFactory {
private static final AtomicInteger poolNumber = new AtomicInteger(1);
private final ThreadGroup group;
private final AtomicInteger threadNumber = new AtomicInteger(1);
private final String namePrefix;

DefaultThreadFactory() {
SecurityManager s = System.getSecurityManager();
group = (s != null) ? s.getThreadGroup() :
Thread.currentThread().getThreadGroup();
namePrefix = "pool-" +
poolNumber.getAndIncrement() +
"-thread-";
}

public Thread newThread(Runnable r) {
Thread t = new Thread(group, r,
namePrefix + threadNumber.getAndIncrement(),
0);
if (t.isDaemon())
t.setDaemon(false);
if (t.getPriority() != Thread.NORM_PRIORITY)
t.setPriority(Thread.NORM_PRIORITY);
return t;
}
}

从代码中可以看出默认线程的名字是按照=> pool-线程池编号(从1开始自增)-thread-线程池中线程数量(从1开始自增)

如何自定义线程名字可以参考下面这个
https://stackoverflow.com/questions/6113746/naming-threads-and-thread-pools-of-executorservice

欢迎关注我的其它发布渠道