Android native service 死锁分析方法

Creative Commons
本作品采用知识共享署名

本文简要说明如何分析Android Native Service 死锁的方法,并说明bionic和glibc在mutex上的实现差异。

当Android Native Service中用mutex保护资源出现竞争,导致死锁时,我们可以用gdb attach Native service,通过观察Mute状态找出deadlock chain.

操作方法

1. 找到Native service pid

假设Native service的process名为test执行

1
ps -A | grep test

将会看到pid为359

root 359 1 226924 9496 binder_thread_read 0 S

2. 用gdb attach到test进程

1
gdb -p 359

3. 对所有的thread执行bt

attach后,process会被暂停,通过下面命令会列出test进程中所有线程的调用栈

1
(gdb) thread apply all bt

观察每个调用栈,看是否有线程在等待mutex,例如可以找到

1
2
3
4
5
6
7
8
9
10
11
12
#0  0xe9f0d22c in syscall () from /apex/com.android.runtime/lib/bionic/libc.so
(gdb) bt
#0 0xe9f0d22c in syscall () from /apex/com.android.runtime/lib/bionic/libc.so
#1 0xe9f12392 in __futex_wait_ex(void volatile*, bool, int, bool, timespec const*) () from /apex/com.android.runtime/lib/bionic/libc.so
#2 0xe9f5ad12 in NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*) () from /apex/com.android.runtime/lib/bionic/libc.so
....
#11 0xe751e404 in TaskWrapFunc (arg=0xe82a0018 <g_stTask+56>)
#12 0xe9f5a12c in __pthread_start(void*) ()
from /apex/com.android.runtime/lib/bionic/libc.so
#13 0xe9f12fee in __start_thread ()
from /apex/com.android.runtime/lib/bionic/libc.so
(gdb)

现在我们知道线程TaskWrapFunc拿不到mutex被锁住,下一步我们看如何知道是哪个线程拿走了mutex

4. 导入带符号的libc

因为板子上Android的libc并没有携带符号,所以我们无法查看frame中各个参数和变量的信息
将编译目录的out\target\product\xxx\symbols\apex\com.android.runtime\ 拷贝到Android板子的/data分区下。保持目录结构/data/symbols/apex/com.android.runtime/,在gdb中导入有符号的libc

1
2
(gdb) set solib-absolute-prefix /data/symbols
(gdb) set solib-search-path /data/symbols

5. 查看mutex owner

再做bt时,可以看到所有frame的信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
(gdb) bt
#0 syscall () at bionic/libc/arch-arm/bionic/syscall.S:44
#1 0xe9f12392 in __futex (ftx=0xe98989f0, op=137, value=301301762,
timeout=0x0, bitset=-1) at bionic/libc/private/bionic_futex.h:45
#2 FutexWithTimeout (ftx=<optimized out>, op=137, value=<optimized out>,
use_realtime_clock=<optimized out>, abs_timeout=<optimized out>,
bitset=-1) at bionic/libc/bionic/bionic_futex.cpp:58
#3 __futex_wait_ex (ftx=0xe98989f0, shared=<optimized out>, value=301301762,
use_realtime_clock=<optimized out>, abs_timeout=0x0)
at bionic/libc/bionic/bionic_futex.cpp:63
#4 0xe9f5ad12 in NonPI::RecursiveOrErrorcheckMutexWait (mutex=0xe98989f0,
shared=0, old_state=<optimized out>, use_realtime_clock=false,
abs_timeout=0x0) at bionic/libc/bionic/pthread_mutex.cpp:705
#5 NonPI::MutexLockWithTimeout (mutex=0xe98989f0, use_realtime_clock=false,
abs_timeout_or_null=0x0) at bionic/libc/bionic/pthread_mutex.cpp:784
#6 0xe751c94c in TD_OS_MutexLock (pMutex=0xe98989f0)
...
#14 0xe751e404 in TaskWrapFunc (arg=0xe82a0018 <g_stTask+56>)
at /home/cd00010/vestel_mp_idtv/vestel_n33007_mb250_mp_idtv/TV/code/platform/src/system/os/td_os.c:2827
#15 0xe9f5a12c in __pthread_start (arg=0xea0121c0)
--Type <RET> for more, q to quit, c to continue without paging--
at bionic/libc/bionic/pthread_create.cpp:347
#16 0xe9f12fee in __start_thread (fn=0xe9f5a103 <__pthread_start(void*)>,
arg=<optimized out>) at bionic/libc/bionic/clone.cpp:53

执行下面命令切换frame,并查看其等待的mutex的owner为tid等于4597的线程

1
2
3
4
5
6
7
(gdb) f 4
#4 0xe9f5ad12 in NonPI::RecursiveOrErrorcheckMutexWait (mutex=0xe98989f0,
shared=0, old_state=<optimized out>, use_realtime_clock=false,
abs_timeout=0x0) at bionic/libc/bionic/pthread_mutex.cpp:705
705 bionic/libc/bionic/pthread_mutex.cpp: No such file or directory.
(gdb) p *mutex
$3 = {state = 32770, {owner_tid = 4597, pi_mutex_id = 4597}}

6. 在前面all thread的信息中tid=4597线程找到owner线程

1
Thread 70 (LWP 4597):

切换到拿到mutex的线程,并通过bt查看流程,可以看到拿了Mutex后就一直sleep,结合代码就可以找出原因了

1
2
3
4
5
(gdb) t 70
[Switching to thread 70 (LWP 4597)]
#0 nanosleep ()
at out/soong/.intermediates/bionic/libc/syscalls-arm.S/gen/syscalls-arm.S:1822
1822 out/soong/.intermediates/bionic/libc/syscalls-arm.S/gen/syscalls-arm.S: No such file or directory.

问题处理

在前面第5步看到的owner_tid为0,这是因为Android的bionic libc和glibc的实现不一样,bionic只会在mutex type不为PTHREAD_MUTEX_NORMAL记录owner_tid。

解决方法

在初始化mutex的时候为其设置PTHREAD_MUTEX_ERRORCHECK

1
2
3
4
5
pthread_mutexattr_t attr;
pthread_mutexattr_init(&attr);
pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_ERRORCHECK);
pthread_mutex_init( pstMtx, &attr );
pthread_mutexattr_destroy(&attr);

流程分析

Android pthread实现在bionic/libc/bionic/中,mutex lock会调用到MutexLockWithTimeout
没有设置Type,默认为NORMAL,会直接上锁,然后退出

1
2
3
if ( __predict_true(mtype == MUTEX_TYPE_BITS_NORMAL) ) {
return NormalMutexLock(mutex, shared, use_realtime_clock, abs_timeout_or_null);
}

设置type后会先检查,如果设置的式error check就不会支持递归锁

1
2
3
4
5
6
7
pid_t tid = __get_thread()->tid;
if (tid == atomic_load_explicit(&mutex->owner_tid, memory_order_relaxed)) {
if (mtype == MUTEX_TYPE_BITS_ERRORCHECK) {
return EDEADLK;
}
return RecursiveIncrement(mutex, old_state);
}

在lock的时候保存owner_tid

1
2
3
4
5
6
7
8
9
if (old_state == unlocked) {
// If exchanged successfully, an acquire fence is required to make
// all memory accesses made by other threads visible to the current CPU.
if (__predict_true(atomic_compare_exchange_strong_explicit(&mutex->state, &old_state,
locked_uncontended, memory_order_acquire, memory_order_relaxed))) {
atomic_store_explicit(&mutex->owner_tid, tid, memory_order_relaxed);
return 0;
}
}

Error check和递归上锁上锁使用

1
2
3
4
if (RecursiveOrErrorcheckMutexWait(mutex, shared, old_state, use_realtime_clock,
abs_timeout_or_null) == -ETIMEDOUT) {
return ETIMEDOUT;
}

参考

http://aospxref.com/android-12.0.0_r3/xref/bionic/libc/bionic/pthread_mutex.cpp