[多线程] 剖析嵌套式死锁问题

在多线程模型中，锁是个复杂的东西，即便老司机有时也会翻车。

锁是配对出现的。锁上了，就要解锁，忘记解锁会产生死锁，一般这种低级错误很容易避免，然而在复杂的业务体系中，往往会产生嵌套式死锁问题，而这种问题有时藏得很深。

1. 嵌套死锁理解
2. 测试
- 2.1. Demo
- 2.2. Gdb 分析
3. 小结

1. 嵌套死锁理解

嵌套式死锁：系统中存在多个（不可重入）锁，跨线程相互调用。

下图展示了同时运行的两个线程，极有可能产生嵌套死锁问题。

设计图来源：《嵌套式死锁原理》

2. 测试

2.1. Demo

根据上述分析，写了个测试 Demo。

// g++ -g -O0 -std=c++11 test.cpp -lpthread -o t && ./t
#include <chrono>
#include <iostream>
#include <memory>
#include <mutex>
#include <thread>

std::mutex g_mtx1;
std::mutex g_mtx2;

int main() {
    std::thread t1([]() {
        std::lock_guard<std::mutex> lck(g_mtx1);
        std::cout << "thread: 1, locked by mtx1\n";
        std::this_thread::sleep_for(std::chrono::seconds(1));

        std::cout << "thread: 1, waitting to unlock mtx2\n";
        std::lock_guard<std::mutex> lck2(g_mtx2);
        std::cout << "thread: 1, locked by mtx2\n";

        std::this_thread::sleep_for(std::chrono::seconds(1));
        std::cout << "thread: 1, done!!!\n";
    });

    std::thread t2([]() {
        std::lock_guard<std::mutex> lck(g_mtx2);
        std::cout << "thread: 2, locked by mtx2\n";
        std::this_thread::sleep_for(std::chrono::seconds(1));

        std::cout << "thread: 2, waitting to unlock mtx1\n";
        std::lock_guard<std::mutex> lck2(g_mtx1);
        std::cout << "thread: 2, locked by mtx1\n";

        std::this_thread::sleep_for(std::chrono::seconds(1));
        std::cout << "thread: 2, done!!!\n";
    });

    t1.join();
    t2.join();
    std::cout << "finished!" << std::endl;
    return 0;
}

// 输出：
// thread: 1, locked by mtx1
// thread: 2, locked by mtx2
// thread: 1, waitting to unlock mtx2
// thread: 2, waitting to unlock mtx1

2.2. Gdb 分析

通过 Gdb 工具绑定死锁进程，然后查看死锁程序函数堆栈。死锁分别发生在：

test.cpp : 18
test.cpp : 31

(gdb) thread apply all bt

Thread 3 (Thread 0x7f8d9c793700 (LWP 2605)):
#0  __lll_lock_wait ()
#1  0x00007f8d9d38be9b in _L_lock_883 ()
#2  0x00007f8d9d38bd68 in __GI___pthread_mutex_lock ()
#3  0x00000000004010ad in __gthread_mutex_lock ()
#4  0x0000000000403284 in std::mutex::lock (this=0x607320 <g_mtx2>)
#5  0x0000000000403456 in std::lock_guard<std::mutex>::lock_guard ()
#6  0x00000000004011d5 in __lambda0::operator() at test.cpp:18
...

Thread 2 (Thread 0x7f8d9bf92700 (LWP 2606)):
#0  __lll_lock_wait ()
#1  0x00007f8d9d38be9b in _L_lock_883 ()
#2  0x00007f8d9d38bd68 in __GI___pthread_mutex_lock ()
#3  0x00000000004010ad in __gthread_mutex_lock ()
#4  0x0000000000403284 in std::mutex::lock ()
#5  0x0000000000403456 in std::lock_guard<std::mutex>::lock_guard ()
#6  0x00000000004012c5 in __lambda1::operator() at test.cpp:31
...

3. 小结

在一个已上锁的功能单元里，尽量不要再使用其它锁，或者调用其它有锁的函数。
锁的粒度（区域）应该尽量小；锁是锁数据的，不是锁逻辑的。在一个函数里，函数入口加锁，函数退出解锁，这种操作看似方便，实则隐藏了很多问题。假如锁住了插入数据库语句逻辑，刚好这个数据库堵了，那么整个多线程系统有可能卡住。