當前位置: 妍妍網 > 碼農

記一次 .NET某施工建模軟體 卡死分析

2024-03-19碼農

一:背景

1. 講故事

前幾天有位朋友在微信上找到我,說他的軟體卡死了,分析了下也不知道是咋回事,讓我幫忙看一下,很多朋友都知道,我分析dump是免費的,當然也不是所有的dump我都能搞定,也只能盡自己最大能力幫助別人縮小問題範圍吧,既然dump有了,接下來就開啟分析之路。

二:WinDbg分析

1. 為什麽會卡死

不同型別的程式卡死的解決思路是不一樣的,朋友也說了是表單程式,那就重點觀察下主執行緒吧,使用 k 命令即可。


0:000> k 25
# Child-SP RetAddr Call Site
0000000000`007fc8d8 00007ffd`87439b13 ntdll!NtWaitForAlertByThreadId+0x14
0100000000`007fc8e0 00007ffd`87439a06 ntdll!RtlpWaitOnAddressWithTimeout+0x43
0200000000`007fc910 00007ffd`8743987d ntdll!RtlpWaitOnAddress+0xae
0300000000`007fc980 00007ffd`87435fdc ntdll!RtlpWaitOnCriticalp+0xd9
0400000000`007fc9f0 00007ffd`87435ef0 ntdll!RtlpEnterCriticalpContended+0xdc
0500000000`007fca20 00007ffd`536839ea ntdll!RtlEnterCriticalp+0x40
0600000000`007fca50 00007ffd`5368470a AcLayers!NS_VirtualRegistry::CRegLock::CRegLock+0x1a
0700000000`007fca90 00007ffd`536726d2 AcLayers!NS_VirtualRegistry::APIHook_RegOpenKeyExW+0x2a
0800000000`007fcb10 00007ffd`778e550b AcLayers!NS_WRPMitigation::APIHook_RegOpenKeyExW+0x42
0900000000`007fcb60 00007ffd`778e5437 xxx!GetCodePageForFont+0xa7
000000000`007fcc90 00007ffd`778e5296 xxx!CToolTipsMgr::NewFont+0x113
0b00000000`007fcda0 00007ffd`778e18f9 xxx!CToolTipsMgr::LoadTheme+0xb2
000000000`007fcdd0 00007ffd`84b9ca66 xxx!CToolTipsMgr::s_ToolTipsWndProc+0x1b9
000000000`007fce10 00007ffd`84b9c34b user32!UserCallWinProcCheckWow+0x266
000000000`007fcf90 00007ffd`4f36b1cc user32!CallWindowProcW+0x8b
0f00000000`007fcfe0 00007ffd`4f39ccac System_Windows_Forms_ni!System.Windows.Forms.NativeWindow.DefWndProc+0x9c
1000000000`007fd090 00007ffd`4f39cc05 System_Windows_Forms_ni!System.Windows.Forms.ToolTip.WndProc+0x9c
1100000000`007fd260 00007ffd`4f36a3a3 System_Windows_Forms_ni!System.Windows.Forms.ToolTip.ToolTipNativeWindow.WndProc+0x15
1200000000`007fd290 00007ffd`4f9e1161 System_Windows_Forms_ni!System.Windows.Forms.NativeWindow.Callback+0xc3
1300000000`007fd330 00007ffd`52c8222e System_Windows_Forms_ni+0x8d1161
1400000000`007fd3a0 00007ffd`84b9ca66 clr!UMThunkStub+0x6e
1500000000`007fd430 00007ffd`84b9c78c user32!UserCallWinProcCheckWow+0x266
1600000000`007fd5b0 00007ffd`84bb3b32 user32!DispatchClientMessage+0x9c
1700000000`007fd610 00007ffd`874c22c4 user32!__fnINLPCREATESTRUCT+0xa2
1800000000`007fd670 00007ffd`836a1f24 ntdll!KiUserCallbackDispatcherContinue
1900000000`007fd7e8 00007ffd`84ba15df win32u!NtUserCreateWindowEx+0x14
100000000`007fd7f0 00007ffd`84ba11d4 user32!VerNtUserCreateWindowEx+0x20f
1b00000000`007fdb80 00007ffd`84ba1012 user32!CreateWindowInternal+0x1b4
100000000`007fdce0 00007ffd`4f3e8098 user32!CreateWindowExW+0x82
100000000`007fdd70 00007ffd`4f3696f0 System_Windows_Forms_ni+0x2d8098
...

從卦象看,很明顯主執行緒卡在 NtWaitForAlertByThreadId 上,這是有問題的,接下來我們仔細解讀下執行緒棧。

  • DispatchClientMessage

  • 這個方法表示當前從 queue 中拿到了別的執行緒透過 Invoke 送過來的資訊,正在處理中。

  • LoadTheme

  • 這個方法表示正在用主執行緒更新表單樣式

  • APIHook_RegOpenKeyExW

  • 首先說一下 AcLayers.dll,專業名詞叫 墊片 ,詳情可以看一下【軟體偵錯】,它主要用來處理一些系統級相容性的問題,然後可以看到它在查詢登錄檔時有一個lock操作。

    在非受控代碼中,lock 一般都用 臨界區(Criticalp) 實作,那到底它等待的臨界區被誰持有著呢?

    2. 誰持有著臨界區鎖

    要想獲取鎖的持有資訊,可以使用 !cs -l 或者 !locks ,但這裏要提醒一下,在真實的dump分析過程中,有時候不準,所以更好的辦法就是從執行緒棧上提取,那怎麽提取呢? 其實就是尋找 ntdll!RtlEnterCriticalp 方法的第一個參數即可,方法簽名如下:


    VOID RtlEnterCriticalp(
    PRTL_CRITICAL_p Criticalp
    )
    ;

    接下來反組譯下 00007ffd536839ea 處的程式碼,看看 rcx 寄存器是怎麽傳下來的。


    0:000> ub 00007ffd`536839ea
    AcLayers!NS_VirtualRegistry::OPENKEY::AddEnumEntries<NS_VirtualRegistry::VIRTUALVAL>+0x11a:
    00007ffd`536839ce cc int3
    00007ffd`536839cf cc int3
    AcLayers!NS_VirtualRegistry::CRegLock::CRegLock:
    00007ffd`536839d0 48895c2408 mov qword ptr [rsp+8],rbx
    00007ffd`536839d5 57 push rdi
    00007ffd`536839d6 4883ec30 sub rsp,30h
    00007ffd`536839da 488bf9 mov rdi,rcx
    00007ffd`536839dd 488d0d4c7f0300 lea rcx,[AcLayers!NS_VirtualRegistry::csRegCriticalp (00007ffd`536bb930)]
    00007ffd`536839e4 ff15ae660100 call qword ptr [AcLayers!_imp_EnterCriticalp (00007ffd`5369a098)]

    從卦象上看,很吉利,這個 rcx 原來是一個全域變量 AcLayers!NS_VirtualRegistry::csRegCriticalp , 接下來用 !cs 觀察下到底被誰持有著。


    0:000> !cs AcLayers!NS_VirtualRegistry::csRegCriticalp
    -----------------------------------------
    Critical p = 0x00007ffd536bb930 (AcLayers!NS_VirtualRegistry::csRegCriticalp+0x0)
    DebugInfo = 0x000000001c4e58e0
    LOCKED
    LockCount = 0x2
    WaiterWoken = No
    OwningThread = 0x0000000000001d20
    RecursionCount = 0x1
    LockSemaphore = 0xFFFFFFFF
    SpinCount = 0x00000000020007ce

    這又是一副吉卦,可以看到當前持有執行緒是 1d20 ,那這個執行緒正在做什麽呢?

    3. 1d20 執行緒為什麽持鎖不釋放

    案情往前推進了一步,我們切過去觀察下這個執行緒棧。


    0:000> ~~[1d20]s
    ntdll!NtDelayExecution+0x14:
    00007ffd`874bec14 c3 ret
    0:028> kL
    # Child-SP RetAddr Call Site
    0000000000`33ccd948 00007ffd`83955381 ntdll!NtDelayExecution+0x14
    0100000000`33ccd950 00007ffd`6d4a2361 KERNELBASE!SleepEx+0xa1
    0200000000`33ccd9f0 00007ffd`8520a75c perfts!CloseLagPerfData+0x21
    0300000000`33ccda30 00007ffd`85209ccd advapi32!CloseExtObjectLibrary+0xec
    0400000000`33ccda90 00007ffd`8396dc6a advapi32!PerfRegCloseKey+0x15d
    0500000000`33ccdae0 00007ffd`839715e6 KERNELBASE!BaseRegCloseKeyInternal+0x72
    0600000000`33ccdb10 00007ffd`83935209 KERNELBASE!ClosePredefinedHandle+0x96
    0700000000`33ccdb40 00007ffd`53685d71 KERNELBASE!RegCloseKey+0x149
    0800000000`33ccdba0 00007ffd`53683ae5 AcLayers!NS_VirtualRegistry::CVirtualRegistry::CloseKey+0xbd
    0900000000`33ccdbf0 00007ffd`51c7737e AcLayers!NS_VirtualRegistry::APIHook_RegCloseKey+0x25
    000000000`33ccdc30 00007ffd`51bf4be2 mscorlib_ni+0x58737e
    0b00000000`33ccdce0 00007ffd`513c356a mscorlib_ni!Microsoft.Win32.RegistryKey.Dispose+0x72
    000000000`33ccdd20 00007ffd`513c34b9 System_ni!System.Diagnostics.PerformanceCounterLib.GetStringTable+0x41a
    ...
    1300000000`33cce050 00007ffd`513bfe3c System_ni!System.Diagnostics.PerformanceCounter..ctor+0xd7
    1400000000`33cce0a0 00007ffc`f45cb2ce System_ni!System.Diagnostics.PerformanceCounter..ctor+0x1c
    1500000000`33cce0d0 00007ffc`f45cb14c 0x00007ffc`f45cb2ce
    1600000000`33cce120 00007ffc`f45cb023 0x00007ffc`f45cb14c
    ...

    從卦中看,這個執行緒貌似在用 CloseLagPerfData 方法關閉一些東西時一直在Sleep等待,可以反組譯 00007ffd6d4a2361 處程式碼看看等待多久。


    0:028> ub 00007ffd`6d4a2361
    perfts!CloseLagPerfData+0x5:
    00007ffd`6d4a2345 55 push rbp
    00007ffd`6d4a2346 488bec mov rbp,rsp
    00007ffd`6d4a2349 4883ec30 sub rsp,30h
    00007ffd`6d4a234d e8720e0000 call perfts!LagCounterManager::Cleanup (00007ffd`6d4a31c4)
    00007ffd`6d4a2352 33db xor ebx,ebx
    00007ffd`6d4a2354 eb0b jmp perfts!CloseLagPerfData+0x21 (00007ffd`6d4a2361)
    00007ffd`6d4a2356 b964000000 mov ecx,64h
    00007ffd`6d4a235b ff15c74e0000 call qword ptr [perfts!_imp_Sleep (00007ffd`6d4a7228)]
    ...

    從卦中的 mov ecx,64h 可以看到是 Sleep(100) 毫秒,更多細節也沒空繼續追究了,但不管怎麽樣,它是由上層的計數器類 PerformanceCounter 引發的,這裏學一下 4S 店的做法,讓朋友能不能不要呼叫 PerformanceCounter 這個類,咱躲開他就可以了,截圖如下:

    去掉之後,朋友反饋問題消失。

    三:總結

    說來也奇怪,最近發現了二起由 PerformanceCounter 引發的程式卡死,把經驗留在這裏,希望後來人少踩坑吧!