[LU-6678] calling ldlm_revoke_export_locks repeatedly while running racer seems to cause deadlock Created: 03/Jun/15 Updated: 10/Oct/21 Resolved: 10/Oct/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Kit Westneat | Assignee: | WC Triage |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
John Hammond and Nathan Lavender recently reported an issue where making nodemap configuration changes while running racer can run into a deadlock situation. After reproducing it on my machine, the code seems to get stuck in ldlm_revoke_export_locks. Looking at the trace debug, it never leaves cfs_hash_for_each_empty: ... 00000001:00000040:2.0:1433251090.069004:0:24913:0:(hash.c:1704:cfs_hash_for_each_empty()) Try to empty hash: a9afca12-80f8-a, loop: 4390396 00000001:00000001:2.0:1433251090.069004:0:24913:0:(hash.c:1600:cfs_hash_for_each_relax()) Process entered 00010000:00000001:2.0:1433251090.069006:0:24913:0:(ldlm_lock.c:189:ldlm_lock_put()) Process entered 00010000:00000001:2.0:1433251090.069006:0:24913:0:(ldlm_lock.c:222:ldlm_lock_put()) Process leaving 00000001:00000040:2.0:1433251090.069007:0:24913:0:(hash.c:1704:cfs_hash_for_each_empty()) Try to empty hash: a9afca12-80f8-a, loop: 4390397 00000001:00000001:2.0:1433251090.069008:0:24913:0:(hash.c:1600:cfs_hash_for_each_relax()) Process entered 00010000:00000001:2.0:1433251090.069010:0:24913:0:(ldlm_lock.c:189:ldlm_lock_put()) Process entered 00010000:00000001:2.0:1433251090.069010:0:24913:0:(ldlm_lock.c:222:ldlm_lock_put()) Process leaving 00000001:00000040:2.0:1433251090.069011:0:24913:0:(hash.c:1704:cfs_hash_for_each_empty()) Try to empty hash: a9afca12-80f8-a, loop: 4390398 ... It looks like some locks eventually timeout: 00010000:00000001:0.0:1433298387.132153:0:2831:0:(ldlm_request.c:97:ldlm_expired_completion_wait()) Process entered 00010000:02000400:0.0:1433298387.132154:0:2831:0:(ldlm_request.c:105:ldlm_expired_completion_wait()) lock timed out (enqueued at 1433298087, 300s ago) 00010000:00010000:0.0:1433298387.132162:0:2831:0:(ldlm_request.c:111:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1433298087, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-test-MDT00 00_UUID lock: ffff880061a31540/0xf4d215ae26653a36 lrc: 3/1,0 mode: --/PR res: [0x20000e690:0x1:0x0].0 bits 0x13 rrc: 25 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 2831 timeout: 0 lvb_type: 0 00010000:00000001:0.0:1433298387.132167:0:2831:0:(ldlm_request.c:120:ldlm_expired_completion_wait()) Process leaving (rc=0 : 0 : 0) On one test run, I crashed the system and took backtraces of the processes to see what was up. Here are some bits I found interesting:
I'm not sure what information is most useful to figure this out. Is it a matter of doing dlmtrace and then dumping the namespaces when it hangs? It's relatively easy to cause the crash: i=0; lctl nodemap_add nm0; while true; do echo $i; lctl nodemap_add_range --name nm0 --range 0@lo; lctl nodemap_del_range --name nm0 --range 0@lo; ((i++)); done I've attached a namespace dump in case that's enough to figure it out. I'll keep digging, but I thought I'd post this in case anyone had any ideas, or in case I am way off track. |
| Comments |
| Comment by Oleg Drokin [ 03/Jun/15 ] |
|
Is this with Have you seen my additional comment in there? |
| Comment by Kit Westneat [ 03/Jun/15 ] |
|
I think this is a different bug, it looks like it's related to LDLM locks instead of mutex/spinlock problems. It also doesn't seem to be related to nodemap per se, just triggered by it. But I might be misreading the debug logs. I did, I'm working on a fix for the new issues in |
| Comment by Oleg Drokin [ 04/Jun/15 ] |
00000001:00000040:2.0:1433251090.069011:0:24913:0:(hash.c:1704:cfs_hash_for_each_empty()) Try to empty hash: a9afca12-80f8-a, loop: 4390398 This looks pretty much like a nodemap bug to me. |
| Comment by Oleg Drokin [ 04/Jun/15 ] |
|
actually no, the 2831 is the thread that's hung, what needs to be done is we need to see what lock is blocking of granting of this lcok then and hunt that down. |