[LU-10067] LBUG mdt_handler.c:222:mdt_lock_pdo_mode() Created: 03/Oct/17  Updated: 09/Nov/17  Resolved: 09/Nov/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Major
Reporter: James Casper Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: None
Environment:

trevis, full DNE
servers: el7.4, zfs, branch master, v2.10.53.1, b3642
clients: el7.4, branch master, v2.10.53.1, b3642


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

https://testing.hpdd.intel.com/test_sessions/00511056-db38-4995-a4f8-b35856fa09f7

racer, test_1: test failed to respond and timed out

From MDS console:

06:18:07:[19681.903127] LustreError: 22680:0:(mdt_handler.c:222:mdt_lock_pdo_mode()) ASSERTION( lh->mlh_pdo_mode == LCK_MINMODE ) failed: 
06:18:07:[19681.908979] LustreError: 22680:0:(mdt_handler.c:222:mdt_lock_pdo_mode()) LBUG
06:18:07:[19681.912061] Pid: 22680, comm: mdt00_037
06:18:07:[19681.914860] 
06:18:07:[19681.914860] Call Trace:
06:18:07:[19681.920079]  [<ffffffffc06817ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
06:18:07:[19681.923062]  [<ffffffffc068183c>] lbug_with_loc+0x4c/0xb0 [libcfs]
06:18:07:[19681.926146]  [<ffffffffc11b59db>] mdt_object_local_lock+0xa4b/0xad0 [mdt]
06:18:07:[19681.928946]  [<ffffffffc11b56db>] ? mdt_object_local_lock+0x74b/0xad0 [mdt]
06:18:07:[19681.931741]  [<ffffffffc11a5f10>] ? mdt_blocking_ast+0x0/0x2e0 [mdt]
06:18:07:[19681.934380]  [<ffffffffc0e33770>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
06:18:07:[19681.937367]  [<ffffffffc11b5ad0>] mdt_object_lock_internal+0x70/0x330 [mdt]
06:18:07:[19681.940203]  [<ffffffffc11b5ad0>] ? mdt_object_lock_internal+0x70/0x330 [mdt]
06:18:07:[19681.943129]  [<ffffffffc11b5db0>] mdt_object_lock+0x20/0x30 [mdt]
06:18:07:[19681.945893]  [<ffffffffc11fa3f4>] mdt_lock_objects_in_linkea+0x748/0xa68 [mdt]
06:18:07:[19681.948705]  [<ffffffffc11ca148>] mdt_reint_migrate_internal.isra.38+0x8c8/0x16d0 [mdt]
06:18:07:[19681.951559]  [<ffffffffc0692002>] ? cfs_hash_bd_from_key+0x32/0xb0 [libcfs]
06:18:07:[19681.954250]  [<ffffffffc0c41788>] ? lu_object_put+0x148/0x3d0 [obdclass]
06:18:07:[19681.956946]  [<ffffffffc11cb1b5>] mdt_reint_rename_or_migrate.isra.39+0x265/0x860 [mdt]
06:18:07:[19681.959450]  [<ffffffffc11c0181>] ? mdt_root_squash+0x21/0x430 [mdt]
06:18:07:[19681.961948]  [<ffffffff8132c212>] ? strlcpy+0x42/0x60
06:18:07:[19681.964254]  [<ffffffffc11cb7c0>] mdt_reint_migrate+0x10/0x20 [mdt]
06:18:07:[19681.966606]  [<ffffffffc11cf790>] mdt_reint_rec+0x80/0x210 [mdt]
06:18:07:[19681.969134]  [<ffffffffc11b131b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
06:18:07:[19681.971704]  [<ffffffffc11bcda7>] mdt_reint+0x67/0x140 [mdt]
06:18:07:[19681.974204]  [<ffffffffc0ecc225>] tgt_request_handle+0x925/0x1370 [ptlrpc]
06:18:07:[19681.976507]  [<ffffffffc0e750c6>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc]
06:18:07:[19681.979113]  [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90
06:18:07:[19681.981580]  [<ffffffffc0e78862>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
06:18:07:[19681.983929]  [<ffffffff81029557>] ? __switch_to+0xd7/0x510
06:18:07:[19681.986432]  [<ffffffff816a8f00>] ? __schedule+0x330/0x8b0
06:18:07:[19681.988691]  [<ffffffffc0e77dd0>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
06:18:07:[19681.990751]  [<ffffffff810b098f>] kthread+0xcf/0xe0
06:18:07:[19681.993085]  [<ffffffff810b08c0>] ? kthread+0x0/0xe0
06:18:07:[19681.995015]  [<ffffffff816b4f18>] ret_from_fork+0x58/0x90
06:18:07:[19681.997259]  [<ffffffff810b08c0>] ? kthread+0x0/0xe0
06:18:07:[19681.999174] 
06:18:07:[19682.000888] Kernel panic - not syncing: LBUG
06:18:07:[19682.001870] CPU: 1 PID: 22680 Comm: mdt00_037 Tainted: P           OE  ------------   3.10.0-693.1.1.el7_lustre.x86_64 #1
06:18:07:[19682.001870] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
06:18:07:[19682.001870]  ffff8800664d1f00 0000000068d88604 ffff88004cf8b898 ffffffff816a3d6d
06:18:07:[19682.001870]  ffff88004cf8b918 ffffffff8169dc54 ffffffff00000008 ffff88004cf8b928
06:18:07:[19682.001870]  ffff88004cf8b8c8 0000000068d88604 0000000068d88604 ffff88007fd0f8b8
06:18:07:[19682.001870] Call Trace:
06:18:07:[19682.001870]  [<ffffffff816a3d6d>] dump_stack+0x19/0x1b
06:18:07:[19682.001870]  [<ffffffff8169dc54>] panic+0xe8/0x20d
06:18:07:[19682.001870]  [<ffffffffc0681854>] lbug_with_loc+0x64/0xb0 [libcfs]
06:18:07:[19682.001870]  [<ffffffffc11b59db>] mdt_object_local_lock+0xa4b/0xad0 [mdt]
06:18:07:[19682.001870]  [<ffffffffc11b56db>] ? mdt_object_local_lock+0x74b/0xad0 [mdt]
06:18:07:[19682.001870]  [<ffffffffc11a5f10>] ? mdt_obd_reconnect+0x1d0/0x1d0 [mdt]
06:18:07:[19682.001870]  [<ffffffffc0e33770>] ? ldlm_expired_completion_wait+0x240/0x240 [ptlrpc]
06:18:07:[19682.001870]  [<ffffffffc11b5ad0>] mdt_object_lock_internal+0x70/0x330 [mdt]
06:18:07:[19682.001870]  [<ffffffffc11b5ad0>] ? mdt_object_lock_internal+0x70/0x330 [mdt]
06:18:07:[19682.001870]  [<ffffffffc11b5db0>] mdt_object_lock+0x20/0x30 [mdt]
06:18:07:[19682.001870]  [<ffffffffc11fa3f4>] mdt_lock_objects_in_linkea+0x748/0xa68 [mdt]
06:18:07:[19682.001870]  [<ffffffffc11ca148>] mdt_reint_migrate_internal.isra.38+0x8c8/0x16d0 [mdt]
06:18:07:[19682.001870]  [<ffffffffc0692002>] ? cfs_hash_bd_from_key+0x32/0xb0 [libcfs]
06:18:07:[19682.001870]  [<ffffffffc0c41788>] ? lu_object_put+0x148/0x3d0 [obdclass]
06:18:07:[19682.001870]  [<ffffffffc11cb1b5>] mdt_reint_rename_or_migrate.isra.39+0x265/0x860 [mdt]
06:18:07:[19682.001870]  [<ffffffffc11c0181>] ? mdt_root_squash+0x21/0x430 [mdt]
06:18:07:[19682.001870]  [<ffffffff8132c212>] ? strlcpy+0x42/0x60
06:18:07:[19682.001870]  [<ffffffffc11cb7c0>] mdt_reint_migrate+0x10/0x20 [mdt]
06:18:07:[19682.001870]  [<ffffffffc11cf790>] mdt_reint_rec+0x80/0x210 [mdt]
06:18:07:[19682.001870]  [<ffffffffc11b131b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
06:18:07:[19682.001870]  [<ffffffffc11bcda7>] mdt_reint+0x67/0x140 [mdt]
06:18:07:[19682.001870]  [<ffffffffc0ecc225>] tgt_request_handle+0x925/0x1370 [ptlrpc]
06:18:07:[19682.001870]  [<ffffffffc0e750c6>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc]
06:18:07:[19682.001870]  [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90
06:18:07:[19682.001870]  [<ffffffffc0e78862>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
06:18:07:[19682.001870]  [<ffffffff81029557>] ? __switch_to+0xd7/0x510
06:18:07:[19682.001870]  [<ffffffff816a8f00>] ? __schedule+0x330/0x8b0
06:18:07:[19682.001870]  [<ffffffffc0e77dd0>] ? ptlrpc_register_service+0xe80/0xe80 [ptlrpc]
06:18:07:[19682.001870]  [<ffffffff810b098f>] kthread+0xcf/0xe0
06:18:07:[19682.001870]  [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
06:18:07:[19682.001870]  [<ffffffff816b4f18>] ret_from_fork+0x58/0x90
06:18:07:[19682.001870]  [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40


 Comments   
Comment by James Nunez (Inactive) [ 03/Oct/17 ]

In the test_log, we see

CMD: trevis-6vm4 /usr/sbin/lctl snapshot_create -F lustre -n lss_27457
trevis-6vm4: Warning: Permanently added 'trevis-6vm4' (ECDSA) to the list of known hosts.
trevis-6vm4: Warning: Permanently added 'trevis-6vm3' (ECDSA) to the list of known hosts.
CMD: trevis-6vm4 /usr/sbin/lctl snapshot_list -F lustre 2>/dev/null
CMD: trevis-6vm4 /usr/sbin/lctl snapshot_list -F lustre 2>/dev/null
trevis-6vm4: Warning: Permanently added 'trevis-6vm8' (ECDSA) to the list of known hosts.
CMD: trevis-6vm4 /usr/sbin/lctl snapshot_create -F lustre -n lss_29957
CMD: trevis-6vm4 /usr/sbin/lctl snapshot_list -F lustre 2>/dev/null
CMD: trevis-6vm4 /usr/sbin/lctl snapshot_destroy -F lustre -n lss_29957 -f
CMD: trevis-6vm4 /usr/sbin/lctl snapshot_create -F lustre -n lss_12954
trevis-6vm4: Can't lock the snapshot config file /etc/ldev.conf (2): Resource temporarily unavailable
CMD: trevis-6vm4 /usr/sbin/lctl snapshot_create -F lustre -n lss_2287
trevis-6vm4: Can't lock the snapshot config file /etc/ldev.conf (2): Resource temporarily unavailable
CMD: trevis-6vm4 /usr/sbin/lctl snapshot_create -F lustre -n lss_14963
trevis-6vm4: Can't lock the snapshot config file /etc/ldev.conf (2): Resource temporarily unavailable
CMD: trevis-6vm4 /usr/sbin/lctl snapshot_create -F lustre -n lss_12585
trevis-6vm4: Can't lock the snapshot config file /etc/ldev.conf (2): Resource temporarily unavailable
trevis-6vm4: Connection closed by UNKNOWN port 65535
…

Although racer test_1 hangs frequently during autotesting and does fail with “Resource temporarily unavailable” messages in the test log, we don’t see the LBUG in the MDS logs in previous runs of racer.

Comment by James Nunez (Inactive) [ 04/Oct/17 ]

Hongchao,

Would you please comment on this issue?

James

Comment by Gerrit Updater [ 13/Oct/17 ]

Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: https://review.whamcloud.com/29597
Subject: LU-10067 mdt: reinit lock when fail to try lock
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 24bd3ed58b9fd6a80f5e9c000119c15076d38053

Comment by Gerrit Updater [ 09/Nov/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29597/
Subject: LU-10067 mdt: reinit lock when fail to try lock
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 37e4bcaad4b4cd1f539c257f7424850e51d685c1

Comment by Minh Diep [ 09/Nov/17 ]

Landed for 2.11

Generated at Sat Feb 10 02:31:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.