Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.5.3
-
RHEL 6 Bull kernel 2.6.32-642.4.2.el6.Bull.100.x86_64
Lustre build based on 2.5.3.90
-
3
-
9223372036854775807
Description
One Lustre client crashed right after a recovery with the following messages:
2016-10-29 16:00:01 Lustre: DEBUG MARKER: Sat Oct 29 16:00:01 2016 2016-10-29 16:00:01 2016-10-29 16:01:07 Lustre: 17766:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1477749640/real 1477749640] req@ffff8811e1b9a400 x1547883906151520/t0(0) o101->scratch3-MDT0000-mdc-ffff88205ec91400@JO.BOO.AL.IL@o2ib2:12/10 lens 632/1136 e 0 to 1 dl 1477749667 ref 2 fl Rpc:XP/0/ffffffff rc 0/-1 2016-10-29 16:01:07 Lustre: 17766:0:(client.c:1942:ptlrpc_expire_one_request()) Skipped 12 previous similar messages 2016-10-29 16:01:07 Lustre: scratch3-MDT0000-mdc-ffff88205ec91400: Connection to scratch3-MDT0000 (at JO.BOO.AL.IL@o2ib2) was lost; in progress operations using this service will wait for recovery to complet e 2016-10-29 16:01:07 Lustre: Skipped 12 previous similar messages 2016-10-29 16:01:07 Lustre: scratch3-MDT0000-mdc-ffff88205ec91400: Connection restored to scratch3-MDT0000 (at JO.BOO.AL.IL@o2ib2) 2016-10-29 16:01:07 Lustre: Skipped 12 previous similar messages 2016-10-29 16:01:07 LustreError: 17766:0:(namei.c:816:ll_create_node()) ASSERTION( list_empty(&inode->i_dentry) ) failed: 2016-10-29 16:01:07 LustreError: 17766:0:(namei.c:816:ll_create_node()) LBUG
The stack is as follows:
PID: 17766 TASK: ffff8819ab5b0ab0 CPU: 6 COMMAND: "rsync" #0 [ffff88189c697b20] machine_kexec at ffffffff8103ff4b #1 [ffff88189c697b80] crash_kexec at ffffffff810cfce2 #2 [ffff88189c697c50] panic at ffffffff81546ce9 #3 [ffff88189c697cd0] lbug_with_loc at ffffffffa067beeb [libcfs] #4 [ffff88189c697cf0] ll_create_nd at ffffffffa0dbf854 [lustre] #5 [ffff88189c697d70] vfs_create at ffffffff811a7946 #6 [ffff88189c697db0] do_filp_open at ffffffff811ab75e #7 [ffff88189c697f20] do_sys_open at ffffffff81194e87 #8 [ffff88189c697f70] sys_open at ffffffff81194f90 #9 [ffff88189c697f80] system_call_fastpath at ffffffff8100b0d2 RIP: 0000003c88adb480 RSP: 00007fffd85b6bc8 RFLAGS: 00010246 RAX: 0000000000000002 RBX: ffffffff8100b0d2 RCX: 0000000000000001 RDX: 0000000000000180 RSI: 00000000000000c2 RDI: 00007fffd85b7d70 RBP: 00007fffd85b7d9b R8: 7a672e7761722e30 R9: 00000000ffffffff R10: 0000000000000001 R11: 0000000000000246 R12: ffffffff81194f90 R13: ffff88189c697f78 R14: 00007fffd85b7d9c R15: 0000000000000000 ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b
Even though the intent is to create the inode with a dentry in the filesystem namespace (/somepath/.tmpfile), the inode structure field i_dentry is populated with one dentry pointing at /.lustre/fid/[0x298cd542a:0x3b3d:0x0].
Strangely enough, even though the crash occured in ll_create_nd() -> ll_create_it() -> ll_create_node, before the call to d_instantiate() in ll_create_nd() -> ll_create_it(), the fid actually really points back to the requested location:
# lfs fid2path /mountpoint [0x298cd542a:0x3b3d:0x0] /somepath/.tmpfile
I can't upload the crashdump due to site restriction, but will be happy to give more informations as you request them.