Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.5.3
-
RHEL 6 Bull kernel 2.6.32-642.4.2.el6.Bull.100.x86_64
Lustre build based on 2.5.3.90
-
3
-
9223372036854775807
Description
One Lustre client crashed right after a recovery with the following messages:
2016-10-29 16:00:01 Lustre: DEBUG MARKER: Sat Oct 29 16:00:01 2016 2016-10-29 16:00:01 2016-10-29 16:01:07 Lustre: 17766:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1477749640/real 1477749640] req@ffff8811e1b9a400 x1547883906151520/t0(0) o101->scratch3-MDT0000-mdc-ffff88205ec91400@JO.BOO.AL.IL@o2ib2:12/10 lens 632/1136 e 0 to 1 dl 1477749667 ref 2 fl Rpc:XP/0/ffffffff rc 0/-1 2016-10-29 16:01:07 Lustre: 17766:0:(client.c:1942:ptlrpc_expire_one_request()) Skipped 12 previous similar messages 2016-10-29 16:01:07 Lustre: scratch3-MDT0000-mdc-ffff88205ec91400: Connection to scratch3-MDT0000 (at JO.BOO.AL.IL@o2ib2) was lost; in progress operations using this service will wait for recovery to complet e 2016-10-29 16:01:07 Lustre: Skipped 12 previous similar messages 2016-10-29 16:01:07 Lustre: scratch3-MDT0000-mdc-ffff88205ec91400: Connection restored to scratch3-MDT0000 (at JO.BOO.AL.IL@o2ib2) 2016-10-29 16:01:07 Lustre: Skipped 12 previous similar messages 2016-10-29 16:01:07 LustreError: 17766:0:(namei.c:816:ll_create_node()) ASSERTION( list_empty(&inode->i_dentry) ) failed: 2016-10-29 16:01:07 LustreError: 17766:0:(namei.c:816:ll_create_node()) LBUG
The stack is as follows:
PID: 17766 TASK: ffff8819ab5b0ab0 CPU: 6 COMMAND: "rsync" #0 [ffff88189c697b20] machine_kexec at ffffffff8103ff4b #1 [ffff88189c697b80] crash_kexec at ffffffff810cfce2 #2 [ffff88189c697c50] panic at ffffffff81546ce9 #3 [ffff88189c697cd0] lbug_with_loc at ffffffffa067beeb [libcfs] #4 [ffff88189c697cf0] ll_create_nd at ffffffffa0dbf854 [lustre] #5 [ffff88189c697d70] vfs_create at ffffffff811a7946 #6 [ffff88189c697db0] do_filp_open at ffffffff811ab75e #7 [ffff88189c697f20] do_sys_open at ffffffff81194e87 #8 [ffff88189c697f70] sys_open at ffffffff81194f90 #9 [ffff88189c697f80] system_call_fastpath at ffffffff8100b0d2 RIP: 0000003c88adb480 RSP: 00007fffd85b6bc8 RFLAGS: 00010246 RAX: 0000000000000002 RBX: ffffffff8100b0d2 RCX: 0000000000000001 RDX: 0000000000000180 RSI: 00000000000000c2 RDI: 00007fffd85b7d70 RBP: 00007fffd85b7d9b R8: 7a672e7761722e30 R9: 00000000ffffffff R10: 0000000000000001 R11: 0000000000000246 R12: ffffffff81194f90 R13: ffff88189c697f78 R14: 00007fffd85b7d9c R15: 0000000000000000 ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b
Even though the intent is to create the inode with a dentry in the filesystem namespace (/somepath/.tmpfile), the inode structure field i_dentry is populated with one dentry pointing at /.lustre/fid/[0x298cd542a:0x3b3d:0x0].
Strangely enough, even though the crash occured in ll_create_nd() -> ll_create_it() -> ll_create_node, before the call to d_instantiate() in ll_create_nd() -> ll_create_it(), the fid actually really points back to the requested location:
# lfs fid2path /mountpoint [0x298cd542a:0x3b3d:0x0] /somepath/.tmpfile
I can't upload the crashdump due to site restriction, but will be happy to give more informations as you request them.
Thanks Bruno.
The patch was backported to our branch.
This ticket can be closed.