Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7798

ll_prep_inode()) ASSERTION( fid_is_sane(&md.body->mbo_fid1) ) failed:

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • None
    • lola
      build: 2.8.50-6-gf9ca359 ;commit f9ca359284357d145819beb08b316e932f7a3060
    • 3
    • 9223372036854775807

    Description

      Error happened during soak testing of build '20160218' (see: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160218). DNE is enabled.
      MDT's have been formated using ldiskfs, OSTs using zfs.

      Event history:

      • 2016-02-18 16:24:27,115:fsmgmt.fsmgmt:INFO reseting MDS node lola-10
      • 2016-02-18-16:34:04] Lustre client lola-26 crash with LBUG
      • 2016-02-18-16:34:06 Lustre client lola-29 crash with LBUG

      Error message and stack trace is same for both events:

      <6>Lustre: soaked-MDT0004-mdc-ffff880852d1a000: Connection restored to 192.168.1.110@o2ib10 (at 192.168.1.110@o2ib10)
      <0>LustreError: 23705:0:(llite_lib.c:2295:ll_prep_inode()) ASSERTION( fid_is_sane(&md.body->mbo_fid1) ) failed:
      <0>LustreError: 23705:0:(llite_lib.c:2295:ll_prep_inode()) LBUG
      <4>Pid: 23705, comm: pct
      <4>
      <4>Call Trace:
      <4> [<ffffffffa050b875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4> [<ffffffffa050be77>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4> [<ffffffffa0aa0192>] ll_prep_inode+0x752/0xc40 [lustre]
      <4> [<ffffffffa07f6d60>] ? lustre_swab_mdt_body+0x0/0x130 [ptlrpc]
      <4> [<ffffffffa0ab42b2>] ll_new_node+0x682/0x7f0 [lustre]
      <4> [<ffffffffa0ab6b04>] ll_mkdir+0x104/0x220 [lustre]
      <4> [<ffffffff8122ec0f>] ? security_inode_permission+0x1f/0x30
      <4> [<ffffffff8119d759>] vfs_mkdir+0xd9/0x140
      <4> [<ffffffff811a04e7>] sys_mkdirat+0xc7/0x1b0
      <4> [<ffffffff8100c6f5>] ? math_state_restore+0x45/0x60
      <4> [<ffffffff811a05e8>] sys_mkdir+0x18/0x20
      <4> [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
      <4>
      <0>Kernel panic - not syncing: LBUG
      <4>Pid: 23705, comm: pct Not tainted 2.6.32-504.30.3.el6.x86_64 #1
      <4>Call Trace:
      <4> [<ffffffff815293fc>] ? panic+0xa7/0x16f
      <4> [<ffffffffa050becb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      <4> [<ffffffffa0aa0192>] ? ll_prep_inode+0x752/0xc40 [lustre]
      <4> [<ffffffffa07f6d60>] ? lustre_swab_mdt_body+0x0/0x130 [ptlrpc]
      <4> [<ffffffffa0ab42b2>] ? ll_new_node+0x682/0x7f0 [lustre]
      <4> [<ffffffffa0ab6b04>] ? ll_mkdir+0x104/0x220 [lustre]
      <4> [<ffffffff8122ec0f>] ? security_inode_permission+0x1f/0x30
      <4> [<ffffffff8119d759>] ? vfs_mkdir+0xd9/0x140
      <4> [<ffffffff811a04e7>] ? sys_mkdirat+0xc7/0x1b0
      <4> [<ffffffff8100c6f5>] ? math_state_restore+0x45/0x60
      <4> [<ffffffff811a05e8>] ? sys_mkdir+0x18/0x20
      <4> [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b
      
      • No errors on Lustre server nodes ca be be correlated to the events.

      Attached console, messages and vmcore-dmesg.txt from both clients.
      Crash files are available

      Attachments

        1. console-lola-26.log.bz2
          63 kB
        2. console-lola-29.log.bz2
          72 kB
        3. lola-26-vmcore-dmesg.txt.bz2
          30 kB
        4. lola-29-vmcore-dmesg.txt.bz2
          27 kB
        5. messages-lola-26.log.bz2
          235 kB
        6. messages-lola-29.log.bz2
          248 kB

        Issue Links

          Activity

            [LU-7798] ll_prep_inode()) ASSERTION( fid_is_sane(&md.body->mbo_fid1) ) failed:
            pjones Peter Jones added a comment -

            seems to be a duplicate of LU-7422

            pjones Peter Jones added a comment - seems to be a duplicate of LU-7422

            We appear to have hit this again, while doing ZFS metadata testing:

            Oct  6 22:18:08 spirit-17 kernel: LustreError: 36451:0:(llite_lib.c:2465:ll_prep_inode()) ASSERTION( fid_is_sane(&md.body->mbo_fid1) ) failed:
            Oct  6 22:18:08 spirit-17 kernel: LustreError: 36451:0:(llite_lib.c:2465:ll_prep_inode()) LBUG
            Oct  6 22:18:08 spirit-17 kernel: Pid: 36451, comm: mdtest
            Oct  6 22:18:08 spirit-17 kernel: #012Call Trace:
            Oct  6 22:18:08 spirit-17 kernel: [<ffffffffa08ab853>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
            Oct  6 22:18:08 spirit-17 kernel: [<ffffffffa08abdf5>] lbug_with_loc+0x45/0xc0 [libcfs]
            Oct  6 22:18:08 spirit-17 kernel: [<ffffffffa0de7002>] ll_prep_inode+0xb12/0xe60 [lustre]
            Oct  6 22:18:08 spirit-17 kernel: [<ffffffffa0df6c3f>] ? ll_lookup_it+0x64f/0xe20 [lustre]
            Oct  6 22:18:08 spirit-17 kernel: [<ffffffffa0df7c2c>] ll_atomic_open+0x81c/0x12c0 [lustre]
            Oct  6 22:18:08 spirit-17 kernel: [<ffffffff811ec8a1>] do_last+0xa11/0x1270 
            Oct  6 22:18:08 spirit-17 kernel: [<ffffffff811eede2>] path_openat+0xc2/0x490
            Oct  6 22:18:08 spirit-17 kernel: [<ffffffff811f05ab>] do_filp_open+0x4b/0xb0
            Oct  6 22:18:08 spirit-17 kernel: [<ffffffff811fd147>] ? __alloc_fd+0xa7/0x130
            Oct  6 22:18:08 spirit-17 kernel: [<ffffffff811ddf53>] do_sys_open+0xf3/0x1f0
            Oct  6 22:18:08 spirit-17 kernel: [<ffffffff811de06e>] SyS_open+0x1e/0x20
            Oct  6 22:18:08 spirit-17 kernel: [<ffffffff811de0a6>] SyS_creat+0x16/0x20
            Oct  6 22:18:08 spirit-17 kernel: [<ffffffff81646a09>] system_call_fastpath+0x16/0x1b
            Oct  6 22:18:08 spirit-17 kernel:
            
            cliffw Cliff White (Inactive) added a comment - We appear to have hit this again, while doing ZFS metadata testing: Oct 6 22:18:08 spirit-17 kernel: LustreError: 36451:0:(llite_lib.c:2465:ll_prep_inode()) ASSERTION( fid_is_sane(&md.body->mbo_fid1) ) failed: Oct 6 22:18:08 spirit-17 kernel: LustreError: 36451:0:(llite_lib.c:2465:ll_prep_inode()) LBUG Oct 6 22:18:08 spirit-17 kernel: Pid: 36451, comm: mdtest Oct 6 22:18:08 spirit-17 kernel: #012Call Trace: Oct 6 22:18:08 spirit-17 kernel: [<ffffffffa08ab853>] libcfs_debug_dumpstack+0x53/0x80 [libcfs] Oct 6 22:18:08 spirit-17 kernel: [<ffffffffa08abdf5>] lbug_with_loc+0x45/0xc0 [libcfs] Oct 6 22:18:08 spirit-17 kernel: [<ffffffffa0de7002>] ll_prep_inode+0xb12/0xe60 [lustre] Oct 6 22:18:08 spirit-17 kernel: [<ffffffffa0df6c3f>] ? ll_lookup_it+0x64f/0xe20 [lustre] Oct 6 22:18:08 spirit-17 kernel: [<ffffffffa0df7c2c>] ll_atomic_open+0x81c/0x12c0 [lustre] Oct 6 22:18:08 spirit-17 kernel: [<ffffffff811ec8a1>] do_last+0xa11/0x1270 Oct 6 22:18:08 spirit-17 kernel: [<ffffffff811eede2>] path_openat+0xc2/0x490 Oct 6 22:18:08 spirit-17 kernel: [<ffffffff811f05ab>] do_filp_open+0x4b/0xb0 Oct 6 22:18:08 spirit-17 kernel: [<ffffffff811fd147>] ? __alloc_fd+0xa7/0x130 Oct 6 22:18:08 spirit-17 kernel: [<ffffffff811ddf53>] do_sys_open+0xf3/0x1f0 Oct 6 22:18:08 spirit-17 kernel: [<ffffffff811de06e>] SyS_open+0x1e/0x20 Oct 6 22:18:08 spirit-17 kernel: [<ffffffff811de0a6>] SyS_creat+0x16/0x20 Oct 6 22:18:08 spirit-17 kernel: [<ffffffff81646a09>] system_call_fastpath+0x16/0x1b Oct 6 22:18:08 spirit-17 kernel:

            Rest of kernel panic

            Kernel panic - not syncing: LBUG^M
            Pid: 85659, comm: pct Not tainted 2.6.32-504.30.3.el6.x86_64 #1^M
            Call Trace:^M
             [<ffffffff815293fc>] ? panic+0xa7/0x16f^M
             [<ffffffffa045cecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]^M
             [<ffffffffa09ff192>] ? ll_prep_inode+0x752/0xc40 [lustre]^M
             [<ffffffffa0747d60>] ? lustre_swab_mdt_body+0x0/0x130 [ptlrpc]^M
             [<ffffffffa0a132b2>] ? ll_new_node+0x682/0x7f0 [lustre]^M
             [<ffffffffa0a15b04>] ? ll_mkdir+0x104/0x220 [lustre]^M
             [<ffffffff8122ec0f>] ? security_inode_permission+0x1f/0x30^M
             [<ffffffff8119d759>] ? vfs_mkdir+0xd9/0x140^M
             [<ffffffff811a04e7>] ? sys_mkdirat+0xc7/0x1b0^M
             [<ffffffff8100c6f5>] ? math_state_restore+0x45/0x60^M
             [<ffffffff811a05e8>] ? sys_mkdir+0x18/0x20^M
             [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b^M
            Initializing cgroup subsys cpuset^M
            Initializing cgroup subsys cpu^M
            

            crash dump has been copied to lola:/scratch/crashdumps/lu-7798/

            cliffw Cliff White (Inactive) added a comment - Rest of kernel panic Kernel panic - not syncing: LBUG^M Pid: 85659, comm: pct Not tainted 2.6.32-504.30.3.el6.x86_64 #1^M Call Trace:^M [<ffffffff815293fc>] ? panic+0xa7/0x16f^M [<ffffffffa045cecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]^M [<ffffffffa09ff192>] ? ll_prep_inode+0x752/0xc40 [lustre]^M [<ffffffffa0747d60>] ? lustre_swab_mdt_body+0x0/0x130 [ptlrpc]^M [<ffffffffa0a132b2>] ? ll_new_node+0x682/0x7f0 [lustre]^M [<ffffffffa0a15b04>] ? ll_mkdir+0x104/0x220 [lustre]^M [<ffffffff8122ec0f>] ? security_inode_permission+0x1f/0x30^M [<ffffffff8119d759>] ? vfs_mkdir+0xd9/0x140^M [<ffffffff811a04e7>] ? sys_mkdirat+0xc7/0x1b0^M [<ffffffff8100c6f5>] ? math_state_restore+0x45/0x60^M [<ffffffff811a05e8>] ? sys_mkdir+0x18/0x20^M [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b^M Initializing cgroup subsys cpuset^M Initializing cgroup subsys cpu^M crash dump has been copied to lola:/scratch/crashdumps/lu-7798/

            This bug has re-occured in 2.8.0-RC5 testing

            Lustre: 3790:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 6 previous similar messages^M
            Lustre: 3779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1457583412/real 1457583412]  req@ffff880275d579c0 x1528367925364328/t0(0) o38->soaked-MDT0007-mdc-ffff88082f2f4800@192.168.1.111@o2ib10:12/10 lens 520/544 e 0 to 1 dl 1457583433 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1^M
            Lustre: 3779:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 6 previous similar messages^M
            LustreError: 11-0: soaked-MDT0006-mdc-ffff88082f2f4800: operation mds_reint to node 192.168.1.110@o2ib10 failed: rc = -19^M
            Lustre: soaked-MDT0007-mdc-ffff88082f2f4800: Connection restored to 192.168.1.111@o2ib10 (at 192.168.1.111@o2ib10)^M
            LustreError: 3779:0:(client.c:2874:ptlrpc_replay_interpret()) @@@ request replay timed out.^M
              req@ffff88046cbfb6c0 x1528367924440432/t412317941034(412317941034) o36->soaked-MDT0006-mdc-ffff88082f2f4800@192.168.1.111@o2ib10:12/10 lens 616/192 e 1 to 1 dl 1457583622 ref 2 fl Interpret:EX/4/ffffffff rc -110/-1^M
            Lustre: soaked-MDT0006-mdc-ffff88082f2f4800: Connection restored to 192.168.1.111@o2ib10 (at 192.168.1.111@o2ib10)^M
            LustreError: 85659:0:(llite_lib.c:2295:ll_prep_inode()) ASSERTION( fid_is_sane(&md.body->mbo_fid1) ) failed: ^M
            LustreError: 85659:0:(llite_lib.c:2295:ll_prep_inode()) LBUG^M
            Pid: 85659, comm: pct^M
            ^M
            Call Trace:^M
             [<ffffffffa045c875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
             [<ffffffffa045ce77>] lbug_with_loc+0x47/0xb0 [libcfs]^M
             [<ffffffffa09ff192>] ll_prep_inode+0x752/0xc40 [lustre]^M
             [<ffffffffa0747d60>] ? lustre_swab_mdt_body+0x0/0x130 [ptlrpc]^M
             [<ffffffffa0a132b2>] ll_new_node+0x682/0x7f0 [lustre]^M
             [<ffffffffa0a15b04>] ll_mkdir+0x104/0x220 [lustre]^M
             [<ffffffff8122ec0f>] ? security_inode_permission+0x1f/0x30^M
             [<ffffffff8119d759>] vfs_mkdir+0xd9/0x140^M
             [<ffffffff811a04e7>] sys_mkdirat+0xc7/0x1b0^M
             [<ffffffff8100c6f5>] ? math_state_restore+0x45/0x60^M
             [<ffffffff811a05e8>] sys_mkdir+0x18/0x20^M
             [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b^M
            ^M
            
            cliffw Cliff White (Inactive) added a comment - This bug has re-occured in 2.8.0-RC5 testing Lustre: 3790:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 6 previous similar messages^M Lustre: 3779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1457583412/real 1457583412] req@ffff880275d579c0 x1528367925364328/t0(0) o38->soaked-MDT0007-mdc-ffff88082f2f4800@192.168.1.111@o2ib10:12/10 lens 520/544 e 0 to 1 dl 1457583433 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1^M Lustre: 3779:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 6 previous similar messages^M LustreError: 11-0: soaked-MDT0006-mdc-ffff88082f2f4800: operation mds_reint to node 192.168.1.110@o2ib10 failed: rc = -19^M Lustre: soaked-MDT0007-mdc-ffff88082f2f4800: Connection restored to 192.168.1.111@o2ib10 (at 192.168.1.111@o2ib10)^M LustreError: 3779:0:(client.c:2874:ptlrpc_replay_interpret()) @@@ request replay timed out.^M req@ffff88046cbfb6c0 x1528367924440432/t412317941034(412317941034) o36->soaked-MDT0006-mdc-ffff88082f2f4800@192.168.1.111@o2ib10:12/10 lens 616/192 e 1 to 1 dl 1457583622 ref 2 fl Interpret:EX/4/ffffffff rc -110/-1^M Lustre: soaked-MDT0006-mdc-ffff88082f2f4800: Connection restored to 192.168.1.111@o2ib10 (at 192.168.1.111@o2ib10)^M LustreError: 85659:0:(llite_lib.c:2295:ll_prep_inode()) ASSERTION( fid_is_sane(&md.body->mbo_fid1) ) failed: ^M LustreError: 85659:0:(llite_lib.c:2295:ll_prep_inode()) LBUG^M Pid: 85659, comm: pct^M ^M Call Trace:^M [<ffffffffa045c875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M [<ffffffffa045ce77>] lbug_with_loc+0x47/0xb0 [libcfs]^M [<ffffffffa09ff192>] ll_prep_inode+0x752/0xc40 [lustre]^M [<ffffffffa0747d60>] ? lustre_swab_mdt_body+0x0/0x130 [ptlrpc]^M [<ffffffffa0a132b2>] ll_new_node+0x682/0x7f0 [lustre]^M [<ffffffffa0a15b04>] ll_mkdir+0x104/0x220 [lustre]^M [<ffffffff8122ec0f>] ? security_inode_permission+0x1f/0x30^M [<ffffffff8119d759>] vfs_mkdir+0xd9/0x140^M [<ffffffff811a04e7>] sys_mkdirat+0xc7/0x1b0^M [<ffffffff8100c6f5>] ? math_state_restore+0x45/0x60^M [<ffffffff811a05e8>] sys_mkdir+0x18/0x20^M [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b^M ^M

            The client should not crash when accessing bad data from the network. It is very likely that this is a result of bad data being sent from the server, but that is a separate issue. This bug should focus on adding proper error handling to ll_prep_inode() for this case instead of crashing.

            adilger Andreas Dilger added a comment - The client should not crash when accessing bad data from the network. It is very likely that this is a result of bad data being sent from the server, but that is a separate issue. This bug should focus on adding proper error handling to ll_prep_inode() for this case instead of crashing.
            heckes Frank Heckes (Inactive) added a comment - - edited

            I forgot to mention that the OSS nodes have been extended to operate in active-active failover configuration
            for disk resources by Feb, 17th 2016. So the failover partner node lola-4 can see of node lola-5 all disks and has its ZFS pools imported also.
            There's no start-up (boot) wrapper script that prevents the (primary) zpools of the other node from being imported.

            heckes Frank Heckes (Inactive) added a comment - - edited I forgot to mention that the OSS nodes have been extended to operate in active-active failover configuration for disk resources by Feb, 17th 2016. So the failover partner node lola-4 can see of node lola-5 all disks and has its ZFS pools imported also. There's no start-up (boot) wrapper script that prevents the (primary) zpools of the other node from being imported.
            heckes Frank Heckes (Inactive) added a comment - - edited

            Crash files have been saved to lhn.lola.hpdd.intel.com:/scratch/crashdumps/lu-7798/

            heckes Frank Heckes (Inactive) added a comment - - edited Crash files have been saved to lhn.lola.hpdd.intel.com:/scratch/crashdumps/lu-7798/

            People

              wc-triage WC Triage
              heckes Frank Heckes (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: