Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7422

incorrect ENOENT handling in mdt_intent_reint causes panic on client side

Details

    • 3
    • 9223372036854775807

    Description

      In case of DISP_OPEN_CREATE client waits for valid fid value in reply when it_status == 0.
      When reint_open returns ENOENT fid is not set and client gets fid filled by 0. This may cause following panic:

      > LustreError: 19177:0:(llite_lib.c:2178:ll_prep_inode()) ASSERTION( fid_is_sane(&md.body->fid1) ) failed:
      > LustreError: 19177:0:(llite_lib.c:2178:ll_prep_inode()) LBUG
      > Pid: 19177, comm: cp
      > 
      > Call Trace:
      >  [<ffffffff81006109>] try_stack_unwind+0x169/0x1b0
      >  [<ffffffff81004b99>] dump_trace+0x89/0x440
      >  [<ffffffffa036b8c7>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
      >  [<ffffffffa036be27>] lbug_with_loc+0x47/0xc0 [libcfs]
      >  [<ffffffffa0a21184>] ll_prep_inode+0xc04/0xe40 [lustre]
      >  [<ffffffffa0a34721>] ll_create_nd+0x3d1/0xec0 [lustre]
      >  [<ffffffff81163fe9>] vfs_create+0xa9/0xf0
      >  [<ffffffff811646aa>] do_last+0x67a/0x7d0
      >  [<ffffffff81165358>] path_openat+0xc8/0x3d0
      >  [<ffffffff81165788>] do_filp_open+0x48/0xa0
      >  [<ffffffff8115651e>] do_sys_open+0x16e/0x240
      >  [<ffffffff81156630>] sys_open+0x20/0x30
      >  [<ffffffff815662eb>] system_call_fastpath+0x16/0x1b
      >  [<00007fc9c3726790>] 0x7fc9c3726790

      We faced the issue on DNE setup. For unknown reason(possibly failover) FLDB on master mdt didn't include OST seq ranges.
      We faced above panic every time after trying to create regular file in directory located on mdt1.

      Attachments

        1. console-lola-10.log.bz2
          386 kB
        2. console-lola-29.log.bz2
          44 kB
        3. lola-29-lustre-log.1448526588.65535.bz2
          0.3 kB
        4. messages-lola-10.log.bz2
          445 kB
        5. messages-lola-29.log.bz2
          409 kB
        6. soak.log.bz2
          358 kB

        Issue Links

          Activity

            [LU-7422] incorrect ENOENT handling in mdt_intent_reint causes panic on client side
            pjones Peter Jones added a comment -

            Landed for 2.9

            pjones Peter Jones added a comment - Landed for 2.9

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17985/
            Subject: LU-7422 llite: don't panic when fid is insane
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: e4eb90ecdc09740d90834cb1e95b5693e6637173

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17985/ Subject: LU-7422 llite: don't panic when fid is insane Project: fs/lustre-release Branch: master Current Patch Set: Commit: e4eb90ecdc09740d90834cb1e95b5693e6637173

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17177/
            Subject: LU-7422 mdt: fix ENOENT handling in mdt_intent_reint
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: ec9078afb635e8a64a4906b09dc99a2ab90e321b

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17177/ Subject: LU-7422 mdt: fix ENOENT handling in mdt_intent_reint Project: fs/lustre-release Branch: master Current Patch Set: Commit: ec9078afb635e8a64a4906b09dc99a2ab90e321b

            Sergey Cheremencev (sergey_cheremencev@xyratex.com) uploaded a new patch: http://review.whamcloud.com/17985
            Subject: LU-7422 llite: don't panic when fid is insane
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 77306b00cd1b3536f8d6b4e9b25f7b065f737ec7

            gerrit Gerrit Updater added a comment - Sergey Cheremencev (sergey_cheremencev@xyratex.com) uploaded a new patch: http://review.whamcloud.com/17985 Subject: LU-7422 llite: don't panic when fid is insane Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 77306b00cd1b3536f8d6b4e9b25f7b065f737ec7
            heckes Frank Heckes (Inactive) added a comment - - edited

            Used build '20160104' from branch master (see https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola&spaceKey=Releases#SoakTestingonLola-20160104).
            DNE is enabled and MDSes are in active-active HA configuration. MDTs have been formatted using ldiskfs, OSTs using zfs.

            Original error happens again for two client nodes at almost the same time during failback of MDS resources to primary node while soak testing the build specified above.

            <0>LustreError: 75738:0:(llite_lib.c:2295:ll_prep_inode()) ASSERTION( fid_is_sane(&md.body->mbo_fid1) ) failed: 
            <0>LustreError: 75738:0:(llite_lib.c:2295:ll_prep_inode()) LBUG
            <4>Pid: 75738, comm: mdtest
            ...
            <0>Kernel panic - not syncing: LBUG
            <4>Pid: 75738, comm: mdtest Not tainted 2.6.32-504.30.3.el6.x86_64 #1
            

            Crash dump files have been written for both nodes (lola-26,29) and have been saved to lola-1:/scratch/crashdumps/lu-7422/lola-26-127.0.0.1-2016-01-05-19:02:53 , lola-29-127.0.0.1-2016-01-05-19:02:56. Log files can be provided on demand.

            heckes Frank Heckes (Inactive) added a comment - - edited Used build '20160104' from branch master (see https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola&spaceKey=Releases#SoakTestingonLola-20160104 ). DNE is enabled and MDSes are in active-active HA configuration. MDTs have been formatted using ldiskfs , OSTs using zfs . Original error happens again for two client nodes at almost the same time during failback of MDS resources to primary node while soak testing the build specified above. <0>LustreError: 75738:0:(llite_lib.c:2295:ll_prep_inode()) ASSERTION( fid_is_sane(&md.body->mbo_fid1) ) failed: <0>LustreError: 75738:0:(llite_lib.c:2295:ll_prep_inode()) LBUG <4>Pid: 75738, comm: mdtest ... <0>Kernel panic - not syncing: LBUG <4>Pid: 75738, comm: mdtest Not tainted 2.6.32-504.30.3.el6.x86_64 #1 Crash dump files have been written for both nodes ( lola-26,29 ) and have been saved to lola-1:/scratch/crashdumps/lu-7422/lola-26-127.0.0.1-2016-01-05-19:02:53 , lola-29-127.0.0.1-2016-01-05-19:02:56 . Log files can be provided on demand.

            Maloo set -1 because there are 2 test failures:
            1. sanity 230f - it is marked as known but LU-7549
            2. conf-sanity 51. I don't see how it can be connected with my patch:

            15:32:55:LustreError: 3731:0:(lov_object.c:278:lov_init_raid0()) ASSERTION( subdev != NULL ) failed: not init ost 1
            15:32:55:LustreError: 3731:0:(lov_object.c:278:lov_init_raid0()) LBUG
            15:32:55:Pid: 3731, comm: touch
            15:32:55:
            15:32:55:Call Trace:
            15:32:55: [<ffffffffa03e3875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
            15:32:55: [<ffffffffa03e3e77>] lbug_with_loc+0x47/0xb0 [libcfs]
            15:32:55: [<ffffffffa0ab1239>] lov_init_raid0+0xde9/0x1140 [lov]
            15:32:55: [<ffffffffa0aadea4>] lov_object_init+0x124/0x300 [lov]
            15:32:55: [<ffffffffa078621c>] ? lu_object_add+0x2c/0x30 [obdclass]
            15:32:55: [<ffffffffa0788e48>] lu_object_alloc+0xd8/0x320 [obdclass]
            15:32:55: [<ffffffffa078a1ad>] lu_object_find_try+0xc03e>] ? do_filp_open+0xa8e/0xd20
            15:32:55: [<ffffffff81156a00>] ? __vma_link_rb+0x30/0x40
            15:32:55: [<ffffffff81156aab>] ? vma_link+0x9b/0xf0
            15:32:55: [<ffffffff8129dc3a>] ? strncpy_from_user+0x4a/0x90
            15:32:55: [<ffffffff811b1d52>] ? alloc_fd+0x92/0x160
            15:32:55: [<ffffffff8118e847>] ? do_sys_open+0x67/0x130
            15:32:55: [<ffffffff8118e950>] ? sys_open+0x20/0x30
            15:32:55: [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b
            
            scherementsev Sergey Cheremencev added a comment - Maloo set -1 because there are 2 test failures: 1. sanity 230f - it is marked as known but LU-7549 2. conf-sanity 51. I don't see how it can be connected with my patch: 15:32:55:LustreError: 3731:0:(lov_object.c:278:lov_init_raid0()) ASSERTION( subdev != NULL ) failed: not init ost 1 15:32:55:LustreError: 3731:0:(lov_object.c:278:lov_init_raid0()) LBUG 15:32:55:Pid: 3731, comm: touch 15:32:55: 15:32:55:Call Trace: 15:32:55: [<ffffffffa03e3875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 15:32:55: [<ffffffffa03e3e77>] lbug_with_loc+0x47/0xb0 [libcfs] 15:32:55: [<ffffffffa0ab1239>] lov_init_raid0+0xde9/0x1140 [lov] 15:32:55: [<ffffffffa0aadea4>] lov_object_init+0x124/0x300 [lov] 15:32:55: [<ffffffffa078621c>] ? lu_object_add+0x2c/0x30 [obdclass] 15:32:55: [<ffffffffa0788e48>] lu_object_alloc+0xd8/0x320 [obdclass] 15:32:55: [<ffffffffa078a1ad>] lu_object_find_try+0xc03e>] ? do_filp_open+0xa8e/0xd20 15:32:55: [<ffffffff81156a00>] ? __vma_link_rb+0x30/0x40 15:32:55: [<ffffffff81156aab>] ? vma_link+0x9b/0xf0 15:32:55: [<ffffffff8129dc3a>] ? strncpy_from_user+0x4a/0x90 15:32:55: [<ffffffff811b1d52>] ? alloc_fd+0x92/0x160 15:32:55: [<ffffffff8118e847>] ? do_sys_open+0x67/0x130 15:32:55: [<ffffffff8118e950>] ? sys_open+0x20/0x30 15:32:55: [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b
            heckes Frank Heckes (Inactive) added a comment - - edited

            Error happened also during soak testing of master branch (see https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola&spaceKey=Releases#SoakTestingonLola-20151122). DNE is enabled. MDSes are in active-active HA failover configuration.

            Here the error occurred after the remount of the MDTs on MDS (lola-10) completed successful (2015-11-26 00:27:36).
            Pasted the stack trace ones more, as context seems to be different than for the one above.

            Nov 26 00:29:48 lola-29 kernel: LustreError: 65535:0:(llite_lib.c:2295:ll_prep_inode()) ASSERTION( fid_is_sane(&md.body->mbo_fid1) ) failed:
            Nov 26 00:29:48 lola-29 kernel: LustreError: 65535:0:(llite_lib.c:2295:ll_prep_inode()) LBUG
            Nov 26 00:29:48 lola-29 kernel: Pid: 65535, comm: pct
            Nov 26 00:29:48 lola-29 kernel:
            Nov 26 00:29:48 lola-29 kernel: Call Trace:
            Nov 26 00:29:48 lola-29 kernel: [<ffffffffa050b875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
            Nov 26 00:29:48 lola-29 kernel: [<ffffffffa050be77>] lbug_with_loc+0x47/0xb0 [libcfs]
            Nov 26 00:29:48 lola-29 kernel: [<ffffffffa0abdb62>] ll_prep_inode+0x752/0xc40 [lustre]
            Nov 26 00:29:48 lola-29 kernel: [<ffffffffa0802c10>] ? lustre_swab_mdt_body+0x0/0x130 [ptlrpc]
            Nov 26 00:29:48 lola-29 kernel: [<ffffffffa0ad29d2>] ll_new_node+0x682/0x7f0 [lustre]
            Nov 26 00:29:48 lola-29 kernel: [<ffffffffa0ad5224>] ll_mkdir+0x104/0x220 [lustre]
            Nov 26 00:29:48 lola-29 kernel: [<ffffffff8122ec0f>] ? security_inode_permission+0x1f/0x30
            Nov 26 00:29:48 lola-29 kernel: [<ffffffff8119d759>] vfs_mkdir+0xd9/0x140
            Nov 26 00:29:48 lola-29 kernel: [<ffffffff811a04e7>] sys_mkdirat+0xc7/0x1b0
            Nov 26 00:29:48 lola-29 kernel: [<ffffffff8100c6f5>] ? math_state_restore+0x45/0x60
            Nov 26 00:29:48 lola-29 kernel: [<ffffffff811a05e8>] sys_mkdir+0x18/0x20
            Nov 26 00:29:48 lola-29 kernel: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
            Nov 26 00:29:48 lola-29 kernel:
            Nov 26 00:29:48 lola-29 kernel: LustreError: dumping log to /tmp/lustre-log.1448526588.65535
            

            Chronologically this event can be correlated to the following error on lola-10:

            lola-10.log:Nov 26 00:29:26 lola-10 kernel: Lustre: 3875:0:(client.c:1994:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1448526555/real 1448526555]  req@ffff88041c873380 x1518889781449540/t0(
            0) o400->soaked-MDT0004-osp-MDT0005@0@lo:24/4 lens 224/224 e 1 to 1 dl 1448526560 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1
            lola-10.log:Nov 26 00:29:28 lola-10 kernel: LustreError: 5620:0:(ldlm_lib.c:1883:check_for_next_transno()) soaked-MDT0004: waking for gap in transno, VBR is OFF (skip: 992140549951, ql: 10, comp: 6, conn: 16, next: 992140549952
            , next_update 992140549968 last_committed: 992140549904)
            lola-10.log:Nov 26 00:29:28 lola-10 kernel: LustreError: 5620:0:(ldlm_lib.c:1883:check_for_next_transno()) soaked-MDT0004: waking for gap in transno, VBR is OFF (skip: 992140549956, ql: 10, comp: 6, conn: 16, next: 992140549957
            , next_update 992140549968 last_committed: 992140549904)
            etc... etc...
            lola-10.log:Nov 26 00:29:34 lola-10 kernel: LustreError: 5620:0:(ldlm_lib.c:1883:check_for_next_transno()) soaked-MDT0004: waking for gap in transno, VBR is OFF (skip: 992140550321, ql: 10, comp: 6, conn: 16, next: 992140550322
            , next_update 992140550332 last_committed: 992140550276)
            lola-10.log:Nov 26 00:29:34 lola-10 kernel: LustreError: 5620:0:(ldlm_lib.c:1883:check_for_next_transno()) soaked-MDT0004: waking for gap in transno, VBR is OFF (skip: 992140550327, ql: 10, comp: 6, conn: 16, next: 992140550328
            , next_update 992140550332 last_committed: 992140550276)
            lola-10.log:Nov 26 00:29:35 lola-10 kernel: LustreError: 5620:0:(ldlm_lib.c:1883:check_for_next_transno()) soaked-MDT0004: waking for gap in transno, VBR is OFF (skip: 992140550374, ql: 10, comp: 6, conn: 16, next: 992140550375
            , next_update 992140550385 last_committed: 992140550276)
            lola-10.log:Nov 26 00:29:35 lola-10 kernel: LustreError: 5620:0:(ldlm_lib.c:1883:check_for_next_transno()) soaked-MDT0004: waking for gap in transno, VBR is OFF (skip: 992140550379, ql: 10, comp: 6, conn: 16, next: 992140550380
            , next_update 992140550385 last_committed: 992140550276)
            ...
            etc...
            ...
            lola-10.log:Nov 26 00:29:40 lola-10 kernel: Lustre: 5620:0:(ldlm_lib.c:1946:target_recovery_overseer()) recovery is aborted, evict exports in recovery
            lola-10.log:Nov 26 00:29:40 lola-10 kernel: Lustre: 5620:0:(ldlm_lib.c:1946:target_recovery_overseer()) Skipped 1 previous similar message
            lola-10.log:Nov 26 00:29:40 lola-10 kernel: Lustre: soaked-MDT0004: disconnecting 10 stale clients
            lola-10.log:Nov 26 00:29:40 lola-10 kernel: Lustre: 5620:0:(ldlm_lib.c:1580:abort_req_replay_queue()) @@@ aborted:  req@ffff8808216889c0 x1518687394127176/t0(992140550777) o36->a8187858-2e65-5add-6cdf-c58a76255990@192.168.1.126@o2ib100:210/0 lens 624/0 e 0 to 0 dl 1448526600 ref 1 fl Complete:/4/ffffffff rc 0/-1
            lola-10.log:Nov 26 00:29:40 lola-10 kernel: Lustre: 5620:0:(ldlm_lib.c:1580:abort_req_replay_queue()) Skipped 8 previous similar messages
            lola-10.log:Nov 26 00:29:40 lola-10 kernel: Lustre: soaked-MDT0004: Denying connection for new client 429ceaa9-f9a8-504b-0d85-c997f52f2036(at 192.168.1.130@o2ib100), waiting for 16 known clients (6 recovered, 0 in progress, and 10 evicted) to recover in 2:25
            lola-10.log:Nov 26 00:29:40 lola-10 kernel: Lustre: soaked-MDT0004: Recovery over after 2:04, of 16 clients 6 recovered and 10 were evicted.
            lola-10.log:Nov 26 00:29:40 lola-10 kernel: Lustre: Skipped 9 previous similar messages
            lola-10.log:Nov 26 00:29:40 lola-10 kernel: LustreError: 5620:0:(update_records.c:72:update_records_dump()) master transno = 992140550816 batchid = 987843952824 flags = 0 ops = 5 params = 9
            lola-10.log:Nov 26 00:29:40 lola-10 kernel: LustreError: 5620:0:(update_records.c:72:update_records_dump()) master transno = 992140550834 batchid = 987843952825 flags = 0 ops = 5 params = 4
            lola-10.log:Nov 26 00:29:40 lola-10 kernel: LustreError: 5620:0:(update_records.c:72:update_records_dump()) master transno = 992140550870 batchid = 987843952826 flags = 0 ops = 5 params = 9
            lola-10.log:Nov 26 00:29:40 lola-10 kernel: LustreError: 5620:0:(update_records.c:72:update_records_dump()) master transno = 992140550885 batchid = 987843952827 flags = 0 ops = 5 params = 4
            lola-10.log:Nov 26 00:29:40 lola-10 kernel: LustreError: 5620:0:(update_records.c:72:update_records_dump()) master transno = 992140550923 batchid = 987843952828 flags = 0 ops = 6 params = 10
            lola-10.log:Nov 26 00:29:40 lola-10 kernel: LustreError: 5620:0:(update_records.c:72:update_records_dump()) master transno = 992140550934 batchid = 987843952829 flags = 0 ops = 5 params = 4
            lola-10.log:Nov 26 00:29:40 lola-10 kernel: LustreError: 5620:0:(update_records.c:72:update_records_dump()) master transno = 992140550960 batchid = 987843952830 flags = 0 ops = 5 params = 9
            lola-10.log:Nov 26 00:29:40 lola-10 kernel: LustreError: 5620:0:(update_records.c:72:update_records_dump()) master transno = 992140550975 batchid = 987843952831 flags = 0 ops = 5 params = 4
            lola-10.log:Nov 26 00:29:40 lola-10 kernel: LustreError: 5620:0:(update_records.c:72:update_records_dump()) master transno = 992140551007 batchid = 987843952832 flags = 0 ops = 5 params = 9
            lola-10.log:Nov 26 00:29:40 lola-10 kernel: LustreError: 5620:0:(update_records.c:72:update_records_dump()) master transno = 992140551023 batchid = 987843952833 flags = 0 ops = 5 params = 4
            lola-10.log:Nov 26 00:29:40 lola-10 kernel: LustreError: 5620:0:(update_records.c:72:update_records_dump()) master transno = 992140551054 batchid = 987843952834 flags = 0 ops = 5 params = 9
            lola-10.log:Nov 26 00:29:40 lola-10 kernel: LustreError: 5620:0:(update_records.c:72:update_records_dump()) master transno = 992140551065 batchid = 987843952835 flags = 0 ops = 5 params = 4
            lola-10.log:Nov 26 00:29:40 lola-10 kernel: LustreError: 5620:0:(update_records.c:72:update_records_dump()) master transno = 992140551096 batchid = 987843952836 flags = 0 ops = 5 params = 9
            

            (Gap in trans. is an old bug INTL-156)

            Attached files: soak.log (to show events); client lola-29) kernel debug log file, messages, console logs; mds lola-10 messages, console logs

            heckes Frank Heckes (Inactive) added a comment - - edited Error happened also during soak testing of master branch (see https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola&spaceKey=Releases#SoakTestingonLola-20151122 ). DNE is enabled. MDSes are in active-active HA failover configuration. Here the error occurred after the remount of the MDTs on MDS ( lola-10 ) completed successful (2015-11-26 00:27:36). Pasted the stack trace ones more, as context seems to be different than for the one above. Nov 26 00:29:48 lola-29 kernel: LustreError: 65535:0:(llite_lib.c:2295:ll_prep_inode()) ASSERTION( fid_is_sane(&md.body->mbo_fid1) ) failed: Nov 26 00:29:48 lola-29 kernel: LustreError: 65535:0:(llite_lib.c:2295:ll_prep_inode()) LBUG Nov 26 00:29:48 lola-29 kernel: Pid: 65535, comm: pct Nov 26 00:29:48 lola-29 kernel: Nov 26 00:29:48 lola-29 kernel: Call Trace: Nov 26 00:29:48 lola-29 kernel: [<ffffffffa050b875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Nov 26 00:29:48 lola-29 kernel: [<ffffffffa050be77>] lbug_with_loc+0x47/0xb0 [libcfs] Nov 26 00:29:48 lola-29 kernel: [<ffffffffa0abdb62>] ll_prep_inode+0x752/0xc40 [lustre] Nov 26 00:29:48 lola-29 kernel: [<ffffffffa0802c10>] ? lustre_swab_mdt_body+0x0/0x130 [ptlrpc] Nov 26 00:29:48 lola-29 kernel: [<ffffffffa0ad29d2>] ll_new_node+0x682/0x7f0 [lustre] Nov 26 00:29:48 lola-29 kernel: [<ffffffffa0ad5224>] ll_mkdir+0x104/0x220 [lustre] Nov 26 00:29:48 lola-29 kernel: [<ffffffff8122ec0f>] ? security_inode_permission+0x1f/0x30 Nov 26 00:29:48 lola-29 kernel: [<ffffffff8119d759>] vfs_mkdir+0xd9/0x140 Nov 26 00:29:48 lola-29 kernel: [<ffffffff811a04e7>] sys_mkdirat+0xc7/0x1b0 Nov 26 00:29:48 lola-29 kernel: [<ffffffff8100c6f5>] ? math_state_restore+0x45/0x60 Nov 26 00:29:48 lola-29 kernel: [<ffffffff811a05e8>] sys_mkdir+0x18/0x20 Nov 26 00:29:48 lola-29 kernel: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b Nov 26 00:29:48 lola-29 kernel: Nov 26 00:29:48 lola-29 kernel: LustreError: dumping log to /tmp/lustre-log.1448526588.65535 Chronologically this event can be correlated to the following error on lola-10 : lola-10.log:Nov 26 00:29:26 lola-10 kernel: Lustre: 3875:0:(client.c:1994:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1448526555/real 1448526555] req@ffff88041c873380 x1518889781449540/t0( 0) o400->soaked-MDT0004-osp-MDT0005@0@lo:24/4 lens 224/224 e 1 to 1 dl 1448526560 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1 lola-10.log:Nov 26 00:29:28 lola-10 kernel: LustreError: 5620:0:(ldlm_lib.c:1883:check_for_next_transno()) soaked-MDT0004: waking for gap in transno, VBR is OFF (skip: 992140549951, ql: 10, comp: 6, conn: 16, next: 992140549952 , next_update 992140549968 last_committed: 992140549904) lola-10.log:Nov 26 00:29:28 lola-10 kernel: LustreError: 5620:0:(ldlm_lib.c:1883:check_for_next_transno()) soaked-MDT0004: waking for gap in transno, VBR is OFF (skip: 992140549956, ql: 10, comp: 6, conn: 16, next: 992140549957 , next_update 992140549968 last_committed: 992140549904) etc... etc... lola-10.log:Nov 26 00:29:34 lola-10 kernel: LustreError: 5620:0:(ldlm_lib.c:1883:check_for_next_transno()) soaked-MDT0004: waking for gap in transno, VBR is OFF (skip: 992140550321, ql: 10, comp: 6, conn: 16, next: 992140550322 , next_update 992140550332 last_committed: 992140550276) lola-10.log:Nov 26 00:29:34 lola-10 kernel: LustreError: 5620:0:(ldlm_lib.c:1883:check_for_next_transno()) soaked-MDT0004: waking for gap in transno, VBR is OFF (skip: 992140550327, ql: 10, comp: 6, conn: 16, next: 992140550328 , next_update 992140550332 last_committed: 992140550276) lola-10.log:Nov 26 00:29:35 lola-10 kernel: LustreError: 5620:0:(ldlm_lib.c:1883:check_for_next_transno()) soaked-MDT0004: waking for gap in transno, VBR is OFF (skip: 992140550374, ql: 10, comp: 6, conn: 16, next: 992140550375 , next_update 992140550385 last_committed: 992140550276) lola-10.log:Nov 26 00:29:35 lola-10 kernel: LustreError: 5620:0:(ldlm_lib.c:1883:check_for_next_transno()) soaked-MDT0004: waking for gap in transno, VBR is OFF (skip: 992140550379, ql: 10, comp: 6, conn: 16, next: 992140550380 , next_update 992140550385 last_committed: 992140550276) ... etc... ... lola-10.log:Nov 26 00:29:40 lola-10 kernel: Lustre: 5620:0:(ldlm_lib.c:1946:target_recovery_overseer()) recovery is aborted, evict exports in recovery lola-10.log:Nov 26 00:29:40 lola-10 kernel: Lustre: 5620:0:(ldlm_lib.c:1946:target_recovery_overseer()) Skipped 1 previous similar message lola-10.log:Nov 26 00:29:40 lola-10 kernel: Lustre: soaked-MDT0004: disconnecting 10 stale clients lola-10.log:Nov 26 00:29:40 lola-10 kernel: Lustre: 5620:0:(ldlm_lib.c:1580:abort_req_replay_queue()) @@@ aborted: req@ffff8808216889c0 x1518687394127176/t0(992140550777) o36->a8187858-2e65-5add-6cdf-c58a76255990@192.168.1.126@o2ib100:210/0 lens 624/0 e 0 to 0 dl 1448526600 ref 1 fl Complete:/4/ffffffff rc 0/-1 lola-10.log:Nov 26 00:29:40 lola-10 kernel: Lustre: 5620:0:(ldlm_lib.c:1580:abort_req_replay_queue()) Skipped 8 previous similar messages lola-10.log:Nov 26 00:29:40 lola-10 kernel: Lustre: soaked-MDT0004: Denying connection for new client 429ceaa9-f9a8-504b-0d85-c997f52f2036(at 192.168.1.130@o2ib100), waiting for 16 known clients (6 recovered, 0 in progress, and 10 evicted) to recover in 2:25 lola-10.log:Nov 26 00:29:40 lola-10 kernel: Lustre: soaked-MDT0004: Recovery over after 2:04, of 16 clients 6 recovered and 10 were evicted. lola-10.log:Nov 26 00:29:40 lola-10 kernel: Lustre: Skipped 9 previous similar messages lola-10.log:Nov 26 00:29:40 lola-10 kernel: LustreError: 5620:0:(update_records.c:72:update_records_dump()) master transno = 992140550816 batchid = 987843952824 flags = 0 ops = 5 params = 9 lola-10.log:Nov 26 00:29:40 lola-10 kernel: LustreError: 5620:0:(update_records.c:72:update_records_dump()) master transno = 992140550834 batchid = 987843952825 flags = 0 ops = 5 params = 4 lola-10.log:Nov 26 00:29:40 lola-10 kernel: LustreError: 5620:0:(update_records.c:72:update_records_dump()) master transno = 992140550870 batchid = 987843952826 flags = 0 ops = 5 params = 9 lola-10.log:Nov 26 00:29:40 lola-10 kernel: LustreError: 5620:0:(update_records.c:72:update_records_dump()) master transno = 992140550885 batchid = 987843952827 flags = 0 ops = 5 params = 4 lola-10.log:Nov 26 00:29:40 lola-10 kernel: LustreError: 5620:0:(update_records.c:72:update_records_dump()) master transno = 992140550923 batchid = 987843952828 flags = 0 ops = 6 params = 10 lola-10.log:Nov 26 00:29:40 lola-10 kernel: LustreError: 5620:0:(update_records.c:72:update_records_dump()) master transno = 992140550934 batchid = 987843952829 flags = 0 ops = 5 params = 4 lola-10.log:Nov 26 00:29:40 lola-10 kernel: LustreError: 5620:0:(update_records.c:72:update_records_dump()) master transno = 992140550960 batchid = 987843952830 flags = 0 ops = 5 params = 9 lola-10.log:Nov 26 00:29:40 lola-10 kernel: LustreError: 5620:0:(update_records.c:72:update_records_dump()) master transno = 992140550975 batchid = 987843952831 flags = 0 ops = 5 params = 4 lola-10.log:Nov 26 00:29:40 lola-10 kernel: LustreError: 5620:0:(update_records.c:72:update_records_dump()) master transno = 992140551007 batchid = 987843952832 flags = 0 ops = 5 params = 9 lola-10.log:Nov 26 00:29:40 lola-10 kernel: LustreError: 5620:0:(update_records.c:72:update_records_dump()) master transno = 992140551023 batchid = 987843952833 flags = 0 ops = 5 params = 4 lola-10.log:Nov 26 00:29:40 lola-10 kernel: LustreError: 5620:0:(update_records.c:72:update_records_dump()) master transno = 992140551054 batchid = 987843952834 flags = 0 ops = 5 params = 9 lola-10.log:Nov 26 00:29:40 lola-10 kernel: LustreError: 5620:0:(update_records.c:72:update_records_dump()) master transno = 992140551065 batchid = 987843952835 flags = 0 ops = 5 params = 4 lola-10.log:Nov 26 00:29:40 lola-10 kernel: LustreError: 5620:0:(update_records.c:72:update_records_dump()) master transno = 992140551096 batchid = 987843952836 flags = 0 ops = 5 params = 9 (Gap in trans. is an old bug INTL-156) Attached files: soak.log (to show events); client lola-29 ) kernel debug log file, messages, console logs; mds lola-10 messages, console logs

            The problem faced on lustre 2.5.1.

            The fid_is_sane() check should be skipped if -ENOENT is returned:

            Yes. But the problem here is that 0 returned instead of -ENOENT.

            static int mdt_intent_reint(enum mdt_it_code opcode,
                                        struct mdt_thread_info *info,
                                        struct ldlm_lock **lockp,
                                        __u64 flags)
            ...
                    if (rep->lock_policy_res2 == -ENOENT &&
                        mdt_get_disposition(rep, DISP_LOOKUP_NEG))
                            rep->lock_policy_res2 = 0;
            scherementsev Sergey Cheremencev added a comment - The problem faced on lustre 2.5.1. The fid_is_sane() check should be skipped if -ENOENT is returned: Yes. But the problem here is that 0 returned instead of -ENOENT. static int mdt_intent_reint(enum mdt_it_code opcode, struct mdt_thread_info *info, struct ldlm_lock **lockp, __u64 flags) ... if (rep->lock_policy_res2 == -ENOENT && mdt_get_disposition(rep, DISP_LOOKUP_NEG)) rep->lock_policy_res2 = 0;

            Sergey Cheremencev (sergey_cheremencev@xyratex.com) uploaded a new patch: http://review.whamcloud.com/17177
            Subject: LU-7422 mdt: fix ENOENT handling in mdt_intent_reint
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: d1aeb6446540ef375bcade3f7d1215ad598d0675

            gerrit Gerrit Updater added a comment - Sergey Cheremencev (sergey_cheremencev@xyratex.com) uploaded a new patch: http://review.whamcloud.com/17177 Subject: LU-7422 mdt: fix ENOENT handling in mdt_intent_reint Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: d1aeb6446540ef375bcade3f7d1215ad598d0675

            What version of Lustre are you testing? The fid_is_sane() check should be skipped if -ENOENT is returned:

            int ll_prep_inode(struct inode **inode, struct ptlrpc_request *req,
                              struct super_block *sb, struct lookup_intent *it)
            {
                    :
                    :
                    rc = md_get_lustre_md(sbi->ll_md_exp, req, sbi->ll_dt_exp,
                                          sbi->ll_md_exp, &md);
                    if (rc != 0)
                            GOTO(cleanup, rc);
                    :
                    :
                            /*
                             * At this point server returns to client's same fid as client
                             * generated for creating. So using ->fid1 is okay here.
                             */
                            LASSERT(fid_is_sane(&md.body->mbo_fid1));
            
            adilger Andreas Dilger added a comment - What version of Lustre are you testing? The fid_is_sane() check should be skipped if -ENOENT is returned: int ll_prep_inode(struct inode **inode, struct ptlrpc_request *req, struct super_block *sb, struct lookup_intent *it) { : : rc = md_get_lustre_md(sbi->ll_md_exp, req, sbi->ll_dt_exp, sbi->ll_md_exp, &md); if (rc != 0) GOTO(cleanup, rc); : : /* * At this point server returns to client's same fid as client * generated for creating. So using ->fid1 is okay here. */ LASSERT(fid_is_sane(&md.body->mbo_fid1));

            People

              wc-triage WC Triage
              scherementsev Sergey Cheremencev
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: