Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6147

sanity-lfsck test_4: '(7) unexpected status'

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.7.0
    • None
    • None
    • 3
    • 17167

    Description

      This issue was created by maloo for John Hammond <john.hammond@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/7626441a-a14b-11e4-807a-5254006e85c2.

      The sub-test test_4 failed with the following error:

      (7) unexpected status
      

      Please provide additional information about the failure here.

      Info required for matching: sanity-lfsck 4

      Attachments

        Activity

          [LU-6147] sanity-lfsck test_4: '(7) unexpected status'

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13493/
          Subject: LU-6147 lfsck: NOT purge object by OI scrub
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: d11360f4cc5d38cd748a97ca05e10121353ae616

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13493/ Subject: LU-6147 lfsck: NOT purge object by OI scrub Project: fs/lustre-release Branch: master Current Patch Set: Commit: d11360f4cc5d38cd748a97ca05e10121353ae616
          yujian Jian Yu added a comment - This is blocking patch review testing on master branch: https://testing.hpdd.intel.com/test_sets/8e856a20-aa3a-11e4-9ffb-5254006e85c2 https://testing.hpdd.intel.com/test_sets/1c657568-aaa1-11e4-8eda-5254006e85c2

          Many sanity-lfsck test_4 failed on Maloo recently:
          https://testing.hpdd.intel.com/sub_tests/query?utf8=✓&test_set%5Btest_set_script_id%5D=4f25830c-64fe-11e2-bfb2-52540035b04c&sub_test%5Bsub_test_script_id%5D=50859f34-64fe-11e2-bfb2-52540035b04c&sub_test%5Bstatus%5D=FAIL&sub_test%5Bquery_bugs%5D=&test_session%5Btest_host%5D=&test_session%5Btest_group%5D=&test_session%5Buser_id%5D=&test_session%5Bquery_date%5D=&test_session%5Bquery_recent_period%5D=&test_node%5Bos_type_id%5D=&test_node%5Bdistribution_type_id%5D=&test_node%5Barchitecture_type_id%5D=&test_node%5Bfile_system_type_id%5D=&test_node%5Blustre_branch_id%5D=&test_node_network%5Bnetwork_type_id%5D=&commit=Update+results

          So we need to land this patch ASAP to avoid blocking others.

          yong.fan nasf (Inactive) added a comment - Many sanity-lfsck test_4 failed on Maloo recently: https://testing.hpdd.intel.com/sub_tests/query?utf8= ✓&test_set%5Btest_set_script_id%5D=4f25830c-64fe-11e2-bfb2-52540035b04c&sub_test%5Bsub_test_script_id%5D=50859f34-64fe-11e2-bfb2-52540035b04c&sub_test%5Bstatus%5D=FAIL&sub_test%5Bquery_bugs%5D=&test_session%5Btest_host%5D=&test_session%5Btest_group%5D=&test_session%5Buser_id%5D=&test_session%5Bquery_date%5D=&test_session%5Bquery_recent_period%5D=&test_node%5Bos_type_id%5D=&test_node%5Bdistribution_type_id%5D=&test_node%5Barchitecture_type_id%5D=&test_node%5Bfile_system_type_id%5D=&test_node%5Blustre_branch_id%5D=&test_node_network%5Bnetwork_type_id%5D=&commit=Update+results So we need to land this patch ASAP to avoid blocking others.
          yong.fan nasf (Inactive) added a comment - Here is the patch: http://review.whamcloud.com/#/c/13493/

          The LU-6109 failed for more complex reason, we need more investigate it.
          But the reason for this failure (LU-6147) is known. So handle it separately.

          yong.fan nasf (Inactive) added a comment - The LU-6109 failed for more complex reason, we need more investigate it. But the reason for this failure ( LU-6147 ) is known. So handle it separately.

          Thanks John.

          yong.fan nasf (Inactive) added a comment - Thanks John.
          jhammond John Hammond added a comment -

          Yes. It works for me.

          jhammond John Hammond added a comment - Yes. It works for me.

          John, would you please to verify whether the patch http://review.whamcloud.com/#/c/13493/ works for you or not? Thanks!

          yong.fan nasf (Inactive) added a comment - John, would you please to verify whether the patch http://review.whamcloud.com/#/c/13493/ works for you or not? Thanks!

          It is another failure instance of LU-6109 because of LFSCK deadlock with itself when try to find some special object, such as .lustre/lost+found/MDTxxxx

          yong.fan nasf (Inactive) added a comment - It is another failure instance of LU-6109 because of LFSCK deadlock with itself when try to find some special object, such as .lustre/lost+found/MDTxxxx
          jhammond John Hammond added a comment -

          When I runn this locally I see an lfsck thread asleep in lu_object_find_at():

          [  515.234216] Lustre: DEBUG MARKER: == sanity-lfsck test 4: FID-in-dirent can be rebuilt after MDT file-level backup/restore == 10:23:00 (1421943780)
          [  521.780085] Lustre: Failing over lustre-MDT0000
          [  524.984187] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 192.168.122.201@tcp (no target). If you are running an HA pair check that the 
          target is mounted on the other server.
          [  524.987971] LustreError: Skipped 13 previous similar messages
          [  527.956175] Lustre: 5093:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1421943787/real 1421943787]  req@ffff8
          80064738188 x1491015708973216/t0(0) o251->MGC192.168.122.200@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1421943793 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
          [  527.961754] Lustre: 5093:0:(client.c:1942:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
          [  528.510534] Lustre: server umount lustre-MDT0000 complete
          [  529.308146] LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts: 
          [  531.180857] LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts: 
          [  531.902794] LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts: 
          [  537.656353] LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts: 
          [  539.324451] Lustre: *** cfs_fail_loc=1601, val=0***
          [  547.009622] Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 3 clients reconnect
          [  547.011917] Lustre: lustre-MDT0000: Denying connection for new client lustre-MDT0000-lwp-MDT0001_UUID (at 192.168.122.201@tcp), waiting for all 3 known clients (0
           recovered, 0 in progress, and 0 evicted) to recover in 1:06
          [  547.128862] Lustre: lustre-MDT0000: Recovery over after 0:01, of 3 clients 3 recovered and 0 were evicted.
          [  578.733248] Lustre: DEBUG MARKER: sanity-lfsck test_4: @@@@@@ FAIL: (7) unexpected status
          [  724.197393] INFO: task lfsck:5461 blocked for more than 120 seconds.
          [  724.198800]       Not tainted 2.6.32-431.29.2.el6.lustre.x86_64 #1
          [  724.200136] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
          [  724.201748] lfsck         D 0000000000000000     0  5461      2 0x00000080
          [  724.203226]  ffff880067f37c00 0000000000000046 0000000000000000 ffff880065abe188
          [  724.204927]  ffff880067f37b70 ffffffffa03183fe ffff880067f37c00 ffffffffa04a5cd9
          [  724.206596]  ffff880062298cc0 ffff880067f37fd8 000000000000fcc8 ffff880062298cc0
          [  724.208431] Call Trace:
          [  724.208981]  [<ffffffffa03183fe>] ? cfs_hash_spin_unlock+0xe/0x10 [libcfs]
          [  724.210466]  [<ffffffffa04a5cd9>] ? lu_object_find_try+0x99/0x2b0 [obdclass]
          [  724.211890]  [<ffffffffa04a5f2d>] lu_object_find_at+0x3d/0xe0 [obdclass]
          [  724.212765]  [<ffffffff8155625b>] ? _read_unlock+0x2b/0x40
          [  724.213476]  [<ffffffff81061d90>] ? default_wake_function+0x0/0x20
          [  724.214280]  [<ffffffffa04a600f>] lu_object_find_slice+0x1f/0x80 [obdclass]
          [  724.215198]  [<ffffffffa09e0838>] lfsck_master_oit_engine+0x608/0x1f40 [lfsck]
          [  724.216154]  [<ffffffff810b74bd>] ? trace_hardirqs_on+0xd/0x10
          [  724.216999]  [<ffffffffa09e2da1>] lfsck_master_engine+0xc31/0x14b0 [lfsck]
          [  724.217933]  [<ffffffff81553065>] ? thread_return+0x4e/0x7e9
          [  724.218670]  [<ffffffff81061d90>] ? default_wake_function+0x0/0x20
          [  724.219474]  [<ffffffffa09e2170>] ? lfsck_master_engine+0x0/0x14b0 [lfsck]
          [  724.220382]  [<ffffffff8109e856>] kthread+0x96/0xa0
          [  724.220999]  [<ffffffff8100c30a>] child_rip+0xa/0x20
          [  724.221665]  [<ffffffff815562e0>] ? _spin_unlock_irq+0x30/0x40
          [  724.222430]  [<ffffffff8100bb10>] ? restore_args+0x0/0x30
          [  724.223110]  [<ffffffff8109e7c0>] ? kthread+0x0/0xa0
          [  724.223796]  [<ffffffff8100c300>] ? child_rip+0x0/0x20
          
          jhammond John Hammond added a comment - When I runn this locally I see an lfsck thread asleep in lu_object_find_at(): [ 515.234216] Lustre: DEBUG MARKER: == sanity-lfsck test 4: FID-in-dirent can be rebuilt after MDT file-level backup/restore == 10:23:00 (1421943780) [ 521.780085] Lustre: Failing over lustre-MDT0000 [ 524.984187] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 192.168.122.201@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [ 524.987971] LustreError: Skipped 13 previous similar messages [ 527.956175] Lustre: 5093:0:(client.c:1942:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1421943787/real 1421943787] req@ffff8 80064738188 x1491015708973216/t0(0) o251->MGC192.168.122.200@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1421943793 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 [ 527.961754] Lustre: 5093:0:(client.c:1942:ptlrpc_expire_one_request()) Skipped 2 previous similar messages [ 528.510534] Lustre: server umount lustre-MDT0000 complete [ 529.308146] LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts: [ 531.180857] LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts: [ 531.902794] LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts: [ 537.656353] LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts: [ 539.324451] Lustre: *** cfs_fail_loc=1601, val=0*** [ 547.009622] Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 3 clients reconnect [ 547.011917] Lustre: lustre-MDT0000: Denying connection for new client lustre-MDT0000-lwp-MDT0001_UUID (at 192.168.122.201@tcp), waiting for all 3 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 1:06 [ 547.128862] Lustre: lustre-MDT0000: Recovery over after 0:01, of 3 clients 3 recovered and 0 were evicted. [ 578.733248] Lustre: DEBUG MARKER: sanity-lfsck test_4: @@@@@@ FAIL: (7) unexpected status [ 724.197393] INFO: task lfsck:5461 blocked for more than 120 seconds. [ 724.198800] Not tainted 2.6.32-431.29.2.el6.lustre.x86_64 #1 [ 724.200136] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 724.201748] lfsck D 0000000000000000 0 5461 2 0x00000080 [ 724.203226] ffff880067f37c00 0000000000000046 0000000000000000 ffff880065abe188 [ 724.204927] ffff880067f37b70 ffffffffa03183fe ffff880067f37c00 ffffffffa04a5cd9 [ 724.206596] ffff880062298cc0 ffff880067f37fd8 000000000000fcc8 ffff880062298cc0 [ 724.208431] Call Trace: [ 724.208981] [<ffffffffa03183fe>] ? cfs_hash_spin_unlock+0xe/0x10 [libcfs] [ 724.210466] [<ffffffffa04a5cd9>] ? lu_object_find_try+0x99/0x2b0 [obdclass] [ 724.211890] [<ffffffffa04a5f2d>] lu_object_find_at+0x3d/0xe0 [obdclass] [ 724.212765] [<ffffffff8155625b>] ? _read_unlock+0x2b/0x40 [ 724.213476] [<ffffffff81061d90>] ? default_wake_function+0x0/0x20 [ 724.214280] [<ffffffffa04a600f>] lu_object_find_slice+0x1f/0x80 [obdclass] [ 724.215198] [<ffffffffa09e0838>] lfsck_master_oit_engine+0x608/0x1f40 [lfsck] [ 724.216154] [<ffffffff810b74bd>] ? trace_hardirqs_on+0xd/0x10 [ 724.216999] [<ffffffffa09e2da1>] lfsck_master_engine+0xc31/0x14b0 [lfsck] [ 724.217933] [<ffffffff81553065>] ? thread_return+0x4e/0x7e9 [ 724.218670] [<ffffffff81061d90>] ? default_wake_function+0x0/0x20 [ 724.219474] [<ffffffffa09e2170>] ? lfsck_master_engine+0x0/0x14b0 [lfsck] [ 724.220382] [<ffffffff8109e856>] kthread+0x96/0xa0 [ 724.220999] [<ffffffff8100c30a>] child_rip+0xa/0x20 [ 724.221665] [<ffffffff815562e0>] ? _spin_unlock_irq+0x30/0x40 [ 724.222430] [<ffffffff8100bb10>] ? restore_args+0x0/0x30 [ 724.223110] [<ffffffff8109e7c0>] ? kthread+0x0/0xa0 [ 724.223796] [<ffffffff8100c300>] ? child_rip+0x0/0x20

          People

            yong.fan nasf (Inactive)
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: