Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7372

replay-dual test_26: test failed to respond and timed out

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.8.0, Lustre 2.9.0, Lustre 2.10.0, Lustre 2.11.0, Lustre 2.12.0, Lustre 2.10.3, Lustre 2.10.4, Lustre 2.10.5, Lustre 2.12.4
    • None
    • Server/Client : master, build # 3225 RHEL 6.7
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Saurabh Tandan <saurabh.tandan@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/1e79d2a6-7d21-11e5-a254-5254006e85c2.

      The sub-test test_26 failed with the following error:

      test failed to respond and timed out
      

      Client dmesg:

      Lustre: DEBUG MARKER: test_26 fail mds1 1 times
      LustreError: 980:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1445937610, 300s ago), entering recovery for MGS@10.2.4.140@tcp ns: MGC10.2.4.140@tcp lock: ffff88007bdd82c0/0x956ab2c8047544d6 lrc: 4/1,0 mode: --/CR res: [0x65727473756c:0x2:0x0].0x0 rrc: 1 type: PLN flags: 0x1000000000000 nid: local remote: 0x223a79061b204538 expref: -99 pid: 980 timeout: 0 lvb_type: 0
      Lustre: 29433:0:(client.c:2039:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1445937910/real 1445937910]  req@ffff880028347980 x1516173751413108/t0(0) o250->MGC10.2.4.140@tcp@10.2.4.140@tcp:26/25 lens 520/544 e 0 to 1 dl 1445937916 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: 29433:0:(client.c:2039:ptlrpc_expire_one_request()) Skipped 67 previous similar messages
      

      MDS console:

      09:22:17:LustreError: 24638:0:(client.c:1138:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff88004d92c980 x1516158358024328/t0(0) o101->lustre-MDT0000-lwp-MDT0000@0@lo:23/10 lens 456/496 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
      09:25:19:LustreError: 24638:0:(client.c:1138:ptlrpc_import_delay_req()) Skipped 6 previous similar messages
      09:25:19:LustreError: 24638:0:(qsd_reint.c:55:qsd_reint_completion()) lustre-MDT0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x10000:0x0], rc:-5
      09:25:19:LustreError: 24638:0:(qsd_reint.c:55:qsd_reint_completion()) Skipped 1 previous similar message
      09:25:19:INFO: task umount:24629 blocked for more than 120 seconds.
      09:25:19:      Not tainted 2.6.32-573.7.1.el6_lustre.x86_64 #1
      09:25:19:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      09:25:19:umount        D 0000000000000000     0 24629  24628 0x00000080
      09:25:19: ffff880059e2bb48 0000000000000086 0000000000000000 00000000000708b7
      09:25:20: 0000603500000000 000000ac00000000 00001c1fd9b9c014 ffff880059e2bb98
      09:25:20: ffff880059e2bb58 0000000101d3458a ffff880076ee3ad8 ffff880059e2bfd8
      09:25:20:Call Trace:
      09:25:20: [<ffffffff8153a756>] __mutex_lock_slowpath+0x96/0x210
      09:25:20: [<ffffffff8153a27b>] mutex_lock+0x2b/0x50
      09:25:20: [<ffffffffa02cb30d>] mgc_process_config+0x1dd/0x1210 [mgc]
      09:25:20: [<ffffffffa0476b61>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      09:25:20: [<ffffffffa07fe28d>] obd_process_config.clone.0+0x8d/0x2e0 [obdclass]
      09:25:20: [<ffffffffa0476b61>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      09:25:20: [<ffffffffa08024c2>] lustre_end_log+0x262/0x6a0 [obdclass]
      09:25:20: [<ffffffffa082efb1>] server_put_super+0x911/0xed0 [obdclass]
      09:25:20: [<ffffffff811b0116>] ? invalidate_inodes+0xf6/0x190
      09:25:20: [<ffffffff8119437b>] generic_shutdown_super+0x5b/0xe0
      09:25:20: [<ffffffff81194466>] kill_anon_super+0x16/0x60
      09:25:20: [<ffffffffa07fa096>] lustre_kill_super+0x36/0x60 [obdclass]
      09:25:20: [<ffffffff81194c07>] deactivate_super+0x57/0x80
      09:25:20: [<ffffffff811b4a7f>] mntput_no_expire+0xbf/0x110
      09:25:20: [<ffffffff811b55cb>] sys_umount+0x7b/0x3a0
      09:25:20: [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
      

      Info required for matching: replay-dual test_26

      Attachments

        1. 1453855057.tgz
          24.15 MB
        2. log-7372
          65 kB

        Issue Links

          Activity

            [LU-7372] replay-dual test_26: test failed to respond and timed out

            We have another case of replay-dual test 26 failing in umount. The interesting thing here is that the test just fails and continues with the next test. The test does not time out/hang due to this failure.

            The console logs of the MDS (vm4) have the same call trace in umount as above.

            Logs are at: https://testing.whamcloud.com/test_sets/f594d332-8023-11e8-97ff-52540065bddc

            jamesanunez James Nunez (Inactive) added a comment - We have another case of replay-dual test 26 failing in umount. The interesting thing here is that the test just fails and continues with the next test. The test does not time out/hang due to this failure. The console logs of the MDS (vm4) have the same call trace in umount as above. Logs are at: https://testing.whamcloud.com/test_sets/f594d332-8023-11e8-97ff-52540065bddc
            mdiep Minh Diep added a comment - +1 on b2_10  https://testing.hpdd.intel.com/test_sets/07ca0dd2-43d0-11e8-960d-52540065bddc
            yujian Jian Yu added a comment -

            By searching on Maloo, I found replay-dual test 26 hung at least 17 times in the last week.

            Here is a failure instance on master branch:
            https://testing.hpdd.intel.com/test_sets/358e9718-fda5-11e7-a7cd-52540065bddc

            Console log on MDS:

            [51864.393931] LustreError: 14617:0:(qsd_reint.c:56:qsd_reint_completion()) lustre-MDT0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x1010000:0x0], rc:-5
            [51864.395871] LustreError: 14617:0:(qsd_reint.c:56:qsd_reint_completion()) Skipped 1 previous similar message
            [51960.236441] INFO: task umount:14605 blocked for more than 120 seconds.
            [51960.237288] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [51960.238163] umount          D ffff88005c01cf10     0 14605  14604 0x00000080
            [51960.238987] Call Trace:
            [51960.239287]  [<ffffffff816ac5d9>] schedule_preempt_disabled+0x29/0x70
            [51960.240077]  [<ffffffff816aa407>] __mutex_lock_slowpath+0xc7/0x1d0
            [51960.241003]  [<ffffffff816a981f>] mutex_lock+0x1f/0x2f
            [51960.241719]  [<ffffffffc069c3e7>] mgc_process_config+0x207/0x13f0 [mgc]
            [51960.242608]  [<ffffffffc0950a26>] obd_process_config.constprop.14+0x76/0x230 [obdclass]
            [51960.243674]  [<ffffffffc0638d47>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
            [51960.244490]  [<ffffffffc093d639>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
            [51960.245406]  [<ffffffffc0951cbf>] lustre_end_log+0x1ff/0x550 [obdclass]
            [51960.246218]  [<ffffffffc097d15e>] server_put_super+0x7de/0xcd0 [obdclass]
            [51960.247052]  [<ffffffff812054d2>] generic_shutdown_super+0x72/0x100
            [51960.247873]  [<ffffffff812058a2>] kill_anon_super+0x12/0x20
            [51960.248579]  [<ffffffffc09500c2>] lustre_kill_super+0x32/0x50 [obdclass]
            [51960.249417]  [<ffffffff81205c59>] deactivate_locked_super+0x49/0x60
            [51960.250180]  [<ffffffff812063c6>] deactivate_super+0x46/0x60
            [51960.250878]  [<ffffffff8122376f>] cleanup_mnt+0x3f/0x80
            [51960.251555]  [<ffffffff81223802>] __cleanup_mnt+0x12/0x20
            [51960.252269]  [<ffffffff810aee05>] task_work_run+0xc5/0xf0
            [51960.252967]  [<ffffffff8102ab52>] do_notify_resume+0x92/0xb0
            [51960.253708]  [<ffffffff816b8d37>] int_signal+0x12/0x17
            
            yujian Jian Yu added a comment - By searching on Maloo, I found replay-dual test 26 hung at least 17 times in the last week. Here is a failure instance on master branch: https://testing.hpdd.intel.com/test_sets/358e9718-fda5-11e7-a7cd-52540065bddc Console log on MDS: [51864.393931] LustreError: 14617:0:(qsd_reint.c:56:qsd_reint_completion()) lustre-MDT0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x1010000:0x0], rc:-5 [51864.395871] LustreError: 14617:0:(qsd_reint.c:56:qsd_reint_completion()) Skipped 1 previous similar message [51960.236441] INFO: task umount:14605 blocked for more than 120 seconds. [51960.237288] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [51960.238163] umount D ffff88005c01cf10 0 14605 14604 0x00000080 [51960.238987] Call Trace: [51960.239287] [<ffffffff816ac5d9>] schedule_preempt_disabled+0x29/0x70 [51960.240077] [<ffffffff816aa407>] __mutex_lock_slowpath+0xc7/0x1d0 [51960.241003] [<ffffffff816a981f>] mutex_lock+0x1f/0x2f [51960.241719] [<ffffffffc069c3e7>] mgc_process_config+0x207/0x13f0 [mgc] [51960.242608] [<ffffffffc0950a26>] obd_process_config.constprop.14+0x76/0x230 [obdclass] [51960.243674] [<ffffffffc0638d47>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [51960.244490] [<ffffffffc093d639>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [51960.245406] [<ffffffffc0951cbf>] lustre_end_log+0x1ff/0x550 [obdclass] [51960.246218] [<ffffffffc097d15e>] server_put_super+0x7de/0xcd0 [obdclass] [51960.247052] [<ffffffff812054d2>] generic_shutdown_super+0x72/0x100 [51960.247873] [<ffffffff812058a2>] kill_anon_super+0x12/0x20 [51960.248579] [<ffffffffc09500c2>] lustre_kill_super+0x32/0x50 [obdclass] [51960.249417] [<ffffffff81205c59>] deactivate_locked_super+0x49/0x60 [51960.250180] [<ffffffff812063c6>] deactivate_super+0x46/0x60 [51960.250878] [<ffffffff8122376f>] cleanup_mnt+0x3f/0x80 [51960.251555] [<ffffffff81223802>] __cleanup_mnt+0x12/0x20 [51960.252269] [<ffffffff810aee05>] task_work_run+0xc5/0xf0 [51960.252967] [<ffffffff8102ab52>] do_notify_resume+0x92/0xb0 [51960.253708] [<ffffffff816b8d37>] int_signal+0x12/0x17

            Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/30796
            Subject: LU-7372 test: Remove replay-dual test_26 from ALWAYS_EXCEPT
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set: 1
            Commit: 3872c328fa09d5fbde801e5e88e783ae27bf99a6

            gerrit Gerrit Updater added a comment - Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/30796 Subject: LU-7372 test: Remove replay-dual test_26 from ALWAYS_EXCEPT Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 3872c328fa09d5fbde801e5e88e783ae27bf99a6

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30677/
            Subject: LU-7372 test: Remove replay-dual test_26 from ALWAYS_EXCEPT
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: bda752429ad6e0d7be0eb76c8abe1bf8561ac32f

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30677/ Subject: LU-7372 test: Remove replay-dual test_26 from ALWAYS_EXCEPT Project: fs/lustre-release Branch: master Current Patch Set: Commit: bda752429ad6e0d7be0eb76c8abe1bf8561ac32f

            James Casper (jamesx.casper@intel.com) uploaded a new patch: https://review.whamcloud.com/30677
            Subject: LU-7372 test: Remove replay-dual test_26 from ALWAYS_EXCEPT
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 09d341f5fb7b09283cb4515b17f10905564fdf4c

            gerrit Gerrit Updater added a comment - James Casper (jamesx.casper@intel.com) uploaded a new patch: https://review.whamcloud.com/30677 Subject: LU-7372 test: Remove replay-dual test_26 from ALWAYS_EXCEPT Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 09d341f5fb7b09283cb4515b17f10905564fdf4c

            replay-dual test 26 is still on the ALWAYS_EXCEPT list and need to be removed to fully test that the patches fix this issue.

            jamesanunez James Nunez (Inactive) added a comment - replay-dual test 26 is still on the ALWAYS_EXCEPT list and need to be removed to fully test that the patches fix this issue.

            John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/28323/
            Subject: LU-7372 mgs: reprocess all locks at device fini
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set:
            Commit: 6df5f062727fc216072bc8daa7dfbc545876a2b0

            gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/28323/ Subject: LU-7372 mgs: reprocess all locks at device fini Project: fs/lustre-release Branch: b2_10 Current Patch Set: Commit: 6df5f062727fc216072bc8daa7dfbc545876a2b0

            Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/28323
            Subject: LU-7372 mgs: reprocess all locks at device fini
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set: 1
            Commit: 3eed63b655e18de53540b7aaa33aa69e17f78257

            gerrit Gerrit Updater added a comment - Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/28323 Subject: LU-7372 mgs: reprocess all locks at device fini Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 3eed63b655e18de53540b7aaa33aa69e17f78257
            pjones Peter Jones added a comment -

            Landed for 2.11

            pjones Peter Jones added a comment - Landed for 2.11

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/17853/
            Subject: LU-7372 mgs: reprocess all locks at device fini
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 2dc19f20ba9fcc1bcac6ae7ee5169ce10caab882

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/17853/ Subject: LU-7372 mgs: reprocess all locks at device fini Project: fs/lustre-release Branch: master Current Patch Set: Commit: 2dc19f20ba9fcc1bcac6ae7ee5169ce10caab882

            People

              bobijam Zhenyu Xu
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: