Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10519

replay-vbr fails to start running tests due to MDS not restarting

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.11.0, Lustre 2.10.3
    • None
    • 3
    • 9223372036854775807

    Description

      replay-vbr doesn’t start running any tests due to problems with unmounting the clients. The following description is based on the replay-vbr information at
      https://testing.hpdd.intel.com/test_sets/22a8e51e-f750-11e7-a10a-52540065bddc

      In the replay-vbr suite_log, we see a problem contacting the MGS:

      onyx-42vm1: mount.lustre: mount onyx-42vm7:onyx-42vm8:/lustre at /mnt/lustre2 failed: Input/output error
      onyx-42vm1: Is the MGS running?
      onyx-42vm4: mount.lustre: mount onyx-42vm7:onyx-42vm8:/lustre at /mnt/lustre2 failed: Input/output error
      onyx-42vm4: Is the MGS running?
      onyx-42vm3: mount.lustre: mount onyx-42vm7:onyx-42vm8:/lustre at /mnt/lustre2 failed: Input/output error
      onyx-42vm3: Is the MGS running?
      Stopping clients: onyx-42vm1.onyx.hpdd.intel.com,onyx-42vm3,onyx-42vm4 /mnt/lustre (opts:)
      CMD: onyx-42vm1.onyx.hpdd.intel.com,onyx-42vm3,onyx-42vm4 running=\$(grep -c /mnt/lustre' ' /proc/mounts);
      if [ \$running -ne 0 ] ; then
      echo Stopping client \$(hostname) /mnt/lustre opts:;
      lsof /mnt/lustre || need_kill=no;
      if [ x != x -a x\$need_kill != xno ]; then
          pids=\$(lsof -t /mnt/lustre | sort -u);
          if [ -n \"\$pids\" ]; then
                   kill -9 \$pids;
          fi
      fi;
      while umount  /mnt/lustre 2>&1 | grep -q busy; do
          echo /mnt/lustre is still busy, wait one second && sleep 1;
      done;
      fi
      

      In the console log of client 4 (vm4), we see a problem unmounting the file system at the end of replay-dual and this impacts replay-vbr:

      [ 3200.984760] Lustre: DEBUG MARKER: == replay-dual test complete, duration 2281 sec ====================================================== 00:16:26 (1515716186)
      [ 3201.183181] Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre2' ' /proc/mounts);
      [ 3201.183181] if [ $running -ne 0 ] ; then
      [ 3201.183181] echo Stopping client $(hostname) /mnt/lustre2 opts:;
      [ 3201.183181] lsof /mnt/lustre2 || need_kill=no;
      [ 3201.183181] if [ x != x -a x$need_kill != xno ]; then
      [ 3201.183181]     pids=$(lsof -t /mnt/lustre2 | sort -u);
      [ 3201.183181]    
      [ 3360.201681] INFO: task umount:17035 blocked for more than 120 seconds.
      [ 3360.202477] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 3360.203291] umount          D ffff880069f5bf40     0 17035  17028 0x00000080
      [ 3360.204040] Call Trace:
      [ 3360.204299]  [<ffffffff816ac5f9>] schedule_preempt_disabled+0x29/0x70
      [ 3360.204982]  [<ffffffff816aa427>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 3360.205732]  [<ffffffff816a983f>] mutex_lock+0x1f/0x2f
      [ 3360.206277]  [<ffffffffc061e147>] mgc_process_config+0x207/0x13f0 [mgc]
      [ 3360.206996]  [<ffffffffc0704a75>] obd_process_config.constprop.13+0x85/0x2d0 [obdclass]
      [ 3360.207837]  [<ffffffffc05babc7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [ 3360.208560]  [<ffffffffc06f1de9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
      [ 3360.209290]  [<ffffffffc0705d9f>] lustre_end_log+0x1ff/0x550 [obdclass]
      [ 3360.210025]  [<ffffffffc0afcc0d>] ll_put_super+0x8d/0xaa0 [lustre]
      [ 3360.210672]  [<ffffffff81248c61>] ? fsnotify_destroy_marks+0x81/0x90
      [ 3360.211327]  [<ffffffff8113ae4d>] ? call_rcu_sched+0x1d/0x20
      [ 3360.211922]  [<ffffffffc0b2794c>] ? ll_destroy_inode+0x1c/0x20 [lustre]
      [ 3360.212634]  [<ffffffff8121f5a8>] ? destroy_inode+0x38/0x60
      [ 3360.213206]  [<ffffffff8121f6da>] ? evict+0x10a/0x180
      [ 3360.213789]  [<ffffffff8121f78e>] ? dispose_list+0x3e/0x50
      [ 3360.214338]  [<ffffffff81220446>] ? evict_inodes+0xd6/0x140
      [ 3360.214903]  [<ffffffff812054d2>] generic_shutdown_super+0x72/0x100
      [ 3360.215628]  [<ffffffff812058a2>] kill_anon_super+0x12/0x20
      [ 3360.216207]  [<ffffffffc0704115>] lustre_kill_super+0x45/0x50 [obdclass]
      [ 3360.216925]  [<ffffffff81205c59>] deactivate_locked_super+0x49/0x60
      [ 3360.217570]  [<ffffffff812063c6>] deactivate_super+0x46/0x60
      [ 3360.218186]  [<ffffffff8122375f>] cleanup_mnt+0x3f/0x80
      [ 3360.218753]  [<ffffffff812237f2>] __cleanup_mnt+0x12/0x20
      [ 3360.219318]  [<ffffffff810aee05>] task_work_run+0xc5/0xf0
      [ 3360.219860]  [<ffffffff8102ab52>] do_notify_resume+0x92/0xb0
      [ 3360.220504]  [<ffffffff816b8d37>] int_signal+0x12/0x17
      [ 3365.065664] LustreError: 166-1: MGC10.2.8.151@tcp: Connection to MGS (at 10.2.8.152@tcp) was lost; in progress operations using this service will fail
      [ 3365.067092] LustreError: Skipped 2 previous similar messages
      [ 3365.067690] LustreError: 3103:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1515716050, 300s ago), entering recovery for MGS@10.2.8.152@tcp ns: MGC10.2.8.151@tcp lock: ffff88007bb0a000/0xa3346a06e12d0d8e lrc: 4/1,0 mode: --/CR res: [0x65727473756c:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x62460a664e0e6625 expref: -99 pid: 3103 timeout: 0 lvb_type: 0
      [ 3365.071478] LustreError: 17045:0:(ldlm_resource.c:1100:ldlm_resource_complain()) MGC10.2.8.151@tcp: namespace resource [0x65727473756c:0x2:0x0].0x0 (ffff880069efcb40) refcount nonzero (1) after lock cleanup; forcing cleanup.
      [ 3365.073441] LustreError: 17045:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x65727473756c:0x2:0x0].0x0 (ffff880069efcb40) refcount = 2
      [ 3365.074760] LustreError: 17045:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks:
      [ 3365.075569] LustreError: 17045:0:(ldlm_resource.c:1705:ldlm_resource_dump()) ### ### ns: ?? lock: ffff88007bb0a000/0xa3346a06e12d0d8e lrc: 4/1,0 mode: --/CR res: ?? rrc=?? type: ??? flags: 0x1106400000000 nid: local remote: 0x62460a664e0e6625 expref: -99 pid: 3103 timeout: 0 lvb_type: 0
      [ 3365.092970] Lustre: Unmounted lustre-client
      [ 3377.067710] Lustre: lustre-MDT0000-mdc-ffff88007a24d800: Connection to lustre-MDT0000 (at 10.2.8.152@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [ 3377.069582] Lustre: Skipped 31 previous similar messages
      [ 3379.942361] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  auster : @@@@@@ FAIL: Restart of mds1 failed! 
      [ 3380.113928] Lustre: DEBUG MARKER: auster : @@@@@@ FAIL: Restart of mds1 failed!
      [ 3380.325749] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /tmp/test_logs/1515713768/auster..debug_log.$(hostname -s).1515716366.log;
      [ 3380.325749]          dmesg > /tmp/test_logs/1515713768/auster..dmesg.$(hostname -s).1515716366.log
      [ 3380.730891] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
      [ 3381.125455] Lustre: DEBUG MARKER: /usr/sbin/lctl mark -----============= acceptance-small: replay-vbr ============----- Fri Jan 12 00:19:27 UTC 2018
      [ 3381.320973] Lustre: DEBUG MARKER: -----============= acceptance-small: replay-vbr ============----- Fri Jan 12 00:19:27 UTC 2018
      [ 3382.105085] Lustre: DEBUG MARKER: /usr/sbin/lctl mark excepting tests: 12a
      [ 3382.287300] Lustre: DEBUG MARKER: excepting tests: 12a
      [ 3382.470266] Lustre: DEBUG MARKER: 
      [ 3382.470266] running=$(mount | grep -c /mnt/lustre2' ');
      [ 3382.470266] rc=0;
      [ 3382.470266] if [ $running -eq 0 ] ; then
      [ 3382.470266] 	mkdir -p /mnt/lustre2;
      [ 3382.470266] 	mount -t lustre  -o user_xattr,flock onyx-42vm7:onyx-42vm8:/lustre /mnt/lustre2;
      [ 3382.470266] 	rc=$?;
      [ 3382.470266] fi;
      [ 3382.470266] exit $rc
      [ 3402.626707] LustreError: 15c-8: MGC10.2.8.151@tcp: The configuration from log 'lustre-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
      [ 3402.629189] Lustre: Unmounted lustre-client
      [ 3402.629745] LustreError: 17479:0:(obd_mount.c:1506:lustre_fill_super()) Unable to mount  (-5)
      [ 3403.290421] Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre' ' /proc/mounts);
      [ 3403.290421] if [ $running -ne 0 ] ; then
      [ 3403.290421] echo Stopping client $(hostname) /mnt/lustre opts:;
      [ 3403.290421] lsof /mnt/lustre || need_kill=no;
      [ 3403.290421] if [ x != x -a x$need_kill != xno ]; then
      [ 3403.290421]     pids=$(lsof -t /mnt/lustre | sort -u);
      [ 3403.290421]     if 
      [ 3600.220677] INFO: task umount:17558 blocked for more than 120 seconds.
      [ 3600.221454] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 3600.222245] umount          D ffff88007a1a5ee0     0 17558  17551 0x00000080
      [ 3600.223132] Call Trace:
      [ 3600.223389]  [<ffffffff816ac5f9>] schedule_preempt_disabled+0x29/0x70
      [ 3600.224028]  [<ffffffff816aa427>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 3600.224762]  [<ffffffff816a983f>] mutex_lock+0x1f/0x2f
      [ 3600.225339]  [<ffffffffc061e147>] mgc_process_config+0x207/0x13f0 [mgc]
      [ 3600.226023]  [<ffffffffc0704a75>] obd_process_config.constprop.13+0x85/0x2d0 [obdclass]
      [ 3600.226895]  [<ffffffffc05babc7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [ 3600.227594]  [<ffffffffc06f1de9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
      [ 3600.228319]  [<ffffffffc0705d9f>] lustre_end_log+0x1ff/0x550 [obdclass]
      [ 3600.229078]  [<ffffffffc0afcc0d>] ll_put_super+0x8d/0xaa0 [lustre]
      [ 3600.229740]  [<ffffffff81248c61>] ? fsnotify_destroy_marks+0x81/0x90
      [ 3600.230428]  [<ffffffff8113ae4d>] ? call_rcu_sched+0x1d/0x20
      [ 3600.231006]  [<ffffffffc0b2794c>] ? ll_destroy_inode+0x1c/0x20 [lustre]
      [ 3600.231737]  [<ffffffff8121f5a8>] ? destroy_inode+0x38/0x60
      [ 3600.232336]  [<ffffffff8121f6da>] ? evict+0x10a/0x180
      [ 3600.232858]  [<ffffffff8121f78e>] ? dispose_list+0x3e/0x50
      [ 3600.233474]  [<ffffffff81220446>] ? evict_inodes+0xd6/0x140
      [ 3600.234032]  [<ffffffff812054d2>] generic_shutdown_super+0x72/0x100
      [ 3600.234729]  [<ffffffff812058a2>] kill_anon_super+0x12/0x20
      [ 3600.235330]  [<ffffffffc0704115>] lustre_kill_super+0x45/0x50 [obdclass]
      [ 3600.236001]  [<ffffffff81205c59>] deactivate_locked_super+0x49/0x60
      [ 3600.236716]  [<ffffffff812063c6>] deactivate_super+0x46/0x60
      [ 3600.237317]  [<ffffffff8122375f>] cleanup_mnt+0x3f/0x80
      [ 3600.237858]  [<ffffffff812237f2>] __cleanup_mnt+0x12/0x20
      [ 3600.238460]  [<ffffffff810aee05>] task_work_run+0xc5/0xf0
      [ 3600.239006]  [<ffffffff8102ab52>] do_notify_resume+0x92/0xb0
      [ 3600.239647]  [<ffffffff816b8d37>] int_signal+0x12/0x17
      [ 3628.070881] Lustre: 3090:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1515716613/real 1515716613]  req@ffff88006a659e00 x1589341203865152/t0(0) o250->MGC10.2.8.151@tcp@10.2.8.151@tcp:26/25 lens 520/544 e 0 to 1 dl 1515716638 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
      [ 3628.074097] Lustre: 3090:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 276 previous similar messages
      [ 3665.080686] LustreError: 3103:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1515716350, 300s ago), entering recovery for MGS@10.2.8.152@tcp ns: MGC10.2.8.151@tcp lock: ffff88007bb0b800/0xa3346a06e12d0db8 lrc: 4/1,0 mode: --/CR res: [0x65727473756c:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x62460a664e0e6acb expref: -99 pid: 3103 timeout: 0 lvb_type: 0
      [ 3720.239678] INFO: task umount:17558 blocked for more than 120 seconds.
      [ 3720.240435] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 3720.241277] umount          D ffff88007a1a5ee0     0 17558  17551 0x00000080
      [ 3720.242105] Call Trace:
      [ 3720.242379]  [<ffffffff816ac5f9>] schedule_preempt_disabled+0x29/0x70
      [ 3720.243024]  [<ffffffff816aa427>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 3720.243723]  [<ffffffff816a983f>] mutex_lock+0x1f/0x2f
      [ 3720.244271]  [<ffffffffc061e147>] mgc_process_config+0x207/0x13f0 [mgc]
      [ 3720.244954]  [<ffffffffc0704a75>] obd_process_config.constprop.13+0x85/0x2d0 [obdclass]
      [ 3720.245805]  [<ffffffffc05babc7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [ 3720.246498]  [<ffffffffc06f1de9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
      [ 3720.247224]  [<ffffffffc0705d9f>] lustre_end_log+0x1ff/0x550 [obdclass]
      [ 3720.247956]  [<ffffffffc0afcc0d>] ll_put_super+0x8d/0xaa0 [lustre]
      [ 3720.248601]  [<ffffffff81248c61>] ? fsnotify_destroy_marks+0x81/0x90
      [ 3720.249241]  [<ffffffff8113ae4d>] ? call_rcu_sched+0x1d/0x20
      [ 3720.249876]  [<ffffffffc0b2794c>] ? ll_destroy_inode+0x1c/0x20 [lustre]
      [ 3720.250558]  [<ffffffff8121f5a8>] ? destroy_inode+0x38/0x60
      [ 3720.251118]  [<ffffffff8121f6da>] ? evict+0x10a/0x180
      [ 3720.251684]  [<ffffffff8121f78e>] ? dispose_list+0x3e/0x50
      [ 3720.252278]  [<ffffffff81220446>] ? evict_inodes+0xd6/0x140
      [ 3720.252837]  [<ffffffff812054d2>] generic_shutdown_super+0x72/0x100
      [ 3720.253521]  [<ffffffff812058a2>] kill_anon_super+0x12/0x20
      [ 3720.254095]  [<ffffffffc0704115>] lustre_kill_super+0x45/0x50 [obdclass]
      [ 3720.254820]  [<ffffffff81205c59>] deactivate_locked_super+0x49/0x60
      [ 3720.255464]  [<ffffffff812063c6>] deactivate_super+0x46/0x60
      [ 3720.256028]  [<ffffffff8122375f>] cleanup_mnt+0x3f/0x80
      [ 3720.256624]  [<ffffffff812237f2>] __cleanup_mnt+0x12/0x20
      [ 3720.257206]  [<ffffffff810aee05>] task_work_run+0xc5/0xf0
      [ 3720.257816]  [<ffffffff8102ab52>] do_notify_resume+0x92/0xb0
      [ 3720.258412]  [<ffffffff816b8d37>] int_signal+0x12/0x17
      

      From client 1 (vm1) console log:

      [ 3077.173859] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == replay-dual test complete, duration 2281 sec ====================================================== 00:16:26 \(1515716186\)
      [ 3077.357365] Lustre: DEBUG MARKER: == replay-dual test complete, duration 2281 sec ====================================================== 00:16:26 (1515716186)
      [ 3077.572239] Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre2' ' /proc/mounts);
      [ 3077.572239] if [ $running -ne 0 ] ; then
      [ 3077.572239] echo Stopping client $(hostname) /mnt/lustre2 opts:;
      [ 3077.572239] lsof /mnt/lustre2 || need_kill=no;
      [ 3077.572239] if [ x != x -a x$need_kill != xno ]; then
      [ 3077.572239]     pids=$(lsof -t /mnt/lustre2 | sort -u);
      [ 3077.572239]    
      [ 3240.203972] INFO: task umount:20148 blocked for more than 120 seconds.
      [ 3240.204906] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 3240.205860] umount          D ffff88007b438000     0 20148  20141 0x00000080
      [ 3240.206815] Call Trace:
      [ 3240.207101]  [<ffffffff816ac5f9>] schedule_preempt_disabled+0x29/0x70
      [ 3240.207823]  [<ffffffff816aa427>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 3240.208472]  [<ffffffff816a983f>] mutex_lock+0x1f/0x2f
      [ 3240.209093]  [<ffffffffc0a7b147>] mgc_process_config+0x207/0x13f0 [mgc]
      [ 3240.209783]  [<ffffffffc0750a75>] obd_process_config.constprop.13+0x85/0x2d0 [obdclass]
      [ 3240.210610]  [<ffffffffc05d9bc7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [ 3240.211394]  [<ffffffffc073dde9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
      [ 3240.212154]  [<ffffffffc0751d9f>] lustre_end_log+0x1ff/0x550 [obdclass]
      [ 3240.212914]  [<ffffffffc0bd9c0d>] ll_put_super+0x8d/0xaa0 [lustre]
      [ 3240.213548]  [<ffffffff81248c61>] ? fsnotify_destroy_marks+0x81/0x90
      [ 3240.214268]  [<ffffffff8113ae4d>] ? call_rcu_sched+0x1d/0x20
      [ 3240.214871]  [<ffffffffc0c0494c>] ? ll_destroy_inode+0x1c/0x20 [lustre]
      [ 3240.215545]  [<ffffffff8121f5a8>] ? destroy_inode+0x38/0x60
      [ 3240.216205]  [<ffffffff8121f6da>] ? evict+0x10a/0x180
      [ 3240.216723]  [<ffffffff8121f78e>] ? dispose_list+0x3e/0x50
      [ 3240.217303]  [<ffffffff81220446>] ? evict_inodes+0xd6/0x140
      [ 3240.217941]  [<ffffffff812054d2>] generic_shutdown_super+0x72/0x100
      [ 3240.218626]  [<ffffffff812058a2>] kill_anon_super+0x12/0x20
      [ 3240.218951] LustreError: 166-1: MGC10.2.8.151@tcp: Connection to MGS (at 10.2.8.152@tcp) was lost; in progress operations using this service will fail
      [ 3240.218952] LustreError: Skipped 2 previous similar messages
      [ 3240.218977] LustreError: 23661:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1515716050, 300s ago), entering recovery for MGS@10.2.8.152@tcp ns: MGC10.2.8.151@tcp lock: ffff880062da5600/0x9eda9d6180716331 lrc: 4/1,0 mode: --/CR res: [0x65727473756c:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x62460a664e0e661e expref: -99 pid: 23661 timeout: 0 lvb_type: 0
      [ 3240.221609] LustreError: 2809:0:(ldlm_resource.c:1100:ldlm_resource_complain()) MGC10.2.8.151@tcp: namespace resource [0x65727473756c:0x2:0x0].0x0 (ffff880060c1d480) refcount nonzero (1) after lock cleanup; forcing cleanup.
      [ 3240.221611] LustreError: 2809:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x65727473756c:0x2:0x0].0x0 (ffff880060c1d480) refcount = 2
      [ 3240.221612] LustreError: 2809:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks:
      [ 3240.221617] LustreError: 2809:0:(ldlm_resource.c:1705:ldlm_resource_dump()) ### ### ns: MGC10.2.8.151@tcp lock: ffff880062da5600/0x9eda9d6180716331 lrc: 4/1,0 mode: --/CR res: [0x65727473756c:0x2:0x0].0x0 rrc: 3 type: PLN flags: 0x1106400000000 nid: local remote: 0x62460a664e0e661e expref: -99 pid: 23661 timeout: 0 lvb_type: 0
      [ 3240.232538]  [<ffffffffc0750115>] lustre_kill_super+0x45/0x50 [obdclass]
      [ 3240.233322]  [<ffffffff81205c59>] deactivate_locked_super+0x49/0x60
      [ 3240.234090]  [<ffffffff812063c6>] deactivate_super+0x46/0x60
      [ 3240.234676]  [<ffffffff8122375f>] cleanup_mnt+0x3f/0x80
      [ 3240.235214]  [<ffffffff812237f2>] __cleanup_mnt+0x12/0x20
      [ 3240.235766]  [<ffffffff810aee05>] task_work_run+0xc5/0xf0
      [ 3240.236339]  [<ffffffff8102ab52>] do_notify_resume+0x92/0xb0
      [ 3240.236916]  [<ffffffff816b8d37>] int_signal+0x12/0x17
      [ 3240.246310] Lustre: Unmounted lustre-client
      [ 3252.218981] Lustre: lustre-MDT0000-mdc-ffff880060f37000: Connection to lustre-MDT0000 (at 10.2.8.152@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [ 3252.220948] Lustre: Skipped 30 previous similar messages
      [ 3256.312576] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  auster : @@@@@@ FAIL: Restart of mds1 failed! 
      [ 3256.492539] Lustre: DEBUG MARKER: auster : @@@@@@ FAIL: Restart of mds1 failed!
      [ 3256.704074] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /tmp/test_logs/1515713768/auster..debug_log.$(hostname -s).1515716366.log;
      [ 3256.704074]          dmesg > /tmp/test_logs/1515713768/auster..dmesg.$(hostname -s).1515716366.log
      [ 3257.096964] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
      [ 3257.506302] Lustre: DEBUG MARKER: /usr/sbin/lctl mark -----============= acceptance-small: replay-vbr ============----- Fri Jan 12 00:19:27 UTC 2018
      [ 3257.698451] Lustre: DEBUG MARKER: -----============= acceptance-small: replay-vbr ============----- Fri Jan 12 00:19:27 UTC 2018
      [ 3257.983880] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == replay-vbr test complete, duration -o sec ========================================================= 00:19:27 \(1515716367\)
      [ 3258.157503] Lustre: DEBUG MARKER: == replay-vbr test complete, duration -o sec ========================================================= 00:19:27 (1515716367)
      [ 3258.484817] Lustre: DEBUG MARKER: /usr/sbin/lctl mark excepting tests: 12a
      [ 3258.667646] Lustre: DEBUG MARKER: excepting tests: 12a
      [ 3258.852092] Lustre: DEBUG MARKER: 
      [ 3258.852092] running=$(mount | grep -c /mnt/lustre2' ');
      [ 3258.852092] rc=0;
      [ 3258.852092] if [ $running -eq 0 ] ; then
      [ 3258.852092] 	mkdir -p /mnt/lustre2;
      [ 3258.852092] 	mount -t lustre  -o user_xattr,flock onyx-42vm7:onyx-42vm8:/lustre /mnt/lustre2;
      [ 3258.852092] 	rc=$?;
      [ 3258.852092] fi;
      [ 3258.852092] exit $rc
      [ 3279.009972] LustreError: 15c-8: MGC10.2.8.151@tcp: The configuration from log 'lustre-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
      [ 3279.012386] Lustre: Unmounted lustre-client
      [ 3279.012960] LustreError: 4457:0:(obd_mount.c:1506:lustre_fill_super()) Unable to mount  (-5)
      [ 3279.662897] Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre' ' /proc/mounts);
      [ 3279.662897] if [ $running -ne 0 ] ; then
      [ 3279.662897] echo Stopping client $(hostname) /mnt/lustre opts:;
      [ 3279.662897] lsof /mnt/lustre || need_kill=no;
      [ 3279.662897] if [ x != x -a x$need_kill != xno ]; then
      [ 3279.662897]     pids=$(lsof -t /mnt/lustre | sort -u);
      [ 3279.662897]     if 
      [ 3304.220947] Lustre: 12857:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1515716398/real 1515716398]  req@ffff880060594600 x1589341091652896/t0(0) o38->lustre-MDT0000-mdc-ffff880060f37000@10.2.8.152@tcp:12/10 lens 520/544 e 0 to 1 dl 1515716414 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [ 3304.223822] Lustre: 12857:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 198 previous similar messages
      [ 3480.236958] INFO: task umount:4583 blocked for more than 120 seconds.
      [ 3480.237631] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 3480.238417] umount          D ffff8800368adee0     0  4583   4576 0x00000080
      [ 3480.239168] Call Trace:
      [ 3480.239436]  [<ffffffff816ac5f9>] schedule_preempt_disabled+0x29/0x70
      [ 3480.240083]  [<ffffffff816aa427>] __mutex_lock_slowpath+0xc7/0x1d0
      [ 3480.240822]  [<ffffffff816a983f>] mutex_lock+0x1f/0x2f
      [ 3480.241356]  [<ffffffffc0a7b147>] mgc_process_config+0x207/0x13f0 [mgc]
      [ 3480.242104]  [<ffffffffc0750a75>] obd_process_config.constprop.13+0x85/0x2d0 [obdclass]
      [ 3480.242891]  [<ffffffffc05d9bc7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [ 3480.243557]  [<ffffffffc073dde9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
      [ 3480.244358]  [<ffffffffc0751d9f>] lustre_end_log+0x1ff/0x550 [obdclass]
      [ 3480.245040]  [<ffffffffc0bd9c0d>] ll_put_super+0x8d/0xaa0 [lustre]
      [ 3480.245705]  [<ffffffff81248c61>] ? fsnotify_destroy_marks+0x81/0x90
      [ 3480.246346]  [<ffffffff8113ae4d>] ? call_rcu_sched+0x1d/0x20
      [ 3480.246993]  [<ffffffffc0c0494c>] ? ll_destroy_inode+0x1c/0x20 [lustre]
      [ 3480.247637]  [<ffffffff8121f5a8>] ? destroy_inode+0x38/0x60
      [ 3480.248208]  [<ffffffff8121f6da>] ? evict+0x10a/0x180
      [ 3480.248769]  [<ffffffff8121f78e>] ? dispose_list+0x3e/0x50
      [ 3480.249331]  [<ffffffff81220446>] ? evict_inodes+0xd6/0x140
      [ 3480.249954]  [<ffffffff812054d2>] generic_shutdown_super+0x72/0x100
      [ 3480.250612]  [<ffffffff812058a2>] kill_anon_super+0x12/0x20
      [ 3480.251193]  [<ffffffffc0750115>] lustre_kill_super+0x45/0x50 [obdclass]
      [ 3480.251914]  [<ffffffff81205c59>] deactivate_locked_super+0x49/0x60
      [ 3480.252539]  [<ffffffff812063c6>] deactivate_super+0x46/0x60
      [ 3480.253170]  [<ffffffff8122375f>] cleanup_mnt+0x3f/0x80
      [ 3480.253681]  [<ffffffff812237f2>] __cleanup_mnt+0x12/0x20
      [ 3480.254231]  [<ffffffff810aee05>] task_work_run+0xc5/0xf0
      [ 3480.254824]  [<ffffffff8102ab52>] do_notify_resume+0x92/0xb0
      [ 3480.255398]  [<ffffffff816b8d37>] int_signal+0x12/0x17
      [ 3540.237952] LustreError: 23661:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1515716350, 300s ago), entering recovery for MGS@10.2.8.151@tcp ns: MGC10.2.8.151@tcp lock: ffff880061b52800/0x9eda9d61807163e7 lrc: 4/1,0 mode: --/CR res: [0x65727473756c:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x62460a664e0e6abd expref: -99 pid: 23661 timeout: 0 lvb_type: 0
      

      From client 3 (vm3) console log:

       [ 3200.888758] Lustre: DEBUG MARKER: == replay-dual test complete, duration 2281 sec ====================================================== 00:16:26 (1515716186)
      [ 3201.086884] Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre2' ' /proc/mounts);
      [ 3201.086884] if [ $running -ne 0 ] ; then
      [ 3201.086884] echo Stopping client $(hostname) /mnt/lustre2 opts:;
      [ 3201.086884] lsof /mnt/lustre2 || need_kill=no;
      [ 3201.086884] if [ x != x -a x$need_kill != xno ]; then
      [ 3201.086884]     pids=$(lsof -t /mnt/lustre2 | sort -u);
      [ 3201.086884]    
      [ 3201.417836] Lustre: Unmounted lustre-client
      [ 3362.842127] LustreError: 166-1: MGC10.2.8.151@tcp: Connection to MGS (at 10.2.8.152@tcp) was lost; in progress operations using this service will fail
      [ 3362.843597] LustreError: Skipped 2 previous similar messages
      [ 3362.844193] LustreError: 3109:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1515716048, 300s ago), entering recovery for MGS@10.2.8.152@tcp ns: MGC10.2.8.151@tcp lock: ffff88007b650000/0x66776ab671d1d0f5 lrc: 4/1,0 mode: --/CR res: [0x65727473756c:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x62460a664e0e6617 expref: -99 pid: 3109 timeout: 0 lvb_type: 0
      [ 3362.852523] LustreError: 28374:0:(ldlm_resource.c:1100:ldlm_resource_complain()) MGC10.2.8.151@tcp: namespace resource [0x65727473756c:0x2:0x0].0x0 (ffff88000010ac00) refcount nonzero (1) after lock cleanup; forcing cleanup.
      [ 3362.854550] LustreError: 28374:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x65727473756c:0x2:0x0].0x0 (ffff88000010ac00) refcount = 2
      [ 3362.855911] LustreError: 28374:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks:
      [ 3362.856734] LustreError: 28374:0:(ldlm_resource.c:1705:ldlm_resource_dump()) ### ### ns: ?? lock: ffff88007b650000/0x66776ab671d1d0f5 lrc: 4/1,0 mode: --/CR res: ?? rrc=?? type: ??? flags: 0x1106400000000 nid: local remote: 0x62460a664e0e6617 expref: -99 pid: 3109 timeout: 0 lvb_type: 0
      [ 3379.845411] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  auster : @@@@@@ FAIL: Restart of mds1 failed! 
      [ 3380.015373] Lustre: DEBUG MARKER: auster : @@@@@@ FAIL: Restart of mds1 failed!
      [ 3380.229705] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /tmp/test_logs/1515713768/auster..debug_log.$(hostname -s).1515716366.log;
      [ 3380.229705]          dmesg > /tmp/test_logs/1515713768/auster..dmesg.$(hostname -s).1515716366.log
      [ 3380.621370] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
      [ 3381.043662] Lustre: DEBUG MARKER: /usr/sbin/lctl mark -----============= acceptance-small: replay-vbr ============----- Fri Jan 12 00:19:27 UTC 2018
      [ 3381.222383] Lustre: DEBUG MARKER: -----============= acceptance-small: replay-vbr ============----- Fri Jan 12 00:19:27 UTC 2018
      [ 3382.008755] Lustre: DEBUG MARKER: /usr/sbin/lctl mark excepting tests: 12a
      [ 3382.191345] Lustre: DEBUG MARKER: excepting tests: 12a
      [ 3382.375529] Lustre: DEBUG MARKER: 
      [ 3382.375529] running=$(mount | grep -c /mnt/lustre2' ');
      [ 3382.375529] rc=0;
      [ 3382.375529] if [ $running -eq 0 ] ; then
      [ 3382.375529] 	mkdir -p /mnt/lustre2;
      [ 3382.375529] 	mount -t lustre  -o user_xattr,flock onyx-42vm7:onyx-42vm8:/lustre /mnt/lustre2;
      [ 3382.375529] 	rc=$?;
      [ 3382.375529] fi;
      [ 3382.375529] exit $rc
      [ 3402.532825] LustreError: 15c-8: MGC10.2.8.151@tcp: The configuration from log 'lustre-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
      [ 3402.535372] Lustre: Unmounted lustre-client
      [ 3402.536342] LustreError: 32090:0:(obd_mount.c:1506:lustre_fill_super()) Unable to mount  (-5)
      [ 3403.185319] Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre' ' /proc/mounts);
      [ 3403.185319] if [ $running -ne 0 ] ; then
      [ 3403.185319] echo Stopping client $(hostname) /mnt/lustre opts:;
      [ 3403.185319] lsof /mnt/lustre || need_kill=no;
      [ 3403.185319] if [ x != x -a x$need_kill != xno ]; then
      [ 3403.185319]     pids=$(lsof -t /mnt/lustre | sort -u);
      [ 3403.185319]     if 
      [ 3410.455336] Lustre: Unmounted lustre-client
      

      There are several failures like this one, but I can't find any that match the issues seen here on the clients.

      Attachments

        Issue Links

          Activity

            People

              emoly.liu Emoly Liu
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: