Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-657

recovery-mds-scale (FLAVOR=MDS): client load failed

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.1.0, Lustre 2.3.0, Lustre 2.4.0, Lustre 2.1.1, Lustre 2.1.4
    • 3
    • 4239

    Description

      After running for 8 hours (MDS failed over 51 times), recovery-mds-scale test failed as follows:

      Starting mds1: -o user_xattr,acl  /dev/disk/by-id/scsi-1IET_00010001 /mnt/mds1
      fat-amd-1-ib: debug=0x33f0404
      fat-amd-1-ib: subsystem_debug=0xffb7e3ff
      fat-amd-1-ib: debug_mb=48
      Started lustre-MDT0000
      ==== Checking the clients loads AFTER  failover -- failure NOT OK
      Client load failed on node client-12-ib, rc=1
      Client load failed during failover. Exiting
      Found the END_RUN_FILE file: /home/yujian/test_logs/end_run_file
      client-12-ib
      client-2-ib
      Client load failed on node client-12-ib
      
      client client-12-ib load stdout and debug files :
                    /tmp/recovery-mds-scale.log_run_dd.sh-client-12-ib
                    /tmp/recovery-mds-scale.log_run_dd.sh-client-12-ib.debug
      2011-09-01 08:15:46 Terminating clients loads ...
      Duration:                43200
      Server failover period: 600 seconds
      Exited after:           30428 seconds
      Number of failovers before exit:
      mds1: 51 times
      ost1: 0 times
      ost2: 0 times
      ost3: 0 times
      ost4: 0 times
      ost5: 0 times
      ost6: 0 times
      Status: FAIL: rc=1
      

      /tmp/recovery-mds-scale.log_run_dd.sh-client-12-ib.debug:

      2011-09-01 08:07:51: dd succeeded
      + cd /tmp
      + rm -rf /mnt/lustre/d0.dd-client-12-ib
      ++ date '+%F %H:%M:%S'
      + echoerr '2011-09-01 08:07:53: dd run finished'
      + echo '2011-09-01 08:07:53: dd run finished'
      2011-09-01 08:07:53: dd run finished
      + '[' '!' -e /home/yujian/test_logs/end_run_file ']'
      + true
      ++ date '+%F %H:%M:%S'
      + echoerr '2011-09-01 08:07:53: dd run starting'
      + echo '2011-09-01 08:07:53: dd run starting'
      2011-09-01 08:07:53: dd run starting
      + mkdir -p /mnt/lustre/d0.dd-client-12-ib
      + cd /mnt/lustre/d0.dd-client-12-ib
      + load_pid=26714
      + wait 26714
      + dd bs=4k count=1000000 status=noxfer if=/dev/zero of=/mnt/lustre/d0.dd-client-12-ib/dd-file
      dd: writing `/mnt/lustre/d0.dd-client-12-ib/dd-file': No space left on device
      805559+0 records in
      805558+0 records out
      + '[' 1 -eq 0 ']'
      ++ date '+%F %H:%M:%S'
      + echoerr '2011-09-01 08:09:10: dd failed'
      + echo '2011-09-01 08:09:10: dd failed'
      2011-09-01 08:09:10: dd failed
      

      Syslog on client-12-ib showed:

      Sep  1 08:09:10 client-12 kernel: Lustre: DEBUG MARKER: Checking clients are in FULL state before doing next failover
      Sep  1 08:09:10 client-12 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.4.135@o2ib. The ost_write operation failed with -28
      Sep  1 08:09:10 client-12 kernel: LustreError: 26714:0:(vvp_io.c:1001:vvp_io_commit_write()) Write page 805558 of inode ffff880312852a38 failed -28
      

      The dd error and syslog on client-12-ib showed that "No space left on device" on the OSS node fat-amd-4 (192.168.4.135), however, df on that node showed that:

      [root@fat-amd-4-ib ~]# lctl dl
        0 UP mgc MGC192.168.4.132@o2ib f176795b-1295-66f4-d018-e6c09ba5b112 5
        1 UP ost OSS OSS_uuid 3
        2 UP obdfilter lustre-OST0001 lustre-OST0001_UUID 21
        3 UP obdfilter lustre-OST0003 lustre-OST0003_UUID 21
        4 UP obdfilter lustre-OST0005 lustre-OST0005_UUID 21
      
      [root@fat-amd-4-ib ~]# df
      Filesystem           1K-blocks      Used Available Use% Mounted on
      /dev/sda1             20642428   2223180  17370672  12% /
      tmpfs                  8165072         0   8165072   0% /dev/shm
      /dev/sdk              15939896   6847620   8292276  46% /mnt/ost2
      /dev/sdg              15939896    460468  14679428   4% /mnt/ost4
      /dev/sdh              15939896    463104  14676792   4% /mnt/ost6
      

      "lfs df" and "lfs df -i" on the client nodes also showed that there were free enough spaces and inodes on the OSTs. What's more, for each load (dd, tar, dbench, iozone) running on different client node, the load dir would always be removed after a successful run and before a new run (please refer to the recovery-mds-scale.log_run_

      {dd,tar,dbench,iozone}

      .sh-*.debug files), so there should not be ENOSPC error.

      Syslog on fat-amd-4-ib showed that:

      Sep  1 08:00:05 fat-amd-4-ib kernel: Pid: 4286, comm: ll_ost_io_127
      Sep  1 08:00:05 fat-amd-4-ib kernel:
      Sep  1 08:00:05 fat-amd-4-ib kernel: Call Trace:
      Sep  1 08:00:05 fat-amd-4-ib kernel: [<ffffffff81247e8f>] ? __generic_unplug_device+0x1f/0x40
      Sep  1 08:00:05 fat-amd-4-ib kernel: [<ffffffff814dc995>] schedule_timeout+0x215/0x2e0
      Sep  1 08:00:05 fat-amd-4-ib kernel: [<ffffffff814dc613>] wait_for_common+0x123/0x180
      Sep  1 08:00:05 fat-amd-4-ib kernel: [<ffffffff8105dc60>] ? default_wake_function+0x0/0x20
      Sep  1 08:00:05 fat-amd-4-ib kernel: [<ffffffff81249a5d>] ? submit_bio+0x8d/0x120
      Sep  1 08:00:05 fat-amd-4-ib kernel: [<ffffffff814dc72d>] wait_for_completion+0x1d/0x20
      Sep  1 08:00:05 fat-amd-4-ib kernel: [<ffffffff8124cedd>] __blkdev_issue_flush+0xad/0xe0
      Sep  1 08:00:05 fat-amd-4-ib kernel: [<ffffffff8124cf26>] blkdev_issue_flush+0x16/0x20
      Sep  1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa041f8ab>] ldiskfs_sync_file+0x17b/0x250 [ldiskfs]
      Sep  1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa0a15865>] filter_sync+0x285/0x3e0 [obdfilter]
      Sep  1 08:00:05 fat-amd-4-ib kernel: [<ffffffff8115a340>] ? cache_alloc_refill+0x1c0/0x240
      Sep  1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa04fe98c>] ? lprocfs_counter_add+0x12c/0x196 [lvfs]
      Sep  1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa09cfe0c>] ost_blocking_ast+0x58c/0xe40 [ost]
      Sep  1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa064db73>] ldlm_cancel_callback+0x63/0xf0 [ptlrpc]
      Sep  1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa064dc59>] ldlm_lock_cancel+0x59/0x190 [ptlrpc]
      Sep  1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa066fe36>] ldlm_request_cancel+0x256/0x420 [ptlrpc]
      Sep  1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa09d9720>] ost_handle+0x3d60/0x4b90 [ost]
      Sep  1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa049b6b1>] ? libcfs_debug_vmsg2+0x4d1/0xb50 [libcfs]
      Sep  1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa0694104>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc]
      Sep  1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa069684c>] ? lustre_msg_get_status+0x3c/0xa0 [ptlrpc]
      Sep  1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa06a4c7e>] ptlrpc_main+0xb8e/0x1900 [ptlrpc]
      Sep  1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa06a40f0>] ? ptlrpc_main+0x0/0x1900 [ptlrpc]
      Sep  1 08:00:05 fat-amd-4-ib kernel: [<ffffffff8100c1ca>] child_rip+0xa/0x20
      Sep  1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa06a40f0>] ? ptlrpc_main+0x0/0x1900 [ptlrpc]
      Sep  1 08:00:05 fat-amd-4-ib kernel: [<ffffffffa06a40f0>] ? ptlrpc_main+0x0/0x1900 [ptlrpc]
      Sep  1 08:00:05 fat-amd-4-ib kernel: [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
      Sep  1 08:00:05 fat-amd-4-ib kernel:
      Sep  1 08:00:05 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889205.4286
      Sep  1 08:00:05 fat-amd-4-ib kernel: Lustre: Service thread pid 4210 was inactive for 156.04s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
      Sep  1 08:00:05 fat-amd-4-ib kernel: Lustre: Skipped 2 previous similar messages
      Sep  1 08:00:05 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889205.4270
      Sep  1 08:00:05 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889205.4269
      Sep  1 08:00:06 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889206.4237
      Sep  1 08:00:06 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889206.4267
      Sep  1 08:00:06 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889206.4283
      Sep  1 08:00:12 fat-amd-4-ib kernel: Lustre: Service thread pid 4218 was inactive for 156.00s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
      Sep  1 08:00:12 fat-amd-4-ib kernel: Lustre: Skipped 12 previous similar messages
      Sep  1 08:00:12 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889212.4218
      Sep  1 08:00:15 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889215.4285
      Sep  1 08:00:18 fat-amd-4-ib kernel: Lustre: Service thread pid 4201 was inactive for 156.00s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
      Sep  1 08:00:18 fat-amd-4-ib kernel: Lustre: Skipped 2 previous similar messages
      Sep  1 08:00:18 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889218.4201
      Sep  1 08:00:25 fat-amd-4-ib kernel: Lustre: Service thread pid 4204 completed after 177.20s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      Sep  1 08:00:25 fat-amd-4-ib kernel: Lustre: Skipped 27 previous similar messages
      Sep  1 08:00:35 fat-amd-4-ib kernel: Lustre: Service thread pid 4249 was inactive for 156.00s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
      Sep  1 08:00:35 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889235.4249
      Sep  1 08:00:36 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889236.4275
      Sep  1 08:00:43 fat-amd-4-ib kernel: LustreError: dumping log to /tmp/lustre-log.1314889243.4222
      Sep  1 08:00:48 fat-amd-4-ib kernel: Lustre: Service thread pid 4226 completed after 200.23s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      

      Maloo report: https://maloo.whamcloud.com/test_sets/b4bac0f4-d523-11e0-8d02-52540025f9af

      Since the size of all of the log files is large, I could not upload them to this ticket. Please find /scratch/logs/2.1.0/recovery-mds-scale.1314890151.log.tar.bz2 on node brent.whamcloud.com.

      Attachments

        Issue Links

          Activity

            People

              hongchao.zhang Hongchao Zhang
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: