Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1246

SANITY_QUOTA test_32 failed in cleanup_and_setup_lustre with LOAD_MODULES_REMOTE=true

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 1.8.6
    • None
    • One mgs/mds, two OSS, two clients, lustre-1.8.6.81.
      Server is running rhel5.7, client is running sles11sp1.
      ofed 1.5.4.1.
    • 3
    • 6108

    Description

      SANITY_QUOTA test_32 always failed. Test was started from service331 (a lustre client actually):

      ...
      Formatting mgs, mds, osts
      ...
      Setup mgs, mdt, osts
      start mds /dev/sdb1 -o errors=panic,acl
      Starting mds: -o errors=panic,acl /dev/sdb1 /mnt/mds
      service360: Reading test skip list from /usr/lib64/lustre/tests/cfg/tests-to-skip.sh
      service360: #!/bin/bash
      service360: #SANITY_BIGFILE_EXCEPT="64b"
      service360: #export SANITY_EXCEPT="$SANITY_BIGFILE_EXCEPT"
      service360: MDSSIZE=2000000, OSTSIZE=2000000.
      service360: ncli_nas.sh: Before init_clients_lists
      service360: ncli_nas.sh: Done init_clients_lists
      service360: lnet.debug=0x33f1504
      service360: lnet.subsystem_debug=0xffb7e3ff
      service360: lnet.debug_mb=16
      Started lustre-MDT0000
      start ost1 /dev/sdb1 -o errors=panic,mballoc,extents
      Starting ost1: -o errors=panic,mballoc,extents /dev/sdb1 /mnt/ost1
      service361: mount.lustre: mount /dev/sdb1 at /mnt/ost1 failed: Cannot send after transport endpoint shutdown
      mount -t lustre /dev/sdb1 /mnt/ost1
      Start of /dev/sdb1 on ost1 failed 108
      ...

      It seems this test is the only one that set LOAD_MODULES_REMOTE=true before calling cleanup_and_setup_lustre and failed. Sometimes only OST1 had error 108 problem, sometimes both OST1 and OST2 were hit with this problem. I put "sleep 3" in setupall()
      after mds started but before trying to start ost, but it did not help.

      The 'demsg' from MDS (service360) showed:
      Lustre: DEBUG MARKER: == test 32: check lqs hash(bug 21846) ========================================== == 11:05:01
      Lustre: MDT lustre-MDT0000 has stopped.
      LustreError: 28890:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
      LustreError: 28890:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
      Lustre: MGS has stopped.
      Lustre: server umount lustre-MDT0000 complete
      Lustre: Removed LNI 10.151.26.38@o2ib
      Lustre: OBD class driver, http://wiki.whamcloud.com/
      Lustre: Lustre Version: 1.8.6.81
      Lustre: Build Version: lustre/scripts-1.8.6
      Lustre: Listener bound to ib1:10.151.26.38:987:mlx4_0
      Lustre: Register global MR array, MR size: 0xffffffffffffffff, array size: 1
      Lustre: Added LNI 10.151.26.38@o2ib [8/64/0/180]
      Lustre: Filtering OBD driver; http://wiki.whamcloud.com/
      LDISKFS-fs (sdb1): mounted filesystem with ordered data mode
      LDISKFS-fs (sdb1): mounted filesystem with ordered data mode
      LDISKFS-fs (sdb1): mounted filesystem with ordered data mode
      Lustre: MGS MGS started
      Lustre: MGC10.151.26.38@o2ib: Reactivating import
      Lustre: MGS: Logs for fs lustre were removed by user request. All servers must be restarted in order to regenerate the logs.
      Lustre: Setting parameter lustre-mdtlov.lov.stripesize in log lustre-MDT0000
      Lustre: Enabling user_xattr
      Lustre: Enabling ACL
      Lustre: lustre-MDT0000: new disk, initializing
      Lustre: lustre-MDT0000: Now serving lustre-MDT0000 on /dev/sdb1 with recovery enabled
      Lustre: 30206:0:(lproc_mds.c:271:lprocfs_wr_group_upcall()) lustre-MDT0000: group upcall set to /usr/sbin/l_getgroups
      Lustre: MGS: Regenerating lustre-OSTffff log by user request.
      Lustre: lustre-MDT0000: temporarily refusing client connection from 10.151.25.182@o2ib
      LustreError: 30130:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (11) req@ffff8103f64cc000 x1397073513545734/t0 o38><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1332353120 ref 1 fl Interpret:/0/0 rc -11/0
      Lustre: 30308:0:(mds_lov.c:1155:mds_notify()) MDS lustre-MDT0000: add target lustre-OST0000_UUID
      Lustre: 29699:0:(quota_master.c:1718:mds_quota_recovery()) Only 0/1 OSTs are active, abort quota recovery
      Lustre: MDS lustre-MDT0000: lustre-OST0000_UUID now active, resetting orphans
      Lustre: DEBUG MARKER: Using TIMEOUT=20
      Lustre: DEBUG MARKER: sanity-quota test_32: @@@@@@ FAIL: Rehearsh didn't happen
      Lustre: DEBUG MARKER: == test 99: Quota off =============================== == 11:08:33

      The 'dmesg' from OST1 (service361) showed:
      Lustre: DEBUG MARKER: == test 32: check lqs hash(bug 21846) ========================================== == 11:05:01
      Lustre: OST lustre-OST0000 has stopped.
      LustreError: 5972:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
      LustreError: 5972:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
      Lustre: 5972:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1395992728675719 sent from MGC10.151.26.38@o2ib to NID 10.151.26.38@o2ib 6s ago has timed out (6s prior to deadline).
      req@ffff810407edd800 x1395992728675719/t0 o251->MGS@MGC10.151.26.38@o2ib_0:26/25 lens 192/384 e 0 to 1 dl 1332352932 ref 1 fl Rpc:N/0/0 rc 0/0
      Lustre: server umount lustre-OST0000 complete
      LDISKFS-fs (sdb1): mounted filesystem with ordered data mode
      LDISKFS-fs (sdb1): mounted filesystem with ordered data mode
      LDISKFS-fs (sdb1): mounted filesystem with ordered data mode
      Lustre: 20460:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1395992728675720 sent from MGC10.151.26.38@o2ib to NID 10.151.26.38@o2ib 0s ago has failed due to network error (5s prior to deadline).
      req@ffff8103ca5edc00 x1395992728675720/t0 o250->MGS@MGC10.151.26.38@o2ib_0:26/25 lens 368/584 e 0 to 1 dl 1332353093 ref 1 fl Rpc:N/0/0 rc 0/0
      LustreError: 7058:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff8103b2ca7800 x1395992728675721/t0 o253->MGS@MGC10.151.26.38@o2ib_0:26/25 lens 4736/4928 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
      LustreError: 7058:0:(obd_mount.c:1112:server_start_targets()) Required registration failed for lustre-OSTffff: -108
      LustreError: 7058:0:(obd_mount.c:1670:server_fill_super()) Unable to start targets: -108
      LustreError: 7058:0:(obd_mount.c:1453:server_put_super()) no obd lustre-OSTffff
      LustreError: 7058:0:(obd_mount.c:147:server_deregister_mount()) lustre-OSTffff not registered
      Lustre: server umount lustre-OSTffff complete
      LustreError: 7058:0:(obd_mount.c:2065:lustre_fill_super()) Unable to mount (-108)
      Lustre: DEBUG MARKER: Using TIMEOUT=20
      Lustre: DEBUG MARKER: sanity-quota test_32: @@@@@@ FAIL: Rehearsh didn't happen
      Lustre: DEBUG MARKER: == test 99: Quota off =============================== == 11:08:33

      Attachments

        Issue Links

          Activity

            People

              niu Niu Yawei (Inactive)
              jaylan Jay Lan (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: