Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3094

Sanity test_132: @@@@@@ FAIL: some glimpse RPC is expected and silent errors during mount

    XMLWordPrintable

Details

    • Bug
    • Resolution: Not a Bug
    • Minor
    • None
    • Lustre 2.4.0
    • None
    • The review queue in Maloo
    • 3
    • 7514

    Description

      Lustre Errors are reported in Sanity Test_132.

      I believe they are related to later errors that cannot read some /proc stats from the OST. LU-2902 (Sanity test 156) and LU-2735 (Sanity test 151) and example of the later issues.

      In general Test_132 remounts the OST and sometimes all it not well during the mount.

      Whenever I find missing data from the /proc filesystem I see Lustre Errors during test 132. The test does not allays fail 132 but there allways errors during the remount.

      I find the 132 issues by searching for Sanity 156/151 and looking for FAIL with " NOT IN CACHE: before: , after: " (the Null /proc reads)

      So what errors are seen?

      In the Sanity test 132 OST dmesg you see things like:

      Lustre: DEBUG MARKER: mkdir -p /mnt/ost3; mount -t lustre   		                   /dev/lvm-OSS/P3 /mnt/ost3
      LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. quota=on. Opts: 
      Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
      Lustre: DEBUG MARKER: e2label /dev/lvm-OSS/P3 2>/dev/null
      Lustre: lustre-OST0001: deleting orphan objects from 0x0:6257 to 6304
      LustreError: 8844:0:(ldlm_resource.c:1161:ldlm_resource_get()) lvbo_init failed for resource 6272: rc -2
      LustreError: 8844:0:(ldlm_resource.c:1161:ldlm_resource_get()) Skipped 244 previous similar messages
      Lustre: DEBUG MARKER: mkdir -p /mnt/ost4
      
      Lustre: DEBUG MARKER: test -b /dev/lvm-OSS/P5
      Lustre: lustre-OST0001: deleting orphan objects from 0x0:6257 to 6304
      Lustre: Skipped 2 previous similar messages
      LustreError: 10424:0:(ldlm_resource.c:1161:ldlm_resource_get()) lvbo_init failed for resource 5376: rc -2
      LustreError: 10424:0:(ldlm_resource.c:1161:ldlm_resource_get()) Skipped 22 previous similar messages
      Lustre: DEBUG MARKER: mkdir -p /mnt/ost5; mount -t lustre   		                   /dev/lvm-OSS/P5 /mnt/ost5
      LDISKFS-fs (dm-4): mounted filesystem with ordered data mode. quota=on. Opts: 
      LustreError: 137-5: UUID 'lustre-OST0005_UUID' is not available for connect (no target)
      LustreError: 137-5: UUID 'lustre-OST0006_UUID' is not available for connect (no target)
      

      The Above strikes me a quite serious and it complains about finding the underlying storage target.

      Lustre: DEBUG MARKER: mkdir -p /mnt/ost3; mount -t lustre   		                   /dev/lvm-OSS/P3 /mnt/ost3
      LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. quota=on. Opts: 
      Lustre: 5288:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1364764845/real 1364764845]  req@ffff88030461c800 x1431056565766172/t0(0) o38->lustre-MDT0000-lwp-OST0001@192.168.4.20@o2ib:12/10 lens 400/544 e 0 to 1 dl 1364764855 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: 5288:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
      LustreError: 11293:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff880328c36000 x1431056565766204/t0(0) o101->MGC192.168.4.20@o2ib@192.168.4.20@o2ib:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
      LustreError: 11293:0:(client.c:1052:ptlrpc_import_delay_req()) Skipped 3 previous similar messages
      LustreError: 12040:0:(obd_mount.c:1715:server_register_target()) Cannot talk to the MGS: -5, not fatal
      

      In general the remount caused by 132 is quite messy in dmesg but it does not always seem to be the same errors. There seem to be communication issues. When 132 has these Lustre errors several other tests will have issues.

      Most of the time when the mount has errors 132 does not fail but sanity tests 133, 151, and 156 all fail together.

      I opened this as a blocker as I think this may be a root cause for several issues and deserves a proper evaluation.

      The runs below are recent example of the issues and there are plenty more if you search maloo:
      https://maloo.whamcloud.com/test_sets/786cf564-9b82-11e2-bd87-52540035b04c
      https://maloo.whamcloud.com/test_sets/636517d2-98f3-11e2-af89-52540035b04c
      https://maloo.whamcloud.com/test_sets/016cbfcc-9816-11e2-879d-52540035b04c

      Attachments

        Issue Links

          Activity

            People

              niu Niu Yawei (Inactive)
              keith Keith Mannthey (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: