Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15735

sanity-sec test_19: OSS not responding after "force_sync=1"

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for eaujames <eaujames@ddn.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/bdc21484-cfd2-46d9-a0a9-a46d1a5682b2

      test_19 failed with the following error:

      test_19 returned 1
      

      This failure apears on review-ldiskfs-arm session, but the OSS was running on x86_64

      Client test:

      CMD: onyx-124vm1 lctl set_param -n osd*.*OS*.force_sync=1
      ...
      CMD: onyx-81vm3 /usr/sbin/lctl set_param -n os[cd]*.*MD*.force_sync 1
      CMD: onyx-81vm3 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: onyx-81vm3 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: onyx-81vm3 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      ...
      Delete is not completed in 37 seconds
      CMD: onyx-81vm3 /usr/sbin/lctl get_param osc.*MDT*.sync_*
      osc.lustre-OST0000-osc-MDT0000.sync_changes=0
      osc.lustre-OST0000-osc-MDT0000.sync_in_flight=0
      osc.lustre-OST0000-osc-MDT0000.sync_in_progress=1
      ...
      CMD: onyx-91vm5.onyx.whamcloud.com runas -u0 -g0 -G0 lfs quota -q /mnt/lustre
      running as uid/gid/euid/egid 0/0/0/0, groups: 0
       [lfs] [quota] [-q] [/mnt/lustre]
      /usr/lib64/lustre/tests/sanity-sec.sh: line 1293: [21448]: syntax error: operand expected (error token is "[21448]")
      CMD: onyx-81vm3 /usr/sbin/lctl nodemap_del c0
      CMD: onyx-81vm3 /usr/sbin/lctl nodemap_del c1
      CMD: onyx-81vm3 /usr/sbin/lctl nodemap_modify --name default --property admin --value 0
      CMD: onyx-81vm3 /usr/sbin/lctl get_param -n nodemap.active
      CMD: onyx-81vm3 /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
      On MGS 10.240.26.198, default.admin_nodemap = nodemap.default.admin_nodemap=0
      CMD: onyx-124vm1 /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
      onyx-124vm1: ssh: connect to host onyx-124vm1 port 22: No route to host
      pdsh@onyx-91vm5: onyx-124vm1: ssh exited with exit code 255
      

      MDS dmesg:

      [18117.644399] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n os[cd]*.*MD*.force_sync 1
      [18118.434329] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      [18120.214180] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      ....
      [18133.582033] Lustre: 11600:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1649444731/real 1649444731]  req@000000009702997f x1729549387801536/t0(0) o13->lustre-OST0006-osc-MDT0000@10.240.30.32@tcp:7/4 lens 224/368 e 0 to 1 dl 1649444738 ref 1 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'osp-pre-6-0.0'
      [18133.582035] Lustre: 11599:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1649444731/real 1649444731]  req@00000000f3577720 x1729549387801600/t0(0) o13->lustre-OST0002-osc-MDT0000@10.240.30.32@tcp:7/4 lens 224/368 e 0 to 1 dl 1649444738 ref 1 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'osp-pre-2-0.0'
      [18133.582063] Lustre: lustre-OST0002-osc-MDT0000: Connection to lustre-OST0002 (at 10.240.30.32@tcp) was lost; in progress operations using this service will wait for recovery to complete
      

      OST dmesg:

      [18108.151196] Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @
      [18108.941076] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.c0.trusted_nodemap
      [18115.611844] Lustre: DEBUG MARKER: lctl set_param -n osd*.*OS*.force_sync=1
      

      No messages are logged after "force_sync=1" on the OST, but no crash has been reported. The OSS seems to have disappeared.
      Hard reset? Misconfiguration of kdumpd? Network issues?

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity-sec test_19 - test_19 returned 1

      Attachments

        Activity

          People

            wc-triage WC Triage
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: