Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2205

recovery-small test 11 multiop eternal loop

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • None
    • Lustre 2.4.0
    • None
    • 3
    • 5248

    Description

      I started to hit tihs issue frequently on repeated runs of recovery-small.

      Multiop starts to loop outputting the samme message over and over again:

      short read: 0/1
      

      I logged into the node and here's what I see

      [root@centos6-7 ~]# dmesg | tail
      [27409.796429] LustreError: 138-a: lustre-OST0001: A client on nid 0@lo was evicted due to a lock blocking callback time out: rc -107
      [27409.798509] LustreError: Skipped 2 previous similar messages
      [27409.821171] Lustre: DEBUG MARKER: cancel_lru_locks osc start
      [27410.021536] LustreError: 26667:0:(ldlm_lockd.c:2177:ldlm_cancel_handler()) ldlm_cancel from 0@lo arrived at 1350470383 with bad export cookie 1895698488629392860
      [27410.022477] LustreError: 26666:0:(ldlm_request.c:1169:ldlm_cli_cancel_req()) Got rc -107 from cancel RPC: canceling anyway
      [27410.022985] LustreError: 167-0: lustre-OST0000-osc-ffff8800226a5bf0: This client was evicted by lustre-OST0000; in progress operations using this service will fail.
      [27410.023160] LustreError: 27694:0:(ldlm_resource.c:761:ldlm_resource_complain()) Namespace lustre-OST0000-osc-ffff8800226a5bf0 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
      [27410.023163] LustreError: 27694:0:(ldlm_resource.c:767:ldlm_resource_complain()) Resource: ffff88001ab28e78 (2/0/0/0) (rc: 1)
      [27410.025030] LustreError: 26666:0:(ldlm_request.c:1795:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -107
      [27410.092031] Lustre: DEBUG MARKER: cancel_lru_locks osc stop
      [root@centos6-7 ~]# ps ax | grep multi
      27707 pts/0    R+   291:38 multiop /mnt/lustre/f.recovery-small.11 or
      27878 pts/1    S+     0:00 grep multi
      

      strace on this process shows

      read(3, "", 1)                          = 0
      write(2, "short read: 0/1\n", 16)       = 16
      read(3, "", 1)                          = 0
      write(2, "short read: 0/1\n", 16)       = 16
      read(3, "", 1)                          = 0
      write(2, "short read: 0/1\n", 16)       = 16
      read(3, "", 1)                          = 0
      write(2, "short read: 0/1\n", 16)       = 16
      read(3, "", 1)                          = 0
      

      So I did another experiment:

      [root@centos6-7 ~]# ls -la /proc/27707/fd
      total 0
      dr-x------ 2 root root  0 Oct 17 10:56 .
      dr-xr-xr-x 7 root root  0 Oct 17 10:56 ..
      lrwx------ 1 root root 64 Oct 17 11:37 0 -> /dev/pts/0
      l-wx------ 1 root root 64 Oct 17 11:37 1 -> pipe:[1489661]
      l-wx------ 1 root root 64 Oct 17 10:56 2 -> pipe:[1489624]
      lr-x------ 1 root root 64 Oct 17 11:37 3 -> /mnt/lustre/f.recovery-small.11
      [root@centos6-7 ~]# ls -l /mnt/lustre/f.recovery-small.11
      -rw-r--r-- 1 root root 0 Oct 17 06:39 /mnt/lustre/f.recovery-small.11
      [root@centos6-7 ~]# cat /mnt/lustre/f.recovery-small.11
      cat: /mnt/lustre/f.recovery-small.11: Input/output error
      [root@centos6-7 ~]# touch /mnt/lustre/f.recovery-small.11
      [root@centos6-7 ~]# echo $?
      0
      [root@centos6-7 ~]# cat /mnt/lustre/f.recovery-small.11
      [root@centos6-7 ~]# 
      

      yet at this point multiop still did not finish its loop.

      At the very least I imagine we should actually make multiop terminate on error that repeats many times.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: