Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2030

LDLM timeout resulting in client eviction and truncate system call to fail with EINTR

Details

    • Bug
    • Resolution: Won't Fix
    • Minor
    • None
    • Lustre 2.1.2
    • Lustre Client: 2.1.2-4chaos
      Lustre Server: 2.1.2-3chaos
    • 3
    • 4159

    Description

      I see errors reported from the truncate system call when running SIMUL on 2 nodes with 16 processes each (32 processes total).

      The reproducer is:

      $ salloc -N2 -n32
      $ srun -- /g/g0/surya1/dev/simul-1.14/simul -d /p/lcrater2/surya1/simul-test -N 635
      

      Specifically, I see truncate return EINTR when running the test under strace:

      truncate("/p/lcrater2/surya1/simul-test/simul_truncate.0", 1024) = -1 EINTR (Interrupted system call)
      

      The console logs from the clients and servers from a case where it hit this is below.

      Test:

      hype355@root:srun -- /g/g0/surya1/dev/simul-1.14/simul -d /p/lcrater2/surya1/simul-test -N 635 -f14 -l15
      Simul is running with 32 process(es)
      13:14:38: Set iteration 0
      13:14:38: Running test #14(iter 0): creat, shared mode.
      13:14:38: Running test #15(iter 0): truncate, shared mode.
      13:18:00: Process 3(hype214): FAILED in simul_truncate, truncate failed: Interrupted system call
      

      Client console messages:

      # hype355 /root > tail -f /var/log/conman/console.hype21{3,4}
      ==> /var/log/conman/console.hype213 <==
      2012-09-25 13:16:19 Lustre: lc2-OST000c-osc-ffff880833b85400: Connection to lc2-OST000c (at 10.1.1.49@o2ib9) was lost; in progress operations using this service will wait for recovery to complete
      2012-09-25 13:16:19 LustreError: 107596:0:(ldlm_request.c:1179:ldlm_cli_cancel_req()) Got rc -107 from cancel RPC: canceling anyway
      2012-09-25 13:16:19 LustreError: 107596:0:(ldlm_request.c:1807:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -107
      2012-09-25 13:16:19 LustreError: 167-0: This client was evicted by lc2-OST000c; in progress operations using this service will fail.
      2012-09-25 13:16:19 LustreError: 107628:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace lc2-OST000c-osc-ffff880833b85400 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
      2012-09-25 13:16:19 LustreError: 107628:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff880748182d40 (167630808/0/0/0) (rc: 1)
      2012-09-25 13:16:19 Lustre: lc2-OST000c-osc-ffff880833b85400: Connection restored to lc2-OST000c (at 10.1.1.49@o2ib9)
      
      ==> /var/log/conman/console.hype214 <==
      2012-09-25 13:18:00 Lustre: lc2-OST000c-osc-ffff880831ce8800: Connection to lc2-OST000c (at 10.1.1.49@o2ib9) was lost; in progress operations using this service will wait for recovery to complete
      2012-09-25 13:18:00 LustreError: 75280:0:(ldlm_request.c:1179:ldlm_cli_cancel_req()) Got rc -107 from cancel RPC: canceling anyway
      2012-09-25 13:18:00 LustreError: 75280:0:(ldlm_request.c:1807:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -107
      2012-09-25 13:18:00 LustreError: 167-0: This client was evicted by lc2-OST000c; in progress operations using this service will fail.
      2012-09-25 13:18:00 Lustre: lc2-OST000c-osc-ffff880831ce8800: Connection restored to lc2-OST000c (at 10.1.1.49@o2ib9)
      

      Server console messages:

      > $ tail /var/log/conman/console.zwicky49
      > ==> /var/log/conman/console.zwicky49 <==
      > 2012-09-25 13:16:19 LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.121.14@o2ib2  ns: filter-lc2-OST000c_UUID lock: ffff88028aee6d80/0x8691e72961eb7429 lrc: 3/0,0 mode: PW/PW res: 167630808/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0xf19c08e9d42c3468 expref: 4 pid: 4358 timeout 7757255590
      > 2012-09-25 13:16:19 LustreError: 21861:0:(ldlm_lockd.c:2074:ldlm_cancel_handler()) ldlm_cancel from 192.168.121.14@o2ib2 arrived at 1348604178 with bad export cookie 9696785637627868352
      > 2012-09-25 13:16:19 Lustre: 24547:0:(ldlm_lib.c:933:target_handle_connect()) lc2-OST000c: connection from 7f3b19c3-f206-6f48-4758-3716d5199ca1@192.168.121.14@o2ib2 t0 exp (null) cur 1348604179 last 0
      > 2012-09-25 13:17:59 LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.121.15@o2ib2  ns: filter-lc2-OST000c_UUID lock: ffff8801d0235b40/0x8691e72961eb744c lrc: 3/0,0 mode: PW/PW res: 167630808/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0xbd300c41b4abfe4a expref: 4 pid: 24547 timeout 7757356173
      > 2012-09-25 13:18:00 LustreError: 21861:0:(ldlm_lockd.c:2074:ldlm_cancel_handler()) ldlm_cancel from 192.168.121.15@o2ib2 arrived at 1348604279 with bad export cookie 9696785637609291136
      > 2012-09-25 13:18:00 Lustre: 25184:0:(ldlm_lib.c:933:target_handle_connect()) lc2-OST000c: connection from e0aa3722-987c-e30c-5c60-102f582dddae@192.168.121.15@o2ib2 t0 exp (null) cur 1348604280 last 0
      

      The client version of lustre on the nodes:

      hype355@root:pdsh -w hype[213-214] 'rpm -qa | grep lustre | sort' | dshbak -c
      ----------------
      hype[213-214]
      ----------------
      lustre-2.1.2-4chaos_2.6.32_220.23.1.1chaos.ch5.x86_64.x86_64
      lustre-modules-2.1.2-4chaos_2.6.32_220.23.1.1chaos.ch5.x86_64.x86_64
      lustre-tools-llnl-1.4-2.ch5.x86_64
      

      The server version of lustre:

      $ pdsh -w zwicky[48-49] 'rpm -qa | grep lustre | sort' | dshbak -c
      ----------------
      zwicky[48-49]
      ----------------
      lustre-2.1.2-3chaos_2.6.32_220.23.1.1chaos.ch5.x86_64.x86_64
      lustre-modules-2.1.2-3chaos_2.6.32_220.23.1.1chaos.ch5.x86_64.x86_64
      lustre-tools-llnl-1.4-2.ch5.x86_64
      

      Attachments

        Activity

          [LU-2030] LDLM timeout resulting in client eviction and truncate system call to fail with EINTR

          Thanks Chris.

          jfc John Fuchs-Chesney (Inactive) added a comment - Thanks Chris.

          Prakash hasn't worked on Lustre in some time. This is probably another "Won't Fix" given its age.

          morrone Christopher Morrone (Inactive) added a comment - Prakash hasn't worked on Lustre in some time. This is probably another "Won't Fix" given its age.

          Hi Prakash,
          Do you need any more works on this ticket? Or can we closed it?
          Thanks

          hongchao.zhang Hongchao Zhang added a comment - Hi Prakash, Do you need any more works on this ticket? Or can we closed it? Thanks

          Hi, the same as before, +rpctrace and +dlmtrace, thanks a lot!

          hongchao.zhang Hongchao Zhang added a comment - Hi, the same as before, +rpctrace and +dlmtrace, thanks a lot!

          Yes, the issue was reproducible. I'll try to allocate some time to get on the machine and gather the logs from all 4 OSS nodes for this filesystem. Any debug flags of special interest?

          prakash Prakash Surya (Inactive) added a comment - Yes, the issue was reproducible. I'll try to allocate some time to get on the machine and gather the logs from all 4 OSS nodes for this filesystem. Any debug flags of special interest?

          Hi Prakash,
          Is the issue reproducible? there is still one more issue in the logs, why OST0010 took so much time to enqueue this glimpse lock?
          is the log of the node containing lc2-OST0010 available?

          hongchao.zhang Hongchao Zhang added a comment - Hi Prakash, Is the issue reproducible? there is still one more issue in the logs, why OST0010 took so much time to enqueue this glimpse lock? is the log of the node containing lc2-OST0010 available?
          hongchao.zhang Hongchao Zhang added a comment - the patch is tracked at http://review.whamcloud.com/#change,4382

          Good to hear. Thanks!

          prakash Prakash Surya (Inactive) added a comment - Good to hear. Thanks!

          status update: the possible patch is under creation & test

          hongchao.zhang Hongchao Zhang added a comment - status update: the possible patch is under creation & test

          in the LU-2030-llogs-1 (+dlmtrace, +rpctrace), the "simul_truncate.0" (PID=112500) failed,
          00000002:00010000:9.0:1348694770.130724:0:112500:0:(mdc_locks.c:918:mdc_intent_lock()) (name: simul_truncate.0,[0x20097c3a8:0x16a98:0x0]) in obj [0x20095bc20:0x114c:0x0], intent: getattr flags 00

          and it has acquired the ldlm lock "ffff8806636c1b40/0xf19c08e9d42c4459" at OST0013 and try to get the ldlm lock
          "ffff8807ff4d3d80/0xf19c08e9d42c4460" at OST0010, which has spent more than 100s to get the lock! the the eviction occur

          hongchao.zhang Hongchao Zhang added a comment - in the LU-2030 -llogs-1 (+dlmtrace, +rpctrace), the "simul_truncate.0" (PID=112500) failed, 00000002:00010000:9.0:1348694770.130724:0:112500:0:(mdc_locks.c:918:mdc_intent_lock()) (name: simul_truncate.0, [0x20097c3a8:0x16a98:0x0] ) in obj [0x20095bc20:0x114c:0x0] , intent: getattr flags 00 and it has acquired the ldlm lock "ffff8806636c1b40/0xf19c08e9d42c4459" at OST0013 and try to get the ldlm lock "ffff8807ff4d3d80/0xf19c08e9d42c4460" at OST0010, which has spent more than 100s to get the lock! the the eviction occur

          People

            hongchao.zhang Hongchao Zhang
            prakash Prakash Surya (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: