Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4381

clio deadlock from truncate

    XMLWordPrintable

Details

    • 3
    • 12008

    Description

      If I run the following:

      # export OSTCOUNT=6
      # export MOUNT_2=y
      # ./lustre/tests/llmount.sh
      # lfs setstripe -c 6 /mnt/lustre/f0
      # 
      # (while true; do echo Hi > /mnt/lustre/f0; done) &
      # (while true; do echo Bye > /mnt/lustre2/f0; done) &
      

      Then within a second of starting, both child tasks get stuck in cl_lock_state_wait()

      [<ffffffffa045cb75>] cl_lock_state_wait+0x1b5/0x320 [obdclass]
      [<ffffffffa045d35b>] cl_enqueue_locked+0x15b/0x1f0 [obdclass]
      [<ffffffffa045debe>] cl_lock_request+0x7e/0x270 [obdclass]
      [<ffffffffa0462e4c>] cl_io_lock+0x3cc/0x560 [obdclass]
      [<ffffffffa0463082>] cl_io_loop+0xa2/0x1b0 [obdclass]
      [<ffffffffa0dcabe8>] cl_setattr_ost+0x218/0x2f0 [lustre]
      [<ffffffffa0d96145>] ll_setattr_raw+0xa45/0x10c0 [lustre]
      [<ffffffffa0d9681d>] ll_setattr+0x5d/0xf0 [lustre]
      [<ffffffff811a0048>] notify_change+0x168/0x340
      [<ffffffff81180ad4>] do_truncate+0x64/0xa0
      [<ffffffff811949e1>] do_filp_open+0x851/0xdc0
      [<ffffffff8117f849>] do_sys_open+0x69/0x140
      [<ffffffff8117f960>] sys_open+0x20/0x30
      [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      [<ffffffffffffffff>] 0xffffffffffffffff
      

      They stay stuck there until one client gets evicted by an OST:

      LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 0@lo  ns: filter-lustre-OST0002_UUID lock: ffff880217559100/0xb06606e6f58bd625 lrc: 3/0,0 mode: PW/PW res: [0x2:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000080010020 nid: 0@lo remote: 0xb06606e6f58bd61e expref: 4 pid: 14479 timeout: 4300627190 lvb_type: 0
      LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 0@lo  ns: filter-lustre-OST0004_UUID lock: ffff88019996f9c0/0xb06606e6f58bd58b lrc: 3/0,0 mode: PW/PW res: [0x2:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000080010020 nid: 0@lo remote: 0xb06606e6f58bd584 expref: 4 pid: 13781 timeout: 4300627191 lvb_type: 0
      LustreError: 11-0: lustre-OST0002-osc-ffff8801a13eb800: Communicating with 0@lo, operation obd_ping failed with -107.
      Lustre: lustre-OST0004-osc-ffff88019e033800: Connection to lustre-OST0004 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
      LustreError: 167-0: lustre-OST0004-osc-ffff88019e033800: This client was evicted by lustre-OST0004; in progress operations using this service will fail.
      LustreError: 16413:0:(ldlm_resource.c:815:ldlm_resource_complain()) lustre-OST0004-osc-ffff88019e033800: namespace resource [0x2:0x0:0x0].0 (ffff8801a86f6980) refcount nonzero (1) after lock cleanup; forcing cleanup.
      LustreError: 16413:0:(ldlm_resource.c:1454:ldlm_resource_dump()) --- Resource: [0x2:0x0:0x0].0 (ffff8801a86f6980) refcount = 2
      Lustre: lustre-OST0004-osc-ffff88019e033800: Connection restored to lustre-OST0004 (at 0@lo)
      

      Attachments

        Issue Links

          Activity

            People

              bobijam Zhenyu Xu
              jhammond John Hammond
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: