Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2094

LBUG in osc_req_attr_set along write path

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.2.0, Lustre 2.1.1
    • None
    • Seen on SLES11-SP1 with Lustre 2.1.1 and Lustre 2.2
      Also seen on SLES11-SP2 with Lustre 2.2
    • 3
    • 4377

    Description

      LBUG reported in LU-1650 brings down client nodes. But the LBUG occurs on a write path rather than a read path. This is the occurrence reported by Michael McKay - 17/Sep/12 in LU-1650.

      > 2012-04-20T22:00:48.036542-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:698:osc_req_attr_set()) lock@ffff8803f4522358[3 2 0 1 1 00000006] P(0):[0, 18446744073709551615]@[0x100000000:0x4b1f5c0:0x0] {
      > 2012-04-20T22:00:48.065914-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:698:osc_req_attr_set())     lovsub@ffff8803f525e4e0: [0 ffff880404198470 P(0):[0, 18446744073709551615]@[0x23d979800:0x1:0x0]]
      > 2012-04-20T22:00:48.065973-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:698:osc_req_attr_set())     osc@ffff8803f4a3ab50: (null) 00001000 0x85ead42e54740750 1 (null) size: 0 mtime: 0 atime: 0 ctime: 0 blocks: 0
      > 2012-04-20T22:00:48.065987-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:698:osc_req_attr_set()) } lock@ffff8803f4522358
      > 2012-04-20T22:00:48.095709-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:698:osc_req_attr_set()) no cover page!
      > 2012-04-20T22:00:48.199435-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:698:osc_req_attr_set()) lock@ffff8803f4522dd8[1 1 0 1 1 00000000] W(2):[0, 0]@[0x100000000:0x4b1f5c0:0x0] {
      > 2012-04-20T22:00:48.199467-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:698:osc_req_attr_set())     lovsub@ffff8803f7d0a4e0: [0 ffff8807e9ae1a88 W(2):[0, 18446744073709551615]@[0x23d979800:0x1:0x0]]
      > 2012-04-20T22:00:48.199478-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:698:osc_req_attr_set())     osc@ffff8803f5c37b50: (null) 00000000 0x0 0 (null) size: 0 mtime: 0 atime: 0 ctime: 0 blocks: 0
      > 2012-04-20T22:00:48.199489-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:698:osc_req_attr_set()) } lock@ffff8803f4522dd8
      > 2012-04-20T22:00:48.199500-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:698:osc_req_attr_set()) no cover page!
      > 2012-04-20T22:00:48.199509-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:700:osc_req_attr_set()) page@ffff8803f4901680[3 ffff8807eb639b08:0 ^(null)_ffff8803f49015c0 2 0 1 (null) ffff8807e9b9c380 0x0]
      > 2012-04-20T22:00:48.199523-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:700:osc_req_attr_set()) page@ffff8803f49015c0[1 ffff8807ea301b58:0 ^ffff8803f4901680_(null) 2 0 1 (null) (null) 0x0]
      > 2012-04-20T22:00:48.199543-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:700:osc_req_attr_set()) vvp-page@ffff8803f7438f50(0:0:0) vm@ffffea000dc16b30 200000000005019 4:0 ffff8803f4901680 0 lru
      > 2012-04-20T22:00:48.199556-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:700:osc_req_attr_set()) lov-page@ffff8803f757d978
      > 2012-04-20T22:00:48.199600-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:700:osc_req_attr_set()) osc-page@ffff8803f4b317a8: 1< 0x845fed 2 0 - - + > 2< 0 0 2016 0x3 0x420 | (null) ffff880401df2628 ffff8807e847c340 ffffffffa0516a20 ffff8803f4b317a8 > 3< + ffff8807ffdfe0c0 1 39 0 > 4< 0 0 8 33759232 - | - - - - > 5< - - - - | 0 - - | 0 - ->
      > 2012-04-20T22:00:48.226323-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:700:osc_req_attr_set()) end page@ffff8803f49015c0
      > 2012-04-20T22:00:48.226361-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:700:osc_req_attr_set()) dump uncover page!
      > 2012-04-20T22:00:48.226372-05:00 c0-0c0s5n1 Pid: 7196, comm: flush-lustre-1
      > 2012-04-20T22:00:48.226381-05:00 c0-0c0s5n1 Call Trace:
      > 2012-04-20T22:00:48.226418-05:00 c0-0c0s5n1 [<ffffffff81007b11>] try_stack_unwind+0x1b1/0x200
      > 2012-04-20T22:00:48.226432-05:00 c0-0c0s5n1 [<ffffffff81006420>] dump_trace+0x90/0x300
      > 2012-04-20T22:00:48.255723-05:00 c0-0c0s5n1 [<ffffffffa014b8d2>] libcfs_debug_dumpstack+0x52/0x80 [libcfs]
      > 2012-04-20T22:00:48.255753-05:00 c0-0c0s5n1 [<ffffffffa0514536>] osc_req_attr_set+0x206/0x2a0 [osc]
      > 2012-04-20T22:00:48.255775-05:00 c0-0c0s5n1 [<ffffffffa024d0ae>] cl_req_attr_set+0xce/0x1f0 [obdclass]
      > 2012-04-20T22:00:48.255786-05:00 c0-0c0s5n1 [<ffffffffa0501e77>] osc_send_oap_rpc+0xc67/0x1c50 [osc]
      > 2012-04-20T22:00:48.255817-05:00 c0-0c0s5n1 [<ffffffffa0503136>] osc_check_rpcs+0x2d6/0x470 [osc]
      > 2012-04-20T22:00:48.255832-05:00 c0-0c0s5n1 [<ffffffffa0514bf0>] osc_io_submit+0x220/0x4d0 [osc]
      > 2012-04-20T22:00:48.285929-05:00 c0-0c0s5n1 [<ffffffffa024d3d8>] cl_io_submit_rw+0x78/0x1a0 [obdclass]
      > 2012-04-20T22:00:48.285958-05:00 c0-0c0s5n1 [<ffffffffa0575924>] lov_io_submit+0x254/0xb40 [lov]
      > 2012-04-20T22:00:48.311494-05:00 c0-0c0s5n1 [<ffffffffa024d3d8>] cl_io_submit_rw+0x78/0x1a0 [obdclass]
      > 2012-04-20T22:00:48.311524-05:00 c0-0c0s5n1 [<ffffffffa05e8424>] ll_writepage+0x2c4/0x480 [lustre]
      > 2012-04-20T22:00:48.311557-05:00 c0-0c0s5n1 [<ffffffff810d0b22>] __writepage+0x12/0x50
      > 2012-04-20T22:00:48.311580-05:00 c0-0c0s5n1 [<ffffffff810d1213>] write_cache_pages+0x1d3/0x3b0
      > 2012-04-20T22:00:48.311592-05:00 c0-0c0s5n1 [<ffffffff810d140f>] generic_writepages+0x1f/0x30
      > 2012-04-20T22:00:48.337108-05:00 c0-0c0s5n1 [<ffffffff810d143c>] do_writepages+0x1c/0x40
      > 2012-04-20T22:00:48.337139-05:00 c0-0c0s5n1 [<ffffffff8112c422>] writeback_single_inode+0x92/0x2c0
      > 2012-04-20T22:00:48.337195-05:00 c0-0c0s5n1 [<ffffffff8112cfca>] writeback_inodes_wb+0x26a/0x3e0
      > 2012-04-20T22:00:48.337213-05:00 c0-0c0s5n1 [<ffffffff8112d2e0>] wb_writeback+0x1a0/0x260
      > 2012-04-20T22:00:48.337259-05:00 c0-0c0s5n1 [<ffffffff8112d67c>] wb_do_writeback+0x1fc/0x210
      > 2012-04-20T22:00:48.337274-05:00 c0-0c0s5n1 [<ffffffff8112d6cb>] bdi_writeback_task+0x3b/0xc0
      > 2012-04-20T22:00:48.337286-05:00 c0-0c0s5n1 [<ffffffff810e2321>] bdi_start_fn+0x81/0xf0
      > 2012-04-20T22:00:48.362673-05:00 c0-0c0s5n1 [<ffffffff81063dae>] kthread+0x8e/0xa0
      > 2012-04-20T22:00:48.362703-05:00 c0-0c0s5n1 [<ffffffff81003d3a>] child_rip+0xa/0x20
      > 2012-04-20T22:00:48.362745-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:702:osc_req_attr_set()) LBUG
      
      These messages are the result of executing this code path in osc_req_attr
      

      This bug occurs frequently during system regression testing, but we have no explicit reproducer.

      The dumps are too large to attach here, so I will put them on the Cray anonymous ftp server.

      2.1 Dump: ftp.cray.com:/outbound/LU1650.21.osc_req_attr_set.lbug.tgz
      2.2 Dump: ftp.cray.com:/outbound/LU1650.22.osc_req_attr_set.lbug.tgz

      Let me know if you want other dumps or if you want me to look for other info in our collection of dumps. Note that dklogs are NOT available; performance issues require that debug tracing be kept to a minimum.

      Attachments

        Issue Links

          Activity

            [LU-2094] LBUG in osc_req_attr_set along write path

            duplicate of LU-2170

            jay Jinshan Xiong (Inactive) added a comment - duplicate of LU-2170

            This is the description of what I have seen from one of the cores
            > 2012-04-20T22:00:48.362745-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:702:osc_req_attr_set()) LBUG
            In this case the object has 2 locks:
            LOCK1: is a CLM_PHANTOM lock for the whole file that is being Cancelled
            LOCK2: is a WRITE lock for the index that this write request wants, but is currently in the CLS_QUEUING state and is blocked on LOCK1 while its being cancelled.

            From this information we can see that we are currently writing out an inode even though we do not have a write/read lock for this object. The CLM_PHANTOM lock is being cancelled at this point in time. This is most likely a case of trying to write something out without ever getting a write lock first, or we have an extra release the write lock.

            michael_mckay@xyratex.com Michael McKay (Inactive) added a comment - This is the description of what I have seen from one of the cores > 2012-04-20T22:00:48.362745-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:702:osc_req_attr_set()) LBUG In this case the object has 2 locks: LOCK1: is a CLM_PHANTOM lock for the whole file that is being Cancelled LOCK2: is a WRITE lock for the index that this write request wants, but is currently in the CLS_QUEUING state and is blocked on LOCK1 while its being cancelled. From this information we can see that we are currently writing out an inode even though we do not have a write/read lock for this object. The CLM_PHANTOM lock is being cancelled at this point in time. This is most likely a case of trying to write something out without ever getting a write lock first, or we have an extra release the write lock.

            People

              wc-triage WC Triage
              amk Ann Koehler (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: