Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13102

llog_cat_id2handle()) error opening log id rc = -2 osp_sync_thread()) ASSERTION( thread->t_flags != SVC_RUNNING )

    XMLWordPrintable

Details

    • 3
    • 9223372036854775807

    Description

      During massive IO at cluster we could reproduce the next issue.

      Nov 24 12:25:15 snx11281n003 kernel: LustreError: 12929:0:(llog_cat.c:404:llog_cat_id2handle()) snx11281-OST0000-osc-MDT0001: error opening log id [0x22903:0x1:0x0]:0: rc = -2
      Nov 24 12:25:15 snx11281n003 kernel: LustreError: 12929:0:(osp_sync.c:1293:osp_sync_thread()) ASSERTION( thread->t_flags != SVC_RUNNING ) failed: 3 changes, 0 in progress, 0 in flight
      Nov 24 12:25:15 snx11281n003 kernel: LustreError: 12929:0:(osp_sync.c:1293:osp_sync_thread()) LBUG
      Nov 24 12:25:15 snx11281n003 kernel: Pid: 12929, comm: osp-syn-0-1 3.10.0-957.1.3957.1.3.x3.3.36.x86_64 #1 SMP Tue Nov 19 09:47:32 PST 2019
      Nov 24 12:25:15 snx11281n003 kernel: Call Trace:
      Nov 24 12:25:15 snx11281n003 kernel:  [<ffffffffc0a427cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      Nov 24 12:25:15 snx11281n003 kernel:  [<ffffffffc0a4287c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      Nov 24 12:25:15 snx11281n003 kernel:  [<ffffffffc18b5a1b>] osp_sync_thread+0xa1b/0xa70 [osp]
      Nov 24 12:25:15 snx11281n003 kernel:  [<ffffffffa72c1c31>] kthread+0xd1/0xe0
      Nov 24 12:25:15 snx11281n003 kernel:  [<ffffffffa7976c1d>] ret_from_fork_nospec_begin+0x7/0x21
      Nov 24 12:25:15 snx11281n003 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
      

      And with additional debug.

       llog_cat_set_first_idx())   catlog [0x6:0x1:0x0] first idx 34730, last_idx 34731
       osp_sync_process_queues()) 1 changes, 0 in progress, 0 in flight
       llog_process_thread()) stop processing plain 0x76941:1:0 index 64767 count 1
       llog_process_thread()) index: 34731, lh_last_idx: 34730 synced_idx: 34730 lgh_last_idx: 34731
       llog_cat_process_common()) processing log [0x2780f:0x1:0x0]:0   at index 34731 of catalog [0x6:0x1:0x0]
      llog_cat_id2handle()) snx11281-OST0001-osc-MDT0001:  error opening log id [0x2780f:0x1:0x0]:0:rc = -2
      

      RC
      The logic for rereading a llog buffer had an exception for a full catalog, when lgh_last_idx = llh_cat_idx and a first processing index is a llh_cat_idx+1. This check is based on a value lh_last_idx, which stays unchanged between buffer read. But llh_cat_idx could go forward, and this lead to a wrong check where reread doesn't happen. As a result Lustre got ENOENT for a record and stoped osp processing.

      Attachments

        Activity

          People

            aboyko Alexander Boyko
            aboyko Alexander Boyko
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: