Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13986

livelock is possible in distribute_txn_commit_thread()

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      A recent patch to update_trans.c changed how distribute_txn_thread() waited for more work to do.

      It previously had an explicit "wait_event()" which listed all the conditions to wait for. It would then recheck each condition and possibly perform an appropriate action.

      It was changed to check each condition only once (per loop). If the condition was true, the action would be performed and a flag set. If no conditions were true (indicated by flag), it would wait, otherwise it would loop and recheck all condition.

      One of the "if (condition) { do work }" stanzas in the loop tested a condition that was not a condition that should wake up the loop. "batchid" was not tested at all in the wait_event(). The flag mentioned above was, however, set when that condition tested true.
      This can cause the loop to spin indefinitely.

      The "__set_current_state(TASK_RUNNING);" should be removed so that the value
      of batchid cannot stop the loop from sleeping (calling 'schedule()').

      Attachments

        Issue Links

          Activity

            [LU-13986] livelock is possible in distribute_txn_commit_thread()

            I guess the following warning is related:

            [   62.515602] ------------[ cut here ]------------
            [   62.519116] do not call blocking ops when !TASK_RUNNING; state=402 set at [<00000000a9f5c02a>] distribute_txn_commit_thread+0x5c/0x12f0 [ptlrpc]
            [   62.519405] WARNING: CPU: 0 PID: 5426 at kernel/sched/core.c:7438 __might_sleep+0x5d/0x70
            [   62.519509] Modules linked in: zfs(O) zunicode(O) zzstd(O) zlua(O) zcommon(O) znvpair(O) zavl(O) icp(O) spl(O) lustre(O) osp(O) ofd(O) lod(O) mdt(O) mdd(O) mgs(O) osd_ldiskfs(O) ldiskfs(O) lquota(O) lfsck(O) obdecho(O) mgc(O) mdc(O) lov(O) osc(O) lmv(O) fid(O) fld(O) ptlrpc(O) obdclass(O) ksocklnd(O) lnet(O) libcfs(O)
            [   62.519850] CPU: 0 PID: 5426 Comm: dist_txn-0 Tainted: G        W  O     --------- -  - 4.18.0 #11
            [   62.519959] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014
            [   62.520075] RIP: 0010:__might_sleep+0x5d/0x70
            [   62.520141] Code: ee 48 89 df 5b 5d 41 5c e9 70 fe ff ff 48 8b 90 d0 1a 00 00 48 c7 c7 00 28 e2 a8 c6 05 2f a4 46 01 01 48 89 d1 e8 49 58 fd ff <0f> 0b eb ce 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 55 65 48
            [   62.520362] RSP: 0018:ffff914fbe7c3e10 EFLAGS: 00010282
            [   62.520426] RAX: 0000000000000084 RBX: ffffffffa8e38be3 RCX: 0000000000000001
            [   62.520518] RDX: 0000000080000001 RSI: ffffffffa8e39811 RDI: 00000000ffffffff
            [   62.520610] RBP: 00000000000000e2 R08: 0000000000000000 R09: 0000000000000000
            [   62.520701] R10: ffff914fbe7c3c58 R11: ffff914fbe7c3c50 R12: 0000000000000000
            [   62.520793] R13: 0000000000608040 R14: 0000000000000058 R15: ffff914fe377ad00
            [   62.520885] FS:  0000000000000000(0000) GS:ffff915073800000(0000) knlGS:0000000000000000
            [   62.520977] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [   62.521061] CR2: 00007fa2d06618a0 CR3: 0000000061011002 CR4: 0000000000370eb0
            [   62.521154] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            [   62.521247] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
            [   62.521338] Call Trace:
            [   62.521374]  kmem_cache_alloc_trace+0x1cd/0x2a0
            [   62.521443]  ? sub_updates_write+0x1360/0x1360 [ptlrpc]
            [   62.521600]  distribute_txn_commit_thread+0x64e/0x12f0 [ptlrpc]
            [   62.521757]  ? rcu_read_lock_sched_held+0xe/0x60
            [   62.521824]  ? lock_release+0x20c/0x2d0
            [   62.521874]  ? trace_hardirqs_on+0x1c/0xe0
            [   62.521926]  ? sub_updates_write+0x1360/0x1360 [ptlrpc]
            [   62.522074]  kthread+0x16e/0x1a0
            [   62.522127]  ? set_kthread_struct+0x40/0x40
            [   62.522177]  ret_from_fork+0x24/0x30
            
            bzzz Alex Zhuravlev added a comment - I guess the following warning is related: [ 62.515602] ------------[ cut here ]------------ [ 62.519116] do not call blocking ops when !TASK_RUNNING; state=402 set at [<00000000a9f5c02a>] distribute_txn_commit_thread+0x5c/0x12f0 [ptlrpc] [ 62.519405] WARNING: CPU: 0 PID: 5426 at kernel/sched/core.c:7438 __might_sleep+0x5d/0x70 [ 62.519509] Modules linked in: zfs(O) zunicode(O) zzstd(O) zlua(O) zcommon(O) znvpair(O) zavl(O) icp(O) spl(O) lustre(O) osp(O) ofd(O) lod(O) mdt(O) mdd(O) mgs(O) osd_ldiskfs(O) ldiskfs(O) lquota(O) lfsck(O) obdecho(O) mgc(O) mdc(O) lov(O) osc(O) lmv(O) fid(O) fld(O) ptlrpc(O) obdclass(O) ksocklnd(O) lnet(O) libcfs(O) [ 62.519850] CPU: 0 PID: 5426 Comm: dist_txn-0 Tainted: G W O --------- - - 4.18.0 #11 [ 62.519959] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014 [ 62.520075] RIP: 0010:__might_sleep+0x5d/0x70 [ 62.520141] Code: ee 48 89 df 5b 5d 41 5c e9 70 fe ff ff 48 8b 90 d0 1a 00 00 48 c7 c7 00 28 e2 a8 c6 05 2f a4 46 01 01 48 89 d1 e8 49 58 fd ff <0f> 0b eb ce 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 55 65 48 [ 62.520362] RSP: 0018:ffff914fbe7c3e10 EFLAGS: 00010282 [ 62.520426] RAX: 0000000000000084 RBX: ffffffffa8e38be3 RCX: 0000000000000001 [ 62.520518] RDX: 0000000080000001 RSI: ffffffffa8e39811 RDI: 00000000ffffffff [ 62.520610] RBP: 00000000000000e2 R08: 0000000000000000 R09: 0000000000000000 [ 62.520701] R10: ffff914fbe7c3c58 R11: ffff914fbe7c3c50 R12: 0000000000000000 [ 62.520793] R13: 0000000000608040 R14: 0000000000000058 R15: ffff914fe377ad00 [ 62.520885] FS: 0000000000000000(0000) GS:ffff915073800000(0000) knlGS:0000000000000000 [ 62.520977] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 62.521061] CR2: 00007fa2d06618a0 CR3: 0000000061011002 CR4: 0000000000370eb0 [ 62.521154] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 62.521247] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 62.521338] Call Trace: [ 62.521374] kmem_cache_alloc_trace+0x1cd/0x2a0 [ 62.521443] ? sub_updates_write+0x1360/0x1360 [ptlrpc] [ 62.521600] distribute_txn_commit_thread+0x64e/0x12f0 [ptlrpc] [ 62.521757] ? rcu_read_lock_sched_held+0xe/0x60 [ 62.521824] ? lock_release+0x20c/0x2d0 [ 62.521874] ? trace_hardirqs_on+0x1c/0xe0 [ 62.521926] ? sub_updates_write+0x1360/0x1360 [ptlrpc] [ 62.522074] kthread+0x16e/0x1a0 [ 62.522127] ? set_kthread_struct+0x40/0x40 [ 62.522177] ret_from_fork+0x24/0x30

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40043/
            Subject: LU-13986 target: fix possible liveloop in distribute_txn thd
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: d05cab2fd7b9b38cc8414dcb03dbcc7b9ed31696

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40043/ Subject: LU-13986 target: fix possible liveloop in distribute_txn thd Project: fs/lustre-release Branch: master Current Patch Set: Commit: d05cab2fd7b9b38cc8414dcb03dbcc7b9ed31696

            Neil Brown (neilb@suse.de) uploaded a new patch: https://review.whamcloud.com/40043
            Subject: LU-13986 target: fix possible liveloop in distribute_txn thd
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a6caa6ac9d871e549b2b1bfdaa4118b53e161dd4

            gerrit Gerrit Updater added a comment - Neil Brown (neilb@suse.de) uploaded a new patch: https://review.whamcloud.com/40043 Subject: LU-13986 target: fix possible liveloop in distribute_txn thd Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a6caa6ac9d871e549b2b1bfdaa4118b53e161dd4

            People

              neilb Neil Brown
              neilb Neil Brown
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: