Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16484

Linux kernel BUG when deleting and adding a peer and using a filesystem

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.9
    • None
    • 2-node cluster:
      Lustre Client / Rocky 8.5 with Lustre 2.12.9
      Lustre Server / CentOS 7.9 with Lustre 2.12.6
      Transport: OPA
    • 3
    • 9223372036854775807

    Description

      Using a small 2-node cluster with 1 server and 1 client.  On the client, I ran the following in loops in two windows:

      Window #1:

              lnetctl peer del --prim_nid $PEER
              lnetctl peer add --prim_nid $PEER

      Where $PEER is the NID of the server

      Window #2:

              touch $BASE_DIR/qq
              rm -f $BASE_DIR/*

      Where $BASE_DIR is a directory within a mounted Lustre file system from $PEER.

      Running either results in no issues.  Running both at the same time results in a kernel BUG.

       

      [ 8763.523887] list_add corruption. prev->next should be next (ffff8e98555a72e0), but was ffff8e98056d2810. (prev=ffff8e98056d2810).
      [ 8763.536952] ------------[ cut here ]------------
      [ 8763.536953] kernel BUG at lib/list_debug.c:28!
      [ 8763.541933] invalid opcode: 0000 [#1] SMP PTI
      [ 8763.546809] CPU: 9 PID: 18262 Comm: lnet_discovery Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-348.el8.0.2.x86_64 #1
      [ 8763.560434] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.1029.090220201031 09/02/2020
      [ 8763.572117] RIP: 0010:__list_add_valid.cold.0+0x26/0x28
      [ 8763.577964] Code: 00 00 00 c3 48 89 d1 48 c7 c7 88 6e 51 91 48 89 c2 e8 a0 da ca ff 0f 0b 48 89 c1 4c 89 c6 48 c7 c7 e0 6e 51 91 e8 8c da ca ff <0f> 0b 48 89 fe 48 89 c2 48 c7 c7 70 6f 51 91 e8 78 da ca ff 0f 0b
      [ 8763.598976] RSP: 0018:ffffb7c9658e7d80 EFLAGS: 00010246
      [ 8763.604823] RAX: 0000000000000075 RBX: ffff8e98555a72c0 RCX: 0000000000000000
      [ 8763.612808] RDX: 0000000000000000 RSI: ffff8ea73fa56818 RDI: ffff8ea73fa56818
      [ 8763.620794] RBP: ffff8e98056d2800 R08: 0000000000006475 R09: 0000000000aaaaaa
      [ 8763.628779] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8e98555a730c
      [ 8763.636765] R13: ffff8e98056d2810 R14: ffff8e9931905c00 R15: 0000000000000000
      [ 8763.644751] FS:  0000000000000000(0000) GS:ffff8ea73fa40000(0000) knlGS:0000000000000000
      [ 8763.653806] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 8763.660235] CR2: 000056304a85f748 CR3: 0000000f35610005 CR4: 00000000001706e0
      [ 8763.668221] Call Trace:
      [ 8763.670972]  lnet_select_pathway+0xebd/0x12d0 [lnet]
      [ 8763.676538]  lnet_send+0x5d/0x1b0 [lnet]
      [ 8763.680936]  lnet_peer_discovery+0x277/0x11f0 [lnet]
      [ 8763.686491]  ? __schedule+0x2cc/0x700
      [ 8763.690591]  ? finish_wait+0x80/0x80
      [ 8763.694600]  ? lnet_peer_merge_data+0xd50/0xd50 [lnet]
      [ 8763.700350]  kthread+0x116/0x130
      [ 8763.703962]  ? kthread_flush_work_fn+0x10/0x10
      [ 8763.708933]  ret_from_fork+0x35/0x40

       

       

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              luick Dean Luick
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: