Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3306

layout.c:1946:__req_capsule_get()) @@@ Wrong buffer for field `name' (6 of 8) in format `LDLM_INTENT_OPEN': 3 vs. 0 (client)

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.4.0, Lustre 2.5.0
    • Lustre 2.4.0
    • None
    • 3
    • 8192

    Description

      Fresh master checkout, hit this in racer:

      [61845.122433] LustreError: 7885:0:(pack_generic.c:770:lustre_msg_string()) can't unpack short string in msg ffffc90006c54ff0 buffer[6] len 3: strlen 1
      [61845.131556] LustreError: 7885:0:(layout.c:1946:__req_capsule_get()) @@@ Wrong buffer for field `name' (6 of 8) in format `LDLM_INTENT_OPEN': 3 vs. 0 (client)
      [61845.131557]   req@ffff88003c01e7f0 x1434586131710596/t0(0) o101->13783025-666d-77a1-7c55-b4fef65827f1@0@lo:0/0 lens 576/0 e 0 to 0 dl 1368130054 ref 1 fl Interpret:/0/ffffffff rc 0/-1
      [61845.133834] LustreError: 7885:0:(mdt_handler.c:1727:mdt_reint_internal()) Can't unpack reint, rc -14
      [61845.134310] LustreError: 7885:0:(lustre_mdt.h:66:err_serious()) ASSERTION( -rc < ESERIOUS ) failed: 
      [61845.134747] LustreError: 7885:0:(lustre_mdt.h:66:err_serious()) LBUG
      [61845.134992] Pid: 7885, comm: mdt00_008
      [61845.135199] 
      [61845.135199] Call Trace:
      [61845.135558]  [<ffffffffa0af28a5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      [61845.135817]  [<ffffffffa0af2ea7>] lbug_with_loc+0x47/0xb0 [libcfs]
      [61845.136103]  [<ffffffffa05c32eb>] mdt_enqueue+0x10b/0x110 [mdt]
      [61845.140089]  [<ffffffffa05b4d28>] mdt_handle_common+0x648/0x1660 [mdt]
      [61845.140375]  [<ffffffffa05f0f25>] mds_regular_handle+0x15/0x20 [mdt]
      [61845.140681]  [<ffffffffa12a66c8>] ptlrpc_server_handle_request+0x3a8/0xc70 [ptlrpc]
      [61845.141158]  [<ffffffffa0af35ee>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      [61845.141448]  [<ffffffffa0b04e9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
      [61845.142006]  [<ffffffffa129de11>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc]
      [61845.142386]  [<ffffffff81054613>] ? __wake_up+0x53/0x70
      [61845.143683]  [<ffffffffa12a79d5>] ptlrpc_main+0xa45/0x1650 [ptlrpc]
      [61845.143981]  [<ffffffffa12a6f90>] ? ptlrpc_main+0x0/0x1650 [ptlrpc]
      [61845.144265]  [<ffffffff8100c10a>] child_rip+0xa/0x20
      [61845.144542]  [<ffffffffa12a6f90>] ? ptlrpc_main+0x0/0x1650 [ptlrpc]
      [61845.144854]  [<ffffffffa12a6f90>] ? ptlrpc_main+0x0/0x1650 [ptlrpc]
      [61845.145134]  [<ffffffff8100c100>] ? child_rip+0x0/0x20
      [61845.145399] 
      [61845.220106] Kernel panic - not syncing: LBUG
      

      Crashdump and modules are in /exports/crashdumps/192.168.10.210-2013-05-09-16\:06\:54/
      source tag master-20130509

      Attachments

        Activity

          [LU-3306] layout.c:1946:__req_capsule_get()) @@@ Wrong buffer for field `name' (6 of 8) in format `LDLM_INTENT_OPEN': 3 vs. 0 (client)

          Closing ticket as patch landed to Master. Please let me know if more work is needed in this ticket and I will reopen.

          jlevi Jodi Levi (Inactive) added a comment - Closing ticket as patch landed to Master. Please let me know if more work is needed in this ticket and I will reopen.

          Are more patches needed than Change, 6383 that has landed to master? Or can this ticket be closed?

          jlevi Jodi Levi (Inactive) added a comment - Are more patches needed than Change, 6383 that has landed to master? Or can this ticket be closed?
          green Oleg Drokin added a comment -

          http://review.whamcloud.com/6383 is err_serious change that I think is the most important. Applying it prevents the server from crashing on data from network.

          green Oleg Drokin added a comment - http://review.whamcloud.com/6383 is err_serious change that I think is the most important. Applying it prevents the server from crashing on data from network.
          green Oleg Drokin added a comment -

          Well, I am hitting this pretty regularly again and again, so I did some digging and here's what I uncovered:
          on a client in the log (for failed request):

          00000080:00200000:3.0:1368694813.256766:0:21370:0:(dcache.c:401:ll_revalidate_it
          ()) VFS Op:name=13,intent=open
          ...
          00000002:00010000:3.0:1368694813.266750:0:21370:0:(mdc_locks.c:1083:mdc_intent_lock()) (name: 13,[0x200000401:0x330:0x0]) in obj [0x200000400:0x1:0x0], intent: open flags 02304001
          

          I found op_data for this request and it's:

          crash> p *(struct md_op_data *)0xffff88000b5d7df0
          $3 = {
            op_fid1 = {
              f_seq = 8589935616, 
              f_oid = 1, 
              f_ver = 0
            }, 
            op_fid2 = {
              f_seq = 8589935617, 
              f_oid = 816, 
              f_ver = 0
            }, 
          ...
            op_name = 0xffff8800b5076fd0 "2",    <====
            op_namelen = 2,                      <==== !!
          ...
          

          So op_data has different name than what was reported in the log, but the length is the same.
          What really happened, I think is some sort of rename succeeded in between we got the revalidate request from the kernel and sending the message for over the wire sending that replaced the name in dentry (I also found dentry).
          In fact here it is:

          00000080:00200000:1.0:1368694813.261175:0:21387:0:(namei.c:1306:ll_rename_generic()) VFS Op:oldname=13,src_dir=144115205255725057/33554436(ffff880048d9cb08),newname=2,tgt_dir=144115205255725057/33554436(ffff880048d9cb08)
          

          This is all thanks to us not storing the name in op_data, but merely a pointer to some other location (dentry in this case, but e.g. in 3270 there is a bug of similar nature where op_data points to sai entry that gets freed unexpectedly).

          So after sending this malformed request, on the server side our string check fails (can't unpack short string) because we expect string length 2, but got only 1, this in turn returns error to mdt_reint_internal:

                  rc = mdt_reint_unpack(info, op);
                  if (rc != 0) {
                          CERROR("Can't unpack reint, rc %d\n", rc);
                          RETURN(err_serious(rc));
                  }
          

          That sets err_serious on the rc.

          Now we return all the way through the stack to mdt_enqueue, and there:

                  return rc ? err_serious(rc) : req->rq_status;
          

          So we call err_serious(rc) AGAIN which triggers the assertion.

          I think we should just remove the assertion and allow double err_serious setting, as otherwise we will need to go through all the callchains to ensure it is never called twice, yet everywhere where it's needed it should be called once for sure, which might be quite cumbersome.

          green Oleg Drokin added a comment - Well, I am hitting this pretty regularly again and again, so I did some digging and here's what I uncovered: on a client in the log (for failed request): 00000080:00200000:3.0:1368694813.256766:0:21370:0:(dcache.c:401:ll_revalidate_it ()) VFS Op:name=13,intent=open ... 00000002:00010000:3.0:1368694813.266750:0:21370:0:(mdc_locks.c:1083:mdc_intent_lock()) (name: 13,[0x200000401:0x330:0x0]) in obj [0x200000400:0x1:0x0], intent: open flags 02304001 I found op_data for this request and it's: crash> p *(struct md_op_data *)0xffff88000b5d7df0 $3 = { op_fid1 = { f_seq = 8589935616, f_oid = 1, f_ver = 0 }, op_fid2 = { f_seq = 8589935617, f_oid = 816, f_ver = 0 }, ... op_name = 0xffff8800b5076fd0 "2", <==== op_namelen = 2, <==== !! ... So op_data has different name than what was reported in the log, but the length is the same. What really happened, I think is some sort of rename succeeded in between we got the revalidate request from the kernel and sending the message for over the wire sending that replaced the name in dentry (I also found dentry). In fact here it is: 00000080:00200000:1.0:1368694813.261175:0:21387:0:(namei.c:1306:ll_rename_generic()) VFS Op:oldname=13,src_dir=144115205255725057/33554436(ffff880048d9cb08),newname=2,tgt_dir=144115205255725057/33554436(ffff880048d9cb08) This is all thanks to us not storing the name in op_data, but merely a pointer to some other location (dentry in this case, but e.g. in 3270 there is a bug of similar nature where op_data points to sai entry that gets freed unexpectedly). So after sending this malformed request, on the server side our string check fails (can't unpack short string) because we expect string length 2, but got only 1, this in turn returns error to mdt_reint_internal: rc = mdt_reint_unpack(info, op); if (rc != 0) { CERROR("Can't unpack reint, rc %d\n", rc); RETURN(err_serious(rc)); } That sets err_serious on the rc. Now we return all the way through the stack to mdt_enqueue, and there: return rc ? err_serious(rc) : req->rq_status; So we call err_serious(rc) AGAIN which triggers the assertion. I think we should just remove the assertion and allow double err_serious setting, as otherwise we will need to go through all the callchains to ensure it is never called twice, yet everywhere where it's needed it should be called once for sure, which might be quite cumbersome.
          green Oleg Drokin added a comment -

          and another one, /exports/crashdumps/192.168.10.220-2013-05-13-06\:05\:21/

          green Oleg Drokin added a comment - and another one, /exports/crashdumps/192.168.10.220-2013-05-13-06\:05\:21/
          green Oleg Drokin added a comment -

          Had another identical crash happen, so I guess this is more than just some sort of random memory corruption.
          Same codebase, crash in /exports/crashdumps/192.168.10.220-2013-05-11-01\:46\:08/

          green Oleg Drokin added a comment - Had another identical crash happen, so I guess this is more than just some sort of random memory corruption. Same codebase, crash in /exports/crashdumps/192.168.10.220-2013-05-11-01\:46\:08/

          People

            green Oleg Drokin
            green Oleg Drokin
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: