[LU-2615] group of OSS crashed at umount Created: 14/Jan/13  Updated: 17/Dec/15  Resolved: 17/Dec/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Alexandre Louvet Assignee: Hongchao Zhang
Resolution: Done Votes: 0
Labels: ptr

Attachments: File ptlrpcd.c     File recov_thread.c    
Severity: 3
Rank (Obsolete): 6118

 Description   

We have got 4 OSSes that crash at the same time, at umount, with the following bt :

PID: 18173 TASK: ffff8803376dc040 CPU: 4 COMMAND: "umount"
#0 [ffff8802b115f8d0] machine_kexec at ffffffff8102895b
0000001 [ffff8802b115f930] crash_kexec at ffffffff810a4622
0000002 [ffff8802b115fa00] panic at ffffffff81484657
0000003 [ffff8802b115fa80] lbug_with_loc at ffffffffa04ade5b [libcfs]
0000004 [ffff8802b115faa0] llog_recov_thread_stop at ffffffffa072e55b [ptlrpc]
0000005 [ffff8802b115fad0] llog_recov_thread_fini at ffffffffa072e593 [ptlrpc]
0000006 [ffff8802b115faf0] filter_llog_finish at ffffffffa0c7d3dd [obdfilter]
0000007 [ffff8802b115fb20] obd_llog_finish at ffffffffa057c2f8 [obdclass]
0000008 [ffff8802b115fb40] filter_precleanup at ffffffffa0c7cdaf [obdfilter]
0000009 [ffff8802b115fba0] class_cleanup at ffffffffa05a3ca7 [obdclass]
0000010 [ffff8802b115fc20] class_process_config at ffffffffa05a5feb [obdclass]
0000011 [ffff8802b115fcb0] class_manual_cleanup at ffffffffa05a6d29 [obdclass]
0000012 [ffff8802b115fd70] server_put_super at ffffffffa05b2c0c [obdclass]
0000013 [ffff8802b115fe40] generic_shutdown_super at ffffffff8116542b
0000014 [ffff8802b115fe60] kill_anon_super at ffffffff81165546
0000015 [ffff8802b115fe80] lustre_kill_super at ffffffffa05a8966 [obdclass]
0000016 [ffff8802b115fea0] deactivate_super at ffffffff811664e0
0000017 [ffff8802b115fec0] mntput_no_expire at ffffffff811826bf
0000018 [ffff8802b115fef0] sys_umount at ffffffff81183188
0000019 [ffff8802b115ff80] system_call_fastpath at ffffffff810030f2
RIP: 00007f62ddfbdd67 RSP: 00007fffab738308 RFLAGS: 00010202
RAX: 00000000000000a6 RBX: ffffffff810030f2 RCX: 0000000000000010
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007f62deeb3bb0
RBP: 00007f62deeb3b80 R8: 00007f62deeb3bd0 R9: 0000000000000000
R10: 00007fffab738130 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 00007f62deeb3c10
ORIG_RAX: 00000000000000a6 CS: 0033 SS: 002b

This bt is identical as the one shown LU-1194 which is supposed to be fixed in 2.1.3.

Site is classified so I can't upload the binary crash but I can export the content of some structures upon request.



 Comments   
Comment by Peter Jones [ 14/Jan/13 ]

Hongchao

Could you please look into this one?

Thanks

Peter

Comment by Hongchao Zhang [ 20/Jan/13 ]

is the debug log before the crash available? it will print the content of the remain llcd and the address of llog_ctxt and
llog_commit_master will be printed also,

CDEBUG(D_RPCTRACE, "Llcd (%p) at %s:%d:\n", llcd, func, line);
CDEBUG(D_RPCTRACE, " size: %d\n", llcd->llcd_size);
CDEBUG(D_RPCTRACE, " ctxt: %p\n", llcd->llcd_ctxt);
CDEBUG(D_RPCTRACE, " lcm : %p\n", llcd->llcd_lcm);
CDEBUG(D_RPCTRACE, " cookiebytes : %d\n", llcd->llcd_cookiebytes);

could you please print the content of the two structures? Thanks!

Comment by Hongchao Zhang [ 21/Jan/13 ]

the other possible reason of this issue is the ptlrpc_request created in llog_send is not completed normally, and its rq_interpret_reply (llcd_interpret)
is not called to free the llcd.

Hi, what patches are you using with the 2.1.3?

Comment by Alexandre Louvet [ 24/Jan/13 ]

Here is the list of patches we have on our production machine.

  • LU-1039 handle bulk IO errors correctly
  • LU-1650 find the lock by index of subpage: update with patch set 2
  • LU-2170 osc: set osc_lock attribute only once
  • ORNL-22 general ptlrpcd threads pool support
  • LU-1144 implement a NUMA aware ptlrpcd binding policy
  • LU-1110 MDS Oops in osd_xattr_get() during file open by FID
  • LU-1363 llite: Not held lock when calling security_d_instantiate
  • LU-645/BZ23978 getcwd failure
  • LU-1299 (patch set 11) loading large enough binary from lustre trigger OOM killer
  • LU-549 Improve statfs performance if selinux is disabled (complement)
  • LU-1749 mdt failing to open llog result in invalidating ost
  • LU-1331 changelogs: RNMTO record not always after RNMFRM
  • LU-1592 ldlm: protect obd_export:exp_imp_reverse's change
  • LU-1144 update patch to add ptlrpc option in modprobe.conf file
  • LU-1448 prevent NULL pointer dereference on disabled OSC
  • LU-1057 quota speed up lookup in osc_quota_chkdq.patch
  • LU-1714 properly initialyse sg_magic value.patch

For CDEBUG log it will take a little more time. By default our production machine run with a debug filter set to 0 (none), so I do not have those trace in the dmesg log. I have to spend some time to extract them from the crash.

Comment by Hongchao Zhang [ 24/Jan/13 ]

what is the content of the two structures "llog_ctxt" and "llog_commit_master" of the llcd?

Comment by Alexandre Louvet [ 11/Feb/13 ]

Here is the content of llog_ctxt & llog_commit_master extracted from the crash dump :

 
crash> llog_ctxt 0xffff8802327f8900
struct llog_ctxt {
  loc_idx = 3,
  loc_gen = {
    mnt_cnt = 0,
    conn_cnt = 0
  },
  loc_obd = 0xffff88032adbc038,
  loc_olg = 0xffff88032adbc2d0,
  loc_exp = 0xffff8802ed751400,
  loc_imp = 0x0,
  loc_logops = 0xffffffffa0cb99e0,
  loc_handle = 0x0,
  loc_lcm = 0xffff88033614b400,
  loc_llcd = 0x0,
  loc_sem = {
    lock = {
      raw_lock = {
        slock = 458759
      }
    },
    count = 0,
    wait_list = {
      next = 0xffff8802327f8960,
      prev = 0xffff8802327f8960
    }
  },
  loc_refcount = {
    counter = 2
  },
  llog_proc_cb = 0xffffffffa0c95220,
  loc_flags = 2
}

crash> llog_commit_master 0xffff88033614b400
struct llog_commit_master {
  lcm_flags = 4,
  lcm_count = {
    counter = 1
  },
  lcm_refcount = {
    counter = 3
  },
  lcm_pc = {
    pc_flags = 0,
    pc_lock = {
      raw_lock = {
        slock = 65537
      }
    },
    pc_starting = {
      done = 0,
      wait = {
        lock = {
          raw_lock = {
            slock = 196611
          }
        },
        task_list = {
          next = 0xffff88033614b430,
          prev = 0xffff88033614b430
        }
      }
    },
    pc_finishing = {
      done = 0,
      wait = {
        lock = {
          raw_lock = {
            slock = 196611
          }
        },
        task_list = {
          next = 0xffff88033614b450,
          prev = 0xffff88033614b450
        }
      }
    },
    pc_set = 0x0,
    pc_name = "lcm_xxxxx1-OST0",
    pc_env = {
      le_ctx = {
        lc_tags = 2415919112,
        lc_thread = 0x0,
        lc_value = 0x0,
        lc_state = LCS_FINALIZED,
        lc_remember = {
          next = 0xffff88033614b498,
          prev = 0xffff88033614b498
        },
        lc_version = 15,
        lc_cookie = 0
      },
      le_ses = 0x0
    },
    pc_index = -1,
    pc_npartners = 0,
    pc_partners = 0x0,
    pc_cursor = 0
  },
  lcm_lock = {
    raw_lock = {
      slock = 155453764
    }
  },
  lcm_llcds = {
    next = 0xffff88021b2c2050,
    prev = 0xffff88021b2c2050
  },
  lcm_name = "lcm_xxxxx1-OST0002\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
} 
Comment by Hongchao Zhang [ 08/Mar/13 ]

the lcm->lcm_llcds list is empty
...
lcm_llcds =

{ next = 0xffff88021b2c2050, prev = 0xffff88021b2c2050 }

,
...

but the lcm->lcm_count is "1", which is very strange!
could you please attach the "lustre/ptlrpc/ptlrpcd.c" and "lustre/ptlrpc/recov_thread.c" in your source tree, thanks very much!

Comment by Sebastien Buisson (Inactive) [ 11/Mar/13 ]

Hi, here are the source files requested by Hongchao.

Comment by Hongchao Zhang [ 12/Mar/13 ]

the list "lcm_llcds" is corrupted for its value of "next" and "prev" is wrong (it's not in the address region of "struct llog_commit_master").
Could it be memory corrupt? there is no trace of the bug yet, sorry!

does the issue occur again recently?

Comment by Alexandre Louvet [ 11/Apr/13 ]

> Could it be memory corrupt?
It's unexpected. The serveur is fully ECC protected and, as it's an OSS, almost only linux & lustre are running on this node.

> does the issue occur again recently?
It did occurs the last 4 times we did stop lustre on the node.

Comment by Hongchao Zhang [ 26/Apr/13 ]

Hi,
Does the kernel dump referred in comment at 11/Feb/13 3:56 PM exist? if so, could you please print the content at 0xffff88021b2c2050
as "struct llog_canceld_ctxt"? besides, can the console output(just the part related to Lustre) be attached here? Thanks a lot!

Comment by Hongchao Zhang [ 02/May/13 ]

the remaining "llcd" should have been sent over ptlrpc_request for llog_ctxt->loc_llcd == NULL, and the request could not finish, then "llcd_interpret" wasn't
called to free the "llcd", there are 2 patches (ORNL-22 general ptlrpcd threads pool support; LU-1144 implement a NUMA aware ptlrpcd binding policy) among
the patches applied currently is related to it, could you please help to revert the 2 patches and test it, Thanks!

Comment by Sebastien Buisson (Inactive) [ 16/May/13 ]

Hi,

It might be difficult to have the opportunity to install packages with those 2 patches reverted at customer site.
Instead, could we just set ptlrpcd_bind_policy=1 and max_ptlrpcds=2 as options for the ptlrpc kernel module, so that it behaves like if patch from ORNL-22 was not applied?
Is it still a relevant test for you?

Thanks,
Sebastien.

Comment by Hongchao Zhang [ 07/Jun/13 ]

Hi,

Yes, it will disable the ptlrpcd thread pools (although not shaking off the patch completely) and it should be still a relevant test.

Thanks

Comment by Hongchao Zhang [ 18/Jun/13 ]

Hi, what is the output of the test? Thanks

Comment by Sebastien Buisson (Inactive) [ 19/Jun/13 ]

Hi,

I have asked people on site for the results of the tests.

Cheers,
Sebastien.

Comment by Li Xi (Inactive) [ 02/Aug/13 ]

We hit the same problem on lustre-2.1.6 too.

After reading a few codes, I am wondering whether it is possible for following race problem to happen. Please correct me if I am wrong.

filter_llog_finish
--llog_recov_thread_fini
----llog_sync
------llog_obd_repl_sync
--------llog_cancel
----------llog_obd_repl_cancel
------------llcd_push
--------------llcd_send
----------------Sending async
----llog_recov_thread_stop
------LBUG,because llcd_send is sending a llcd and llcd_interpret() is not called since no reply has been got now.

Thanks!

Comment by Hongchao Zhang [ 02/Aug/13 ]

what are the two threads involved in the race?
normally, the llog_recov_thread_stop is only called by llog_recov_thread_fini, and "llog_recov_thread_stop" is called in two places,
one is the cleanup for the failed llog_recov_thread_init call, the other is the normal cleanup phase during device cleanup
(called in filter_llog_finish). they can't be called simultaneously

could you please attach some more info about this issue, and can it be reproduced on your site?

Comment by Li Xi (Inactive) [ 02/Aug/13 ]

Hi Hongchao,

Sorry, may be 'race' is not the right word to express my thought.

At the time llcd_send() returns, the completion handler llcd_interpret() might not be called yet, right? When the llcd is still under use by the RPC on flight, llog_recov_thread_stop() will hit a LBUG. I can't find any codes in filter_llog_finish() which waits for the RPC finishes, so I guess it is possible that when llog_recov_thread_stop() is called, the RPC is still on flight. Am I right?

Thanks
Li Xi

Comment by Hongchao Zhang [ 05/Aug/13 ]

in ptlrpcd_stop, cfs_wait_for_completion(&pc->pc_finishing) will be called to wait the pending RPCs to complete!

can the issue be reproduced in your site?

Comment by Li Xi (Inactive) [ 05/Aug/13 ]

Ah, I see. Thank you very much!

Generated at Sat Feb 10 01:26:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.