Hi, Peter, in my former patches, I was trying to unpack the uid and gid from ptlrpc request to avoid change the protocol. When I look into the bugs that Mahmoud Hanafi mentioned, I noticed that, ptlrpc_request::rq_pill is initialized in ptlrpc_service_ops::so_hpreq_handler. However not every request use so_hpreq_handler.
mdt/mdt_mds.c: .so_hpreq_handler = ptlrpc_hpreq_handler,
mdt/mdt_mds.c: .so_hpreq_handler = NULL,
mdt/mdt_mds.c: .so_hpreq_handler = NULL,
mdt/mdt_mds.c: .so_hpreq_handler = NULL,
mdt/mdt_mds.c: .so_hpreq_handler = NULL,
mdt/mdt_mds.c: .so_hpreq_handler = NULL,
ost/ost_handler.c: .so_hpreq_handler = ptlrpc_hpreq_handler,
ost/ost_handler.c: .so_hpreq_handler = tgt_hpreq_handler,
ost/ost_handler.c: .so_hpreq_handler = NULL,
ost/ost_handler.c: .so_hpreq_handler = NULL,
That means, not every ptlrpc_request:rq_pill has been initialized before it is used in nrs_tbf.c. When the rq_pill is not initialized, using req_capsule_client_get() will trigger a LBUG. Ptlrpc requests from client to MDS, belongs to this case, So it caused a LBUG.
I think adding the uid and gid to ptlrpc protocol can easily address this issue and it is a better way to keep the logic simpler and easier to understand. Although it will introduce some duplication field in the ptlrpc request, and break the protocol, I think it is worth doing so.
. Our idea is as follows:
1. Add uid and gid to struct ptlrpc_body_v3 similar as jobid.
2. Combine the process logic of uid, gid, jobid together.
3. Handle uid and gid in nrs_tbf similar as jobid.
I uploaded the patch at https://review.whamcloud.com/#/c/27608/. And it need further improvement in coding to better merge the codes of jobid and uid/gid.
With the latest mater branch, it panic: