[LU-1778] Root Squash is not always properly enforced Created: 22/Aug/12  Updated: 28/Feb/23  Resolved: 09/May/14

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.1, Lustre 2.1.2
Fix Version/s: Lustre 2.6.0, Lustre 2.5.4

Type: Bug Priority: Minor
Reporter: Alexandre Louvet Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: Text File conf-sanity.test_43.console.wtm-14vm2.log    
Issue Links:
Blocker
Related
is related to LU-5142 Interop 2.5.1<->2.6 failure on test s... Resolved
is related to LU-6990 write error: Invalid argument when tr... Resolved
Severity: 3
Rank (Obsolete): 8532

 Description   

On a node with root_squash activated, if root try to access to attributes of file (fstat) which has not been previously accessed, the operation return ENOPERM.
If the attributes file were accessed by an authorized user, then root can access attributes without troubles.

as root :
[root@clientae ~]# mount -t lustre 192.168.1.100:/scratch /scratch
[root@clientae ~]# cd /scratch/
[root@clientae scratch]# ls -la
total 16
drwxrwxrwx 4 root root 4096 Aug 21 18:03 .
dr-xr-xr-x. 28 root root 4096 Aug 22 15:53 ..
drwxr-xr-x 2 root root 4096 Jun 21 18:42 .lustre
drwx------ 2 slurm users 4096 Aug 21 18:03 test_dir
[root@clientae scratch]# cd test_dir/
[root@clientae test_dir]# ls -la
ls: cannot open directory .: Permission denied

then, as user 'slurm' :
[slurm@clientae ~]$ cd /scratch/test_dir
[slurm@clientae test_dir]# ls -la
total 16
drwx------ 2 slurm users 4096 Aug 21 18:03 .
drwxrwxrwx 4 root root 4096 Aug 22 16:47 ..
rw-rr- 1 slurm users 7007 Aug 22 15:58 afile

now, come back as user root an replay the 'ls' command :
[root@clientae test_dir]# ls -la
total 16
drwx------ 2 slurm users 4096 Aug 21 18:03 .
drwxrwxrwx 4 root root 4096 Aug 22 16:47 ..
rw-rr- 1 slurm users 7007 Aug 22 15:58 afile
[root@clientae test_dir]# stat afile
File: `afile'
Size: 7007 Blocks: 16 IO Block: 2097152 regular file
Device: d61f715ah/3592384858d Inode: 144115238826934275 Links: 1
Access: (0644/rw-rr-) Uid: ( 500/ slurm) Gid: ( 100/ users)
Access: 2012-08-22 15:59:26.000000000 +0200
Modify: 2012-08-22 15:58:55.000000000 +0200
Change: 2012-08-22 15:58:55.000000000 +0200

At this point if you try to have a look into the file as root, you get ENOPERM
[root@clientae test_dir]# cat afile
cat: afile: Permission denied
even if you already got access to the content with the authorized user.

But, if the file is opened by the user ('tail -f afile' for exemple), root get access to the content of the file as well
[root@clientae test_dir]# tail afile
coucou
coucou
coucou
coucou
coucou
coucou
coucou
coucou
coucou
coucou

As soon as the file is closed by the user, root left access to the content(at least can't open the file any more)

Alex.



 Comments   
Comment by Peter Jones [ 23/Aug/12 ]

Bob

Could you please look into this one?

Thanks

Peter

Comment by Diego Moreno (Inactive) [ 26/Sep/12 ]

Hi, any news on this ticket? Do you need some more information?

Comment by Bob Glossman (Inactive) [ 01/Oct/12 ]

I haven't been able to reproduce this failure in the current b2_1:

[root@centos53 ~]# mount -t lustre centos53:/lustre /mnt/lustre

[root@centos53 ~]# lctl get_param mdt/*/root_squash
mdt.lustre-MDT0000.root_squash=500:500

[root@centos53 ~]# cd /mnt/lustre
[root@centos53 lustre]# ls -la
total 16
drwxrwxrwx 4 root root 4096 Oct 1 09:06 .
drwxr-xr-x. 6 root root 4096 Oct 1 09:05 ..
drwx------ 2 bogl bogl 4096 Oct 1 09:07 bogl
drwxr-xr-x 2 root root 4096 Oct 1 09:05 .lustre
[root@centos53 lustre]# cd bogl
[root@centos53 bogl]# ls -la
total 12
drwx------ 2 bogl bogl 4096 Oct 1 09:07 .
drwxrwxrwx 4 root root 4096 Oct 1 09:06 ..
rw------ 1 bogl bogl 4 Oct 1 09:07 f1
[root@centos53 bogl]# cat f1
foo

Am I doing something incorrect in my reproduction attempt? Is there some other precondition to making this happen?

Comment by Alexandre Louvet [ 02/Oct/12 ]

This is worst than in my case ...

1/ as root has been remapped to something different than 0:0, I would expect that you wasn't able to enter in bogl directory
2/ for the same reason, root shouldn't be able to watch the content of f1

That said, I did a new test on a vanilla 2.1.3 (ie the rpm downloaded from whamcloud, without recompilation) on top of centos 6.x up to date to confirm that it fail at the latest available version.

[root@server ~]# lctl get_param mdt/*/root_squash
mdt.scratch1-MDT0000.root_squash=0:0
mdt.scratch2-MDT0000.root_squash=0:0
mdt.scratch3-MDT0000.root_squash=0:0

=> set root_squash to an id which doesn't match my user id

[root@server ~]# lctl conf_param scratch1.mdt.root_squash="65535:65535"
[root@server ~]# lctl get_param mdt/*/root_squash
mdt.scratch1-MDT0000.root_squash=0:0
mdt.scratch2-MDT0000.root_squash=0:0
mdt.scratch3-MDT0000.root_squash=0:0

On the client, running as a simple user
[test@client scratch1]$ id
uid=500(test) gid=100(users) groups=100(users)
[test@client scratch1]$ pwd
/scratch1
[test@client scratch1]$ mkdir test
[test@client scratch1]$ chmod 700 test
[test@client scratch1]$ ls -la
total 16
drwxrwxrwx 3 root root 4096 Sep 11 22:15 .
dr-xr-xr-x. 29 root root 4096 Oct 2 09:22 ..
drwxr-xr-x 2 root root 4096 Sep 11 22:15 .lustre
drwx------ 2 test users 4096 Oct 2 10:37 test
[test@client scratch1]$ cd test/
[test@client test]$ echo coucou > afile
[test@client test]$ ls -la
total 9
drwx------ 2 test users 4096 Oct 2 10:37 .
drwxrwxrwx 3 root root 4096 Sep 11 22:15 ..
rw-rr- 1 test users 7 Oct 2 10:37 afile
[test@client test]$ cat afile
coucou

now log as root on the client

[root@client scratch1]# pwd
/scratch1
[root@client scratch1]# ls -la
total 16
drwxrwxrwx 3 root root 4096 Sep 11 22:15 .
dr-xr-xr-x. 29 root root 4096 Oct 2 09:22 ..
drwxr-xr-x 2 root root 4096 Sep 11 22:15 .lustre
drwx------ 2 test users 4096 Oct 2 10:37 test
[root@client scratch1]# cd test/
[root@client test]# ls -la
total 12
drwx------ 2 test users 4096 Oct 2 10:37 .
drwxrwxrwx 3 root root 4096 Sep 11 22:15 ..
rw-rr- 1 test users 7 Oct 2 10:37 afile

=> There is already something funny at this point. As root was mapped to 65535:65535, I expect to not be able to enter in this directory (700) [it was also shown in your test]. Flushing the cache on the client (ie echo 3 > /proc/sys/vm/drop_caches make a different situation. Root can enter the 'test' directory, but can't stat files :

[root@client test]# ls -la
ls: cannot access afile: Permission denied
total 8
drwx------ 2 test users 4096 Oct 2 10:37 .
drwxrwxrwx 3 root root 4096 Sep 11 22:15 ..
-????????? ? ? ? ? ? afile

I imagine this is due to the fact that the uid:gid translation in 'only' made at the mdt side and not at client side, letting root to access to attribute in client side cache without problem. Am I right ?

Whatever, return as test user and stat the 'afile' again
[test@client test]$ ls -la
total 12
drwx------ 2 test users 4096 Oct 2 10:37 .
drwxrwxrwx 3 root root 4096 Sep 11 22:15 ..
rw-rr- 1 test users 7 Oct 2 10:37 afile

switch back as root and run 'ls' once again let root again access to attributes :

[root@client test]# ls -la
total 12
drwx------ 2 test users 4096 Oct 2 10:37 .
drwxrwxrwx 3 root root 4096 Sep 11 22:15 ..
rw-rr- 1 test users 7 Oct 2 10:37 afile

at this point root can't access to 'afile' content
[root@client test]# cat afile
cat: afile: Permission denied

unless an authorized user run tail -f afile, and keep it running
[test@client test]$ tail -f afile
coucou

[root@client test]# cat afile
coucou

Comment by Bob Glossman (Inactive) [ 02/Oct/12 ]

In my case id 500 == bogl. With root squash set to 500 (bogl) root should be able to see into bogl owned dir and file, and it does.

I will retry with setting root squash to some other id.

Comment by Bob Glossman (Inactive) [ 02/Oct/12 ]

Have set root_squash to 65535:65535, shown by:

[root@centos54 bogl]# lctl set_param mdt/*/root_squash=65535:65535
mdt.lustre-MDT0000.root_squash=65535:65535

On client accessing as bogl, tree looks like:

[bogl@centos53 lustre-release]$ ll -R /mnt/lustre
/mnt/lustre:
total 4
drwx------ 2 bogl bogl 4096 Oct 1 15:18 bogl

/mnt/lustre/bogl:
total 4
rw------ 1 bogl bogl 4 Oct 1 15:18 file
[bogl@centos53 lustre-release]$
[bogl@centos53 lustre-release]$ cat /mnt/lustre/bogl/file
foo

Note permissions on dir and file only for bogl (id==500).

Accessing as root, I consistently see no access for ls or file content to dir or file:

[root@centos53 ~]# ll -R /mnt/lustre
/mnt/lustre:
total 4
drwx------ 2 bogl bogl 4096 Oct 1 15:18 bogl
ls: cannot open directory /mnt/lustre/bogl: Permission denied
[root@centos53 bogl]# cat /mnt/lustre/bogl/file
cat: /mnt/lustre/bogl/file: Permission denied

I do see access being allowed for cd to the bogl owned dir. stat of the file is initially refused:

[root@centos53 bogl]# cd /mnt/lustre/bogl
[root@centos53 bogl]# stat file
stat: cannot stat `file': Permission denied

Then after doing a stat of the file as bogl:

[bogl@centos53 lustre-release]$ stat /mnt/lustre/bogl/file
File: `/mnt/lustre/bogl/file'
Size: 4 Blocks: 8 IO Block: 2097152 regular file
Device: 2c54f966h/743766374d Inode: 144115205255725058 Links: 1
Access: (0600/rw------) Uid: ( 500/ bogl) Gid: ( 500/ bogl)
Access: 2012-10-02 08:20:40.000000000 -0700
Modify: 2012-10-01 15:18:07.000000000 -0700
Change: 2012-10-01 15:18:46.000000000 -0700

A later stat of the file as root is allowed:

[root@centos53 bogl]# stat file
File: `file'
Size: 4 Blocks: 8 IO Block: 2097152 regular file
Device: 2c54f966h/743766374d Inode: 144115205255725058 Links: 1
Access: (0600/rw------) Uid: ( 500/ bogl) Gid: ( 500/ bogl)
Access: 2012-10-02 08:20:40.000000000 -0700
Modify: 2012-10-01 15:18:07.000000000 -0700
Change: 2012-10-01 15:18:46.000000000 -0700

I see no case where access to the file content as root is allowed:

[root@centos53 bogl]# cat file
cat: file: Permission denied
[root@centos53 bogl]# cat /mnt/lustre/bogl/file
cat: /mnt/lustre/bogl/file: Permission denied

This behavior looks consistent in all versions of 2.X right up to master.

Comment by Bob Glossman (Inactive) [ 02/Oct/12 ]

correction: on another retry I do see incorrect access to file content allowed. If I do a stat and then a persistent access as bogl:

[bogl@centos53 lustre-release]$ stat /mnt/lustre/bogl/file
File: `/mnt/lustre/bogl/file'
Size: 4 Blocks: 8 IO Block: 2097152 regular file
Device: 2c54f966h/743766374d Inode: 144115205255725058 Links: 1
Access: (0600/rw------) Uid: ( 500/ bogl) Gid: ( 500/ bogl)
Access: 2012-10-02 08:35:56.000000000 -0700
Modify: 2012-10-01 15:18:07.000000000 -0700
Change: 2012-10-01 15:18:46.000000000 -0700
[bogl@centos53 lustre-release]$ tail -f /mnt/lustre/bogl/file
foo

After that a stat and access as root is allowed, at least for a while:

[root@centos53 bogl]# stat file
File: `file'
Size: 4 Blocks: 8 IO Block: 2097152 regular file
Device: 2c54f966h/743766374d Inode: 144115205255725058 Links: 1
Access: (0600/rw------) Uid: ( 500/ bogl) Gid: ( 500/ bogl)
Access: 2012-10-02 08:50:02.000000000 -0700
Modify: 2012-10-01 15:18:07.000000000 -0700
Change: 2012-10-01 15:18:46.000000000 -0700
[root@centos53 bogl]# cat file
foo

It seems to require both a (permitted) stat and file access as bogl before the access that should be forbidden as root gets allowed.

Comment by Peter Jones [ 30/Oct/12 ]

Niu is going to look into this one

Comment by Niu Yawei (Inactive) [ 30/Oct/12 ]

Hi, Alex

As you mentioned, root_suqash is just a server side id remapping (like NFS root_squash), it doesn't affect client cache, so this looks an expected behaviour to me. You need to make sure the cache is cleared before you want the root_squash enforced. (I think it's same for NFS, isn't it?) Thanks.

Comment by Alexandre Louvet [ 16/Nov/12 ]

Hi Niu,

I made some test with NFS and it works as expected : root (under root_squash) never get access to user data if rights for 'others' are not set. It doesn't depend of the activity of an authorized user on the same client.
I should add that I can't make sure the cache is cleared before we want the root_squash enforced : root_squash is expectected to be enforced all the time and the cache content depend of the authorized user activity.

Comment by Niu Yawei (Inactive) [ 28/Nov/12 ]

I made some test with NFS and it works as expected : root (under root_squash) never get access to user data if rights for 'others' are not set. It doesn't depend of the activity of an authorized user on the same client.

I think NFS client should not know if server is squashing root as well, but there are several reasons that I can think of, which could make NFS root_squash doesn't affected by the client cache much:

  • NFS is kind of WCC (Weak Cache Consistency) filesystem, it revalidates client cache every few seconds (default is 3?), and with some mount options, the cache could be revalidated before every operation.
  • NFS client sends ACCESS RPC for root user before operations.

Lustre is a strong cache consistency filesystem, we can't afford the extra ACCESS RPC or cache revalidation like NFS does, maybe we can make the client aware of the root_squash setting on server, and let the user to configure if they want always do access check on server side (with sacrificing the performance), but I'm not sure if we have enough resource to implement it for this moment. Anyway, I think we should state the root_squash on caching problem clearly in the manual.

Alex, what do you think about? Is this feature (enforce root_squash no matter of caching) very important for you? Or just improving manual is ok for you? Thanks.

Comment by Sebastien Buisson (Inactive) [ 29/Nov/12 ]

Hi Niu,

We do not ask that setting or unsetting
the root_squash parameter is taken into account on the whole Lustre cluster in real time. We could really accommodate with unmounting then remounting the clients if the root_squash parameter has changed on the
server.
Our real issue is that the root user accessing a file from a client where the same file has already been accessed by a legitimate user will gain access to this file, whatever is the root_squash parameter, because data will be read from the client cache.
I think it could be possible to store the root_squash information on the client at mount time. So there would be no need to verify this on the server for every request, and there would be no impact on performance.

What do you think?

Sebastien.

Comment by Niu Yawei (Inactive) [ 29/Nov/12 ]

Hi, Sebastien

Yes, I agree with you on this. Adding a permission checking hook (and check the squash setting there) for llite and make the llite aware of root_squash setting could save the RPCs to server.

In my opinion, this could be a feature enhancement rather than a bug, I'm glad to implement this when the time is available, and if you want propose a patch for this, I'm glad to help on review. Thank you.

Comment by Sebastien Buisson (Inactive) [ 05/Dec/12 ]

Hi,

I would like to propose a patch to address this issue, so I carried out some tests to try to understand which functions are involved in getting file permissions and granting or not file access.
Unfortunately, my tests left me a little bit confused...

Here is what I did.
File owner is user buisso1s. root_squash is enforced.

  • accessing as user pichong:
    EACCES in ll_file_open() (file>private_data is not NULL)
  • accessing as root:
    EACCES in ll_file_open() (file>private_data is not NULL)
  • accessing as user pichong, while buisso1s runs 'tail -f file':
    -EACCES in ll_inode_permission()
  • accessing as root, while buisso1s runs 'tail -f file':
    Access granted, both ll_file_open() and ll_inode_permission() return 0 (file->private_data is NULL)

In the end, I could not figure out who is in charge of checking file permissions.
Can you shed my light on this?

TIA,
Sebastien.

Comment by Niu Yawei (Inactive) [ 05/Dec/12 ]

Hi,

When there isn't cache on client, permission checking is done on server side on the open RPC(the first two cases), when there is cache on client (no open RPC is needed), permission check is done on client by the permisson checking hook (ll_inode_permission, which is invoked by kernel, see may_open()).

Comment by Sebastien Buisson (Inactive) [ 06/Dec/12 ]

So, is it OK if I propose to modify ll_inode_permission() to add a check for some kind of root_squash parameter that would be fetched by the client at mount time and stored somewhere?

Comment by Niu Yawei (Inactive) [ 06/Dec/12 ]

Hi, Sebastien, I think it's doable. The current root_squash option is stored in mdt config log (because it's a mds only option), we could probably populate this option into the client config log as well, then the client can be notified whenever the option is changed. Thanks.

Comment by Sebastien Buisson (Inactive) [ 18/Dec/12 ]

Hi,
I think I need some help regarding the way to store the root_squash option in the client config log. At the moment this is stored in the mdt config log, so how is it possible to pass it to the client config log? Via the mgs config log? What are the functions involved in that case?
TIA,
Sebastien.

Comment by Niu Yawei (Inactive) [ 18/Dec/12 ]

Hi, Sebastien

Please look at the mgs_write_log_param(), root_squash is now a PARAM_MDT param, which is stored in the $FSNAME-mdt0001, you might want it be stored in the client log either ($FSNAME-client), I think a simple way is to have the amdinistrator runnig two configure commands:
1. lctl conf_param $FSNAME.mdt.rootsquash=$ID:$ID
2. lctl conf_param $FSNAME.llite.rootsquash=$ID:$ID

And the other options related to root_squash should be treated carefully as well, such as nosquash_nids. Thanks.

Comment by Gregoire Pichon [ 30/Jan/13 ]

I have posted a patch for b2_1 on gerrit: http://review.whamcloud.com/5212

Comment by Gregoire Pichon [ 31/Jan/13 ]

For information, here is the note from Andreas Dilger in the gerrit.

Patch Set 1: I would prefer that you didn't submit this

I don't think this patch introduces any useful security to the system. If the user is root on the client, then it is trivial to "su" to another user and bypass the client-side root squash entirely.

Comment by Alexandre Louvet [ 31/Jan/13 ]

This is also true for NFS, but this is not the problem. Lustre claim to support root_squash (at least there is a chapter in the documentation about this feature) and customer hope that this functionality avoid root to access files for which root user doesn't have access. I agree that root can modify it credential and access to the file, but this is another story.

The only real interest of this feature is to avoid root to make stupid actions that will damage the content of the filesystem, but the comportment of the feature should be consistent over time and not change due to the client state. Currently root_squash comportment is confusing and the request is just to make it clean.

Comment by Gregoire Pichon [ 07/Feb/13 ]

Excerpt from Andreas comment in the gerrit

...
In summary, there is absolutely nothing to be gained except code complexity if the user already has root access on the client. This has to be enforced at the server, and at most root squash can only prevent the user from accessing files owned by root in the filesystem, or other root-only operations.

Only with Kerberos and/or the upcoming UID/GID mapping could root be denied access to new files from that client, and I can't think of any way that root could be denied access to cached files on the client. Even if the user's keys were only in memory and the kernel itself blocked access from root locally (in an irrevocable manner), the root user could replace the lustre kernel modules with an insecure version and reboot, and then wait until the user accessed secure data again.

The only way to avoid this is to never allow root access on the client in the first place.

Andreas,

The current implementation of root squash feature by Lustre is not working as expected by the customer and as specified in "Using Root Squash" section of the Lustre Operations Manual.

What do you propose to make progress on this issue ?

If you think this feature is senseless, then why not reducing its scope to security configurations only (MDT sec-level), or even remove the feature completely ?

My feeling is that we should be able to make it work properly. We could perform the root squashing on the client by overwritting the fsuid and gsuid of the task with the root_squash uid:guid specified on the MDS. These settings could be transmitted to the client either at mount time or each time file attributes are retrieved from the MDS (LDLM_INTENT_OPEN or LDLM_INTENT_GETATTR rpcs for instance). The patch I proposed last week is not suitable. Ok, let's find a better solution.

Comment by Andreas Dilger [ 08/Feb/13 ]

Actually, the description in the use manual is correctly describing how the code functions:

http://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#dbdoclet.50438221_64726

Root squash is a security feature which restricts super-user access rights to a Lustre file system. Without the root squash feature enabled, Lustre users on untrusted clients could access or modify files owned by root on the filesystem, including deleting them. Using the root squash feature restricts file access/modifications as the root user to only the specified clients. Note, however, that this does not prevent users on insecure clients from accessing files owned by other users.

Like I wrote in the Gerrit comment, there is nothing that can be done by root squash to prevent access to files when someone has root access on the client. In that case, the root user could "su - other_user" and immediately circumvent all of the checking that was added to squash the root user access. The root squash feature is only to prevent "root" on clients to be able to access and/or modify files owned by root on the filesystem. The same "su - other_user" hole is present for NFS, and the fact that "root" is denied direct access on NFS is like a sheet of paper protecting a bank vault.

The OpenSFS UID/GIF mapping and shared-key authentication features being developed by IU could allow for much more robust protection in the future. This would allow mapping users from specific nodes to one set of UIDs that don't overlap with UIDs from other nodes, and with the shared-key node authentication it would be impossible for even root to access files for UIDs that are not mapped to that cluster. If you are interested to follow this design and development, please email me and I will provide meeting and list details.

Comment by Alexandre Louvet [ 09/Feb/13 ]

Andreas, I think we are moving away from this ticket objective. I do agree with all points about security limitations of the root_squash feature, but this is not the problem there. The problem is that the manual says that access to root user is only granted to object for which it is allowed and this is not always true.

Case for which root try to get read access to object for witch the inode is already in client cache, doesn't get root_squash applied. Client code doesn't have knowledge about root_squash and only apply traditional security checking. The result is that root get access granted or denied depending of the cache content, which is very confusing for users. This is the only reason on the jira ticket.

Comment by Gregoire Pichon [ 13/Mar/13 ]

I have posted a patch for master on gerrit: http://review.whamcloud.com/#change,5700

Comment by Gregoire Pichon [ 25/Jul/13 ]

Tests on the patchset 7 and 8 have made the client hang after conf-sanity test_43 (the one for root squash). I have been able to reproduce the hang (after 16 successful runs) and took a dump.

It is available on ftp.whamcloud.com in /uploads/LU-1778
ftp> dir
227 Entering Passive Mode (72,18,218,227,205,178).
150 Here comes the directory listing.
rw-rr- 1 123 114 3387608 Jul 25 07:22 lustre-2.4.51-2.6.32_358.el6.x86_64_g4c66dbd.x86_64.rpm
rw-rr- 1 123 114 45824206 Jul 25 07:23 lustre-debuginfo-2.4.51-2.6.32_358.el6.x86_64_g4c66dbd.x86_64.rpm
rw-rr- 1 123 114 181316 Jul 25 07:23 lustre-ldiskfs-4.1.0-2.6.32_358.el6.x86_64_g4c66dbd.x86_64.rpm
rw-rr- 1 123 114 1674715 Jul 25 07:23 lustre-ldiskfs-debuginfo-4.1.0-2.6.32_358.el6.x86_64_g4c66dbd.x86_64.rpm
rw-rr- 1 123 114 3312152 Jul 25 07:23 lustre-modules-2.4.51-2.6.32_358.el6.x86_64_g4c66dbd.x86_64.rpm
rw-rr- 1 123 114 165060 Jul 25 07:24 lustre-osd-ldiskfs-2.4.51-2.6.32_358.el6.x86_64_g4c66dbd.x86_64.rpm
rw-rr- 1 123 114 5067172 Jul 25 07:24 lustre-source-2.4.51-2.6.32_358.el6.x86_64_g4c66dbd.x86_64.rpm
rw-rr- 1 123 114 4757320 Jul 25 07:24 lustre-tests-2.4.51-2.6.32_358.el6.x86_64_g4c66dbd.x86_64.rpm
rw-rr- 1 123 114 100181834 Jul 25 07:27 vmcore

Here are the information I have extracted from the dump.

The unmount command seems hung. Higher part of the stack is due to the dump signal.

crash> bt 2723
PID: 2723   TASK: ffff88046ab98040  CPU: 4   COMMAND: "umount"
 #0 [ffff880028307e90] crash_nmi_callback at ffffffff8102d2c6
 #1 [ffff880028307ea0] notifier_call_chain at ffffffff815131d5
 #2 [ffff880028307ee0] atomic_notifier_call_chain at ffffffff8151323a
 #3 [ffff880028307ef0] notify_die at ffffffff8109cbfe
 #4 [ffff880028307f20] do_nmi at ffffffff81510e9b
 #5 [ffff880028307f50] nmi at ffffffff81510760
    [exception RIP: page_fault]
    RIP: ffffffff815104b0  RSP: ffff880472c13bc0  RFLAGS: 00000082
    RAX: ffffc9001dd57008  RBX: ffff880470b27e40  RCX: 000000000000000f
    RDX: ffffc9001dd1d000  RSI: ffff880472c13c08  RDI: ffff880470b27e40
    RBP: ffff880472c13c48   R8: 0000000000000000   R9: 00000000fffffffe
    R10: 0000000000000001  R11: 5a5a5a5a5a5a5a5a  R12: ffff880472c13c08 = struct cl_site *
    R13: 00000000000000c4  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #6 [ffff880472c13bc0] page_fault at ffffffff815104b0
 #7 [ffff880472c13bc8] cfs_hash_putref at ffffffffa04305c1 [libcfs]
 #8 [ffff880472c13c50] lu_site_fini at ffffffffa0588841 [obdclass]
 #9 [ffff880472c13c70] cl_site_fini at ffffffffa0591d0e [obdclass]
#10 [ffff880472c13c80] ccc_device_free at ffffffffa0e6c16a [lustre]
#11 [ffff880472c13cb0] lu_stack_fini at ffffffffa058b22e [obdclass]
#12 [ffff880472c13cf0] cl_stack_fini at ffffffffa059132e [obdclass]
#13 [ffff880472c13d00] cl_sb_fini at ffffffffa0e703bd [lustre]
#14 [ffff880472c13d40] client_common_put_super at ffffffffa0e353d4 [lustre]
#15 [ffff880472c13d70] ll_put_super at ffffffffa0e35ef9 [lustre]
#16 [ffff880472c13e30] generic_shutdown_super at ffffffff8118326b
#17 [ffff880472c13e50] kill_anon_super at ffffffff81183356
#18 [ffff880472c13e70] lustre_kill_super at ffffffffa057d37a [obdclass]
#19 [ffff880472c13e90] deactivate_super at ffffffff81183af7
#20 [ffff880472c13eb0] mntput_no_expire at ffffffff811a1b6f
#21 [ffff880472c13ee0] sys_umount at ffffffff811a25db
#22 [ffff880472c13f80] system_call_fastpath at ffffffff8100b072
    RIP: 00007f0e6a971717  RSP: 00007fff17919878  RFLAGS: 00010206
    RAX: 00000000000000a6  RBX: ffffffff8100b072  RCX: 0000000000000010
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 00007f0e6c3cfb90
    RBP: 00007f0e6c3cfb70   R8: 00007f0e6c3cfbb0   R9: 0000000000000000
    R10: 00007fff179196a0  R11: 0000000000000246  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 00007f0e6c3cfbf0
    ORIG_RAX: 00000000000000a6  CS: 0033  SS: 002b

ccc_device_free() is called on lu_device 0xffff880475ad06c0

crash> struct lu_device ffff880475ad06c0
struct lu_device {
  ld_ref = {
    counter = 1
  }, 
  ld_type = 0xffffffffa0ea22e0, 
  ld_ops = 0xffffffffa0e787a0, 
  ld_site = 0xffff880472cf05c0, 
  ld_proc_entry = 0x0, 
  ld_obd = 0x0, 
  ld_reference = {<No data fields>}, 
  ld_linkage = {
    next = 0xffff880472cf05f0, 
    prev = 0xffff880472cf05f0
  }
}

ld_type->ldt_tags
crash> rd -8 ffffffffa0ea22e0
ffffffffa0ea22e0:  04 = LU_DEVICE_CL

ld_type->ldt_name
crash> rd  ffffffffa0ea22e8
ffffffffa0ea22e8:  ffffffffa0e7d09f = "vvp"


lu_site=ffff880472cf05c0
crash> struct lu_site ffff880472cf05c0
struct lu_site {
  ls_obj_hash = 0xffff880470b27e40, 
  ls_purge_start = 0, 
  ls_top_dev = 0xffff880475ad06c0, 
  ls_bottom_dev = 0x0, 
  ls_linkage = {
    next = 0xffff880472cf05e0, 
    prev = 0xffff880472cf05e0
  }, 
  ls_ld_linkage = {
    next = 0xffff880475ad06f0, 
    prev = 0xffff880475ad06f0
  }, 
  ls_ld_lock = {
    raw_lock = {
      slock = 65537
    }
  }, 
  ls_stats = 0xffff880470b279c0, 
  ld_seq_site = 0x0
}

crash> struct cfs_hash 0xffff880470b27e40
struct cfs_hash {
  hs_lock = {
    rw = {
      raw_lock = {
        lock = 0
      }
    }, 
    spin = {
      raw_lock = {
        slock = 0
      }
    }
  }, 
  hs_ops = 0xffffffffa05edee0, 
  hs_lops = 0xffffffffa044e320, 
  hs_hops = 0xffffffffa044e400,
  hs_buckets = 0xffff880471e4f000, 
  hs_count = {
    counter = 0
  }, 
  hs_flags = 6184, = 0x1828 = CFS_HASH_SPIN_BKTLOCK | CFS_HASH_NO_ITEMREF | CFS_HASH_ASSERT_EMPTY | CFS_HASH_DEPTH 
  hs_extra_bytes = 48, 
  hs_iterating = 0 '\000', 
  hs_exiting = 1 '\001', 
  hs_cur_bits = 23 '\027', 
  hs_min_bits = 23 '\027', 
  hs_max_bits = 23 '\027', 
  hs_rehash_bits = 0 '\000', 
  hs_bkt_bits = 15 '\017', 
  hs_min_theta = 0, 
  hs_max_theta = 0, 
  hs_rehash_count = 0, 
  hs_iterators = 0, 
  hs_rehash_wi = {
    wi_list = {
      next = 0xffff880470b27e88, 
      prev = 0xffff880470b27e88
    }, 
    wi_action = 0xffffffffa04310f0 <cfs_hash_rehash_worker>, 
    wi_data = 0xffff880470b27e40, 
    wi_running = 0, 
    wi_scheduled = 0
  }, 
  hs_refcount = {
    counter = 0
  }, 
  hs_rehash_buckets = 0x0, 
  hs_name = 0xffff880470b27ec0 "lu_site_vvp"
}

I am going to attach the log of the Maloo test that hung (Jul 19 10:12 PM).

Comment by Gregoire Pichon [ 25/Jul/13 ]

client log from Maloo test on patchset 7 (Jul 19 10:12 PM)

Comment by Gregoire Pichon [ 04/Dec/13 ]

I have posted another patch that adds a service to print a nidlist: http://review.whamcloud.com/#/c/8479/ . After review of patchset 11 of #5700 patch, it seems to be a requirement.

Comment by Cliff White (Inactive) [ 13/Dec/13 ]

Thank you. Would it be possible for you to rebase this on current master? There are a few conflicts preventing merge.

Comment by Gregoire Pichon [ 11/Feb/14 ]

The patch #8479 has been landed and then reverted due to a conflit with GNIIPLND patch.

I have posted a new version of the patch: http://review.whamcloud.com/9221

Comment by Jodi Levi (Inactive) [ 22/Apr/14 ]

Patch landed to Master

Comment by Gregoire Pichon [ 23/Apr/14 ]

This ticket has not been fixed yet.
The main patch http://review.whamcloud.com/#change,5700 is still in progress.

Comment by Peter Jones [ 09/May/14 ]

Now really landed for 2.6.

Comment by Gregoire Pichon [ 18/Jun/14 ]

I have backported the two patches to be integrated in 2.5 maintenance release.
http://review.whamcloud.com/10743
http://review.whamcloud.com/10744

Comment by Gregoire Pichon [ 01/Sep/14 ]

The two above patches #10743 and #10744 have been posted and are ready for review since end of June.
Would it be possible to have them included in the next 2.5 maintenance release: 2.5.3 ?

Comment by Gerrit Updater [ 01/Dec/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/10743/
Subject: LU-1778 libcfs: add a service that prints a nidlist
Project: fs/lustre-release
Branch: b2_5
Current Patch Set:
Commit: 57a8a6bec4dc965388b5bba48e7501f79bdab44b

Comment by Gerrit Updater [ 01/Dec/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/10744/
Subject: LU-1778 llite: fix inconsistencies of root squash feature
Project: fs/lustre-release
Branch: b2_5
Current Patch Set:
Commit: d82b4f54cbbe269519330e88639dd8e197636496

Comment by Gregoire Pichon [ 24/Aug/16 ]

Closing as the issue has been fixed (several months ago) in master and 2.5 maintenance release.

Generated at Sat Feb 10 01:19:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.