Is there an algorithm to decide if a symlink loops?
I don't fully understand what you're asking. If I didn't know any better I think you were asking if there was a way to detect this while in the midst of dealing with a file. I don't believe this is possible.
The only method I can conceive of is doing a find where you specifically start looking through a particular branch in the directory tree.
Example
$ tree
.
`-- a
`-- b
|-- c
| `-- d
| `-- e -> ../../../../a/b
`-- e -> e
5 directories, 1 file
The find
command will detect this loop but not really tell you a whole lot about it.
$ find -L . -mindepth 15
find: File system loop detected; `./a/b/c/d/e' is part of the same file system loop as `./a/b'.
find: `./a/b/e': Too many levels of symbolic links
I arbitrarily picked 15 levels so as to block any output being displayed by the find
. You can however drop that switch (-mindepth
) if you don't care about the directory tree being displayed. The find
command still detects the loop and stops:
$ find -L .
.
./a
./a/b
./a/b/c
./a/b/c/d
find: File system loop detected; `./a/b/c/d/e' is part of the same file system loop as `./a/b'.
find: `./a/b/e': Too many levels of symbolic links
Incidentally, if you want to override the default MAXSYMLINKS
which is apparently 40 on Linux (newer 3.x versions of the kernel) you can see this U&L Q&A titled: How do you increase MAXSYMLINKS.
Using the symlinks command
There is a tool that FTP site maintainers could use called symlinks
which will help expose issues with too long or dangling trees that were caused by symbolic links.
In certain cases the symlinks
tool could be used to delete offending links too.
Example
$ symlinks -srv a
lengthy: /home/saml/tst/99159/a/b/c/d/e -> ../../../../a/b
dangling: /home/saml/tst/99159/a/b/e -> e
The glibc library
The glibc library looks to offer some C functions around this, but I don't entirely know their role or how to actually use them. So I can only merely point them out to you.
The man page, man symlink
shows the function definition for a function called symlink()
. The description goes like this:
symlink() creates a symbolic link named newpath which contains the string oldpath.
One of the error states that this function returns:
ELOOP Too many symbolic links were encountered in resolving newpath.
I'll also direct you to the man page, man path_resolution
which discusses how Unix determines paths to items on disk. Specifically this paragraph.
If the component is found and is a symbolic link (symlink), we first
resolve this symbolic link (with the current lookup directory as starting
lookup directory). Upon error, that error is returned. If the result is
not a directory, an ENOTDIR error is returned. If the resolution of the
symlink is successful and returns a directory, we set the current lookup
directory to that directory, and go to the next component. Note that the
resolution process here involves recursion. In order to protect the
kernel against stack overflow, and also to protect against denial of
service, there are limits on the maximum recursion depth, and on the maximum
number of symbolic links followed. An ELOOP error is returned when the
maximum is exceeded ("Too many levels of symbolic links").
OK, after some more thought I think I have a clear solution.
The critical insight is that if every link that is part of a path resolves to something, then the entire path resolves. Or the other way around, if a path does not resolve then there must be a specific symlink that requires traversing that does not resolve.
While thinking about this problem previously I was using an algorithm that traversed elements of a path starting from the root, and when it encountered a symlink it replaced that path element with the contents of the symlink and then continued traversing. Since this approach doesn't remember which symlink it is currently resolving it cannot detect when it is in a nonresolving loop.
If the algorithm keeps track of which symlink it is currently resolving (or which symlinks in case of recursive links), it can detect if it is attempting to resolve a link again recursively which it is still busy resolving.
Algorithm:
initialize `location` to the current working directory
initialize `link_contents` to the path we want to resolve
initialize `active_symlinks` to the empty set
def resolve_symlink(location, link_contents, active_symlinks) :
loop forever:
next_location = location / [first element of link_contents]
see if next_location is a symlink.
if so:
if next_location in active_symlinks: abort, we have a loop
location = resolve_symlink(location, readlink(next_location), active_symlinks ∪ {next_location})
else:
location = next_location
strip first element of link_contents
if link_contents is empty:
return location
edit:
I have a working implementation of this in python at https://bitbucket.org/JanKanis/python-inotify/src/853ed903e870cbfa283e6ce7a5e41aeffe16d4e7/inotify/pathresolver.py?at=pathwatcher.
edit 2: Bitbucket stopped hosting mercurial repos. Here is the full file:
# pathresolver.py - This module contains an iterator that iterates over all
# elements of a path including any symlinks.
# Copyright 2012-2013 Jan Kanis
# This library is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
# License as published by the Free Software Foundation; either
# version 2.1 of the License, or (at your option) any later version.
"""
This module contains a few functions that help traverse paths with
symlinks.
`resolve_symlinks` is a generator that yields pairs of paths, with one
yield for each path element that is traversed to reach the final
target of a path. This includes path elements and paths from symlinks.
`resolve_path` is a wrapper around `resolve_symlinks` that takes a
single path as an argument and sets up the other arguments to
`resolve_symlinks`.
`get_symlinkmax` is a function that determines the maximum number of
symlinks the system will traverse in a path.
Note: this module forms part of python-inotify, but is considered an
internal module. As such there are no stability guarantees regarding
the api's of these functions.
"""
__author__ = "Jan Kanis"
import sys, os, errno
import tempfile, shutil
from pathlib import PosixPath
_curdir = PosixPath('.')
_root = PosixPath('/')
_parentdir = PosixPath('..')
def resolve_path(path):
'''Resolve the symlinks in path, yielding all filesystem locations that are traversed.
The yielded value is a tuple, of which the first element is a symlink-free
path, and the second element is a path relative to the first element that
has not yet been traversed. This second element may contain more symlinks.
The resolution implementation will follow an unbounded number of symlinks
but will still detect symlink loops if they prevent a path from resolving.
path can be given as a string or as a pathlib object. The yielded values
are pathlib.PosixPath objects.
'''
linkcache = {}
linkcounter = [0]
for p in resolve_symlink(_curdir, PosixPath(path), set(),
linkcache, linkcounter):
yield p
def resolve_symlink(location, link_contents, active_links, known_links, linkcounter):
'''Recursively resolve a symlink (or another path) to the file or
directory it ultimately points to. This function handles an
unlimited number of symlinks, and correctly detects symlink
loops. All path parameters should be given as pathlib.PosixPath
instances.
location: The directory in which the currently to be resolved link resides.
link_contents: The path stored in the symlink as returned by readlink().
active_links: a set of symlinks that is currently being resolved.
linkcache: a dictionary of link location -> resolved target paths. This
cache prevents this function from having to resolve the same symlink
twice. (Note that having to traverse the same symlink multiple times
does not necessarily mean that the path does not resolve to anything.)
linkcounter: A list containing a single number. (We use a list so that the
value can be passed by reference.) This number is updated to indicate the
total number of symlinks that has been traversed.
'''
while True:
if link_contents.is_absolute():
location = _root
link_contents = link_contents.relative()
yield location, link_contents
if link_contents == _curdir:
return
if link_contents.parts[0] == '..':
# We need to choose here if we allow traversing of a path above
# the root or above the current directory. Going above CWD
# should be allowed as long as we don't go above / by doing
# so. The OS allows going to /.. (which just ends up at /
# again), so for consistency with that we also allow it,
# although a path that requires us to do this is probably a bug
# somewhere.
if all(p in ('/', '..') for p in location.parts):
location = location['..']
else:
location = location.parent()
# Strip the first part of link_contents off
link_contents = link_contents.parts[1:]
continue
try:
nextpath = location[link_contents.parts[0]]
newlink = PosixPath(os.readlink(str(nextpath)))
except OSError as e:
if e.errno == errno.EINVAL:
# The entry is not a symbolic link, assume it is a normal file
# or directory. If it is a file and we need it to be a
# directory that will be detected the next time through the
# loop in the os.errno.ENOTDIR check. Checking it here would be
# possible, but keeping the number of system calls at one per
# loop makes reasoning about race conditions easier.
location = nextpath
link_contents = link_contents.parts[1:]
continue
if e.errno == errno.ENOENT:
# The entry does not exist
raise FileNotFoundError(nextpath)
if e.errno == errno.ENOTDIR:
# At this point we know the path is not valid, but we can not
# determine with certainty what is wrong. If there were no
# concurrent modifications we can safely make an is_dir()
# call. If concurrent modifications did happen the is_dir()
# check may succeed erroneously but we can't detect all
# concurrent modifications anyway. If the check fails
# (erroneously or not) that indicates a concurrent modification
# so we fall through.
if not location.is_dir():
raise NotADirectoryError(location)
# We should not be able to get here, unless there is a bug
# or some relevant part of the file system was changed
# concurrently while we were resolving this link.
raise ConcurrentFilesystemModificationError(nextpath)
if e.errno == errno.ELOOP:
# Can only happen if a path component was changed concurrently
raise ConcurrentFilesystemModificationError(nextpath)
# For other OSErrors (such as in case of EACCESS) we propagate to
# the caller.
raise
# It is a symlink!
if nextpath in active_links:
raise SymlinkLoopError(nextpath)
link_contents = link_contents.parts[1:]
# We have not yet attempted traversing this symlink during the
# current call or any of its parents.
if nextpath in known_links:
# known_links stores fully resolved paths, so we don't need to
# traverse the cached path and can just continue our traversal from
# there.
location = known_links[nextpath]
continue
# An unknown link, resolve it recursively
linkcounter[0] += 1
# Don't yield the very last result of this recursive call immediately,
# we still want to process that further.
lastloc, lastlink = None, None
for loc, link in resolve_symlink(location, newlink,
active_links.union((nextpath,)), known_links, linkcounter):
if lastloc:
yield lastloc, lastlink[link_contents]
lastloc, lastlink = loc, link
# The last yielded location is the final resolution of the symlink. The
# last yielded link_contents is always '.' so we can ignore that.
known_links[nextpath] = loc
location = loc
continue
_symlinkmax = None
def get_symlinkmax():
'''Returns the maximum number of symlinks that this system will traverse in
resolving a file path.
'''
global _symlinkmax
if _symlinkmax is not None:
return _symlinkmax
try:
tempdir = tempfile.mkdtemp(prefix='inotify-symlinkmax-tmpdir-')
open(tempdir+'/testfile', 'w').close()
target = 'testfile'
for i in range(1, 60):
name = 'l'+str(i)
os.symlink(target, tempdir+'/'+name)
target = name
try:
open(tempdir+'/'+name).close()
except IOError as e:
if e.errno == errno.ELOOP:
_symlinkmax = i - 1
break
raise
finally:
if tempdir:
shutil.rmtree(tempdir)
return _symlinkmax
class InvalidPathError (OSError):
def __init__(self, msg, path, errno=None, *args):
self.filename = path
self.errno = errno
if errno:
self.strerror = os.strerror(errno)
OSError.__init__(self, msg, *args)
class SymlinkLoopError (InvalidPathError):
def __init__(self, path, *args):
msg = "Path not valid: The symlink at '{}' forms a symlink loop".format(path)
InvalidPathError.__init__(self, msg, path, errno=errno.ELOOP, *args)
class ConcurrentFilesystemModificationError (InvalidPathError):
def __init__(self, path, *args):
msg = "Path not valid: A concurrent change was detected while traversing '{}'".format(path)
InvalidPathError.__init__(self, msg, path, errno=None, *args)
# To be Python 2 and 3 compatible and also inherit from the right exception
# types in python 3, we need some boilerplate.
fnf_msg = "Path not valid: '{}' does not exist"
nad_msg = "Path not valid: '{}' is not a directory"
if sys.version_info >= (3, 3):
class FileNotFoundError (InvalidPathError, FileNotFoundError):
def __init__(self, path, *args):
InvalidPathError.__init__(self, fnf_msg.format(path), path,
errno=ENOENT, *args)
class NotADirectoryError (InvalidPathError, NotADirectoryError):
def __init__(self, path, *args):
InvalidPathError.__init__(self, nad_msg.format(path), path,
errno=errno.ENOTDIR, *args)
else:
class FileNotFoundError (InvalidPathError):
def __init__(self, path, *args):
InvalidPathError.__init__(self, fnf_msg.format(path), path,
errno=errno.ENOENT, *args)
class NotADirectoryError (InvalidPathError):
def __init__(self, path, *args):
InvalidPathError.__init__(self, nad_msg.format(path), path,
errno=errno.ENOTDIR, *args)
Python has a function called networkx.simple_cycles() that can be used for this. But yes it would need to read every file on the system.
>>> import networkx as nx
>>> G = nx.DiGraph()
>>> G.add_edge('A', 'B')
>>> G.add_edge('B', 'C')
>>> G.add_edge('C', 'D')
>>> G.add_edge('C', 'A')
>>> nx.simple_cycles(G)
[['A', 'B', 'C', 'A']]