Is there an algorithm to decide if a symlink loops?

I don't fully understand what you're asking. If I didn't know any better I think you were asking if there was a way to detect this while in the midst of dealing with a file. I don't believe this is possible.

The only method I can conceive of is doing a find where you specifically start looking through a particular branch in the directory tree.

Example

$ tree 
.
`-- a
    `-- b
        |-- c
        |   `-- d
        |       `-- e -> ../../../../a/b
        `-- e -> e

5 directories, 1 file

The find command will detect this loop but not really tell you a whole lot about it.

$ find -L . -mindepth 15
find: File system loop detected; `./a/b/c/d/e' is part of the same file system loop as `./a/b'.
find: `./a/b/e': Too many levels of symbolic links

I arbitrarily picked 15 levels so as to block any output being displayed by the find. You can however drop that switch (-mindepth) if you don't care about the directory tree being displayed. The find command still detects the loop and stops:

$ find -L . 
.
./a
./a/b
./a/b/c
./a/b/c/d
find: File system loop detected; `./a/b/c/d/e' is part of the same file system loop as `./a/b'.
find: `./a/b/e': Too many levels of symbolic links

Incidentally, if you want to override the default MAXSYMLINKS which is apparently 40 on Linux (newer 3.x versions of the kernel) you can see this U&L Q&A titled: How do you increase MAXSYMLINKS.

Using the symlinks command

There is a tool that FTP site maintainers could use called symlinks which will help expose issues with too long or dangling trees that were caused by symbolic links.

In certain cases the symlinks tool could be used to delete offending links too.

Example

$ symlinks -srv a
lengthy:  /home/saml/tst/99159/a/b/c/d/e -> ../../../../a/b
dangling: /home/saml/tst/99159/a/b/e -> e

The glibc library

The glibc library looks to offer some C functions around this, but I don't entirely know their role or how to actually use them. So I can only merely point them out to you.

The man page, man symlink shows the function definition for a function called symlink(). The description goes like this:

symlink() creates a symbolic link named newpath which contains the string oldpath.

One of the error states that this function returns:

ELOOP Too many symbolic links were encountered in resolving newpath.

I'll also direct you to the man page, man path_resolution which discusses how Unix determines paths to items on disk. Specifically this paragraph.

If  the component is found and is a symbolic link (symlink), we first 
resolve this symbolic link (with the current lookup directory as starting 
lookup directory).  Upon error, that error is returned.  If the result is 
not a directory, an ENOTDIR error is returned.  If the resolution of the 
symlink is successful and returns a directory, we set the current lookup
directory to that directory, and go to the next component.  Note that the 
resolution process here involves recursion.  In order  to  protect  the 
kernel against stack overflow, and also to protect against denial of 
service, there are limits on the maximum recursion depth, and on the maximum 
number of symbolic links followed.  An ELOOP error is returned  when  the
maximum is exceeded ("Too many levels of symbolic links").

OK, after some more thought I think I have a clear solution.

The critical insight is that if every link that is part of a path resolves to something, then the entire path resolves. Or the other way around, if a path does not resolve then there must be a specific symlink that requires traversing that does not resolve.

While thinking about this problem previously I was using an algorithm that traversed elements of a path starting from the root, and when it encountered a symlink it replaced that path element with the contents of the symlink and then continued traversing. Since this approach doesn't remember which symlink it is currently resolving it cannot detect when it is in a nonresolving loop.

If the algorithm keeps track of which symlink it is currently resolving (or which symlinks in case of recursive links), it can detect if it is attempting to resolve a link again recursively which it is still busy resolving.

Algorithm:

initialize `location` to the current working directory
initialize `link_contents` to the path we want to resolve
initialize `active_symlinks` to the empty set

def resolve_symlink(location, link_contents, active_symlinks) :
    loop forever:
        next_location = location / [first element of link_contents]
        see if next_location is a symlink.
        if so:
            if next_location in active_symlinks: abort, we have a loop
            location = resolve_symlink(location, readlink(next_location), active_symlinks ∪ {next_location})
        else:
            location = next_location
        strip first element of link_contents
        if link_contents is empty: 
            return location

edit:

I have a working implementation of this in python at https://bitbucket.org/JanKanis/python-inotify/src/853ed903e870cbfa283e6ce7a5e41aeffe16d4e7/inotify/pathresolver.py?at=pathwatcher.

edit 2: Bitbucket stopped hosting mercurial repos. Here is the full file:

# pathresolver.py - This module contains an iterator that iterates over all
# elements of a path including any symlinks. 

# Copyright 2012-2013 Jan Kanis

# This library is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
# License as published by the Free Software Foundation; either
# version 2.1 of the License, or (at your option) any later version.


"""
This module contains a few functions that help traverse paths with
symlinks.

`resolve_symlinks` is a generator that yields pairs of paths, with one
yield for each path element that is traversed to reach the final
target of a path. This includes path elements and paths from symlinks.

`resolve_path` is a wrapper around `resolve_symlinks` that takes a
single path as an argument and sets up the other arguments to
`resolve_symlinks`.

`get_symlinkmax` is a function that determines the maximum number of
symlinks the system will traverse in a path.

Note: this module forms part of python-inotify, but is considered an
internal module. As such there are no stability guarantees regarding
the api's of these functions.
"""


__author__ = "Jan Kanis"


import sys, os, errno
import tempfile, shutil
from pathlib import PosixPath


_curdir = PosixPath('.')
_root = PosixPath('/')
_parentdir = PosixPath('..')


def resolve_path(path):
    '''Resolve the symlinks in path, yielding all filesystem locations that are traversed.

    The yielded value is a tuple, of which the first element is a symlink-free
    path, and the second element is a path relative to the first element that
    has not yet been traversed. This second element may contain more symlinks.
    
    The resolution implementation will follow an unbounded number of symlinks
    but will still detect symlink loops if they prevent a path from resolving.

    path can be given as a string or as a pathlib object. The yielded values
    are pathlib.PosixPath objects.

    '''
    linkcache = {}
    linkcounter = [0]
    for p in resolve_symlink(_curdir, PosixPath(path), set(),
                                  linkcache, linkcounter):
        yield p


def resolve_symlink(location, link_contents, active_links, known_links, linkcounter):
    '''Recursively resolve a symlink (or another path) to the file or
    directory it ultimately points to. This function handles an
    unlimited number of symlinks, and correctly detects symlink
    loops. All path parameters should be given as pathlib.PosixPath
    instances.

    location: The directory in which the currently to be resolved link resides.

    link_contents: The path stored in the symlink as returned by readlink().

    active_links: a set of symlinks that is currently being resolved.

    linkcache: a dictionary of link location -> resolved target paths. This
    cache prevents this function from having to resolve the same symlink
    twice. (Note that having to traverse the same symlink multiple times
    does not necessarily mean that the path does not resolve to anything.)

    linkcounter: A list containing a single number. (We use a list so that the
    value can be passed by reference.) This number is updated to indicate the
    total number of symlinks that has been traversed.

    '''

    while True:
        if link_contents.is_absolute():
            location = _root
            link_contents = link_contents.relative()

        yield location, link_contents
        if link_contents == _curdir:
            return

        if link_contents.parts[0] == '..':
            # We need to choose here if we allow traversing of a path above
            # the root or above the current directory. Going above CWD
            # should be allowed as long as we don't go above / by doing
            # so. The OS allows going to /.. (which just ends up at /
            # again), so for consistency with that we also allow it,
            # although a path that requires us to do this is probably a bug
            # somewhere.
            if all(p in ('/', '..') for p in location.parts):
                location = location['..']
            else:
                location = location.parent()
            # Strip the first part of link_contents off
            link_contents = link_contents.parts[1:]
            continue

        try:
            nextpath = location[link_contents.parts[0]]
            newlink = PosixPath(os.readlink(str(nextpath)))
        except OSError as e:
            if e.errno == errno.EINVAL:
                # The entry is not a symbolic link, assume it is a normal file
                # or directory. If it is a file and we need it to be a
                # directory that will be detected the next time through the
                # loop in the os.errno.ENOTDIR check. Checking it here would be
                # possible, but keeping the number of system calls at one per
                # loop makes reasoning about race conditions easier.
                location = nextpath
                link_contents = link_contents.parts[1:]
                continue
            if e.errno == errno.ENOENT:
                # The entry does not exist
                raise FileNotFoundError(nextpath)
            if e.errno == errno.ENOTDIR:
                # At this point we know the path is not valid, but we can not
                # determine with certainty what is wrong. If there were no
                # concurrent modifications we can safely make an is_dir()
                # call. If concurrent modifications did happen the is_dir()
                # check may succeed erroneously but we can't detect all
                # concurrent modifications anyway. If the check fails
                # (erroneously or not) that indicates a concurrent modification
                # so we fall through.
                if not location.is_dir():
                    raise NotADirectoryError(location)
                # We should not be able to get here, unless there is a bug
                # or some relevant part of the file system was changed
                # concurrently while we were resolving this link.
                raise ConcurrentFilesystemModificationError(nextpath)
            if e.errno == errno.ELOOP:
                # Can only happen if a path component was changed concurrently
                raise ConcurrentFilesystemModificationError(nextpath)
            # For other OSErrors (such as in case of EACCESS) we propagate to
            # the caller. 
            raise

        # It is a symlink!
        if nextpath in active_links:
            raise SymlinkLoopError(nextpath)

        link_contents = link_contents.parts[1:]
        # We have not yet attempted traversing this symlink during the
        # current call or any of its parents.
        if nextpath in known_links:
            # known_links stores fully resolved paths, so we don't need to
            # traverse the cached path and can just continue our traversal from
            # there.
            location = known_links[nextpath]
            continue
        
        # An unknown link, resolve it recursively
        linkcounter[0] += 1
        # Don't yield the very last result of this recursive call immediately,
        # we still want to process that further. 
        lastloc, lastlink = None, None
        for loc, link in resolve_symlink(location, newlink,
                          active_links.union((nextpath,)), known_links, linkcounter):
            if lastloc:
                yield lastloc, lastlink[link_contents]
            lastloc, lastlink = loc, link
        # The last yielded location is the final resolution of the symlink. The
        # last yielded link_contents is always '.' so we can ignore that.
        known_links[nextpath] = loc
        location = loc
        continue


_symlinkmax = None
def get_symlinkmax():
  '''Returns the maximum number of symlinks that this system will traverse in
  resolving a file path.

  '''
  global _symlinkmax
  if _symlinkmax is not None:
    return _symlinkmax

  try:
    tempdir = tempfile.mkdtemp(prefix='inotify-symlinkmax-tmpdir-')
    open(tempdir+'/testfile', 'w').close()

    target = 'testfile'
    for i in range(1, 60):
      name = 'l'+str(i)
      os.symlink(target, tempdir+'/'+name)
      target = name

      try:
        open(tempdir+'/'+name).close()
      except IOError as e:
        if e.errno == errno.ELOOP:
          _symlinkmax = i - 1
          break
        raise
    
  finally:
    if tempdir:
      shutil.rmtree(tempdir)
  return _symlinkmax



class InvalidPathError (OSError):
    def __init__(self, msg, path, errno=None, *args):
        self.filename = path
        self.errno = errno
        if errno:
            self.strerror = os.strerror(errno)
        OSError.__init__(self, msg, *args)

class SymlinkLoopError (InvalidPathError):
    def __init__(self, path, *args):
        msg = "Path not valid: The symlink at '{}' forms a symlink loop".format(path)
        InvalidPathError.__init__(self, msg, path, errno=errno.ELOOP, *args)

class ConcurrentFilesystemModificationError (InvalidPathError):
    def __init__(self, path, *args):
        msg = "Path not valid: A concurrent change was detected while traversing '{}'".format(path)
        InvalidPathError.__init__(self, msg, path, errno=None, *args)


# To be Python 2 and 3 compatible and also inherit from the right exception
# types in python 3, we need some boilerplate.

fnf_msg = "Path not valid: '{}' does not exist"
nad_msg = "Path not valid: '{}' is not a directory"

if sys.version_info >= (3, 3):
    class FileNotFoundError (InvalidPathError, FileNotFoundError):
        def __init__(self, path, *args):
            InvalidPathError.__init__(self, fnf_msg.format(path), path,
                                          errno=ENOENT, *args)

    class NotADirectoryError (InvalidPathError, NotADirectoryError):
        def __init__(self, path, *args):
            InvalidPathError.__init__(self, nad_msg.format(path), path,
                                          errno=errno.ENOTDIR, *args)

else:
    class FileNotFoundError (InvalidPathError):
        def __init__(self, path, *args):
            InvalidPathError.__init__(self, fnf_msg.format(path), path,
                                          errno=errno.ENOENT, *args)

    class NotADirectoryError (InvalidPathError):
        def __init__(self, path, *args):
            InvalidPathError.__init__(self, nad_msg.format(path), path,
                                          errno=errno.ENOTDIR, *args)

Python has a function called networkx.simple_cycles() that can be used for this. But yes it would need to read every file on the system.

>>> import networkx as nx
>>> G = nx.DiGraph()
>>> G.add_edge('A', 'B')
>>> G.add_edge('B', 'C')
>>> G.add_edge('C', 'D')
>>> G.add_edge('C', 'A')
>>> nx.simple_cycles(G)
[['A', 'B', 'C', 'A']]

Is there an algorithm to decide if a symlink loops?

Example

Using the symlinks command

Example

The glibc library

Tags:

Symlink

Algorithms

Related

Recent Posts