Detect EOL type using PHP

The here already given answers provide the user of enough information. The following code (based on the already given anwers) might help even more:

It provides a reference of the found EOL

The detection sets also a key which can be used by an application to this reference.

It shows how to use the reference in a utility class.

Shows how to use it for detection of a file returning the key name of the found EOL.

I hope this is of usage to all of you.

/**
Newline characters in different Operating Systems
The names given to the different sequences are:
============================================================================================
NewL  Chars       Name     Description
----- ----------- -------- ------------------------------------------------------------------
LF    0x0A        UNIX     Apple OSX, UNIX, Linux
CR    0x0D        TRS80    Commodore, Acorn BBC, ZX Spectrum, TRS-80, Apple II family, etc
LFCR  0x0A 0x0D   ACORN    Acorn BBC and RISC OS spooled text output.
CRLF  0x0D 0x0A   WINDOWS  Microsoft Windows, DEC TOPS-10, RT-11 and most other early non-Unix
                          and non-IBM OSes, CP/M, MP/M, DOS (MS-DOS, PC DOS, etc.), OS/2,
----- ----------- -------- ------------------------------------------------------------------
*/
const EOL_UNIX    = 'lf';        // Code: \n
const EOL_TRS80   = 'cr';        // Code: \r
const EOL_ACORN   = 'lfcr';      // Code: \n \r
const EOL_WINDOWS = 'crlf';      // Code: \r \n

then use the following code in a static class Utility to detect

/**
Detects the end-of-line character of a string.
@param string $str      The string to check.
@param string $key      [io] Name of the detected eol key.
@return string The detected EOL, or default one.
*/
public static function detectEOL($str, &$key) {
   static $eols = array(
     Util::EOL_ACORN   => "\n\r",  // 0x0A - 0x0D - acorn BBC
     Util::EOL_WINDOWS => "\r\n",  // 0x0D - 0x0A - Windows, DOS OS/2
     Util::EOL_UNIX    => "\n",    // 0x0A -      - Unix, OSX
     Util::EOL_TRS80   => "\r",    // 0x0D -      - Apple ][, TRS80
  );

  $key = "";
  $curCount = 0;
  $curEol = '';
  foreach($eols as $k => $eol) {
     if( ($count = substr_count($str, $eol)) > $curCount) {
        $curCount = $count;
        $curEol = $eol;
        $key = $k;
     }
  }
  return $curEol;
}  // detectEOL

and then for a file:

/**
Detects the EOL of an file by checking the first line.
@param string  $fileName    File to be tested (full pathname).
@return boolean false | Used key = enum('cr', 'lf', crlf').
@uses detectEOL
*/
public static function detectFileEOL($fileName) {
   if (!file_exists($fileName)) {
     return false;
   }

   // Gets the line length
   $handle = @fopen($fileName, "r");
   if ($handle === false) {
      return false;
   }
   $line = fgets($handle);
   $key = "";
   <Your-Class-Name>::detectEOL($line, $key);

   return $key;
}  // detectFileEOL

Change the Your-Class-Name into your name for the implementation Class (all static members).

My answer, because I could make neither ohaal's one or transilvlad's one work, is:

function detect_newline_type($content) {
    $arr = array_count_values(
               explode(
                   ' ',
                   preg_replace(
                       '/[^\r\n]*(\r\n|\n|\r)/',
                       '\1 ',
                       $content
                   )
               )
           );
    arsort($arr);
    return key($arr);
}

Explanation:

The general idea in both proposed solutions is good, but implementation details hinder the usefulness of those answers.

Indeed, the point of this function is to return the kind of newline used in a file, and that newline can either be one or two character long.

This alone renders the use of str_split() incorrect. The only way to cut the tokens correctly is to use a function that cuts a string with variable lengths, based on character detection instead. That is when explode() comes into play.

But to give useful markers to explode, it is necessary to replace the right characters, in the right amount, by the right match. And most of the magic happens in the regular expression.

3 points have to be considered:

using .* as suggested by ohaal will not work. While it is true that . will not match newline characters, on a system where \r is not a newline character, or part of a newline character, . will match it incorrectly (reminder: we are detecting newlines because they could be different from the ones on our system. Otherwise there is no point).
replacing /[^\r\n]*/ with anything will "work" to make the text vanish, but will be an issue as soon as we want to have a separator (since we remove all characters but the newlines, any character that isn't a newline will be a valid separator). Hence the idea to create a match with the newline, and use a backreference to that match in the replacement.
It is possible that in the content, multiple newlines will be in a row. However we do not want to group them in that case, since they will be seen by the rest of the code as different types of newlines. That is why the list of newlines is explicitly stated in the match for the backreference.

Wouldn't it be easier to just replace everything except new lines using regex?

_{The dot matches a single character, without caring what that character is. The only exception are newline characters.}

With that in mind, we do some magic:

$string = 'some string with new lines';
$newlines = preg_replace('/.*/', '', $string);
// $newlines is now filled with new lines, we only need one
$newline = substr($newlines, 0, 1);

Not sure if we can trust regex to do all this, but I don't have anything to test with.

enter image description here

/**
 * Detects the end-of-line character of a string.
 * @param string $str The string to check.
 * @param string $default Default EOL (if not detected).
 * @return string The detected EOL, or default one.
 */
function detectEol($str, $default=''){
    static $eols = array(
        "\0x000D000A", // [UNICODE] CR+LF: CR (U+000D) followed by LF (U+000A)
        "\0x000A",     // [UNICODE] LF: Line Feed, U+000A
        "\0x000B",     // [UNICODE] VT: Vertical Tab, U+000B
        "\0x000C",     // [UNICODE] FF: Form Feed, U+000C
        "\0x000D",     // [UNICODE] CR: Carriage Return, U+000D
        "\0x0085",     // [UNICODE] NEL: Next Line, U+0085
        "\0x2028",     // [UNICODE] LS: Line Separator, U+2028
        "\0x2029",     // [UNICODE] PS: Paragraph Separator, U+2029
        "\0x0D0A",     // [ASCII] CR+LF: Windows, TOPS-10, RT-11, CP/M, MP/M, DOS, Atari TOS, OS/2, Symbian OS, Palm OS
        "\0x0A0D",     // [ASCII] LF+CR: BBC Acorn, RISC OS spooled text output.
        "\0x0A",       // [ASCII] LF: Multics, Unix, Unix-like, BeOS, Amiga, RISC OS
        "\0x0D",       // [ASCII] CR: Commodore 8-bit, BBC Acorn, TRS-80, Apple II, Mac OS <=v9, OS-9
        "\0x1E",       // [ASCII] RS: QNX (pre-POSIX)
        //"\0x76",       // [?????] NEWLINE: ZX80, ZX81 [DEPRECATED]
        "\0x15",       // [EBCDEIC] NEL: OS/390, OS/400
    );
    $cur_cnt = 0;
    $cur_eol = $default;
    foreach($eols as $eol){
        if(($count = substr_count($str, $eol)) > $cur_cnt){
            $cur_cnt = $count;
            $cur_eol = $eol;
        }
    }
    return $cur_eol;
}

Notes:

Needs to check encoding type
~~Needs to somehow know that we may be on an exotic system like ZX8x (since ASCII x76 is a regular letter)~~ @radu raised a good point, in my case, it's not worth the effort to handle ZX8x systems nicely.
Should I split the function into two? mb_detect_eol() (multibyte) and detect_eol()

Detect EOL type using PHP

Explanation:

Tags:

Php

Newline

Platform

Related

Recent Posts