How do I get a websites title using command line?

wget -qO- 'http://www.youtube.com/watch?v=Dd7dQh8u4Hc' |
  perl -l -0777 -ne 'print $1 if /<title.*?>\s*(.*?)\s*<\/title/si'

You can pipe it to GNU recode if there are things like &lt; in it:

wget -qO- 'http://www.youtube.com/watch?v=Dd7dQh8u4Hc' |
  perl -l -0777 -ne 'print $1 if /<title.*?>\s*(.*?)\s*<\/title/si' |
  recode html..

To remove the - youtube part:

wget -qO- 'http://www.youtube.com/watch?v=Dd7dQh8u4Hc' |
 perl -l -0777 -ne 'print $1 if /<title.*?>\s*(.*?)(?: - youtube)?\s*<\/title/si'

To point out some of the limitations:

portability

There is no standard/portable command to do HTTP queries. A few decades ago, I would have recommended lynx -source instead here. But nowadays, wget is more portable as it can be found by default on most GNU systems (including most Linux-based desktop/laptop operating systems). Other fairly portables ones include the GET command that comes with perl's libwww that is often installed, lynx -source, and to a lesser extent curl. Other common ones include links -source, elinks -source, w3m -dump_source, lftp -c cat...

HTTP protocol and redirection handling

wget may not get the same page as the one that for instance firefox would display. The reason being that HTTP servers may choose to send a different page based on the information provided in the request sent by the client.

The request sent by wget/w3m/GET... is going to be different from the one sent by firefox. If that's an issue, you can alter wget behaviour to change the way it sends the request though with options.

The most important ones here in this regard are:

  • Accept and Accept-language: that tells the server in which language and charset the client would like to get the response in. wget doesn't send any by default so the server will typically send with its default settings. firefox on the other end is likely configured to request your language.
  • User-Agent: that identifies the client application to the server. Some sites send different content based on the client (though that's mostly for differences between javascript language interpretations) and may refuse to serve you if you're using a robot-type user agent like wget.
  • Cookie: if you've visited this site before, your browser may have permanent cookies for it. wget will not.

wget will follow the redirections when they are done at the HTTP protocol level, but since it doesn't look at the content of the page, not the ones done by javascript or things like <meta http-equiv="refresh" content="0; url=http://example.com/">.

Performance/Efficiency

Here, out of laziness, we have perl read the whole content in memory before starting to look for the <title> tag. Given that the title is found in the <head> section that is in the first few bytes of the file, that's not optimal. A better approach, if GNU awk is available on your system could be:

wget -qO- 'http://www.youtube.com/watch?v=Dd7dQh8u4Hc' |
  gawk -v IGNORECASE=1 -v RS='</title' 'RT{gsub(/.*<title[^>]*>/,"");print;exit}'

That way, awk stops reading after the first </title, and by exiting, causes wget to stop downloading.

Parsing of the HTML

Here, wget writes the page as it downloads it. At the same time, perl, slurps its output (-0777 -n) whole in memory and then prints the HTML code that is found between the first occurrences of <title...> and </title.

That will work for most HTML pages that have a <title> tag, but there are cases where it won't work.

By contrast coffeeMug's solution will parse the HTML page as XML and return the corresponding value for title. It is more correct if the page is guaranteed to be valid XML. However, HTML is not required to be valid XML (older versions of the language were not), and because most browsers out there are lenient and will accept incorrect HTML code, there's even a lot of incorrect HTML code out there.

Both my solution and coffeeMug's will fail for a variety of corner cases, sometimes the same, sometimes not.

For instance, mine will fail on:

<html><head foo="<title>"><title>blah</title></head></html>

or:

<!-- <title>old</title> --><title>new</title>

While his will fail on:

<TITLE>foo</TITLE>

(valid html, not xml) or:

or:

<title>...</title>
...
<script>a='<title>'; b='</title>';</script>

(again, valid html, missing <![CDATA[ parts to make it valid XML).

<title>foo <<<bar>>> baz</title>

(incorrect html, but still found out there and supported by most browsers)

interpretation of the code inside the tags.

That solution outputs the raw text between <title> and </title>. Normally, there should not be any HTML tags in there, there may possibly be comments (though not handled by some browsers like firefox so very unlikely). There may still be some HTML encoding:

$ wget -qO- 'http://www.youtube.com/watch?v=CJDhmlMQT60' |
  perl -l -0777 -ne 'print $1 if /<title.*?>\s*(.*?)\s*<\/title/si'
Wallace &amp; Gromit - The Cheesesnatcher Part 1 (claymation) - YouTube

Which is taken care of by GNU recode:

$ wget -qO- 'http://www.youtube.com/watch?v=CJDhmlMQT60' |
  perl -l -0777 -ne 'print $1 if /<title.*?>\s*(.*?)\s*<\/title/si' |
   recode html..
Wallace & Gromit - The Cheesesnatcher Part 1 (claymation) - YouTube

But a web client is also meant to do more transformations on that code when displaying the title (like condense some of the blanks, remove the leading and trailing ones). However it's unlikely that there'd be a need for that. So, as in the other cases, it's up to you do decide whether it's worth the effort.

Character set

Before UTF-8, iso8859-1 used to be the preferred charset on the web for non-ASCII characters though strictly speaking they had to be written as &eacute;. More recent versions of HTTP and the HTML language have added the possibility to specify the character set in the HTTP headers or in the HTML headers, and a client can specify the charsets it accepts. UTF-8 tends to be the default charset nowadays.

So, that means that out there, you'll find é written as &eacute;, as &#233;, as UTF-8 é, (0xc3 0xa9), as iso-8859-1 (0xe9), with for the 2 last ones, sometimes the information on the charset in the HTTP headers or the HTML headers (in different formats), sometimes not.

wget only gets the raw bytes, it doesn't care about their meaning as characters, and it doesn't tell the web server about the preferred charset.

recode html.. will take care to convert the &eacute; or &#233; into the proper sequence of bytes for the character set used on your system, but for the rest, that's trickier.

If your system charset is utf-8, chances are it's going to be alright most of the time as that tends to be the default charset used out there nowadays.

$ wget -qO- 'http://www.youtube.com/watch?v=if82MGPJEEQ' |
 perl -l -0777 -ne 'print $1 if /<title.*?>\s*(.*?)\s*<\/title/si'
Noir Désir - L&#39;appartement - YouTube

That é above was a UTF-8 é.

But if you want to cover for other charsets, once again, it would have to be taken care of.

It should also be noted that this solution won't work at all for UTF-16 or UTF-32 encoded pages.

To sum up

Ideally, what you need here, is a real web browser to give you the information. That is, you need something to do the HTTP request with the proper parameters, intepret the HTTP response correctly, fully interpret the HTML code as a browser would, and return the title.

As I don't think that can be done on the command line with the browsers I know (though see now this trick with lynx), you have to resort to heuristics and approximations, and the one above is as good as any.

You may also want to take into consideration performance, security... For instance, to cover all the cases (for instance, a web page that has some javascript pulled from a 3rd party site that sets the title or redirect to another page in an onload hook), you may have to implement a real life browser with its dom and javascript engines that may have to do hundreds of queries for a single HTML page, some of which trying to exploit vulnerabilities...

While using regexps to parse HTML is often frowned upon, here is a typical case where it's good enough for the task (IMO).


You can also try hxselect (from HTML-XML-Utils) with wget as follows:

wget -qO- 'http://www.youtube.com/watch?v=Dd7dQh8u4Hc' | hxselect -s '\n' -c  'title' 2>/dev/null

You can install hxselect in Debian based distros using:
sudo apt-get install html-xml-utils.

STDERR redirection is to avoid the Input is not well-formed. (Maybe try normalize?) message.

In order to get rid of "- YouTube", pipe the output of the above command to awk '{print substr($0, 0, length($0)-10)}'.


You can also use curl and grep to do this. You'll need to enlist the use of PCRE (Perl Compatible Regular Expressions) in grep to get the look behind and look ahead facilities so that we can find the <title>...</title> tags.

Example

$ curl 'http://www.youtube.com/watch?v=Dd7dQh8u4Hc' -so - | \
    grep -iPo '(?<=<title>)(.*)(?=</title>)'
Why Are Bad Words Bad? - YouTube

Details

The curl switches:

  • -s = silent
  • -o - = send output to STDOUT

The grep switches:

  • -i = case insensitivity
  • -o = Return only the portion that matches
  • -P = PCRE mode

The pattern to grep:

  • (?<=<title>) = look for a string that starts with this to the left of it
  • (?=</title>) = look for a string that ends with this to the right of it
  • (.*) = everything in between <title>..</title>.

More complex situations

If <title>...</titie> spans multiple lines, then the above won't find it. You can mitigate this situation by using tr, to delete any \n characters, i.e. tr -d '\n'.

Example

Sample file.

$ cat multi-line.html 
<html>
<title>
this is a \n title
</TITLE>
<body>
<p>this is a \n title</p>
</body>
</html>

And a sample run:

$ curl 'http://www.jake8us.org/~sam/multi-line.html' -so - | \
     tr -d '\n' | \
     grep -iPo '(?<=<title>)(.*)(?=</title>)'
this is a \n title

lang=...

If the <title> is set like this, <title lang="en"> then you'll need to remove this prior to greping it. The tool sed can be used to do this:

$ curl 'http://www.jake8us.org/~sam/multi-line.html' -so - | \
     tr -d '\n' | \
     sed 's/ lang="\w+"//gi' | \
     grep -iPo '(?<=<title>)(.*)(?=</title>)'
this is a \n title

The above finds the case insensitive string lang= followed by a word sequence (\w+). It is then stripped out.

A real HTML/XML Parser - using Ruby

At some point regex will fail in solving this type of problem. If that occurs then you'll likely want to use a real HTML/XML parser. One such parser is Nokogiri. It's available in Ruby as a Gem and can be used like so:

$ curl 'http://www.jake8us.org/~sam/multi-line.html' -so - | \
    ruby -rnokogiri -e \
     'puts Nokogiri::HTML(readlines.join).xpath("//title").map { |e| e.content }'

this is a \n title

The above is parsing the data that comes via the curl as HTML (Nokogiri::HTML). The method xpath then looks for nodes (tags) in the HTML that are leaf nodes, (//) with the name title. For each found we want to return its content (e.content). The puts then prints them out.

A real HTML/XML Parser - using Perl

You can also do something similar with Perl and the HTML::TreeBuilder::XPath module.

$ cat title_getter.pl
#!/usr/bin/perl

use HTML::TreeBuilder::XPath;

$tree = HTML::TreeBuilder::XPath->new_from_url($ARGV[0]); 
($title = $tree->findvalue('//title')) =~ s/^\s+//;
print $title . "\n";

You can then run this script like so:

$ ./title_getter.pl http://www.jake8us.org/~sam/multi-line.html
this is a \n title