How do I get a websites title using command line?
wget -qO- 'http://www.youtube.com/watch?v=Dd7dQh8u4Hc' |
perl -l -0777 -ne 'print $1 if /<title.*?>\s*(.*?)\s*<\/title/si'
You can pipe it to GNU recode
if there are things like <
in it:
wget -qO- 'http://www.youtube.com/watch?v=Dd7dQh8u4Hc' |
perl -l -0777 -ne 'print $1 if /<title.*?>\s*(.*?)\s*<\/title/si' |
recode html..
To remove the - youtube
part:
wget -qO- 'http://www.youtube.com/watch?v=Dd7dQh8u4Hc' |
perl -l -0777 -ne 'print $1 if /<title.*?>\s*(.*?)(?: - youtube)?\s*<\/title/si'
To point out some of the limitations:
portability
There is no standard/portable command to do HTTP queries. A few decades ago, I would have recommended lynx -source
instead here. But nowadays, wget
is more portable as it can be found by default on most GNU systems (including most Linux-based desktop/laptop operating systems). Other fairly portables ones include the GET
command that comes with perl
's libwww that is often installed, lynx -source
, and to a lesser extent curl
. Other common ones include links -source
, elinks -source
, w3m -dump_source
, lftp -c cat
...
HTTP protocol and redirection handling
wget
may not get the same page as the one that for instance firefox
would display. The reason being that HTTP servers may choose to send a different page based on the information provided in the request sent by the client.
The request sent by wget/w3m/GET... is going to be different from the one sent by firefox. If that's an issue, you can alter wget
behaviour to change the way it sends the request though with options.
The most important ones here in this regard are:
Accept
andAccept-language
: that tells the server in which language and charset the client would like to get the response in.wget
doesn't send any by default so the server will typically send with its default settings.firefox
on the other end is likely configured to request your language.User-Agent
: that identifies the client application to the server. Some sites send different content based on the client (though that's mostly for differences between javascript language interpretations) and may refuse to serve you if you're using a robot-type user agent likewget
.Cookie
: if you've visited this site before, your browser may have permanent cookies for it.wget
will not.
wget
will follow the redirections when they are done at the HTTP protocol level, but since it doesn't look at the content of the page, not the ones done by javascript or things like <meta http-equiv="refresh" content="0; url=http://example.com/">
.
Performance/Efficiency
Here, out of laziness, we have perl
read the whole content in memory before starting to look for the <title>
tag. Given that the title is found in the <head>
section that is in the first few bytes of the file, that's not optimal. A better approach, if GNU awk
is available on your system could be:
wget -qO- 'http://www.youtube.com/watch?v=Dd7dQh8u4Hc' |
gawk -v IGNORECASE=1 -v RS='</title' 'RT{gsub(/.*<title[^>]*>/,"");print;exit}'
That way, awk stops reading after the first </title
, and by exiting, causes wget
to stop downloading.
Parsing of the HTML
Here, wget
writes the page as it downloads it. At the same time, perl
, slurps its output (-0777 -n
) whole in memory and then prints the HTML code that is found between the first occurrences of <title...>
and </title
.
That will work for most HTML pages that have a <title>
tag, but there are cases where it won't work.
By contrast coffeeMug's solution will parse the HTML page as XML and return the corresponding value for title
. It is more correct if the page is guaranteed to be valid XML. However, HTML is not required to be valid XML (older versions of the language were not), and because most browsers out there are lenient and will accept incorrect HTML code, there's even a lot of incorrect HTML code out there.
Both my solution and coffeeMug's will fail for a variety of corner cases, sometimes the same, sometimes not.
For instance, mine will fail on:
<html><head foo="<title>"><title>blah</title></head></html>
or:
<!-- <title>old</title> --><title>new</title>
While his will fail on:
<TITLE>foo</TITLE>
(valid html, not xml) or:
or:
<title>...</title>
...
<script>a='<title>'; b='</title>';</script>
(again, valid html
, missing <![CDATA[
parts to make it valid XML).
<title>foo <<<bar>>> baz</title>
(incorrect html, but still found out there and supported by most browsers)
interpretation of the code inside the tags.
That solution outputs the raw text between <title>
and </title>
. Normally, there should not be any HTML tags in there, there may possibly be comments (though not handled by some browsers like firefox so very unlikely). There may still be some HTML encoding:
$ wget -qO- 'http://www.youtube.com/watch?v=CJDhmlMQT60' |
perl -l -0777 -ne 'print $1 if /<title.*?>\s*(.*?)\s*<\/title/si'
Wallace & Gromit - The Cheesesnatcher Part 1 (claymation) - YouTube
Which is taken care of by GNU recode
:
$ wget -qO- 'http://www.youtube.com/watch?v=CJDhmlMQT60' |
perl -l -0777 -ne 'print $1 if /<title.*?>\s*(.*?)\s*<\/title/si' |
recode html..
Wallace & Gromit - The Cheesesnatcher Part 1 (claymation) - YouTube
But a web client is also meant to do more transformations on that code when displaying the title (like condense some of the blanks, remove the leading and trailing ones). However it's unlikely that there'd be a need for that. So, as in the other cases, it's up to you do decide whether it's worth the effort.
Character set
Before UTF-8, iso8859-1 used to be the preferred charset on the web for non-ASCII characters though strictly speaking they had to be written as é
. More recent versions of HTTP and the HTML language have added the possibility to specify the character set in the HTTP headers or in the HTML headers, and a client can specify the charsets it accepts. UTF-8 tends to be the default charset nowadays.
So, that means that out there, you'll find é
written as é
, as é
, as UTF-8 é
, (0xc3 0xa9), as iso-8859-1 (0xe9), with for the 2 last ones, sometimes the information on the charset in the HTTP headers or the HTML headers (in different formats), sometimes not.
wget
only gets the raw bytes, it doesn't care about their meaning as characters, and it doesn't tell the web server about the preferred charset.
recode html..
will take care to convert the é
or é
into the proper sequence of bytes for the character set used on your system, but for the rest, that's trickier.
If your system charset is utf-8, chances are it's going to be alright most of the time as that tends to be the default charset used out there nowadays.
$ wget -qO- 'http://www.youtube.com/watch?v=if82MGPJEEQ' |
perl -l -0777 -ne 'print $1 if /<title.*?>\s*(.*?)\s*<\/title/si'
Noir Désir - L'appartement - YouTube
That é
above was a UTF-8 é
.
But if you want to cover for other charsets, once again, it would have to be taken care of.
It should also be noted that this solution won't work at all for UTF-16 or UTF-32 encoded pages.
To sum up
Ideally, what you need here, is a real web browser to give you the information. That is, you need something to do the HTTP request with the proper parameters, intepret the HTTP response correctly, fully interpret the HTML code as a browser would, and return the title.
As I don't think that can be done on the command line with the browsers I know (though see now this trick with lynx
), you have to resort to heuristics and approximations, and the one above is as good as any.
You may also want to take into consideration performance, security... For instance, to cover all the cases (for instance, a web page that has some javascript pulled from a 3rd party site that sets the title or redirect to another page in an onload hook), you may have to implement a real life browser with its dom and javascript engines that may have to do hundreds of queries for a single HTML page, some of which trying to exploit vulnerabilities...
While using regexps to parse HTML is often frowned upon, here is a typical case where it's good enough for the task (IMO).
You can also try hxselect
(from HTML-XML-Utils) with wget
as follows:
wget -qO- 'http://www.youtube.com/watch?v=Dd7dQh8u4Hc' | hxselect -s '\n' -c 'title' 2>/dev/null
You can install hxselect
in Debian based distros using:
sudo apt-get install html-xml-utils
.
STDERR redirection is to avoid the Input is not well-formed. (Maybe try normalize?)
message.
In order to get rid of "- YouTube", pipe the output of the above command to awk '{print substr($0, 0, length($0)-10)}'
.
You can also use curl
and grep
to do this. You'll need to enlist the use of PCRE (Perl Compatible Regular Expressions) in grep
to get the look behind and look ahead facilities so that we can find the <title>...</title>
tags.
Example
$ curl 'http://www.youtube.com/watch?v=Dd7dQh8u4Hc' -so - | \
grep -iPo '(?<=<title>)(.*)(?=</title>)'
Why Are Bad Words Bad? - YouTube
Details
The curl
switches:
-s
= silent-o -
= send output to STDOUT
The grep
switches:
-i
= case insensitivity-o
= Return only the portion that matches-P
= PCRE mode
The pattern to grep
:
(?<=<title>)
= look for a string that starts with this to the left of it(?=</title>)
= look for a string that ends with this to the right of it(.*)
= everything in between<title>..</title>
.
More complex situations
If <title>...</titie>
spans multiple lines, then the above won't find it. You can mitigate this situation by using tr
, to delete any \n
characters, i.e. tr -d '\n'
.
Example
Sample file.
$ cat multi-line.html
<html>
<title>
this is a \n title
</TITLE>
<body>
<p>this is a \n title</p>
</body>
</html>
And a sample run:
$ curl 'http://www.jake8us.org/~sam/multi-line.html' -so - | \
tr -d '\n' | \
grep -iPo '(?<=<title>)(.*)(?=</title>)'
this is a \n title
lang=...
If the <title>
is set like this, <title lang="en">
then you'll need to remove this prior to grep
ing it. The tool sed
can be used to do this:
$ curl 'http://www.jake8us.org/~sam/multi-line.html' -so - | \
tr -d '\n' | \
sed 's/ lang="\w+"//gi' | \
grep -iPo '(?<=<title>)(.*)(?=</title>)'
this is a \n title
The above finds the case insensitive string lang=
followed by a word sequence (\w+
). It is then stripped out.
A real HTML/XML Parser - using Ruby
At some point regex will fail in solving this type of problem. If that occurs then you'll likely want to use a real HTML/XML parser. One such parser is Nokogiri. It's available in Ruby as a Gem and can be used like so:
$ curl 'http://www.jake8us.org/~sam/multi-line.html' -so - | \
ruby -rnokogiri -e \
'puts Nokogiri::HTML(readlines.join).xpath("//title").map { |e| e.content }'
this is a \n title
The above is parsing the data that comes via the curl
as HTML (Nokogiri::HTML
). The method xpath
then looks for nodes (tags) in the HTML that are leaf nodes, (//
) with the name title
. For each found we want to return its content (e.content
). The puts
then prints them out.
A real HTML/XML Parser - using Perl
You can also do something similar with Perl and the HTML::TreeBuilder::XPath module.
$ cat title_getter.pl
#!/usr/bin/perl
use HTML::TreeBuilder::XPath;
$tree = HTML::TreeBuilder::XPath->new_from_url($ARGV[0]);
($title = $tree->findvalue('//title')) =~ s/^\s+//;
print $title . "\n";
You can then run this script like so:
$ ./title_getter.pl http://www.jake8us.org/~sam/multi-line.html
this is a \n title