Fetching data from HTML source

I recommend that you import as an XMLObject, which represents structured XML data in a Mathematica-based format.

info = Import[
   "http://area51.stackexchange.com/proposals/4470/martial-arts", 
   "XMLObject"];

You can access the parts of xml using Mathematica patterns, like so:

labels = Cases[info, XMLElement[
  "div", {"class" -> "site-health-label"}, label_] :> 
  First[label], Infinity];
values = Cases[info, XMLElement[
  "div", {"class" -> "site-health-value"}, value_] :> 
   First[value], Infinity];
Grid[{labels, values}, Dividers -> All]

enter image description here


Alright, follows are the regular expressions required to solve this. For further information on how these things work I would check out

  • the regular expression documentation on the wolfram docs site.
  • As far as how regular expressions themselves work, check out the perl regular expression man page. This is where I learned how to write regexes back when I was in 8th grade or so ;).

Scroll down to the end of this question for an explanation on the various parts of this regular expression.

Code

siteHealthRegexString[str_] := 
  "(?s)div class=\\\"site-health-value\\\">([.,0-9%]+)</div>\\s+<div \
class=\\\"site-health-label\\\">" <> str <> 
   ".*?<span class=\\\"site-health-status-([a-z]+)\\\"";

getSiteHealthValue[str_, siteText_] :=
 StringCases[siteText,
  RegularExpression[siteHealthRegexString[str]] -> {"$1", "$2"}]

getSiteHealth[url_String] :=
 With[{site = Import[url, "Source"]},
  Map[(getSiteHealthValue[#, site][[1]] &), {"questions", "answered", 
    "avid users", "total users", "answer ratio", "visits"}]]

Then look up the statistics for a given site by calling

getSiteHealth[
  "http://area51.stackexchange.com/proposals/4470/martial-arts"]

Which results in:

  {{"1.4", "worrying"}, {"97%", "excellent"}, {"36", "worrying"}, 
   {"344", "worrying"}, {"3.5", "excellent"}, {"82", "worrying"}}

If you would like to have the numerical results separate from the english results, just do a transpose like:

Transpose@getSiteHealth[
 "http://area51.stackexchange.com/proposals/4470/martial-arts"]

Which results in:

 {{"1.4", "97%", "36", "344", "3.5", "82"}, 
  {"worrying", "excellent", "worrying", "worrying", "excellent", "worrying"}}

Regex Explanation

Basically the interesting elements of my regex up there are:

  1. \\\" - used to escape a single double quote character.
  2. [.0-9%] - a regex grouping matching the characters ., ',', 0, 1, ..., 9, %
  3. (...) - indicates that the pattern inside the parens are to be returned as a "match". The first set of parens go to "$1", the second will go to "$2" and so forth.
  4. \\s - matches any whitespace character. Including newlines, spaces and soforth.
  5. + - indicates to match 1 or more of the proceeding element. You should note my use of this instruction twice in the regex. The first was to indicate that I wished to match more than one number/percent symbol/dot at a time. The second use is to indicate that I wish to match one or more whitespaces.
  6. . is a wildcard indicating any character.
  7. * instructs the parser to look for 0 or more of whatever pattern or group came immediately before it.
  8. ? Indicates to the regular expression engine that the pattern should be matched as soon as it is found. (Otherwise called the "greedy" operator.) Without this, the regular expression engine will continue until the very last match "eating" up everything in between.
  9. .*? thus means to look for zero or more instances of any character, including newlines, and stop matching as soon as it finds a suitable match for the remainder of the pattern. Without this all the entries would read "worrying" as they would all match to the very last item (corresponding to "visits/day") in the given site box.
  10. (?s) is used to indicate that the . wildcard is to include newlines.

Side Question

I'm not directly familiar with mathematica symbolic regular expressions. So I went ahead and asked in the MMA chat about the side question and was told, by @acl, that the implementation details listing state that the two are equivalent to the kernel. As such I vastly prefer the perl compatible regular expressions over the symbolic regular expressions as nearly every single programming language in existence understands them. With that aside, there may be compelling reasons to use the symbolic form, I'm just unaware of them.


After much fiddling... This function gathers the "vital signs" of every proposal, and returns everything in a list.

proposalURL = "http://area51.stackexchange.com/?tab=beta&page=";

getProposalData[main_] := 
 Module[{b, srclist, pnumber, imp1, url, imp2, proplist, proplinks, 
   fullproplinks, propxml, health, propname, fulldata},
  srclist = {};
  pnumber = 1;
  imp1[x_] := Import[srclist[[x]], "Source"];
  b = 1;

  (*How many proposal pages?*)

  AppendTo[srclist, main <> ToString[b] <> "&pagesize=50"];
  For[b = 1, 
   Length[StringCases[imp1[pnumber], 
      RegularExpression["No\\s+proposals\\s+in\\s+beta"]]] == 0, 
   pnumber++, 
   AppendTo[srclist, main <> ToString[pnumber] <> "&pagesize=50"]];

  (*Gather proposal location from XML data*)

  url = DeleteDuplicates[srclist];
  imp2[x_] := Import[url[[x]], "XMLObject"];

  proplist = 
   Flatten[Table[
     Cases[imp2[i], 
      XMLElement["div", {"class" -> "a51-summary"}, ___], 
      Infinity], {i, 1, Length[url]}], 1];

  proplinks = 
   Table[proplist[[i]][[3]][[2]][[3]][[1]][[2]][[3]][[2]], {i, 1, 
     Length[proplist]}];

  (*Generate well-formed links and collect numerical values*)

  fullproplinks[x_] := 
   Import["http://area51.stackexchange.com" <> proplinks[[x]], 
    "XMLObject"];
  propxml = 
   Flatten[Table[
     Cases[fullproplinks[i], 
      XMLElement["div", {"class" -> "site-health-detail"}, ___], 
      Infinity], {i, 1, Length[proplinks]}], 1];
  health = 
   ToExpression[
    StringReplace[
     Flatten[Table[
       propxml[[i]][[3]][[2]][[3]][[2]][[3]], {i, 1, 
        5 Length[proplinks]}], 1], {"%" -> "", "," -> ""}]];

  (*Gather proposal names*)

  propname = 
   Flatten[Table[
     StringCases[proplinks[[i]], RegularExpression["[^/]+$"]], {i, 1, 
      Length[proplinks]}]];

  (*Riffle lists, tadda!*)

  fulldata = 
   Partition[
    Flatten[Riffle[health, propname, {1, 6 Length[proplinks], 6}]], 
    6]
  ]

To go from beta proposals to launched proposals, one needs only to change the proposalURL to http://area51.stackexchange.com/?tab=launched&page= and the first regex to No launched proposals.

The function is a hybdrid of the methods offered to me in the other answers. I don't quite know how I could have done it (as concisely) other than using both XML and regex. Thank you!

Charts

I've made some charts, but they're not very elegant. Have fun making the data talk.

As of March 24, 2012:


LAUNCHED

Questions per day

Percentage of answered questions

Avid users

Answer ratio

Visits per day


BETA

Questions per day

Percentage of answered questions

Avid users

Answer ratio

Visits per day