How to make a picture grid of all Nobel Laureates in physics?
Here's something to get you started down to path of scraping the somewhat larger individual pictures from the Nobel website:
links = Import[
"https://www.nobelprize.org/nobel_prizes/physics/laureates/index.html?images=yes", "Hyperlinks"];
individualpagelinks =
Select[
links,
StringMatchQ[
"https://www.nobelprize.org/nobel_prizes/physics/laureates/" ~~ NumberString ~~ "/" ~~ name__ ~~ "-facts.html"]
];
postcardpictures =
StringCases[
individualpagelinks,
"https://www.nobelprize.org/nobel_prizes/physics/laureates/" ~~ year : NumberString ~~ "/" ~~ name__ ~~ "-facts.html"
:>
"https://www.nobelprize.org/nobel_prizes/physics/laureates/" <> year <> "/" <> name <> "_postcard.jpg"
] // Flatten // DeleteDuplicates;
Import /@ postcardpictures[[1 ;; 5]]
I found it easier to extract the rationale for the prizes from the Wikipedia table of Nobel Prize winners in physics:
wikidata = Import[
"https://en.wikipedia.org/wiki/List_of_Nobel_laureates_in_Physics",
"Data"
];
Cases[
wikidata,
{year_, name_, _, rationale_}
:>
{year, StringDelete[rationale, {Whitespace ~~ "[" ~~ NumberString ~~ "]", "\""}]},
Infinity
][[2 ;; -2]]
(* Out:
{
{1901, "in recognition of the extraordinary services he has rendered by the discovery of the remarkable rays subsequently named after him"},
{1902, "in recognition of the extraordinary service they rendered by their researches into the influence of magnetism upon radiation phenomena"},
...
}
*)
Some manual cleanup will be necessary here: the somewhat naive method I proposed is confused by nested tables...
Using jSoupLink:
<< jSoupLink`
ParseHTML[
"https://www.nobelprize.org/nobel_prizes/physics/laureates/1921/einstein-facts.html",
".laureate_info_wrapper p",
"text"
] // TableForm
It is possible to be more precise:
ParseHTML[
"https://www.nobelprize.org/nobel_prizes/physics/laureates/1921/einstein-facts.html",
"span[itemprop=birthDate]",
"text"
]
{"14 March 1879"}
I don't intend to explain how I figure out the CSS rules but there are a lot of things you can do quite easily with this jSoupLink if you know how. You could write a script that starts from this directory of Nobel Prizes and recursively collect data from all laureates, for example.
Thanks to MarcoB and C. E. From them I learned how to deal with HTML contents using Mathematica.
I now summerize my final approach below (it is a bit long, so I make it an answer).
In this approach, I use information all from www.nobelprize.org and mathematica features that are all built-in.
individualpagelinks
is a list of all Nobel Laureates information page hyperlinks(I learned from MarcoB)
links = DeleteDuplicates@
Import["https://www.nobelprize.org/nobel_prizes/physics/laureates/\
index.html?images=yes", "Hyperlinks"];
individualpagelinks =
Select[links,
StringMatchQ[
"https://www.nobelprize.org/nobel_prizes/physics/laureates/" ~~
NumberString ~~ "/" ~~ __ ~~ "-facts.html"]];
To fetch essential information on "...-facts.html". The key is to import with option "XMLObject"
. like this
Import[individualpagelinks[[1]], "XMLObject"]
To know which expression contains the information you want, just Ctrl+F
and search, for example search "birth" in the output cell, and you can find XMLElement["span", {"itemprop" -> "birthDate"}, {"9 March 1959"}]
contains the information
Then use Cases
to get all information you need.
In getData
, The order of information: image, year, given name, family name, birth date, birth place, death date, death place, prize motivation, fields(if it exists)
Clear[getData];
getData[link_] := Module[{data},
data = Import[link, "XMLObject"];
{Import[StringReplace[link, "-facts.html" -> "_postcard.jpg"]],
StringTrim /@ {StringCases[link, NumberString][[1]],
Cases[data,
XMLElement["span", {"itemprop" -> "givenName"}, {x_}] -> x,
Infinity][[1]],
Cases[data,
XMLElement["span", {"itemprop" -> "familyName"}, {x_}] -> x,
Infinity][[1]],
Cases[data, {XMLElement["strong", {}, {"Born:"}], ___},
Infinity][[1, 3, -1, 1]],
StringSplit[
Cases[data, {XMLElement["strong", {}, {"Born:"}], ___},
Infinity][[1, -1]], ","],
Sequence@
If[tmp =
Cases[data, {XMLElement["strong", {}, {"Died:"}], ___},
Infinity]; tmp =!= {},
{tmp[[1, 3, -1, 1]], StringSplit[tmp[[1, -1]], ","]}, "live"],
Cases[
data, {XMLElement["strong", {}, {"Prize motivation:"}], x_} ->
x, Infinity][[1]],
If[tmp =
Cases[data, {XMLElement["strong", {}, {"Field:"}], x_} -> x,
Infinity]; tmp =!= {}, tmp[[1]], Nothing]}}]
labeledPicture
label image with year, name, country of birth
labeledPicture[dataEntry_] := Labeled[dataEntry[[1]],
Column[{dataEntry[[2, 1]],
dataEntry[[2, 2]] <> " " <> dataEntry[[2, 3]],
" (" <> dataEntry[[2, 5, -1]] <> ")"}, "Center"]]
Here is an example with recent 10 Nobel Laureates in physics
data = ParallelMap[getData, individualpagelinks[[1 ;; 10]]];
Grid[Partition[labeledPicture /@ data, 5, 5, {1, 1}, {}]]
This will give