Reading in HTML/XML PDF file formats into R
A possible solution using pdfx
# download file to your home dir
download.file("https://mchb.hrsa.gov/whusa11/hstat/hsrmh/downloads/pdf/233ml.pdf","233ml.pdf")
# get packages
library(remotes)
remotes::install_github("sckott/extractr")
library(extractr)
#parse
pdfx(file="233ml.pdf", what="parsed")
Your xml_document ht
includes 1x body and 13x html
you can use html_node
or html_nodes
from rvest
to extract the pieces you need.
library(xml2)
library(XML)
library(rvest)
library(dplyr)
html_string="https://mchb.hrsa.gov/whusa11/hstat/hsrmh/downloads/pdf/233ml.pdf"
ht <-read_html(html_string)
ht %>% html_nodes("html") # look at all html nodes
ht %>% html_node("body") # look at body node
Accordind to your question it looks like you would like to have the body node as text, right?
You can get it with:
ht %>% html_node("body") %>% as.character -> text #get body node as text
text
[1] "<body><p>%PDF-1.6\r%\xe2ãÏÓ\r\n83 0 obj\r<&g...