Discussion: View Thread

Back to discussions

Expand all | Collapse all

R Question - reading adobe pdf files in R

1. R Question - reading adobe pdf files in R

Recommend
Chris Barker
Posted 04-02-2023 16:36
First,
My thanks to Rick Peterson, for helping arrange for the debugging of a software bug that prevented me from posting to the section.
The bug has been resolved and with this post I'm again able to 1) post to the section members and 2) reply to posts from section members
This is my first post to the section since the bug was fixed.

My question here is two part -
First whether any of the section members have experience using the R library that permits one to import conttents of a pdf file into R. These are pdfs with lots of text.
I have successfully accomplished that import

My second question - how best to find text strings with the imported files . . And oversimplified, it seems that R imports the separate pdfs into items in a list. So I have 39 files and I get a list with 39 items.
And finding text after selecting one item (corresponding to a specifc pdf) is where I "get stuck"

The main commands needed are to install the following two libraries
library(pdftools)
library(stringr)

and one, creates a single working directory in R and also stores the pdfs in that same directory
then uses
files <- list.files(pattern = "PDF$")
files
The ----->list.files,---- looks in the current /working directory for the pdfs
My project is looking at thirty-nine separate pdf files each pdf has about 25-50 pages - of certain personal records.
These are personal and these are not files I can share.

After this import there is an "lapply" of pdf_text
# myadobeimport is the extract text from pdf. in a list
myadobeimport <- lapply(files, pdf_text)

It seems that each pdf is imported as giant vector. And I have been working with substrings and character string search (regxpr, etc).
What seems to work so far is converting these to a matrix with
as.matrix(myadobeimport)

and then I search for strings by "sink()" in items in a list and exporting to a text file ".txt", then opening and searching in the text file
sink("myadobeimport.txt")
myadobeimportmatrix <- as.matrix(myadobeimport[[3]][3])
myadobeimportmatrix
sink()

-----My last and main question, is there a shorter/faster/simpler way for me to search for and extract specific text strings from each pdf file?

thank you

------------------------------
Chris Barker, Ph.D.
2023 Chair Statistical Consulting Section
Consultant and
Adjunct Associate Professor of Biostatistics
www.barkerstats.com

---
"In composition you have all the time you want to decide what to say in 15 seconds, in improvisation you have 15 seconds."
-Steve Lacy
------------------------------
2. RE: R Question - reading adobe pdf files in R

Recommend
Christopher Ryan
Posted 04-02-2023 17:57
Creating and sharing a minimal working example, with some meaningless
text (lorem ipsum) would be helpful.

You might find some of the packages in the CRAN taskview on Natural
Language Processing helpful, particularly the tm and tidytext packages.
I have no personal experience with them, but might be worth a look.

https://cran.r-project.org/web/views/NaturalLanguageProcessing.html

Do you want to find if one particular text string (call it the needle)
is present in any of the pdf files (call them the haystacks)? Or are
you looking for several different needles amongst all the haystacks?

Without understanding more about the problem, I don't know if these will
help, but here are a couple code snippets I keep handy:

## Using dplyr and stringr, the line below will keep rows in mytibble in
which variable_in_my_tibble contains ANY of the strings in the vector
named vector_of_strings_of_interest

mytibble %>% filter(str_detect(variable_in_my_tibble,
paste(vector_of_strings_of_interest, collapse="|")))

## matching or searching for string(s) across several variables in a
dplyr dataframe :text:words:dplyr:

dd %>% filter_all(any_vars(str_detect(., pattern = "Ca")))

This is probably a more capable approach, with credit going to Jeff
Newmiller on the R-help mailing list:

## see https://stat.ethz.ch/pipermail/r-help/2015-July/430269.html
## in R-help archives
## For context, I use this for a dataframe where each record is a death
certificate, and the free-text causes of death (of which there can be
several for any given death cert) are in variables c(174, 184, 186, 188,
190, 192). The object oid.words is a character vector of the words I
wish to detect.

edrs.3$textwordoid.Newmiller <- grepl( paste0( "\\b(", paste0(
oid.words, collapse="|" ), ")\\b" ), do.call( paste, edrs.3[ , c(174,
184, 186, 188, 190, 192) ] ) )

--Chris Ryan

Chris Barker via American Statistical Association wrote:

> Chris Barker <https: community.amstat.org/people/chris-barker="">
> Apr 2, 2023 4:36 PM
> Chris Barker <https: community.amstat.org/people/chris-barker="">
>
>
> First,
> My thanks to Rick Peterson, for helping arrange for the debugging of a
> software bug that prevented me from posting to the section.??
> The bug has been resolved and?? with this post?? I'm again able to 1) post
> to the section members and 2) reply to posts from section members
> This is my first post to the section since the bug was fixed.
>
> My question here is two part -
> First whether any of the section members have experience using the?? R
> library that permits one to import conttents of a pdf file into R.??
> These are pdfs with lots of text.
> I have successfully accomplished that import
>
> My second question - how best to find text strings with the imported
> files . . And oversimplified, it seems that R imports the?? separate pdfs
> into items in a list. So I have 39 files and I get a list with 39 items.
> And finding text after selecting one item (corresponding to a specifc
> pdf) is where I "get stuck"
>
> The main commands needed are to install the following two libraries
> library(pdftools)
> library(stringr)??
>
> and one, creates a single working directory in R and also stores the
> pdfs in that same directory
> then uses??
> files <- list.files(pattern = "PDF$")
> files
> The ----->list.files,----?? looks in the current /working directory for
> the pdfs??
> My project is looking at thirty-nine separate pdf files each pdf has
> about 25-50 pages - of certain personal records.
> These are personal and these?? are not files I can share.
>
> After this import there is an "lapply" of pdf_text
> # myadobeimport is the extract text from pdf. in a list
> myadobeimport <- lapply(files, pdf_text)??
>
>
>
> It seems that each pdf is imported as?? giant vector. And I have been
> working with substrings and character string search (regxpr, etc).
> What seems to work so far is converting these to a matrix with
> as.matrix(myadobeimport)
>
> and then I search for strings by "sink()"?? in items in a list and
> exporting to a text file ".txt", then opening and searching in the text file
> sink("myadobeimport.txt")
> myadobeimportmatrix <- as.matrix(myadobeimport[[3]][3])
> myadobeimportmatrix
> sink()
>
> -----My last and main question, is there a shorter/faster/simpler way??
> for me to search for and extract specific text strings?? from each pdf file?
>
> thank you
>
>
>
> ------------------------------
> Chris Barker, Ph.D.
> 2023 Chair Statistical Consulting Section
> Consultant and
> Adjunct Associate Professor of Biostatistics
> www.barkerstats.com
>
>
> ---
> "In composition you have all the time you want to decide what to say in
> 15 seconds, in improvisation you have 15 seconds."
> -Steve Lacy
> ------------------------------
>
> ?? *Reply to Group Online
> <https: community.amstat.org/cnsl/ourdiscussiongroup/postreply?messagekey="97ad1c1b-68bf-45a1-983d-d7d20c7ec1ef&ListKey=ac0f6215-000e-4179-801f-d62beb5b8a21">*
> ?? *Reply to Sender
> <https: community.amstat.org/cnsl/ourdiscussiongroup/postreply?messagekey="97ad1c1b-68bf-45a1-983d-d7d20c7ec1ef&ListKey=ac0f6215-000e-4179-801f-d62beb5b8a21&SenderKey=e2f8edc8-b8ec-4994-a2f3-791eba5577b1">*
> ?? *View Thread
> <https: community.amstat.org/cnsl/discussion/r-question-reading-adobe-pdf-files-in-r#bm97ad1c1b-68bf-45a1-983d-d7d20c7ec1ef="">*
> ?? *Recommend
> <https: community.amstat.org:443/cnsl/discussion/r-question-reading-adobe-pdf-files-in-r?messagekey="97ad1c1b-68bf-45a1-983d-d7d20c7ec1ef&cmd=rate&cmdarg=add#bm97ad1c1b-68bf-45a1-983d-d7d20c7ec1ef">*
> ?? *Forward
> <https: community.amstat.org/cnsl/ourdiscussiongroup/forwardmessages?messagekey="97ad1c1b-68bf-45a1-983d-d7d20c7ec1ef&ListKey=ac0f6215-000e-4179-801f-d62beb5b8a21">*
> ?? *Flag as Inappropriate
> <https: community.amstat.org/cnsl/discussion/r-question-reading-adobe-pdf-files-in-r?markappropriate="97ad1c1b-68bf-45a1-983d-d7d20c7ec1ef#bm97ad1c1b-68bf-45a1-983d-d7d20c7ec1ef">*
> ?? *Post New Message Online
> <http: community.amstat.org/participate/postmessage?groupid="1777">* ??
>
>
>
>
> ??
>
> You are subscribed to "Statistical Consulting Section" as
> cryan@binghamton.edu. To change your subscriptions, go to My
> Subscriptions
> <http: community.amstat.org/preferences?section="Subscriptions">. To
> unsubscribe from this community discussion, go to Unsubscribe
> <http: community.amstat.org/higherlogic/egroups/unsubscribe.aspx?userkey="b16385a2-8ae1-4389-baca-0d70145c3500&sKey=KeyRemoved&GroupKey=ac0f6215-000e-4179-801f-d62beb5b8a21">.
>
>

Original Message
3. RE: R Question - reading adobe pdf files in R

Recommend
Gregory Erkens
Posted 04-03-2023 07:48
First, I'll second Chris Ryan's comment on a minimal working example. That would help. I haven't used the pdf_text package, so references to the structure resulting from the pdf_text function are not very clear to me.

Second, I recommend using the tidytext package. I used it about a year ago to compare word use between two sources, and it was very helpful. @Chris Barker , I think this package will help you to complete your search pretty easily. Given what I think I understand about your data, use the output from

lapply(files, pdf_text)

and create a tibble or data.frame with two columns:

* filenames
* text from PDF

Given that structure, I think the following code should get you pretty close to what you need.

# Create output data where each row is a distinct sentence from a specific PDF
pdf_unnested <- pdf_tibble %>% unnest_tokens(output = out_text, input = text_from_pdf, token = 'sentences')

# Filter data to find strings of interest
filter(pdf_unnested, str_detect(out_text, "your string of interest") )

Splitting the pdf into different sentences seems like the approach you might need, but there are many options for the token argument that might be a better fit. See the documentation for tidytext.

I also found the tidytext vignettes very helpful in providing a quick start to using the package.

Hope that helps.

------------------------------
Gregory Erkens
------------------------------

Original Message
4. RE: R Question - reading adobe pdf files in R

Recommend
Chris Barker
Posted 04-03-2023 10:52
Thank you for the replies . at the moment I'm about to get on a train

These were personal financial documents that may become part of a lawsuit that I am not able to adequately anonymize and I can't provide here

in lieu of that permit me to offer the following
I found public domain financial documents in PDF format that are probably close enough and itemize those below with some links.
corporate profit and loss, 10k, income statements from well-known corporations, and I'm looking across say 20 or so of each particular type, trying to find the same data element in each of these documents. The different corporations present the same information in similar, but not necessarily identical formats.

The apple profit and loss may be the easiest document to work with, and I need to find the same element in the profit and loss statement across say 20 different companies. there might be minor variations among the companies. I Didn't track down examples from 20 companies. I tracked down some well-known ones in the pharmaceutical industry Gilead , pfizer and Biogen, and tech - apple. Manufacturing Ford.

for example, I might want to find the total cost of sales in 2021 for Apple. that particular year and row other companies isn't necessarily located in the same cell in each table. . I would need to text search the row label and text search the column header and then find the cell at the intersection of the column and row. A reasonable assumption in my particular confidential financial statements, which come from the banking and brokerage industry -,the row labels and column headers should be about the same.
I may need a way of comparing 20 text column headers and saying that they're about the same. I know SAS has some character string, matching functions (soundex, etc) and there likely are character string matching functions in R

I hope this better clarifies my question

Profit loss

https://www.wallstreetprep.com/knowledge/profit-loss/

Apple p&L

https://wsp-blog-images.s3.amazonaws.com/uploads/2022/01/25191359/Profit-and-Loss-Statement-Example-PL.jpg

https://www.annualreports.com/

https://www.grantthornton.global/globalassets/1.-member-firms/global/insights/article-pdfs/ifrs/ifrs-example-financial-statements-2021_2.pdf

Ford

https://s201.q4cdn.com/693218008/files/doc_financials/2022/q1/Ford-Q1-2022-10-Q-Report.pdf

Pfizer

https://s28.q4cdn.com/781576035/files/doc_financials/2021/ar/PFE-2021-Form-10K-FINAL.pdf

https://investors.pfizer.com/Investors/Financials/Annual-Reports/default.aspx

https://s28.q4cdn.com/781576035/files/doc_financials/2022/ar/PFE-2022-Form-10K-FINAL-(without-Exhibits).pdf

https://investors.pfizer.com/Investors/Financials/Annual-Reports/default.aspx

https://thewaltdisneycompany.com/app/uploads/2023/02/2022-Annual-Report.pdf

Gilead

https://www.annualreports.com/HostedData/AnnualReports/PDF/NASDAQ_GILD_2021.pdf

------------------------------
Chris Barker, Ph.D.
2023 Chair Statistical Consulting Section
Consultant and
Adjunct Associate Professor of Biostatistics
www.barkerstats.com

---
"In composition you have all the time you want to decide what to say in 15 seconds, in improvisation you have 15 seconds."
-Steve Lacy
------------------------------

Original Message
5. RE: R Question - reading adobe pdf files in R

Recommend
William Huber
Posted 04-03-2023 11:02
First, PDF is a mess: it's a "Page Description Format," not a mechanism to convey data. The descriptions of pages in some PDF files are extremely irregular and almost impossible to parse.

Having said that, some PDF files are tractable, especially those created by a word processor (usually). A little experimentation suggests that when `pdf_text` reads such a file, it renders it as a list of character vectors, one per page (in order). Newline characters "\n" separate the visible lines. A simple call to `strsplit` will render each page as a vector of strings, one per line.

Base `R` comes with a powerful set of text searching and processing tools embodied in its regular expression functions. See the help page for `regular expression`. If you're unfamiliar with the syntax of regular expressions and have more than the simplest text processing needs, then stop right here and take the time to learn it. It's a few hours of study and practice that will pay off forever in terms of the capabilities and concepts it gives you. If you also learn some of the underlying theory (about finite state automata) you will appreciate how incredibly efficient this technology can be. For more about this, visit the source code page at https://github.com/laurikari/tre.

So, the basic work flow is to read a PDF file and loop over its pages to do your processing, perhaps splitting each page into individual lines. As a simple example, here is the code to read a file and output all occurrences of (base-10) numbers, each accompanied by any word that might precede it (possibly separated by blanks and punctuation). The first two lines reference and read the file; the next describes the pattern; and the last two lines match that pattern to the file's contents using `gregexec` and print out all the matches.

fn.in <- "https://example-files.online-convert.com/document/pdf/example.pdf"
S <- pdf_text(fn.in)
r <- "(^|([[:alpha:]]+[[:blank:][:punct:]]*))((([+-]?[[:digit:]]+[.]?)[[:digit:]]*)|([.][[:digit:]]+))"
op <- gregexec(r, S)
lapply(regmatches(S, op), \(x) if (length(x) > 0) x[1, ])

This processes a sample one-page PDF file available online. Its output is

[[1]] [1] "Version: 1.0" "a 2002" "Doe #1" "Doe #2" "cited 21" "ShareAlike 3.0"

With a multiple-page file you will get a list of results, one per page. (Any pages with no matches will simply be skipped in the output list.)

In my experience, extensive testing of any complex regular expression is always worthwhile. This version of the pattern is a relatively untested initial attempt to illustrate the kind of search one might do to scrape a PDF file of numerical data -- so please don't rely on it implicitly!

------------------------------
William Huber (Bill)
Quantitative Decisions / Analysis and Inference
------------------------------

Original Message
6. RE: R Question - reading adobe pdf files in R

Recommend
Chris Barker
Posted 04-05-2023 15:20
thank you very much for the replies.
I found and report two not-perfect "work around-S". And its not free ($0.0). This reminded me that I pay a monthly fee for access to adobe. In the "olden" days one simply bought adobe acrobat pro on a CD or DVD, and then paid for the upgrades. A lot of software companies have started charging for their software on a regular basis, for example monthly. I find adobe features very useful in consulting and adobe is an essential tool in my consulting practice. For example, long ago I learned that before sending any output to a client, to routinely convert output to pdf and "toggle off" the feature that lets someone cut/paste into a word or other document. As a consultant one only needs exactly one time to find output clearly labelled as "draft do not forward" cut and pasted into a press release or similar and a CEO sending then retracting the statement and wondering exactly how and why the wrong results had been used requiring a retraction. And for a licensed version of adobe, as long as one pays the monthly fee ,one can manipulate the pdf files, markup, etc. and -export- to different formats, such as excel, powerpoint, word and a few others.

And I also contacted the author of the package (Dr. Ooms at UCBerkeley) with some questions and I appreciate that he replied promptly (same day)

below my two workarounds with Some useful help files and some additional R code

Two Work-arounds

I. First workaround - extraction of adobe using the R library,

R library for adobe extraction

https://docs.ropensci.org/pdftools/

Key function of pdftools

The ?pdftools manual page shows a brief overview of the main utilities. The most important function is pdf_text which returns a character vector of length equal to the number of pages in the pdf. Each string in the vector contains a plain text version of the text on that page.

# r code example ###########################################################

# extract and "cat" each page of pdf

library(pdftools)

download.file("http://arxiv.org/pdf/1403.2805.pdf", "1403.2805.pdf", mode = "wb")

txt <- pdf_text("1403.2805.pdf")

txt

?cat

cat(txt[1])

cat(txt[18])

cat(txt[19])

II. Second workaround Exports from licensed ($) windows version of Adobe pro export of a large pdf into excel, word and powerpoint

The large pdf (CAUTION 114 pages)

https://www.grantthornton.global/globalassets/1.-member-firms/global/insights/article-pdfs/ifrs/ifrs-example-financial-statements-2021_2.pdf

Extracts of the 114 page pdf (note your browser may suggest these be downloaded rather than open in the browser

www.barkerstats.com/PDFs/ASA/CNSL/ifrs-example-financial-statements-2021_2.docx

www.barkerstats.com/PDFs/ASA/CNSL/ifrs-example-financial-statements-2021_2.pptx

www.barkerstats.com/PDFs/ASA/CNSL/ifrs-example-financial-statements-2021_2.xlsx

------------------------------
Chris Barker, Ph.D.
2023 Chair Statistical Consulting Section
Consultant and
Adjunct Associate Professor of Biostatistics
www.barkerstats.com

---
"In composition you have all the time you want to decide what to say in 15 seconds, in improvisation you have 15 seconds."
-Steve Lacy
------------------------------

Original Message

Discussion: View Thread

R Question - reading adobe pdf files in R

Chris Barker04-02-2023 16:36

Christopher Ryan04-02-2023 17:57

Gregory Erkens04-03-2023 07:48

Chris Barker04-03-2023 10:52

William Huber04-03-2023 11:02

Chris Barker04-05-2023 15:20

1. R Question - reading adobe pdf files in R

2. RE: R Question - reading adobe pdf files in R

3. RE: R Question - reading adobe pdf files in R

4. RE: R Question - reading adobe pdf files in R

5. RE: R Question - reading adobe pdf files in R

6. RE: R Question - reading adobe pdf files in R