Discussion: View Thread

  • 1.  R Question - reading adobe pdf files in R

    Posted 04-02-2023 16:36


    First,
    My thanks to Rick Peterson, for helping arrange for the debugging of a software bug that prevented me from posting to the section. 
    The bug has been resolved and  with this post  I'm again able to 1) post to the section members and 2) reply to posts from section members
    This is my first post to the section since the bug was fixed.

    My question here is two part -
    First whether any of the section members have experience using the  R library that permits one to import conttents of a pdf file into R.  These are pdfs with lots of text.
    I have successfully accomplished that import

    My second question - how best to find text strings with the imported files . . And oversimplified, it seems that R imports the  separate pdfs into items in a list. So I have 39 files and I get a list with 39 items.
    And finding text after selecting one item (corresponding to a specifc pdf) is where I "get stuck"

    The main commands needed are to install the following two libraries
    library(pdftools)
    library(stringr) 

    and one, creates a single working directory in R and also stores the pdfs in that same directory
    then uses 
    files <- list.files(pattern = "PDF$")
    files
    The ----->list.files,----  looks in the current /working directory for the pdfs 
    My project is looking at thirty-nine separate pdf files each pdf has about 25-50 pages - of certain personal records.
    These are personal and these  are not files I can share.

    After this import there is an "lapply" of pdf_text
    # myadobeimport is the extract text from pdf. in a list
    myadobeimport <- lapply(files, pdf_text) 



    It seems that each pdf is imported as  giant vector. And I have been working with substrings and character string search (regxpr, etc).
    What seems to work so far is converting these to a matrix with
    as.matrix(myadobeimport)

    and then I search for strings by "sink()"  in items in a list and exporting to a text file ".txt", then opening and searching in the text file
    sink("myadobeimport.txt")
    myadobeimportmatrix <- as.matrix(myadobeimport[[3]][3])
    myadobeimportmatrix
    sink()

    -----My last and main question, is there a shorter/faster/simpler way  for me to search for and extract specific text strings  from each pdf file?

    thank you



    ------------------------------
    Chris Barker, Ph.D.
    2023 Chair Statistical Consulting Section
    Consultant and
    Adjunct Associate Professor of Biostatistics
    www.barkerstats.com


    ---
    "In composition you have all the time you want to decide what to say in 15 seconds, in improvisation you have 15 seconds."
    -Steve Lacy
    ------------------------------


  • 2.  RE: R Question - reading adobe pdf files in R

    Posted 04-02-2023 17:57
    Creating and sharing a minimal working example, with some meaningless
    text (lorem ipsum) would be helpful.

    You might find some of the packages in the CRAN taskview on Natural
    Language Processing helpful, particularly the tm and tidytext packages.
    I have no personal experience with them, but might be worth a look.

    https://cran.r-project.org/web/views/NaturalLanguageProcessing.html

    Do you want to find if one particular text string (call it the needle)
    is present in any of the pdf files (call them the haystacks)? Or are
    you looking for several different needles amongst all the haystacks?

    Without understanding more about the problem, I don't know if these will
    help, but here are a couple code snippets I keep handy:

    ## Using dplyr and stringr, the line below will keep rows in mytibble in
    which variable_in_my_tibble contains ANY of the strings in the vector
    named vector_of_strings_of_interest

    mytibble %>% filter(str_detect(variable_in_my_tibble,
    paste(vector_of_strings_of_interest, collapse="|")))

    ## matching or searching for string(s) across several variables in a
    dplyr dataframe :text:words:dplyr:

    dd %>% filter_all(any_vars(str_detect(., pattern = "Ca")))


    This is probably a more capable approach, with credit going to Jeff
    Newmiller on the R-help mailing list:

    ## see https://stat.ethz.ch/pipermail/r-help/2015-July/430269.html
    ## in R-help archives
    ## For context, I use this for a dataframe where each record is a death
    certificate, and the free-text causes of death (of which there can be
    several for any given death cert) are in variables c(174, 184, 186, 188,
    190, 192). The object oid.words is a character vector of the words I
    wish to detect.

    edrs.3$textwordoid.Newmiller <- grepl( paste0( "\\b(", paste0(
    oid.words, collapse="|" ), ")\\b" ), do.call( paste, edrs.3[ , c(174,
    184, 186, 188, 190, 192) ] ) )

    --Chris Ryan





    Chris Barker via American Statistical Association wrote:

    > Chris Barker <https: community.amstat.org/people/chris-barker="">
    > Apr 2, 2023 4:36 PM
    > Chris Barker <https: community.amstat.org/people/chris-barker="">
    >
    >
    > First,
    > My thanks to Rick Peterson, for helping arrange for the debugging of a
    > software bug that prevented me from posting to the section.??
    > The bug has been resolved and?? with this post?? I'm again able to 1) post
    > to the section members and 2) reply to posts from section members
    > This is my first post to the section since the bug was fixed.
    >
    > My question here is two part -
    > First whether any of the section members have experience using the?? R
    > library that permits one to import conttents of a pdf file into R.??
    > These are pdfs with lots of text.
    > I have successfully accomplished that import
    >
    > My second question - how best to find text strings with the imported
    > files . . And oversimplified, it seems that R imports the?? separate pdfs
    > into items in a list. So I have 39 files and I get a list with 39 items.
    > And finding text after selecting one item (corresponding to a specifc
    > pdf) is where I "get stuck"
    >
    > The main commands needed are to install the following two libraries
    > library(pdftools)
    > library(stringr)??
    >
    > and one, creates a single working directory in R and also stores the
    > pdfs in that same directory
    > then uses??
    > files <- list.files(pattern = "PDF$")
    > files
    > The ----->list.files,----?? looks in the current /working directory for
    > the pdfs??
    > My project is looking at thirty-nine separate pdf files each pdf has
    > about 25-50 pages - of certain personal records.
    > These are personal and these?? are not files I can share.
    >
    > After this import there is an "lapply" of pdf_text
    > # myadobeimport is the extract text from pdf. in a list
    > myadobeimport <- lapply(files, pdf_text)??
    >
    >
    >
    > It seems that each pdf is imported as?? giant vector. And I have been
    > working with substrings and character string search (regxpr, etc).
    > What seems to work so far is converting these to a matrix with
    > as.matrix(myadobeimport)
    >
    > and then I search for strings by "sink()"?? in items in a list and
    > exporting to a text file ".txt", then opening and searching in the text file
    > sink("myadobeimport.txt")
    > myadobeimportmatrix <- as.matrix(myadobeimport[[3]][3])
    > myadobeimportmatrix
    > sink()
    >
    > -----My last and main question, is there a shorter/faster/simpler way??
    > for me to search for and extract specific text strings?? from each pdf file?
    >
    > thank you
    >
    >
    >
    > ------------------------------
    > Chris Barker, Ph.D.
    > 2023 Chair Statistical Consulting Section
    > Consultant and
    > Adjunct Associate Professor of Biostatistics
    > www.barkerstats.com
    >
    >
    > ---
    > "In composition you have all the time you want to decide what to say in
    > 15 seconds, in improvisation you have 15 seconds."
    > -Steve Lacy
    > ------------------------------
    >
    > ?? *Reply to Group Online
    > <https: community.amstat.org/cnsl/ourdiscussiongroup/postreply?messagekey="97ad1c1b-68bf-45a1-983d-d7d20c7ec1ef&ListKey=ac0f6215-000e-4179-801f-d62beb5b8a21">*
    > ?? *Reply to Sender
    > <https: community.amstat.org/cnsl/ourdiscussiongroup/postreply?messagekey="97ad1c1b-68bf-45a1-983d-d7d20c7ec1ef&ListKey=ac0f6215-000e-4179-801f-d62beb5b8a21&SenderKey=e2f8edc8-b8ec-4994-a2f3-791eba5577b1">*
    > ?? *View Thread
    > <https: community.amstat.org/cnsl/discussion/r-question-reading-adobe-pdf-files-in-r#bm97ad1c1b-68bf-45a1-983d-d7d20c7ec1ef="">*
    > ?? *Recommend
    > <https: community.amstat.org:443/cnsl/discussion/r-question-reading-adobe-pdf-files-in-r?messagekey="97ad1c1b-68bf-45a1-983d-d7d20c7ec1ef&cmd=rate&cmdarg=add#bm97ad1c1b-68bf-45a1-983d-d7d20c7ec1ef">*
    > ?? *Forward
    > <https: community.amstat.org/cnsl/ourdiscussiongroup/forwardmessages?messagekey="97ad1c1b-68bf-45a1-983d-d7d20c7ec1ef&ListKey=ac0f6215-000e-4179-801f-d62beb5b8a21">*
    > ?? *Flag as Inappropriate
    > <https: community.amstat.org/cnsl/discussion/r-question-reading-adobe-pdf-files-in-r?markappropriate="97ad1c1b-68bf-45a1-983d-d7d20c7ec1ef#bm97ad1c1b-68bf-45a1-983d-d7d20c7ec1ef">*
    > ?? *Post New Message Online
    > <http: community.amstat.org/participate/postmessage?groupid="1777">* ??
    >
    >
    >
    >
    > ??
    >
    > You are subscribed to "Statistical Consulting Section" as
    > cryan@binghamton.edu. To change your subscriptions, go to My
    > Subscriptions
    > <http: community.amstat.org/preferences?section="Subscriptions">. To
    > unsubscribe from this community discussion, go to Unsubscribe
    > <http: community.amstat.org/higherlogic/egroups/unsubscribe.aspx?userkey="b16385a2-8ae1-4389-baca-0d70145c3500&sKey=KeyRemoved&GroupKey=ac0f6215-000e-4179-801f-d62beb5b8a21">.
    >
    >




  • 3.  RE: R Question - reading adobe pdf files in R

    Posted 04-03-2023 07:48

    First, I'll second Chris Ryan's comment on a minimal working example.  That would help.  I haven't used the pdf_text package, so references to the structure resulting from the pdf_text function are not very clear to me.

    Second, I recommend using the tidytext package.  I used it about a year ago to compare word use between two sources, and it was very helpful.  @Chris Barker , I think this package will help you to complete your search pretty easily.  Given what I think I understand about your data, use the output from

      lapply(files, pdf_text)

    and create a tibble or data.frame with two columns:

    * filenames
    * text from PDF

    Given that structure, I think the following code should get you pretty close to what you need.

    # Create output data where each row is a distinct sentence from a specific PDF
       pdf_unnested <- pdf_tibble %>% unnest_tokens(output = out_text, input = text_from_pdf, token = 'sentences')

    # Filter data to find strings of interest
       filter(pdf_unnested, str_detect(out_text, "your string of interest") )

    Splitting the pdf into different sentences seems like the approach you might need, but there are many options for the token argument that might be a better fit.  See the documentation for tidytext.  

    I also found the tidytext vignettes very helpful in providing a quick start to using the package.

    Hope that helps.



    ------------------------------
    Gregory Erkens
    ------------------------------



  • 4.  RE: R Question - reading adobe pdf files in R

    Posted 04-03-2023 10:52

    Thank you for the replies . at the moment I'm about to get on a train  

    These were personal financial documents that may become part of a lawsuit  that I am not able to adequately anonymize and I can't provide here

    in lieu of that permit me to offer the following
     I found public domain financial documents in PDF format that are probably close enough and itemize those below with some links. 
    corporate profit and loss, 10k, income statements from well-known corporations, and  I'm looking across say 20 or so of each particular type, trying to find the same data element in each of these documents. The different corporations present the same information in similar, but not necessarily identical formats.

    The apple profit and loss may be the easiest document to work with, and I need to find the same element in the profit and loss statement across say 20 different companies. there might be minor variations among the  companies. I Didn't track down examples from 20 companies. I  tracked down some well-known ones in the pharmaceutical industry Gilead , pfizer and Biogen, and tech - apple. Manufacturing Ford. 

    for example, I might want to find the total cost of sales in 2021 for Apple.  that particular year and row other companies isn't necessarily located in the same cell in each table. . I would need to text search the row label and text search the column header and then find the cell at the intersection of the column and row. A reasonable assumption in my particular confidential financial statements, which come from the banking and brokerage industry -,the row  labels and column headers should be about the same.
    I may need a way of comparing 20 text column headers and saying that they're about the same.  I know SAS  has some character string, matching functions (soundex,  etc) and  there likely  are character string matching functions in R 

    I hope this better clarifies my question

    Profit loss
    Apple p&L
    Ford
    Pfizer
    Gilead





    ------------------------------
    Chris Barker, Ph.D.
    2023 Chair Statistical Consulting Section
    Consultant and
    Adjunct Associate Professor of Biostatistics
    www.barkerstats.com


    ---
    "In composition you have all the time you want to decide what to say in 15 seconds, in improvisation you have 15 seconds."
    -Steve Lacy
    ------------------------------



  • 5.  RE: R Question - reading adobe pdf files in R

    Posted 04-03-2023 11:02

    First, PDF is a mess: it's a "Page Description Format," not a mechanism to convey data.  The descriptions of pages in some PDF files are extremely irregular and almost impossible to parse.

    Having said that, some PDF files are tractable, especially those created by a word processor (usually).  A little experimentation suggests that when `pdf_text` reads such a file, it renders it as a list of character vectors, one per page (in order).  Newline characters "\n" separate the visible lines.  A simple call to `strsplit` will render each page as a vector of strings, one per line.

    Base `R` comes with a  powerful set of text searching and processing tools embodied in its regular expression functions.  See the help page for `regular expression`.  If you're unfamiliar with the syntax of regular expressions and have more than the simplest text processing needs, then stop right here and take the time to learn it.  It's a few hours of study and practice that will pay off forever in terms of the capabilities and concepts it gives you.  If you also learn some of the underlying theory (about finite state automata) you will appreciate how incredibly efficient this technology can be.  For more about this, visit the source code page at https://github.com/laurikari/tre

    So, the basic work flow is to read a PDF file and loop over its pages to do your processing, perhaps splitting each page into individual lines.  As a simple example, here  is the code to read a file and output all occurrences of (base-10) numbers, each accompanied by any word that might precede it (possibly separated by blanks and punctuation).  The first two lines reference and read the file; the next describes the pattern; and the last two lines match that pattern to the file's contents using `gregexec` and print out all the matches.

    fn.in <- "https://example-files.online-convert.com/document/pdf/example.pdf"
    S <- pdf_text(fn.in)
    r <- "(^|([[:alpha:]]+[[:blank:][:punct:]]*))((([+-]?[[:digit:]]+[.]?)[[:digit:]]*)|([.][[:digit:]]+))"
    op <- gregexec(r, S)
    lapply(regmatches(S, op), \(x) if (length(x) > 0) x[1, ])

    This processes a sample one-page PDF file available online.  Its output is

    [[1]]
    [1] "Version: 1.0"   "a 2002"         "Doe #1"         "Doe #2"         "cited 21"       "ShareAlike 3.0"
    

    With a multiple-page file you will get a list of results, one per page.  (Any pages with no matches will simply be skipped in the output list.)

    In my experience, extensive testing of any complex regular expression is always worthwhile.  This version of the pattern is a relatively untested initial attempt to illustrate the kind of search one might do to scrape a PDF file of numerical data -- so please don't rely on it implicitly!



    ------------------------------
    William Huber (Bill)
    Quantitative Decisions / Analysis and Inference
    ------------------------------



  • 6.  RE: R Question - reading adobe pdf files in R

    Posted 04-05-2023 15:20

    thank you very much for the replies. 
    I found and report two  not-perfect "work around-S". And its not free ($0.0). This reminded me that I pay a monthly fee for access to adobe. In the "olden" days one simply bought adobe acrobat pro on a CD or DVD, and then paid for the upgrades. A lot of software companies have started charging for their software on a regular basis, for example monthly. I find adobe features very useful in consulting and adobe is an essential tool in my consulting practice. For example, long ago I learned that before sending any output to a client, to routinely convert output to pdf and "toggle off" the feature that lets someone cut/paste into a word or other document. As a consultant one only needs exactly one time to  find output clearly labelled as "draft  do not forward" cut and pasted into a press release or similar and a CEO sending then retracting the statement and wondering exactly how and why the wrong results had been used requiring a retraction. And for a licensed version of adobe, as long as one pays the monthly fee ,one can manipulate the pdf files, markup, etc. and -export- to different formats, such as excel, powerpoint, word and a few others. 

    And I also contacted the author of the package (Dr. Ooms  at UCBerkeley) with some questions and I appreciate that he replied promptly (same day)

    below my two workarounds with Some useful help files and some additional R code

    Two Work-arounds

    I.                     First workaround - extraction of adobe using the R library, 

    R library for adobe extraction

    https://docs.ropensci.org/pdftools/

    Key function of pdftools

    The ?pdftools manual page shows a brief overview of the main utilities. The most important function is pdf_text which returns a character vector of length equal to the number of pages in the pdf. Each string in the vector contains a plain text version of the text on that page.

    # r code example ###########################################################

    # extract and "cat" each page of pdf

    library(pdftools)

    download.file("http://arxiv.org/pdf/1403.2805.pdf", "1403.2805.pdf", mode = "wb")

    txt <- pdf_text("1403.2805.pdf")

    txt

    ?cat

    cat(txt[1])

    cat(txt[18])

    cat(txt[19])

    II.                 Second workaround Exports from licensed ($) windows version of Adobe pro export of a large pdf into excel, word and powerpoint

    The large pdf (CAUTION 114 pages)

    https://www.grantthornton.global/globalassets/1.-member-firms/global/insights/article-pdfs/ifrs/ifrs-example-financial-statements-2021_2.pdf

    Extracts of the 114 page pdf (note your browser may suggest these be downloaded rather than open in the browser

    www.barkerstats.com/PDFs/ASA/CNSL/ifrs-example-financial-statements-2021_2.docx

    www.barkerstats.com/PDFs/ASA/CNSL/ifrs-example-financial-statements-2021_2.pptx

    www.barkerstats.com/PDFs/ASA/CNSL/ifrs-example-financial-statements-2021_2.xlsx



    ------------------------------
    Chris Barker, Ph.D.
    2023 Chair Statistical Consulting Section
    Consultant and
    Adjunct Associate Professor of Biostatistics
    www.barkerstats.com


    ---
    "In composition you have all the time you want to decide what to say in 15 seconds, in improvisation you have 15 seconds."
    -Steve Lacy
    ------------------------------