Creating and sharing a minimal working example, with some meaningless
text (lorem ipsum) would be helpful.
You might find some of the packages in the CRAN taskview on Natural
Language Processing helpful, particularly the tm and tidytext packages.
I have no personal experience with them, but might be worth a look.
https://cran.r-project.org/web/views/NaturalLanguageProcessing.htmlDo you want to find if one particular text string (call it the needle)
is present in any of the pdf files (call them the haystacks)? Or are
you looking for several different needles amongst all the haystacks?
Without understanding more about the problem, I don't know if these will
help, but here are a couple code snippets I keep handy:
## Using dplyr and stringr, the line below will keep rows in mytibble in
which variable_in_my_tibble contains ANY of the strings in the vector
named vector_of_strings_of_interest
mytibble %>% filter(str_detect(variable_in_my_tibble,
paste(vector_of_strings_of_interest, collapse="|")))
## matching or searching for string(s) across several variables in a
dplyr dataframe :text:words:dplyr:
dd %>% filter_all(any_vars(str_detect(., pattern = "Ca")))
This is probably a more capable approach, with credit going to Jeff
Newmiller on the R-help mailing list:
## see
https://stat.ethz.ch/pipermail/r-help/2015-July/430269.html## in R-help archives
## For context, I use this for a dataframe where each record is a death
certificate, and the free-text causes of death (of which there can be
several for any given death cert) are in variables c(174, 184, 186, 188,
190, 192). The object oid.words is a character vector of the words I
wish to detect.
edrs.3$textwordoid.Newmiller <- grepl( paste0( "\\b(", paste0(
oid.words, collapse="|" ), ")\\b" ), do.call( paste, edrs.3[ , c(174,
184, 186, 188, 190, 192) ] ) )
--Chris Ryan
Chris Barker via American Statistical Association wrote:
> Chris Barker <https: community.amstat.org/people/chris-barker="">
> Apr 2, 2023 4:36 PM
> Chris Barker <https: community.amstat.org/people/chris-barker="">
>
>
> First,
> My thanks to Rick Peterson, for helping arrange for the debugging of a
> software bug that prevented me from posting to the section.??
> The bug has been resolved and?? with this post?? I'm again able to 1) post
> to the section members and 2) reply to posts from section members
> This is my first post to the section since the bug was fixed.
>
> My question here is two part -
> First whether any of the section members have experience using the?? R
> library that permits one to import conttents of a pdf file into R.??
> These are pdfs with lots of text.
> I have successfully accomplished that import
>
> My second question - how best to find text strings with the imported
> files . . And oversimplified, it seems that R imports the?? separate pdfs
> into items in a list. So I have 39 files and I get a list with 39 items.
> And finding text after selecting one item (corresponding to a specifc
> pdf) is where I "get stuck"
>
> The main commands needed are to install the following two libraries
> library(pdftools)
> library(stringr)??
>
> and one, creates a single working directory in R and also stores the
> pdfs in that same directory
> then uses??
> files <- list.files(pattern = "PDF$")
> files
> The ----->list.files,----?? looks in the current /working directory for
> the pdfs??
> My project is looking at thirty-nine separate pdf files each pdf has
> about 25-50 pages - of certain personal records.
> These are personal and these?? are not files I can share.
>
> After this import there is an "lapply" of pdf_text
> # myadobeimport is the extract text from pdf. in a list
> myadobeimport <- lapply(files, pdf_text)??
>
>
>
> It seems that each pdf is imported as?? giant vector. And I have been
> working with substrings and character string search (regxpr, etc).
> What seems to work so far is converting these to a matrix with
> as.matrix(myadobeimport)
>
> and then I search for strings by "sink()"?? in items in a list and
> exporting to a text file ".txt", then opening and searching in the text file
> sink("myadobeimport.txt")
> myadobeimportmatrix <- as.matrix(myadobeimport[[3]][3])
> myadobeimportmatrix
> sink()
>
> -----My last and main question, is there a shorter/faster/simpler way??
> for me to search for and extract specific text strings?? from each pdf file?
>
> thank you
>
>
>
> ------------------------------
> Chris Barker, Ph.D.
> 2023 Chair Statistical Consulting Section
> Consultant and
> Adjunct Associate Professor of Biostatistics
>
www.barkerstats.com>
>
> ---
> "In composition you have all the time you want to decide what to say in
> 15 seconds, in improvisation you have 15 seconds."
> -Steve Lacy
> ------------------------------
>
> ?? *Reply to Group Online
> <https: community.amstat.org/cnsl/ourdiscussiongroup/postreply?messagekey="97ad1c1b-68bf-45a1-983d-d7d20c7ec1ef&ListKey=ac0f6215-000e-4179-801f-d62beb5b8a21">*
> ?? *Reply to Sender
> <https: community.amstat.org/cnsl/ourdiscussiongroup/postreply?messagekey="97ad1c1b-68bf-45a1-983d-d7d20c7ec1ef&ListKey=ac0f6215-000e-4179-801f-d62beb5b8a21&SenderKey=e2f8edc8-b8ec-4994-a2f3-791eba5577b1">*
> ?? *View Thread
> <https: community.amstat.org/cnsl/discussion/r-question-reading-adobe-pdf-files-in-r#bm97ad1c1b-68bf-45a1-983d-d7d20c7ec1ef="">*
> ?? *Recommend
> <https: community.amstat.org:443/cnsl/discussion/r-question-reading-adobe-pdf-files-in-r?messagekey="97ad1c1b-68bf-45a1-983d-d7d20c7ec1ef&cmd=rate&cmdarg=add#bm97ad1c1b-68bf-45a1-983d-d7d20c7ec1ef">*
> ?? *Forward
> <https: community.amstat.org/cnsl/ourdiscussiongroup/forwardmessages?messagekey="97ad1c1b-68bf-45a1-983d-d7d20c7ec1ef&ListKey=ac0f6215-000e-4179-801f-d62beb5b8a21">*
> ?? *Flag as Inappropriate
> <https: community.amstat.org/cnsl/discussion/r-question-reading-adobe-pdf-files-in-r?markappropriate="97ad1c1b-68bf-45a1-983d-d7d20c7ec1ef#bm97ad1c1b-68bf-45a1-983d-d7d20c7ec1ef">*
> ?? *Post New Message Online
> <http: community.amstat.org/participate/postmessage?groupid="1777">* ??
>
>
>
>
> ??
>
> You are subscribed to "Statistical Consulting Section" as
>
cryan@binghamton.edu. To change your subscriptions, go to My
> Subscriptions
> <http: community.amstat.org/preferences?section="Subscriptions">. To
> unsubscribe from this community discussion, go to Unsubscribe
> <http: community.amstat.org/higherlogic/egroups/unsubscribe.aspx?userkey="b16385a2-8ae1-4389-baca-0d70145c3500&sKey=KeyRemoved&GroupKey=ac0f6215-000e-4179-801f-d62beb5b8a21">.
>
>
Original Message:
Sent: 4/2/2023 4:36:00 PM
From: Chris Barker
Subject: R Question - reading adobe pdf files in R
First,
My thanks to Rick Peterson, for helping arrange for the debugging of a software bug that prevented me from posting to the section.
The bug has been resolved and with this post I'm again able to 1) post to the section members and 2) reply to posts from section members
This is my first post to the section since the bug was fixed.
My question here is two part -
First whether any of the section members have experience using the R library that permits one to import conttents of a pdf file into R. These are pdfs with lots of text.
I have successfully accomplished that import
My second question - how best to find text strings with the imported files . . And oversimplified, it seems that R imports the separate pdfs into items in a list. So I have 39 files and I get a list with 39 items.
And finding text after selecting one item (corresponding to a specifc pdf) is where I "get stuck"
The main commands needed are to install the following two libraries
library(pdftools)
library(stringr)
and one, creates a single working directory in R and also stores the pdfs in that same directory
then uses
files <- list.files(pattern = "PDF$")
files
The ----->list.files,---- looks in the current /working directory for the pdfs
My project is looking at thirty-nine separate pdf files each pdf has about 25-50 pages - of certain personal records.
These are personal and these are not files I can share.
After this import there is an "lapply" of pdf_text
# myadobeimport is the extract text from pdf. in a list
myadobeimport <- lapply(files, pdf_text)
It seems that each pdf is imported as giant vector. And I have been working with substrings and character string search (regxpr, etc).
What seems to work so far is converting these to a matrix with
as.matrix(myadobeimport)
and then I search for strings by "sink()" in items in a list and exporting to a text file ".txt", then opening and searching in the text file
sink("myadobeimport.txt")
myadobeimportmatrix <- as.matrix(myadobeimport[[3]][3])
myadobeimportmatrix
sink()
-----My last and main question, is there a shorter/faster/simpler way for me to search for and extract specific text strings from each pdf file?
thank you
------------------------------
Chris Barker, Ph.D.
2023 Chair Statistical Consulting Section
Consultant and
Adjunct Associate Professor of Biostatistics
www.barkerstats.com
---
"In composition you have all the time you want to decide what to say in 15 seconds, in improvisation you have 15 seconds."
-Steve Lacy
------------------------------
</http:></http:></http:></https:></https:></https:></https:></https:></https:></https:></https:>