Hi-
I keep the following list of parsing for what you are trying to do into these categories. Much of it from Google NLP but also from other sources as well. Hope this helps add to your list:
Text Components: (Google calls this "Text Span") - represents an output piece of the overall text that has a central entity.
Token (either word or term) identifiers, with the following identified attributes that define the token
- Parts of Speech: Adjective, Preposition, Postposition, Adverb, Conjunction, Determiner, Common Noun, Proper Non, Cardinal number, Pronoun, Particle or other function word, Punctuation, Verb, Verb tense, Verb mode, Foreign word/term, Typo, Abbreviation, Emoticon, and Affix
- Time flow during an event: Perfective, Imperfective, and Progressive as well as Tense: conditional, future, past, present, imperfect, pluperfect
Noun or pronoun case: accusative, adverbial, complement, dative, genitive, instrumental, locative, nominative, oblique, partitive, prepositional, reflexive, reflexive_case, relative, relative_case, and vocative
- Grammatical mood: Conditional, Imperative, Indicative, Interrogative, Jussive, Subjunctive
- Person: first, second, third, reflexive
- Proper noun: yes/no flag
- Voice: active, causitive, passive
Text, Text Component, Sentence, and Token calculations:
- lexical diversity = number of unique words divided by count of words in overall text
- average and median of number of character/words within the specific text component
- ratios of token identifiers in terms of entire text component
Sentence: If text has sentence structure, the sentences and the number of sentences in the overall text.
Token: Word or term identifiers:
- Entity Analysis (think proper nouns): Named Person, Location, Organization, Event, Artwork, Consumer Good, Brand,
- Entities around Location details: phone, address (and all of its components), geo long-lat, other geographic markers
- Entities around specific Amounts: Date, Number, Currency
Saliency: A score for an entity provides information about the importance or centrality of a specific entity to the entire text.
Sentiment Score per Text Component
Text Classification: The name of the category and a score of confidence that the portion of text meets the classification criteria
Original Language: English, French, German, Spanish, Chinese - Mandarin, etc.
------------------------------
Carol Haney
Senior Research and Data Scientist, Distinguished
Qualtrics
------------------------------
Original Message:
Sent: 03-28-2020 16:01
From: Jerome Tuttle
Subject: How do writers differ quantitatively?
How do FICTION writers differ quantitatively? I have the following list. What can be added?
# characters per word (first used by DeMorgan), # words per sentence
% unique words
use of frequent worda
use of sensory adjectives
use of sentiment words
use of positive or negative words
verb/adjective ratio
compexity (grade level readability)
Does anyone have new items? Does anyone have some quantitatively-oriented English teachers who can weigh in?
Jerry
------------------------------
Jerry Tuttle
Adjunct online math instructor
------------------------------