Discussion: View Thread

  • 1.  RE: Section discussion at JSM - Broader name

    Posted 08-12-2022 10:20
    Hi, all. 

    At our annual meeting, we had a brief discussion about our name. ASA's Council of Sections suggested that "text analysis" might be too narrow for a section. There is also the concern that "text analysis" might fit neatly under another section. FWIW, it didn't sound to me like either of these was a catastrophic blocker to us becoming a section. Even so, I want to pick up the discussion of naming here and put forward my suggestion for a section name.

    "Corpus Statistics"  or "The Section for Corpus Statistics"

    I think "Corpus Statistics" checks two boxes for us:
    1. Its name implies broad linguistic phenomena as a subfield of statistical study (i.e., it's less narrow than just "text")
    2. Its name distinguishes it from both Natural Language Processing (NLP) and Linguistics as a uniquely statistical approach

    Natural Language Processing is a subfield of Artificial Intelligence and, as such, is inherently task-focused (extract entities, produce summaries, etc.). But it does not primarily focus on making inference on populations from samples, inherently the task of statistics. Meanwhile, there is a subfield of Linguistics called Corpus Linguistics that studies real-world samples of language to understand linguistic phenomena. While Corpus Linguistics is focused on language itself, Corpus Statistics studies samples of language as statistical phenomena. Here, language as statistical phenomena includes more than just text (though in practice, step 1 of speech analytics is usually converting speech to text).

    In addition to proposing the above name change, I suggest we use this thread for any discussion of re-naming, whether it's for "Corpus Statistics" or for another proposal. That way, we don't have to search through many threads to get at the broader discussion.

    Very interested to hear what you all think!


    ------------------------------
    Tommy Jones
    ------------------------------


  • 2.  RE: Section discussion at JSM - Broader name

    Posted 08-15-2022 07:16
    There is a suggestion from Tommy Jones to use "Corpus Statistics" or "The Section for Corpus Statistics" for the name of the new section.  Here's my concern about these names.  The Oxford English Dictionary says that the word "corpus" means "A body or complete collection of writings or the like; the whole body of literature on any subject."  I'm much more interested in, for instance, text analysis from disparate sources, which may be streaming and may be produced on the fly.  I worry that the word "corpus" tends to refer to a collection of texts (e.g., the corpus of all of Shakespeare's writings), which tends to have a fixed meaning.  I like the current nomenclature of "text analysis" because it strikes me as having a broader scope.


    ------------------------------
    Mark Daniel Ward
    ------------------------------



  • 3.  RE: Section discussion at JSM - Broader name

    Posted 08-16-2022 08:40
    I also like the name "Text Analysis" since it has a broader meaning.  I think that "Corpus Statistics" will confuse people and seems too narrow.

    ------------------------------
    Todd Sanger
    Eli Lilly and Company
    ------------------------------



  • 4.  RE: Section discussion at JSM - Broader name

    Posted 08-17-2022 13:21
    I have two concerns with not changing the name: 

    First, "text analytics" is a very generic term, albeit one in common usage. It doesn't indicate any specialty that we bring as statisticians and could easily be perceived as fitting neatly under the statistical learning section. (With apologies to Carol, I have a similar hesitation about "natural language understanding".)

    Second, looking at the dictionary definition of "corpus" is overly narrow in contrast to how it is used in Linguistics in the context of "Corpus Linguistics". I'd encourage you to follow the link to the Wikipedia article on the topic and see how they use the word "corpus" in that context. 

    If the concern is the word "corpus", then perhaps we could say "natural language statistics", which broadens the group past just "text", for which we were criticized at the council of sections. I don't think it has the same ring and is tied close to "natural language processing", which I think we want to avoid. But I don't hate "natural language statistics" :)






  • 5.  RE: Section discussion at JSM - Broader name

    Posted 08-18-2022 07:50
    I agree with Tommy - Natural Language Statistics keeps the title relevant for a very long time.

    ------------------------------
    Carol Haney
    Senior Research and Data Scientist, Distinguished
    ------------------------------



  • 6.  RE: Section discussion at JSM - Broader name

    Posted 08-18-2022 12:33
    Hello all,

    I have major concerns with "Natural Language Understanding" (as too narrow) and similar problems exist for "Corpus Statistics".

    I like reviewing the "papers with code" website to see how they categorize work in NLP: Papers with Code - Natural Language Processing Of course, you'll notice that they don't have any section that includes the word "Statistics".

    My favorite suggestion so far is "Natural Language Statistics" as it balances relating us to "Natural Language Processing" (the most commonly used categorization) while keeping us distinct and emphasizing the fact that statistical rigor will be a requirement.

    Good suggestion, Tommy!

    ------------------------------
    Karl Pazdernik, Ph.D.
    Senior Data Scientist & Team Lead, Applied Statistics & Computational Modeling, Pacific Northwest National Laboratory
    Research Assistant Professor, Department of Statistics, North Carolina State University
    ------------------------------



  • 7.  RE: Section discussion at JSM - Broader name

    Posted 08-19-2022 12:58

    Dear All

    This has been a lively discussion so far. Based on what's been communicated so far, is it fair to say the three options below get the nod from most? If so, maybe we can put the list to a vote? My own preference is #1 (Natural Language Statistics). In fact, all of these choices seem to agree in spirit with other section names. I was also thinking "statistical methods for text analysis," but this is likely too verbose for a section name. The three choices below do send a signal that our section would be concerned with statistical analysis/inference of text (esp. nos. 1 and 2).

    1. Natural Language Statistics Section
    2. Natural Language Analysis Section
    3. Text Analysis Section

    I'm unsure how to set up an online poll, but this is likely feasible if we choose to do it.


    Thanks

    Ricky.

    ------------------------------
    B. Ricky Rambharat, Ph.D.
    Applied Statistician
    e-mail: rrambharat@gmail.com
    WWW: sites.google.com/view/drrambharat/
    ------------------------------



  • 8.  RE: Section discussion at JSM - Broader name

    Posted 08-22-2022 10:49
    I will be happy to set up a poll by using Google Form. But I want the executive committee to make a decision first and finalize the name choices before opening the poll.

    ------------------------------
    Tony An, Vice Chair of Council of Sections Government Board 2023-25
    ------------------------------



  • 9.  RE: Section discussion at JSM - Broader name

    Posted 08-15-2022 21:28

    Dear Tommy and Colleagues

    First, it was very nice to meet many of you at the meeting at JSM last week. The present suggestion seems reasonable. An important point to potentially bear in mind when labeling this section, we should ensure we communicate that a key objective is statistical inference based on text. I'd vote for "Section on Corpus Statistics" or something similar if "text analysis" isn't best. 

    I also look forward to the other replies. 

    Thanks,
    Ricky. 



    ------------------------------
    B. Ricky Rambharat, Ph.D.
    Applied Statistician
    e-mail: rrambharat@gmail.com
    WWW: sites.google.com/view/drrambharat/
    ------------------------------



  • 10.  RE: Section discussion at JSM - Broader name

    Posted 08-15-2022 22:33
    Hi!  in the spirit of possibilities of a name that will stay relevant for a long time, may I suggest the industry standard of "Natural Language Understanding", that encompasses short text, corpus, and any other text that is statistically categorized?

    ------------------------------
    Carol Haney
    Senior Research and Data Scientist, Distinguished
    ------------------------------