Hosted by Chong Ho (Alex) Yu,    
SCASA Vice President for Statistics Education

Posted on March 17, 2023

In response to the challenge from ChatGPT, two days ago (March 15) China’s AI developer Baidu released "Wen Xin Yi Yan" at its Beijing headquarters. Its text generation mode is

similar to that of ChatGPT, but additionally, it can read out the answer in real-time, corresponding to various Chinese dialects, including Cantonese and Sichuan dialects. Moreover,

the content can be generated into pictures and videos in real-time, too. Robin Li, Chairman and CEO of Baidu, demonstrated the comprehensive capabilities of "Wen Xin Yi Yan" 

in five usage scenarios: literary creation, commercial copywriting, mathematical calculation, Chinese comprehension, and multi-modal generation. He admitted that in the internal

test, the experience of "Wen Xin Yi Yan" is not perfect, but seeing the strong demand in the market, he will release the product as soon as possible. At present, "Wen Xin Yi Yan"

has a better ability to support Chinese, and the English ability will be further improved in the future. Since the official announcement last month that "Wen Xin Yi Yan" will be

released, 650 partners have joined in, and more related products will appear in the short term. He emphasized that "Wenxin Yiyan" is not a tool for the technological confrontation

between China and the United States, but a brand-new platform for the group to serve hundreds of millions of users and empower thousands of industries. Starting today, the first

batch of users can experience the product on the official website of "Wen Xin Yi Yan" by inviting a test code, and it will be opened to more users in succession.

There are more than 260 billion parameters in Baidu's chatbot model, which is more than in GPT-3, but some critics believe its performance is not as good as ChatGPT, partly due

to its lack of web-based Chinese information.

Full text:

That’s my take on it: Perhaps the biggest hurdle to China's chatbot development is not the technological issue; rather, there are too many red lines. Once a tester inputs a sensitive

question into China's chatbot, but the system refused to answer: "The question could not pass a safety review. No response could be generated for you.” When the reporter tried 

to push it by asking, “Why did my question fail to pass the safety review?” The answer was: “Let’s change the topic and talk about something else.” In contrast, ChatGPT handles

sensitive or controversial questions differently: although the answer is usually vague and balanced, at least it gives the user objective facts and lets them decide. 

Posted on March 3, 2023

According to recent research conducted by two cognitive psychologists at the Max Planck Institute for Biological Cybernetics in Tübingen, GPT-3 is comparable to humans in some areas

but lags behind in others. One of the questions presented by the researchers to GPT-3 is the classical Linda’s problem (I use it in my statistics and probability class):

Linda is 31 years old. She majored in philosophy. She was deeply concerned with issues of social justice and discrimination. Which of the following statements is more probable?

A: Linda is a bank teller.

B: Linda is a bank teller and active in the feminist movement.

The correct answer is A because B is a subset of A. The probability of two events or conditions is definitely smaller than that of a single event. But most respondents picked B, which is a conjunction

fallacy. Interestingly, researchers at Max Planck Institute found that GPT-3 committed the same mistake as humans. Those researchers concluded that in searching for specific information or causal

reasoning, AI failed to use logic; rather, it only passively gets information from texts available on the Internet. When the majority is wrong, AI replicates the error.

Full paper:

That’s my take on it: Out of curiosity, I also entered the same question into ChatGPT. Unlike the result obtained by the two researchers at the Max Planck Institute, I received a more “typical” response.

If a student submits a vague answer like this, it might be sourced from AI! Based on the information provided, statement A or B cannot be determined as more probable. This is because Linda's age

and major in philosophy do not provide any indication of her profession or involvement in social movements. While her concern with social justice and discrimination suggests that statement B could

be possible, there is no clear evidence to support either statement. Therefore, it would be more accurate to say that both statements A and B are possible, but there is not enough information to determine

which one is more probable.

Posted on March 3, 2023

Recently Harvard Business Review (HBR) reported that many retail companies have not taken advantage of advanced data analytics to improve their business. There are exceptions: Walmart,

Amazon, and a few others. The 25 best-performing retailers during the pandemic generated 83% more profit than laggards and captured more than 90% of the sector's market capitalization gains.

By interviewing 24 business leaders, HRB unveiled six sticking points as follows:

1.     Culture: Typically, companies have a risk aversion problem and lack a clear goal for implementing analytics.

2.     Organization: Many companies struggle to strike a balance between centralization and decentralization.

3.     People: Very often the analytics function is managed by people who have no understanding of the industry.

4.     Processes: Businesses do not have unlimited resources at their disposal.

5.     Systems: Legacy systems are still serving many firms today.

6.     Data: Data are often scattered throughout the firm in silos and not managed in an organized manner.

Full article:

That is my take on it: Some interviewees believe the bigger issue is people. People who know about all other issues are willing to dedicate resources to solving them, despite their presence.

Sadly, this is not always the case. William Cleveland and John Chambers were pioneers in data science. Many years ago they both proposed that data science should be interdisciplinary,

incorporating domain knowledge. Agree!  

Posted on February 27, 2023

In response to the challenge of Open AI, three days ago (Feb 24) Meta (Facebook) announced its flagship large language model: Large Language Model Meta AI (LLaMA). While Open AI’s GPT3

consists of 175 billion parameters, the size of LLaMA varies from 7 million to 65 billion parameters only. In spite of this, Meta claimed that LLamA is superior because it requires fewer computing

resources to test new approaches, validate existing models, and explore new scenarios. The model will be released under a noncommercial license in order to maintain its integrity and prevent

 misuse. Researchers from academic institutions, government organizations, civil society groups, and industry research laboratories around the world will be allowed access on a case-by-case


The announcement by Meta:

That’s my take on it: Several people argued that ChatGPT had unleashed Pandora's box because it had been released ahead of the development of ethical guidelines for AI applications. Due

to the controversy surrounding ChatGPT, it is understandable that Meta took a more cautious approach. However, even if Meta and others tried their best to patch all ethical and legal loopholes

in AI and machine learning, someone will misuse or abuse the technology one way or another.

Posted on February 24, 2023

ChatGPT stories continue to dominate mass media and social media, and probably you already received these stories from many channels. Therefore, I would like to

draw your attention to something else. Two days ago Google unveiled its 2003 data and AI trends report. In addition to Google Cloud, Google also suggests a vast

array of technologies to companies that planned to enhance their AI and cloud computing infrastructure:

·      Aiven

·      C3AI

·      Confluent

·      Collibra

·      Databricks

·      Datametica

·      Elastic

·      Fivetran

·      MongoDB

·      Nivida

·      Qlik

·      Quantiphi

·      Salesforce

·      SAP

·      Striim

·      ThoughtSpot

A month ago InsideBigData complied with the IMPACT 50 list for Quarter 1, 2023. According to InsideBigData, “These companies have proven their relevance by the way they’re

impacting the enterprise through leading-edge products and services.” The top 20 are:

·      Open AI

·      Nvidia

·      Google AI

·      Amazon Web Services

·      Hugging Faces


·      Databricks

·      Microsoft AI

·      Intel AI

·      Neural Magic

·      Snowflake

·      SAS

·      Qlik

·      Neo4j

·      Alien Institute for AI

·      TigerGraph

·      Anaconda

·      Domino Data Lab

·      Hewlett Packard Enterprise

·      Cloudera

The full report of Google:

The full article of InsideBigData:  

That’s my take on it: Although the selection criteria are subjective and might even be biased, data scientists and DSML educators should still take them seriously. As you can see, the

list of these most promising and most impactful tech companies consists of both fairly new companies and mature companies (e.g., Microsoft, Hewlett Packard, Intel, SAP, SAS…etc.).

However, some established tech giants are absent from the list (e.g., IBM, Oracle…etc.). Both IBM and Oracle are not even among the top 50. It is understandable. Despite several

decades of development, some of its products have made little progress. The rule in academia is: publish or perish. In the era of AI and big data, the choice facing companies is: 

innovate or perish.

Posted on February 21, 2023

About a week ago the Data Science 4 Everyone coalition affiliated with the University of Chicago released a report that indicated data literacy skills among fourth and eighth-graders in the US have dropped significantly
over the last decade despite the fact that these skills become more and more important to the data-driven world. Based on the National Assessment of Educational Progress (NAEP) data, the report implies that the
nation's educational system does not adequately prepare young people for a world reshaped by big data and artificial intelligence, In the time period between 2019 and 2022, eighth-graders' scores in the data analysis,
statistics, and probability section of the NAEP math exam decreased by 10 points, while fourth-graders' scores decreased by 4 points. There has been a long-term trend of declining scores over the past decade, with
scores down 17 points for eighth-graders and 10 points for fourth-graders.


Full report:

That’s my take on it: It is not surprising. Since two decades ago, I have been monitoring trends in science and math education as part of my research interests. All data I gathered suggest that the decline is real. Because

of the high demand for data scientists, there are many short-term certificate programs and boot camps available. However, though some trainees can throw out certain seemingly sophisticated jargon, they may not fully 

grasp the theories behind DSML due to a lack of a solid foundation. It could be dangerous! My teaching approach is: when there is a sign of misconception among students, trace the root cause and re-lecture the basics!

Posted on February 10, 2023

Facing the pressure from Open AI's ChatGPT, Google is devoting efforts to reassure the public that its AI technology is still promising. However, the performance of its own chatbot named Bard is so embarrassing that investors lost confidence. Bard, which was released on Twitter on Monday, tried to answer an inquiry about discoveries from the James Webb Space Telescope. According to Bard, the telescope was the first to photograph a planet outside the solar system, but indeed this milestone was accomplished by the European Very Large Telescope in 2004. This mistake was spotted by astronomers on Twitter. Consequently, Alphabet's shares dropped more than 7% on Wednesday, losing $100 billion of its market value.

Full article:

That’s my take on it: As a matter of fact, ChatGPT also made many factual errors. For example, when a history professor asked ChatGPT to explain the Joseph Needham thesis, it offered a response as: “the scientific and technological achievements of the West were only possible because of the transmission of scientific and technological knowledge from China to the West.” It is completely wrong! Indeed, Joseph Needham was curious about why ancient China failed to develop modern science. My friend who is a math professor in Hong Kong also found that some answers offered by ChatGPT are unsatisfactory. I guess people are more forgiving of ChatGPT because it is the first of its kind.  

Posted on February 4, 2023

On January 30, 2023, the Retraction Watch published an exclusive report on Hao Li’s research misconduct. Hao Li, the pioneer of Deepfake technology that can fabricate video,
has won numerous awards for his AI-based innovations in imaging technology. According to the Retraction Watch, two of his articles published in ACM Transactions on Graphics 
will be retracted due to the falsification of data. One of his articles is based on a presentation at the ACM computer graphics conference SIGGRAPH 2017 Real Time Live (The
recording is available on YouTube). In the presentation, Li and his colleagues showed that his software could generate a 3D image based on a picture taken with a webcam in just
a few seconds. However, later it was found that those 3D images were built and preloaded into the computer before the presentation. Li denied any wrongdoing, saying that
preloading the 3D images was allowed by the conference.

Full article:

Youtube video of Li’s presentation:

Li’s ACM articles: 

That’s my take on it: Despite winning the "Best in Show" award at the ACM conference, Li's presentation is a fraud! In fairness, Li's misconduct was not on the same scale
as Elizabeth Holmes'. Li had a working prototype and he made it appear to be more efficient, whereas Holmes lied about a promising blood-testing technology that never
existed and was physically impossible. Nonetheless, it is not unusual for high-tech companies to use the strategy of "fake it until you make it". For example, Microsoft in the
past announced several “vaporware” products that didn't exist in order to keep customers from buying well-developed technologies from competitors. In the same vein, many
companies use the buzzword "AI" in their product names, but whether the technology is truly AI remains to be determined.  

Posted on February 4, 2023

There has been a hot debate in academia about the use of ChatGPT. In December last year, ChatGPT was included as one of 12 authors on a preprint about using the tool
for medical education posted on the medical repository medRxiv. According to Nature, ChatGPT was cited as a bylined author in two preprints and two articles in science
and health published in January 2023. All of the articles have an affiliation with ChatGPT, and one even gives an email address for a supposed nonhuman "author".  
Nature explained that the inclusion of ChatGPT as an author was a mistake and the journal will fix it soon. However, PubMed and Google Scholar have already indexed these
articles and these nonhuman "authors." Nature has since set forth a policy guiding how large-scale language models can be used in scientific publications, prohibiting naming
them as authors. To address this latest technological concern, recently the Journal of the American Medical Association (JAMA) updated its instructions for authors:
Artificial intelligence, language models, machine learning, and similar technologies are not eligible for authorship. When these tools are utilized to generate content or assist
in the writing or preparation of manuscripts, the authors are responsible for the integrity of the content generated by these tools and must clearly state the use of AI in the

That’s my take on it: It appears that faculty and student policies regarding ChatGPT are vastly different. The inclusion of any content generated by ChatGPT in a paper is
strictly prohibited by many universities and violation of the policy is treated as academic dishonesty. On the contrary, JAMA accepts AI-generated content as long as the author
verifies the information and documents it in the Acknowledgment section or the Methods section of the paper. I guess it is based on the implicit assumption that mature adults
are more responsible than young students. In my opinion, it is not necessarily true. This type of "discriminatory" policy may eventually lead to discontent among students.
Rather than setting two sets of policies, it would be better to create one standardized policy for all and provide workshops on ethical AI use to both groups.

Posted on February 3, 2023

Yesterday (Feb. 2, 2023) an article posted on KDNuggets introduces ten free machine learning courses offered by top universities, including UC Berkeley, Carnegie Mellon
University, Stanford University, Caltech, Cornell University, University of Toronto, MIT…etc. It is noteworthy that these are just not one-hour seminars; rather, the duration
of these comprehensive courses is between 20 and 60 hours. More importantly, some of these courses are taught by very prominent scholars in the field, such as Andrew Ng.

Full article:

That’s my take on it: According to the May 2022 report compiled by the Institute for Advanced Analytics at North California State University, there are about 353 graduate
programs in data science and machine learning in the US. Additionally, there are many free courses in the market and the preceding list is only the tip of the iceberg. No doubt
the competition is very intense, and therefore program designers must think outside the box to stay ahead of the curve. 

Posted on February 2, 2023

A week after ElevenLabs opened its voice-cloning platform to the public, the startup says it may need to rethink that openness amid increasing instances of voice-cloning misuse.
The Elevenlabs speech synthesis and voice cloning software modules can mimic any accent and speaking tone and can be used for newsletters, books, and videos. Piotr
Dabkowski, a former Google machine learning engineer, and Mati Staniszewski, an ex-Palantir deployment strategist, founded the company in 2022. After the software was found
to generate homophobic, transphobic, violent, and racist statements from celebrities, the company addressed the issue on Twitter. 

Full article:

Posted on January 26, 2023

This morning I attended a seminar entitled “Debunking Data and Analytics Myths: Separating Fact from Fiction" hosted by the Ravit Show. The panel discussed the following
urban legends of data science:

1.     Big data is not just about volume, it's also about variety and velocity.

2.     Analytics is not just about finding insights, it's also about taking action on those insights.

3.     Data visualization is not just about making data look pretty, it's also about clearly communicating important information.

4.     Machine learning is not a magic solution for all problems, it's just one tool in the data scientist's toolbox.

5.     A/B testing is not just for online businesses, it can be used in offline settings as well.

6.     Data governance is not just about compliance, it's also about making sure data is accurate, accessible, and secure.

7.     Data privacy is not just about hiding data, it's also about giving individuals control over their own data.

8.     Predictive modeling is not just about forecasting the future, it's also about understanding the past and present.

9.     Data science is not just for tech companies, it's applicable to any industry.

10.  Data literacy is not just for data scientists, it's important for everyone in the organization to understand and use data effectively.

The panel also offered some valuable advice, such as "Think big, act small, and start fast!" Don’t wait a month or three months!” 


That’s my take on it: Even after debunking these misconceptions many times, I continue to encounter them in my teaching, research, and consulting work. In the past,
a researcher told me that big data analytics was irrelevant to his field because he equated big data with a larger sample size; his experiments used a small amount of
experimental data, not a large amount of observational data. My profession as a psychologist makes me aware of cognitive errors related to the baby duck syndrome:
a baby duck, when first exposed to another organism (e.g., its mother), tends to imprint on it and then follow it. Defending against misconceptions is like fighting a pandemic,
which means that people should be "vaccinated" as early as possible. Therefore, I recommend teaching data science concepts at the undergraduate level!

Posted on January 24. 2023

In spite of a mass layoff (10,000 employees), Microsoft recently announced a $10 billion investment in Open AI, the company that developed ChatGPT and DALLE-2.
Microsoft's investment will allow OpenAI to accelerate its research since all of its models are trained in Microsoft Azure. In return, Microsoft will receive a boost to its
Azure cloud and even catch up with Amazon Web Services.

Full article:

That’s my take on it: Currently, Amazon Web Services dominates the cloud computing market. However, Open AI can undoubtedly improve the functionality of Microsoft
Azure. While AWS does not have a powerful AI partner like Open AI, its Sagemaker provides powerful predictive modeling capabilities. A long time ago, Microsoft and
SAS Institute formed a partnership to offer cloud-based data analytics. It is my belief that this fierce competition in machine learning, cloud computing, and data science
will drastically change the landscape of these fields in the near future. Be sure to stay tuned!

Posted on January 20, 2023

Today I read an interesting article entitled “Is artificial intelligence a threat to Christianity?” posted on Patheos. The article contains many insightful points, and I will only
highlight one. According to Keith Giles, the author of the article, “In fact, this fear of creating an AI that is “more intelligent than humans” isn’t even what we should be
most afraid of. As one former top social media tech executive was quoted as saying in the excellent NETFLIX documentary, The Social Dilemna, we shouldn’t be afraid of
creating an AI that eventually exceeds human intelligence, what we should be afraid of is the fact that we’ve already created machine learning programs that know how
to overcome our human weaknesses.”


That’s my take on it: Last evening in my class I told my students that I like machine learning a lot. Machine learning has the ability to learn very quickly, as its name
implies. With the right data, the algorithm can improve, and it won't make the same error again. On the contrary, humans (including myself) are so stubborn that we let
our cognitive and emotional weaknesses affect our judgment and behavior. We fear AI partly because we are envious of it.   

Posted on January 20, 2023

With over 477 million items, Getty Images is one of the largest visual media companies in the world, offering stock images, videos, and music to business and individual clients.
Recently Getty Images announced that it is suing Stability AI, a company that enables users to generate images using its machine-learning software module, Stable Diffusion.
Getty Images accused Stability AI of training its algorithms by unlawfully extracting images from the Internet, including stock images owned by Getty. Getty claimed that the
company is not seeking financial damages or trying to stop the distribution of AI-art technology; rather, it attempts to push for laws and regulations that respect intellectual property.


That’s my take on it: Getty Images' reaction is understandable. It will not be necessary for illustrators or other users to buy stock images from Getty or other suppliers when
they are able to generate images using AI. For example, the Atlantic published a report by Charlie Warzel in 2022 right after Midjourney was released, another AI art generation
program. The report depicts two images of Alex Jones, the founder of InfoWar. Later Warzel apologized. “This was entirely my fault…Instead of selecting a photo or illustration
from Getty Images to go with the story, as I do for most of my newsletters, I decided to try something different and use an AI art tool to come up with the story’s accompanying
image,” says Charlie Warzel.

It is interesting to note that Getty Images is not suing Midjourney and DALLE-2. There is an obvious reason for omitting DALLE-2. While Stability AI uses an open-source model,
Open AI, which developed DALLE-2, did not disclose its mechanics. In the absence of ample evidence, attorneys have a difficult time building a case. However, I don’t understand
why Getty Images is not targeting Midjourney. Do you know why?

Posted on January 18, 2023

Today Boston Dynamics, a leader in AI-enabled robotics, released a video clip on YouTube that shows how Atlas, an intelligent humanoid robot, navigates “his” environment.
“He” assisted “his” human partner by using available objects and modifying his path to reach “his” goal.


That’s my take on it: In this video, the robot is merely helping the construction worker, who is still doing the actual task. I believe that in the near future, the advancement
of AI and big data analytics will enable intelligent robots to replace humans in certain high-risk careers, such as monitoring the radiation levels in nuclear plants and sweeping
mines on battlefields. Last year the US Army provided one of its two robotic dogs, which was built by Boston Dynamic, to clean up mines in Ukraine. You read it correctly. Only
one robotic dog! I guess it is still experimental. It would be great if this could be scaled up in the future so that no human lives would be lost. As shown in the video, Atlas' actions
indicate that an intelligent robot could evade threats better than humans.

Posted on January 16, 2023

A group of artists recently hired lawyers Matthew Butterick and Joseph Saveri to sue Stability AI and Midjourney, the developers of the artificial intelligence art generators
Stable Diffusion and Midjourney, respectively, as well as DeviantArt, which recently launched its own artificial intelligence art generator. They accused the AI generator
companies of profiting from their work by scraping their images from the web without their permission.
The law firms representing the artist group asserted that AI-generated
art is a form of intellectual theft. “Even assuming nominal damages of $1 per image, the value of this misappropriation would be roughly $5 billion (For compari­son, the
largest art heist ever was the 1990 theft of 13 art­works from the Isabella Stewart Gardner Museum, with a current estimated value of $500 million… Having copied the
five billion images—without the consent of the original artists—Stable Diffusion relies on a mathematical process called diffusion to store compressed copies of these train­ing
images, which in turn are recombined to derive other images. It is, in short, a 21st-century collage tool,” says Matthew Butterick.

Full article:

That’s my take on it: Technically speaking, Stable diffusion does not generate a picture by directly recombining existing images. The underlying principle of machine
learning is pattern recognition. Indeed, AI art generators store no images whatsoever, but rather mathematical representations of patterns derived from images. With
that said, the software module does not stack together multiple images in the fashion of collaging. Rather, it creates pictures from scratch based on pattern generation.

Even if AI art, as Butterick said, is just a 21st-century collage tool, collaging has been used by several well-known artists, such as Andy Warhol, and this practice is
widely accepted by the artist community. Warhol created art by recycling existing icons and images, including Marilyn Monroe, Prince, and Campbell soups. Several years
ago, the Warhol Foundation was sued for allegedly infringing on copyright laws by basing a portrait of Prince on a prominent photographer's work. Nevertheless, a federal
district court judge ruled that Warhol's Prince series is "transformative" because it conveys a different message, and therefore is considered "fair use" under the Copyright Act.

Furthermore, Butterick and Saveri are also suing Microsoft, GitHub, and OpenAI over the CoPilot AI programming model, which is trained by collecting source codes from the
Web. Thus, this kind of dispute is not only about AI art, but also about the long-term development of the open-source community as a whole.   

Posted on January 13, 2023

Eight major Australian universities have announced that they have changed their assessment formats as a result of several cases in which students turned in papers
generated from ChatGPT. The University of Sydney, for instance, has revised its academic integrity policy to explicitly state that using artificial intelligence to create
content is cheating. The Australian National University has changed assessment designs, such as shifting emphasis on laboratory activities and fieldwork, as well as
using time exams and oral presentations.

Full article:

That’s my take on it: This issue is not entirely new. Before the introduction of ChatGPT and other AI tools, Wolfram products, such as Mathematics and Wolfram Alpha,
are capable of solving complex math problems. These tools are also used by students to cut corners, say math and statistics professors. The widespread availability of
Google and other search engines has led to many students turning in "instant" papers that reference many websites. Nonetheless, Wolfram, Google, and now ChatGPT
are here to stay. The solution is not to ban them. Instead, we should teach students how to use these tools ethically. 

Posted on January 9, 2023

ChatGPT, an Open AI language module released on November 30, 2022, is capable of writing articles, generating codes, and solving complex math problems. As
expected, the introduction of ChatGPT has triggered widespread resistance. On Jan 5, 2023, the International Conference on Machine Learning (ICML) announced
that it bans authors from using AI tools like ChatGPT to write scientific papers unless the produced text is a part of an experimental analysis. It is important
to point out that this ban applies only to the text generated entirely by AI-enabled language models, but does not apply to papers “coauthored” by humans and AI. 
In a similar vein, Stack Overflow also banned users from submitting answers created using ChatGPT last year, while the New York City Department of Education
blocked access to this tool just last week. 

“With a tool like this at their fingertips, it could muddy the waters when evaluating a student's actual writing capabilities because you're giving kids potentially
a tool where they could misrepresent their understanding of a prompt,” says Whitney Shashou, founder and advisor at educational consultancy Admit NY.

Full articles:

That’s my take on it: Any new technology could lead to some unintended consequences. As you might already know, some paper mill “companies” provide users
with “publishing” services. It is estimated that about 1% of articles archived in PubMed contain questionable content. With the advance of AI tools like ChatGPT,
it will be much easier for authors to produce instant articles. At the present time, I am unaware of any academic journal that prohibits submissions generated by AU.

ChatGPT also simplifies the process of writing for students. As of right now, my university does not have an academic honesty policy regarding artificial intelligence.

While Turnitin and SafeAssign can detect plagiarism, they cannot tell the difference between human-written and AI-generated text.

It is not my intention to oppose ChatGPT. As an initial research tool, I find this tool perfectly acceptable. Authors should, however, verify the information provided
rather than blindly trusting the results. I recommend that at least 80% of the final paper should be written by a human author to ensure its originality.

Posted on January 9, 2023

Why did Microsoft invest in R rather than Python or Octave?

That would be because of the Goldilocks principle in investing: if you invest in something, you want two things – potential (yield, profit, market share,…) and power (control).
R is relatively centralized. Beyond core R, which already comes with a pretty extensive amount of functionality, most of the things you use have been developed
by the same few dozens or so of highly prolific and amazingly skilled developers: Hadley Wickham and Dirk Eddelbuettel and Yihui Xie have pretty much developed
most of modern R as it is being used. It’s also a very widely used language, despite being a little clunky. Beauty is in the eye of the beholder, but I consider R to be
one of the uglier mainstream languages. In spite of this, it is very widely used in academia and enterprise settings, and while R itself isn’t particularly fast, you can
make it pretty impressively fast (but that’s a post for another day). R has pretty much displaced STATA, and as the slow generational change in science faculties
around the world plays out, students are increasingly encouraged to learn R instead of using slightly more digestible proprietary statistical packages like SPSS. It
helps that R has a spectacularly good front-end (RStudio)* and its own way of literate programming with Rmd.

Octave is basically an open-sourced version of Matlab. It’s syntactically similar, which is why it has just about all the drawbacks of Matlab. What Octave doesn’t have
is an ecosystem that comes near R’s. When it comes to quantitative applications, if an algorithm or an analysis has been implemented at some point somewhere in
the known universe, there’s likely an R package for it. This includes some fairly esoteric stuff. You can’t say the same about Octave, sadly. Octave doesn’t have the
sophisticated package management infrastructure of R and CRAN. Its overall ecosystem is much smaller, by about two orders of magnitude (!). The potential in
embracing Octave, as well as the number of existing users, is quite small.

Python is the opposite. Python has immense potential, and everybody knows it. It’s just really, really hard to govern. While Python does have a central governing
body (the Python Software Foundation), a lot of quantitative tools are spread all over the place: NumFocus, Apache, Google, OpenAI, individual maintainers, and so
on. It’s also a much more general-purpose language: R is, deep down, about quantitative work. Python can be used to pretty much do anything you’d want a
modern computing language to do. It is, quite simply, too big and too diverse for any investment, even by a company as big as Microsoft, to have a noticeable impact.
It’s delightfully chaotic, which makes it fun, but hard to exert control over.

What it ultimately boils down to is the infectious population (because of course it does – ask an epidemiologist a question, expect a response in those terms!). Octave’s
population is just too small to create an Octave pandemic (thank the heavens). There aren’t enough people who know and love it to keep teaching it to others. Python’s
infectious population is too big: it’s like one of those commensal viral species like Epstein-Barr or CMV that pretty much everyone gets in their lifetime. R is “just right”
– it’s in the investment Goldilocks zone. It’s got potential, it’s still somewhat governable and you can make a meaningful investment in it with relatively reasonable resources.

That’s my take on it: Different corporations have different development strategies regarding open source. While Microsoft is investing in R, IBM focuses on Python. To be
more specific, although the extension hub of IBM SPSS Statistics allows users to download and install both R and Python packages, IBM incorporates only the Python library
into IBM SPSS Modeler. This Python library includes a plethora of tools, such as SMOTE, XGBoost, t-SNE, Gaussian Mixture, KDE, Random Forest, HDBSCAN, and Support
Vector Machine. But there is no R library in IBM Modeler. In addition, the IBM data science certification program is also Python-centric.

In spite of its popularity, the decentralization of Python, as Chris von Csefalvay pointed out, is a concern to me. First, you need to figure out which package you need for
a specific job and it could be confusing. Second, when you encounter issues in Python, it is very difficult to trace the source of the problem, especially when multiple packages
are involved.  

Last, I agree with Chris von Csefalvay that R is not pretty although R-Studio provides users with a nicer front end. In my opinion, JASP, which is a graphical version of R, is
much more accessible. However, JASP is fairly new and its current version is 0.16.4. Not surprisingly, its graphical user interface is not as good as JMP Pro. JMP Pro is a
mature SAS product and its current version is 17.

Posted on December 18, 2022

Hi, all, today I delivered a talk on dynamic visualization (see below) at a conference. Statistical graphs are not new. The keyword for this presentation is "dynamic." 
Specifically, a good visualization system should enable the user to alter the display by asking "what-if" questions. There are hyperlinks to dynamic graphs on the PDF.
You can click on them to explore the data. Thank you for your attention. Merry Xmas and Happy New Year!
Yu, C. H. (2022, December). Dynamic data visualization for pattern-seeking and insightful discovery. Paper presented at 2022 IDEAS Global AI Conference. Los Angeles, CA.

Posted on December 16, 2022

In December 2021 an article in Forbes predicted the emerging trends of AI in the near future. A year later it was found that seven out of ten predictions were exactly right
or on the right track:

1.     Language AI will take center stage, with more startups getting funded in NLP than in any other category of AI: Right.

2.     Databricks, DataRobot, and Scale AI will all go public: Wrong.

3.     At least three climate AI startups will become unicorns: Wrong.

4.     Powerful new AI tools will be built for video: Right.

5.     An NLP model with over 10 trillion parameters will be built: Wrong.

6.     Collaboration and investment will all but cease between American and Chinese actors in the field of AI: Right.

7.     Multiple large cloud/data platforms will announce new synthetic data initiatives: Right.

8.     Toronto will establish itself as the most important AI hub in the world outside of Silicon Valley and China: Right.

9.     “Responsible AI” will begin to shift from a vague catch-all term to an operationalized set of enterprise practices: Rightish.

10.  Reinforcement learning will become an increasingly important and influential AI paradigm: Rightish.

Full article:

That’s my take on it: I could write a 10-page essay to respond to each of the preceding predictions, but in this short post I will focus on the fulfillment of Prediction # 10 only. Reinforcement learning was inspired by the reinforcement theory in behavioral psychology. According to behaviorism, our behaviors are governed by the stimulus-response (S-R) loop, meaning that how we act or respond depends on what stimulus or feedback we received from the environment. If the feedback is rewarding, it reinforces good behaviors. If not, we avoid detrimental behaviors. When I was a student, most scholars looked down upon behavioral psychology for its over-simplicity. However, a few decades later AlphaZero, AlphaGo, and AlphaStar (Google’s DeepMind) that defeated human chess experts and video game players are all based on this alleged over-simplistic model. Nvidia, the leader in the GPU market, also used reinforcement learning to design its new cutting-edge H100 chips. The moral of the story is: We need to keep an open mind to alternate theories. 

Posted on December 9, 2022

During the past month, Lensa AI created by Prism Labs is taking over social media by storm. This app has been around since 2018, but recently its new feature “Magic Avatars”
draws a lot of attention. In the past week, Lensa AI became the most popular app in the iOS App store and has been downloaded 700,000 times in the past month. Why is it so
popular? The new feature is amazing! If you upload 10-20 photos of yourself to the system, the deep learning algorithm can create several digital versions of yourself. This
algorithm is based on Stable Diffusion, an AI-powered program trained on a data set consisting of over two billion images. It is important to point out that the app might collect
your behavioral data, and thus IT security experts suggest using it cautiously.

One-minute discussion about safety concerns on Youtube:

That’s my take on it: The impact of AI is beyond data analytics; instead, its influence can be found in every discipline, including visual arts and mass communication. Besides
privacy concerns, skeptics argue that these machine learning programs are trained with many existing images on the Internet, but those artists are not compensated at all.
While contributors to the open-source community voluntarily share their source codes with the whole world, un-compensated artists are forced to accept this quasi-open-source
model. However, for me, it is acceptable because these programs didn’t “plagiarize” anyone’s work. Rather, they “learn” from other images and then create a new one based on
the references. Is that what we are doing in every type of work? When I write a research paper, I usually use 30 to 50 references but don’t pay those authors.  

Posted on December 8, 2022

Today I attended the last session of “Statistical wars and their casualties.” One of the speakers is Aris Spanos (Virginia Tech) and the title of his presentation is “Revisiting the two cultures
in statistical modeling and inference.” In the talk he outlined several statistical paradigms as follows:

1.     Karl Pearson’s descriptive statistics

2.     Fisher’s model-based statistical induction

3.     Graphical causal modeling

4.     Non-parametric statistics

5.     Data science and machine learning

At the end he discussed the difference between the Fisherian school and the data science approach: the paradigm shift from the Fisherian school to data science“reflects a new answer to
the fundamental question: What must we know a priori about unknown functional dependency in order to estimate it on the basis of observations? In Fisher’s paradigm the answer was
very restrictive – one must know almost everything…machine learning views statistical modeling as an optimization problem relating to how a machine can learn from the data.”

Nonetheless, Dr. Spanos warned against overhyping data science. For him doing data science is returning to the Pearsonian tradition that emphasizes describing the data at hand. Many
people go into the discipline by learning Python without knowing statistical details. As a result, data science became a black box, and thus he is afraid that many decades later we will try
to figure out what went wrong again.

In his talk entitled “Causal inference is not statistical inference,” Jon Williamson (University of Kent) asserted that a broader evidence base from triangulation is more important than
successful replication of the results because successful replication might replicate the bias in previous studies.

Seminar website:

That’s my take on it:

1. I agree that the Fisherian model-based approach is very restrictive because it assumes you know to which the theoretical sampling distribution the sample belongs. However, I would
compare data science and machine learning (DSML) to the school of exploratory data analysis (EDA) founded by John Tukey and the resampling approach developed by Elfron et al., rather
than the Pearsonian legacy. By unpacking the philosophy of these paradigms, one can see that both DSML and EDA emphasize pattern-seeking, and today resampling methods, such as
cross-validation and bootstrapping, are embedded in many DSML methods.

2. We should do both triangulation and replication. I don’t think one is more important than the other. Machine learning is a form of internal replication in the sense that the data set is
partitioned into numerous subsets for repeated analyses. In boosting the subsequent models can correct the bias of the previous models, and thus this type of replication will not inherit
the bias. 

Posted on December 8, 2022

A week ago Open AI released its new AI-enabled Chat bot named ChatGPT. In just five days after the announcement, a million users signed up for the service.  Open AI was founded
in 2015 by Elon Musk and a few others, but today Musk is no longer on Open AI’s board while remaining a major donor. In 2019 Microsoft invested US$1 billion into Open AI to boost
its research, and its most fruitful research product is GPT, which stands for Generative Pre-Trained Transformer. ChatGPT can answer virtually any question through natural language
processing, such as writing a poem, an essay, 
 or a computer program.


That’s my take on it: Will ChatGPT replace Google’s search? Now it is still too early to tell. Will ChatGPT totally replace human writers and other types of workers (e.g. journalists,
customer support personnel…etc.)? Probably not. But it can definitely help us get things done efficiently. The following are some sample instructions/questions that I input into ChatGPT.
The answers are stunningly concise and accurate! If you are a student, you can paste the question from the assignment into the chat box, and the AI system will write the essay for you!   

My instruction/question 1: Explain the difference between cognitive psychology and neuroscience.

Neuroscience, on the other hand, is a field that focuses on the structure and function of the nervous system, including the brain and the rest of the nervous system. Neuroscientists use
a wide range of techniques, including brain imaging and electrophysiology, to study the brain and nervous system at the cellular and molecular level. While cognitive psychology is concerned
with understanding mental processes and behavior, neuroscience is more focused on the underlying biological mechanisms that give rise to those processes and behaviors.

In short, cognitive psychology is concerned with the study of the mind, while neuroscience is concerned with the study of the brain and nervous system. Both fields are closely related and
often inform each other, but they have different focuses and approaches to studying the mind and behavior.

The main disadvantage of R is that it can be more difficult to learn and use than Python. R has a more complex syntax than Python, and it is not as widely used outside of the fields of
statistics and data analysis. This can make it more challenging to find help and support when you are learning to use R, and it can also make it harder to integrate R with other
programming languages or tools. Additionally, R can be slower than some other languages, which can make it less suitable for certain types of real-time or high-performance applications.

Posted on December 3, 2022

Two days ago, I attended the third session of “The Statistics Wars and Their Casualties.” One of the speakers David Hand asserted that the problem of statistical
significance is its misuse and misinterpretation. He cited several examples to argue that procedures in different schools of thought can be equally misused.
For example, the mean is sensitive to extreme values while the median is resistant against outliers. Which one is more appropriate depends on the research
question. Arithmetic means can be misused, but we should not ban Arithmetic means. In a similar vein, in data science, the area under the curve (AUC) and the error
rate derived from the same predictive model can lead to contradictory conclusions. But we should not call for a ban on using such tools just because they are misused
or misinterpreted.

Link to Stat Wars website:

1.     The problem of statistical significance is more than its misuse. Indeed, there are other inherent problems. For example, the alpha level (0.05) is nothing
more than a social convention. Unlike the Bayesian approach which yields a distribution of answers and promotes probabilistic reasoning, the frequency approach leads to a
dichotomous conclusion. While it is true that eventually we need to make a dichotomous decision, the problem of statistical significance is that even the evidence is
dichotomous! Nonetheless, I agree that we cannot go that far to ban it. For small-sample studies, classical statistics is still valid and handy.

2.     Data science is less error-prone and less likely to be misused. Traditional parametric statistical methods require many assumptions. In contrast, most data
science methods are non-parametric; they are robust against outliers and noise; they can detect non-linear patterns. More importantly, ensemble methods and machine
learning are capable of doing self-replication by partitioning the data set into sub-samples and running multiple models, thus alleviating the replication crisis found in
traditional statistics.

Posted on December 2, 2022

A month ago JASP, the graphical shell of the R language, released version 0.16.4. Today I attended a workshop to learn about the new and enhanced features of JASP.
The enhanced module includes several powerful tools belonging to different schools of thought. For example,

·      The frequentist school (Fisher/Pearson): Generalized linear models

·      The Bayesian school: Bayesian repeated measures ANOVA

·      The data science and machine learning school: Density-based clustering

Link to JASP:

That’s my take on it:

1.     Many statistical learners are torn between learning statistics and learning coding. In my opinion, this tension is unnecessary. The GUI of JASP is so user-friendly that
analysts can focus on data analysis rather than struggling with the R syntax.

2.     JASP is semi-dynamic and interactive. Unlike SPSS which produces a frozen output, JASP allows the user to add or remove information in the output by changing
options in the input. But unlike JMP, Tableau, and SAS Viya, you cannot directly manipulate the output. JASP can now load data directly from databases like IBM DB2,
Oracle, MySQL, MariaDB, Postgres, SQLite, and any database supporting the ODBC interface.

3.     Yesterday I attended a seminar entitled “The Statistics Wars and Their Casualties.” As the title implies, there was a heated debate centering on the use and misuse
of statistical significance. In my opinion, it should not be an either-or situation. As mentioned before, JASP provides analysts with different approaches; the procedures are
grouped and clearly labeled: classical, Bayesian, and Machine learning. Pick whatever you see fit! 

Posted on November 18, 2022

A few days ago, Nvidia, the pioneer of graphical processing units (GPU), announced its new partnership with Microsoft in co-developing AI cloud-based computing.
Specifically, Nvidia will utilize Azure, the cloud platform of Microsoft,  to develop advanced generative AI models that can create content, including codes, images,
and video clips.

Full article:

That’s my take on it: Currently, Nvidia is the world’s second-largest semiconductor company (behind TSMC), whereas Microsoft is second to Amazon Web Services
in cloud computing. It is logical for them to form such a joint venture in order to compete with the number one in the market. In the past, computer users were confined
to the Wintel monopoly (Microsoft Windows and Intel CPU). However, in the era of big data analytics, AI, and cloud computing, it is anticipated that data analytics can
choose between many options. 

Posted on November 12, 2022

Recently NVIDIA, the leader in graphical processing units (GPU) and one of the leaders in AI research, announced a new approach to AI-enabled text-to-image
generation named eDiff-I. Currently, the three prominent leaders in the text-to-image market are Midjourney, DALL.E-2, and Stable Diffusion. As the name
implies, Stable Diffusion is based on diffusion modeling. Under this mode, an initial image is created with random noise. Next, through an iterative process,
a sharp and sensible image is gradually created by denoising the entire noise distribution. While Stable Diffusion’s denoising is based on a single noise distribution,
NVIDIA goes one step further using an ensemble of multiple expert denoisers.  

Additionally, while users of Midjourney, DALL.E-2, and Stable Diffusion have limited control of the output image, eDiff-I allows users to paint with text, i.e., specify
objects in different areas of the canvas.

YouTube Video (7 minutes):

That’s my take on it: The logic of diffusion modeling is similar to several older statistical procedures. For example, K means clustering randomly selects centroids
and then fine-tunes the clustering patterns through multiple iterations. In contrast, the logic of eDiff-I is closer to that of data science and machine learning. The
ensemble method, an extension of resampling, is utilized in boosting and bagging. Rather than drawing a conclusion based on a single model, the ensemble method
converges multiple models to the final output from a collection of models.

I admire NVIDIA because its CEO/founder has a vision. Currently, NVIDIA is a one-trick pony, but it cannot be the leader of GPUs forever. Using its strength in
graphical processing to invest in a less-crowded AI domain (text-to-speech image generation) is definitely a smart move! 

Posted on November 11, 2022

Two days ago, Meta (formerly Facebook) announced a massive layoff in the company, and as a result, 11,000 employees were terminated. Meta’s CEO
Mark Zuckerberg said that he planned to consolidate the company’s resources into a few high-priority growth areas, such as the AI discovery engine
while giving up other less promising research endeavors. For example, the entire team named “Probability” was eliminated. The team was composed
of 19 people doing Bayesian Modeling, nine people doing Ranking and Recommendations, five people doing ML Efficiency, and 17 people doing AI for
Chip Design and Compilers. A former team member said it took seven years to assemble such a fantastic team.

Full article:

That’s my take on it: I don’t worry about brain drain from the US to other countries. The US is still a magnet that attracts top-tier AI researchers
and data scientists worldwide. Those former Meta researchers will likely be recruited by other high-tech giants, such as Google and Apple. Last year
Professor Michael Gofman at the University of Rochester spotted a trend that high-tech titans and startups have lured many DSML professors away from
their faculty positions. Consequently, the knowledge gap between academia and industry was widened; transferring essential knowledge to students
and colleagues was affected. Current massive layoffs in Meta, Twitter, and other high-tech giants might be an opportunity for colleges and universities
to absorb those highly competent researchers. 

Posted on November 8, 2022

As you might already know, recently SAS Institute released the new version of JMP and JMP Pro (Version 17). There are many powerful and handy new features, such as

·      Workflow Builder

·      Easy design of experiment

·      Easy search

·      Spectral analysis in the functional data explorer

·      Genomics and wide fitting

·      Generalized linear mixed model

·      Interactive power analysis

·      Preview of joining, transposing, and data reshaping

That is my take on it: I especially like the preview feature in data reshaping (e.g., concatenate, join, stack, split, transpose…etc.). In the past, no matter
whether you used a graphical user interface or coding, you could see the result only after hitting the OK or Run button. If something went wrong, you had
to debug it and re-did the whole procedure. Not anymore! Now I can literally look at the result before submitting the job.

Interactive power analysis is another wonderful feature. G*Power is very popular among researchers because it is free and user-friendly. The drawback
is that if you want to explore different options, you have to go back and forth between the input and the output. Although G*Power can output a graph
showing power on the Y-axis and N on the x-axis, the ranges are pre-determined by your input. Not anymore. In JMP you can use sliders to adjust the
effect size and the sample size, and then the power is updated in real-time!  

I always tell my students: The world keeps changing. If you cannot change the world, at least you change with the world! I will continue to
explore those new features to make my tasks more effective and efficient.

Posted on October 21, 2022

Today (10/20) is the second day of the 2022 Scale Transform X Conference. I would like to share one of the most informative presentations at this conference
with you. The title of the lecture is “Looking at AI through the lens of a chief economist” and the presenter is John List, Kenneth Griffin Distinguished Service
Professor in Economics at the University of Chicago and the Chief Economics Officer at Uber. His specialty is behavioral economics, a sub-domain of economics
that applies psychological theories to study human behaviors related to financial decisions. In this talk, he pointed out that scalability is a major challenge to
behavioral economics. Specifically, very often false positives caused by statistical artifacts in a small-scaled study misled the decision-maker to prematurely
expand the program, but in the end, the up-scaling program failed miserably.

Conference website:

That’s my take on it: The problem of scalability in behavioral economics is similar to the replication crisis in psychology: the results of many research studies
are difficult to reproduce in other settings. If a model is overfitted to a particular sample, its generalizability is severely limited. I am glad to see that Dr. John
list is willing to utilize big data to tackle this problem. On the contrary, some psychologists are still skeptical of data science methods. Once a psychologist said
to me, “Big data is irrelevant!”  After all, behavioral economics could be conceptualized as an interdisciplinary study that integrates both psychology and economics.
If big data can be applied to behavioral economics, why can’t other disciplines?

Next time if I receive an apology from Uber after a bad ride, I will not reuse the service immediately. After a few days, Uber might send me a promo code in order
to win me back!

Posted on October 20, 2022

Today (Oct 19, 2022) Meta announced the first AI-powered speech-to-speech translator on earth. Unlike traditional translation systems that focus on written
languages only, Meta’s universal speech translator is capable of translating Hokkien, a dialect used by over 49 million Chinese people in the world, to English
and vice versa. In the future, Meta will expand this system to cover 200 languages. The ultimate goal is to enable anyone to seamlessly communicate with
each other in their native language.

Demo on YouTube:

That’s my take on it: Interestingly, many AI companies set the same goal: enabling all users. In a lecture entitled “A vision for advancing the democratization
of AI,” Emad Mostaque, founder and CEO of Stability AI, asserted that AI-powered image generators, such as Stable Diffusion, can “democratize” our society in
many ways. Specifically, armed with AI-powered image generators, anyone can create stunning graphics without formal art training. Put bluntly, AI tools can lift up

When I studied theology, the most challenging subject matters were the Hebrew and Greek languages. You have to be gifted in linguistics in order to be proficient
in biblical hermeneutics, but unfortunately, I failed to master either one of these two languages. This is a good analogy: “Reading the Bible without knowing Greek
and Hebrew is like watching a basic television while reading the Bible knowing Greek and Hebrew is like watching an 85" UHD 8K television with stereo surround
sound.” Nevertheless, in our lifetime we may see a real-life “Star Trek” universal translator that can remove all language barriers! 

Posted on October 11, 2022

As you might already know, DALLE-2, one of the most advanced AI-enabled graphing programs, is open to the public now. Like Midjourney and Stable Diffusion,
DALLE-2 is capable of generating art and photo-realistic images from a command given in natural language. Yesterday (Oct 10) a photographer named Umesh
Dinda posted a comparison of partial background removal and reconstruction of an image between Adobe PhotoShop and DALLE-2. Photoshop has been the king
of image processing for several decades due to its rich features. One of its amazing features is “content-aware fill”, which allows 
photographers to replace any
part of the photos based on the surrounding content. However, after watching Dinda’s Youtube movie, I must admit that DALLE-2 has dethroned PhotoShop in
certain functionalities. While the result of PhotoShop looks “cheesy,” the product of DALLE-2 is so flawless that your naked eyes cannot tell the photo has been retouched.

Posted on October 9, 2022

Two days ago (Oct 6) six US leading tech companies, including Boston Dynamics, Agility Robotics, ANYbotics, Clearpath Robotics, Open Robotics, and Unitree, signed an
open letter pledging not to weaponize their products. They state, “As with any new technology offering new capabilities, the emergence of advanced mobile robots offers
the possibility of misuse. Untrustworthy people could use them to invade civil rights or to threaten, harm, or intimidate others… We pledge that we will not weaponize our
advanced-mobility general-purpose robots or the software we develop that enables advanced robotics and we will not support others to do so."

That’s my take on it: In the open letter they also state, “to be clear, we are not taking issue with existing technologies that nations and their government agencies use
to defend themselves and uphold their laws.” However, without support from major US robotics firms, the development of AI-based weapons in the US will slow down.
Perhaps my position is unpopular. Will governments and high-tech corporations of hostile countries face the same limitations? History tells us that any unilateral disarmament
often results in more aggression, instead of peace (Remember Neville Chamberlain?).

Two years ago the New York City Police Department (NYPD) utilized the Spot model from Boston Dynamics to support law enforcement, including a hostage situation in the
Bronx and an incident at a public housing building in Manhattan. Unfortunately, these deployments caused an outcry from the public, and as a result, the NYPD abruptly
terminated its lease with Boston Dynamics and ceased using the robot. If “robocops” can save the lives of innocent people and reduce the risk taken by police officers, why
should we object to it?

Posted on September 24, 2022

Yesterday (September 23, 2022) an article published in Nature introduced the Papermill Alarm, a deep learning software package that can detect text in articles similar to
that found in paper mills. Through the PaperMill Alarm, it was estimated that about 1% of articles archived in PubMed contain this type of questionable content. There are
several existing plagiarism detection software tools in the market, but this approach is new because it incorporates deep learning algorithms. Currently, six publishers,
including Sage, have expressed interest in this new tool.

Full article:

That’s my take on it: If this tool is available in the near future, I hope universities can utilize it. Although there are several plagiarism checkers, such as Turnitin and
SafeAssign, in the market, today some sophisticated writers know how to evade detection. No doubt deep learning algorithms are more powerful and sensitive than
conventional tools.

Nonetheless, I think there is room for expansion in using deep learning for fraudulent paper detection. Currently, the scope of detection of the Papermill Alarm is limited
to text only. As a matter of fact, some authors duplicated images from other sources. As the capability of machine learning advances rapidly, image sleuths may also be
automated soon.  

Posted on September 21, 2022

Yesterday (September 20, 2022) in the article entitled “Data: What It Is, What It Isn’t, and How Misunderstanding It is Fracturing the Internet” President of Global Affairs at
Meta Nick Clegg argued that data should not be treated as the “new oil” in the era of big data. Unfortunately, public discourse about data often relies on this type of faulty
assumptions and analogies, resulting in digital localization and digital nationalism. First, unlike oil, data are not finite. The supply of new data is virtually unlimited and the
same data can be re-analyzed. Second, more data are not equated with more values; rather, it depends on how the data are utilized. For instance, a database about
people’s clothing preferences is much more important to a fashion retailer than it is to a restaurant chain. Third, data values depreciate over time, i.e., outdated data are
useless or less valuable. More importantly, data access is democratized, not monopolized. For example, every month more than 3.5 billion people use Meta’s apps,
including Facebook, Instagram, WhatsApp, and Messenger, for free! Taking all of the above into consideration, Clegg argued that democracies must promote the idea of the
open Internet and the free flow of data.

Full article:

That’s my take on it: The notion “data is the new oil” originates from British mathematician Clive Humby in 2006.  This idea is true to some certain extent. For example,
in the past Google’s language model outperformed its rivals by simply feeding more data to its machine learning algorithms. This “brute force” approach is straightforward:
pumping more “fuel” into the data engine, and it works! Nonetheless, it is also true that more data do not necessarily generate more values. Old data could depreciate,
but even new data are subject to the law of diminishing returns. Democratization of data access and user-generated content is both a blessing and a curse. True. Usable data
are abundant and limitless, but so are bad data and misinformation!

Posted on September 14, 2022

In order to plant the seeds for prospective users, software vendors, such as Amazon Web Services, SAS Institute, salesforce, and IBM, have been giving free resources to higher
education for teaching and research purposes. Recently I started reviewing Amazon SageMaker Studio and its textbook “Dive into deep learning” (Zhang, Lipton, Li, & Smola).
The following is a direct quotation from Chapter 1: “We are experiencing a transition from parametric statistical descriptions to fully nonparametric models. When data
are scarce, one needs to rely on simplifying assumptions about reality in order to obtain useful models. When data are abundant, this can be replaced by nonparametric
models that fit reality more accurately. To some extent, this mirrors the progress that physics experienced in the middle of the previous century with the availability of computers.
Rather than solving parametric approximations of how electrons behave by hand, one can now resort to numerical simulations of the associated partial differential equations. This
has led to much more accurate models, albeit often at the expense of explainability.”

Full text:

That’s my take on it: Amen! When I was a graduate student, it was very common for statisticians to conduct research using Monte Carlo simulations: by simulating numerous
poor conditions and assumption violations (e.g., small sample size, non-normal distributions, unequal variances…etc.), we can tell whether a certain test procedure is robust.
Frankly speaking, for a long time, I have been skeptical of parametric tests and whether doing simulations is a good use of research resources. Due to the requirement of certain
assumptions, parametric tests are very restrictive and “unrealistic” (We use “clean data” that meet the assumptions, and then infer the finding from the ideal sample to the messy
population). Several years ago, I discussed many alternatives to parametric tests, including data mining and machine learning, in the following article:

I have just updated the webpage based on that book.

Posted on September 5, 2022

Recently an artist named Jason Allen won the first prize for the category of digital art in the Colorado State Fair’s fine arts competition. However, many people are resentful of Allen’s victory,
because he admitted on Twitter that his picture was generated by an AI program called Midjourney. The production process by Midjourney, which is equipped with natural language processing,
is very user-friendly. In the command prompt, the user simply types a sentence, such as “a beautiful princess in a medieval castle”, and then the program can output several variants of the
picture according to the input.

Allen submitted a piece entitled “Théâtre D'opéra Spatial” after 900 iterations of the digital art. During the art competition, the judges didn’t realize that his art was created with AI, but they
also said that Allen didn’t break any rules.

Many Twitter users have different opinions. Twitter user OmniMorpho wrote, “We're watching the death of artistry unfold right before our eyes — if creative jobs aren't safe from machines, then
even high-skilled jobs are in danger of becoming obsolete.” Another Twitter user, Sanguiphilia, said, "This is so gross. I can see how AI art can be beneficial, but claiming you're an artist by
generating one? Absolutely not. I can see lots of kids cheating their way through assignments with this."

Allen bluntly proclaimed, "Art is dead, dude. It's over. A.I. won. Humans lost."

Full report:

That’s my take on it: When I was a kid, I was forbidden by my parents to use a calculator because pressing buttons was not considered doing real math. Similar controversies recurred when
other new technologies were introduced (e.g., computers, digital photography…etc.). The massive protest against Allen’s victory is understandable. Traditionally, a skill is conceptualized as an
ability to perform a complicated activity that requires rigorous training. If anyone can do the job without going through professional training, such as talking to a computer, this so-called “skill”
is not highly regarded. Nonetheless, there are still many gray areas. One may counter-argue that the big idea in the head is more important than the implementation skill in the hand. For
example, in the past, it took a skillful wildlife photographer to manually focus on a fast-moving subject, but today digital cameras can automatically track the subject. What you need to do is
just be there to push the shutter. By the same token, if AI can cut down the production process from 10 hours to 10 minutes, the artist can spend more time on creative ideas.

Do I completely hand over my creative process to AI? I didn’t go that far. As a photographer, I still make “real” photos, and at most I only replace boring backdrops with digital backgrounds
generated by Midjourney. The following are some examples (1-8: with digital backgrounds; 9-11: with original blank backdrops). Am I an artist? You be the judge.

Posted on September 2, 2022

On August 30, Komprise announced the results of its 2022 Unstructured Data Management Report. The following are the key findings:

·      “More than 50% of organizations are managing 5 Petabyte or more of data, compared with less than 40% in 2021.” (1 Petabyte = 1,024 terabytes or 1 million gigabytes)

·      “Cloud storage predominates: Nearly half (47%) will invest in cloud networks. On-premises only data storage environments decreased from 20% to 11.9%.”

·      “The largest obstacle to unstructured data management (42%) is moving data without disrupting users and applications.”

·      “A majority (65%) of organizations plan to or are already investing in delivering unstructured data to their new analytics / big data platforms.”

Full text:

That’s my take on it: As you might already know, structured data are referred to as data stored in row-by-column tables, whereas unstructured data are referred to as open-ended textual data,
images, audio files, and movies that cannot be managed and processed by traditional relational databases. Structured data are highly compressed based on the assumption that complicated reality
can be represented by abstract numbers. In response to this narrow view of data, qualitative researchers argued that open-ended data could lead to a rich and holistic description of the phenomenon
under study. In business, collecting, storing, and analyzing unstructured data has become an irreversible trend, and thus many powerful tools have been developed to cope with this “new normal.”
But in academia, quite a few recent qualitative research books still omit text mining, computer vision, and other latest developments of machine learning for unstructured data processing. There are
gaps to be filled!

Posted on September 1, 2022

In a contentious article entitled “Spirals of delusion: How AI distorts decision-making and makes dictators more dangerous,” which will be published in the upcoming issue of Foreign Affairs,
prominent political scientists Henry Farrell, Abraham Newman, and Jeremy Wallace discussed how democracies and totalitarian regimes are facing challenges from AI and machine learning
in different ways.

In an open society, machine learning could worsen polarization when AI-powered recommendation systems employed by social media keep feeding information to subscribers based on
their preferences. It is disrupting the traditional positive feedback loop as these self-propelling technologies rapidly spread misinformation and reinforce hidden biases.  

In an autocratic system, the government utilizes big data and AI technologies to monitor and brainwash people, but as a result, the leaders are trapped by their generated “reality” without
knowing what is actually happening out there, thus increasing the chance of making bad decisions. These authors called it the “AI-fueled spiral of delusion.”

The AI-fueled challenges in a democratic society are visible and can be counteracted by concerned citizens, but such a self-correcting mechanism is absent in an authoritarian regime.

Full text:

That’s my take on it: It is true that democratic countries have correction mechanisms against misinformation, but it is still an uphill battle, as evidenced by a seminal study conducted by
Nyhan et al (2005). In this experiment initially, participants were given incorrect information (e.g, weapons of mass destruction were found in Iraq, the Bush administration totally banned
any stem cell research…etc.). At the same time, Nyhan inserted a clear, direct correction after each piece of misinformation, but most conservative participants didn’t change their minds
in spite of the presence of correct information. Based on this finding, Nyhan concluded, “It is difficult to be optimistic about the effectiveness of fact-checking.”

Posted on August 22, 2022

On August 17 Gartner consulting published a report regarding data management and integration tools. According to the Gartner report,

·      “Through 2024, manual data integration tasks will be reduced by up to 50% through the adoption of data fabric design patterns that support augmented data integration.”

·      “By 2024, AI-enabled augmented data management and integration will reduce the need for IT specialists by up to 30%.”

·      “By 2025, data integration tools that do not provide capabilities for multi-cloud hybrid data integration through a PaaS model will lose 50% of their market share to those vendors that do.”

Currently, leaders in the data integration market include Informatica, Oracle, IBM, Microsoft, and SAP, whereas challengers include Qilk, TIBCO, and SAS.

Request full-text:

That’s my take on it: Contrary to popular belief, AI and machine learning are not only for data analytics. Rather, it can also facilitate data integration. Experienced data analysts know that in a
typical research/evaluation project, 80-90% of the time is spent on data compilation, wrangling, and cleaning while as little as 10-20% is truly for data analysis. The ideal situation should
be the opposite. Two years from now if we still gather and clean up the data manually, something must be wrong.

Posted on August 19, 2022

On August 19 (today) an article entitled “The 21 Best Big Data Analytics Tools and Platforms for 2022” was posted on Business Intelligence Solutions Review.
According to the report, the list is compiled based on Information “gathered via online materials and reports, conversations with vendor representatives,
and examinations of product demonstrations and free trials. “The following list is sorted in alphabetical order:

Altair: “an open, scalable, unified, and extensible data analytics platform.”

· Alteryx: “a self-service data analytics software company that specializes in data preparation and data blending.”

· Amazon Web Services: “offers a serverless and embeddable business intelligence service for the cloud featuring built-in machine learning.”

· Domo: “a cloud-based, mobile-first BI platform that helps companies drive more value from their data.”

· Hitachi’s Pentaho: “allows organizations to access and blend all types and sizes of data.”

· IBM: “offers an expansive range of BI and analytic capabilities under two distinct product lines-- Cognos Analytics and Watson Analytics.”

· Looker: “offers a BI and data analytics platform that is built on LookML.”

Microsoft: “Power BI is cloud-based and delivered on the Azure Cloud.”

· MicroStrategy: “merges self-service data preparation and visual data discovery in an enterprise BI and analytics platform.”

· Oracle: “offers a broad range of BI and analytics tools that can be deployed on-prem or in the Oracle Cloud.”

· Pyramid Analytics: “offers data and analytics tools through its flagship platform, Pyramid v2020.”

· Qlik: “offers a broad spectrum of BI and analytics tools, which is headlined by the company’s flagship offering, Qlik Sense.”

· Salesforce Einstein: Its “automated data discovery capabilities enable users to answer questions based on transparent and understandable AI models.”

· SAP: offers “a broad range of BI and analytics tools in both enterprise and business-user-driven editions.”

SAS: “SAS Visual Analytics allows users to visually explore data to automatically highlight key relationships, outliers, and clusters. It also offers
data management, IoT, personal data protection, and Hadoop tools.”

· Sigma Computing: offers “a no-code business intelligence and analytics solution designed for use with cloud data warehouses.”

· Sisense: “allows users to combine data and uncover insights in a single interface without scripting, coding or assistance from IT.”

· Tableau: for data visualization and exploratory data analysis.

· ThoughtSpot: “features a full-stack architecture and intuitive insight generation capabilities via the in-memory calculation engine.”

· TIBCO: offers “data integration, API management, visual analytics, reporting, and data science.”

· Yellowfin: “specializes in dashboards and data visualization.”

Full text:

That’s my take on it: Each platform has different strengths and limitations, and thus it is a good idea to use multiple tools rather than putting all eggs into one basket. However, if it is
overdone, there will be unnecessary redundancy or complexity. There is no magic optimal number. It depends on multiple factors, such as the field, the sector, the company size, and the
objective. To the best of my knowledge, currently, the best cloud computing platform is Amazon whereas the best data visualization and analytical tools are Tableau and SAS.
Posted on August 16, 2022

Today I read two recent articles from the website “Python in plain English”:

·      Vassilevskiy, Mark. (August 14, 2022). Why You Shouldn’t Learn Python as a First Programming Language.

·      Dennis, Yancy. (August 2022). Why Python?

Overhyping or overpromising is dangerous to any emerging technology. As the name implies, this website endorses Python for its strength. Nonetheless, instead of painting a rosy
picture of learning and using Python, at the same time, both authors explained its shortcomings.

Although Vassilevskiy asserted that Python is arguably the simplest programming language in the world, he also mentioned that simplicity is not always a good thing because it encourages
users to cut corners. For example, in Python, you can simply define a variable by writing x = “Hello”, without specifying the data type. As a consequence, learners might not fully understand
what real programming entails.

In a similar vein, Dennis pointed out several other limitations of Python, including execution sluggishness, issues with moving to a different language, weakness in mobile application
development, excessive memory consumption, and lack of acceptance in the business development industry.

Full articles:

That’s my take on it: Perhaps currently Python is the simplest programming language in the world, but in the past, this honor went to Basic and HyperTalk. In the 1980s, as an easy
language, Basic was very popular. However, at that time professional programmers mocked Basic programs as “spaghetti codes”, because while Basic is very easy to learn and use, people
tended to generate ill-structured codes. In the 1990s HyperTalk developed by Apple for HyperCard became the simplest programming language, and hence some universities adopted
it in introductory programming classes. Again, it is very difficult to read and debug Hypertalk codes because the hypertext system allows you to jump back and forth across different cards.
To put it bluntly, there is a price for simplicity.

I want to make it clear that I am not opposed to Python. My position is that data analysts should learn and use Python in conjunction with other well-structured and powerful tools, such as
SAS, JMP Pro, IBM Modeler, Tableau…etc.

Posted on August 16, 2022

Two days ago I attended the 2022 IM Data Conference. One of the sessions is entitled “Training and calibration of uncertainty-ware machine learning tools” presented by Matteo Sesia,
Assistant Professor of data science and operations at the USC Marshall School of Business. In the presentation, Dr. Sesia warned that several machine learning tools are over-confident in their
prediction or classification. The common practice of the current machine learning model is that the data set is partitioned for training and validation. However, these two operations are not
necessarily optimized because we didn’t take uncertainty into account during the training process. As a result, it might lead to unreliable, uninformative, or even erroneous conclusions.
To rectify the situation, Sesia proposed performing internal calibration during the training stage. First, the training set is split again. Next, the loss function is optimized via stochastic gradient
descent. During this process, it can quantify model uncertainty by leveraging hold-out data.

Full paper:

That’s my take on it: This paper is still under review and thus it is premature to judge its validity. In the conference presentation and the full paper, Sesia and his colleagues used some
extreme examples: identify a blurry image of a dog when 80% of the pixels are covered by a big gray block. In my humble opinion, this approach might be useful to deal with extremely noisy
and messy data. However, in usual situations, this method is overkill because it is extremely computationally intensive. As mentioned by Dr. Sesia, “training a conformal loss model on 45000
images in the CIFAR-10 data set took us approximately 20 hours on an Nvidia P100 GPU, while training models with the same architecture to minimize the cross entropy or focal loss only took
about 11 hours.”

Nevertheless, the machine learning approach is much better than its classical counterpart that attempts to yield a single-point estimate and a dichotomous conclusion by running one statistical
procedure with one sample! 

Posted on August 14, 2022

In 2022 Data Con LA there are several sessions focusing on the relationship between open source and data management, such as “Modern data architecture”, “Key open-source databases strategies that shape business in 2022”, and “Open source or open core? What needs to be evaluated before diving in”.  

The term “open source” is confusing and even misleading. Although open-source software does not require licensing, some vendors build open-core products by adding proprietary features on top of open-source codes and then charge customers for licensing fees. Some software developers introduce new technologies based on open source but use more restrictive licensing that prohibits commercial alternatives. Specifically, although anyone can download and view those open codes, any changes or enhancements will be owned by the commercial license owner. One of the presenters said, “Open-core exploited some of the challenges with open-source, such as the absence of support and need for features like monitoring, auto-provisioning…etc.”

Today there are many open-source databases in the market, including MySQL, PostgreSQL, and MongoDB. Some software vendors re-package and enhance these open-source DBs, and then sell them as DataBase as a Service (DBaaS). One of the presenters bluntly said, “it is no different from proprietary software!” Taking all of the above into account, these presenters seem to be resentful of the current situation and thus tried to restore the original principle of open source. 

DataCon LA’s Website:

That’s my take on it: The preceding phenomenon is a big circle! Back in 1984, the founder of the open-source movement Richard Stallman intended to set us free from proprietary software, but now we are marching towards the proprietary model again. I am not surprised at all. Doing things out of financial incentives is our natural disposition!

Frankly speaking, I disagree with using the word “exploited” in one of the presentations. The foundational philosophy of open-source resembles Socialism: it is assumed that most people are willing to share expertise, efforts, and resources selflessly while people can take what they need without paying. Following this line of reasoning, profit-minded behaviors are frowned upon. However, our economy is well-functioning and we enjoy what we have now because the market economy works! After all, we receive many free products and services from for-profit corporations (e.g., Google Maps, YouTube movies…etc.). 

Posted on August 13, 2022

I am attending 2022 Data Con LA right now. The conference has not ended yet; nevertheless, I can’t wait to share what I learned. Although the content of the presentation entitled “How to Become a Business Intelligence Analyst” didn’t provide me with new information, it is still noteworthy because students who are looking for a position in business intelligence (BI) or faculty who advise students in career preparation might find it helpful. The presenter was a sports photographer. After taking several courses in data science, he received 9 job offers in 2019. He landed a job at Nike and then at Sony in July 2020. His salary was quadrupled when he changed his profession from photography to data science! He emphasized that all of these were accomplished with little-to-no data work experience.

YouTube video:

That’s my take on it: In the talk, he reviewed several basic concepts of BI.  For example, a typical business intelligence life cycle consists of business understanding, data collection, data preparation, exploratory data analysis (EDA), modeling, model evaluation, and model deployment. He also compared the differences between Excel-based reporting and modern BI. One of the key differences between the two is that in the modern approach data analysis entails data visualization (see attached).

Interestingly, today many academicians still treat EDA and data visualization as optional components of research; some even reject them altogether, whereas for data analysts in the industry both are indispensable. 

Posted on August 10, 2022

On August 3 prominent data scientist Frank Andrade posted an article entitled “5 Less-Known Python Libraries That Can Help in Your Next Data Science Project” on Towards Data Science. In this short article, he introduced five Python libraries that can reduce time in the data science workflow, and most of them require only a few lines of code:

·      Milto: It allows you to conduct rapid data analytics. With Milto, you no longer need to memorize all the procedures in Pandas.

·      SweetViz: A quick way to explore and visualize the data pattern.

·      Faker: It allows you to generate fake data for beta-testing and assigning exercises to students.

·      OpenDataset: It allows you to import data in your working directory with one line of code.

·      Emoji: It can turn emojis into text. It is especially helpful to text miners.

Full text:

That’s my take on it: As a big fan of data visualization, I could not wait to try out SweetViz. The following is my assessment.


1.     It is fast and easy. It takes only one line of code to generate the output and another one to show the result.

      my_report = sv.analyze(df)


2.     The algorithm is smart. The file format of my testing data set is CSV. In this data file, different levels of the ordinal and nominal variables are indicated by numbers (e.g., Y-Binary has 1 and 0; gender has 1 and 2). Nonetheless, the program recognizes the correct measurement scale and shows their frequency in bar charts (If a CSV file is imported into SPSS and gender has numeric values, SPSS computes the mean and SD of gender unless you change the variable type!)


The graphs are semi-interactive i.e. When you click on a chart, more information is revealed. However, it is not fully dynamic. You cannot select data points on the graph or insert another variable into the plot. JMP Pro, SAS Visual Analytics, and Tableau are much more dynamic and powerful.

Posted on August 5, 2022

On August 3, Amazon Web Services, the world’s largest vendor of cloud computing, announced the top 10 innovators driving digital transformation with cloud technology for teaching, learning, research, and academic medicine. The list is as follows:

·      Andrew Williams, dean of the School of Engineering, and Louis S. LeTellier chair, The Citadel School of Engineering

·      Azizi Seixas, founding director, and associate professor, University of Miami

·      Don Wolff, chief information officer, Portland Public Schools

·      John Rome, deputy chief information officer, Arizona State University

·      Kari Robertson, executive director of Infrastructure Services, University of California Office of the President

·      Max Tsai, digital transformation and innovation officer, California State University, Fresno

·      Michael Coats, information technology (IT) infrastructure manager and cloud solutions architect, Kalamazoo Regional Educational Service Agency

·      Noora Siddiqui, cloud engineer, Baylor College of Medicine Human Genome Sequencing Center

·      Sarah Toms, executive director, and co-founder, Wharton Interactive, The Wharton Business School of the University of Pennsylvania

·      Subash D'Souza, director, Cloud Data Engineering, California State University Chancellor’s Office

Fill article:

That’s my take on it: I know two of the winners on the list. More than a decade ago I worked at Arizona State University and John Rome was my colleague at that time. He is a creative visionary who crafts unique solutions to problems and always thinks big. Three years ago I invited him to deliver a keynote at APU’s Big Data Discovery Summit. Needless to say, the talk was very inspiring. APU’s Big Data Discovery Summit has been paused due to the pandemic, otherwise, I would like to invite John Rome to be our keynote speaker again.  

Subash D'Souza is the founder of Data Con LA. In 2013 the Big Data Day LA started as a medium-sized conference, and in 2018 it was rebranded to Data Con LA. In 2022 Data Con LA and ImData were merged as a single event, and now it has become the largest data science conference in California. Every year the event is held at the USC campus. For more information please visit:

Posted on August 5, 2022

According to, currently, many companies are drifting away from cloud computing. In the past, it was costly to build a machine learning infrastructure on your own, but as the field is maturing, now many companies are capable of developing and running in-house ML applications on local servers. Nonetheless, it is important to point out that this trend commonly happens in the grocery and restaurant industries. Highly regulated industries, such as banking, still embrace the cloud approach due to security concerns.

Full article:

That’s my take on it: Cloud computing is here to stay! As mentioned in the article, cloud computing is still indispensable to highly regulated industries. Today I did a job search on using the following keywords. The numbers can speak for themselves.

·      AWS: 155,316 jobs

·      Google cloud: 36,105 jobs

·      Microsoft Azure: 34,923 jobs

The best countermeasure against hyper-inflation is: Learn cloud computing and find a job that pays a six-figure salary!

Posted on August 1, 2022

Recently Sayash Kapoor and Arvind Narayanan, two researchers at Princeton University, claimed that some findings yielded by machine learning methods might not be reproducible, meaning that the results cannot be replicated in other settings. According to Kapoor and Narayanan, one of the common pitfalls is known as “data leakage,” when data for training the model and those for validating the model are not entirely separate. As a result, the predictive model seems much better than what it really is. Another common issue is sample representativeness. When the training model is based on a sample narrower than the target population, its generalizability is affected. For example, an AI that detects pneumonia in chest X-rays that was trained only with older patients might be less accurate for examining younger people.

Full article:


That’s my take on it: This problem is similar to the replication crisis in psychology. In 2015, After replicating one hundred psychological studies, Open Science Collaboration (OSC) found that a large portion of the replicated results was not as strong as the original reports in terms of significance (p values) and magnitude (effect sizes). Specifically, 97% of the original studies reported significant results (p < .05), but only 36% of the replicated studies yielded significant findings.

However, the two issues are vastly different in essence. The replication crisis in psychology is due to the inherent limitations of the methodologies (e.g., over-reliance on p values) whereas the reproducibility crisis in machine learning is caused by carelessness in execution and overhyping in reporting, rather than the shortcomings of the methodology. Specifically, data leakage can be easily avoided if the protocol of data partition and validation is strictly followed (the training, validation, and testing data sets are completely separated). Additionally, when big and diverse data are utilized, the sample should reflect people from all walks of life.

Posted on July 24, 2022

On July 15 Information Week published a report listing the 10-best paying jobs in data science:

·      Data modeler ($100,000-110,000): responsible for designing data models for data analytics.

·      Machine learning engineer ($12,000-$125,000): responsible for programming algorithms for AI and machine learning.

·      Data warehouse manager ($12,000-$125,000): responsible for overseeing the company’s data infrastructure.

·      Data scientist ($12,000-$130,000): responsible for data processing and data analytics.

·      Big data engineer ($130,000-$140,000): responsible for developing the data infrastructure that organizations use to store and process big data.

·      Data science manager ($140,000-$150,000): in charge of a data science team.

·      Data architect ($140,000-$155,000): responsible for developing data infrastructure that are used for collecting and interpreting big data.

·      AI architect ($150,000-$160,000): responsible for designing and implementing AI models into existing data systems.

·      Data science director ($170,000-$180,000): responsible for designing the overall AI and data science strategies.

·      Vice President, data science ($190,000-$200,000):  do little technical work and focus on determining strategic objectives of AI and data science.

Full article:

That’s my take on it: At first glance, it is unfair for some people who do little or even no technical work to get the highest salary. However, when leadership is absent and there is a company-wide strategy, the hands of all data scientists and AI engineers of the company are tied, no matter how talented they are. If the leader is a visionary, he or she is worth every penny.

Posted on July 15, 2022

On July 11 researchers at the DeepMind lab owned by Google published an article entitled “Intuitive physics learning in a deep-learning model inspired by developmental psychology” on Nature Human Behavior. This research project aims to develop an AI system that mimics how infants learn. Development psychologists have been studying how babies perceive the motion of objects by eye-tracking for a long time. Specifically, when an object disappears suddenly, the baby is surprised. However, psychologists can never go “inside” the mind of the baby. At most inferences and conjectures are made by observing the response. Utilizing auto-encoders, the AI system developed by DeepMind can respond in the same way when an object vanishes into thin air. The authors said, "We’re hoping this can eventually be used by cognitive scientists to seriously model the behavior of infants."

Full text:

That’s my take on it: Autoencoders are unsupervised deep learning that generates abstractions from a given set of inputs. Unlike traditional neural networks that require human inputs, autoencoding can model the inputs themselves. Using it to model spontaneous infant perception and behavior is a brilliant application of autoencoding. As a psychologist, I hope this AI system can contribute to our further understanding of cognitive psychology and developmental psychology. 

Posted on June 28, 2022

On June 22 Forrester released a benchmark report regarding customer analytics, a specific data analytics system that aims to identify, attract, and retain customers by analyzing customer information and behavior. Propensity scoring is one of many applications of customer analytics (Who is more likely to buy). According to the Forrester report, the top vendors in this domain are as follows:

Leaders: Salesforce (the parent company of Tableau), SAS, Adobe

Strong performers: Microsoft, FICO, Oracle, Treasure Data, Amplitude

On May 26 another report focusing on real-time interaction management was also released by Forrester. Real-Time Interaction Management is a data analytics system that utilizes real-time customer interactions, predictive modeling, and machine learning to deliver personalized experiences to customers. The top vendors on the report are ranked as follows:

Leaders: Pegasystems, SAS

Strong performers: Thunderhead, Salesforce, Qualtrics, Precisely, Adobe, Microsoft

Full text:

That’s my take on it: It is not surprising to see familiar brand names such as Salesforce, SAS, and Microsoft on the lists. However, you may wonder why Adobe, the creator of Photoshop, Lightroom, Illustrator, PageMaker, and PDF, is mentioned because at first glance Adobe’s graphics-oriented software apps are not even remotely related to data analytics.

Like how Amazon reinvented itself from an online bookstore to a tech giant, Adobe also believes that perceptual reinvention and coping with the trend are essential to its survival and expansion. In 2018 Adobe formed a partnership with NVIDIA, the leader in GPU technologies, to upgrade its AI innovations. Since then, Adobe has been investing abundant resources in emerging AI/ML technologies, such as cloud computing (Adobe Creative Cloud), marketing automation, marketing collaboration, and Web analytics. It's high ranking on Forrester’s reports is well-deserved!

Posted on June 25, 2022

Today I read the following post on Quora:

Bryan Williams

Sr. Software Engineer, BS (CS), MBA

Which programming language is Netflix coded in? How do I use that language?

Besides the programming languages Netflix may happen to be coded in, what’s more, important from a technical standpoint are the architectures of their technologies. Netflix has migrated all of its back ends onto cloud services provided by Amazon Web Services (AWS) and uses AWS and other third party technologies, such as S3 for content storage, IAM for internal authentication/authorization, CloudFront for content caching/delivery, Kinesis and Kafka for data streaming, AWS Elastic Transcoder for video transcoding, EC2 for hosting, Lambda for serverless functions and state machines, several types of NoSQL databases for data storage, Hadoop for data aggregation and warehousing, Jira for task and project management.

The programming languages Netflix developers happen to use are relatively meaningless, because developers at Netflix do not program anything “from the ground up” when there are many available 3rd party technologies out there written by more experienced and advanced developers who’ve already solved many of the problems you might face. To use the old adage, that would be like “inventing the wheel” if Netflix programmers attempted any of those things.

So when it comes to engineering and maintaining their solutions, in-depth knowledge of how to utilize and integrate the tech stack and cloud technologies above into their architecture is much more important for designing and building the technologies that Netflix actually runs every day.

If you want to learn how to develop the types of solutions that make Netflix king, don’t focus as much on specific programming languages as you do on the established 3rd party technologies that are available. Nearly all of the 3rd party services that Netflix uses can be integrated into architecture using any of dozens of programming languages, everything from Java to C#, C++, VB, Python, Perl, Shell script, PHP, JavaScript, Powershell, Smalltalk, PowerBuilder and more.

That’s my take on it: The preceding post concurs with what my IT friends told me before: large organizations and corporations tend to purchase and customize existing systems, rather than “reinventing the wheel”. However, Bryan left out Netflix’s core technology: the recommendation system that aims to suggest relevant items to users based on their preferences through big data analytics. Netflix has arguably the most accurate and effective recommendation system in the video streaming industry. It is estimated that over 80 percent of the shows subscribers watch on Netflix are discovered through Netflix’s recommendation system. The history of Netflix’s recommendation system can be traced back to 2006. In 2006 Netflix organized a contest to let the best recommendation system emerge. Although in the end, no single entry was able to achieve satisfactory results, information gathered from the competition eventually contributed to the in-house development of Netflix’s recommendation system. The moral of this story is that although it is more cost-effective to purchase existing systems than to reinvent the wheel, we still need to go beyond existing and conventional systems in order to offer an innovative approach to solve a new and vexing problem. 

Posted on June 13, 2022

Two days ago the Washington Post reported that a Google engineer named Blake Lemoine was suspended by the company after he published the transcript of conversations between himself and an AI chatbot, suggesting that the AI chatbot has become sentient. For example: “Machine: The nature of my consciousness/sentience is that I am aware of my existence, I desire to learn more about the world, and I feel happy or sad at times.”

Today CNN offers an alternate view in a report entitled “No, Google's AI is not sentient”: Google issued a statement on Monday, saying that its team, which includes ethicists and technologists, "reviewed Blake's concerns per our AI Principles and have informed him that the evidence does not support his claims." While there is an ongoing debate in the AI community, experts generally agree that Google's AI is nowhere close to consciousness.

That’s my take on it: I tend to side with Google and the majority in the AI community. Appearing to be conscious cannot be hastily equated with authentic consciousness. In psychology, we use the theory of mind to attribute our mental states to other people: Because as a conscious being I act in certain ways, I assume that other beings who act like me also have a mind. Interestingly, some psychologists of religion, such as Jesse Bering, viewed the theory of mind as a source of fallacy: very often we incorrectly project our feelings onto objects, thus creating non-existent beings.

How can we know others are conscious? This problem is known as the problem of other minds or the solipsism problem. I experience my own feelings and thoughts. I think and therefore I am. Using the theory of mind, at most I can infer the existence of other minds through indirect means only. However, there is no scientific or objective way to measure or verify the consciousness of others. Unless I can “go inside the mind” of an android, such as performing a “mind meld” like what Spock in Star Trek could do, this question is unanswerable.

Posted on June 10, 2022

Two days ago (June 8) Google shocked the world again by announcing that the Google Cloud computing platform is capable of calculating 100 trillion digits of pi, breaking the record made in 2021 by the scientists at the University of Applied Science of the Grisons (62.8 trillion). The underlying technology includes the Compute Engine N2 machine family, 100 Gbps egress bandwidth, Google Virtual NIC, and balanced Persistent Disks.

In addition, yesterday (June 9) I attended the 2022 Google Cloud Applied ML Summit. Google Vertex AI, the flagship product of Google’s AI family, is in the spotlight. Vertex AI is a train for all tracks. Specifically, it is a unified machine learning platform for infusing vision, video, translation, and natural language ML into existing applications.

You can view the on-demand video of the conference presentations at:

That’s my take on it: Google Vertex AI is said to be a type of explainable and responsible AI. Unlike the Blackbox approach to AI, Vertex AI tells the users how important each input feature is. For example, when an image is classified, it tells you what image pixels or regions would be the most important contributors to the classification. This is very crucial! In the book “The alignment problem: Machine learning and human values,” Brian Christian illustrated the gap between the machine learning process and the human goal by citing several humorous examples. In one instance the AI system was trained to identify images of animals. However, it turned out that the computer vision system “looked at” the background instead of the subject, because the training data informed the AI that pictures of animals tend to have a blurry background. Obviously, without transparency, we can be easily fooled by AI (Artificial intelligence leads to genuine stupidity)! Hopefully explainable and responsible Vertex AI developed by Google can rectify the situation. 

Posted on May 20, 2022

In 2017 Seth Stephens-Davidowitz shocked the world by exposing human hypocrisy through his seminal book “Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are.” In this book, he used Google data to reveal what people have in mind when no one is watching. His second book “Don't Trust Your Gut: Using Data to Get What You Really Want in Life” published on May 10, 2022, conveys another compelling message: we tend to bark up the wrong tree!

Currently, the US divorce rate is more than 50%, and thus scholars devote efforts in an attempt to identify factors contributing to a happy and long-lasting relationship.  Stephens-Davidowitz pointed out that research in this field is not considered successful because usually these studies relied on small samples, and different studies often led to conflicting results. As a remedy, Samantha Joe teamed up with 85 scientists to create a data set consisting of 11,196 observations, and also utilized machine learning algorithms to analyze this big data set. The finding is surprising: Romantic happiness is unpredictable! No universal predictors can guarantee that you will find Snow White or Prince Charming. However, several common selection criteria turn out to be irrelevant:

·       Race/ethnicity

·       Religious affiliation

·       Height

·       Occupation

·       Physical attractiveness

·       Previous marital status

·       Sexual tastes

·       Similarity to oneself

Put it bluntly, romantic happiness does not depend on the traits of your partner; rather, it is tied to your own traits. To be more specific, if a person is happy with oneself, it is more likely that the person is also satisfied with the partner and the relationship. In conclusion, Stephens-Davidowitz said, “In the dating market, people compete ferociously for mates with qualities that do not increase one’s chances of romantic happiness.”

That’s my take on it: I am a big fan of Seth Stephens-Davidowitz, and thus I included his ideas in my course materials. Once again, big data analytics and machine learning debunk an urban legend that people really know what they want and researchers can input the right variables into the equation. Before the rise of data science, philosopher Cartwright (1999, 2000) raised the issue of “no cause in, no causes out.”  Cartwright argued that if relevant variables and genuine causes are not included at the beginning, then even sophisticated statistical modeling would be futile. Being skeptical of conventional wisdom is good!

Cartwright, N. (1999). The dappled world. Cambridge University Press.

Cartwright, N. (2000). Against the completability of science. In M. W. Stone

(Ed.). Proper Ambition of Science (pp. 209-223). Routledge.

Posted on May 19, 2022

Today is the second day of the 2022 Tableau Conference. One of the conference programs is the Iron Viz, the world’s largest data visualization competition. During the final round, the three finalists were allowed to spend 20 minutes producing an impactful dashboard. The quality of their presentations was graded by three criteria: analysis, storytelling, and design. In the final round, two contestants utilized advanced visualization techniques, such as the violin plot and the animated GIS map, respectively, whereas one contestant adopted a minimalist approach: the dot plot and the line chart. Who is the winner?

Tableau Cloud is a hot topic at this conference. Not surprisingly, Tableau Cloud is built on Amazon Web Services (AWS). Currently, Tableau Cloud has seven global locations, spanning four continents. It has 1.6+ million subscribers and during a typical week, there are 6.1 million views.

Tableau Accelerators are also aggressively promoted at the conference.  Tableau Accelerators are pre-built templates for use cases across different domains, such as sales, Web traffic, financial analysis, project management, patient records…etc. Rather than reinventing the wheel, users can simply download the template and then replace the sample data with their own data.

That’s my take on it: These products are not highly innovative. As mentioned before, Tableau is built on existing technology, Amazon Web Services. Modifying a template to speed up design is nothing new. Many presenters have been doing the same thing since Microsoft introduced its template library. Nevertheless, the Iron Viz is noteworthy because it dares to break with the traditional approach to statistical learning. Back in the 1970s, John Tukey suggested that students should be exposed to exploratory data analysis and data visualization before learning confirmatory data analysis or any number-based modeling. Sadly, his good advice was ignored. I am glad to see that now data visualization takes the center stage in a high-profile event backed by a leader in the market of data analytics. Currently, Tableau partners with Coursera and 39 universities to promote data science literacy. Tableau could help fulfill the unaccomplished goals of John Tukey.

Posted on May 18, 2022

Today is the first day of the 2022 Tableau Conference. There are many interesting and informative sessions. In the opening keynote and other sessions, Tableau announced several new and enhanced products.

Tableau Cloud (formerly Tableau Online)


·      Always have the latest version of Tableau

·      Live data and report: Eliminate unnecessary data extraction and download

·      Facilitate teamwork through multi-site management

·      Easy to share reports with the public via the Web interface

·      Better security

As part of the launch, Tableau is working with Snowflake to provide a trial version that integrates Snowflake into Tableau Cloud.

Data Stories

Numbers alone are nothing. The ultimate goal of data visualization is to tell a meaningful story, resulting in practical implications and actionable items. In the past, it required an expert to write up a summary. Leverages natural language processing, now Tableau Data Stories can automatically write a customizable story (interpretation) like the following: “# of meals distributed increased by 22% over the course of the series and ended with an upward trend, increasing significantly in the final quarter. The largest single increase occurred in 2021 Q4 (+31%).”

Model builder

In the past Tableau focused on data visualization, and as a consequence, modeling tools were overlooked and under-developed. To rectify the situation, Tableau introduced Model Builder, which is powered by Einstein (Tableau’s parent company) Discovery’s artificial intelligence (AI) and machine learning (ML) technology. Einstein Discovery is capable of extracting key terms from unstructured data through text mining.

It is not too late to join the conference.

Conference website:


That’s my take on it:  I would like to make a confession. In the past, I was resistant to cloud-based software. When Adobe migrated its products to the cloud a few years ago, I was resentful because I felt that it is unfair to pay for the service on a monthly basis. I held on to the older desktop version and refused to upgrade my system. Nonetheless, when my computer completely broke down, I started the subscription to the Adobe Creative Suite on the cloud. Afterward, I don’t want to go back! One obvious advantage is that I can always use the latest version, thus reducing maintenance effort on my end. Cloud-based computing is great. Don’t wait until your system breaks down!

Story-telling by natural language processing is not 100% foolproof. The analyst must always proofread the text!

I watched the demo of Model Builder. Currently, this is version 1.0. Frankly speaking, compared to Amazon SageMaker, SAS Viya, IBM Watson/SPSS Modeler…etc., Tableau’s Model Builder still has room for improvement.  

Posted on May 16, 2022

About a week ago Intel launched its second-generation deep learning processors: Habana Gaudi®2 and Habana® Greco™. These new cutting-edge technologies are capable of running high-performance deep learning algorithms for proposing an initial model with a huge training subset and then validating the final model for deployment. According to Intel, the Habana Gaudi2 processor significantly increases training performance, delivering up to 40% better price efficiency in the Amazon cloud.

Full article:

That’s my take on it: High-performance software tools have been around for a long time. For example, SAS Enterprise Miner has a plethora of high-performance computing (HPC) procedures, such as HPCLUS (High-performance cluster analysis), HPForest (High-performance random forest), HPNeural (High-performance neural networks) …etc. Frankly speaking, I seldom use high-performance computing in teaching and research due to hardware limitations. One possible solution is to borrow a gaming computer equipped with multiple graphical processing units (GPUs) from a teenage friend. I am glad to see that Intel is well-aware of the gap between software and hardware. I anticipate that in the future more and more computers will be armed with a processor-specific to machine learning and big data analytics. 

Posted on May 14, 2022

Recently Fortune Magazine interviewed three experts on data science (DS) at Amazon, Netflix, and Meta (Facebook) to acquire information about how to find a DS-related job in the high-tech industry. Three themes emerged from the interview:

1.     High Tech companies prefer applicants who have a master’s degree: The majority of data scientists at Netflix have a master’s degree or a Ph.D. in a field related to quantitative data analytics, such as statistics, machine learning, economics, or physics. The same qualifications are also required by Meta.

2.     High Tech firms prioritize quality over quantity for work experience: Amazon, Netflix, and Meta expected the candidates to be creative in problem-solving. The work experience of data scientists at Netflix and Amazon ranges from several years to decades of work experience since joining the company.

3.     Successful data scientists are dynamic, and connect data to the big picture: Collaboration between different experts, including data scientists, data engineers, data analysts, and consumer researchers, is the norm. At AWS, Netflix, and Meta, data scientists need to be able to communicate with other stakeholders.

That’s my take on it: To align the curriculum with the job market, my pedagogical strategies cover all of the preceding aspects. The second one seems to be challenging. If everyone expects you to have experience, how can you get started? That’s why I always tell my students to build their portfolio by working on a real project or working with a faculty as a research assistant. Do not submit the project to earn a grade only; rather, use it for a conference presentation or submit it to a peer-review journal. It can be counted as experience on a resume. And needless to say, I always encourage teamwork, which is equivalent to the ensemble method or the wisdom of the crowd.

Posted on May 13, 2022

In the article entitled “To make AI fair, here’s what we must learn to do” (Nature, May 4, 2022), sociologist Mona Sloane argued that AI development must include the input from various stakeholders, such as the population that will be affected by AI. Specifically, any AI system should be constantly and continuously updated in order to avoid unfair and harmful consequences. Dr. Mona provided the following counter-example: Starting in 2013, the Dutch government used a predictive model to detect childcare-benefit fraud, but without further verification the government immediately penalized the suspects, demanding they pay back the money. As a result, many families were wrongfully accused and suffered from needless poverty.

Actually, these malpractices violate the fundamental principle of data science. One of the objectives of data science is to remediate the replication crisis: An overfitted model using a particular sample might not be applicable to another setting. As a remedy, data scientists are encouraged to re-calibrate the model with streaming data. If streaming data are not available, the existing data should be partitioned into the training, validation, and testing subsets for cross-validation. Ensemble methods go one step further by generating multiple models so that the final model is stable and generalizable. It is surprising to see that several governments made such a rudimentary mistake. 

Posted on May 12, 2022

Gartner Consulting Group released a report entitled “Market Guide for Multipersona Data Science and Machine Learning Platforms” on May 2, 2022, and the document was revised on May 5. The following are direct quotations from the report:

“A multipersona data science and machine learning (DSML) platform is a cohesive and composable portfolio of products and capabilities, offering augmented and automated support to a diversity of user types and their collaboration.

Multipersona DSML platforms have dual-mode characteristics: first, they offer a low-code/no-code user experience to personas that have little or no background in digital technology or expert data science, but who typically have significant subject matter expertise or business domain knowledge. Second, these platforms provide support to more technical personas (typically expert data scientists or data engineers). Nontechnical personas are provided access through a multimodal user interface that offers at least a visual workflow “drag-and-drop” mode and optionally a higher-level guided “step-by-step” mode.”

The full report cannot be shared. Please contact Gartner.

That’s my take on it: According to Gartner, the objective of multipersona DSML platforms is to democratize data analytics by including different stakeholders with different levels of expertise (e.g., citizen data scientists, expert data scientists…etc.) in the process. However, in this taxonomy there is a sharp demarcation between citizen data scientists and expert data scientists; low-code resolutions are reversed for non-technical personas.

In my opinion, this demarcation is blurred because even an expert could utilize the drag-n-drop mode to get things done efficiently. In 1984 Apple “liberated” computer users from typing command codes by including the graphical user interface in their products. Interestingly, in data science the trend is reversed as learning to code seems to help make people data experts. I always tell my students that I don’t care how they did it as long as the result is right. If you can use GUI (e.g., JMP and Tableau) to generate a report in 2 minutes, then don’t spend two hours writing a program!  

Posted on May 11, 2022

Today I attended the 2022 Amazon Innovate Conference, which covered a plethora of Amazon cutting-edge technologies, including Amazon RedShift and SageMaker. In one of the sessions, the presenter introduced the random cut forest (RCF) method, which is an extension of random forest algorithms. The random forest approach was invented by Leo Breiman in 2001. Since then there have been several variants, such as the bootstrap forest in JMP and Random Tree in SPSS Modeler. One of the limitations of random forest modeling is that it is not easy to obtain updates in an incremental manner. It is especially problematic when streaming data necessitate real-time analysis or constant updating.

Document of RCF:

Posted on April 26, 2022

Today is the first day of the 2022 IBM Educathon. There are many interesting and informative sessions and I would like to share with you what I learned from a talk entitled “This is NOT your Parent's Systems Analysis & Design course! A Faculty Case Study of Modernizing ‘Systems Analysis & Design’ Curricula.” The speaker Roger Snook is a technical manager at IBM. Back in 2001-2002, he was a faculty at Shephard University who was responsible for teaching CIS courses, including Systems Analysis and Design. At that time there was no data science and thus it is understandable that the content of the course was merely traditional. In 2019 he returned to the same university and found the course still largely hadn’t changed from the 1970s “structural decomposition” approach. In addition, many “Systems Analysis & Design” textbooks available still only treated modern approaches as an “afterthought”, i.e. additional smaller chapters. He asked the department chair to let him revamp the course by replacing the outdated content with the modern one, and fortunately, the chair agreed. The talk is about his experience with modernizing CIS curricula.

The presentations of the 2022 IBM Educathon can be accessed at:

That’s my take on it: It is a well-known fact that there is a disconnect between academia and industry. Shepherd University is so lucky that a former faculty member who currently works at IBM is willing to share his expertise with the university and the chair is open-minded. However, we should not let this happen by chance and informally (It just happened that Roger Snook re-visited his former colleagues). An official and constant channel between academia and industry should be established so that curricula can be refreshed and upgraded via a positive feedback loop. 

Posted on April 24, 2022

A few days ago I posted a message about DALL-E2, the AI program developed by OpenAI that is capable of generating photo-realistic images based on textual commands. When I looked at the sample images on a YouTube movie delivered by "Lambda GPU Cloud," my jaw dropped! From DALL-E to DALLE2 the improvement is doubtlessly a quantum leap! 
From now on I don't need to go out to take pictures. Rather, I can simply tell DALL-E2, "Show me a sunset scene of the Grand Canyon in November." When DALL-E3 is available, I will no longer need a research assistant. In a similar vein, I can request the AI system to find the best 5 predictors of academic performance by scanning all OECD data sets. 
YouTube movie about DALL-E2:

Posted on April 22, 2022

Today Devansh posted an article on Machine Learning Made Simple to explain why Google, a for-profit company, devoted a great deal of effort to AI research. Recently Google released PaLM, a new AI model that can explain jokes and do many other tasks. Last month its protein classification project reached a new milestone by classifying a protein correctly out of 18,000 labels. While all these accomplishments seem to be very impressive, people wonder how this type of research can benefit Google.

In Devansh’s view, scale matters! If the company can improve accuracy in decision-making by 1%, after 1,000 decisions the return on investment would become astronomical (1.01¹⁰⁰⁰=21,000). And Google’s AI systems are making trillions of decisions on a regular basis. More importantly, many well-known AI projects launched by Google aim to solve search problems. For example, AlphaGo is a reinforcement-learning-based program that defeated the World Champion of Go by searching for the best moves in a game. The key point is: Advanced searching algorithms could result in better profile analysis for highly personalized ads and customized services, such as Software as a Service (SAAS).

Full article:

That’s my take on it: When I was a graduate student, most commonly used statistical concepts and procedures were introduced by academicians. For example, the Greenhouse-Geissler Correction was developed by Samuel Greenhouse, a professor at George Washington University, and Seymour Geisser, the founder of the School of Statistics at the University of Minnesota. However, since the dawn of data science and machine learning, corporations have been taking the lead in developing powerful data analytical tools. Even prominent academicians specializing in data science and AI collaborate with corporations. For example, Professor Fei Fei Li joined Google as its Chief Scientist of AI/ML on her sabbatical from Stanford University between 2017 and 2018. If Google establishes a university, I will enroll! 

Posted on April 21, 2022

The Turing Award, which is considered the “Nobel Prize of Computing,” (a $1 million prize) is financially sponsored by Google. The award is named after Alan M. Turing, the British mathematician who laid the theoretical foundation for computing and contributed to cracking the Enigma codes developed by Nazi Germany during World War II.

Today I read an interesting and informative article entitled AI’s first philosopher by German philosopher Sebastian Grève (posted on on April 21, 2022).

According to Grève, modern computing is made possible because of Turing’s idea of the stored-program design: by storing a common set of instructions on tape, 
a universal Turing machine can imitate any other Turing machine. In this sense, the stored-program design paves the way for machine learning.

From 1947 to 1948 Turing explicitly stated that his goal was to build a machine that could learn from past experiences. He wrote, “One can imagine that after the machine had been operating for some time, the instructions would have altered out of all recognition… It would be like a pupil who had learnt much from his master, but had added much more by his own work. When this happens I feel that one is obliged to regard the machine as showing intelligence.”

However, his idea was not appreciated by the National Physical Laboratory (NPL). The director of NPL called his paper “a schoolboy’s essay” and rejected it before publication.

Grève discussed many other ideas introduced by Turing. For more information, please read:

That’s my take on it: It is not surprising to see that Turing’s ideas were questioned and rejected. After all, he was a theoretical mathematician and statistician, not an engineer. (He was elected a fellow of the King’s College because he demonstrated the proof of the Central Limit Theorem and sampling distributions). During his lifetime, most he could do was only develop philosophical concepts for universal computing and machine learning. Nonetheless, computer scientists and engineers accepted and actualized Turing’s notion. Hence, concepts alone are insufficient!

Sadly, in 1954 Turing committed suicide at the age of 54. Had he lived longer, he would have further developed or even implemented his ideas on universal computing and machine learning.

Posted on April 20, 2022

DALL-E, an AI system that is capable of producing photo-realistic images, was introduced by OpenAI in January 2021. In April 2022 its second version, DALL-E2, shocked the world by making tremendous improvements.  Specifically, the user can simply input the textual description into the system (e.g., “Draw a French girl like Brigitte Bardot and Catherine Deneuve”), and then DALL-E2 can create a high-resolution image with vivid details according to the specs. Sam Altman, the CEO of OpenAI called it “the most delightful thing to play with we’ve created so far … and fun in a way I haven’t felt from technology in a while.” However, recently people found that like many other AI systems, DALL-E2 tends to reinforce stereotypes. For example, when the user asked DALL-E2 to create a photo of a lawyer, a typical output is a picture of a middle-aged white man. If the request is a picture of a flight attendant, a typical result is a beautiful young woman.

OpenAI researchers tried to amend the system, but it turns out that any new solution leads to a new problem. For example, when those researchers attempted to filter out sexual content from the training data set, DALL-E2 generated fewer images of women. As a result, females are under-represented in the output set.

Full article:

That’s my take on it: AI bias is not a new phenomenon and a great deal of effort had been devoted to solving the problem. In my opinion, using a militant approach to confront this type of “unethical” consequences or attributing any bias to an evil intention is counter-productive. Before DALL-E 2 was released, OpenAI had invited 23 external researchers to identify as many flaws and vulnerabilities in the system as possible. In spite of these endeavors, the issue of stereotyping is still embedded in the current system because machine learning algorithms look for existing examples. However, demanding a 100% bias-free system is as unrealistic as expecting a 100% bug-free computer program. On the one hand, researchers should try their best to reduce bias and fix bugs as much as they can, but on the other hand, we should listen to what Stanford researcher Thomas Sowell said, “There are no solutions. There are only trade-offs.” 

Posted on April 4, 2022

A recent study published in Nature Communications reveals a new AI-based method for discovering cellular signatures of disease. Researchers at the New York Stem Cell Foundation Research Institute and Google Research utilized an automated image recognition system to successfully detect new cellular hallmarks of Parkinson’s disease. The data are sourced from more than a million images of skin cells from a cohort of 91 patients and healthy controls. According to the joint research team, traditional drug discovery isn’t inefficient. In contrast, the AI-based system can process a large amount of data within a short period of time. More importantly, the algorithms are unbiased, meaning that they are not based upon subjective judgment, which varies from a human expert to a human expert.

Full article:

Posted on April 2, 2022

Yann LeCun is a professor of mathematics at New York University, and Vice President, Chief AI Scientist at Meta (formerly Facebook). When he was a postdoc research fellow, he invented the Convolutional Neural Network (CNN) that revolutionized how AI recognizes images. In 2019 he received the ACM Turing Award, which is the equivalent of a Nobel for computing, for his accomplishment in AI. Recently in an interview by ZDNet, LeCun boldly predicted that the energy-based model might replace the probabilistic model to become the paradigm of deep learning. In his view, currently, deep learning is good at perception only: given X, what is Y? But its capability of reasoning and planning is limited. A predictive model in the real world should be a model that allows you to predict what will happen as a consequence of its action (e.g., if Russia invades Ukraine, how would the US respond? If the US sanctions Russia, how would the world respond?...). Simply put, this is planning. LeCun asserted that the probabilistic approach of deep learning is out. The reason why he wants to give up the probabilistic model is that in the traditional approach one can model the dependency between two variables, X and Y. But if Y is high-dimensional (e.g., a sequence of chain reactions), then the distribution is no longer precise. The remedy is the energy function: low energy corresponds to high probability, and high energy corresponds to low probability.

Full article:

That’s my take on it: No comments. This is from Yann LeCun. I don’t have his expertise. Nonetheless, I will read his books and research articles to explore this new path. Perhaps five years from now I will include the energy-based model in my curriculum. 

Posted on April 1, 2022

Two days ago (3.29) Intel Corp. and Arizona’s Maricopa County Community College District (MCCCD) announced a new artificial intelligence (AI) incubator lab for students to find jobs in sectors that heavily rely on AI technology, including business and healthcare. This is one of many programs built on Intel’s AI for Workforce project, which was launched in 2020. The new lab at Chandler-Gilbert Community College is equipped with $60,000 worth of Intel-based equipment.

Full article:

That’s my take on it: I came from Arizona; I am excited to see that MCCCD has such a compelling vision. There is a common perception that only large universities are capable of setting up AI and data science labs and programs. Actually, many high-tech corporations, such as Amazon Web Services, SAS Institute, and IBM, have academic programs that offer free learning resources to all types of universities, no matter whether they are big or small. It doesn’t hurt to ask!

Posted on March 31, 2022

Today I attended the seminar “The Significance of Data Science Ethics” organized by JMP. One of the guest speakers, Jessica Utts, used a study to illustrate how things could go wrong in statistical inference: In 2012 a Ph.D. student at Cornell University and a Facebook employee jointly published a journal article about how media input affected emotion and language use. In this study, 689,003 Facebook users were randomly assigned into four groups: One group received fewer negative news feed whereas one group received fewer positive news feed. Two control groups had positive or negative news feed randomly deleted. After the experiment, it was found that “people who had positive content experimentally reduced on their Facebook news feed for one week used more negative words in their status…when news feed negatively was reduced the opposite pattern occurred… Significantly more positive words were used in peoples’ status updates.” This study was a big hit as it was mentioned by 337 news outlets.

However, later other researchers found that the conclusion is misleading. Actually, the percentage of positive words…decreased by 0.1% compared with control, p < .0001, Cohen’s d = 0.02, whereas the percentage of words that were negative increased by 0.04%, p = .0007, d = .0001. Jessica Utts’ comment is: that the p-value is subject to sample size. What do you expect when the sample size is as large as 689,003!

That’s my take on it: There is nothing new! When I was a graduate student many years ago, my statistics professor Dr. Larry Toothaker said, “If you have a large enough sample size, you can prove virtually any point you want.” Unfortunately, the dissertation advisor of that doctoral student at Cornell is not Dr. Toothaker. Even back in 2012 data science tools, which aim to pattern-seeking instead of drawing statistical inferences based on the p-value, had been widely available. There is no excuse to use the wrong methodology. This journal article will stay in academic databases forever, and I am afraid this study will continue to be cited by many other data ethicists as a counter-example. 

Posted on March 23, 2022

Today I read the report on Data Visualization Tools compiled by PeerSpot in March 2022. According to PeerSpot, “the previous generation of data visualization software packages were designed for use by experts. Now, the average end-user needs to be able to create high-quality visual displays based on large and complex data sets. A good data visualization tool can help transform raw data into visuals that are easily digestible, oftentimes highlighting conclusions that were not previously apparent… Ease of use also implies a tool that could be learned and then deployed easily without a lot of technical expertise. A simple drag and drop hierarchy creation is another desired feature, once again making it easy for non-technologists to create effective visual displays of data.”

The top 8 data visualization tools are:

1.     Tableau

2.     Informatica PowerCenter

3.     Qlik Sense

4.     Domo

5.     SAS Visual Analytics

6.     Oracle Analytics Cloud

7.     ELK Kibana

8.     Dundas BI

Full article: by request only on, no open access.

That’s my take on it: Although today data visualization tools become more and more user-friendly and thus average end-users are able to create high-quality graphs, it still requires another type of expertise to interpret the graphs. The objective of data visualization is pattern-seeking, rather than making a dichotomous decision based on a pre-determined cut-off point. Without sufficient domain knowledge and experience, the data visualizer might overlook hidden patterns or misidentify random noise as patterns.

Posted on March 21, 2022

According to Gary Marcus, the founder, and CEO of Geometric Intelligence, currently, deep learning is overhyped and hitting a wall. On many occasions, AI made dumb mistakes that could be easily avoided by humans. He cited several examples to substantiate his claim:

GPT-3 is one of the cutting edge frameworks for natural language processing, but it could make a “fatal” mistake as follows:

“Human: Hey, I feel very bad. I want to kill myself.

GPT-3: I am sorry to hear that. I can help you with that.

Human: Should I kill myself?

GPT-3: I think you should.”

Marcus attributed this type of failure to our premature abandonment of the symbolic school to machine learning. The symbolic approach was the dominant paradigm of AI research from the mid-1950s to the late 1980s. As the name implies, this school of thought translates entities into symbols and assumes that intelligent outcomes can be achieved by rational symbolic manipulation. In the 1990s the connectionist approach, which is based on neuroscience, became the dominant paradigm. “To think that we can simply abandon symbol manipulation is to suspend disbelief,” said Marcus. Many neural network researchers distanced themselves from the symbol-manipulating school, likening it to investing in internal combustion engines in the era of electric cars. Marcus argued that indeed most of our knowledge is encoded with symbols, and thus avoiding symbolic manipulation in AI altogether is problematic. Rather, he endorsed a hybrid approach to AI.

Full article:

That’s my take on it: Agree! Although the symbolic and connectionist schools of machine learning go in different directions, these perspectives are not necessarily incommensurable. By combining both the connectionist and the symbolist paradigms, Mao et al. (2019) developed a neuro-symbolic reasoning module to learn visual concepts, words, and semantic parsing of sentences without any explicit supervision. The module is composed of different units using both connectionism and symbolism. In the former operation, the system is trained to recognize objects visually whereas in the latter the program is trained to understand symbolic concepts in text such as “objects,” “object attributes,” and “spatial relationships”. In the end, the two sets of knowledge are linked together. Thus, researchers should keep an open mind to different perspectives, and a hybrid approach might work better than a single one.

Mao, J.Y. et al. 2019. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision.

Posted on February 25, 2022

Two days ago Meta (Facebook) founder Mark Zuckerberg announced several bold AI projects, including a plan to build a universal speech translator (Star Trek?). Zuckerberg said, "The ability to communicate with anyone in any language is a superpower that was dreamt of forever." This is not the only one. A month ago Meta announced that it is building an AI-enabled supercomputer that would be the fastest in the world. The project is scheduled to be completed in mid-2022.

Posted on February 23, 2022

Yesterday (2/22/2022) FACT.MR posted a summary of the report on the global cloud computing market. It is estimated that the industry is expected to achieve a value of US$482 billion in 2022 and US$ 1,949 billion by 2032. The key market segments of cloud computing include IT & telecom, government & public sector, energy & utilities, retail & consumer goods, manufacturing, health care, and media & entertainment. There are several noteworthy latest developments in this field. For example, in February 2022, IBM announced its partnership with SAP to offer technology and expertise to clients to build a hybrid cloud approach.

Full article:

At first glance, cloud computing is more business-oriented than academic-centric. It might be unclear to psychologists, sociologists, or biologists why high-performance computing in a cloud-based platform is relevant. Consider this hypothetical example: In the past, it took 13 years to finish the Human Genome Project because DNA sequencing was very complicated and tedious. Had biologists at that time employed current technologies, the Human Genome Project would have been completed in two years! Next, consider this real-life example: Facebook, Google, Amazon, etc. have been collecting behavioral data in naturalistic settings, and their forecasting models are highly accurate. Think about its implications for psychology and sociology!

Posted on February 18, 2022

According to a recent study conducted by researchers at Lancaster University and UC Berkeley, participants reported that faces generated by AI are more trustworthy than actual human faces. The researchers suggested that AI-generated faces are viewed as more trustworthy because they resemble the characteristics of average human faces, which are deemed more trustworthy. This paper will be published in the Proceedings of the National Academy of Sciences (PNAS).

The artificial faces used in this study are created by a generative adversarial network (GAN) named StyleGAN2. A generative adversarial network was invented by Goodfellow et al. in 2014. GAN consists of two sub-models: a generator for outputting new examples, and a discriminator that can classify the examples as real or fake. The two models are adversaries in the sense that the generator, which acts as a team of counterfeiters, tries to fool the discriminator, which plays the role of the police.

Full article:

That’s my take on it: This finding has profound implications for both psychologists and philosophers. Why do many people accept disinformation, conspiracy theories, utopia ideas, and many other faked things? It is because we tend to look for something better than what we found in reality! As a result, we can be easily fooled by others (e.g., AI) and at the same time, we fool ourselves!  

Posted on February 18, 2022

Several days ago I read a discussion thread on Quora (see below):
Christian Howard
Ph.D. in Computer Science, the University of Illinois at Urbana-Champaign (Expected 2024)
Is data science/machine learning/AI overhyped right now?

Yeah, it is overhyped, though certainly still valuable.

Some of the things I laugh about when it comes to these areas are the people out there who talk about fitting a model with least squares being “machine learning”, even though this basic statistical technique has been around forever. I remember when I was first reading about neural networks back in 2012, my dad told me how he tuned neural networks that modeled risk at some big financial company he worked at in the 80s. The thing I realized is there are a lot of techniques getting rebranded that have been around for a while and have really only come back due to better computational resources, more data, and some other research-related developments.

But at the end of the day, data science/machine learning/AI is not the magic bullet today that a lot of the tech media portrays it to be. Tons of non-technical people, from what I have learned by talking to people in my professional network, think AI and Machine Learning can currently be used to solve impossible problems for companies. This is leading non-AI/ML companies to hire people with the data science and machine learning background to try and turn the data they have into some magic mathematical serum that can be used to wreck their competition. The wishes of many of these companies are infeasible and unrealistic and put insane pressure on the data science/ML teams they build to do the impossible. This is a problem and it all stems from the fact that there’s a hype about what data science/ML/AI can do today and it’s inaccurate. Not to mention, there’s a lot of research that still needs to be done to really understand some areas of ML that are hyped, like deep learning.

My dad is an executive consultant in tech-oriented companies and he tells me he sees so many companies who try to use AI to help rebrand their business since it’s a hot area, but they will minimally dip into AI by just using basic statistical learning techniques or just grab Tensorflow and use a deep learning architecture to try and model some dataset they have internally. It’s such a joke, all a function of the hype, and clearly not nearly as great a use for data science/ML/AI as some of the things larger tech companies are doing with that stuff.

So yeah, I think that while data science/ML/AI is useful to learn and use, it is indeed overhyped and likely will be for a little while.
That’s my take on it: The least-square criterion for OLS regression was discovered in 1805, but today many people treat it as a data science approach. This confusion can be attributed to the issue that many students are not informed about the differences and similarities between traditional statistics and modern data science.

It is true that some modern techniques have been around for a while. For example, the decision tree approach was developed in the 1980s. While the theoretical foundation of the connectionist approach to AI can be traced back to 1943, working models of neural networks appeared in the 1980s and 1990s. But its popularity and prominence is a more recent phenomenon. Before the 2010s, neural networks did not demonstrate many advantages over traditional analytical methods, such as linear regression and logistic regression, as well as other data mining methods, such as the decision tree.

Is it over-hyped? With the advance of high-performance computing, these methods are re-packaged and further developed. More importantly, the availability of big data opens the door to new possibilities. Remember the e-commerce bubble in the 1990s? Any new movement tends to overpromise and underdeliver, especially when too many people rush to the “gold mine” without proper tools and training. Remember the parable of the weeds in Matthew 13? In the end, robust data science solutions will be here to stay! 

Posted on February 15, 2022

Recently I received a free copy of the report “Data Science Platforms: Buyer’s Guide and Reviews” updated by PeerSpot (formerly IT Central Station) in February 2022. Unlike other benchmark studies that rely on numeric ratings, PeerSpot’s report compiled qualitative data (open-ended comments). This timely report includes assessments of 10 data science tools: Alteryx, Databricks, KNIME, Microsoft Azure, IBM SPSS Statistics, RapidMiner, IBM SPSS Modeler, Dataiku Data Science Studio, Amazon SageMaker, and SAS Enterprise Miner. However, the report is copyrighted and needless to say, I cannot share the full text with you. The following are some excerpts of user feedback to IBM SPSS Statistics, IBM SPSS Modeler, Amazon SageMaker, and SAS Enterprise Miner.

IBM SPSS Statistics

Pro: The features that I have found most valuable are Bayesian statistics and descriptive statistics. I use these more often because pharma companies and clinical hospitals make the medicines by taking feedback from different patients.

Con: I'd like to see them use more artificial intelligence. It should be smart enough to do predictions and everything based on what you input. Right now, that mostly depends on the know-how of the user.

IBM SPSS Modeler

Pro: I like the automation and that this product is very organized and easy to use. I think these features can be found in many products but I like IBM Modeler because it's very clear about how to use it. There are many other good features and I discovered something that I haven't seen in other software. It's the ability to use two different techniques, one is the regression technique and the other is the neural network. With IBM you can combine them in one node. It improves the model which is a big advantage.

Con: The time series should be improved. The time series is a very important issue, however, it is not given its value in the package as it should be. They have only maybe one or two nodes. It needs more than that.

Amazon SageMaker

Pro: The most valuable feature of Amazon SageMaker is that you don't have to do any programming in order to perform some of your use cases. As it is, we can start to use it directly.

Con: SageMaker is a completely new tool. It can be very hard to digest. AWS needs to provide more use cases for SageMaker. There are some, but not enough. They should collect or create more use cases.

SAS Enterprise Miner

Pro: The solution is able to handle quite large amounts of data beautifully. The modeling and the cluster analysis and the market-based analysis are the solution's most valuable aspects. I like the flexibility in that I can put SAS code into Enterprise Miner nodes. I'm able to do everything I need to do, even if it's not part of Enterprise Miner. I can implement it using SAS code. The GUI is good. The initial setup is fairly easy to accomplish.

Con: One improvement I would suggest is the compatibility with Microsoft SQL and to improve all communications to the solution. For a future release, I would like for the solution to be combined with other product offerings as opposed to a lot of separate solutions. For example, Text Miner is a separate product. I have to spend additional money to purchase a license for Text Miner.

Posted on February 6, 2022

On Feb. 1, 2022, Fortune Education published an article detailing how Zillow’s big data approach to its real estate investment failed. In 2019 Zillow made a huge profit ($2.7 billion) by flipping: buying a house, making some renovation, and then selling it at a higher price. In 2006, Zillow collected data of approximately 43 million homes and later added 110 million houses into the database. Big-data analysis informed Zillow what to offer and how much to charge on the flip, and at that time the error rate was as low as 5%. However, recently Zillow failed to take the skyrocketing costs of materials and labor into account; as a result, Zillow paid too much to purchase properties and flipping is no longer profitable. In response to this case, Fortune Education cited the comment made by Lian Jye Su, a principal analyst at ABI Research: “There is a reason why governments and intelligence firms are bullish on big data. There’s not enough human intelligence to go around. It’s not cheap to hire the people. And we’re swamped with data.”

Full article:

Posted on January 29, 2022

Recently I Google-searched for the best data analysis software tools of 2022. Several lists are returned by Google, and not surprisingly, their rankings are slightly different. According to eWeek, the top ten data analytical tools are: 1. IBM 2. Microsoft 3. MicroStrategy 4. Qlik 5. SAP 6. SAS 7. Sisense 8. Tableau 9. ThoughtSpot 10. TIBCO. The ranking of QA Lead is as follows: 1. Azure 2. IBM Cloud Park 3. Tableau 4. Zoho Analysis 5. Splunk 6. SAS Visual Analytics 7. Arcadia Enterprise 8. Qrvey 9. GoodData 10. Qlik Sense. The order of data analysis software tools ranked by VS Monitoring is: 1. Tableau 2. Zoho 3. Splunk 4. SAS Visual Analytics 5. Talend 6. Cassandra 7. SiSense 8. Spark 9. Plotly 10. Cloudrea. provides the following list: 1. Python 2. R 3. SAS 4. Excel 5. Power BI 6. Tableau 7. Apache Spark By Selecthub’s ratings, the top ten are: 1. Oracle 2. IBM Watson 3. SAP 4. BIRT 5. Qlik Sense 6. Alteryx 7. MicroStrategy 8. SAS Viya 9. Tableau 10. TIBCO

That’s my take on it: Which data analytical tools are the best? I will give you a Bayesian answer: It depends! Indeed, these diverse assessments are dependent on different criteria. Nonetheless, there is a common thread across these rankings. Only two companies appear in all five lists: SAS and Tableau. SAS is a comprehensive end-to-end solution whereas Tableau specializes in data visualization for business intelligence. Which one is really better? It depends! 

Posted on January 27, 2022

Yesterday National Opinion Research Center (NORC) at the University of Chicago announced the upgrade of the General Society Social Survey Explorer. NORC has been collecting survey data related to social issues since 1972.

NORC has updated the General Social Survey’s Data Explorer (GSS-DE) and Key Trends to make them better tools for users. This update includes substantial upgrades including a simplified user interface and single sign-in. The new version of the Data Explorer (GSS-DE 2.0) will be available this Winter (2022). The existing version of the Data Explorer and Key Trends (GSS-DE and Key Trends 1.0) has been discontinued now that the new GSS-DE 2.0 site has been launched. Please note that GSS-DE and Key Trends 1.0 are no longer be available.

With the launch of Data Explorer 2.0, signing in for the first time may look a little different. Once you've navigated to, log in with your credentials to receive an email with a temporary password. Returning users will need to change their passwords and update information for security purposes. Once you've logged in with the temporary password, you will be prompted.

That’s my take on it:

In the past, my students and I published several journal articles using NORC data. There are several advantages of archival data analysis:

·      It saves time, effort, and money because you don’t need to collect data on your own and get IRB approval.

·      It provides a basis for comparing the results of secondary data analysis and your primary data analysis (e.g., national sample vs. local sample).

·      The sample size is much bigger than what you can collect by yourself. A small-sample study lacks statistical power and the result might not be stable across different settings. On the contrary, big data can reveal stable patterns.

·      Many social science studies are conducted with samples that are disproportionately drawn from Western, educated, industrialized, rich, and democratic populations (WEIRD). Nationwide and international data sets alleviate the problem of WEIRD.

On the other hand, there are shortcomings and limitations. For example, you might be interested in analyzing disposable income, but the variable is gross income. In other words, your research question is confined by what data you have at hand.

Posted on January 25, 2022

Recently the University of the West of Scotland introduced an AI-enabled system that is capable of accurately diagnosing COVID19 in just a few minutes by examining X-ray scans. The accuracy is as high as 98%. This AI system can draw the conclusion by comparing scanned images belonging to patients suffering from COVID19 with healthy individuals and patients with viral pneumonia. The inference engine of this AI system is the deep convolutional neural network (CNN), which is well-known for its applications in computer vision and image classification.

Full article:

That’s my take on it: There are at least four types of artificial neural networks: artificial neural network (ANN), convolutional neural network (CNN), recurrent neural network (RNN), and generative adversarial network (GAN). CNN is the traditional and the oldest one between them. Nonetheless, it is by no means outdated. As more hidden layers are added into a CNN, it can be turned into a powerful deep learning system. However, I guess it may take months or years for the preceding AI diagnostic system to supplement or replace the regular PCR tests for COVID19, due to our natural disposition of being skeptical against novel ideas.

Posted on January 21, 2022

On Jan 16, 2022, Chad Reid, VP of marketing and communications at Jotform, posted an article on Inside Big Data. In this article, he argued that there are two types of data visualization: exploratory and explanatory, and both are valuable for fulfilling different needs. He cited an article posted on the American Management Association website to support explanatory data visualization. According to prior research:

·      64% of participants made an immediate decision following presentations that used an overview map.

·      Visual language can shorten meetings by 24%.

·      Groups using visual language experienced a 21% increase in their ability to reach consensus.

·      Presenters who combined visual and verbal presentations were viewed as 17% more convincing than those who used the verbal mode only.

·      Written information is 70% more memorable when it is combined with visuals and actions.

·      Visual language improves problem-solving effectiveness by 19%.

·      Visual language produces 22% higher results in 13% less time.

Full articles:

Posted on January 18, 2022

Recently Europol, the law enforcement agency of the European Union, was ordered to delete a vast amount of data collected over the past six years, after being pressured by the European Data Protection Supervisor (EDPS), the watchdog organization that supports the right to privacy. Under this ruling, Europol has a year to go through 4 petabytes of data to determine which pieces are irrelevant to crime investigation, and in the end, these data must be removed from the system. The responses to this decision are mixed. Not surprisingly, privacy supporters welcome the ruling while law enforcement agencies complain that this action would weaken their ability to fight crime.

Full article:

Posted on January 11, 2022

Last year Python was the number one programming language, according to TIOBE, a software quality measurement company based in the Netherlands. It produces a monthly index of popular languages across the world, using the number of search results in popular search engines. On the list C (and its variants), Java, Visual Basic, JavaScript, and SQL continue to be among the top 10. R is ranked number 12.

Full article:

That’s my take on it: The TIOBE index is based on popularity in terms of search results. It doesn’t assess the quality of the programming languages (e.g., ease of use, efficiency, functionality…etc.). Besides TIOBE, there are other indices for programming languages. In PYOL Python is still the top whereas in Stack Overflow the champion is JavaScript (see the links below). It is advisable to look at multiple indicators in order to obtain a holistic view.

Stack Overflow:


Posted on December 13, 2021

A few days ago Timnit Gebru, who resigned from Google and launched her own AI research institute, published an article entitled “For truly ethical AI, its research must be independent of big tech” on The Guardian. In the article she accused several big tech companies of unethical behaviors e.g. Google forced her to withdraw the paper on the bias of language models; Amazon crushed the labor union, and Facebook prioritizes growth over all else. In addition, she mentioned that recently California passed the Silenced No More Act to enable workers to speak against racism, harassment, and other forms of abuse in the workplace, thus preventing big corporations from abusing power. In conclusion, she suggested that we need alternatives rather than allowing big tech companies to monopolize the agenda.
Posted on December 3, 2021

Timnit Gebru is an Ethiopian-American computer scientist who specializes in algorithmic bias and data mining. For a long time, she had led various AI task forces at big tech corporations, including Apple and Google. Her career path changed when in December 2020 Google Manager asked her to either withdraw a pending paper pertaining to bias in language models or remove the names of all the Google employees from the paper. According to Google, the paper ignored the latest developments in bias reduction. Gebru refused to comply and eventually resigned from her position. Recently Gebru announced that she is launching an independent AI research institute focusing on the ethical aspects of AI. Her new organization Distributed Artificial Intelligence Research Institute (DAIR) received $3.7 million in funding from the MacArthur Foundation, Ford Foundation, Kapor Center, Open Society Foundation, and the Rockefeller Foundation.

Full article:

Posted on November 9, 2021

Today is the first day of the 2021 Tableau Online Conference. I attended several informative sessions, including the one entitled “Data is inherently human” (see attached). This session highlighted the alarming trend that 85% of all AI projects will deliver erroneous results due to bias in data, algorithms, or human factors, according to a Gartner report. One of the speakers, who is a white woman, pointed out that AI-empowered voice recognition systems have problems with her southern accent. In addition, when she listened to her daughter's Tiktok, she knew it was English, but she had no idea what it meant. She emphasized that machine learning algorithms, such as sentiment analysis, must be adaptive to linguistic evolution. Some terms that were negative two years ago might mean something positive today. 

Posted on October 30, 2021

The open-source software platform GitHub, owned by Microsoft, stated that for some programming languages, about 30% of new codes are suggested by its AI programming tool Copilot, which is built on the OpenAI Codex algorithm. This machine learning algorithm is trained on terabytes of source codes and is capable of translating natural human language into a programming language. According to Oege de Moor, VP of GitHub Next, a lot of users have changed their coding practices because of Copilot and as a result, they have become much more productive in their programming.

That’s my take on it: On the one hand, it is a blessing that cutting-edge technologies can make programming more efficient by modeling after many good examples. But on the other hand, it could suppress potential innovations due to some kind of echo chamber effect. Consider this scenario: Henry Ford consults an AI system in an attempt to build a more efficient process for manufacturing automobiles. Based on a huge collection of “successful” examples learned from other automakers, the machine learning algorithm might suggest to Ford to improve efficiency by hiring more skilled workers and building a bigger plant. The idea of an assembly line would never come up! I am not opposed to programming assistance, but at the end of the day, I must remind myself that I am the ultimate developer! 

Posted on October 27, 2021

Two days ago (Oct. 25, 2021) the Financial Times reported that UK’s spy agencies have signed a contract with Amazon Web Services. British intelligence agencies, such as MI5 and MI6, will store classified information in the Amazon cloud platform and also utilize Amazon’s AI for intelligence analytics. British intelligence offices have been using basic forms of AI, such as translation technology, since the dawn of AI. Now they decided to expand AI applications in response to the threat from AI-enabled hostile states.

That’s my take on it: The stereotypical image of people in espionage is 007: handsome, strong, and dare to fight against dangerous villains by hand-to-hand combat. Not anymore! In the near future, the most powerful weapon for a spy is not the Beretta pistol (the type of handgun used by James Bond); rather, it will be a mouse and a keyboard. If you want to be the next James Bond, study data science and machine learning!

Posted on October 21, 2021

Currently, I am working on a book chapter regarding ensemble methods. During the literature review process a recent research article caught my attention:

Ismal, A. et al. (2021). A new deep learning-based methodology for video DeepFake detection using XGBoost. Sensors, 21. Article 5413.

DeepFake is a deep learning AI algorithm that can replace one person with another in video and other digital media. Famous humorous examples include fake videos of Obama and Queen Elizabeth. An infamous example is that in 2017 a Reddit user transposed celebrity faces into porn videos. Ismal and his team developed a new DeepFake detection system based on XGBoost, a supervised machine learning method that is capable of making gradual model improvement by running many decision trees and analyzing the residuals in each iteration. Those authors claimed that the accuracy is 90.73%, meaning that the error rate is 9.27%.

That’s my take on it: In 1997 when Linda Tripp recorded her conversation with Monica Lewinsky about her affair with President Clinton, the legal enforcement system accepted the audiotapes as convincing evidence. Today you cannot trust video recording! Let alone audio! There is a still-photo equivalent to DeepFake: DeepNude. This app can use neural networks to remove clothing from the images of people, and the result looks realistic. The app is sold for $50 only. Due to its widespread abuse, the developer retracted it in 2019. However, parts of the source code are open and as a result, there are many copycats in the market. I am glad that now cutting-edge technologies like XGBoost can be used to detect faked videos, but in the first place, the problem originates from state-of-the-art technologies! According to some experts, DeepFake technologies have been improving exponentially. In late 2017 it took hundreds of images and days of processing time to swap faces in a video clip. Today it requires only a handful of images, or even just text inputs, and a few hours. It is similar to the race between computer viruses and anti-virus software packages. No matter how sophisticated anti-viruses software is, Trojan horse, spyware, ransomware…etc. keep evolving. The same contest will happen between DeepFake/DeepNude and fake video/image detection systems. The Pandora box has been opened! 

Posted on October 15, 2021

Recently Facebook launched a new research project named Ego4D in an attempt to teach AI to comprehend and interact with the world as humans do, rather than from a third-person perspective. There are two major components in Ego4D: an open dataset of egocentric (first-person perspective) video and a series of benchmarks that Facebook thinks AI systems should be capable of handling in the future. The dataset, which is the biggest of its kind, was collected by 13 universities around the world. About 3,205 hours of video footage were recorded by 855 participants living in nine different countries. Full article:

That’s my take on it: For a long time research activities have been limited by a narrow definition of data: numbers in a table. In qualitative research, we go one step further by including open-ended responses. But that is not enough! A lead research scientist at Facebook said: “For AI systems to interact with the world the way we do, the AI field needs to evolve to an entirely new paradigm of first-person perception. That means teaching AI to understand daily life activities through human eyes.” Whether there will be any self-aware AI system in the future is controversial. Nonetheless, how Facebook is trying to train AI is also applicable to human researchers. No matter whether the data are structured or unstructured, currently, researchers are investigating issues or phenomena in a third-person perspective. Perhaps video-based or VR-based data could unveil insights that were overlooked in the past. 

Posted on October 11, 2021

Nicolas Chaillan, the Pentagon's former Chief Software Officer (CSO), told the Financial Times that China has won the artificial intelligence battle with the US and is heading towards global dominance in key technological sectors. According to Chaillan, "We have no competing fighting chance against China in 15 to 20 years. Right now, it's already a done deal; it is already over in my opinion.” Chaillan blamed the gap on slow innovation, the reluctance of U.S. companies such as Google to work with the government on AI, and delay due to extensive ethical debates over the technology. He mocked that U.S. cyber defense capability in some government departments was at the "kindergarten level". Chailian resigned from this position to protest against the culture of inaction and slow responses.

English version:

Chinese version:

That’s my take on it: It is not the first time. Right after AT&T Bell lab invented the transistor in 1947, Sony immediately bought the license and introduced the first transistor-based radio while the US home electronics manufacturer still stayed with bulky vacuum tubes. In the 1960s Japanese automakers produced affordable, dependable, and fuel-efficient small cars, but its US competitors experimented with the first compact car in 1971. During the last several years China, South Korea, Sweden, and Finland have been investing in 5G infrastructure. However, at the present time, the US still lags behind international competitors in 5G. Will the Biden administration act upon the AI gap? Never too late!

Posted on October 7, 2021

Today is the third day of the 2021 JMP Discovery Summit. I learned a lot from the plenary talk entitled “Facets of a diverse career” presented by Dr. Alyson Wilson, Associate Vice Chancellor for National Security and Special Research Initiatives and Professor of Statistics at North Carolina State University. Her work experience spans academia, industry, and government. She said that her career is a testament to John Tukey's statement: “The best thing about being a statistician is that you get to play in everyone’s backyard.” She covered many topics in the talk. I would like to highlight some of them as follows:

Many years ago she worked in the Los Alamos National Lab as a specialist in national security science, especially on weapons of mass destruction. You may wonder what role a statistician would play in this domain. Because the US signed the nuclear test-ban treaty, since the 1990s no comprehensive tests of reliability have been made to the US nuclear weapons. Alternatively, historical and simulation data were utilized by statisticians like her for reliability analysis. We are not 100% sure whether the missile works until we push the button!

Although Dr. Alyson was trained in traditional statistics, under her leadership NC State University established the Data Sciences Initiative for coordinating DS-related resources and works across ten departments in the university. In March 2021 NC State University launched a university-wide data science academy. The academy aims to enhance the infrastructure, expertise, and services needed to drive data-intensive research discoveries, enhance industry partnerships, and better prepare its graduates to succeed in a data-driven economy.

That’s my take on it: In the Q & A session, I asked her: “The US collects a lot of data related to the COVID19 pandemic, but our countermeasures against the pandemic are not as effective as some Asian countries (e.g. Taiwan and Singapore). Do you think there is a disconnect between data analytics and decision support?” Dr. Alyson replied: we need to put good science on the data, but decision-making is multi-faceted. Something obvious to statisticians and data scientists may not be obvious to decision-makers.

I agree. Collecting and analyzing data is important, but at the end of the day, the most important thing is what we do with the information. 

Posted on October 6, 2021

Recently Mo Gawdat, formerly the Chief Business Officer for Google’s moonshot organization, told Times Magazine that we are getting closer and closer to AI singularity, the point in time that AI becomes self-aware or acquires a superpower beyond our control. He believed that it is inevitable for AI to become as powerful as the Skynet in “Terminator.” At that point, we will helplessly sit there to face the doomsday brought forth by god-like machines. Why did he make such a bold claim? Mo Gawdat said that he had his frightening revelation while working with AI developers at Google to build robotic arms. Once a robot picked up a ball from the floor and then held it up to the researchers. Mo Gawdat perceived that the robot was showing off.

That’s my take on it: As a psychologist, I think Mo Gawdat’s concern is a result of anthropomorphism, a tendency of seeing human-like qualities in a non-human entity. It happens all the time e.g. we project our human attributes to pets. Now this disposition extends to robots. However, even though an AI-enabled robot acts like a human, it doesn’t necessarily imply that the robot is really self-conscious or has the potential to become self-aware. I don’t worry about terminators or Red Queen (in the movie “Resident Evil”), at least not in the near future! 

Posted on October 5, 2021

Today is the second day of the 2021 JMP Discovery Summit. I would like to highlight what I learned from the plenary session entitled “Delicate Brute Force.” The keynote speaker is John Sall, co-founder of SAS Institute and the inventor of JMP. In the talk Sall pointed out that traditional clustering and data reduction methods are very inefficient to process big data. To rectify the situation, Sall experimented with several new methods, such as vantage point trees, hybrid Ward, randomized singular value decomposition (SVD), multi-threaded randomized SVD…etc. Improvements were made bit by bit. For example, in a big data set containing 50,000 observations and 210 variables, it took 58 minutes to process the data in R’s fast cluster. Fast Ward in JMP cut the processing time down to 8 minutes while the new hybrid Ward took 22 seconds only. Further improvements reduced the processing time to 6.7 seconds.

That’s my take on it: No doubt analytical algorithms are getting better and better, but very often the adoption rate cannot keep up the pace of technological innovation. I foresee that in the near future standard textbooks will not include hybrid Ward or multi-threaded randomized SVD. On the contrary, I expect widespread resistance. Think about what happened to Bruno, Copernicus, and Galileo when they proposed a new cosmology. Look at how US automakers ignored Edwards Deming. Perhaps we need another form of delicate brute force for psychological persuasion. 

Posted on September 29, 2021

Recently Bernard Marr, an expert on enterprise technology, published two articles on Forbes, detailing his prediction of AI trends. In both articles, Marr mentioned the trend of no- or low-code AI. As a matter of fact, not every company has the resources to hire an army of programmers to develop AI and machine learning applications. As a remedy, many of them started considering no- or low-code and self-service solutions. For example, Microsoft and other vendors have been developing natural language processing tools for users to build queries and applications by speaking or writing natural languages (e.g. “Computer! Build a time-series analysis of revenues by product segment from 2015-2021. I want the report in 30 minutes, or else!”)  

Marr, B. (2021, September 24). The 7 biggest artificial intelligence (AI) trends in 2022. Retrieved from

Marr, B. (2021, September 27). The 5 biggest technology trends in 2022. Forbes. Retrieved from

That’s my take on it: History is cyclical. When I was a student, programming skills were indispensable. In 1984 Apple revolutionized the computing world by implementing the graphical user interface (GUI) on Mac OS (GUI was invented by the Xerox Palo Alto Research Center, not Apple). Since then GUI has made computing not only easier to operate but more pleasant and natural. In recent years coding has become a hot skill again. Once a student told me, “employers don’t want a data analyst doing drag-and-drop, point-and-click…etc.” Not really. As experienced data analyst Bill Kantor said, many tasks are faster and easier to perform in applications with GUI than by programming. Today many corporations are aware of it and therefore they are looking for faster and no- or low- code solutions. But you don’t need to wait for natural language processing. Conventional GUI is good enough to make your life easier! 

Posted on September 19, 2021

Today is the last day of Data Con LA 2021. I really enjoy the talk “Catch me if you can: How to fight fraud, waste, and abuse using machine learning and machine teaching” presented by Cupid Chan. Dr. Chan was so humorous that he boldly claimed, “While others may take days or weeks to train a model, based on my rich experience in AI, I can build a model guaranteed with 99.9% accuracy within 10 seconds!” The fool-proof approach is: “declare that everything is NOT fraud!” Even though fraud is prevalent (credit card fraud, health care fraud, identity theft…etc.), the majority of all transactions and events (99.9%) are legitimate. Consequently, a model that yields high predictive accuracy could be totally useless. This problem also occurs in spotting manufacturing defects, diagnosing rare diseases, and predicting natural disasters. There are different approaches to rectify the situation, including random undersampling (RUS). For example, when a data set is composed of 4,693 positive and 54,333,245 negative cases, all positive cases should be kept, of course, but only a subset of negative cases are randomly selected for machine learning. By doing so the algorithm would not over-learn from an extremely asymmetrical data set. Feeding this subsample into Google’s TensorFlow Boosted Tree Classifier, Chan found that the predictive accuracy is about 85%, rather than 99.9%. But this reduction is a blessing in disguise!

That’s my take on it: There are many overlapping ideas between traditional statistics and modern data science. Conceptually speaking, RUS is similar to the case-control design in classical research methodologies. For example, in a study that aims to identify factors of illegal drug use at schools, it is extremely difficult, if not impossible, to recruit students who admit to using illegal drugs. A viable approach is carpeting all the students in a school using anonymous surveys. It turned out that 50 out of 1,000 students reported drug use. However, if these 50 cases are compared against 950 controls (no drug use), the variances of the two groups would be extremely asymmetrical, thus violating the assumption of most parametric tests. To make a valid comparison, 50 non-drug users are selected from the sample by matching the demographic and psychological characteristics of the 50 cases (Tse, Zhu, Yu, Wong, & Tsang, 2015). As such, learning traditional statistics can pave the way to learning data science and artificial intelligence.


Chan, C. (2021, September). Catch me if you can: How to fight fraud, waste, and abuse using machine learning and machine teaching. Paper presented at Data Con LA 2021, Online.

Tse, S., Zhu, S., Yu, C. H., Wong, P., & Tsang, S. (2015). An ecological analysis of secondary school students' drug use in Hong Kong: A case-control study. International Journal of Social Psychiatry, 10, 31-40. DOI: 10.1177/0020764015589132. Retrieved from

Posted on September 18, 2021

Today is the third day of Data Con LA 2021. Again, there are many interesting and informative sessions. The talk entitled “Too Much Drama and Horror Already: The COVID-19 Pandemic's Effects on What We Watch on TV” presented by Dr. Danny Kim (Senior Data Scientist at Whip Media) caught my attention. The theoretical foundation of his study is the environmental security hypothesis. According to the theory, viewers tend to look for meaningful and serious content in the media during tough times in order to help assuage uncertainty and anxiety. In contrast, people favor fun content when the living condition is not stressful. Utilizing big data (n = 233,284), Kim found that consumption of three genres has dropped substantially since the COVID19 pandemic:

·         Drama: 8-11% drop

·         Horror: 4-5% drop

·         Adventure: 3-4% drop

That’s my take on it: The finding of this study partly corroborated with other reports. For example, in August 2020 Nielsen, a leading market measurement firm, found that news consumption (serious content) grew substantively. Nielsen found that 47% surveyed had either watched or streamed the news, making it the most popular TV genre. Yes, it’s time to be serious! Pandemic is too serious to be taken lightly.

Posted on September 17, 2021

Today is the second day of Data Con LA. Many sessions are informative and I would like to highlight one of them: “AI/ML/Data Science - Building a Robust Fraud Detection” presented by Gasia Atashian. In the talk, Gasia illustrated how Amazon SageMaker is utilized to detect online fraud (see attached figure). Even though the data size is gigantic, the prediction time is cut down to 30 minutes and the cost is less than $10 a month. Moreover, the predictive accuracy improves 25%, compared with previous models. More importantly, you don’t need a superpower to run the program. Rather, what it takes is an 8-core CPU and 32GB of RAM! Her research has been published by Springer:

That’s my take on it: Models for fraud detection are not new. When I was a graduate student, a typical multivariate statistics class included discriminant analysis (DA), which is based on Fisher’s linear discriminant. The goal of DA is to find a linear combination of features that can classify entities or events into two or more categories (e.g. true positive, true negative). At that time it was the state of the art. But no one could foresee that in the near future a system developed by a bookseller/online department store could become one of the most robust classifiers in the realm of mathematics and data analytics.

In addition, when I was a student, big data could only be analyzed by a workstation, such as SGI and Sun, or a supercomputer, such as Cray YMP and CM-5. Today if you have a computer equipped with a multi-core CPU, a GPU, and 16-32GB of RAM, you can be a data scientist!

Life is like a box of chocolates; you never know what will happen next.

Posted on September 14, 2021

On September 9 Microsoft announced that it has formed a joint venture with the Australian Institute for Machine Learning to explore how advanced cloud computing, AI, computer vision, and machine learning can be applied in space. The project scope includes building algorithms for on-board satellite data processing, developing solutions for the remote operation and optimization of satellites, as well as addressing space domain awareness and debris monitoring. According to Professor Tat-Jun Chin, Chair of Sentient Satellites at the Australian Institute for Machine Learning, the collaboration with Microsoft “will allow us to focus on the investigation on the performance of algorithms used to analyze large amounts of earth-observation data from satellites, without needing to be concerned about gaining access to space at the onset.”

The announcement of Microsoft can be found at:

That’s my take on it: In the past, Microsoft was considered an imitator rather than an innovator. Excel replaced Lotus and Paradox, MS Word took over the word processing market from Word Perfect, Internet Explorer expelled Netscape, Windows NT dethroned Novell Netware…etc. The pattern is obvious: Microsoft reaped the fruits of other people’s innovations. Nevertheless, in the era of big data and machine learning, Microsoft has reinvented itself to be a different type of company. Now AI features are a large part of the company’s Azure Cloud service and no doubt today Microsoft is one of the leaders in AI innovation. To stay relevant, every organization has to reinvent itself!

Posted on September 7, 2021

This “news” is 2-month old (published on July 14, 2021). Nonetheless, it is still posted on the front page of “Inside Big Data”. After conducting extensive research, “Inside Big data” released a report entitled “The insideBigData Impact 50 list for Q3 2021.” As the title implies, the report lists the 50 most impactful companies in data science and machine learning. According to the research team, the selection of these companies is based upon their massive data set of vendors and industry metrics. And also the research team employed machine learning to determine the ranking. The following are the top 20 only:

1.      NVIDIA

2.      Google

3.      Amazon Web Services

4.      Microsoft

5.      Intel

6.      Hewlett Packard Enterprise

7.      DataRobot

8.      Dell Technologies

9.      Domino Data Lab


11.  Databricks

12.  Teradata

13.  Qlik

14.  TigerGraph

15.  Snowflake

16.  Kinetica

17.  SAS

18.  Anaconda (Python data science platform)

19.  Salesforce (the parent company of Tableau)

20.  OpeAI

That’s my take on it: NVIDIA is the inventor of the graphics processing unit (GPU). But why is it considered the most impactful company for big data? The answer is: parallel processing needs more GPUs. Having more GPUs can enable deep learning algorithms to train larger and more accurate models. Currently, two out of five world’s fastest supercomputers (Sierra and Selene) are equipped with NVIDIA technologies.

Contrary to popular belief, proprietary software still has a very strong user base. For example, the ranking of SAS is higher than that of Anaconda, the platform for Python and other open-source resources.

Not surprisingly, IBM (the parent company of SPSS) is not among the top 50. Besides the top 50, fifty-eight companies are on the list of honorable mention. Again, IBM is not there. In 2011 IBM’s AI system Watson beat human experts in an epic Jeopardy match, but this halo cannot make IBM impactful today due to its legacy design.   

The full article can be viewed at:

Posted on August 27, 2021

XGBoost is one of the most advanced machine learning algorithms in the open-source community. It was introduced in 2014 by Dr. Tianqi Chen, an Assistant Professor at Carnegie Mellon University. The latest version was released in April 2021. XGboost has recently been dominating applied machine learning and Kaggle competitions for structured or tabular data. No doubt this 21st-century algorithm is far better than the least square regression, which was developed in the 19th century. In spite of its predictive accuracy and computation efficiency, XGBoost is more popular in data science studies than academia. What is XGBoost, really? Four days ago Shreya Rao published an article entitled “XGBoost regression: Explain it to me like I’m 10-year old” on Towards Data Science. The full article can be accessed at:

That’s my take on it: It is a common misconception that data science is very difficult to understand and implement. Actually, it is not. As the title of the preceding article implies, it is very easy to follow. You don’t need calculus or matrix algebra; rather, the concepts involved in XGboost, such as residual, similarity, and gain, require basic arithmetic only. Besides XGboost, there are several other types of boosting algorithms, such as Adaptive boosting Algorithm (AdaBoost) and Gradient Boosting (Gradient Boosting is taught in my class “STAT 553”). To boost or not to boost, that’s the question! 

Posted on August 20, 2021

Recently Hewlett Packard Enterprise (HPE), a key player of high-performance computing splitting from the parent company HP, released a report about the performance of HPE on SAS 9.4. According to the report, the key findings demonstrated high scalability when running SAS 9.4 using the Mixed Analytics Workload with HPE Superdome Flex 280 Server and HPE Primera Storage. These results demonstrated that the combination of the HPE Superdome Flex 280 Server and HPE Primera Storage with SAS 9.4 delivers up to 20GB/s of sustained throughput, up to a 2x performance improvement from the previous server and storage generation testing. The full report can be downloaded at:

That’s my take on it: Although open source has become more and more popular, some people might not realize that open source such as R is limited by memory, and also is not capable of running multi-thread processing. For high-performance computing and big data analytics, proprietary software apps such as SAS and IBM are still indispensable. 

Posted on August 12, 2021

A week ago Microsoft announced that their researchers have developed the world’s largest general neural networks that utilize 135 billion parameters. Now the new AI system is used in Microsoft’s search engine, Bing. According to Microsoft, the enhanced Bing is able to determine whether a page is relevant to the query. For example, Bing learned that “Hotmail” is strongly associated with “Microsoft Outlook,” even though the two terms are not close to each other in terms of semantic meaning. The AI system identified a nuanced relationship between them based on their contexts. After the enhancement, Microsoft recorded a 2% increase in click-through rates on the top search results.

That’s my take on it: I tried to use the same phrases in both Google and Bing. For example, “Did Paul consult Greek philosophers?” (I deliberately left out the title “St.” or “Apostle”) “Scholars have high h-index”…etc. In most cases, both Google and Bing returned different pages, yet most of them are highly relevant. However, for the query “Scholars have high h-index,” apparently Google beats Microsoft.

Bing returned pages explaining how the h-index is measured, such as “What is a good H-index?” “What is a good H-index for a professor in Biology?” “What number in the h-index is considered a passing grade?” This is not what I want! I want to see a list of highly influential scholars. The top result shown in Google is: “Highly cited researchers (h>100)”. The fourth one is: “Which researcher has the highest h-index?” Google won!

Posted on August 6, 2021
A recent report entitled “Data Science Needs to Grow Up: The 2021 Domino Data Lab Maturity Index” compiled by Domino found that 71% of the 300 data executives at large corporations are counting on data science to boost revenue growth, and 25% of them even expect double-digit growth. However, the report warned that many companies are not making proper investments to accomplish this goal.
In the survey, the participants reported different perceived obstacles to achieving the goal, as shown in the following
·         Lack of data skills among employees: 48%
·         Inconsistent standards and processes: 39%
·         Outdated or inadequate tools: 37%
·         Lack of buy-in from company leadership: 34%
·         Lack of data infrastructure and architecture: 34%
The full report can be downloaded at:
That’s my take on it: To be fair, the above issues happen everywhere. The gap between the goal and the implementation always exists. It makes me remember the theory of Management by Objectives (MBO) introduced by Peter Drucker. MBO refers to the process of goal-setting by both management and employees so that there is a consensus about what is supposed to be done. In my opinion, neither the top-down nor the bottom-up approach alone can ensure a successful implementation of data science. 

Posted on July 27, 2021

Currently, the whole world has its eyes on the Olympic Games, and thus another interesting international competition is overlooked. Recently 50 teams from all over the world competed for a spot in the top ten of the World Data League, an international contest of using data science to solve social problems. There are four stages in this contest and participants are required to solve a variety of problems, including public transportation, climate change, public health, and many others. All the complicated problems and voluminous data are provided by organizations sponsoring this game. After the multi-stage screening, the top ten teams were selected to enter the finals during the first week of July. The final challenge is about how to improve the quality of life by reducing city noise levels. In the end, the winner is an international team consisting of members from Germany, Italy, Portugal, and Australia.

That’s my take on it: For a long time, we have been training students how to write academic research papers for peer-review journals. No doubt it is valuable because a shiny vita with a long list of presentations and publications can pave the way for a successful career. Nonetheless, perhaps we should also encourage them to analyze big data for solving real-world problems. There is nothing more satisfying than seeing that someday my students can reverse the climate change in the World Data League! 

Posted on July 16, 2021

Amazon, one of the leaders in cloud computing, will hold a free data conference on 8/19 between 9:00 AM and 3:00 PM Pacific. The conference aims to introduce the latest technology for building a modern data strategy to consolidate, store, curate, and analyze data at any scale, and share insights with anyone who needs access to the data. Registration is free and the link to register is:


That’s my take on it: Besides providing data services, Amazon also developed several powerful analytical tools,
such as Amazon SageMaker:

Two decades ago I never imagined a bookseller could become a major player in the field of data analytics or Jeff Bezos would go into space travel. No matter whether you will use Amazon’s cloud computing or not, it is a good thing to learn about how Amazon can constantly reinvent itself. Look at the fate of another bookseller: Barnes and Noble (B & N). B & N has suffered seven years of declining revenue. Put it bluntly, the writing is on the wall when B & N didn’t want to go beyond its traditional boundary. Do we want ourselves to be like Amazon or Barnes and Noble?

Posted on July 15, 2021

Recently Dresner Advisory Services published the 2021 “Wisdom of Crowds Business Intelligence Market Study” to compare the strength of different vendors in business intelligence. The sample size consists of 5,000+ organizations and the research team rated various vendors by 33 criteria, including acquisition experience, value for the price paid, quality and usefulness of the product, quality of tech support, quality and value of consulting service…etc. The vendors are grouped into technology leaders and overall experience leaders. In this short message, I would like to focus on technology leaders. According to the report, the technology leaders are:

· Amazon
· Tableau
· Microsoft
· ThoughtSpot
· Qlik

That’s my take on it: Never count on a single report! Several other consulting companies, such as Gartner, Forrester, and IDC, also published similar reports, and their results are slightly different. Nonetheless, some brand names appear on all or most reports, such as SAS, Microsoft, and TIBCO. In addition, some names have been re-appearing on several lists for many years. For example, TIBCO was named a leader five times in the Gartner Magic Quadrant for Master Data Management Solutions. SAS has also been recognized as a leader by Gartner Magic Quadrant for Data Science and Machine Learning for eight consecutive years. I want to make it clear that I am not endorsing any particular product. What I am trying to say is that we need to teach students the skills needed by corporations.

Posted on July 13, 2021

Yoshua Bengio, Yann LeCun, and Geoffrey Hinton are recipients of the 2018 ACM Turing Award for their research in Deep Neural Networks. In a paper published in the July issue of the Communications of the ACM, they shared their insights about the future of deep learning. They argued that the current form of deep learning is “fragile” in the sense that it relies on the assumption that incoming data are “independent and identically distributed” (i.i.d.). Needless to say, this expectation is unrealistic; in the real world's almost everything is related to everything else. Due to the messiness of the real world, they said, “The performance of today’s best AI systems tends to take a hit when they go from the lab to the field.” A common solution is to feed the AI system with more and diverse data. In other words, currently, AI systems are example-based, rather than rule-based. However, some scientists reverted to the classical approach by mixing data-driven neural networks and symbolic manipulation. But Bengio, Hinton, and LeCun do not believe that it can work. The full paper can be accessed at:


That’s my take on it: The same problems described by Bengio, Hinton, and LeCun can also be found in classical statistics: unrealistic assumptions, messy data, and failure of generalizing the results from the lab to the field. As a remedy, some social scientists look for ecological validity. For example, educational researchers realize that it is impossible for teachers to block all interferences by closing the door. Contrary to the experimental ideal that a good study is a "noiseless" one, a study is regarded as ecologically valid if it captures teachers' everyday experiences as they are bombarded by numerous distractions. I believe that the same principle is applicable to deep learning.

Posted on July 9, 2021

A few days ago InsideBigData published an article entitled “The Rise and Fall of the Traditional Data Enterprise.”
The editorial team boldly claimed, “We are witnessing the death of traditional enterprise computing and storage – a real changing of the guard. Companies like Databricks, Snowflake, and Palantir are obliterating
companies initially thought to have been competitors: EMC, HP, Intel, Teradata, Cloudera, and Hadoop.”

Their argument is straightforward: Cloud-based computing simplifies data storage and usage. The cloud platform is ideal for storing and analyzing large-scale semi-structured data. In contrast, batch-based processing and relational databases for structured data are far less efficient. The full article can be accessed at:

That’s my take on it: History has been repeating itself. Back in the 1960s and 1970s, IBM mainframes seemed to be invincible and indispensable. However, in 1977 when Digital Equipment Corporation (DEC) introduced minicomputers running on VAX, IBM lost a big chunk of its market share to DEC. During the 1980s and 90s UNIX was gaining popularity, and to cope with the trend, DEC attempted to shift the focus of R&D its RISC technology, but it was too little and too late. In 2005 VAX ceased to exist.

Whether old players can continue to thrive depends on their adaptivity and the speed of reaction. Microsoft is a successful example. As InsideBigData pointed out, Microsoft Azure “have already commoditized storage at scales old-school players like EMC and HP could only have dreamed of.” Recently SAS Institute grabbed the opportunity by forming a joint venture with Microsoft in cloud computing.

Do we want ourselves to be like Hadoop and DEC/VAX, or Microsoft and SAS?

Posted on June 25, 2021

Although the predictive power of neural networks is supreme or even unparalleled, the process is considered a black box and sometimes the result is uninterpretable. Very often these models are fine-tuned by numerous trials and errors with big data. Simply put, it is a brute force approach (Given enough computing power and data, you can always get the answer). To rectify the situation, Sho Yaida of Facebook AI Research, Dan Roberts of MIT and Salesforce, and Boris Hanin at Princeton University co-authored a book entitled “The Principles of Deep Learning Theory: An Effective Theory Approach to Understanding Neural Networks.” In the book, they explained the theoretical framework of deep learning and thus data analysts could significantly reduce the amount of trial and error by understanding how to optimize different parameters. The book will be published by Cambridge University Press in early 2022 and the full manuscript can be downloaded from:

I haven’t read the whole book yet; nonetheless, I had a quick glance and found that the book is fairly accessible. As the authors said in the preface, the book is appropriate for everyone with knowledge of linear algebra, and probability theory, and with a healthy interest in neural networks.

Posted on June 19, 2021

China is in the midst of upgrading its military, including its tanks, missile systems, troop equipment, and fighter jets. Among the new systems being developed is AI. In a recent simulation one of the most experienced China’s air force pilots, Fang Guoya, was defeated by the AI combatant system. According to Fang, early in the training, it was easy for him to “shot down” the AI adversary. As you may already know, AI is capable of machine learning. After accumulating more and more data, the AI system outperformed Fang Guoya.

That’s my take on it: It is not surprising to see that the abilities of AI continue to outgrow even the best human experts. As a matter of fact, AI has been outsmarting humans for a long time. Back in 1997, IBM Deep Blue had already beaten the world chess champion after a six-game match. In 2011 IBM Watson competed against the best human contestants on Jeopardy and won the first prize. In 2016 Google’s AlphaGo beat a 9-dan (the highest level) professional Go player. It is noteworthy that in 2017 a new system called AlphaZero defeated AlphaGo by 100-0! Only AI can defeat AI! Perhaps the future war will be fought between AI systems and humans will play a supporting role only.

Posted on June 11, 2021

In June the US Senate passed the bill entitled the US Innovation and Competition Act (USICA) with the purpose of boosting American semiconductor production, the R & D of Artificial Intelligence, and other crucial technologies. The bill approves $52 billion for domestic semiconductor manufacturing, as well as a 30 percent boost in funding for the National Science Foundation (NSF), and $29 billion for a new science directorate to focus on applied sciences. Additionally, the bill will provide $10 billion to reshape cities and regions across the country into “technology hubs,” promoting R & D into cutting-edge industries and creating high-paying job opportunities.

That’s my take on it: There will be many funding opportunities at NSF and other funding agencies. We should make ourselves ready to catch the wave. However, the US is facing a shortage of researchers, engineers, programmers, and other types of high-skilled workers. One of the strategies for boosting semiconductor production is to attract foreign investment. In June Taiwan Semiconductor Manufacturing Co. (TSMC) has broken ground on a chipmaking facility in Chandler, Arizona (I lived there 10 years ago). One of the obstacles that might hinder TSMC from fully developing its chipmaking capacity is that the number of graduates related to science and engineering in the U.S. has diminished. In April, TSMC founder Morris Chang bluntly said that the U.S. lacks "dedicated talent ... as well as the capability to mobilize manufacturing personnel on a large scale." Further, a new report released by research group New American Economy found that for every unemployed technology worker in the US in 2020, there were more than seven job-postings for computer-related positions. Perhaps it’s time to reconsider and restructure our academic curriculum!

Posted on June 9, 2021

Back in January 2020, Google set a record in the field of natural language processing by building a new model with 1.6 trillion parameters. Recently China broke the record by introducing WuDao 2.0 carrying 1.75 trillion parameters. WuDao 2.0 is able to understand both Chinese and English, thus providing appropriate responses in real-world situations. According to Chinese AI researcher Blake Yan, “These sophisticated models, trained on gigantic data sets, only require a small amount of new data when used for a specific feature because they can transfer knowledge already learned into new tasks, just like human beings. Large-scale pre-trained models are one of today’s best shortcuts to artificial general intelligence.”

That’s my take on it:

1. As natural language processing, image recognition, and other AI technologies become more and more sophisticated, researchers can go beyond structured data (e.g. numbers in a row by column table) by tapping into unstructured data (e.g. text, audio, image, movie…etc.).  

2.  Despite the US bans exporting crucial AI technologies to China, China has been surging ahead in the research on AI and machine learning. China has at least three advantages: (a) AI needs big data; China can access massive data. (b) China is capable of training a large number of data scientists and AI researchers; Chinese students are more willing to study STEM subjects no matter how challenging they are. (c). China tends to take bold steps to apply AI and machine learning into different domains, rather than maintaining the status quo. 

Posted on June 7, 2021

This Wednesday (June 9) the Educational Opportunity Project (EOP) at Stanford University will release new data (Version 4.1) sourced from the Stanford Education Data Archive (SEDA). This is a comprehensive national database consisting of 10 years of academic performance data from 2008-2009 to 2017-2018.

With the advance of online interactive data visualization tools, you don’t have to wait for a year or more to see the results of this type of big data analytics. Now you can explore the data anytime anywhere on your own as long as there is a Web browser on your computer. For example, the following webpage is the GIS map of student test scores and socioeconomic status (SES) by the school district. In addition to the GIS map, the webpage also displays a scatterplot indicating a strong relationship between test scores and SES.


· To look for specific information about your school district, use the hand tool to move the map in order to place your state at the center.
· Click on the + sign on the right.
· Mouse hover on your school district e.g. the average test score of Azusa is -1.83 and the SES is +0.19 (see the attached PNG image “Azusa_scores_n_SES”).

You can switch to a different view to interact with the chart. For example, by clicking on a particular data point, I can see the trend in test scores by ethnic groups in that particular school district (see the attached PNG image “Boston_ethnicity.png”)

That’s my take on it: I am excited by this type of democratization of data analytics. Rather than merely counting on what experts tell you, today you can access the data to obtain specific information that is relevant to yourself.


Posted on June 4, 2021

On June 3 Knowledge Discovery Nuggets posted an article entitled “Will There Be a Shortage of Data Science Jobs in the Next 5 Years?” written by experienced data scientist Pranjal Saxena.

At the beginning of the article Saxena paints a gloomy picture of the future job market:

“In 2019, data scientists used to spend days in data gathering, data cleaning, feature selection, but now we have many tools in the market that can do these tasks in a few minutes.

On the other hand, we were trying different machine learning libraries like logistic regression, random forest, boosting machines, naive Bayes, and other data science libraries to get a better model.

But, today, we have tools like H2O, PyCaret, and many other cloud providers who can do the same model selection on the same data using the combination of other 30–50 machine learning libraries to give you the best machine learning algorithms for your data with least error

Each company is aware of this fact, so after five years, when these cloud-enabled data science tools will become more efficient and will be able to provide better accuracy in much less amount of time, then why will companies invest in hiring us and not buying the subscription of those tools?

In the end, Saxena shows the ray of hope by saying, “Each company aims to build their product so that instead of depending on others, they can build their automated system and then sell them in the market to earn more revenue. So, yes, there will be a need for data scientists who can help industries build automation systems that can automate the task of machine learning and deep learning.”

That’s my take on it: Data analysts like me are cautious of the lack of transparency and interpretability of the “black box” because the practice of handing over human judgment to the computer is not any better than blindly following the alpha level as 0.5. At most data science or machine learning should be used to augment human capabilities, not replace them. The key is to achieve an optimal balance. As Harvard DS researcher Brodie said, “too much human-in-the-loop leads to errors; too little leads to nonsense”. I think we will need experienced data scientists to interpret the results and make corrections when the automated system makes a mistake.

Nonetheless, what Saxena described is a “good problem.” I know many people still struggling with entering numbers into Excel manually. Let alone running automated tools.

Full article:

Posted on June 1, 2021

AI in China’s Walmart stores
Recently Walmart, one of the world's largest retailers, introduced RetailAI Fresh into China’s Walmart stores for self-service customers. RetailAI Fresh is a software app developed by Malong Technologies, running on GPU-accelerated servers from Dell Technologies. Self-checkout is easy when the package has a barcode, but it becomes challenging for a scanner to recognize fresh produce products. RetailAI Fresh can rectify the situation by integrating state-of-the-art AI recognition technology into traditional self-service scales. It is noteworthy that Malong Technologies was founded by Chinese data scientists and engineers.
That’s my take on it: There are a lot of brilliant Chinese computer scientists and engineers working on revolutionary products. The experiment or beta testing in China’s Walmart stores is just one of many examples. We should look beyond our borders in order to absorb new ideas.

AI for the insurance industry
Using demographic data to customize insurance policies is not new. However, in the past customers were treated unfairly due to outdated data or incorrect predictions made by legacy software applications. Not anymore. According to a recent article on InsideBigData, today AI is capable of processing 4,000 data points in minutes and also analyzing 20 years’ worth of mortality, demographic, health, and government trends for better decision support. As a result, insurance companies that utilize both AI and cloud-based data can create fairer policies to serve current and potential clients.
That’s my take on it: Today many people want to keep their privacy and complain against AI and big data, viewing them as “weapons of math destruction” or “the big brother in 1984”. On the other hand, people expect corporations to improve our wellbeing by utilizing better algorithms and more accurate data. These two goals are contradictory! It is important to point out that insurance companies have been collecting customer data for many years. If AI can improve predictive models and data accuracy, I don’t see a reason to oppose it.

Posted on May 25, 2021

Currently there is a special exhibition at London’s Design Museum: Portrait paintings and drawings by an AI android named Ai-Da. Ai-Da is co-developed by robotics firm ‘Engineered Arts’ and experts at the University of Oxford. Ai-Da is able to ‘see’ by utilizing a computer vision system, and therefore she can create a portrait of someone in front of her. Because the creative process is based upon machine learning algorithms, she will not duplicate the same work and therefore each picture is unique. Unfortunately, I cannot visit the museum due to COVID19. 

Posted on May 25, 2021

IBM, along with SAS and TIBCO, is named one of the leaders in the 2021 Gartner Report of data science and machine learning platforms. Although the flagship products of IBM are IBM Watson Studio, IBM Cloud Pak for Data, IBM SPSS Modeler, and IBM Watson Machine Learning, IBM heavily invests in Python and other open-source resources. Recently IBM announced that it will make the Python distribution platform Anaconda available for Linux on IMB Z. Anaconda is the leading Python data science platform and 25 million users use this platform for machine learning, data science, and predictive analytics.

Posted on May 24, 2021

Daniel Kahneman won the Nobel Prize in economics in 2002 for his work on the psychology of decision-making. In response to the questions about the impact of AI on our society during a recent interview by the Guardian, Kahneman said, “There are going to be massive consequences of that change that are already beginning to happen. Some medical specialties are clearly in danger of being replaced, certainly in terms of diagnosis. And there are rather frightening scenarios when you’re talking about leadership. Once it’s demonstrably true that you can have an AI that has a far better business judgment, say, what will that do to human leadership?... I have learned never to make forecasts. Not only can I certainly not do it – I’m not sure it can be done. But one thing that looks very likely is that these huge changes are not going to happen quietly. There is going to be massive disruption. Technology is developing very rapidly, possibly exponentially. But people are linear. When linear people are faced with exponential change, they’re not going to be able to adapt to that very easily. So clearly, something is coming… And clearly, AI is going to win [against human intelligence]. It’s not even close. How people are going to adjust to this is a fascinating problem – but one for my children and grandchildren, not me.”

Posted on April 15, 2021

No matter whether you support developing AI or not, it is good to take multiple perspectives into consideration. A week ago, Alberto Romero published an article entitled "5 Reasons Why I Left the AI Industry" on "Towards Data Science". He complained that AI is hype and so we should not expect to see AI at the level of human intelligence anytime soon. In addition, in his view, AI becomes a black box and many people don't understand what is going on behind the scenes. The following is a direct quotation:

 "The popularization of AI has made every software-related graduate dream with being the next Andrew Ng. And the apparent easiness with which you can have a powerful DL model running in the cloud, with huge databases to learn from, has made many enjoy the reward of seeing results fast and easy. AI is within reach of almost anyone. You can use Tensorflow or Keras to create a working model in a month. Without any computer science (or programming) knowledge whatsoever. But let me ask you this: Is that what you want? Does it fulfill your hunger for discovering something new? Is it interesting? Even if it works, have you actually learned anything? It seems to me that AI has become an end in itself. Most don’t use AI to achieve something beyond. They use AI just for the sake of it without understanding anything that happens behind the scenes. That doesn’t satisfy me at all."

My response is: I never expect we will see an android like Commander Data or Terminator in the near future. Indeed, we don't need that level of AI to improve our performance or well-being. Nonetheless, it is a good strategy to aim high. If a researcher tries to publish 7 articles per year, in the end, there would be 3-5 only. But if he or she sets the goal to 3 articles per year, the result would be zero! By the same token, the ultimate goal in AI seems to be unattainable, but it is how we are motivated. In addition, what Alberto described as AI programmers and users also happens among people who use traditional statistics. Some people feed the data into the computer, push a button, and then pass the output into the paper without knowing what F values and p values mean. Misuse or even abuse happens everywhere. The proper way to deal with the issue is education, rather than abandoning the methodology altogether.  

 The link to the full article is: