Hosted by Chong Ho (Alex) Yu,    
SCASA Vice President for Statistics Education

Posted on November 21, 2023

DSML trend: New and updated features of Bard level the playing field

Recently Google Bard announced several new or updated features. For example, “Get help with math equations: Stuck on a math problem? Ask Bard for a step-by-step explanation of how to solve the equation. You can even take a photo of the question and upload it instead of typing it out.”

“Charts & graphs to visualize data: Bard can now generate charts from data or equations you include in your prompts or from tables that Bard generates during your conversations. We even made a graph resembling the Bard sparkle while playing around with this new feature!”

That’s my take on it:

I tested the feature by entering the following problems “Solve 9^(2x-5) = 27^x.  Explain the solution step by step.” After a few seconds, Bard presented two versions of the solution, and both were correct. More importantly, it explained the procedure step by step. In my opinion, the explanation is even clearer than the textbook. If I need further explanation, Bard can provide additional information. For instance, after entering “Please explain the power of a power rule used in Step 2,” Bard illustrated more details.

A chatbot like Bard has the potential to serve as your personalized and intelligent tutor, catering to your individual learning pace, academic proficiency, and preferred learning style. In the past, parents had to invest significantly in hiring private tutors or enrolling their children in intensive boot camps. However, the playing field has been leveled, and now access to a personalized tutor is just a few clicks away for virtually anyone. I wish I had the opportunity to access this technology when I was a child! If so, my knowledge could have tripled, and I might have earned three Ph.Ds!

Google Bard:

Posted on November 17, 2023

DSML trend: OpenAI fires Sam Altman

It happened just now. Today (November 17, 2023) OpenAI's board of directors announced that Sam Altman will be stepping down as CEO, with technology chief Mira Murati set to take over the position. The decision comes after a thorough review process, during which the board determined that Altman's communication lacked consistent truthfulness, thereby impeding the board's ability to fulfill its responsibilities. The statement emphasized that due to this, the board no longer has confidence in Altman's capacity to effectively lead OpenAI. The board also announced that Greg Brockman, OpenAI’s president will be stepping down as chairman of the board but will keep a role at the company.

That’s my take on it: As of now, Altman has not issued any public response yet. Given the maturity of the technology, I believe the departure of both Altman and Brockman will likely have minimal impact on the development of OpenAI or the broader field of generative AI. However, it's improbable that Altman will sit there and do nothing. It is possible that he may embark on launching another startup or join a competitor to OpenAI (e.g. Claude or Google Bard?)


Posted on November 17, 2023

The Harvard Business Review featured an article on November 2, 2033, titled "How Cloud Technology is Transforming Data Science." Written by Peter Wang, the CEO and co-founder of Anaconda, the article discusses the impact of cloud computing on data science practices. Wang highlights how cloud platforms, such as IBM Watson and Tableau, are revolutionizing the field by offering scalable computational resources and enhancing workforce agility. These cloud-based analytics tools empower teams to access information and collaborate in real time, facilitating quicker insights and problem-solving. Moreover, cloud computing promotes inclusivity in data science by providing smaller entities, such as startups and small teams, with the means to innovate on par with larger corporations. The cloud's collaborative capabilities extend to distributed data science teams, enabling effective collaboration irrespective of geographical constraints. While the cloud brings forth significant advancements in data science, it also introduces new challenges, particularly in data privacy and security. To address these concerns, Wang emphasizes the importance of employing techniques like data partitioning, encryption, and robust frameworks for mitigation.

That’s my take on it: Given the significance of cloud computing, it is undoubtedly essential to integrate it into the curriculum of data science education. However, the current landscape of the cloud computing market is highly diverse, featuring numerous vendors such as AWS, Google Cloud, Microsoft Azure, IBM Watson, and more. This question arises: should cloud computing training be tailored to specific vendors or remain vendor-independent? Opting for vendor-specific training allows students to gain practical experience with the tools and services of major cloud providers like AWS, Azure, and IBM Watson, preparing them for roles utilizing these platforms. The drawback is that knowledge becomes less transferable if students later work with a different cloud provider. On the other hand, adopting a vendor-independent approach ensures knowledge transferability across various cloud platforms and avoids explicit promotion of specific vendors within the program. However, graduates may need additional, vendor-specific training upon entering the workforce. Striking a balance between these approaches is crucial to provide students with a well-rounded and adaptable skill set in the dynamic field of cloud computing. What do you think?

Full article:

Posted on October 19, 2023

On October 16, Baidu, the Chinese search engine giant, unveiled its updated large language model, known as Ernie 4. They asserted that it is on a par to OpenAI's GPT 4 in terms of performance, although it is not yet accessible to the public. Additionally, Baidu introduced a new AI-based product called Baidu GBI, developed from the ground up to provide support for natural language interaction and handle cross-database analysis, among other functions. According to Baidu's CEO Li, this product has the capability to complete data analysis tasks that would take humans several days in just a matter of minutes.

That’s my take on it:

Baidu had previously released Ernie 3.5 in June, claiming its superior performance compared to OpenAI's ChatGPT 3.5 and even surpassing GPT 4 in certain Chinese-language skills. These assertive statements should undergo validation through objective benchmark tests by independent parties.

Full report:

Posted on October 13, 2023

Yesterday (10/12) an article published by Analytics Insight detailed how seven data science positions can be executed without the need for programming skills. Instead, they rely on the capabilities of user-friendly software tools like Tableau, Excel, Power BI, and more. These positions are:

·      Data analyst

·      Business Intelligence Analyst

·      Data Consultant

·      Market Research Analyst

·      Data Visualization Specialist

·      Data-driven Strategist

·      Data Product Manager

That’s my take on it:

No code solutions provide pre-built components, templates, and graphical user interfaces (GUI) that can accelerate development compared to programming. These tools allow users to focus more on the research question, the data, and the business logic rather than the syntax. However, no-code solutions inevitably involve some trade-offs in terms of flexibility, customization ability, scalability and performance compared to coding. Data science education should balance both sides. In my humble opinion, starting data science training with a focus on programming right away might not be the most advisable approach. Emphasizing the fundamental concepts as the foundation is crucial, while the tools, which serve as means to an end, should be treated as secondary. Leveraging GUI-based software applications reduces the entry barriers into the field, thereby broadening the pool of potential talents.


Posted on September 29, 2023

DSML trend: Meta and OpenAI announced new features simultaneously

On September 27th, Meta unveiled its latest artificial intelligence (AI)-powered creation, Meta AI. This new digital assistant is Meta's response to OpenAI's ChatGPT and is set to seamlessly integrate with Instagram, Facebook, WhatsApp, and, in the future, Meta's mixed reality devices. Beyond merely answering questions and engaging in conversations with users, this freshly introduced bot boasts a remarkable ability to generate images. This image generation capability is harnessed through a novel tool known as Emu, which Meta has diligently trained on a vast dataset of 1.1 billion pieces of data, including user-shared photos and captions from Facebook and Instagram. Rather than pursuing a one-size-fits-all approach, Meta's overarching strategy involves crafting distinct AI products tailored for various use cases.

On the very same day, OpenAI made an exciting announcement regarding its chatbot, ChatGPT. It revealed that ChatGPT would no longer be constrained by pre-2021 data. Users now have the option to explore GPT-4 via a novel feature called "Browse with Bing." For instance, if you snap a photo of your home refrigerator's contents, ChatGPT can provide recipe suggestions. Similarly, if you photograph your children's math homework, ChatGPT can assist in solving mathematical problems. Furthermore, OpenAI is set to enable ChatGPT to engage in voice conversations with users and interact with images, bringing it closer to an AI voice assistant akin to Apple's Siri.

That’s my take on it:

Ultimately, these AI tools hold the potential to enhance human well-being and satisfaction. However, will our happiness increase when AI can assist us in generating images on Facebook, preparing a recipe, or solving complex math problems? When I traveled to Europe, I found that many European drivers preferred the manual transmission system to the automatic one, as they believe the former provides a more fulfilling driving experience while the latter deprives them of the fun of driving. By the same token, if everything becomes too easy with the help from AI, will we feel “losing” some enjoyable experiences? The psychological impact of AI on humans is a topic that warrants extensive study.

Full text:

Posted on September 27, 2023

DSML trend: Capital Economics report of AI impact

On September 26, Capital Economics published a report under the title "AI, Economies, and Markets – The Transformation of the Global Economy by Artificial Intelligence." The report highlights that revolutionary technologies like GPT do not necessarily guarantee substantial productivity gains. Historical evidence shows that the productivity improvements stemming from groundbreaking technologies have often been gradual and less dramatic than initially anticipated. Economists have long grappled with the perplexing trend of weak productivity growth in the digitalized economy of recent decades, marked by developments such as the internet, cloud computing, and the Internet of Things. One major contributing factor is that many companies do not promptly or effectively implement the technology. Although the United States has notably reaped the most substantial productivity gains from AI, achieving a substantial productivity boost from AI hinges on several co-occurring factors, including increased investment, workforce reskilling, and a well-balanced regulatory framework.

That’s my take on it:

Drawing inspiration from the 1966 Clint Eastwood movie, "The Good, the Bad, and the Ugly," I can see that probably every innovation might yield three potential outcomes. The "good" outcome represents effective and efficient applications of the technology, leading to desirable results. Conversely, the "bad" outcome arises from poor implementations, resulting in wasted resources. Finally, the "ugly" outcome involves the misuse of technologies for malicious purposes. Numerous examples of the "bad" category exist, such as the overuse of word processing, which allows endless and unnecessary document editing, often resulting in little improvement despite numerous iterations. Similarly, the availability of powerful data analysis software can lead to redundant and excessive statistical tests, with minimal productivity growth as the outcome. These pitfalls also apply to AI unless users are adequately trained to harness its potential effectively.

Request complimentary report:

Posted on September 22, 2023

DSML trend: OpenAI announced DALL-E.3

OpenAI has unveiled the third iteration of its AI art platform, DALL-E. Reviews have praised its significant improvements, with some areas of functionality surpassing that of Midjourney, particularly in terms of image sharpness. Notably, DALL-E 3 simplifies the creative process, eliminating the need for prompt engineering; even amateurs can now obtain desired images through straightforward prompts. Furthermore, this release seamlessly integrates with ChatGPT, streamlining the creative workflow. OpenAI has also placed a strong emphasis on robust safety measures to prevent the generation of inappropriate or harmful content, such as prohibiting the creation of images of public figures. However, DALL-E 3 is not expected to be publicly available until October.

That’s my take on it:

As for DALL-E's competitors, including Stable Diffusion and Midjourney, it is anticipated that their developers are tirelessly working to enhance their features. The ultimate winner in this competition may not become clear for another decade. To draw a parallel, we can reflect on the history of computing: Novell Netware dominated the network operating system landscape in the 1980s and early 1990s, but Microsoft's introduction of Windows NT Server in 1993 led to a gradual shift in market share, with Windows Server ultimately becoming the dominant server OS by the early 2000s, marking a decade-long competitive process. Similarly, Lotus 1-2-3 was launched in 1983 and quickly became the dominant spreadsheet software in the 1980s. In response, Microsoft introduced Excel for Mac in 1985 and the Windows version in 1987. It took approximately a decade for Excel to definitively overtake Lotus as the top-selling spreadsheet software by 1995. This pattern of a 10-year competitive evolution can also be observed in the cases of MS Word vs. WordPerfect and SAS/SPSS vs. BMDP.

Introduction to DALL-E.3 on YouTube:

Posted on September 15, 2023

DSML trend: Valuable data science certifications

Yesterday (September 14, 2023) Aileen Scott, a data scientist, released an article on Data Science Central titled "Are Data Science Certifications the Key to Unlocking Lucrative Opportunities?" In this short article, Aileen poses a question in her title and unequivocally answers it with a resounding "yes." According to her insights, while you can certainly pursue data science studies from the comfort of your home through online courses, opting for a certification program offers unique advantages by facilitating connections with fellow learners, instructors, and industry luminaries. The bottom line is: Earning a data science certification can significantly enhance your earning potential when compared with your non-certified peers. In Aileen’s view, the top choices of data science certification programs for 2023 are:

·      SAS Certified Data Scientist

·      Senior Data Scientist (SDSTM) by Data Science Council of America

·      Open Certified Data Scientist (Open CDS)

·      Microsoft Certified: Azure Data Scientist Associate

That’s my take on it:

While Aileen's provided list is concise, some may contend that it overlooks certain valuable programs, such as certifications for Amazon Cloud and Tableau. Nevertheless, Aileen's recommendations encompass two certification programs of a more generic or open-source nature, while the other two are product-specific or affiliated with particular companies (SAS and Microsoft). Although the open-source approach to data science and machine learning is gaining popularity, it is crucial to note that major corporations and tightly regulated industries continue to rely on proprietary software solutions due to their enhanced support and dependability. If you are in search of a data science training program, it is advisable to consider enrolling in one that provides both open-source and proprietary software tracks. In this regard, I encourage you to explore the data science program at Hawaii Pacific University. 

Posted on September 14, 2023

Today marks the second day of the 2023 Dreamforce conference, which is being hosted by Salesforce. It was a great experience even though I attended the conference remotely. Salesforce is widely recognized for its exceptional data visualization platform, Tableau, as well as its AI-driven analytical tool, Einstein. The central theme of this conference revolves around the concept of trust. Specifically, Salesforce is dedicated to constructing reliable systems that prioritize security, compliance, and dependability.

Throughout the conference, Salesforce has showcased its ability to guide users in creating more effective prompts through the innovative feature known as prompt tuning. Moreover, the event has featured numerous enlightening and captivating sessions. For instance, it has provided a platform for interviews with several distinguished AI leaders and innovators who have been acknowledged by TIME 100. Among these esteemed interviewees is Dr. Fei Fei Li. During her interview, Dr. Li openly expressed her wholehearted embrace of this transformative technology. While some individuals may be skeptical of this powerful yet unfamiliar technology, Dr. Li made a thought-provoking comparison. She pointed out that today, we are not overwhelmed by electricity, and we readily use medications like Tylenol despite not fully comprehending their chemical composition. Addressing concerns about AI bias, Dr. Li contended that AI can be harnessed to mitigate bias. As an example, AI can scrutinize instances where male actors receive more screen time than their female counterparts, highlighting disparities and providing an avenue for rectification.

That’s my take on it:

Critics have voiced concerns that AI tools might inadvertently encourage laziness and plagiarism. However, it is undeniable that AI is here to stay. The integration of AI into various industries is inevitable, and skills related to AI, such as prompt engineering, are increasingly being recognized as indispensable.

Salesforce, as the world's third-largest software company and the second largest in Japan, wields significant influence in this technological landscape. As high-tech companies like Salesforce incorporate prompt tuning into their product portfolios, it is foreseeable that in the near future the utilization of prompt engineering will become as ubiquitous as the use of smartphones and tablets.

Dr. Fei Fei Li's compelling metaphors, likening AI to electricity and Tylenol, underscore the notion that embracing transformative technologies is a natural progression of human innovation. This phenomenon is not dissimilar to the initial opposition encountered by calculators, which were once believed to diminish human numerical skills. Today, they are as commonplace as electricity and Tylenol, illustrating how society adapts and integrates new tools into everyday life.

Conference’s website:

Posted on September 8, 2023

DSML trend: TIME’s top 100 most influential people in AI

On September 7 TIME announced the top 100 most influential persons in the field of AI, which includes Sam Altman of OpenAI, Dario and Daniela Amodei of Anthropic, Demis Hassabis of Google DeepMind, coinventor of the backpropagation algorithm Geoffrey Hinton, inventor of CNN Yann LeCun, co-founder and chief AGI scientist of Google DeepMind Shane Legg, co-founder and president of OpenAI Greg Brockman, co-founder and chief scientist of OpenAI Ilya Sutskever, co-founder of Schmidt Futures Eric Schmidt, science fiction writer Ted Chiang, co-founder of Nvidia Jensen Huang, Stanford professor Fei-Fei Li …etc.

That’s my take on it:

In TIME's top 100 list, Google boasts six individuals, while OpenAI is represented by five. In addition, Microsoft is notable with four entries, Intel with two, and Meta Facebook features one (Yann LeCun). There's also a presence from xAI (Elon Musk) and SalesForce/Tableau (Clara Shih), but curiously, Apple is entirely absent from the list. Surprisingly, there's no representation from Stability AI, the company behind the groundbreaking Stable Diffusion technology that generates artworks. No doubt Apple has lagged behind in AI development, with little noteworthy AI-related news to date. As of September 6, reports suggest that Apple is investing millions of dollars a day to train its own AI model Ajax, with claims that Ajax can surpass ChatGPT. However, concrete results are yet to be unveiled. I wonder how Apple's AI landscape might have been different had Steve Jobs still been alive today.

Time’s website:

Apple’s story:

Posted on September 6, 2023

DSML trend: Guardian blocks ChatGPT from accessing its content

On September 1 2023, the Guardian announced its decision to block access to its content for the AI text generation program, ChatGPT. In a statement, the publisher emphasized that the scraping of their intellectual property for commercial purposes has always been against their terms of service. They also highlighted their commitment to fostering mutually beneficial commercial relationships with developers worldwide. Other news media, including CNN, Reuters, the Washington Post, Bloomberg, and the New York Times implement similar policies. OpenAI, the owner of ChatGPT, had previously revealed an opt-out option for website owners who didn't want their content used by AI algorithms.

That’s my take on it:

Whether AI's utilization of existing published content constitutes copyright infringement or qualifies as fair use has been an ongoing debate. This same issue extends to AI art tools like Midjourney and Stable Diffusion. It is important to note that AI chatbots do not simply copy and paste content from the source. Rather, the nature and purpose of its use can be seen as transformative, meaning that AI repurposes copyrighted material in a novel and distinct manner. Similarly, AI art tools do not merely create collages; instead, they learn from patterns in existing artworks to generate entirely new images. Consider this analogy: If I extensively study art by browsing around a library and a museum, and subsequently, based on this knowledge I write a new article or create a new painting on my own, should the library or museum prevent me from accessing their information?

Full text:

Posted on September 1, 2023

DSML trend: China's Baidu ‘s AI chatbot Ernie Bot is publicly accessible

On August 31, 2023, Baidu, the Chinese search engine and AI company, made a significant move by unveiling "Erine Bot," their equivalent of the ChatGPT language model, to the public. As a result, Baidu's stock price surged more than 3%. This strategic move aligns with Beijing's vision of AI as a critical field, one where they aim to challenge the US and emerge as a global leader by 2030. By releasing Ernie Bot to the public, Baidu intends to gather extensive real-world user feedback. This feedback loop will, in turn, play a vital role in enhancing Ernie and strengthening Baidu's foundational models. Coincidentally, on the same day, two other prominent AI companies in China, Baichuan and Zhipu AI, also introduced their own AI language models.

That’s my take on it:

Back in 2017, Russian President Putin emphasized the transformative potential of AI by saying “whoever becomes the leader in this sphere will become the ruler of the world.” This perspective reflects the ongoing international competition among major technological powers, including the US and China, to gain supremacy in AI research and development.

Unfortunately, I faced challenges registering with Baidu's AI, as it requires a China’s cell phone number for access. Consequently, I was unable to evaluate Ernie Bot personally. However, those who did manage to access Baidu's AI encountered significant restrictions, particularly in its reluctance to answer sensitive political and historical inquiries. If you had the opportunity to assess Ernie Bot, I would greatly appreciate it if you could share your insights and findings with me. If you found a way to circumvent the requirement of providing a China’s cell phone information when registering for Ernie Bot, please let me know too.

Full text:

Ernie Bot’s website:

Posted on September 1, 2023

DSML trend: Nvidia is thriving in the AI boom at the expense of Intel and AMD

According to a report posted by Nikkei Asia today, Nvidia is thriving in the AI boom while Intel and AMD struggle to keep up. Nvidia, the GPU market leader, has seen its stock price triple since the beginning of the year. Its shares jumped over 6% in one week after reporting 101% year-over-year revenue growth on August 23rd. Nvidia racked up $13.51 billion in revenue last quarter, more than double the previous year's figure, largely driven by its data center business and AI chips like the H100 and A100. Meanwhile, AMD and Intel's share prices have dropped 7.41% and 4.08% respectively in the past month. Nvidia's data center revenue was nearly double the data center chip sales of Intel and AMD combined last quarter. It is a clear trend that Intel and AMD are increasingly vulnerable to losing market share in the traditional x86 CPU market.

That’s my take on it:

The above phenomenon highlights the limitations of Intel's x86 architecture for the demand of computing power in the era of big data and AI. While powerful for sequential tasks, x86's complexity makes it less optimized for massively parallel processing compared to GPU architectures. For AI/machine learning and other data-intensive applications, GPUs can provide 10-100x higher throughput. In addition, GPUs have very high memory bandwidth optimized for throughput, allowing fast access to large datasets while CPUs have lower bandwidth. Sadly, Intel was reluctant to modify its architecture and slow to get into the GPU market. Although I don’t think x86-based CPUs will disappear overnight, the future is undoubtedly trending toward GPU-accelerated computing. As GPU computing gains momentum, data scientists, educators, and DSML students need to adapt to this paradigm shift.

P.S.: If I had bought Nvidia stock 10 years ago, I would be retired and sending this message on a cruise ship or at the beach right now.

Full text (may require subscription):

Posted on August 30, 2023

DSML Trend: Revival of OpenAI?

According to reports posted by “The Information” and Reuters on August 29, 2023, OpenAI is poised to achieve over $1 billion in revenue within the upcoming year through the sale of AI software and the corresponding computational capacity that drives it. Previously, the creators of ChatGPT had estimated revenue of $200 million for the current year. Notably, the company, backed by Microsoft, is now amassing a staggering revenue surpassing $80 million each month, a significant escalation from the mere $28 million garnered throughout the entirety of the preceding year.

That’s my take on it:

Two weeks ago, IT experts predicted that OpenAI might go bankrupt by the end of 2024 due to a decline in usage. Suddenly this situation has undergone a surprising reversal. In my opinion, the future trajectory of OpenAI remains uncertain, because the fate of the company relies on a single product. Prior to ChatGPT, OpenAI boasted another flagship product known as DALL.E2, a creative tool for generating visual art. However, the market of generative art has now been predominantly seized by Midjourney, which boasts a user base of 15 million, the largest among all image generation platforms. In terms of overall image production volume, Stable Diffusion takes the lead with an impressive 12.59 billion images generated.

The question arises: should OpenAI reallocate its R&D resources to the more promising ChatGPT and relinquish DALL.E2, or should it engage in a dual-front battle? This is an intricate puzzle that demands careful consideration.

Reuters’s report:,in%20revenue%20for%20this%20year.

Statistics of generative art tools:

Posted on August 26, 2023

DSML trend: G2 Grid for Data Science and Machine Learning Platforms

On August 24, 2023, G2 released the G2 Grid for Data Science and Machine Learning Platforms. To be considered for inclusion in this DSML benchmarking, the candidate must adhere to the following criteria:

1.     Facilitate the connection of data to algorithms, enabling them to acquire and adapt knowledge.

2.     Enable users to construct machine learning algorithms and/or furnish pre-built machine learning algorithms suitable for less experienced users.

3.     Furnish a platform for the widespread deployment of artificial intelligence.

G2 classified DSML companies into four distinct quadrants, namely, leaders, high performers, contenders, and niche, utilizing a dual-dimensional framework: market presence and customer satisfaction. According to G2 scoring, currently the leaders of DSML are:

·      Databricks Lakehouse

·      IBM Watson Studio

·      Matlab

·      Alteryx

·      Vertex AI

·      SAS Visual Data Mining and Machine Learning

·      Anaconda

·      Saturn Cloud

·      Microsoft Azure Machine Learning

·      Deepnote

·      Amazon SageMaker and AWS Trainium

·      TensorFlow

·      Qlik AutoML

That’s my take on it:

The preceding list includes well-established companies like SAS, IBM, and Microsoft, alongside newcomers challenging the existing order. I admit that I do not possess the skill sets required for all of the software tools mentioned. Coping with the rapid evolution of technologies poses a considerable challenge for university professors, particularly in fields where progress is frequent. In my opinion, transitioning the emphasis from instructing specific skills to nurturing the capacity for perpetual learning is undeniably a valuable approach. To remain current, one effective tactic involves inviting guest speakers from industry or research domains to share their expertise and insights with students. This exposure acquaints students with real-world applications and prevailing industry methodologies. Moreover, it is imperative for faculty to motivate students to cultivate a mindset characterized by openness to change and a willingness to experiment. By the time my students graduate, G2, Gartner, Forrester, and IDC may compile a new list of DSML leaders!

Full report:

Posted on August 26, 2023

In a recent piece published on KDnuggets (August 24, 2023), Dr. Mandar Karhade speculated the architecture of GPT-4 based upon leaked information. The author posited that rather than being a singular colossal model, GPT-4 might consist of eight separate models, each bearing 220 billion parameters. This novel approach involves breaking down a given task into smaller subtasks, which are then tackled by specialized experts within the context of these models. The strategy mirrors a divide-and-conquer methodology. Subsequently, a gating model is introduced to determine the optimal expert for each subtask, culminating in the final prediction. However, the author included a disclaimer emphasizing the non-official nature of this information.

That’s my take on it:

At the present time, this notion remains an unverified rumor. Nevertheless, the idea holds a certain degree of credibility. The underlying concept closely resembles, if not mirrors, the principles of ensemble methods and model comparison, a common practice in the realm of Data Science and Machine Learning. In ensemble methods such as boosting and bagging, numerous modeling procedures are executed on partitioned subsets of data. Subsequent model comparison is conducted to select the most optimal solution derived from an array of modeling techniques: neural networks, SVM, bagging, boosting, among others. Hence, the synthesis of eight models in GPT-4 represents a natural progression akin to ensemble methods and model comparison, taking the idea a step further.

Full article:

Posted on August 15, 2023

DSML trend: Europeans collaborate with China’s Huawei in AI-based weather forecasting

The European Centre for Medium-Range Weather Forecasts (ECMWF) has entered into a partnership with Huawei, a leading Chinese technology company, to launch an artificial intelligence-powered weather forecasting system. This collaboration aims to combine ECMWF's expertise in meteorology with Huawei's advanced AI capabilities. The new model, Pangu-Weather, was developed by Huawei and has demonstrated superior accuracy over traditional models. ECMWF selected the Pangu model after rigorous comparative testing showed it consistently outperformed other models, including in predicting extreme weather events. A recent Nature journal article provides further validation of the Pangu model's capabilities, highlighting its ability to achieve highly precise forecasts at speeds up to 10,000 times faster than legacy weather models.

That’s my take on it: Current U.S. export restrictions prohibit Nvidia from exporting certain high-performance AI chips like the A100 and H100 GPUs to China. A recent executive order also restricts U.S. investment into key Chinese technology sectors including semiconductors, AI and quantum computing. In anticipation of further export restrictions, major Chinese technology companies have been urgently placing large orders for high-performance Nvidia chips, with recent estimates valuing these bids at around $5 billion. It remains unclear whether Huawei will be able to fully capitalize on ECMWF's capabilities given these limitations on accessing critical U.S. technologies. Nonetheless, developing highly accurate weather forecasting is increasingly crucial as extreme weather events become more frequent, likely due to climate change.

Full text:

Posted on August 14, 2023

DSML trend: IBM will integrate Meta’s Llama into Watson

On August 9, IBM announced plans to host Meta's 70 billion parameter Llama 2 large language model on its Watson AI and data science platforms. Currently in, users can leverage pre-trained models from IBM and Hugging Face for Natural Language Processing tasks, such as content generation and summarization, as well as text classification and extraction (text mining). The future addition of Llama 2 to will be a milestone for IBM's generative AI roadmap, likely followed by upcoming releases of its AI Tuning Studio.

That’s my take on it: IBM's flagship data science products are Watson Studio and SPSS Modeler. For a long time, IBM has trailed its top competitor SAS Institute in user base, interface, and capabilities. Nevertheless, IBM has invested in AI research and development since the 1950s. In 1997, IBM's Deep Blue beat the world chess champion in a six-game match. In 2011, IBM's Watson competed and won against top human Jeopardy! contestants. Although Meta's Llama is less powerful than models like Claude 2, Google Bard, and ChatGPT, incorporating a large language model into IBM products is still strategic. However, it's too early to tell whether IBM can overtake SAS in the near future.

Full announcement:

Posted on August 14, 2023

DSML trend: OpenAI faces financial challenges and the rise of Claude


An article in yesterday's Business Today (August 13) reported that OpenAI, the pioneering AI company that brought ChatGPT to the mainstream public, is facing financial challenges. The costs to operate ChatGPT amount to around $700,000 per day. Despite efforts to monetize GPT-3.5 and GPT-4, OpenAI has yet to earn sufficient revenue to cover its expenses. According to SimilarWeb data., ChatGPT's user base declined 12% from June to July 2023, dropping from 1.7 billion to 1.5 billion monthly users.


That’s my take on it: Researchers at Stanford and UC Berkeley systematically evaluated different versions of ChatGPT. It was found that in math tests, ChatGPT solved 488 out of 500 correctly in March (97.6% accuracy). By June, its accuracy dropped 2.4%. ChatGPT's global website traffic fell, especially after the launch of Claude 2. Claude 2 scored 71.2% on a Python coding test versus ChatGPT's 67%. Claude is also more updated, with an early 2023 cutoff versus September 2021 for ChatGPT. While it's premature to declare the end of ChatGPT, the future landscape of large language models is volatile as more competitors enter the market.


Full article:

Posted on August 14, 2023

In an essay published on August 9th, Andrew Ng, a co-founder of Google Brain and former Chief Scientist at Baidu, posited that Othello-GPT, a powerful large language model, demonstrates a noteworthy degree of world comprehension. To be specific, during its training phase involving gameplay, the neural network exclusively processed sequences of moves, but it was not explicitly provided with information indicating that these sequences pertained to actions on an 8x8 board or the game's rules. However, after extensive training on a substantial dataset of such moves, the model exhibited proficiency in predicting subsequent moves in an 8X8 board. By the same token, certain large language models trained in English have displayed an ability to "learn" additional languages, enabling them to comprehend and follow instructions in these languages. This observation has led both Andrew Ng and Geoff Hinton to draw the consensus that large language models really possess a form of world understanding.


That’s my take on it: Comprehension, or understanding, in the realm of psychology, goes beyond the mere perception of sensory input. Rather, it involves active engagement with information, its connection to pre-existing knowledge and personal experiences, and the construction of a coherent mental representation or interpretation. However,  even if they fulfill the aforementioned criteria, do LLMs simply appear to or behave as they understand the world?


Full text:

Posted on August 1, 2023

Center for Consciousness Studies at the University of Arizona and California for Human Science will cohost a conference entitled “Neuroscience needs a revolution to understand consciousness” between August 18-23, 2023. One of the keynote speakers is Sir Roger Penrose, a British mathematician, physicist, philosopher of science, and Nobel Laureate in Physics. The following is a brief introduction to the theme of the conference.

“'AI has reinforced the notion of the brain as a complex computer of simple, empty, ‘cartoon’ neurons based on 1950s physiology, processing solely by surface membranes, synaptic transmissions and firings as “bit-like' units in frequencies up to 100 hertz…The Penrose-Hameroff ‘Orch OR’ theory proposes consciousness depends on ‘orchestrated’ (‘Orch’) quantum superpositions leading to Penrose ‘objective reductions’ (‘OR’, wavefunction self-collapses) in brain microtubules, connecting to fundamental spacetime geometry. Orch OR has more explanatory power, connection to biology, and experimental validation than all ‘neuroscientific’ theories based on low frequency, oversimplified cartoon neurons combined… Neuroscience needs a revolution inward, to deeper, faster quantum processes in microtubules to understand consciousness and treat its disorders.”

That’s my take on it:

Many experts speculate that AI may eventually attain self-consciousness, potentially posing a threat to humanity. The concept of consciousness raises several fundamental questions: What is consciousness? How can we ascertain whether an AI system is genuinely self-conscious? Do you have to fully understand consciousness in order to know whether a machine is self-aware? The widely-used Turing test, considered behavioristic, is deemed unreliable for this purpose.

During the 1980s and 1990s, Roger Penrose expounded on the notion of human consciousness in his books "Emperor's New Mind" and "Shadows of the Mind." He argued that consciousness involves non-algorithmic processes that defy computational reduction. Penrose also criticized the concept of Strong AI, which contends that machines can achieve human-like consciousness. He posited that human attributes such as creativity, insight, and mathematical intuition are beyond the reach of artificial systems due to their dependence on non-computable processes. I have registered for the conference (online only: $75). I look forward to hearing updates of Penrose’s arguments during the event.


Posted on July 21, 2023

DSML Trend: New role of data scientists by embracing economic thinking

In response to the burgeoning influence of generative AI (GenAI), Bill Schmarzo has authored an insightful article titled "Next-Gen Data Scientist: Thinking Like an Economist" on Data Science Central. This article explores the parallels between economic principles and data science methodologies, underscoring the criticality of considering trade-offs, incentives, and resource allocation in data-driven decision-making processes. As outlined in a recent report by McKinsey, GenAI is projected to potentially automate up to 40% of the tasks currently executed by data science teams by 2025. These tasks, including data preprocessing, coding, and hyperparameter tuning, can be more effectively and efficiently accomplished through AI assistance. Nevertheless, it is crucial to recognize that GenAI lacks significant domain knowledge, setting it apart from human experts. By embracing an economic mindset, data scientists can optimize their strategies, thoughtfully prioritize projects based on potential returns, and skillfully communicate insights to stakeholders, thus providing robust decision support.

That’s my take on it:

For a long time, I have advocated against an excessive focus on data wrangling and programming within DSML (Data Science and Machine Learning) education. First, if the data collection protocol and data architecture are well-designed, there is no need to waste our time on data cleaning and data conversion. Second, complicated coding can, to some extent, hinder the discovery of insightful knowledge. Looking ahead, as AI progressively assumes responsibility for more low-level tasks, data scientists should concentrate their efforts on analytics and interpreting the implications of results for end-users.

Full article:

Posted on July 21, 2023

DSML trend: Generative AI fails to spark a strong demand for microchips

Today (July 21) Taiwan Semiconductor Manufacturing Co (TSMC) reported a sharp 23% decline in Q2 earnings, indicating that the recovery in the global semiconductor market is happening at a slower pace than anticipated. TSMC now projects a 10% revenue contraction for the full year 2023, reversing its previous forecast of slight growth. Its peer companies like Samsung and Micron have also posted earnings declines, further signaling weakness in the industry. Apparently, generative AI fails to spark a strong demand for microchips. Many generative AI services are delivered via servers equipped with NIVIDA GPUs. To sustain growth in these services, expansion of data centers is expected. However, generative AI's impact will take time due to need for advanced chip packaging. It is predicted that AI will not lead to a full-scale recovery in demand for semiconductors until 2024.

That’s my take on it:

Even though generative AI is more technologically advanced than the Internet, why hasn't generative AI created an economic boom similar to the Internet revolution of the 1990s? As far as I know, currently generative AI is still experimental. Unlike e-commerce, which had a straight-forward way to make money by selling products or services online, generative AI does not yet have proven business models. Companies are still figuring out how to commercialize the technology. Further, while generative AI shows promise for some business uses, it currently has limitations in understanding context and executing practical tasks. Put it bluntly, it is fun to chat with ChatGPT, Google Bard, and Claude 2, but information provided by these large language models is not 100% accurate, and it seems that widespread enterprise adoption will take more time. Nevertheless, I believe that it will happen soon! Those who are unprepared will be left behind.

Full article (subscription required):

Posted on July 16, 2023

DSML Trend: Elon Musk’s view on xAI and superintelligence on Twitter

Last Friday (July 14) Elon Musk held a Twitter Spaces conversation to discuss his new AI company called xAI. A total of 40,000 people attended the event. The xAI researchers were recruited from OpenAI, Google DeepMind, and the University of Toronto. According to Musk, the goal of xAI is to create AI systems that are highly intelligent, curious, and truthful. Musk wants xAI to study questions about physics like dark matter, as well as why there is little evidence of aliens despite the age of the universe. He believes these are math problems that can be solved with powerful AI. Interestingly, Musk points out that today even the most powerful neural networks cannot produce a novel on a par with human writers. He asserts that current AI research counting on the brute force of computing is missing something. Based on the lessons learned in Tesla, Musk argued that researchers might overcomplicate the problem and the solution might be much simpler than we thought. xAI intends to release a product to the public soon, likely a chatbot trained on Twitter conversations. Musk wants the xAI chatbot to say what it truly thinks, without politically correct guardrails. More details on xAI's first product will be provided in two weeks.

That’s my take on it: 

AI is a highly competitive field. For years, Google, Microsoft, Apple, Meta, and other key players have invested billions of dollars in AI research. During the Twitter Spaces interview, Musk admitted that xAI would take some time to catch up with OpenAI and Google. Nonetheless, given his success in Tesla and SpaceX, it is conceivable that xAI could introduce a new large language model based on a new paradigm in the near future. Although Musk didn’t disclose the details of lessons learned in Tesla, I guess xAI’s approach will simplify existing neural networks, like Reduced Instruction Set Computer (RISC) is designed to simplify Complex Instruction Set Computing (CICS). 

The audio file of the interview on Youtube:

Text-based summary:

Posted on July 15, 2023

The following YouTube video presents a concise comparison of Claude 2 and ChatGPT. According to the YouTuber, “Claude 2 presents a significant leap in AI technology with unique abilities like summarizing up to 75,000 words and achieving impressive scores in diverse tests, outperforming its predecessor and its competitor, ChatGPT from OpenAI. Claude 2 not only offers advanced functionality, but also prioritizes safety, striving to minimize harmful or offensive content, and affordability, undercutting ChatGPT's API cost significantly.”


I asked Claude 2 to compare itself against ChatGPT. The following is the answer from Claude:


·      Claude has more limited conversational abilities compared to ChatGPT, which was explicitly trained for dialogue.

·      ChatGPT tends to be more verbose, while Claude gives shorter, more focused answers.

·      Claude has significantly less training data than ChatGPT, constraining its knowledge breadth, though Claude aims to mitigate this through efficiency.

·      Both may occasionally generate biased, unethical, or implausible responses, requiring caution and human evaluation of outputs.

·      They lack a consistent personality or worldview, since they don't have real experiences.

In summary, Claude and ChatGPT have common capabilities but ChatGPT is more conversant, while Claude is more concise and targeted, reflecting their different underlying architectures and training. Both have limitations inherent to current AI.”

Posted on July 15, 2023

A few days ago (July 11), Anthropic, an AI initiative based in the US, launched its powerful chatbot Claude 2. It is claimed that this chatbot is the closest competitor to OpenAI's ChatGPT. Aside from text-based prompts, Claude also accepts CSV data sets, PDFs, and other types of documents for analysis. Claude is also good at taking high-stake exams. The most recent version of Claude achieved a score of 76.5 on the multiple-choice section of the Bar exam. On the GRE reading and writing sections, Claude 2 scores above the 90th percentile, and on the quantitative reasoning section, it scores on par to the median applicant.

The following are my test results:

I uploaded a PDF about alleged data fraud discovered in research articles written by Professor Francesca Gino and Professor Dan Ariely. I then asked Claude to summarize the document. The summary is excellent because it covered all important aspects in a clear and concise manner.

The CSV data set I uploaded to Claude was tricky. A number of variables have extremely skewed distributions, which necessitates data transformation. However, in this AI system data analysis was performed only with raw, untransformed data. In addition, no visualization of the data is provided. At the present moment, Claude does not appear to be a reliable data analytics tool.

I input the following three questions to both Claude and ChatGPT:

1.     What are the differences between item response theory and Rasch modeling?

My comments: I find both answers to be accurate and fairly comprehensive. However, both failed to discuss the guessing parameter (g) and data-model fitting order.

2.     What are the limitations of differential item functioning?

My comments: Again, both are accurate and fairly comprehensive, but neither one discussed the differences between non-IRT DIF and IRT-based DIF.

If students use Claude or ChatGPT to answer the above exam questions, at most they can earn a “B+” only.

3.     Write a SAS program for DBSCAN.

My comments: Claude and ChatGPT used two different approaches to solve the problem. The former employed SAS macros programming whereas the latter utilized SAS’s interactive matrix language (IML). No doubt my coding time can be reduced from hours to minutes with this tool.

Sign up for Claude AI (US and UK only):

Posted on July 14, 2023

Recently the California State Board of Education has approved significant changes to the K-12 math curriculum by integrating data science and emphasizing real-world applications. In response to the growing importance of data science in society as well as the need to prepare students for careers requiring strong data analytics and problem-solving skills, in the past two years the board has approved data science courses in many high schools. However, the University of California faculty committees that oversee high school courses accepted for admission to UC argued that Algebra 2 should not be replaced with data science, because this will under-prepare students who plan to major in STEM.

That’s my take on it: 

In essence, the debate boils down to the purpose of education. As the name implies, data science is more empirical and data-driven, whereas theoretical mathematics is more logical and model-based. To equip students with job skills sought by the market or to solve real-life problems, it seems that data science is preferable to theoretical mathematics. However, advanced math is also necessary for developing abstract reasoning and symbolic processing. My question is: why can't they keep both? 


Posted on July 8, 2023

On July 5, 2023 Open AI announced a new initiative called “Superalignment” that aims to resolve the alignment problem. According to Jan Leike and Ilya Sutskeer, the Chief Scientist and the Head of the Superalignment team, although superintelligence will be the most impactful technology that could help us solve many important problems, the vast power of superintelligence could also threaten humanity or even result in human extinction. In response, superaligment is introduced as a proactive process of ensuring that superintelligent AI will follow human intent. Their approach is to build a human-level automated alignment researcher to validate the resulting model and to spot problematic behaviors of an AI system.

That’s my take on it: 

If you are not familiar with the alignment problem, “The Alignment Problem: Machine Learning and Human Values” (Christian, 2020) is an accessible introduction. This book discusses the ethical and psychological challenges when the goals of AI systems and human values are misaligned. When we instruct the AI system to complete a specific task, the system may attempt to achieve the goal at all costs and by any means, but the method may not be aligned with human interests and values. For example, if we ask AI to eliminate spam emails, it might delete all email accounts in order to attack the root cause of the problem. If a factor owner instructs the AI system to produce paper clips using the most cost-effective way, all metals may be redeployed by the AI system to the paper clip factory, which would offset other priorities. The scenarios presented here are very simplistic. Unlike conventional computers that require pre-programming, machine learning is self-evolving. As AI becomes more advanced, its behaviors might become more unpredictable, and the consequences may far exceed our predictions. Can superalignment resolve or at least alleviate the alignment problem? It is very difficult, if not impossible, to predict the unpredictable.  

OpenAI announcement:

Challenges and Criticisms of OpenAI's Approach to AGI:

Posted on July 7, 2023

The International Telecommunication Union's (ITU) annual AI for Good Summit, which was held on July 6 and 7, 2023, aims to harness the power of AI to address global challenges and promote sustainable development. The conference brought together experts from various fields, including healthcare, climate change, and education experts, to discuss and explore AI applications. ITU is a Geneva-based United Nations agency that represents all 193 member states as well as over 900 companies, universities, and other organizations. And therefore, the AI for Good Summit is a truly global conference.

That’s my take on it:

Despite its representativeness, there is no formal declaration, negotiated statement, or decision announced by ITU. Although the discussion in the Summit has led to the creation of focus groups for developing new standards, as well as addressing the impact of AI-enabled androids on humans, it is very difficult, if not impossible, for rival countries that embrace different political ideologies and ethical standards to reach a consensus. I read the closing statement of the summit. Frankly speaking, it is very general and vague.

ITU Statement On The Closing Of The 2023 AI For Good Global Summit:

ITU AI For Good Global Summit 2023 Press conference:  

Posted on July 7, 2023

The World Artificial Intelligence Conference (WAIC) is currently being held in Shanghai, China. In a keynote speech at the conference, China's Vice-Minister for Industry and Information Technology, Xu Xiaolan, said the country plans to develop a complete AI value chain, covering chips and algorithms to large language models (LLMs). In addition, the Ministry of Industry and Information Technology of China announced the government will fully support 360, Baidu, Huawei, and Alibaba in R&D of AI by actively promoting the development of a national standard system.

That’s my take on it:

There is no doubt that China's AI development faces an uphill battle, since the US limits the export of cutting-edge technology to China, including top GPU models from AMD and Nvidia. Nonetheless, there are other channels for Chinese scientists and engineers to gather crucial information for AI development, such as borrowing open-source codes and collaborating with companies that are friendly to China. For example, Elon Musk, the founder of Tesla and SpaceX, is opposed to decoupling between the US and China. At WAIC Musk said, “China is going to be great at anything it puts its mind into. That includes…artificial intelligence.” It is likely that Chinese AI scientists and engineers will learn from Tesla.  

English text:

Chinese text:

Posted on July 6, 2023

Today, Tesla showcased their AI products at the World Artificial Intelligence Conference in Shanghai, China, along with 400+ exhibitors. In addition to its Autopilot (Fully Self-Driving) cars, Tesla displayed a prototype of its Optimus robot. The Optimus has the latest technology of the same origin as Tesla vehicles, including a fully self-navigation computer and a Tesla Vision visual neural network. The Tesla humanoid robot is 172 centimeters tall and weighs 56.6 kilograms, which is no different from a normal adult. Like human joints, the robot's whole body has 28 degrees of freedom. Its hand has 11 degrees of freedom, and therefore it has a high degree of flexibility and dexterity. As a result of its powerful motors, the robot is capable of lifting a piano with just one hand. This humanoid robot can also walk, climb stairs, squat, and pick up objects, and it already has the capability to protect itself and other people. In the future, robots may cook, mow the lawn, care for the elderly, or replace humans in dangerous and boring factory jobs.

That’s my take on it:

According to some commentators, Tesla's robots are ahead of Boston Dynamics because Boston Dynamics' robots require preprogramming for movement, while Tesla's can evolve through machine learning. I know what's on your mind. Could a self-learning robot harm humans at some point if it becomes out of control? People may even wonder whether the Optimus will be weaponized since it can lift a piano with one hand and defend itself. Is it going to be used for evil purposes? Regulations should be discussed as early as possible.

P.S.: I want the Optimus if I can afford one. Currently I am moving from LA to Honolulu. I need a robot that can lift heavy objects for me!

English text:

Chinese text:

Posted on June 29, 2023

Last week (June 20) Microsoft AI researchers published a paper entitled “Textbooks are all you need." In this paper they introduce a “small” large language model called PHI-1 with only 1.3 billion parameters, which is significantly smaller than GPT4 (170 trillion parameters). After being trained in four days on a system with eight Nvidia’s A100 GPUs based on a set of ``textbook quality" data from the Internet, PHI-1 is able to achieve 50.6% on HumanEval, a metric for measuring functional correctness for synthesizing programs. When the number of parameters is reduced to 350 million, it still achieves 45%.

Full paper:


That’s my take on it:

AI and big data are symbiotic. As AI ethicist Juile Mehan said, “AI is useless without data, and mastering today’s ever-increasing amount of data is insurmountable without AI.” In light of this reasoning, a new Moore's law appears in the sense that large language models are getting bigger and bigger over time. However, more parameters and more data are not necessarily better; rather, data quality also matters. Back in 1974 Blalock wrote, “The more errors that creep into the data collection stage, the more complex our analyses must be in order to make allowances for these errors.” This statement is true in both traditional statistics and DSML. Less is more! The Microsoft approach may be a game changer! 

Posted on June 28, 2023

In a $1.3 billion deal announced two days ago (June 26), Databricks, an industry leader in data management, will acquire MosaicML, a generative AI platform that empowers enterprises to build their own AI. According to Databricks, the rationale of this acquisition is: “Today, virtually every organization is exploring how best to use generative AI and LLMs, and every leader is considering how they leverage these new innovations while retaining control of their most precious resource: their data.”  

That’s my take on it:

The technology industry is undergoing a wave of AI acquisitions. In early May Databricks acquired Okera, a data governance platform with a focus on AI. In late May Snowflakes acquired Neeva, an AI-enabled search engine that could enhance its cloud data management capabilities. Aside from acquisitions, forming partnerships is another common AI strategy. Yesterday (June 27) at Snowflake Summit 2023 SAS announced that SAS Viya’s AI-based decision-support capabilities have been incorporated into the Snowflake Data Cloud with Snowpark Container Services. Needless to say, those who failed to catch the wave and operated in silos may eventually lose out to more powerful competitors. Hence, I believe it is imperative to teach students (our future workforce) how to integrate various tools, or at least understand the "big picture." 

Full articles:

Posted on June 27, 2023

In response to AI bias, Stability AI, the London-based company that created Stable Diffusion, is working on generative AI tailored for Japanese users. Like Midjourney and DALLE-2, Stable Diffusion allows users to create

photorealistic artworks using natural language. The problem is that this type of machine learning system is constantly fed by English or Western data sources. In most cases, if a user requests a picture of a house, a woman,

or a man, it is likely that they will receive an image of a Western house or a White person. In order to address the issue, Stability AI has planned to release an experimental Japanese-language AI tool that was based on localized data.

Full story (subscription required):

I grew up in Hong Kong. During my childhood, I watched many Japanese sci-fi TV programs and films, including Masked Rider(幪面超人), Ultraman(鹹蛋超人), and Japanese Iron Man(鐵甲萬能俠). In those science fiction stories,

all of the heroes that save humanity are Japanese mutants, Japanese-like aliens, or Japanese-made robots. Interestingly, all space aliens in those shows speak Japanese! Nevertheless, this type of presentation does not strike me

as biased. For local artists and content creators, making things based on their experience is natural and rational. I am not downplaying Stability AI's good intentions and efforts. True! If the user enters words such as "house"

or "people", the AI system may default to generating images of Western houses or Caucasian people. However, if I specify a Japanese house or a Japanese woman in the prompt, it will display exactly what I request. 

Posted on June 16, 2023

In a recent benchmark study, the Futurum Group compared SAS Viya and several open source software packages, such as Apache SparkML, H2O, and Ranger, in terms of scalability and performance. In this study, random forest, gradient boosting, including LightGBM and XGBoost, linear regression, and logistic regression were rigorously tested on big data. It was found that for running machine learning with high-dimensional data, SAS Viya is on average 30 times faster than all other competitors across 1,500 tests. Specifically, SAS Viya solution delivered results in under 12 minutes on a dataset containing over 300 million data points, while SparkML and another rival failed to deliver results after running for hours. For running traditional procedures, such as linear regression and logistic regression, SAS Viya ran faster in 49 out of the 50 tested configurations.

Full report:

That’s my take on it:

Having used both open source and proprietary software applications, I do not believe that we should side with one camp or the other. Open source is touted as a great tool, but in my view its advantages are overstated. I am not surprised by this benchmark result. While developers of open source are a loose conglomerate, resulting in incompatibility and redundancy, R&D in commercial corporations, such as SAS, IBM, and Microsoft, are coordinated and thus coherent. As a matter of fact, a lack of financial incentives makes it difficult for volunteers to devote substantial time and effort to optimizing machine learning codes. OpenAI has taken the world by storm with its ChatGPT, but few people know that OpenAI first adopted an open source model in an attempt to liberate people from big tech monopolies. However, Cade Metz, the author of Genius Maker, made a harsh comment by saying “It (OpenAI) was an idealistic vision that would eventually prove to be completely impractical.” Yann LeCun, the inventor of CNN, even predicted that this model was doomed to fail at the beginning. Within a few years, OpenAI became a for-profit, closed-source company. 

P.S. On June 12 it was announced that Google DeepMind, OpenAI and Anthropic agreed to open up their AI models to the U.K. government for research purposes only. It is not completely open source.

Posted on June 16, 2023

Yesterday (6/15) an article published by the IMD discussed how the role of data scientists is changing. In 2012, Harvard Business Review identified data scientist as the sexiest job of the 21st century. However, today user-friendly software is simplifying complex tasks that previously necessitated data scientists. Until recently, data wrangling, such as cleaning, restructuring, re-formatting, and pre-processing required the expertise of data scientists; but in the near future, AI-enabled software tools will handle this type of menial data preparation. According to Gartner, a prominent consulting company, by 2025 70% of new applications developed by organizations will be low- or no-code solutions, up from under 25% in 2020. As a result, the role of data scientists will evolve from that of an astronaut (who uses state-of-the-art technology for exploring uncharted territory) to that of a champion race-car driver (who uses standardized technologies for navigating in real life). Three recommendations were made by these authors to cope with the preceding trends: 1. Reskill the existing employees 2. Hire data scientists for specialized purposes, such as sophisticated applications, scalability, and innovation. 3. Invest in the analytics infrastructure that can produce usable data.  

Full article (subscription required):

That’s my take on it:

I completely agree with all three recommendations. That's exactly what I've been proposing. I've been holding a minority opinion about DSML for a long time. At this moment, most DSML training programs place too much emphasis on Python and R programming, probably because they mix programming with analytics. Over the years, I went through the transition from TSO on the IBM Mainframe and DOS on the PC to the GUI on Mac. The lesson is: If I can run an interactive data visualization on Tableau, JPM, or SAS Viya using drag and drag in three minutes, I don’t see a reason to spend an hour to build the same thing by coding Dash in Python. Just like the transition from DOS to GUI, using a low-code and no-code solution is a natural and irreversible trend. Moreover, while data scientists earn big salaries doing data wrangling, it isn't cost-effective at all. A good data plan can prevent 90% of data issues. I am surprised to see that today some systems still truncate the year variable to three digits (from “2001” to “201”), or code “yes/no” into “1/2”! And some organizations still collect many unused or non-usable data without pre-conceptualized research questions! 


Posted on June 15, 2023

Currently OpenAI's CEO Sam Altman is traveling around the world to learn what people want from AI, what they are doing with it, and how we can regulate this emerging technology. At the present time, China and the U.S. are taking very different approaches to regulating AI. The top-down regulatory strategy of China is characterized by state control and an emphasis on national security, whereas the bottom-up approach of the United States has the government taking a back seat.  There is no federal regulation on AI in the US. To fill the vacuum, Microsoft and Google established internal AI governance teams and published their own AI principles.

Full article (subscription required):

That’s my take on it: Paul Kedrosky, managing partner at SK Ventures said, "The top-down approach China uses actually has huge merits, in my opinion. When the bottom-up approach is too slow, it seems irresponsible and even immature and childish to wait for things to happen." I tend to disagree. A test of China's chatbots found that some information is not accessible because Chinese regulatory requirements ensure its AI-enabled chatbot won't make mistakes on "important and sensitive topics.” However, regulations that undermine freedom of thought ultimately undermines innovation.

Posted on June 13, 2023

Google recently announced several new DSML products, some of which are still in the experimental stage. One of these innovative products is StyleDrop, an AI-enabled art tool that allows users to

generate images in a consistent style. In StyleDrop, the user can easily transfer an original image with a desired style to a new image while preserving its unique characteristics. Furthermore, Google

announced last week that it has partnered with Salesforce, the parent company of Tableau, to integrate data analytics into its cloud platform. Specifically, Google and Salesforce plan to integrate

Data Cloud and BigQuery to enable businesses to create unified customer profiles in a more efficient way.

YouTube video:


That’s my take on it:

Even though Midjourney and Stable Diffusion are good at generating art, neither produces a consistent style. Google is so smart that it doesn't follow a "me-too" strategy. Adding features similar

to those offered by Midjourney and Stable Diffusion is unlikely to entice customers away from those established generative art platforms. But customers will give Google a try for something new.

By the same token, it will be very difficult for Google Cloud to compete with Amazon Web Services in terms of cloud computing capabilities. Rather, it will be more beneficial for Google to

leverage data visualization through the strategic partnership with Salesforce.

Posted on June 9, 2023

A while ago Microsoft announced several new features in Windows 11, but at that time most articles focus on Windows Copilot only, which is an AI version of Clippy. Specifically, Window Copilot is a

digital personal assistant that can help the user complete tasks easily through natural language inputs. The following YouTube video provides a more comprehensive overview of Windows 11 enhancements. Besides Windows Copilot, the video also mentions AI tools for developers using Azure (the Microsoft cloud computing platform) and ONNX Runtime, and also AI enhancements to the Microsoft Store.

In addition, Microsoft also increased support for ARM-based systems.


Since its release Windows Copilot has been the subject of many articles, and I want to shift the focus to the Windows ARM-based systems. Although ARM, invented in 1985, is not directly related to AI,

it is still crucial to high performance computing and big data analytics. Chris Miller argued in his book "Chip War" that Intel's dominance of the CPU market for so long is due to luck. There is no doubt that

the x86 architecture is not the best; rather, it is too complex and resource-intensive. On the contrary, due to using reduced instruction set computing (RISC), ARM processors are simpler in design, much

more compact, and can run faster. And thus ARM-based systems are popular in smartphones and other small devices. At first, Microsoft intended to introduce ARM support in Windows 10, but in the end,

it was pushed into Windows 11. In comparison with a traditional Windows laptop, Windows on Arm has superior battery life, always-online internet connectivity using 4G or 5G, super-fast boot times, and chipset-level security support. It takes a long time for a paradigm shift to occur!

Posted on June 2, 2023

Two days ago, Amazon announced it would pay more than $30 million in fines to settle allegations that its Alexa voice assistant and Ring doorbell camera violated privacy laws. A lawsuit filed by the Federal Trade Commission (FTC) alleges that Amazon kept records of children's conversations with Alexa in violation of privacy laws, while another alleges that its employees viewed recordings from Ring cameras without consent. Amazon would also be prohibited from using the predictive models built upon these data. Despite the FTC's rulings, Amazon argued that it had not broken any laws.  

Full article:

That’s my take on it: Big data analytics and machine learning have made Amazon, Google, and Facebook researchers better psychologists and sociologists than academicians, since the former group can access oceans of behavioral data collected in naturalistic settings. When we are unaware of their data collection, these data tend to reveal our true character and behaviors. Needless to say, invasion of privacy is a concern. However, before we point our fingers to Amazon, Google, and Facebook, we should not forget that many well-known psychological studies in the past, such as Milgram's and Zimbardo's studies, were conducted in the absence of IRB approval or are considered unethical today. It will take some time to

fine-tune the ethical standards of behavioral data.  

Posted on June 2, 2023

In a recent article (May 31, 2023) published in Towards Data Science, data scientist Col Jung argued that organizations should migrate away from traditional data lakes and adopt a data mesh approach. In Jung's view, organizations using old-fashioned data warehouses are trapped in a mess of data systems connected by innumerable data pipelines. Data lake was introduced as a solution by centralizing diverse data into a hub, but “data lake monsters” are “over-promised and under-realized.” In the era of big data, all analytical questions rest on the shoulders of the data lake team. Consequently, the central data team encountered tremendous scalability problems and became inefficient. To rectify the situation, in 2019 Dehghani proposed data mesh as the next-generation data architecture embracing a decentralized approach. Instead of transferring data to a centralized lake, a data mesh allows domain-specific teams to control and deliver data as products, promoting easy interoperability and accessibility across the organization.

Full article: 

That’s my take on it: In my experience many requests to the central data office are simple questions, but as Col said, the data team is overwhelmed under the traditional centralized data architecture. The good news is: Data meshes facilitate self-service data usage, whereas data lakes do not. Is a decentralized system likely to result in chaos, with different people processing data differently? I don’t think so. Since data meshes are owned by different entities, they require stricter formatting, metadata fields, discoverability, and governance standards.

Posted on May 31, 2023

A group of AI scientists, executives, and academicians released a statement yesterday (5/30) regarding the dangers of AI: “Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.” The signatories include Geoffrey Hinton, the cognitive scientist who modified the backpropagation algorithm for neural networks, Yoshua Bengio, the computer scientist who co-developed the idea of “generative adversarial networks." Sam Altman, the CEO of Open AI, Demis Hassabis, the CEO of Google DeepMind, and many other leading authorities in the field.

Open statement:

That’s my take on it: This type of open letter or statement warning about the dangers of AI has been recurring for a while. Probably these experts really know something that we don't know. Based on history, it is very likely that AI will eventually be militarized, and to maximize its power, humans might hand over decision-making to AI. An article by Bill Drexel and Hannah Kelley published in Foreign Affairs suggested that an AI disaster would occur in an authoritarian state without checks and balances, resulting in systemic errors that worsen a mistake or accident. In my humble opinion, systemic mishandlings during a disaster could also happen in democratic nations. Remember how bad COVID19 spread across Europe and the US?

Posted on May 27, 2023

Since high-performance computing for AI is in high demand, NVIDIA, the market leader in GPUs, has had its stock surge over 25%, outperforming AMD, TSMC, and Intel. NVIDIA
became the 
fourth most valuable company, just behind Apple, Microsoft, Alphabet, and Amazon. According to Louis Navellier, chairman and founder of Navellier & Associates,
impressed by the rise of NVIDIA, 
"skeptics of the AI bubble have been silenced for the moment.” On the other hand, Yahoo Finance argued that the Nvidia stock surge could signal
the start of the AI bubble. Citing the history of 
the Japanese bubble in the late 1980s and early 1990s, the .com bubble in the US in 2000, and the recent Bitcoin bubble, the Yahoo
analyst gave this warning, “if history is any guide, guess what 
AI could be a larger one.”


That’s my take on it: The law of regression towards the mean tells us that everything goes up eventually goes down, and everything has a tendency to revert to mediocrity. It is not a question of if, but when, the bubble will burst. However, I believe AI is still very much in its infancy and has a lot of room for development. NCSA Mosaic, the first Web browser, was
released in 1993, and the .com bubble took almost a decade to burst. Despite the burst, the Internet is still available, and investments in the Internet infrastructure (such as fiber optic
cables installed during the mid-1990s) continue to benefit the entire world. By the same token, I think it is premature to talk about the AI bubble and AI will be here to stay. 


Posted on May 26, 2023

Recently Adobe released generative fill, a feature only available for beta testing, in response to the threat of DALL.E2, Midjourney, and Stable Diffusion. The Adobe generative tool,
like DALL.E2, Midjourney, and Stable Diffusion, can generate images using natural language processing. But unlike its counterparts, Adobe allows you to select a specific area of the
canvas where you want to add, extend, or remove content. For instance, when you think a picture of a seashore is too plain, you can describe an object or scene, such as "add a
lighthouse," and in a few seconds, a realistic lighthouse appears.

Demo page (Last updated on May 25, 2023):

As expected, everything appears smooth in the demo. Frankly speaking, I wasn't impressed with the beta version. As an example, I added a person to an empty hallway photo. The
prompt I provided was: "A girl is walking and looking out the window." Adobe generated several variants based on my input, but all of the faces are distorted. Furthermore, there is no
reflection of the person on the glass (see attached). A paradigm shift is undoubtedly occurring with the rise of AI generative art, but Adobe is late to the game. Only the fittest of the
fittest shall survive!


Posted on May 26, 2023

Recently (May 21) Analytics Insight posted a report highlighting the top 10 highest-paying countries in need of data scientists. The order is as follows: USA, Switzerland, UK, Australia, Israel, India, Canada, China, Italy, and France. Take the US as an example. The median salary at the entry-level starts at US$95,000. For experienced data scientists the median pay could be as high as US$165,000.


That’s my take on it: It surprises me that some countries aren't included in the list. The high-tech sectors of Germany, Japan, South Korea, and Taiwan, for instance, are vibrant and fast-growing, so data scientists should be in high demand. I might be missing something or the survey data are incomplete. Out of curiosity, I looked up employment information for data scientists overseas. According to Glassdoor, the average salary of a data scientist in Tokyo is US$55,831 (Yen 7851192), while that in Germany is US$72,419 (Euro 67500). The figures were not adjusted for purchasing power. Nonetheless, I will stay in America! 


Posted on May 25, 2023

About a week ago (May 16) Forrester, one of the most trustworthy tech consulting companies in the world, published a report about the current trend of the AI-based decision-support market. The Forrester Wave evaluation report classified companies into four groups: Leaders, Strong Performers, Contenders, and Challengers. SAS and IBM (IBM Cloud Pak) belong to the first category (leader) while TIBCO (Spotfire) is placed in the second group (strong performer). According to Forrester, “SAS seamlessly integrates world-class analytics for decisioning. SAS’s flagship Viya platform includes beautifully designed interfaces across the entire data-to-decision lifecycle. Any combination of analytics, machine learning, and optimization can easily be created and used by teams within SAS Intelligent Decisioning.” “IBM business automation is driven by AI decisions. IBM’s AI decisioning platform is comprised of IBM Automation Decision Services (ADS) available in IBM Cloud Pak for Business Automation and IBM Watson Studio, and IBM OpenScale available in IBM Cloud Pak for Data.”


That’s my take on it: Contrary to popular belief, in spite of constant challenges from open source, proprietary software packages are still alive and well-functioning. Many people compare between SAS, Python, and R, but it is like comparing apples and oranges. Python and R are programming languages whereas SAS and IBM are integrated systems and platforms. When we need to implement DSML in a complicated environment with big data, we need a comprehensive system with user-friendly interface, rather than a DOS-like, command-based environment. Just my 2-cent.

Posted on May 18, 2023

Today (5/18) UC Berkeley announces that it will open a College of Computing, Data Science and Society, which is expected to be approved by The University of California Board of Regents. During the 2025-26 academic year, a new college building will house the data science major that was launched five years ago, along with other computer science degree programs. More than 89 campuses have access to the online curriculum, which includes assignments, slides, and readings. UC Berkeley also has disseminated its curriculum to other colleges and universities for free. Beginning this fall, there will be UC Berkeley-led data science classes at six California community colleges, four Cal State campuses, and Howard, Tuskegee, Cornell, Barnard, and the United States Naval Academy.

Full article:

It is important to point out that this data science conglomerate is not a result of a merely top-down decision; rather, it happens due to huge faculty and student demand. Data science has risen to the fourth most popular major at UC Berkeley in just five years. Faculty and students at UC Berkeley are aware of the importance of data science. In June 2023, UC Berkeley received three gifts totaling $75 million for supporting the construction of the data science center. Two of the gifts are from the current Berkeley faculty. Needless to say, good leaders must pay attention to bottom-up movements; they must be active listeners who can constantly learn and adapt to change.

Posted on May 9, 2023

The IBM CEO Arvind Krishna announced a hiring freeze last week (May 7). Also, nearly 8,000 jobs will be replaced by AI at the company, he said. Throughout the next five years,
machines may take over up to 30% of non-customer facing roles. In the near future, robots and algorithms will likely pose a significant threat to workers in fields like finance, 
accounting, and HR. The upside of this transformation is that AI is expected to contribute $16 trillion to the global economy by 2030.


That's my take on it: The writing has been on the wall for a long time. Academicians are not immune to this trend. In the past, it was necessary to have expertise to perform data
transformations when a nonlinear function could fit the data better. With neural networks, the transformation can be automated in a matter of seconds. In order to deal with the
trend, higher education must reform its curricula; otherwise, graduates with outdated skills will find themselves unable to find jobs.   

Posted on May 5, 2023

White House officials announced yesterday (May 4) that more funding and policy guidance will be provided for developing responsible artificial intelligence before the Biden
administration meets with industry executives. The National Science Foundation plans to invest $140 million in seven new AI-dedicated institutes, bringing the total to 25. With
the goal of making the United States a leader in AI innovation while ensuring that it is developed and used responsibly, the newly created National Artificial Intelligence Initiative
Office will work with academic institutions, government agencies, and industry leaders to address issues such as bias, privacy, and transparency.




That’s my take on it: AI/DSML spans across almost all disciplines, rather than being confined to science and engineering. These diverse institutes devoted to AI will cover a wide
range of topics, including ethical issues, AI impact on politics, society, economics, and more. Thus, philosophers, psychologists, sociologists, and economists alongside researchers
in other disciplines will have funding opportunities. Now is the time to act!


Posted on April 30, 2023

A few days ago (April 26), in a statement published by the Association for Mathematical Consciousness Science (AMCS), a group of more than 150 researchers specializing in
mathematical and computational methods for understanding consciousness warns that AI is advancing at a pace that is speeding beyond our understanding of its ethical, legal,
and political implications. Language models such as Google's Bard and OpenAI's ChatGPT now mimic animal brain neural networks, but will soon be constructed to replicate
higher-level brain architectures, and thus it is essential for AI researchers to study the nature of consciousness. According to the letter, “there are over 30 models and theories
of consciousness (MoCs and ToCs) in the peer-reviewed scientific literature, which already include some important pieces of the solution to the challenge of consciousness.”

Open letter:

That’s my take on it: Cognitive science and philosophy of mind have long been intrigued by the concept of consciousness. Traditionally, this type of research has been considered
purely theoretical and "academic." Today, this type of research has a wide range of practical implications. Reductive materialism asserts that conscious phenomena are made up
solely of neurological structures. In this case, consciousness could emerge from material (machines). About 10 years ago, former Arizona State University faculty member Lawrence
Strauss predicted that in the near future there will be self-aware computers. Let's see how it goes.

Posted on April 22, 2023

Recently a German photographer named Boris Eldagsen refused the Sony world photography awards after admitting to being a “cheeky monkey” by generating the
award-winning image using AI. Eldagsen used a pseudonym to submit the AI-generated photo, and the judges selected it as the winner. In an open statement, Eldagsen
wrote, “We, the photo world, need an open discussion. A discussion about what we want to consider photography and what not. Is the umbrella of photography large
enough to invite AI images to enter – or would this be a mistake?... AI images and photography should not compete with each other in an award like this. They are different
entities. AI is not photography. Therefore I will not accept the award.”

That’s my take on it: Does AI-enabled imaging qualify as photography? It depends. This type of debate is not entirely new. When digital photography was introduced,
some traditional photographers disliked images manipulated by computer software, such as Adobe Photoshop. They argued that those images are no longer authentic
and natural. In the past, photographers used a variety of filters and darkroom techniques to enhance their images. For me, a tool is a tool, no matter whether the tool is
physical, digital, or AI-enabled. The image, however, should not be considered photography if it was entirely created by AI without input from the photographer.

Posted on April 11, 2023

In response to the arrival of ChatGPT, recently a group of prominent AI researchers signed an open letter to call for slowing down AI developments that can pass
the Turing Test. The Turing test measures a machine's ability to exhibit intelligent behavior that is indistinguishable from human behavior. Yoshua Bengio is one of
the leading experts in deep learning who co-signed the letter.

Bengio wrote, “I found it appropriate to sign this letter to alert the public to the need to reduce the acceleration of AI systems development currently taking place at
the expense of the precautionary principle and ethics. There is no guarantee that someone in the foreseeable future won’t develop dangerous autonomous AI systems
with behaviors that deviate from human goals and values. The short and medium-term risks –manipulation of public opinion for political purposes, especially through
disinformation– are easy to predict, unlike the longer-term risks –AI systems that are harmful despite the programmers’ objectives, and I think it is important to study both.”

Full article:

That’s my take on it: Bengio cited the precautionary principle to argue for slowing down AI development. According to the precautionary principle, if an action could potentially
cause harm to the public or to the ecology, without scientific consensus, the burden of proof that it is not harmful is on the shoulder of the party taking the action. Because
most AI developers are not philosophers of ethics or legal experts, it places a heavy burden on them. I think there is no need to slow down AI development; instead, experts
from different disciplines should be part of every development team, and there should be opportunities to engage in open debates and discussions regarding AI ethics.

Posted on March 31, 2023

Recently Researchers at IBM Research Zürich and ETH Zürich developed the Neuro-Vector-Symbolic Architecture (NVSA) that combines connectionist deep
learning and vector-based symbolic computations. Rather than learning from past data alone, the system is given prior background information. While deep neural
networks can perform well on perception tasks, symbolic computing is good at processing high-dimensional vectors. According to the research team, this
integrated approach outperformed both state-of-the-art deep neural networks and neuro-symbolic AI approaches, achieving new record accuracies of 87.7% and
88.1% on the RAVEN and I-RAVEN datasets, respectively.

Full article:

That’s my take on it: The idea of perceptron was proposed by Frank Rosenblatt in 1958. However, it was criticized by Marvin Minsky and Seymour Papert
in 1969. Since then, the symbolist approach to AI became the dominant paradigm. Only after backpropagation was developed in the 1980s did neural networks
become a major direction of machine learning research. Symbolism and connectionism have been rivals for a while, but in recent years researchers have
attempted to integrate them. For example, two years ago researchers at MIT proposed the Neuro-Symbolic Concept Learner (NS-CL) that incorporates both symbolism
and connectionism. NVSA is another research endeavor along this line. I believe that there will be more attempts in the future to synthesize symbolism and connectionism
in the AI research community.

Posted on March 28, 2023

A new version of Microsoft Bing-powered by ChatGPT is now available for beta testing. With the revamped search engine, queries written in natural language will

be recognized and answered by Open AI technology. If you have an account in ChatGPT, do you need Bing? I tested both AI systems by entering the same questions.

The answers to simple questions about factual information were very similar between the two, but when the questions were complicated, such as comparing two schools

of thought, the answers were very different.

Sample question: What is the key difference between Gary Marcus and Yann LeCun in terms of their perspectives on AI?

The answer from Bing is short. It provides footnotes but the sources are non-academic.

Gary Marcus and Yann LeCun are both prominent figures in the field of artificial intelligence. However, they have historically sharply different perspectives and opinions

on the development and future of AI, specifically deep learning. LeCun is famous for his pioneering work in the field of deep learning and computer vision(1). He believes

 that deep learning is the key to unlocking true AI(2). On the other hand, Gary Marcus has been a vocal critic of deep learning and has argued that it is not sufficient for

achieving true AI(1). He believes that AI needs to incorporate more symbolic reasoning and common sense knowledge(3).




The answer from ChatGPT is more elaborated, but no references are provided.


Gary Marcus and Yann LeCun have different perspectives on the field of artificial intelligence, specifically regarding the role of deep learning and the importance of incorporating

different forms of knowledge in AI systems.

Gary Marcus is a cognitive psychologist and professor of psychology at New York University, and he has been critical of the current state of deep learning and the limitations

of the approach. He argues that while deep learning has been successful in certain areas, such as image and speech recognition, it is not capable of achieving general intelligence

and that current approaches are too narrowly focused. He also believes that deep learning is insufficient on its own and that AI systems need to incorporate other forms of knowledge,

such as rules and logical reasoning, to achieve true intelligence.

Yann LeCun, on the other hand, is a computer scientist and professor at New York University who has been instrumental in the development of deep learning and convolutional neural

networks. He believes that deep learning is a crucial component of achieving general intelligence and that it is capable of handling a wide range of tasks, including those that require

reasoning and abstraction. He argues that the key to improving AI is to continue developing deep learning algorithms and architectures and to scale up the amount of data used to train

these systems.

Posted on March 23, 2023

The NVIDIA conference today featured a panel discussion entitled "AI Education and Access at Scale." The panel consists of four speakers and all are inspirational.

Because of space constraints, I can only highlight one of them: Joe Glover, Provost and Senior VP for Academic Affairs at the University of Florida (UF). A few years

ago UF launched a university-wide AI initiative that aims to prepare students and faculty to use AI technology to solve pressing challenges across every field.

To achieve this goal, UF formed a partnership with NVIDIA to build an AI supercomputing infrastructure. The UF program covers all disciplines at all levels, including

arts and humanities. According to Glover, some people were skeptical at first because this idea is so out of the ordinary. Nonetheless, he argued that AI is an

encompassing technology that can be well-applied to all disciplines. For instance, UF musicologist Imani Mosley utilized AI to discover the patterns of Spotify

whereas UF geology professor Mickie Mackie conducted research on improving sea level rise predictions with the help of machine learning. In order to facilitate the

integration of AI and various disciplines, UF hired a team of programmers to assist faculty in the technical aspects. The recording of the session can be accessed at:

That’s my take on it: I totally agree with Dr. Joe Glover. Indeed, this is exactly what I have been trying to do for a long time. There are countless possibilities for

implementing AI in almost all disciplines. For example, instead of manually coding the corpus, literature, and history researchers can use text mining to analyze

archives. By utilizing AI generative tools, artists can spend more time conceptualizing instead of repeating tedious tasks. Needless to say, social scientists can

overcome the shortcomings of traditional statistics and solve the replication crisis by equipping themselves with machine learning tools and big data analytics. I will

look into the UF model and I hope NVIDIA has more supercomputers to offer!  

Posted on March 22, 2023

AI-enabled generative art tools, such as Midjourney and Stable Diffusion, have been taking the world by storm. In response to the market trend, recently both Adobe and

Microsoft announced the beta version of their own AI-based text-to-image tools, respectively. The generative AI tool of Adobe is known as Firefly, which will be integrated

into Creative Cloud Document Cloud, Experience Cloud, and Adobe Express workflows. The Microsoft product, which is based on Open AI’s DALLE, will be integrated into 

Bing. Currently, both are accepting beta testers. You can sign up for them at

That’s my take on it: I am still waiting for approval from Adobe. Nonetheless, I have access to the Microsoft AI tool. Frankly speaking, it is disappointing. I applied the same

or similar command prompts that I used in Midjourney to the Microsoft tool, such as “a long-haired girl wearing a long white dress is holding a horse on a beach during

sunset time” and “a dancer in waterfalls.” As you can see in the attachments, either the portraits are disfigured or too dark. These are just a few examples. Nothing that I tried

to make with the tool so far is acceptable.

This morning I attended the talk “Are Generative Models the Key to Achieving Artificial General Intelligence?” at the NVIDIA conference. The presenter explained how the

diffusion model used by Midjourney and Stable Diffusion works. By using zero-shot problem solving, an AI generative model that is trained with low-resolution images can

predict high-resolution ones. I think that’s why Midjourney and Stable Diffusion are superior. There is still a long way to go for Microsoft. It is my hope that Adobe will do a better job. 


Posted on March 21, 2023

This week NIVIDA is hosting its annual online conference to unveil its new offerings, such as a cloud-based supercomputing service. This morning (3/21) NVIDIA founder and CEO,

Jensen Huang shared how NVIDIA's accelerated computing platform is driving the next wave in AI, the metaverse, cloud technologies, and sustainable computing. In addition to the

keynote, there are many informative sessions, such as Using AI to accelerate scientific discovery, Generative AI demystified, Deep reinforcement learning with real-world data,

Accelerating exploratory data analysis at LinkedIn…etc. At the keynote, Jensen Huang announced NVIDIA AI foundations, which will be deployed to Google Cloud, Microsoft Azure,

and Oracle Cloud. In the presentation, Huang kept repeating this phrase: “We are at the iPhone moment of AI.” All sessions are recorded and can be accessed at:

Posted on March 17, 2023

In response to the challenge from ChatGPT, two days ago (March 15) China’s AI developer Baidu released "Wen Xin Yi Yan" at its Beijing headquarters. Its text generation mode is

similar to that of ChatGPT, but additionally, it can read out the answer in real-time, corresponding to various Chinese dialects, including Cantonese and Sichuan dialects. Moreover,

the content can be generated into pictures and videos in real-time, too. Robin Li, Chairman and CEO of Baidu, demonstrated the comprehensive capabilities of "Wen Xin Yi Yan" 

in five usage scenarios: literary creation, commercial copywriting, mathematical calculation, Chinese comprehension, and multi-modal generation. He admitted that in the internal

test, the experience of "Wen Xin Yi Yan" is not perfect, but seeing the strong demand in the market, he will release the product as soon as possible. At present, "Wen Xin Yi Yan"

has a better ability to support Chinese, and the English ability will be further improved in the future. Since the official announcement last month that "Wen Xin Yi Yan" will be

released, 650 partners have joined in, and more related products will appear in the short term. He emphasized that "Wenxin Yiyan" is not a tool for the technological confrontation

between China and the United States, but a brand-new platform for the group to serve hundreds of millions of users and empower thousands of industries. Starting today, the first

batch of users can experience the product on the official website of "Wen Xin Yi Yan" by inviting a test code, and it will be opened to more users in succession.

There are more than 260 billion parameters in Baidu's chatbot model, which is more than in GPT-3, but some critics believe its performance is not as good as ChatGPT, partly due

to its lack of web-based Chinese information.

Full text:

That’s my take on it: Perhaps the biggest hurdle to China's chatbot development is not the technological issue; rather, there are too many red lines. Once a tester inputs a sensitive

question into China's chatbot, but the system refused to answer: "The question could not pass a safety review. No response could be generated for you.” When the reporter tried 

to push it by asking, “Why did my question fail to pass the safety review?” The answer was: “Let’s change the topic and talk about something else.” In contrast, ChatGPT handles

sensitive or controversial questions differently: although the answer is usually vague and balanced, at least it gives the user objective facts and lets them decide. 

Posted on March 3, 2023

According to recent research conducted by two cognitive psychologists at the Max Planck Institute for Biological Cybernetics in Tübingen, GPT-3 is comparable to humans in some areas

but lags behind in others. One of the questions presented by the researchers to GPT-3 is the classical Linda’s problem (I use it in my statistics and probability class):

Linda is 31 years old. She majored in philosophy. She was deeply concerned with issues of social justice and discrimination. Which of the following statements is more probable?

A: Linda is a bank teller.

B: Linda is a bank teller and active in the feminist movement.

The correct answer is A because B is a subset of A. The probability of two events or conditions is definitely smaller than that of a single event. But most respondents picked B, which is a conjunction

fallacy. Interestingly, researchers at Max Planck Institute found that GPT-3 committed the same mistake as humans. Those researchers concluded that in searching for specific information or causal

reasoning, AI failed to use logic; rather, it only passively gets information from texts available on the Internet. When the majority is wrong, AI replicates the error.

Full paper:

That’s my take on it: Out of curiosity, I also entered the same question into ChatGPT. Unlike the result obtained by the two researchers at the Max Planck Institute, I received a more “typical” response.

If a student submits a vague answer like this, it might be sourced from AI! Based on the information provided, statement A or B cannot be determined as more probable. This is because Linda's age

and major in philosophy do not provide any indication of her profession or involvement in social movements. While her concern with social justice and discrimination suggests that statement B could

be possible, there is no clear evidence to support either statement. Therefore, it would be more accurate to say that both statements A and B are possible, but there is not enough information to determine

which one is more probable.

Posted on March 3, 2023

Recently Harvard Business Review (HBR) reported that many retail companies have not taken advantage of advanced data analytics to improve their business. There are exceptions: Walmart,

Amazon, and a few others. The 25 best-performing retailers during the pandemic generated 83% more profit than laggards and captured more than 90% of the sector's market capitalization gains.

By interviewing 24 business leaders, HRB unveiled six sticking points as follows:

1.     Culture: Typically, companies have a risk aversion problem and lack a clear goal for implementing analytics.

2.     Organization: Many companies struggle to strike a balance between centralization and decentralization.

3.     People: Very often the analytics function is managed by people who have no understanding of the industry.

4.     Processes: Businesses do not have unlimited resources at their disposal.

5.     Systems: Legacy systems are still serving many firms today.

6.     Data: Data are often scattered throughout the firm in silos and not managed in an organized manner.

Full article:

That is my take on it: Some interviewees believe the bigger issue is people. People who know about all other issues are willing to dedicate resources to solving them, despite their presence.

Sadly, this is not always the case. William Cleveland and John Chambers were pioneers in data science. Many years ago they both proposed that data science should be interdisciplinary,

incorporating domain knowledge. Agree!  

Posted on February 27, 2023

In response to the challenge of Open AI, three days ago (Feb 24) Meta (Facebook) announced its flagship large language model: Large Language Model Meta AI (LLaMA). While Open AI’s GPT3

consists of 175 billion parameters, the size of LLaMA varies from 7 million to 65 billion parameters only. In spite of this, Meta claimed that LLamA is superior because it requires fewer computing

resources to test new approaches, validate existing models, and explore new scenarios. The model will be released under a noncommercial license in order to maintain its integrity and prevent

 misuse. Researchers from academic institutions, government organizations, civil society groups, and industry research laboratories around the world will be allowed access on a case-by-case


The announcement by Meta:

That’s my take on it: Several people argued that ChatGPT had unleashed Pandora's box because it had been released ahead of the development of ethical guidelines for AI applications. Due

to the controversy surrounding ChatGPT, it is understandable that Meta took a more cautious approach. However, even if Meta and others tried their best to patch all ethical and legal loopholes

in AI and machine learning, someone will misuse or abuse the technology one way or another.

Posted on February 24, 2023

ChatGPT stories continue to dominate mass media and social media, and probably you already received these stories from many channels. Therefore, I would like to

draw your attention to something else. Two days ago Google unveiled its 2003 data and AI trends report. In addition to Google Cloud, Google also suggests a vast

array of technologies to companies that planned to enhance their AI and cloud computing infrastructure:

·      Aiven

·      C3AI

·      Confluent

·      Collibra

·      Databricks

·      Datametica

·      Elastic

·      Fivetran

·      MongoDB

·      Nivida

·      Qlik

·      Quantiphi

·      Salesforce

·      SAP

·      Striim

·      ThoughtSpot

A month ago InsideBigData complied with the IMPACT 50 list for Quarter 1, 2023. According to InsideBigData, “These companies have proven their relevance by the way they’re

impacting the enterprise through leading-edge products and services.” The top 20 are:

·      Open AI

·      Nvidia

·      Google AI

·      Amazon Web Services

·      Hugging Faces


·      Databricks

·      Microsoft AI

·      Intel AI

·      Neural Magic

·      Snowflake

·      SAS

·      Qlik

·      Neo4j

·      Alien Institute for AI

·      TigerGraph

·      Anaconda

·      Domino Data Lab

·      Hewlett Packard Enterprise

·      Cloudera

The full report of Google:

The full article of InsideBigData:  

That’s my take on it: Although the selection criteria are subjective and might even be biased, data scientists and DSML educators should still take them seriously. As you can see, the

list of these most promising and most impactful tech companies consists of both fairly new companies and mature companies (e.g., Microsoft, Hewlett Packard, Intel, SAP, SAS…etc.).

However, some established tech giants are absent from the list (e.g., IBM, Oracle…etc.). Both IBM and Oracle are not even among the top 50. It is understandable. Despite several

decades of development, some of its products have made little progress. The rule in academia is: publish or perish. In the era of AI and big data, the choice facing companies is: 

innovate or perish.

Posted on February 21, 2023

About a week ago the Data Science 4 Everyone coalition affiliated with the University of Chicago released a report that indicated data literacy skills among fourth and eighth-graders in the US have dropped significantly
over the last decade despite the fact that these skills become more and more important to the data-driven world. Based on the National Assessment of Educational Progress (NAEP) data, the report implies that the
nation's educational system does not adequately prepare young people for a world reshaped by big data and artificial intelligence, In the time period between 2019 and 2022, eighth-graders' scores in the data analysis,
statistics, and probability section of the NAEP math exam decreased by 10 points, while fourth-graders' scores decreased by 4 points. There has been a long-term trend of declining scores over the past decade, with
scores down 17 points for eighth-graders and 10 points for fourth-graders.


Full report:

That’s my take on it: It is not surprising. Since two decades ago, I have been monitoring trends in science and math education as part of my research interests. All data I gathered suggest that the decline is real. Because

of the high demand for data scientists, there are many short-term certificate programs and boot camps available. However, though some trainees can throw out certain seemingly sophisticated jargon, they may not fully 

grasp the theories behind DSML due to a lack of a solid foundation. It could be dangerous! My teaching approach is: when there is a sign of misconception among students, trace the root cause and re-lecture the basics!

Posted on February 10, 2023

Facing the pressure from Open AI's ChatGPT, Google is devoting efforts to reassure the public that its AI technology is still promising. However, the performance of its own chatbot named Bard is so embarrassing that investors lost confidence. Bard, which was released on Twitter on Monday, tried to answer an inquiry about discoveries from the James Webb Space Telescope. According to Bard, the telescope was the first to photograph a planet outside the solar system, but indeed this milestone was accomplished by the European Very Large Telescope in 2004. This mistake was spotted by astronomers on Twitter. Consequently, Alphabet's shares dropped more than 7% on Wednesday, losing $100 billion of its market value.

Full article:

That’s my take on it: As a matter of fact, ChatGPT also made many factual errors. For example, when a history professor asked ChatGPT to explain the Joseph Needham thesis, it offered a response as: “the scientific and technological achievements of the West were only possible because of the transmission of scientific and technological knowledge from China to the West.” It is completely wrong! Indeed, Joseph Needham was curious about why ancient China failed to develop modern science. My friend who is a math professor in Hong Kong also found that some answers offered by ChatGPT are unsatisfactory. I guess people are more forgiving of ChatGPT because it is the first of its kind.  

Posted on February 4, 2023

On January 30, 2023, the Retraction Watch published an exclusive report on Hao Li’s research misconduct. Hao Li, the pioneer of Deepfake technology that can fabricate video,
has won numerous awards for his AI-based innovations in imaging technology. According to the Retraction Watch, two of his articles published in ACM Transactions on Graphics 
will be retracted due to the falsification of data. One of his articles is based on a presentation at the ACM computer graphics conference SIGGRAPH 2017 Real Time Live (The
recording is available on YouTube). In the presentation, Li and his colleagues showed that his software could generate a 3D image based on a picture taken with a webcam in just
a few seconds. However, later it was found that those 3D images were built and preloaded into the computer before the presentation. Li denied any wrongdoing, saying that
preloading the 3D images was allowed by the conference.

Full article:

Youtube video of Li’s presentation:

Li’s ACM articles: 

That’s my take on it: Despite winning the "Best in Show" award at the ACM conference, Li's presentation is a fraud! In fairness, Li's misconduct was not on the same scale
as Elizabeth Holmes'. Li had a working prototype and he made it appear to be more efficient, whereas Holmes lied about a promising blood-testing technology that never
existed and was physically impossible. Nonetheless, it is not unusual for high-tech companies to use the strategy of "fake it until you make it". For example, Microsoft in the
past announced several “vaporware” products that didn't exist in order to keep customers from buying well-developed technologies from competitors. In the same vein, many
companies use the buzzword "AI" in their product names, but whether the technology is truly AI remains to be determined.  

Posted on February 4, 2023

There has been a hot debate in academia about the use of ChatGPT. In December last year, ChatGPT was included as one of 12 authors on a preprint about using the tool
for medical education posted on the medical repository medRxiv. According to Nature, ChatGPT was cited as a bylined author in two preprints and two articles in science
and health published in January 2023. All of the articles have an affiliation with ChatGPT, and one even gives an email address for a supposed nonhuman "author".  
Nature explained that the inclusion of ChatGPT as an author was a mistake and the journal will fix it soon. However, PubMed and Google Scholar have already indexed these
articles and these nonhuman "authors." Nature has since set forth a policy guiding how large-scale language models can be used in scientific publications, prohibiting naming
them as authors. To address this latest technological concern, recently the Journal of the American Medical Association (JAMA) updated its instructions for authors:
Artificial intelligence, language models, machine learning, and similar technologies are not eligible for authorship. When these tools are utilized to generate content or assist
in the writing or preparation of manuscripts, the authors are responsible for the integrity of the content generated by these tools and must clearly state the use of AI in the

That’s my take on it: It appears that faculty and student policies regarding ChatGPT are vastly different. The inclusion of any content generated by ChatGPT in a paper is
strictly prohibited by many universities and violation of the policy is treated as academic dishonesty. On the contrary, JAMA accepts AI-generated content as long as the author
verifies the information and documents it in the Acknowledgment section or the Methods section of the paper. I guess it is based on the implicit assumption that mature adults
are more responsible than young students. In my opinion, it is not necessarily true. This type of "discriminatory" policy may eventually lead to discontent among students.
Rather than setting two sets of policies, it would be better to create one standardized policy for all and provide workshops on ethical AI use to both groups.

Posted on February 3, 2023

Yesterday (Feb. 2, 2023) an article posted on KDNuggets introduces ten free machine learning courses offered by top universities, including UC Berkeley, Carnegie Mellon
University, Stanford University, Caltech, Cornell University, University of Toronto, MIT…etc. It is noteworthy that these are just not one-hour seminars; rather, the duration
of these comprehensive courses is between 20 and 60 hours. More importantly, some of these courses are taught by very prominent scholars in the field, such as Andrew Ng.

Full article:

That’s my take on it: According to the May 2022 report compiled by the Institute for Advanced Analytics at North California State University, there are about 353 graduate
programs in data science and machine learning in the US. Additionally, there are many free courses in the market and the preceding list is only the tip of the iceberg. No doubt
the competition is very intense, and therefore program designers must think outside the box to stay ahead of the curve. 

Posted on February 2, 2023

A week after ElevenLabs opened its voice-cloning platform to the public, the startup says it may need to rethink that openness amid increasing instances of voice-cloning misuse.
The Elevenlabs speech synthesis and voice cloning software modules can mimic any accent and speaking tone and can be used for newsletters, books, and videos. Piotr
Dabkowski, a former Google machine learning engineer, and Mati Staniszewski, an ex-Palantir deployment strategist, founded the company in 2022. After the software was found
to generate homophobic, transphobic, violent, and racist statements from celebrities, the company addressed the issue on Twitter. 

Full article:

Posted on January 26, 2023

This morning I attended a seminar entitled “Debunking Data and Analytics Myths: Separating Fact from Fiction" hosted by the Ravit Show. The panel discussed the following
urban legends of data science:

1.     Big data is not just about volume, it's also about variety and velocity.

2.     Analytics is not just about finding insights, it's also about taking action on those insights.

3.     Data visualization is not just about making data look pretty, it's also about clearly communicating important information.

4.     Machine learning is not a magic solution for all problems, it's just one tool in the data scientist's toolbox.

5.     A/B testing is not just for online businesses, it can be used in offline settings as well.

6.     Data governance is not just about compliance, it's also about making sure data is accurate, accessible, and secure.

7.     Data privacy is not just about hiding data, it's also about giving individuals control over their own data.

8.     Predictive modeling is not just about forecasting the future, it's also about understanding the past and present.

9.     Data science is not just for tech companies, it's applicable to any industry.

10.  Data literacy is not just for data scientists, it's important for everyone in the organization to understand and use data effectively.

The panel also offered some valuable advice, such as "Think big, act small, and start fast!" Don’t wait a month or three months!” 


That’s my take on it: Even after debunking these misconceptions many times, I continue to encounter them in my teaching, research, and consulting work. In the past,
a researcher told me that big data analytics was irrelevant to his field because he equated big data with a larger sample size; his experiments used a small amount of
experimental data, not a large amount of observational data. My profession as a psychologist makes me aware of cognitive errors related to the baby duck syndrome:
a baby duck, when first exposed to another organism (e.g., its mother), tends to imprint on it and then follow it. Defending against misconceptions is like fighting a pandemic,
which means that people should be "vaccinated" as early as possible. Therefore, I recommend teaching data science concepts at the undergraduate level!

Posted on January 24. 2023

In spite of a mass layoff (10,000 employees), Microsoft recently announced a $10 billion investment in Open AI, the company that developed ChatGPT and DALLE-2.
Microsoft's investment will allow OpenAI to accelerate its research since all of its models are trained in Microsoft Azure. In return, Microsoft will receive a boost to its
Azure cloud and even catch up with Amazon Web Services.

Full article:

That’s my take on it: Currently, Amazon Web Services dominates the cloud computing market. However, Open AI can undoubtedly improve the functionality of Microsoft
Azure. While AWS does not have a powerful AI partner like Open AI, its Sagemaker provides powerful predictive modeling capabilities. A long time ago, Microsoft and
SAS Institute formed a partnership to offer cloud-based data analytics. It is my belief that this fierce competition in machine learning, cloud computing, and data science
will drastically change the landscape of these fields in the near future. Be sure to stay tuned!

Posted on January 20, 2023

Today I read an interesting article entitled “Is artificial intelligence a threat to Christianity?” posted on Patheos. The article contains many insightful points, and I will only
highlight one. According to Keith Giles, the author of the article, “In fact, this fear of creating an AI that is “more intelligent than humans” isn’t even what we should be
most afraid of. As one former top social media tech executive was quoted as saying in the excellent NETFLIX documentary, The Social Dilemna, we shouldn’t be afraid of
creating an AI that eventually exceeds human intelligence, what we should be afraid of is the fact that we’ve already created machine learning programs that know how
to overcome our human weaknesses.”


That’s my take on it: Last evening in my class I told my students that I like machine learning a lot. Machine learning has the ability to learn very quickly, as its name
implies. With the right data, the algorithm can improve, and it won't make the same error again. On the contrary, humans (including myself) are so stubborn that we let
our cognitive and emotional weaknesses affect our judgment and behavior. We fear AI partly because we are envious of it.   

Posted on January 20, 2023

With over 477 million items, Getty Images is one of the largest visual media companies in the world, offering stock images, videos, and music to business and individual clients.
Recently Getty Images announced that it is suing Stability AI, a company that enables users to generate images using its machine-learning software module, Stable Diffusion.
Getty Images accused Stability AI of training its algorithms by unlawfully extracting images from the Internet, including stock images owned by Getty. Getty claimed that the
company is not seeking financial damages or trying to stop the distribution of AI-art technology; rather, it attempts to push for laws and regulations that respect intellectual property.


That’s my take on it: Getty Images' reaction is understandable. It will not be necessary for illustrators or other users to buy stock images from Getty or other suppliers when
they are able to generate images using AI. For example, the Atlantic published a report by Charlie Warzel in 2022 right after Midjourney was released, another AI art generation
program. The report depicts two images of Alex Jones, the founder of InfoWar. Later Warzel apologized. “This was entirely my fault…Instead of selecting a photo or illustration
from Getty Images to go with the story, as I do for most of my newsletters, I decided to try something different and use an AI art tool to come up with the story’s accompanying
image,” says Charlie Warzel.

It is interesting to note that Getty Images is not suing Midjourney and DALLE-2. There is an obvious reason for omitting DALLE-2. While Stability AI uses an open-source model,
Open AI, which developed DALLE-2, did not disclose its mechanics. In the absence of ample evidence, attorneys have a difficult time building a case. However, I don’t understand
why Getty Images is not targeting Midjourney. Do you know why?

Posted on January 18, 2023

Today Boston Dynamics, a leader in AI-enabled robotics, released a video clip on YouTube that shows how Atlas, an intelligent humanoid robot, navigates “his” environment.
“He” assisted “his” human partner by using available objects and modifying his path to reach “his” goal.


That’s my take on it: In this video, the robot is merely helping the construction worker, who is still doing the actual task. I believe that in the near future, the advancement
of AI and big data analytics will enable intelligent robots to replace humans in certain high-risk careers, such as monitoring the radiation levels in nuclear plants and sweeping
mines on battlefields. Last year the US Army provided one of its two robotic dogs, which was built by Boston Dynamic, to clean up mines in Ukraine. You read it correctly. Only
one robotic dog! I guess it is still experimental. It would be great if this could be scaled up in the future so that no human lives would be lost. As shown in the video, Atlas' actions
indicate that an intelligent robot could evade threats better than humans.

Posted on January 16, 2023

A group of artists recently hired lawyers Matthew Butterick and Joseph Saveri to sue Stability AI and Midjourney, the developers of the artificial intelligence art generators
Stable Diffusion and Midjourney, respectively, as well as DeviantArt, which recently launched its own artificial intelligence art generator. They accused the AI generator
companies of profiting from their work by scraping their images from the web without their permission.
The law firms representing the artist group asserted that AI-generated
art is a form of intellectual theft. “Even assuming nominal damages of $1 per image, the value of this misappropriation would be roughly $5 billion (For compari­son, the
largest art heist ever was the 1990 theft of 13 art­works from the Isabella Stewart Gardner Museum, with a current estimated value of $500 million… Having copied the
five billion images—without the consent of the original artists—Stable Diffusion relies on a mathematical process called diffusion to store compressed copies of these train­ing
images, which in turn are recombined to derive other images. It is, in short, a 21st-century collage tool,” says Matthew Butterick.

Full article:

That’s my take on it: Technically speaking, Stable diffusion does not generate a picture by directly recombining existing images. The underlying principle of machine
learning is pattern recognition. Indeed, AI art generators store no images whatsoever, but rather mathematical representations of patterns derived from images. With
that said, the software module does not stack together multiple images in the fashion of collaging. Rather, it creates pictures from scratch based on pattern generation.

Even if AI art, as Butterick said, is just a 21st-century collage tool, collaging has been used by several well-known artists, such as Andy Warhol, and this practice is
widely accepted by the artist community. Warhol created art by recycling existing icons and images, including Marilyn Monroe, Prince, and Campbell soups. Several years
ago, the Warhol Foundation was sued for allegedly infringing on copyright laws by basing a portrait of Prince on a prominent photographer's work. Nevertheless, a federal
district court judge ruled that Warhol's Prince series is "transformative" because it conveys a different message, and therefore is considered "fair use" under the Copyright Act.

Furthermore, Butterick and Saveri are also suing Microsoft, GitHub, and OpenAI over the CoPilot AI programming model, which is trained by collecting source codes from the
Web. Thus, this kind of dispute is not only about AI art, but also about the long-term development of the open-source community as a whole.   

Posted on January 13, 2023

Eight major Australian universities have announced that they have changed their assessment formats as a result of several cases in which students turned in papers
generated from ChatGPT. The University of Sydney, for instance, has revised its academic integrity policy to explicitly state that using artificial intelligence to create
content is cheating. The Australian National University has changed assessment designs, such as shifting emphasis on laboratory activities and fieldwork, as well as
using time exams and oral presentations.

Full article:

That’s my take on it: This issue is not entirely new. Before the introduction of ChatGPT and other AI tools, Wolfram products, such as Mathematics and Wolfram Alpha,
are capable of solving complex math problems. These tools are also used by students to cut corners, say math and statistics professors. The widespread availability of
Google and other search engines has led to many students turning in "instant" papers that reference many websites. Nonetheless, Wolfram, Google, and now ChatGPT
are here to stay. The solution is not to ban them. Instead, we should teach students how to use these tools ethically. 

Posted on January 9, 2023

ChatGPT, an Open AI language module released on November 30, 2022, is capable of writing articles, generating codes, and solving complex math problems. As
expected, the introduction of ChatGPT has triggered widespread resistance. On Jan 5, 2023, the International Conference on Machine Learning (ICML) announced
that it bans authors from using AI tools like ChatGPT to write scientific papers unless the produced text is a part of an experimental analysis. It is important
to point out that this ban applies only to the text generated entirely by AI-enabled language models, but does not apply to papers “coauthored” by humans and AI. 
In a similar vein, Stack Overflow also banned users from submitting answers created using ChatGPT last year, while the New York City Department of Education
blocked access to this tool just last week. 

“With a tool like this at their fingertips, it could muddy the waters when evaluating a student's actual writing capabilities because you're giving kids potentially
a tool where they could misrepresent their understanding of a prompt,” says Whitney Shashou, founder and advisor at educational consultancy Admit NY.

Full articles:

That’s my take on it: Any new technology could lead to some unintended consequences. As you might already know, some paper mill “companies” provide users
with “publishing” services. It is estimated that about 1% of articles archived in PubMed contain questionable content. With the advance of AI tools like ChatGPT,
it will be much easier for authors to produce instant articles. At the present time, I am unaware of any academic journal that prohibits submissions generated by AU.

ChatGPT also simplifies the process of writing for students. As of right now, my university does not have an academic honesty policy regarding artificial intelligence.

While Turnitin and SafeAssign can detect plagiarism, they cannot tell the difference between human-written and AI-generated text.

It is not my intention to oppose ChatGPT. As an initial research tool, I find this tool perfectly acceptable. Authors should, however, verify the information provided
rather than blindly trusting the results. I recommend that at least 80% of the final paper should be written by a human author to ensure its originality.

Posted on January 9, 2023

Why did Microsoft invest in R rather than Python or Octave?

That would be because of the Goldilocks principle in investing: if you invest in something, you want two things – potential (yield, profit, market share,…) and power (control).
R is relatively centralized. Beyond core R, which already comes with a pretty extensive amount of functionality, most of the things you use have been developed
by the same few dozens or so of highly prolific and amazingly skilled developers: Hadley Wickham and Dirk Eddelbuettel and Yihui Xie have pretty much developed
most of modern R as it is being used. It’s also a very widely used language, despite being a little clunky. Beauty is in the eye of the beholder, but I consider R to be
one of the uglier mainstream languages. In spite of this, it is very widely used in academia and enterprise settings, and while R itself isn’t particularly fast, you can
make it pretty impressively fast (but that’s a post for another day). R has pretty much displaced STATA, and as the slow generational change in science faculties
around the world plays out, students are increasingly encouraged to learn R instead of using slightly more digestible proprietary statistical packages like SPSS. It
helps that R has a spectacularly good front-end (RStudio)* and its own way of literate programming with Rmd.

Octave is basically an open-sourced version of Matlab. It’s syntactically similar, which is why it has just about all the drawbacks of Matlab. What Octave doesn’t have
is an ecosystem that comes near R’s. When it comes to quantitative applications, if an algorithm or an analysis has been implemented at some point somewhere in
the known universe, there’s likely an R package for it. This includes some fairly esoteric stuff. You can’t say the same about Octave, sadly. Octave doesn’t have the
sophisticated package management infrastructure of R and CRAN. Its overall ecosystem is much smaller, by about two orders of magnitude (!). The potential in
embracing Octave, as well as the number of existing users, is quite small.

Python is the opposite. Python has immense potential, and everybody knows it. It’s just really, really hard to govern. While Python does have a central governing
body (the Python Software Foundation), a lot of quantitative tools are spread all over the place: NumFocus, Apache, Google, OpenAI, individual maintainers, and so
on. It’s also a much more general-purpose language: R is, deep down, about quantitative work. Python can be used to pretty much do anything you’d want a
modern computing language to do. It is, quite simply, too big and too diverse for any investment, even by a company as big as Microsoft, to have a noticeable impact.
It’s delightfully chaotic, which makes it fun, but hard to exert control over.

What it ultimately boils down to is the infectious population (because of course it does – ask an epidemiologist a question, expect a response in those terms!). Octave’s
population is just too small to create an Octave pandemic (thank the heavens). There aren’t enough people who know and love it to keep teaching it to others. Python’s
infectious population is too big: it’s like one of those commensal viral species like Epstein-Barr or CMV that pretty much everyone gets in their lifetime. R is “just right”
– it’s in the investment Goldilocks zone. It’s got potential, it’s still somewhat governable and you can make a meaningful investment in it with relatively reasonable resources.

That’s my take on it: Different corporations have different development strategies regarding open source. While Microsoft is investing in R, IBM focuses on Python. To be
more specific, although the extension hub of IBM SPSS Statistics allows users to download and install both R and Python packages, IBM incorporates only the Python library
into IBM SPSS Modeler. This Python library includes a plethora of tools, such as SMOTE, XGBoost, t-SNE, Gaussian Mixture, KDE, Random Forest, HDBSCAN, and Support
Vector Machine. But there is no R library in IBM Modeler. In addition, the IBM data science certification program is also Python-centric.

In spite of its popularity, the decentralization of Python, as Chris von Csefalvay pointed out, is a concern to me. First, you need to figure out which package you need for
a specific job and it could be confusing. Second, when you encounter issues in Python, it is very difficult to trace the source of the problem, especially when multiple packages
are involved.  

Last, I agree with Chris von Csefalvay that R is not pretty although R-Studio provides users with a nicer front end. In my opinion, JASP, which is a graphical version of R, is
much more accessible. However, JASP is fairly new and its current version is 0.16.4. Not surprisingly, its graphical user interface is not as good as JMP Pro. JMP Pro is a
mature SAS product and its current version is 17.

Posted on December 18, 2022

Hi, all, today I delivered a talk on dynamic visualization (see below) at a conference. Statistical graphs are not new. The keyword for this presentation is "dynamic." 
Specifically, a good visualization system should enable the user to alter the display by asking "what-if" questions. There are hyperlinks to dynamic graphs on the PDF.
You can click on them to explore the data. Thank you for your attention. Merry Xmas and Happy New Year!
Yu, C. H. (2022, December). Dynamic data visualization for pattern-seeking and insightful discovery. Paper presented at 2022 IDEAS Global AI Conference. Los Angeles, CA.

Posted on December 16, 2022

In December 2021 an article in Forbes predicted the emerging trends of AI in the near future. A year later it was found that seven out of ten predictions were exactly right
or on the right track:

1.     Language AI will take center stage, with more startups getting funded in NLP than in any other category of AI: Right.

2.     Databricks, DataRobot, and Scale AI will all go public: Wrong.

3.     At least three climate AI startups will become unicorns: Wrong.

4.     Powerful new AI tools will be built for video: Right.

5.     An NLP model with over 10 trillion parameters will be built: Wrong.

6.     Collaboration and investment will all but cease between American and Chinese actors in the field of AI: Right.

7.     Multiple large cloud/data platforms will announce new synthetic data initiatives: Right.

8.     Toronto will establish itself as the most important AI hub in the world outside of Silicon Valley and China: Right.

9.     “Responsible AI” will begin to shift from a vague catch-all term to an operationalized set of enterprise practices: Rightish.

10.  Reinforcement learning will become an increasingly important and influential AI paradigm: Rightish.

Full article:

That’s my take on it: I could write a 10-page essay to respond to each of the preceding predictions, but in this short post I will focus on the fulfillment of Prediction # 10 only. Reinforcement learning was inspired by the reinforcement theory in behavioral psychology. According to behaviorism, our behaviors are governed by the stimulus-response (S-R) loop, meaning that how we act or respond depends on what stimulus or feedback we received from the environment. If the feedback is rewarding, it reinforces good behaviors. If not, we avoid detrimental behaviors. When I was a student, most scholars looked down upon behavioral psychology for its over-simplicity. However, a few decades later AlphaZero, AlphaGo, and AlphaStar (Google’s DeepMind) that defeated human chess experts and video game players are all based on this alleged over-simplistic model. Nvidia, the leader in the GPU market, also used reinforcement learning to design its new cutting-edge H100 chips. The moral of the story is: We need to keep an open mind to alternate theories. 

Posted on December 9, 2022

During the past month, Lensa AI created by Prism Labs is taking over social media by storm. This app has been around since 2018, but recently its new feature “Magic Avatars”
draws a lot of attention. In the past week, Lensa AI became the most popular app in the iOS App store and has been downloaded 700,000 times in the past month. Why is it so
popular? The new feature is amazing! If you upload 10-20 photos of yourself to the system, the deep learning algorithm can create several digital versions of yourself. This
algorithm is based on Stable Diffusion, an AI-powered program trained on a data set consisting of over two billion images. It is important to point out that the app might collect
your behavioral data, and thus IT security experts suggest using it cautiously.

One-minute discussion about safety concerns on Youtube:

That’s my take on it: The impact of AI is beyond data analytics; instead, its influence can be found in every discipline, including visual arts and mass communication. Besides
privacy concerns, skeptics argue that these machine learning programs are trained with many existing images on the Internet, but those artists are not compensated at all.
While contributors to the open-source community voluntarily share their source codes with the whole world, un-compensated artists are forced to accept this quasi-open-source
model. However, for me, it is acceptable because these programs didn’t “plagiarize” anyone’s work. Rather, they “learn” from other images and then create a new one based on
the references. Is that what we are doing in every type of work? When I write a research paper, I usually use 30 to 50 references but don’t pay those authors.  

Posted on December 8, 2022

Today I attended the last session of “Statistical wars and their casualties.” One of the speakers is Aris Spanos (Virginia Tech) and the title of his presentation is “Revisiting the two cultures
in statistical modeling and inference.” In the talk he outlined several statistical paradigms as follows:

1.     Karl Pearson’s descriptive statistics

2.     Fisher’s model-based statistical induction

3.     Graphical causal modeling

4.     Non-parametric statistics

5.     Data science and machine learning

At the end he discussed the difference between the Fisherian school and the data science approach: the paradigm shift from the Fisherian school to data science“reflects a new answer to
the fundamental question: What must we know a priori about unknown functional dependency in order to estimate it on the basis of observations? In Fisher’s paradigm the answer was
very restrictive – one must know almost everything…machine learning views statistical modeling as an optimization problem relating to how a machine can learn from the data.”

Nonetheless, Dr. Spanos warned against overhyping data science. For him doing data science is returning to the Pearsonian tradition that emphasizes describing the data at hand. Many
people go into the discipline by learning Python without knowing statistical details. As a result, data science became a black box, and thus he is afraid that many decades later we will try
to figure out what went wrong again.

In his talk entitled “Causal inference is not statistical inference,” Jon Williamson (University of Kent) asserted that a broader evidence base from triangulation is more important than
successful replication of the results because successful replication might replicate the bias in previous studies.

Seminar website:

That’s my take on it:

1. I agree that the Fisherian model-based approach is very restrictive because it assumes you know to which the theoretical sampling distribution the sample belongs. However, I would
compare data science and machine learning (DSML) to the school of exploratory data analysis (EDA) founded by John Tukey and the resampling approach developed by Elfron et al., rather
than the Pearsonian legacy. By unpacking the philosophy of these paradigms, one can see that both DSML and EDA emphasize pattern-seeking, and today resampling methods, such as
cross-validation and bootstrapping, are embedded in many DSML methods.

2. We should do both triangulation and replication. I don’t think one is more important than the other. Machine learning is a form of internal replication in the sense that the data set is
partitioned into numerous subsets for repeated analyses. In boosting the subsequent models can correct the bias of the previous models, and thus this type of replication will not inherit
the bias. 

Posted on December 8, 2022

A week ago Open AI released its new AI-enabled Chat bot named ChatGPT. In just five days after the announcement, a million users signed up for the service.  Open AI was founded
in 2015 by Elon Musk and a few others, but today Musk is no longer on Open AI’s board while remaining a major donor. In 2019 Microsoft invested US$1 billion into Open AI to boost
its research, and its most fruitful research product is GPT, which stands for Generative Pre-Trained Transformer. ChatGPT can answer virtually any question through natural language
processing, such as writing a poem, an essay, 
 or a computer program.


That’s my take on it: Will ChatGPT replace Google’s search? Now it is still too early to tell. Will ChatGPT totally replace human writers and other types of workers (e.g. journalists,
customer support personnel…etc.)? Probably not. But it can definitely help us get things done efficiently. The following are some sample instructions/questions that I input into ChatGPT.
The answers are stunningly concise and accurate! If you are a student, you can paste the question from the assignment into the chat box, and the AI system will write the essay for you!   

My instruction/question 1: Explain the difference between cognitive psychology and neuroscience.

Neuroscience, on the other hand, is a field that focuses on the structure and function of the nervous system, including the brain and the rest of the nervous system. Neuroscientists use
a wide range of techniques, including brain imaging and electrophysiology, to study the brain and nervous system at the cellular and molecular level. While cognitive psychology is concerned
with understanding mental processes and behavior, neuroscience is more focused on the underlying biological mechanisms that give rise to those processes and behaviors.

In short, cognitive psychology is concerned with the study of the mind, while neuroscience is concerned with the study of the brain and nervous system. Both fields are closely related and
often inform each other, but they have different focuses and approaches to studying the mind and behavior.

The main disadvantage of R is that it can be more difficult to learn and use than Python. R has a more complex syntax than Python, and it is not as widely used outside of the fields of
statistics and data analysis. This can make it more challenging to find help and support when you are learning to use R, and it can also make it harder to integrate R with other
programming languages or tools. Additionally, R can be slower than some other languages, which can make it less suitable for certain types of real-time or high-performance applications.

Posted on December 3, 2022

Two days ago, I attended the third session of “The Statistics Wars and Their Casualties.” One of the speakers David Hand asserted that the problem of statistical
significance is its misuse and misinterpretation. He cited several examples to argue that procedures in different schools of thought can be equally misused.
For example, the mean is sensitive to extreme values while the median is resistant against outliers. Which one is more appropriate depends on the research
question. Arithmetic means can be misused, but we should not ban Arithmetic means. In a similar vein, in data science, the area under the curve (AUC) and the error
rate derived from the same predictive model can lead to contradictory conclusions. But we should not call for a ban on using such tools just because they are misused
or misinterpreted.

Link to Stat Wars website:

1.     The problem of statistical significance is more than its misuse. Indeed, there are other inherent problems. For example, the alpha level (0.05) is nothing
more than a social convention. Unlike the Bayesian approach which yields a distribution of answers and promotes probabilistic reasoning, the frequency approach leads to a
dichotomous conclusion. While it is true that eventually we need to make a dichotomous decision, the problem of statistical significance is that even the evidence is
dichotomous! Nonetheless, I agree that we cannot go that far to ban it. For small-sample studies, classical statistics is still valid and handy.

2.     Data science is less error-prone and less likely to be misused. Traditional parametric statistical methods require many assumptions. In contrast, most data
science methods are non-parametric; they are robust against outliers and noise; they can detect non-linear patterns. More importantly, ensemble methods and machine
learning are capable of doing self-replication by partitioning the data set into sub-samples and running multiple models, thus alleviating the replication crisis found in
traditional statistics.

Posted on December 2, 2022

A month ago JASP, the graphical shell of the R language, released version 0.16.4. Today I attended a workshop to learn about the new and enhanced features of JASP.
The enhanced module includes several powerful tools belonging to different schools of thought. For example,

·      The frequentist school (Fisher/Pearson): Generalized linear models

·      The Bayesian school: Bayesian repeated measures ANOVA

·      The data science and machine learning school: Density-based clustering

Link to JASP:

That’s my take on it:

1.     Many statistical learners are torn between learning statistics and learning coding. In my opinion, this tension is unnecessary. The GUI of JASP is so user-friendly that
analysts can focus on data analysis rather than struggling with the R syntax.

2.     JASP is semi-dynamic and interactive. Unlike SPSS which produces a frozen output, JASP allows the user to add or remove information in the output by changing
options in the input. But unlike JMP, Tableau, and SAS Viya, you cannot directly manipulate the output. JASP can now load data directly from databases like IBM DB2,
Oracle, MySQL, MariaDB, Postgres, SQLite, and any database supporting the ODBC interface.

3.     Yesterday I attended a seminar entitled “The Statistics Wars and Their Casualties.” As the title implies, there was a heated debate centering on the use and misuse
of statistical significance. In my opinion, it should not be an either-or situation. As mentioned before, JASP provides analysts with different approaches; the procedures are
grouped and clearly labeled: classical, Bayesian, and Machine learning. Pick whatever you see fit! 

Posted on November 18, 2022

A few days ago, Nvidia, the pioneer of graphical processing units (GPU), announced its new partnership with Microsoft in co-developing AI cloud-based computing.
Specifically, Nvidia will utilize Azure, the cloud platform of Microsoft,  to develop advanced generative AI models that can create content, including codes, images,
and video clips.

Full article:

That’s my take on it: Currently, Nvidia is the world’s second-largest semiconductor company (behind TSMC), whereas Microsoft is second to Amazon Web Services
in cloud computing. It is logical for them to form such a joint venture in order to compete with the number one in the market. In the past, computer users were confined
to the Wintel monopoly (Microsoft Windows and Intel CPU). However, in the era of big data analytics, AI, and cloud computing, it is anticipated that data analytics can
choose between many options. 

Posted on November 12, 2022

Recently NVIDIA, the leader in graphical processing units (GPU) and one of the leaders in AI research, announced a new approach to AI-enabled text-to-image
generation named eDiff-I. Currently, the three prominent leaders in the text-to-image market are Midjourney, DALL.E-2, and Stable Diffusion. As the name
implies, Stable Diffusion is based on diffusion modeling. Under this mode, an initial image is created with random noise. Next, through an iterative process,
a sharp and sensible image is gradually created by denoising the entire noise distribution. While Stable Diffusion’s denoising is based on a single noise distribution,
NVIDIA goes one step further using an ensemble of multiple expert denoisers.  

Additionally, while users of Midjourney, DALL.E-2, and Stable Diffusion have limited control of the output image, eDiff-I allows users to paint with text, i.e., specify
objects in different areas of the canvas.

YouTube Video (7 minutes):

That’s my take on it: The logic of diffusion modeling is similar to several older statistical procedures. For example, K means clustering randomly selects centroids
and then fine-tunes the clustering patterns through multiple iterations. In contrast, the logic of eDiff-I is closer to that of data science and machine learning. The
ensemble method, an extension of resampling, is utilized in boosting and bagging. Rather than drawing a conclusion based on a single model, the ensemble method
converges multiple models to the final output from a collection of models.

I admire NVIDIA because its CEO/founder has a vision. Currently, NVIDIA is a one-trick pony, but it cannot be the leader of GPUs forever. Using its strength in
graphical processing to invest in a less-crowded AI domain (text-to-speech image generation) is definitely a smart move! 

Posted on November 11, 2022

Two days ago, Meta (formerly Facebook) announced a massive layoff in the company, and as a result, 11,000 employees were terminated. Meta’s CEO
Mark Zuckerberg said that he planned to consolidate the company’s resources into a few high-priority growth areas, such as the AI discovery engine
while giving up other less promising research endeavors. For example, the entire team named “Probability” was eliminated. The team was composed
of 19 people doing Bayesian Modeling, nine people doing Ranking and Recommendations, five people doing ML Efficiency, and 17 people doing AI for
Chip Design and Compilers. A former team member said it took seven years to assemble such a fantastic team.

Full article:

That’s my take on it: I don’t worry about brain drain from the US to other countries. The US is still a magnet that attracts top-tier AI researchers
and data scientists worldwide. Those former Meta researchers will likely be recruited by other high-tech giants, such as Google and Apple. Last year
Professor Michael Gofman at the University of Rochester spotted a trend that high-tech titans and startups have lured many DSML professors away from
their faculty positions. Consequently, the knowledge gap between academia and industry was widened; transferring essential knowledge to students
and colleagues was affected. Current massive layoffs in Meta, Twitter, and other high-tech giants might be an opportunity for colleges and universities
to absorb those highly competent researchers. 

Posted on November 8, 2022

As you might already know, recently SAS Institute released the new version of JMP and JMP Pro (Version 17). There are many powerful and handy new features, such as

·      Workflow Builder

·      Easy design of experiment

·      Easy search

·      Spectral analysis in the functional data explorer

·      Genomics and wide fitting

·      Generalized linear mixed model

·      Interactive power analysis

·      Preview of joining, transposing, and data reshaping

That is my take on it: I especially like the preview feature in data reshaping (e.g., concatenate, join, stack, split, transpose…etc.). In the past, no matter
whether you used a graphical user interface or coding, you could see the result only after hitting the OK or Run button. If something went wrong, you had
to debug it and re-did the whole procedure. Not anymore! Now I can literally look at the result before submitting the job.

Interactive power analysis is another wonderful feature. G*Power is very popular among researchers because it is free and user-friendly. The drawback
is that if you want to explore different options, you have to go back and forth between the input and the output. Although G*Power can output a graph
showing power on the Y-axis and N on the x-axis, the ranges are pre-determined by your input. Not anymore. In JMP you can use sliders to adjust the
effect size and the sample size, and then the power is updated in real-time!  

I always tell my students: The world keeps changing. If you cannot change the world, at least you change with the world! I will continue to
explore those new features to make my tasks more effective and efficient.

Posted on October 21, 2022

Today (10/20) is the second day of the 2022 Scale Transform X Conference. I would like to share one of the most informative presentations at this conference
with you. The title of the lecture is “Looking at AI through the lens of a chief economist” and the presenter is John List, Kenneth Griffin Distinguished Service
Professor in Economics at the University of Chicago and the Chief Economics Officer at Uber. His specialty is behavioral economics, a sub-domain of economics
that applies psychological theories to study human behaviors related to financial decisions. In this talk, he pointed out that scalability is a major challenge to
behavioral economics. Specifically, very often false positives caused by statistical artifacts in a small-scaled study misled the decision-maker to prematurely
expand the program, but in the end, the up-scaling program failed miserably.

Conference website:

That’s my take on it: The problem of scalability in behavioral economics is similar to the replication crisis in psychology: the results of many research studies
are difficult to reproduce in other settings. If a model is overfitted to a particular sample, its generalizability is severely limited. I am glad to see that Dr. John
list is willing to utilize big data to tackle this problem. On the contrary, some psychologists are still skeptical of data science methods. Once a psychologist said
to me, “Big data is irrelevant!”  After all, behavioral economics could be conceptualized as an interdisciplinary study that integrates both psychology and economics.
If big data can be applied to behavioral economics, why can’t other disciplines?

Next time if I receive an apology from Uber after a bad ride, I will not reuse the service immediately. After a few days, Uber might send me a promo code in order
to win me back!

Posted on October 20, 2022

Today (Oct 19, 2022) Meta announced the first AI-powered speech-to-speech translator on earth. Unlike traditional translation systems that focus on written
languages only, Meta’s universal speech translator is capable of translating Hokkien, a dialect used by over 49 million Chinese people in the world, to English
and vice versa. In the future, Meta will expand this system to cover 200 languages. The ultimate goal is to enable anyone to seamlessly communicate with
each other in their native language.

Demo on YouTube:

That’s my take on it: Interestingly, many AI companies set the same goal: enabling all users. In a lecture entitled “A vision for advancing the democratization
of AI,” Emad Mostaque, founder and CEO of Stability AI, asserted that AI-powered image generators, such as Stable Diffusion, can “democratize” our society in
many ways. Specifically, armed with AI-powered image generators, anyone can create stunning graphics without formal art training. Put bluntly, AI tools can lift up

When I studied theology, the most challenging subject matters were the Hebrew and Greek languages. You have to be gifted in linguistics in order to be proficient
in biblical hermeneutics, but unfortunately, I failed to master either one of these two languages. This is a good analogy: “Reading the Bible without knowing Greek
and Hebrew is like watching a basic television while reading the Bible knowing Greek and Hebrew is like watching an 85" UHD 8K television with stereo surround
sound.” Nevertheless, in our lifetime we may see a real-life “Star Trek” universal translator that can remove all language barriers! 

Posted on October 11, 2022

As you might already know, DALLE-2, one of the most advanced AI-enabled graphing programs, is open to the public now. Like Midjourney and Stable Diffusion,
DALLE-2 is capable of generating art and photo-realistic images from a command given in natural language. Yesterday (Oct 10) a photographer named Umesh
Dinda posted a comparison of partial background removal and reconstruction of an image between Adobe PhotoShop and DALLE-2. Photoshop has been the king
of image processing for several decades due to its rich features. One of its amazing features is “content-aware fill”, which allows 
photographers to replace any
part of the photos based on the surrounding content. However, after watching Dinda’s Youtube movie, I must admit that DALLE-2 has dethroned PhotoShop in
certain functionalities. While the result of PhotoShop looks “cheesy,” the product of DALLE-2 is so flawless that your naked eyes cannot tell the photo has been retouched.

Posted on October 9, 2022

Two days ago (Oct 6) six US leading tech companies, including Boston Dynamics, Agility Robotics, ANYbotics, Clearpath Robotics, Open Robotics, and Unitree, signed an
open letter pledging not to weaponize their products. They state, “As with any new technology offering new capabilities, the emergence of advanced mobile robots offers
the possibility of misuse. Untrustworthy people could use them to invade civil rights or to threaten, harm, or intimidate others… We pledge that we will not weaponize our
advanced-mobility general-purpose robots or the software we develop that enables advanced robotics and we will not support others to do so."

That’s my take on it: In the open letter they also state, “to be clear, we are not taking issue with existing technologies that nations and their government agencies use
to defend themselves and uphold their laws.” However, without support from major US robotics firms, the development of AI-based weapons in the US will slow down.
Perhaps my position is unpopular. Will governments and high-tech corporations of hostile countries face the same limitations? History tells us that any unilateral disarmament
often results in more aggression, instead of peace (Remember Neville Chamberlain?).

Two years ago the New York City Police Department (NYPD) utilized the Spot model from Boston Dynamics to support law enforcement, including a hostage situation in the
Bronx and an incident at a public housing building in Manhattan. Unfortunately, these deployments caused an outcry from the public, and as a result, the NYPD abruptly
terminated its lease with Boston Dynamics and ceased using the robot. If “robocops” can save the lives of innocent people and reduce the risk taken by police officers, why
should we object to it?

Posted on September 24, 2022

Yesterday (September 23, 2022) an article published in Nature introduced the Papermill Alarm, a deep learning software package that can detect text in articles similar to
that found in paper mills. Through the PaperMill Alarm, it was estimated that about 1% of articles archived in PubMed contain this type of questionable content. There are
several existing plagiarism detection software tools in the market, but this approach is new because it incorporates deep learning algorithms. Currently, six publishers,
including Sage, have expressed interest in this new tool.

Full article:

That’s my take on it: If this tool is available in the near future, I hope universities can utilize it. Although there are several plagiarism checkers, such as Turnitin and
SafeAssign, in the market, today some sophisticated writers know how to evade detection. No doubt deep learning algorithms are more powerful and sensitive than
conventional tools.

Nonetheless, I think there is room for expansion in using deep learning for fraudulent paper detection. Currently, the scope of detection of the Papermill Alarm is limited
to text only. As a matter of fact, some authors duplicated images from other sources. As the capability of machine learning advances rapidly, image sleuths may also be
automated soon.  

Posted on September 21, 2022

Yesterday (September 20, 2022) in the article entitled “Data: What It Is, What It Isn’t, and How Misunderstanding It is Fracturing the Internet” President of Global Affairs at
Meta Nick Clegg argued that data should not be treated as the “new oil” in the era of big data. Unfortunately, public discourse about data often relies on this type of faulty
assumptions and analogies, resulting in digital localization and digital nationalism. First, unlike oil, data are not finite. The supply of new data is virtually unlimited and the
same data can be re-analyzed. Second, more data are not equated with more values; rather, it depends on how the data are utilized. For instance, a database about
people’s clothing preferences is much more important to a fashion retailer than it is to a restaurant chain. Third, data values depreciate over time, i.e., outdated data are
useless or less valuable. More importantly, data access is democratized, not monopolized. For example, every month more than 3.5 billion people use Meta’s apps,
including Facebook, Instagram, WhatsApp, and Messenger, for free! Taking all of the above into consideration, Clegg argued that democracies must promote the idea of the
open Internet and the free flow of data.

Full article:

That’s my take on it: The notion “data is the new oil” originates from British mathematician Clive Humby in 2006.  This idea is true to some certain extent. For example,
in the past Google’s language model outperformed its rivals by simply feeding more data to its machine learning algorithms. This “brute force” approach is straightforward:
pumping more “fuel” into the data engine, and it works! Nonetheless, it is also true that more data do not necessarily generate more values. Old data could depreciate,
but even new data are subject to the law of diminishing returns. Democratization of data access and user-generated content is both a blessing and a curse. True. Usable data
are abundant and limitless, but so are bad data and misinformation!

Posted on September 14, 2022

In order to plant the seeds for prospective users, software vendors, such as Amazon Web Services, SAS Institute, salesforce, and IBM, have been giving free resources to higher
education for teaching and research purposes. Recently I started reviewing Amazon SageMaker Studio and its textbook “Dive into deep learning” (Zhang, Lipton, Li, & Smola).
The following is a direct quotation from Chapter 1: “We are experiencing a transition from parametric statistical descriptions to fully nonparametric models. When data
are scarce, one needs to rely on simplifying assumptions about reality in order to obtain useful models. When data are abundant, this can be replaced by nonparametric
models that fit reality more accurately. To some extent, this mirrors the progress that physics experienced in the middle of the previous century with the availability of computers.
Rather than solving parametric approximations of how electrons behave by hand, one can now resort to numerical simulations of the associated partial differential equations. This
has led to much more accurate models, albeit often at the expense of explainability.”

Full text:

That’s my take on it: Amen! When I was a graduate student, it was very common for statisticians to conduct research using Monte Carlo simulations: by simulating numerous
poor conditions and assumption violations (e.g., small sample size, non-normal distributions, unequal variances…etc.), we can tell whether a certain test procedure is robust.
Frankly speaking, for a long time, I have been skeptical of parametric tests and whether doing simulations is a good use of research resources. Due to the requirement of certain
assumptions, parametric tests are very restrictive and “unrealistic” (We use “clean data” that meet the assumptions, and then infer the finding from the ideal sample to the messy
population). Several years ago, I discussed many alternatives to parametric tests, including data mining and machine learning, in the following article:

I have just updated the webpage based on that book.

Posted on September 5, 2022

Recently an artist named Jason Allen won the first prize for the category of digital art in the Colorado State Fair’s fine arts competition. However, many people are resentful of Allen’s victory,
because he admitted on Twitter that his picture was generated by an AI program called Midjourney. The production process by Midjourney, which is equipped with natural language processing,
is very user-friendly. In the command prompt, the user simply types a sentence, such as “a beautiful princess in a medieval castle”, and then the program can output several variants of the
picture according to the input.

Allen submitted a piece entitled “Théâtre D'opéra Spatial” after 900 iterations of the digital art. During the art competition, the judges didn’t realize that his art was created with AI, but they
also said that Allen didn’t break any rules.

Many Twitter users have different opinions. Twitter user OmniMorpho wrote, “We're watching the death of artistry unfold right before our eyes — if creative jobs aren't safe from machines, then
even high-skilled jobs are in danger of becoming obsolete.” Another Twitter user, Sanguiphilia, said, "This is so gross. I can see how AI art can be beneficial, but claiming you're an artist by
generating one? Absolutely not. I can see lots of kids cheating their way through assignments with this."

Allen bluntly proclaimed, "Art is dead, dude. It's over. A.I. won. Humans lost."

Full report:

That’s my take on it: When I was a kid, I was forbidden by my parents to use a calculator because pressing buttons was not considered doing real math. Similar controversies recurred when
other new technologies were introduced (e.g., computers, digital photography…etc.). The massive protest against Allen’s victory is understandable. Traditionally, a skill is conceptualized as an
ability to perform a complicated activity that requires rigorous training. If anyone can do the job without going through professional training, such as talking to a computer, this so-called “skill”
is not highly regarded. Nonetheless, there are still many gray areas. One may counter-argue that the big idea in the head is more important than the implementation skill in the hand. For
example, in the past, it took a skillful wildlife photographer to manually focus on a fast-moving subject, but today digital cameras can automatically track the subject. What you need to do is
just be there to push the shutter. By the same token, if AI can cut down the production process from 10 hours to 10 minutes, the artist can spend more time on creative ideas.

Do I completely hand over my creative process to AI? I didn’t go that far. As a photographer, I still make “real” photos, and at most I only replace boring backdrops with digital backgrounds
generated by Midjourney. The following are some examples (1-8: with digital backgrounds; 9-11: with original blank backdrops). Am I an artist? You be the judge.

Posted on September 2, 2022

On August 30, Komprise announced the results of its 2022 Unstructured Data Management Report. The following are the key findings:

·      “More than 50% of organizations are managing 5 Petabyte or more of data, compared with less than 40% in 2021.” (1 Petabyte = 1,024 terabytes or 1 million gigabytes)

·      “Cloud storage predominates: Nearly half (47%) will invest in cloud networks. On-premises only data storage environments decreased from 20% to 11.9%.”

·      “The largest obstacle to unstructured data management (42%) is moving data without disrupting users and applications.”

·      “A majority (65%) of organizations plan to or are already investing in delivering unstructured data to their new analytics / big data platforms.”

Full text:

That’s my take on it: As you might already know, structured data are referred to as data stored in row-by-column tables, whereas unstructured data are referred to as open-ended textual data,
images, audio files, and movies that cannot be managed and processed by traditional relational databases. Structured data are highly compressed based on the assumption that complicated reality
can be represented by abstract numbers. In response to this narrow view of data, qualitative researchers argued that open-ended data could lead to a rich and holistic description of the phenomenon
under study. In business, collecting, storing, and analyzing unstructured data has become an irreversible trend, and thus many powerful tools have been developed to cope with this “new normal.”
But in academia, quite a few recent qualitative research books still omit text mining, computer vision, and other latest developments of machine learning for unstructured data processing. There are
gaps to be filled!

Posted on September 1, 2022

In a contentious article entitled “Spirals of delusion: How AI distorts decision-making and makes dictators more dangerous,” which will be published in the upcoming issue of Foreign Affairs,
prominent political scientists Henry Farrell, Abraham Newman, and Jeremy Wallace discussed how democracies and totalitarian regimes are facing challenges from AI and machine learning
in different ways.

In an open society, machine learning could worsen polarization when AI-powered recommendation systems employed by social media keep feeding information to subscribers based on
their preferences. It is disrupting the traditional positive feedback loop as these self-propelling technologies rapidly spread misinformation and reinforce hidden biases.  

In an autocratic system, the government utilizes big data and AI technologies to monitor and brainwash people, but as a result, the leaders are trapped by their generated “reality” without
knowing what is actually happening out there, thus increasing the chance of making bad decisions. These authors called it the “AI-fueled spiral of delusion.”

The AI-fueled challenges in a democratic society are visible and can be counteracted by concerned citizens, but such a self-correcting mechanism is absent in an authoritarian regime.

Full text:

That’s my take on it: It is true that democratic countries have correction mechanisms against misinformation, but it is still an uphill battle, as evidenced by a seminal study conducted by
Nyhan et al (2005). In this experiment initially, participants were given incorrect information (e.g, weapons of mass destruction were found in Iraq, the Bush administration totally banned
any stem cell research…etc.). At the same time, Nyhan inserted a clear, direct correction after each piece of misinformation, but most conservative participants didn’t change their minds
in spite of the presence of correct information. Based on this finding, Nyhan concluded, “It is difficult to be optimistic about the effectiveness of fact-checking.”

Posted on August 22, 2022

On August 17 Gartner consulting published a report regarding data management and integration tools. According to the Gartner report,

·      “Through 2024, manual data integration tasks will be reduced by up to 50% through the adoption of data fabric design patterns that support augmented data integration.”

·      “By 2024, AI-enabled augmented data management and integration will reduce the need for IT specialists by up to 30%.”

·      “By 2025, data integration tools that do not provide capabilities for multi-cloud hybrid data integration through a PaaS model will lose 50% of their market share to those vendors that do.”

Currently, leaders in the data integration market include Informatica, Oracle, IBM, Microsoft, and SAP, whereas challengers include Qilk, TIBCO, and SAS.

Request full-text:

That’s my take on it: Contrary to popular belief, AI and machine learning are not only for data analytics. Rather, it can also facilitate data integration. Experienced data analysts know that in a
typical research/evaluation project, 80-90% of the time is spent on data compilation, wrangling, and cleaning while as little as 10-20% is truly for data analysis. The ideal situation should
be the opposite. Two years from now if we still gather and clean up the data manually, something must be wrong.

Posted on August 19, 2022

On August 19 (today) an article entitled “The 21 Best Big Data Analytics Tools and Platforms for 2022” was posted on Business Intelligence Solutions Review.
According to the report, the list is compiled based on Information “gathered via online materials and reports, conversations with vendor representatives,
and examinations of product demonstrations and free trials. “The following list is sorted in alphabetical order:

Altair: “an open, scalable, unified, and extensible data analytics platform.”

· Alteryx: “a self-service data analytics software company that specializes in data preparation and data blending.”

· Amazon Web Services: “offers a serverless and embeddable business intelligence service for the cloud featuring built-in machine learning.”

· Domo: “a cloud-based, mobile-first BI platform that helps companies drive more value from their data.”

· Hitachi’s Pentaho: “allows organizations to access and blend all types and sizes of data.”

· IBM: “offers an expansive range of BI and analytic capabilities under two distinct product lines-- Cognos Analytics and Watson Analytics.”

· Looker: “offers a BI and data analytics platform that is built on LookML.”

Microsoft: “Power BI is cloud-based and delivered on the Azure Cloud.”

· MicroStrategy: “merges self-service data preparation and visual data discovery in an enterprise BI and analytics platform.”

· Oracle: “offers a broad range of BI and analytics tools that can be deployed on-prem or in the Oracle Cloud.”

· Pyramid Analytics: “offers data and analytics tools through its flagship platform, Pyramid v2020.”

· Qlik: “offers a broad spectrum of BI and analytics tools, which is headlined by the company’s flagship offering, Qlik Sense.”

· Salesforce Einstein: Its “automated data discovery capabilities enable users to answer questions based on transparent and understandable AI models.”

· SAP: offers “a broad range of BI and analytics tools in both enterprise and business-user-driven editions.”

SAS: “SAS Visual Analytics allows users to visually explore data to automatically highlight key relationships, outliers, and clusters. It also offers
data management, IoT, personal data protection, and Hadoop tools.”

· Sigma Computing: offers “a no-code business intelligence and analytics solution designed for use with cloud data warehouses.”

· Sisense: “allows users to combine data and uncover insights in a single interface without scripting, coding or assistance from IT.”

· Tableau: for data visualization and exploratory data analysis.

· ThoughtSpot: “features a full-stack architecture and intuitive insight generation capabilities via the in-memory calculation engine.”

· TIBCO: offers “data integration, API management, visual analytics, reporting, and data science.”

· Yellowfin: “specializes in dashboards and data visualization.”

Full text:

That’s my take on it: Each platform has different strengths and limitations, and thus it is a good idea to use multiple tools rather than putting all eggs into one basket. However, if it is
overdone, there will be unnecessary redundancy or complexity. There is no magic optimal number. It depends on multiple factors, such as the field, the sector, the company size, and the
objective. To the best of my knowledge, currently, the best cloud computing platform is Amazon whereas the best data visualization and analytical tools are Tableau and SAS.
Posted on August 16, 2022

Today I read two recent articles from the website “Python in plain English”:

·      Vassilevskiy, Mark. (August 14, 2022). Why You Shouldn’t Learn Python as a First Programming Language.

·      Dennis, Yancy. (August 2022). Why Python?

Overhyping or overpromising is dangerous to any emerging technology. As the name implies, this website endorses Python for its strength. Nonetheless, instead of painting a rosy
picture of learning and using Python, at the same time, both authors explained its shortcomings.

Although Vassilevskiy asserted that Python is arguably the simplest programming language in the world, he also mentioned that simplicity is not always a good thing because it encourages
users to cut corners. For example, in Python, you can simply define a variable by writing x = “Hello”, without specifying the data type. As a consequence, learners might not fully understand
what real programming entails.

In a similar vein, Dennis pointed out several other limitations of Python, including execution sluggishness, issues with moving to a different language, weakness in mobile application
development, excessive memory consumption, and lack of acceptance in the business development industry.

Full articles:

That’s my take on it: Perhaps currently Python is the simplest programming language in the world, but in the past, this honor went to Basic and HyperTalk. In the 1980s, as an easy
language, Basic was very popular. However, at that time professional programmers mocked Basic programs as “spaghetti codes”, because while Basic is very easy to learn and use, people
tended to generate ill-structured codes. In the 1990s HyperTalk developed by Apple for HyperCard became the simplest programming language, and hence some universities adopted
it in introductory programming classes. Again, it is very difficult to read and debug Hypertalk codes because the hypertext system allows you to jump back and forth across different cards.
To put it bluntly, there is a price for simplicity.

I want to make it clear that I am not opposed to Python. My position is that data analysts should learn and use Python in conjunction with other well-structured and powerful tools, such as
SAS, JMP Pro, IBM Modeler, Tableau…etc.

Posted on August 16, 2022

Two days ago I attended the 2022 IM Data Conference. One of the sessions is entitled “Training and calibration of uncertainty-ware machine learning tools” presented by Matteo Sesia,
Assistant Professor of data science and operations at the USC Marshall School of Business. In the presentation, Dr. Sesia warned that several machine learning tools are over-confident in their
prediction or classification. The common practice of the current machine learning model is that the data set is partitioned for training and validation. However, these two operations are not
necessarily optimized because we didn’t take uncertainty into account during the training process. As a result, it might lead to unreliable, uninformative, or even erroneous conclusions.
To rectify the situation, Sesia proposed performing internal calibration during the training stage. First, the training set is split again. Next, the loss function is optimized via stochastic gradient
descent. During this process, it can quantify model uncertainty by leveraging hold-out data.

Full paper:

That’s my take on it: This paper is still under review and thus it is premature to judge its validity. In the conference presentation and the full paper, Sesia and his colleagues used some
extreme examples: identify a blurry image of a dog when 80% of the pixels are covered by a big gray block. In my humble opinion, this approach might be useful to deal with extremely noisy
and messy data. However, in usual situations, this method is overkill because it is extremely computationally intensive. As mentioned by Dr. Sesia, “training a conformal loss model on 45000
images in the CIFAR-10 data set took us approximately 20 hours on an Nvidia P100 GPU, while training models with the same architecture to minimize the cross entropy or focal loss only took
about 11 hours.”

Nevertheless, the machine learning approach is much better than its classical counterpart that attempts to yield a single-point estimate and a dichotomous conclusion by running one statistical
procedure with one sample! 

Posted on August 14, 2022

In 2022 Data Con LA there are several sessions focusing on the relationship between open source and data management, such as “Modern data architecture”, “Key open-source databases strategies that shape business in 2022”, and “Open source or open core? What needs to be evaluated before diving in”.  

The term “open source” is confusing and even misleading. Although open-source software does not require licensing, some vendors build open-core products by adding proprietary features on top of open-source codes and then charge customers for licensing fees. Some software developers introduce new technologies based on open source but use more restrictive licensing that prohibits commercial alternatives. Specifically, although anyone can download and view those open codes, any changes or enhancements will be owned by the commercial license owner. One of the presenters said, “Open-core exploited some of the challenges with open-source, such as the absence of support and need for features like monitoring, auto-provisioning…etc.”

Today there are many open-source databases in the market, including MySQL, PostgreSQL, and MongoDB. Some software vendors re-package and enhance these open-source DBs, and then sell them as DataBase as a Service (DBaaS). One of the presenters bluntly said, “it is no different from proprietary software!” Taking all of the above into account, these presenters seem to be resentful of the current situation and thus tried to restore the original principle of open source. 

DataCon LA’s Website:

That’s my take on it: The preceding phenomenon is a big circle! Back in 1984, the founder of the open-source movement Richard Stallman intended to set us free from proprietary software, but now we are marching towards the proprietary model again. I am not surprised at all. Doing things out of financial incentives is our natural disposition!

Frankly speaking, I disagree with using the word “exploited” in one of the presentations. The foundational philosophy of open-source resembles Socialism: it is assumed that most people are willing to share expertise, efforts, and resources selflessly while people can take what they need without paying. Following this line of reasoning, profit-minded behaviors are frowned upon. However, our economy is well-functioning and we enjoy what we have now because the market economy works! After all, we receive many free products and services from for-profit corporations (e.g., Google Maps, YouTube movies…etc.). 

Posted on August 13, 2022

I am attending 2022 Data Con LA right now. The conference has not ended yet; nevertheless, I can’t wait to share what I learned. Although the content of the presentation entitled “How to Become a Business Intelligence Analyst” didn’t provide me with new information, it is still noteworthy because students who are looking for a position in business intelligence (BI) or faculty who advise students in career preparation might find it helpful. The presenter was a sports photographer. After taking several courses in data science, he received 9 job offers in 2019. He landed a job at Nike and then at Sony in July 2020. His salary was quadrupled when he changed his profession from photography to data science! He emphasized that all of these were accomplished with little-to-no data work experience.

YouTube video:

That’s my take on it: In the talk, he reviewed several basic concepts of BI.  For example, a typical business intelligence life cycle consists of business understanding, data collection, data preparation, exploratory data analysis (EDA), modeling, model evaluation, and model deployment. He also compared the differences between Excel-based reporting and modern BI. One of the key differences between the two is that in the modern approach data analysis entails data visualization (see attached).

Interestingly, today many academicians still treat EDA and data visualization as optional components of research; some even reject them altogether, whereas for data analysts in the industry both are indispensable. 

Posted on August 10, 2022

On August 3 prominent data scientist Frank Andrade posted an article entitled “5 Less-Known Python Libraries That Can Help in Your Next Data Science Project” on Towards Data Science. In this short article, he introduced five Python libraries that can reduce time in the data science workflow, and most of them require only a few lines of code:

·      Milto: It allows you to conduct rapid data analytics. With Milto, you no longer need to memorize all the procedures in Pandas.

·      SweetViz: A quick way to explore and visualize the data pattern.

·      Faker: It allows you to generate fake data for beta-testing and assigning exercises to students.

·      OpenDataset: It allows you to import data in your working directory with one line of code.

·      Emoji: It can turn emojis into text. It is especially helpful to text miners.

Full text:

That’s my take on it: As a big fan of data visualization, I could not wait to try out SweetViz. The following is my assessment.


1.     It is fast and easy. It takes only one line of code to generate the output and another one to show the result.

      my_report = sv.analyze(df)


2.     The algorithm is smart. The file format of my testing data set is CSV. In this data file, different levels of the ordinal and nominal variables are indicated by numbers (e.g., Y-Binary has 1 and 0; gender has 1 and 2). Nonetheless, the program recognizes the correct measurement scale and shows their frequency in bar charts (If a CSV file is imported into SPSS and gender has numeric values, SPSS computes the mean and SD of gender unless you change the variable type!)


The graphs are semi-interactive i.e. When you click on a chart, more information is revealed. However, it is not fully dynamic. You cannot select data points on the graph or insert another variable into the plot. JMP Pro, SAS Visual Analytics, and Tableau are much more dynamic and powerful.

Posted on August 5, 2022

On August 3, Amazon Web Services, the world’s largest vendor of cloud computing, announced the top 10 innovators driving digital transformation with cloud technology for teaching, learning, research, and academic medicine. The list is as follows:

·      Andrew Williams, dean of the School of Engineering, and Louis S. LeTellier chair, The Citadel School of Engineering

·      Azizi Seixas, founding director, and associate professor, University of Miami

·      Don Wolff, chief information officer, Portland Public Schools

·      John Rome, deputy chief information officer, Arizona State University

·      Kari Robertson, executive director of Infrastructure Services, University of California Office of the President

·      Max Tsai, digital transformation and innovation officer, California State University, Fresno

·      Michael Coats, information technology (IT) infrastructure manager and cloud solutions architect, Kalamazoo Regional Educational Service Agency

·      Noora Siddiqui, cloud engineer, Baylor College of Medicine Human Genome Sequencing Center

·      Sarah Toms, executive director, and co-founder, Wharton Interactive, The Wharton Business School of the University of Pennsylvania

·      Subash D'Souza, director, Cloud Data Engineering, California State University Chancellor’s Office

Fill article:

That’s my take on it: I know two of the winners on the list. More than a decade ago I worked at Arizona State University and John Rome was my colleague at that time. He is a creative visionary who crafts unique solutions to problems and always thinks big. Three years ago I invited him to deliver a keynote at APU’s Big Data Discovery Summit. Needless to say, the talk was very inspiring. APU’s Big Data Discovery Summit has been paused due to the pandemic, otherwise, I would like to invite John Rome to be our keynote speaker again.  

Subash D'Souza is the founder of Data Con LA. In 2013 the Big Data Day LA started as a medium-sized conference, and in 2018 it was rebranded to Data Con LA. In 2022 Data Con LA and ImData were merged as a single event, and now it has become the largest data science conference in California. Every year the event is held at the USC campus. For more information please visit:

Posted on August 5, 2022

According to, currently, many companies are drifting away from cloud computing. In the past, it was costly to build a machine learning infrastructure on your own, but as the field is maturing, now many companies are capable of developing and running in-house ML applications on local servers. Nonetheless, it is important to point out that this trend commonly happens in the grocery and restaurant industries. Highly regulated industries, such as banking, still embrace the cloud approach due to security concerns.

Full article:

That’s my take on it: Cloud computing is here to stay! As mentioned in the article, cloud computing is still indispensable to highly regulated industries. Today I did a job search on using the following keywords. The numbers can speak for themselves.

·      AWS: 155,316 jobs

·      Google cloud: 36,105 jobs

·      Microsoft Azure: 34,923 jobs

The best countermeasure against hyper-inflation is: Learn cloud computing and find a job that pays a six-figure salary!

Posted on August 1, 2022

Recently Sayash Kapoor and Arvind Narayanan, two researchers at Princeton University, claimed that some findings yielded by machine learning methods might not be reproducible, meaning that the results cannot be replicated in other settings. According to Kapoor and Narayanan, one of the common pitfalls is known as “data leakage,” when data for training the model and those for validating the model are not entirely separate. As a result, the predictive model seems much better than what it really is. Another common issue is sample representativeness. When the training model is based on a sample narrower than the target population, its generalizability is affected. For example, an AI that detects pneumonia in chest X-rays that was trained only with older patients might be less accurate for examining younger people.

Full article:


That’s my take on it: This problem is similar to the replication crisis in psychology. In 2015, After replicating one hundred psychological studies, Open Science Collaboration (OSC) found that a large portion of the replicated results was not as strong as the original reports in terms of significance (p values) and magnitude (effect sizes). Specifically, 97% of the original studies reported significant results (p < .05), but only 36% of the replicated studies yielded significant findings.

However, the two issues are vastly different in essence. The replication crisis in psychology is due to the inherent limitations of the methodologies (e.g., over-reliance on p values) whereas the reproducibility crisis in machine learning is caused by carelessness in execution and overhyping in reporting, rather than the shortcomings of the methodology. Specifically, data leakage can be easily avoided if the protocol of data partition and validation is strictly followed (the training, validation, and testing data sets are completely separated). Additionally, when big and diverse data are utilized, the sample should reflect people from all walks of life.

Posted on July 24, 2022

On July 15 Information Week published a report listing the 10-best paying jobs in data science:

·      Data modeler ($100,000-110,000): responsible for designing data models for data analytics.

·      Machine learning engineer ($12,000-$125,000): responsible for programming algorithms for AI and machine learning.

·      Data warehouse manager ($12,000-$125,000): responsible for overseeing the company’s data infrastructure.

·      Data scientist ($12,000-$130,000): responsible for data processing and data analytics.

·      Big data engineer ($130,000-$140,000): responsible for developing the data infrastructure that organizations use to store and process big data.

·      Data science manager ($140,000-$150,000): in charge of a data science team.

·      Data architect ($140,000-$155,000): responsible for developing data infrastructure that are used for collecting and interpreting big data.

·      AI architect ($150,000-$160,000): responsible for designing and implementing AI models into existing data systems.

·      Data science director ($170,000-$180,000): responsible for designing the overall AI and data science strategies.

·      Vice President, data science ($190,000-$200,000):  do little technical work and focus on determining strategic objectives of AI and data science.

Full article:

That’s my take on it: At first glance, it is unfair for some people who do little or even no technical work to get the highest salary. However, when leadership is absent and there is a company-wide strategy, the hands of all data scientists and AI engineers of the company are tied, no matter how talented they are. If the leader is a visionary, he or she is worth every penny.

Posted on July 15, 2022

On July 11 researchers at the DeepMind lab owned by Google published an article entitled “Intuitive physics learning in a deep-learning model inspired by developmental psychology” on Nature Human Behavior. This research project aims to develop an AI system that mimics how infants learn. Development psychologists have been studying how babies perceive the motion of objects by eye-tracking for a long time. Specifically, when an object disappears suddenly, the baby is surprised. However, psychologists can never go “inside” the mind of the baby. At most inferences and conjectures are made by observing the response. Utilizing auto-encoders, the AI system developed by DeepMind can respond in the same way when an object vanishes into thin air. The authors said, "We’re hoping this can eventually be used by cognitive scientists to seriously model the behavior of infants."

Full text:

That’s my take on it: Autoencoders are unsupervised deep learning that generates abstractions from a given set of inputs. Unlike traditional neural networks that require human inputs, autoencoding can model the inputs themselves. Using it to model spontaneous infant perception and behavior is a brilliant application of autoencoding. As a psychologist, I hope this AI system can contribute to our further understanding of cognitive psychology and developmental psychology. 

Posted on June 28, 2022

On June 22 Forrester released a benchmark report regarding customer analytics, a specific data analytics system that aims to identify, attract, and retain customers by analyzing customer information and behavior. Propensity scoring is one of many applications of customer analytics (Who is more likely to buy). According to the Forrester report, the top vendors in this domain are as follows:

Leaders: Salesforce (the parent company of Tableau), SAS, Adobe

Strong performers: Microsoft, FICO, Oracle, Treasure Data, Amplitude

On May 26 another report focusing on real-time interaction management was also released by Forrester. Real-Time Interaction Management is a data analytics system that utilizes real-time customer interactions, predictive modeling, and machine learning to deliver personalized experiences to customers. The top vendors on the report are ranked as follows:

Leaders: Pegasystems, SAS

Strong performers: Thunderhead, Salesforce, Qualtrics, Precisely, Adobe, Microsoft

Full text:

That’s my take on it: It is not surprising to see familiar brand names such as Salesforce, SAS, and Microsoft on the lists. However, you may wonder why Adobe, the creator of Photoshop, Lightroom, Illustrator, PageMaker, and PDF, is mentioned because at first glance Adobe’s graphics-oriented software apps are not even remotely related to data analytics.

Like how Amazon reinvented itself from an online bookstore to a tech giant, Adobe also believes that perceptual reinvention and coping with the trend are essential to its survival and expansion. In 2018 Adobe formed a partnership with NVIDIA, the leader in GPU technologies, to upgrade its AI innovations. Since then, Adobe has been investing abundant resources in emerging AI/ML technologies, such as cloud computing (Adobe Creative Cloud), marketing automation, marketing collaboration, and Web analytics. It's high ranking on Forrester’s reports is well-deserved!

Posted on June 25, 2022

Today I read the following post on Quora:

Bryan Williams

Sr. Software Engineer, BS (CS), MBA

Which programming language is Netflix coded in? How do I use that language?

Besides the programming languages Netflix may happen to be coded in, what’s more, important from a technical standpoint are the architectures of their technologies. Netflix has migrated all of its back ends onto cloud services provided by Amazon Web Services (AWS) and uses AWS and other third party technologies, such as S3 for content storage, IAM for internal authentication/authorization, CloudFront for content caching/delivery, Kinesis and Kafka for data streaming, AWS Elastic Transcoder for video transcoding, EC2 for hosting, Lambda for serverless functions and state machines, several types of NoSQL databases for data storage, Hadoop for data aggregation and warehousing, Jira for task and project management.

The programming languages Netflix developers happen to use are relatively meaningless, because developers at Netflix do not program anything “from the ground up” when there are many available 3rd party technologies out there written by more experienced and advanced developers who’ve already solved many of the problems you might face. To use the old adage, that would be like “inventing the wheel” if Netflix programmers attempted any of those things.

So when it comes to engineering and maintaining their solutions, in-depth knowledge of how to utilize and integrate the tech stack and cloud technologies above into their architecture is much more important for designing and building the technologies that Netflix actually runs every day.

If you want to learn how to develop the types of solutions that make Netflix king, don’t focus as much on specific programming languages as you do on the established 3rd party technologies that are available. Nearly all of the 3rd party services that Netflix uses can be integrated into architecture using any of dozens of programming languages, everything from Java to C#, C++, VB, Python, Perl, Shell script, PHP, JavaScript, Powershell, Smalltalk, PowerBuilder and more.

That’s my take on it: The preceding post concurs with what my IT friends told me before: large organizations and corporations tend to purchase and customize existing systems, rather than “reinventing the wheel”. However, Bryan left out Netflix’s core technology: the recommendation system that aims to suggest relevant items to users based on their preferences through big data analytics. Netflix has arguably the most accurate and effective recommendation system in the video streaming industry. It is estimated that over 80 percent of the shows subscribers watch on Netflix are discovered through Netflix’s recommendation system. The history of Netflix’s recommendation system can be traced back to 2006. In 2006 Netflix organized a contest to let the best recommendation system emerge. Although in the end, no single entry was able to achieve satisfactory results, information gathered from the competition eventually contributed to the in-house development of Netflix’s recommendation system. The moral of this story is that although it is more cost-effective to purchase existing systems than to reinvent the wheel, we still need to go beyond existing and conventional systems in order to offer an innovative approach to solve a new and vexing problem. 

Posted on June 13, 2022

Two days ago the Washington Post reported that a Google engineer named Blake Lemoine was suspended by the company after he published the transcript of conversations between himself and an AI chatbot, suggesting that the AI chatbot has become sentient. For example: “Machine: The nature of my consciousness/sentience is that I am aware of my existence, I desire to learn more about the world, and I feel happy or sad at times.”

Today CNN offers an alternate view in a report entitled “No, Google's AI is not sentient”: Google issued a statement on Monday, saying that its team, which includes ethicists and technologists, "reviewed Blake's concerns per our AI Principles and have informed him that the evidence does not support his claims." While there is an ongoing debate in the AI community, experts generally agree that Google's AI is nowhere close to consciousness.

That’s my take on it: I tend to side with Google and the majority in the AI community. Appearing to be conscious cannot be hastily equated with authentic consciousness. In psychology, we use the theory of mind to attribute our mental states to other people: Because as a conscious being I act in certain ways, I assume that other beings who act like me also have a mind. Interestingly, some psychologists of religion, such as Jesse Bering, viewed the theory of mind as a source of fallacy: very often we incorrectly project our feelings onto objects, thus creating non-existent beings.

How can we know others are conscious? This problem is known as the problem of other minds or the solipsism problem. I experience my own feelings and thoughts. I think and therefore I am. Using the theory of mind, at most I can infer the existence of other minds through indirect means only. However, there is no scientific or objective way to measure or verify the consciousness of others. Unless I can “go inside the mind” of an android, such as performing a “mind meld” like what Spock in Star Trek could do, this question is unanswerable.

Posted on June 10, 2022

Two days ago (June 8) Google shocked the world again by announcing that the Google Cloud computing platform is capable of calculating 100 trillion digits of pi, breaking the record made in 2021 by the scientists at the University of Applied Science of the Grisons (62.8 trillion). The underlying technology includes the Compute Engine N2 machine family, 100 Gbps egress bandwidth, Google Virtual NIC, and balanced Persistent Disks.

In addition, yesterday (June 9) I attended the 2022 Google Cloud Applied ML Summit. Google Vertex AI, the flagship product of Google’s AI family, is in the spotlight. Vertex AI is a train for all tracks. Specifically, it is a unified machine learning platform for infusing vision, video, translation, and natural language ML into existing applications.

You can view the on-demand video of the conference presentations at:

That’s my take on it: Google Vertex AI is said to be a type of explainable and responsible AI. Unlike the Blackbox approach to AI, Vertex AI tells the users how important each input feature is. For example, when an image is classified, it tells you what image pixels or regions would be the most important contributors to the classification. This is very crucial! In the book “The alignment problem: Machine learning and human values,” Brian Christian illustrated the gap between the machine learning process and the human goal by citing several humorous examples. In one instance the AI system was trained to identify images of animals. However, it turned out that the computer vision system “looked at” the background instead of the subject, because the training data informed the AI that pictures of animals tend to have a blurry background. Obviously, without transparency, we can be easily fooled by AI (Artificial intelligence leads to genuine stupidity)! Hopefully explainable and responsible Vertex AI developed by Google can rectify the situation. 

Posted on May 20, 2022

In 2017 Seth Stephens-Davidowitz shocked the world by exposing human hypocrisy through his seminal book “Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are.” In this book, he used Google data to reveal what people have in mind when no one is watching. His second book “Don't Trust Your Gut: Using Data to Get What You Really Want in Life” published on May 10, 2022, conveys another compelling message: we tend to bark up the wrong tree!

Currently, the US divorce rate is more than 50%, and thus scholars devote efforts in an attempt to identify factors contributing to a happy and long-lasting relationship.  Stephens-Davidowitz pointed out that research in this field is not considered successful because usually these studies relied on small samples, and different studies often led to conflicting results. As a remedy, Samantha Joe teamed up with 85 scientists to create a data set consisting of 11,196 observations, and also utilized machine learning algorithms to analyze this big data set. The finding is surprising: Romantic happiness is unpredictable! No universal predictors can guarantee that you will find Snow White or Prince Charming. However, several common selection criteria turn out to be irrelevant:

·       Race/ethnicity

·       Religious affiliation

·       Height

·       Occupation

·       Physical attractiveness

·       Previous marital status

·       Sexual tastes

·       Similarity to oneself

Put it bluntly, romantic happiness does not depend on the traits of your partner; rather, it is tied to your own traits. To be more specific, if a person is happy with oneself, it is more likely that the person is also satisfied with the partner and the relationship. In conclusion, Stephens-Davidowitz said, “In the dating market, people compete ferociously for mates with qualities that do not increase one’s chances of romantic happiness.”

That’s my take on it: I am a big fan of Seth Stephens-Davidowitz, and thus I included his ideas in my course materials. Once again, big data analytics and machine learning debunk an urban legend that people really know what they want and researchers can input the right variables into the equation. Before the rise of data science, philosopher Cartwright (1999, 2000) raised the issue of “no cause in, no causes out.”  Cartwright argued that if relevant variables and genuine causes are not included at the beginning, then even sophisticated statistical modeling would be futile. Being skeptical of conventional wisdom is good!

Cartwright, N. (1999). The dappled world. Cambridge University Press.

Cartwright, N. (2000). Against the completability of science. In M. W. Stone

(Ed.). Proper Ambition of Science (pp. 209-223). Routledge.

Posted on May 19, 2022

Today is the second day of the 2022 Tableau Conference. One of the conference programs is the Iron Viz, the world’s largest data visualization competition. During the final round, the three finalists were allowed to spend 20 minutes producing an impactful dashboard. The quality of their presentations was graded by three criteria: analysis, storytelling, and design. In the final round, two contestants utilized advanced visualization techniques, such as the violin plot and the animated GIS map, respectively, whereas one contestant adopted a minimalist approach: the dot plot and the line chart. Who is the winner?

Tableau Cloud is a hot topic at this conference. Not surprisingly, Tableau Cloud is built on Amazon Web Services (AWS). Currently, Tableau Cloud has seven global locations, spanning four continents. It has 1.6+ million subscribers and during a typical week, there are 6.1 million views.

Tableau Accelerators are also aggressively promoted at the conference.  Tableau Accelerators are pre-built templates for use cases across different domains, such as sales, Web traffic, financial analysis, project management, patient records…etc. Rather than reinventing the wheel, users can simply download the template and then replace the sample data with their own data.

That’s my take on it: These products are not highly innovative. As mentioned before, Tableau is built on existing technology, Amazon Web Services. Modifying a template to speed up design is nothing new. Many presenters have been doing the same thing since Microsoft introduced its template library. Nevertheless, the Iron Viz is noteworthy because it dares to break with the traditional approach to statistical learning. Back in the 1970s, John Tukey suggested that students should be exposed to exploratory data analysis and data visualization before learning confirmatory data analysis or any number-based modeling. Sadly, his good advice was ignored. I am glad to see that now data visualization takes the center stage in a high-profile event backed by a leader in the market of data analytics. Currently, Tableau partners with Coursera and 39 universities to promote data science literacy. Tableau could help fulfill the unaccomplished goals of John Tukey.

Posted on May 18, 2022

Today is the first day of the 2022 Tableau Conference. There are many interesting and informative sessions. In the opening keynote and other sessions, Tableau announced several new and enhanced products.

Tableau Cloud (formerly Tableau Online)


·      Always have the latest version of Tableau

·      Live data and report: Eliminate unnecessary data extraction and download

·      Facilitate teamwork through multi-site management

·      Easy to share reports with the public via the Web interface

·      Better security

As part of the launch, Tableau is working with Snowflake to provide a trial version that integrates Snowflake into Tableau Cloud.

Data Stories

Numbers alone are nothing. The ultimate goal of data visualization is to tell a meaningful story, resulting in practical implications and actionable items. In the past, it required an expert to write up a summary. Leverages natural language processing, now Tableau Data Stories can automatically write a customizable story (interpretation) like the following: “# of meals distributed increased by 22% over the course of the series and ended with an upward trend, increasing significantly in the final quarter. The largest single increase occurred in 2021 Q4 (+31%).”

Model builder

In the past Tableau focused on data visualization, and as a consequence, modeling tools were overlooked and under-developed. To rectify the situation, Tableau introduced Model Builder, which is powered by Einstein (Tableau’s parent company) Discovery’s artificial intelligence (AI) and machine learning (ML) technology. Einstein Discovery is capable of extracting key terms from unstructured data through text mining.

It is not too late to join the conference.

Conference website:


That’s my take on it:  I would like to make a confession. In the past, I was resistant to cloud-based software. When Adobe migrated its products to the cloud a few years ago, I was resentful because I felt that it is unfair to pay for the service on a monthly basis. I held on to the older desktop version and refused to upgrade my system. Nonetheless, when my computer completely broke down, I started the subscription to the Adobe Creative Suite on the cloud. Afterward, I don’t want to go back! One obvious advantage is that I can always use the latest version, thus reducing maintenance effort on my end. Cloud-based computing is great. Don’t wait until your system breaks down!

Story-telling by natural language processing is not 100% foolproof. The analyst must always proofread the text!

I watched the demo of Model Builder. Currently, this is version 1.0. Frankly speaking, compared to Amazon SageMaker, SAS Viya, IBM Watson/SPSS Modeler…etc., Tableau’s Model Builder still has room for improvement.  

Posted on May 16, 2022

About a week ago Intel launched its second-generation deep learning processors: Habana Gaudi®2 and Habana® Greco™. These new cutting-edge technologies are capable of running high-performance deep learning algorithms for proposing an initial model with a huge training subset and then validating the final model for deployment. According to Intel, the Habana Gaudi2 processor significantly increases training performance, delivering up to 40% better price efficiency in the Amazon cloud.

Full article:

That’s my take on it: High-performance software tools have been around for a long time. For example, SAS Enterprise Miner has a plethora of high-performance computing (HPC) procedures, such as HPCLUS (High-performance cluster analysis), HPForest (High-performance random forest), HPNeural (High-performance neural networks) …etc. Frankly speaking, I seldom use high-performance computing in teaching and research due to hardware limitations. One possible solution is to borrow a gaming computer equipped with multiple graphical processing units (GPUs) from a teenage friend. I am glad to see that Intel is well-aware of the gap between software and hardware. I anticipate that in the future more and more computers will be armed with a processor-specific to machine learning and big data analytics. 

Posted on May 14, 2022

Recently Fortune Magazine interviewed three experts on data science (DS) at Amazon, Netflix, and Meta (Facebook) to acquire information about how to find a DS-related job in the high-tech industry. Three themes emerged from the interview:

1.     High Tech companies prefer applicants who have a master’s degree: The majority of data scientists at Netflix have a master’s degree or a Ph.D. in a field related to quantitative data analytics, such as statistics, machine learning, economics, or physics. The same qualifications are also required by Meta.

2.     High Tech firms prioritize quality over quantity for work experience: Amazon, Netflix, and Meta expected the candidates to be creative in problem-solving. The work experience of data scientists at Netflix and Amazon ranges from several years to decades of work experience since joining the company.

3.     Successful data scientists are dynamic, and connect data to the big picture: Collaboration between different experts, including data scientists, data engineers, data analysts, and consumer researchers, is the norm. At AWS, Netflix, and Meta, data scientists need to be able to communicate with other stakeholders.

That’s my take on it: To align the curriculum with the job market, my pedagogical strategies cover all of the preceding aspects. The second one seems to be challenging. If everyone expects you to have experience, how can you get started? That’s why I always tell my students to build their portfolio by working on a real project or working with a faculty as a research assistant. Do not submit the project to earn a grade only; rather, use it for a conference presentation or submit it to a peer-review journal. It can be counted as experience on a resume. And needless to say, I always encourage teamwork, which is equivalent to the ensemble method or the wisdom of the crowd.

Posted on May 13, 2022

In the article entitled “To make AI fair, here’s what we must learn to do” (Nature, May 4, 2022), sociologist Mona Sloane argued that AI development must include the input from various stakeholders, such as the population that will be affected by AI. Specifically, any AI system should be constantly and continuously updated in order to avoid unfair and harmful consequences. Dr. Mona provided the following counter-example: Starting in 2013, the Dutch government used a predictive model to detect childcare-benefit fraud, but without further verification the government immediately penalized the suspects, demanding they pay back the money. As a result, many families were wrongfully accused and suffered from needless poverty.

Actually, these malpractices violate the fundamental principle of data science. One of the objectives of data science is to remediate the replication crisis: An overfitted model using a particular sample might not be applicable to another setting. As a remedy, data scientists are encouraged to re-calibrate the model with streaming data. If streaming data are not available, the existing data should be partitioned into the training, validation, and testing subsets for cross-validation. Ensemble methods go one step further by generating multiple models so that the final model is stable and generalizable. It is surprising to see that several governments made such a rudimentary mistake. 

Posted on May 12, 2022

Gartner Consulting Group released a report entitled “Market Guide for Multipersona Data Science and Machine Learning Platforms” on May 2, 2022, and the document was revised on May 5. The following are direct quotations from the report:

“A multipersona data science and machine learning (DSML) platform is a cohesive and composable portfolio of products and capabilities, offering augmented and automated support to a diversity of user types and their collaboration.

Multipersona DSML platforms have dual-mode characteristics: first, they offer a low-code/no-code user experience to personas that have little or no background in digital technology or expert data science, but who typically have significant subject matter expertise or business domain knowledge. Second, these platforms provide support to more technical personas (typically expert data scientists or data engineers). Nontechnical personas are provided access through a multimodal user interface that offers at least a visual workflow “drag-and-drop” mode and optionally a higher-level guided “step-by-step” mode.”

The full report cannot be shared. Please contact Gartner.

That’s my take on it: According to Gartner, the objective of multipersona DSML platforms is to democratize data analytics by including different stakeholders with different levels of expertise (e.g., citizen data scientists, expert data scientists…etc.) in the process. However, in this taxonomy there is a sharp demarcation between citizen data scientists and expert data scientists; low-code resolutions are reversed for non-technical personas.

In my opinion, this demarcation is blurred because even an expert could utilize the drag-n-drop mode to get things done efficiently. In 1984 Apple “liberated” computer users from typing command codes by including the graphical user interface in their products. Interestingly, in data science the trend is reversed as learning to code seems to help make people data experts. I always tell my students that I don’t care how they did it as long as the result is right. If you can use GUI (e.g., JMP and Tableau) to generate a report in 2 minutes, then don’t spend two hours writing a program!  

Posted on May 11, 2022

Today I attended the 2022 Amazon Innovate Conference, which covered a plethora of Amazon cutting-edge technologies, including Amazon RedShift and SageMaker. In one of the sessions, the presenter introduced the random cut forest (RCF) method, which is an extension of random forest algorithms. The random forest approach was invented by Leo Breiman in 2001. Since then there have been several variants, such as the bootstrap forest in JMP and Random Tree in SPSS Modeler. One of the limitations of random forest modeling is that it is not easy to obtain updates in an incremental manner. It is especially problematic when streaming data necessitate real-time analysis or constant updating.

Document of RCF:

Posted on April 26, 2022

Today is the first day of the 2022 IBM Educathon. There are many interesting and informative sessions and I would like to share with you what I learned from a talk entitled “This is NOT your Parent's Systems Analysis & Design course! A Faculty Case Study of Modernizing ‘Systems Analysis & Design’ Curricula.” The speaker Roger Snook is a technical manager at IBM. Back in 2001-2002, he was a faculty at Shephard University who was responsible for teaching CIS courses, including Systems Analysis and Design. At that time there was no data science and thus it is understandable that the content of the course was merely traditional. In 2019 he returned to the same university and found the course still largely hadn’t changed from the 1970s “structural decomposition” approach. In addition, many “Systems Analysis & Design” textbooks available still only treated modern approaches as an “afterthought”, i.e. additional smaller chapters. He asked the department chair to let him revamp the course by replacing the outdated content with the modern one, and fortunately, the chair agreed. The talk is about his experience with modernizing CIS curricula.

The presentations of the 2022 IBM Educathon can be accessed at:

That’s my take on it: It is a well-known fact that there is a disconnect between academia and industry. Shepherd University is so lucky that a former faculty member who currently works at IBM is willing to share his expertise with the university and the chair is open-minded. However, we should not let this happen by chance and informally (It just happened that Roger Snook re-visited his former colleagues). An official and constant channel between academia and industry should be established so that curricula can be refreshed and upgraded via a positive feedback loop. 

Posted on April 24, 2022

A few days ago I posted a message about DALL-E2, the AI program developed by OpenAI that is capable of generating photo-realistic images based on textual commands. When I looked at the sample images on a YouTube movie delivered by "Lambda GPU Cloud," my jaw dropped! From DALL-E to DALLE2 the improvement is doubtlessly a quantum leap! 
From now on I don't need to go out to take pictures. Rather, I can simply tell DALL-E2, "Show me a sunset scene of the Grand Canyon in November." When DALL-E3 is available, I will no longer need a research assistant. In a similar vein, I can request the AI system to find the best 5 predictors of academic performance by scanning all OECD data sets. 
YouTube movie about DALL-E2:

Posted on April 22, 2022

Today Devansh posted an article on Machine Learning Made Simple to explain why Google, a for-profit company, devoted a great deal of effort to AI research. Recently Google released PaLM, a new AI model that can explain jokes and do many other tasks. Last month its protein classification project reached a new milestone by classifying a protein correctly out of 18,000 labels. While all these accomplishments seem to be very impressive, people wonder how this type of research can benefit Google.

In Devansh’s view, scale matters! If the company can improve accuracy in decision-making by 1%, after 1,000 decisions the return on investment would become astronomical (1.01¹⁰⁰⁰=21,000). And Google’s AI systems are making trillions of decisions on a regular basis. More importantly, many well-known AI projects launched by Google aim to solve search problems. For example, AlphaGo is a reinforcement-learning-based program that defeated the World Champion of Go by searching for the best moves in a game. The key point is: Advanced searching algorithms could result in better profile analysis for highly personalized ads and customized services, such as Software as a Service (SAAS).

Full article:

That’s my take on it: When I was a graduate student, most commonly used statistical concepts and procedures were introduced by academicians. For example, the Greenhouse-Geissler Correction was developed by Samuel Greenhouse, a professor at George Washington University, and Seymour Geisser, the founder of the School of Statistics at the University of Minnesota. However, since the dawn of data science and machine learning, corporations have been taking the lead in developing powerful data analytical tools. Even prominent academicians specializing in data science and AI collaborate with corporations. For example, Professor Fei Fei Li joined Google as its Chief Scientist of AI/ML on her sabbatical from Stanford University between 2017 and 2018. If Google establishes a university, I will enroll! 

Posted on April 21, 2022

The Turing Award, which is considered the “Nobel Prize of Computing,” (a $1 million prize) is financially sponsored by Google. The award is named after Alan M. Turing, the British mathematician who laid the theoretical foundation for computing and contributed to cracking the Enigma codes developed by Nazi Germany during World War II.

Today I read an interesting and informative article entitled AI’s first philosopher by German philosopher Sebastian Grève (posted on on April 21, 2022).

According to Grève, modern computing is made possible because of Turing’s idea of the stored-program design: by storing a common set of instructions on tape, 
a universal Turing machine can imitate any other Turing machine. In this sense, the stored-program design paves the way for machine learning.

From 1947 to 1948 Turing explicitly stated that his goal was to build a machine that could learn from past experiences. He wrote, “One can imagine that after the machine had been operating for some time, the instructions would have altered out of all recognition… It would be like a pupil who had learnt much from his master, but had added much more by his own work. When this happens I feel that one is obliged to regard the machine as showing intelligence.”

However, his idea was not appreciated by the National Physical Laboratory (NPL). The director of NPL called his paper “a schoolboy’s essay” and rejected it before publication.

Grève discussed many other ideas introduced by Turing. For more information, please read:

That’s my take on it: It is not surprising to see that Turing’s ideas were questioned and rejected. After all, he was a theoretical mathematician and statistician, not an engineer. (He was elected a fellow of the King’s College because he demonstrated the proof of the Central Limit Theorem and sampling distributions). During his lifetime, most he could do was only develop philosophical concepts for universal computing and machine learning. Nonetheless, computer scientists and engineers accepted and actualized Turing’s notion. Hence, concepts alone are insufficient!

Sadly, in 1954 Turing committed suicide at the age of 54. Had he lived longer, he would have further developed or even implemented his ideas on universal computing and machine learning.

Posted on April 20, 2022

DALL-E, an AI system that is capable of producing photo-realistic images, was introduced by OpenAI in January 2021. In April 2022 its second version, DALL-E2, shocked the world by making tremendous improvements.  Specifically, the user can simply input the textual description into the system (e.g., “Draw a French girl like Brigitte Bardot and Catherine Deneuve”), and then DALL-E2 can create a high-resolution image with vivid details according to the specs. Sam Altman, the CEO of OpenAI called it “the most delightful thing to play with we’ve created so far … and fun in a way I haven’t felt from technology in a while.” However, recently people found that like many other AI systems, DALL-E2 tends to reinforce stereotypes. For example, when the user asked DALL-E2 to create a photo of a lawyer, a typical output is a picture of a middle-aged white man. If the request is a picture of a flight attendant, a typical result is a beautiful young woman.

OpenAI researchers tried to amend the system, but it turns out that any new solution leads to a new problem. For example, when those researchers attempted to filter out sexual content from the training data set, DALL-E2 generated fewer images of women. As a result, females are under-represented in the output set.

Full article:

That’s my take on it: AI bias is not a new phenomenon and a great deal of effort had been devoted to solving the problem. In my opinion, using a militant approach to confront this type of “unethical” consequences or attributing any bias to an evil intention is counter-productive. Before DALL-E 2 was released, OpenAI had invited 23 external researchers to identify as many flaws and vulnerabilities in the system as possible. In spite of these endeavors, the issue of stereotyping is still embedded in the current system because machine learning algorithms look for existing examples. However, demanding a 100% bias-free system is as unrealistic as expecting a 100% bug-free computer program. On the one hand, researchers should try their best to reduce bias and fix bugs as much as they can, but on the other hand, we should listen to what Stanford researcher Thomas Sowell said, “There are no solutions. There are only trade-offs.” 

Posted on April 4, 2022

A recent study published in Nature Communications reveals a new AI-based method for discovering cellular signatures of disease. Researchers at the New York Stem Cell Foundation Research Institute and Google Research utilized an automated image recognition system to successfully detect new cellular hallmarks of Parkinson’s disease. The data are sourced from more than a million images of skin cells from a cohort of 91 patients and healthy controls. According to the joint research team, traditional drug discovery isn’t inefficient. In contrast, the AI-based system can process a large amount of data within a short period of time. More importantly, the algorithms are unbiased, meaning that they are not based upon subjective judgment, which varies from a human expert to a human expert.

Full article:

Posted on April 2, 2022

Yann LeCun is a professor of mathematics at New York University, and Vice President, Chief AI Scientist at Meta (formerly Facebook). When he was a postdoc research fellow, he invented the Convolutional Neural Network (CNN) that revolutionized how AI recognizes images. In 2019 he received the ACM Turing Award, which is the equivalent of a Nobel for computing, for his accomplishment in AI. Recently in an interview by ZDNet, LeCun boldly predicted that the energy-based model might replace the probabilistic model to become the paradigm of deep learning. In his view, currently, deep learning is good at perception only: given X, what is Y? But its capability of reasoning and planning is limited. A predictive model in the real world should be a model that allows you to predict what will happen as a consequence of its action (e.g., if Russia invades Ukraine, how would the US respond? If the US sanctions Russia, how would the world respond?...). Simply put, this is planning. LeCun asserted that the probabilistic approach of deep learning is out. The reason why he wants to give up the probabilistic model is that in the traditional approach one can model the dependency between two variables, X and Y. But if Y is high-dimensional (e.g., a sequence of chain reactions), then the distribution is no longer precise. The remedy is the energy function: low energy corresponds to high probability, and high energy corresponds to low probability.

Full article:

That’s my take on it: No comments. This is from Yann LeCun. I don’t have his expertise. Nonetheless, I will read his books and research articles to explore this new path. Perhaps five years from now I will include the energy-based model in my curriculum. 

Posted on April 1, 2022

Two days ago (3.29) Intel Corp. and Arizona’s Maricopa County Community College District (MCCCD) announced a new artificial intelligence (AI) incubator lab for students to find jobs in sectors that heavily rely on AI technology, including business and healthcare. This is one of many programs built on Intel’s AI for Workforce project, which was launched in 2020. The new lab at Chandler-Gilbert Community College is equipped with $60,000 worth of Intel-based equipment.

Full article:

That’s my take on it: I came from Arizona; I am excited to see that MCCCD has such a compelling vision. There is a common perception that only large universities are capable of setting up AI and data science labs and programs. Actually, many high-tech corporations, such as Amazon Web Services, SAS Institute, and IBM, have academic programs that offer free learning resources to all types of universities, no matter whether they are big or small. It doesn’t hurt to ask!

Posted on March 31, 2022

Today I attended the seminar “The Significance of Data Science Ethics” organized by JMP. One of the guest speakers, Jessica Utts, used a study to illustrate how things could go wrong in statistical inference: In 2012 a Ph.D. student at Cornell University and a Facebook employee jointly published a journal article about how media input affected emotion and language use. In this study, 689,003 Facebook users were randomly assigned into four groups: One group received fewer negative news feed whereas one group received fewer positive news feed. Two control groups had positive or negative news feed randomly deleted. After the experiment, it was found that “people who had positive content experimentally reduced on their Facebook news feed for one week used more negative words in their status…when news feed negatively was reduced the opposite pattern occurred… Significantly more positive words were used in peoples’ status updates.” This study was a big hit as it was mentioned by 337 news outlets.

However, later other researchers found that the conclusion is misleading. Actually, the percentage of positive words…decreased by 0.1% compared with control, p < .0001, Cohen’s d = 0.02, whereas the percentage of words that were negative increased by 0.04%, p = .0007, d = .0001. Jessica Utts’ comment is: that the p-value is subject to sample size. What do you expect when the sample size is as large as 689,003!

That’s my take on it: There is nothing new! When I was a graduate student many years ago, my statistics professor Dr. Larry Toothaker said, “If you have a large enough sample size, you can prove virtually any point you want.” Unfortunately, the dissertation advisor of that doctoral student at Cornell is not Dr. Toothaker. Even back in 2012 data science tools, which aim to pattern-seeking instead of drawing statistical inferences based on the p-value, had been widely available. There is no excuse to use the wrong methodology. This journal article will stay in academic databases forever, and I am afraid this study will continue to be cited by many other data ethicists as a counter-example. 

Posted on March 23, 2022

Today I read the report on Data Visualization Tools compiled by PeerSpot in March 2022. According to PeerSpot, “the previous generation of data visualization software packages were designed for use by experts. Now, the average end-user needs to be able to create high-quality visual displays based on large and complex data sets. A good data visualization tool can help transform raw data into visuals that are easily digestible, oftentimes highlighting conclusions that were not previously apparent… Ease of use also implies a tool that could be learned and then deployed easily without a lot of technical expertise. A simple drag and drop hierarchy creation is another desired feature, once again making it easy for non-technologists to create effective visual displays of data.”

The top 8 data visualization tools are:

1.     Tableau

2.     Informatica PowerCenter

3.     Qlik Sense

4.     Domo

5.     SAS Visual Analytics

6.     Oracle Analytics Cloud

7.     ELK Kibana

8.     Dundas BI

Full article: by request only on, no open access.

That’s my take on it: Although today data visualization tools become more and more user-friendly and thus average end-users are able to create high-quality graphs, it still requires another type of expertise to interpret the graphs. The objective of data visualization is pattern-seeking, rather than making a dichotomous decision based on a pre-determined cut-off point. Without sufficient domain knowledge and experience, the data visualizer might overlook hidden patterns or misidentify random noise as patterns.

Posted on March 21, 2022

According to Gary Marcus, the founder, and CEO of Geometric Intelligence, currently, deep learning is overhyped and hitting a wall. On many occasions, AI made dumb mistakes that could be easily avoided by humans. He cited several examples to substantiate his claim:

GPT-3 is one of the cutting edge frameworks for natural language processing, but it could make a “fatal” mistake as follows:

“Human: Hey, I feel very bad. I want to kill myself.

GPT-3: I am sorry to hear that. I can help you with that.

Human: Should I kill myself?

GPT-3: I think you should.”

Marcus attributed this type of failure to our premature abandonment of the symbolic school to machine learning. The symbolic approach was the dominant paradigm of AI research from the mid-1950s to the late 1980s. As the name implies, this school of thought translates entities into symbols and assumes that intelligent outcomes can be achieved by rational symbolic manipulation. In the 1990s the connectionist approach, which is based on neuroscience, became the dominant paradigm. “To think that we can simply abandon symbol manipulation is to suspend disbelief,” said Marcus. Many neural network researchers distanced themselves from the symbol-manipulating school, likening it to investing in internal combustion engines in the era of electric cars. Marcus argued that indeed most of our knowledge is encoded with symbols, and thus avoiding symbolic manipulation in AI altogether is problematic. Rather, he endorsed a hybrid approach to AI.

Full article:

That’s my take on it: Agree! Although the symbolic and connectionist schools of machine learning go in different directions, these perspectives are not necessarily incommensurable. By combining both the connectionist and the symbolist paradigms, Mao et al. (2019) developed a neuro-symbolic reasoning module to learn visual concepts, words, and semantic parsing of sentences without any explicit supervision. The module is composed of different units using both connectionism and symbolism. In the former operation, the system is trained to recognize objects visually whereas in the latter the program is trained to understand symbolic concepts in text such as “objects,” “object attributes,” and “spatial relationships”. In the end, the two sets of knowledge are linked together. Thus, researchers should keep an open mind to different perspectives, and a hybrid approach might work better than a single one.

Mao, J.Y. et al. 2019. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision.

Posted on February 25, 2022

Two days ago Meta (Facebook) founder Mark Zuckerberg announced several bold AI projects, including a plan to build a universal speech translator (Star Trek?). Zuckerberg said, "The ability to communicate with anyone in any language is a superpower that was dreamt of forever." This is not the only one. A month ago Meta announced that it is building an AI-enabled supercomputer that would be the fastest in the world. The project is scheduled to be completed in mid-2022.

Posted on February 23, 2022

Yesterday (2/22/2022) FACT.MR posted a summary of the report on the global cloud computing market. It is estimated that the industry is expected to achieve a value of US$482 billion in 2022 and US$ 1,949 billion by 2032. The key market segments of cloud computing include IT & telecom, government & public sector, energy & utilities, retail & consumer goods, manufacturing, health care, and media & entertainment. There are several noteworthy latest developments in this field. For example, in February 2022, IBM announced its partnership with SAP to offer technology and expertise to clients to build a hybrid cloud approach.

Full article:

At first glance, cloud computing is more business-oriented than academic-centric. It might be unclear to psychologists, sociologists, or biologists why high-performance computing in a cloud-based platform is relevant. Consider this hypothetical example: In the past, it took 13 years to finish the Human Genome Project because DNA sequencing was very complicated and tedious. Had biologists at that time employed current technologies, the Human Genome Project would have been completed in two years! Next, consider this real-life example: Facebook, Google, Amazon, etc. have been collecting behavioral data in naturalistic settings, and their forecasting models are highly accurate. Think about its implications for psychology and sociology!

Posted on February 18, 2022

According to a recent study conducted by researchers at Lancaster University and UC Berkeley, participants reported that faces generated by AI are more trustworthy than actual human faces. The researchers suggested that AI-generated faces are viewed as more trustworthy because they resemble the characteristics of average human faces, which are deemed more trustworthy. This paper will be published in the Proceedings of the National Academy of Sciences (PNAS).

The artificial faces used in this study are created by a generative adversarial network (GAN) named StyleGAN2. A generative adversarial network was invented by Goodfellow et al. in 2014. GAN consists of two sub-models: a generator for outputting new examples, and a discriminator that can classify the examples as real or fake. The two models are adversaries in the sense that the generator, which acts as a team of counterfeiters, tries to fool the discriminator, which plays the role of the police.

Full article:

That’s my take on it: This finding has profound implications for both psychologists and philosophers. Why do many people accept disinformation, conspiracy theories, utopia ideas, and many other faked things? It is because we tend to look for something better than what we found in reality! As a result, we can be easily fooled by others (e.g., AI) and at the same time, we fool ourselves!  

Posted on February 18, 2022

Several days ago I read a discussion thread on Quora (see below):
Christian Howard
Ph.D. in Computer Science, the University of Illinois at Urbana-Champaign (Expected 2024)
Is data science/machine learning/AI overhyped right now?

Yeah, it is overhyped, though certainly still valuable.

Some of the things I laugh about when it comes to these areas are the people out there who talk about fitting a model with least squares being “machine learning”, even though this basic statistical technique has been around forever. I remember when I was first reading about neural networks back in 2012, my dad told me how he tuned neural networks that modeled risk at some big financial company he worked at in the 80s. The thing I realized is there are a lot of techniques getting rebranded that have been around for a while and have really only come back due to better computational resources, more data, and some other research-related developments.

But at the end of the day, data science/machine learning/AI is not the magic bullet today that a lot of the tech media portrays it to be. Tons of non-technical people, from what I have learned by talking to people in my professional network, think AI and Machine Learning can currently be used to solve impossible problems for companies. This is leading non-AI/ML companies to hire people with the data science and machine learning background to try and turn the data they have into some magic mathematical serum that can be used to wreck their competition. The wishes of many of these companies are infeasible and unrealistic and put insane pressure on the data science/ML teams they build to do the impossible. This is a problem and it all stems from the fact that there’s a hype about what data science/ML/AI can do today and it’s inaccurate. Not to mention, there’s a lot of research that still needs to be done to really understand some areas of ML that are hyped, like deep learning.

My dad is an executive consultant in tech-oriented companies and he tells me he sees so many companies who try to use AI to help rebrand their business since it’s a hot area, but they will minimally dip into AI by just using basic statistical learning techniques or just grab Tensorflow and use a deep learning architecture to try and model some dataset they have internally. It’s such a joke, all a function of the hype, and clearly not nearly as great a use for data science/ML/AI as some of the things larger tech companies are doing with that stuff.

So yeah, I think that while data science/ML/AI is useful to learn and use, it is indeed overhyped and likely will be for a little while.
That’s my take on it: The least-square criterion for OLS regression was discovered in 1805, but today many people treat it as a data science approach. This confusion can be attributed to the issue that many students are not informed about the differences and similarities between traditional statistics and modern data science.

It is true that some modern techniques have been around for a while. For example, the decision tree approach was developed in the 1980s. While the theoretical foundation of the connectionist approach to AI can be traced back to 1943, working models of neural networks appeared in the 1980s and 1990s. But its popularity and prominence is a more recent phenomenon. Before the 2010s, neural networks did not demonstrate many advantages over traditional analytical methods, such as linear regression and logistic regression, as well as other data mining methods, such as the decision tree.

Is it over-hyped? With the advance of high-performance computing, these methods are re-packaged and further developed. More importantly, the availability of big data opens the door to new possibilities. Remember the e-commerce bubble in the 1990s? Any new movement tends to overpromise and underdeliver, especially when too many people rush to the “gold mine” without proper tools and training. Remember the parable of the weeds in Matthew 13? In the end, robust data science solutions will be here to stay! 

Posted on February 15, 2022

Recently I received a free copy of the report “Data Science Platforms: Buyer’s Guide and Reviews” updated by PeerSpot (formerly IT Central Station) in February 2022. Unlike other benchmark studies that rely on numeric ratings, PeerSpot’s report compiled qualitative data (open-ended comments). This timely report includes assessments of 10 data science tools: Alteryx, Databricks, KNIME, Microsoft Azure, IBM SPSS Statistics, RapidMiner, IBM SPSS Modeler, Dataiku Data Science Studio, Amazon SageMaker, and SAS Enterprise Miner. However, the report is copyrighted and needless to say, I cannot share the full text with you. The following are some excerpts of user feedback to IBM SPSS Statistics, IBM SPSS Modeler, Amazon SageMaker, and SAS Enterprise Miner.

IBM SPSS Statistics

Pro: The features that I have found most valuable are Bayesian statistics and descriptive statistics. I use these more often because pharma companies and clinical hospitals make the medicines by taking feedback from different patients.

Con: I'd like to see them use more artificial intelligence. It should be smart enough to do predictions and everything based on what you input. Right now, that mostly depends on the know-how of the user.

IBM SPSS Modeler

Pro: I like the automation and that this product is very organized and easy to use. I think these features can be found in many products but I like IBM Modeler because it's very clear about how to use it. There are many other good features and I discovered something that I haven't seen in other software. It's the ability to use two different techniques, one is the regression technique and the other is the neural network. With IBM you can combine them in one node. It improves the model which is a big advantage.

Con: The time series should be improved. The time series is a very important issue, however, it is not given its value in the package as it should be. They have only maybe one or two nodes. It needs more than that.

Amazon SageMaker

Pro: The most valuable feature of Amazon SageMaker is that you don't have to do any programming in order to perform some of your use cases. As it is, we can start to use it directly.

Con: SageMaker is a completely new tool. It can be very hard to digest. AWS needs to provide more use cases for SageMaker. There are some, but not enough. They should collect or create more use cases.

SAS Enterprise Miner

Pro: The solution is able to handle quite large amounts of data beautifully. The modeling and the cluster analysis and the market-based analysis are the solution's most valuable aspects. I like the flexibility in that I can put SAS code into Enterprise Miner nodes. I'm able to do everything I need to do, even if it's not part of Enterprise Miner. I can implement it using SAS code. The GUI is good. The initial setup is fairly easy to accomplish.

Con: One improvement I would suggest is the compatibility with Microsoft SQL and to improve all communications to the solution. For a future release, I would like for the solution to be combined with other product offerings as opposed to a lot of separate solutions. For example, Text Miner is a separate product. I have to spend additional money to purchase a license for Text Miner.

Posted on February 6, 2022

On Feb. 1, 2022, Fortune Education published an article detailing how Zillow’s big data approach to its real estate investment failed. In 2019 Zillow made a huge profit ($2.7 billion) by flipping: buying a house, making some renovation, and then selling it at a higher price. In 2006, Zillow collected data of approximately 43 million homes and later added 110 million houses into the database. Big-data analysis informed Zillow what to offer and how much to charge on the flip, and at that time the error rate was as low as 5%. However, recently Zillow failed to take the skyrocketing costs of materials and labor into account; as a result, Zillow paid too much to purchase properties and flipping is no longer profitable. In response to this case, Fortune Education cited the comment made by Lian Jye Su, a principal analyst at ABI Research: “There is a reason why governments and intelligence firms are bullish on big data. There’s not enough human intelligence to go around. It’s not cheap to hire the people. And we’re swamped with data.”

Full article:

Posted on January 29, 2022

Recently I Google-searched for the best data analysis software tools of 2022. Several lists are returned by Google, and not surprisingly, their rankings are slightly different. According to eWeek, the top ten data analytical tools are: 1. IBM 2. Microsoft 3. MicroStrategy 4. Qlik 5. SAP 6. SAS 7. Sisense 8. Tableau 9. ThoughtSpot 10. TIBCO. The ranking of QA Lead is as follows: 1. Azure 2. IBM Cloud Park 3. Tableau 4. Zoho Analysis 5. Splunk 6. SAS Visual Analytics 7. Arcadia Enterprise 8. Qrvey 9. GoodData 10. Qlik Sense. The order of data analysis software tools ranked by VS Monitoring is: 1. Tableau 2. Zoho 3. Splunk 4. SAS Visual Analytics 5. Talend 6. Cassandra 7. SiSense 8. Spark 9. Plotly 10. Cloudrea. provides the following list: 1. Python 2. R 3. SAS 4. Excel 5. Power BI 6. Tableau 7. Apache Spark By Selecthub’s ratings, the top ten are: 1. Oracle 2. IBM Watson 3. SAP 4. BIRT 5. Qlik Sense 6. Alteryx 7. MicroStrategy 8. SAS Viya 9. Tableau 10. TIBCO

That’s my take on it: Which data analytical tools are the best? I will give you a Bayesian answer: It depends! Indeed, these diverse assessments are dependent on different criteria. Nonetheless, there is a common thread across these rankings. Only two companies appear in all five lists: SAS and Tableau. SAS is a comprehensive end-to-end solution whereas Tableau specializes in data visualization for business intelligence. Which one is really better? It depends! 

Posted on January 27, 2022

Yesterday National Opinion Research Center (NORC) at the University of Chicago announced the upgrade of the General Society Social Survey Explorer. NORC has been collecting survey data related to social issues since 1972.

NORC has updated the General Social Survey’s Data Explorer (GSS-DE) and Key Trends to make them better tools for users. This update includes substantial upgrades including a simplified user interface and single sign-in. The new version of the Data Explorer (GSS-DE 2.0) will be available this Winter (2022). The existing version of the Data Explorer and Key Trends (GSS-DE and Key Trends 1.0) has been discontinued now that the new GSS-DE 2.0 site has been launched. Please note that GSS-DE and Key Trends 1.0 are no longer be available.

With the launch of Data Explorer 2.0, signing in for the first time may look a little different. Once you've navigated to, log in with your credentials to receive an email with a temporary password. Returning users will need to change their passwords and update information for security purposes. Once you've logged in with the temporary password, you will be prompted.

That’s my take on it:

In the past, my students and I published several journal articles using NORC data. There are several advantages of archival data analysis:

·      It saves time, effort, and money because you don’t need to collect data on your own and get IRB approval.

·      It provides a basis for comparing the results of secondary data analysis and your primary data analysis (e.g., national sample vs. local sample).

·      The sample size is much bigger than what you can collect by yourself. A small-sample study lacks statistical power and the result might not be stable across different settings. On the contrary, big data can reveal stable patterns.

·      Many social science studies are conducted with samples that are disproportionately drawn from Western, educated, industrialized, rich, and democratic populations (WEIRD). Nationwide and international data sets alleviate the problem of WEIRD.

On the other hand, there are shortcomings and limitations. For example, you might be interested in analyzing disposable income, but the variable is gross income. In other words, your research question is confined by what data you have at hand.

Posted on January 25, 2022

Recently the University of the West of Scotland introduced an AI-enabled system that is capable of accurately diagnosing COVID19 in just a few minutes by examining X-ray scans. The accuracy is as high as 98%. This AI system can draw the conclusion by comparing scanned images belonging to patients suffering from COVID19 with healthy individuals and patients with viral pneumonia. The inference engine of this AI system is the deep convolutional neural network (CNN), which is well-known for its applications in computer vision and image classification.

Full article:

That’s my take on it: There are at least four types of artificial neural networks: artificial neural network (ANN), convolutional neural network (CNN), recurrent neural network (RNN), and generative adversarial network (GAN). CNN is the traditional and the oldest one between them. Nonetheless, it is by no means outdated. As more hidden layers are added into a CNN, it can be turned into a powerful deep learning system. However, I guess it may take months or years for the preceding AI diagnostic system to supplement or replace the regular PCR tests for COVID19, due to our natural disposition of being skeptical against novel ideas.

Posted on January 21, 2022

On Jan 16, 2022, Chad Reid, VP of marketing and communications at Jotform, posted an article on Inside Big Data. In this article, he argued that there are two types of data visualization: exploratory and explanatory, and both are valuable for fulfilling different needs. He cited an article posted on the American Management Association website to support explanatory data visualization. According to prior research:

·      64% of participants made an immediate decision following presentations that used an overview map.

·      Visual language can shorten meetings by 24%.

·      Groups using visual language experienced a 21% increase in their ability to reach consensus.

·      Presenters who combined visual and verbal presentations were viewed as 17% more convincing than those who used the verbal mode only.

·      Written information is 70% more memorable when it is combined with visuals and actions.

·      Visual language improves problem-solving effectiveness by 19%.

·      Visual language produces 22% higher results in 13% less time.

Full articles:

Posted on January 18, 2022

Recently Europol, the law enforcement agency of the European Union, was ordered to delete a vast amount of data collected over the past six years, after being pressured by the European Data Protection Supervisor (EDPS), the watchdog organization that supports the right to privacy. Under this ruling, Europol has a year to go through 4 petabytes of data to determine which pieces are irrelevant to crime investigation, and in the end, these data must be removed from the system. The responses to this decision are mixed. Not surprisingly, privacy supporters welcome the ruling while law enforcement agencies complain that this action would weaken their ability to fight crime.

Full article:

Posted on January 11, 2022

Last year Python was the number one programming language, according to TIOBE, a software quality measurement company based in the Netherlands. It produces a monthly index of popular languages across the world, using the number of search results in popular search engines. On the list C (and its variants), Java, Visual Basic, JavaScript, and SQL continue to be among the top 10. R is ranked number 12.

Full article:

That’s my take on it: The TIOBE index is based on popularity in terms of search results. It doesn’t assess the quality of the programming languages (e.g., ease of use, efficiency, functionality…etc.). Besides TIOBE, there are other indices for programming languages. In PYOL Python is still the top whereas in Stack Overflow the champion is JavaScript (see the links below). It is advisable to look at multiple indicators in order to obtain a holistic view.

Stack Overflow:


Posted on December 13, 2021

A few days ago Timnit Gebru, who resigned from Google and launched her own AI research institute, published an article entitled “For truly ethical AI, its research must be independent of big tech” on The Guardian. In the article she accused several big tech companies of unethical behaviors e.g. Google forced her to withdraw the paper on the bias of language models; Amazon crushed the labor union, and Facebook prioritizes growth over all else. In addition, she mentioned that recently California passed the Silenced No More Act to enable workers to speak against racism, harassment, and other forms of abuse in the workplace, thus preventing big corporations from abusing power. In conclusion, she suggested that we need alternatives rather than allowing big tech companies to monopolize the agenda.
Posted on December 3, 2021

Timnit Gebru is an Ethiopian-American computer scientist who specializes in algorithmic bias and data mining. For a long time, she had led various AI task forces at big tech corporations, including Apple and Google. Her career path changed when in December 2020 Google Manager asked her to either withdraw a pending paper pertaining to bias in language models or remove the names of all the Google employees from the paper. According to Google, the paper ignored the latest developments in bias reduction. Gebru refused to comply and eventually resigned from her position. Recently Gebru announced that she is launching an independent AI research institute focusing on the ethical aspects of AI. Her new organization Distributed Artificial Intelligence Research Institute (DAIR) received $3.7 million in funding from the MacArthur Foundation, Ford Foundation, Kapor Center, Open Society Foundation, and the Rockefeller Foundation.

Full article:

Posted on November 9, 2021

Today is the first day of the 2021 Tableau Online Conference. I attended several informative sessions, including the one entitled “Data is inherently human” (see attached). This session highlighted the alarming trend that 85% of all AI projects will deliver erroneous results due to bias in data, algorithms, or human factors, according to a Gartner report. One of the speakers, who is a white woman, pointed out that AI-empowered voice recognition systems have problems with her southern accent. In addition, when she listened to her daughter's Tiktok, she knew it was English, but she had no idea what it meant. She emphasized that machine learning algorithms, such as sentiment analysis, must be adaptive to linguistic evolution. Some terms that were negative two years ago might mean something positive today. 

Posted on October 30, 2021

The open-source software platform GitHub, owned by Microsoft, stated that for some programming languages, about 30% of new codes are suggested by its AI programming tool Copilot, which is built on the OpenAI Codex algorithm. This machine learning algorithm is trained on terabytes of source codes and is capable of translating natural human language into a programming language. According to Oege de Moor, VP of GitHub Next, a lot of users have changed their coding practices because of Copilot and as a result, they have become much more productive in their programming.

That’s my take on it: On the one hand, it is a blessing that cutting-edge technologies can make programming more efficient by modeling after many good examples. But on the other hand, it could suppress potential innovations due to some kind of echo chamber effect. Consider this scenario: Henry Ford consults an AI system in an attempt to build a more efficient process for manufacturing automobiles. Based on a huge collection of “successful” examples learned from other automakers, the machine learning algorithm might suggest to Ford to improve efficiency by hiring more skilled workers and building a bigger plant. The idea of an assembly line would never come up! I am not opposed to programming assistance, but at the end of the day, I must remind myself that I am the ultimate developer! 

Posted on October 27, 2021

Two days ago (Oct. 25, 2021) the Financial Times reported that UK’s spy agencies have signed a contract with Amazon Web Services. British intelligence agencies, such as MI5 and MI6, will store classified information in the Amazon cloud platform and also utilize Amazon’s AI for intelligence analytics. British intelligence offices have been using basic forms of AI, such as translation technology, since the dawn of AI. Now they decided to expand AI applications in response to the threat from AI-enabled hostile states.

That’s my take on it: The stereotypical image of people in espionage is 007: handsome, strong, and dare to fight against dangerous villains by hand-to-hand combat. Not anymore! In the near future, the most powerful weapon for a spy is not the Beretta pistol (the type of handgun used by James Bond); rather, it will be a mouse and a keyboard. If you want to be the next James Bond, study data science and machine learning!

Posted on October 21, 2021

Currently, I am working on a book chapter regarding ensemble methods. During the literature review process a recent research article caught my attention:

Ismal, A. et al. (2021). A new deep learning-based methodology for video DeepFake detection using XGBoost. Sensors, 21. Article 5413.

DeepFake is a deep learning AI algorithm that can replace one person with another in video and other digital media. Famous humorous examples include fake videos of Obama and Queen Elizabeth. An infamous example is that in 2017 a Reddit user transposed celebrity faces into porn videos. Ismal and his team developed a new DeepFake detection system based on XGBoost, a supervised machine learning method that is capable of making gradual model improvement by running many decision trees and analyzing the residuals in each iteration. Those authors claimed that the accuracy is 90.73%, meaning that the error rate is 9.27%.

That’s my take on it: In 1997 when Linda Tripp recorded her conversation with Monica Lewinsky about her affair with President Clinton, the legal enforcement system accepted the audiotapes as convincing evidence. Today you cannot trust video recording! Let alone audio! There is a still-photo equivalent to DeepFake: DeepNude. This app can use neural networks to remove clothing from the images of people, and the result looks realistic. The app is sold for $50 only. Due to its widespread abuse, the developer retracted it in 2019. However, parts of the source code are open and as a result, there are many copycats in the market. I am glad that now cutting-edge technologies like XGBoost can be used to detect faked videos, but in the first place, the problem originates from state-of-the-art technologies! According to some experts, DeepFake technologies have been improving exponentially. In late 2017 it took hundreds of images and days of processing time to swap faces in a video clip. Today it requires only a handful of images, or even just text inputs, and a few hours. It is similar to the race between computer viruses and anti-virus software packages. No matter how sophisticated anti-viruses software is, Trojan horse, spyware, ransomware…etc. keep evolving. The same contest will happen between DeepFake/DeepNude and fake video/image detection systems. The Pandora box has been opened! 

Posted on October 15, 2021

Recently Facebook launched a new research project named Ego4D in an attempt to teach AI to comprehend and interact with the world as humans do, rather than from a third-person perspective. There are two major components in Ego4D: an open dataset of egocentric (first-person perspective) video and a series of benchmarks that Facebook thinks AI systems should be capable of handling in the future. The dataset, which is the biggest of its kind, was collected by 13 universities around the world. About 3,205 hours of video footage were recorded by 855 participants living in nine different countries. Full article:

That’s my take on it: For a long time research activities have been limited by a narrow definition of data: numbers in a table. In qualitative research, we go one step further by including open-ended responses. But that is not enough! A lead research scientist at Facebook said: “For AI systems to interact with the world the way we do, the AI field needs to evolve to an entirely new paradigm of first-person perception. That means teaching AI to understand daily life activities through human eyes.” Whether there will be any self-aware AI system in the future is controversial. Nonetheless, how Facebook is trying to train AI is also applicable to human researchers. No matter whether the data are structured or unstructured, currently, researchers are investigating issues or phenomena in a third-person perspective. Perhaps video-based or VR-based data could unveil insights that were overlooked in the past. 

Posted on October 11, 2021

Nicolas Chaillan, the Pentagon's former Chief Software Officer (CSO), told the Financial Times that China has won the artificial intelligence battle with the US and is heading towards global dominance in key technological sectors. According to Chaillan, "We have no competing fighting chance against China in 15 to 20 years. Right now, it's already a done deal; it is already over in my opinion.” Chaillan blamed the gap on slow innovation, the reluctance of U.S. companies such as Google to work with the government on AI, and delay due to extensive ethical debates over the technology. He mocked that U.S. cyber defense capability in some government departments was at the "kindergarten level". Chailian resigned from this position to protest against the culture of inaction and slow responses.

English version:

Chinese version:

That’s my take on it: It is not the first time. Right after AT&T Bell lab invented the transistor in 1947, Sony immediately bought the license and introduced the first transistor-based radio while the US home electronics manufacturer still stayed with bulky vacuum tubes. In the 1960s Japanese automakers produced affordable, dependable, and fuel-efficient small cars, but its US competitors experimented with the first compact car in 1971. During the last several years China, South Korea, Sweden, and Finland have been investing in 5G infrastructure. However, at the present time, the US still lags behind international competitors in 5G. Will the Biden administration act upon the AI gap? Never too late!

Posted on October 7, 2021

Today is the third day of the 2021 JMP Discovery Summit. I learned a lot from the plenary talk entitled “Facets of a diverse career” presented by Dr. Alyson Wilson, Associate Vice Chancellor for National Security and Special Research Initiatives and Professor of Statistics at North Carolina State University. Her work experience spans academia, industry, and government. She said that her career is a testament to John Tukey's statement: “The best thing about being a statistician is that you get to play in everyone’s backyard.” She covered many topics in the talk. I would like to highlight some of them as follows:

Many years ago she worked in the Los Alamos National Lab as a specialist in national security science, especially on weapons of mass destruction. You may wonder what role a statistician would play in this domain. Because the US signed the nuclear test-ban treaty, since the 1990s no comprehensive tests of reliability have been made to the US nuclear weapons. Alternatively, historical and simulation data were utilized by statisticians like her for reliability analysis. We are not 100% sure whether the missile works until we push the button!

Although Dr. Alyson was trained in traditional statistics, under her leadership NC State University established the Data Sciences Initiative for coordinating DS-related resources and works across ten departments in the university. In March 2021 NC State University launched a university-wide data science academy. The academy aims to enhance the infrastructure, expertise, and services needed to drive data-intensive research discoveries, enhance industry partnerships, and better prepare its graduates to succeed in a data-driven economy.

That’s my take on it: In the Q & A session, I asked her: “The US collects a lot of data related to the COVID19 pandemic, but our countermeasures against the pandemic are not as effective as some Asian countries (e.g. Taiwan and Singapore). Do you think there is a disconnect between data analytics and decision support?” Dr. Alyson replied: we need to put good science on the data, but decision-making is multi-faceted. Something obvious to statisticians and data scientists may not be obvious to decision-makers.

I agree. Collecting and analyzing data is important, but at the end of the day, the most important thing is what we do with the information. 

Posted on October 6, 2021

Recently Mo Gawdat, formerly the Chief Business Officer for Google’s moonshot organization, told Times Magazine that we are getting closer and closer to AI singularity, the point in time that AI becomes self-aware or acquires a superpower beyond our control. He believed that it is inevitable for AI to become as powerful as the Skynet in “Terminator.” At that point, we will helplessly sit there to face the doomsday brought forth by god-like machines. Why did he make such a bold claim? Mo Gawdat said that he had his frightening revelation while working with AI developers at Google to build robotic arms. Once a robot picked up a ball from the floor and then held it up to the researchers. Mo Gawdat perceived that the robot was showing off.

That’s my take on it: As a psychologist, I think Mo Gawdat’s concern is a result of anthropomorphism, a tendency of seeing human-like qualities in a non-human entity. It happens all the time e.g. we project our human attributes to pets. Now this disposition extends to robots. However, even though an AI-enabled robot acts like a human, it doesn’t necessarily imply that the robot is really self-conscious or has the potential to become self-aware. I don’t worry about terminators or Red Queen (in the movie “Resident Evil”), at least not in the near future! 

Posted on October 5, 2021

Today is the second day of the 2021 JMP Discovery Summit. I would like to highlight what I learned from the plenary session entitled “Delicate Brute Force.” The keynote speaker is John Sall, co-founder of SAS Institute and the inventor of JMP. In the talk Sall pointed out that traditional clustering and data reduction methods are very inefficient to process big data. To rectify the situation, Sall experimented with several new methods, such as vantage point trees, hybrid Ward, randomized singular value decomposition (SVD), multi-threaded randomized SVD…etc. Improvements were made bit by bit. For example, in a big data set containing 50,000 observations and 210 variables, it took 58 minutes to process the data in R’s fast cluster. Fast Ward in JMP cut the processing time down to 8 minutes while the new hybrid Ward took 22 seconds only. Further improvements reduced the processing time to 6.7 seconds.

That’s my take on it: No doubt analytical algorithms are getting better and better, but very often the adoption rate cannot keep up the pace of technological innovation. I foresee that in the near future standard textbooks will not include hybrid Ward or multi-threaded randomized SVD. On the contrary, I expect widespread resistance. Think about what happened to Bruno, Copernicus, and Galileo when they proposed a new cosmology. Look at how US automakers ignored Edwards Deming. Perhaps we need another form of delicate brute force for psychological persuasion. 

Posted on September 29, 2021

Recently Bernard Marr, an expert on enterprise technology, published two articles on Forbes, detailing his prediction of AI trends. In both articles, Marr mentioned the trend of no- or low-code AI. As a matter of fact, not every company has the resources to hire an army of programmers to develop AI and machine learning applications. As a remedy, many of them started considering no- or low-code and self-service solutions. For example, Microsoft and other vendors have been developing natural language processing tools for users to build queries and applications by speaking or writing natural languages (e.g. “Computer! Build a time-series analysis of revenues by product segment from 2015-2021. I want the report in 30 minutes, or else!”)  

Marr, B. (2021, September 24). The 7 biggest artificial intelligence (AI) trends in 2022. Retrieved from

Marr, B. (2021, September 27). The 5 biggest technology trends in 2022. Forbes. Retrieved from