Artificial intelligence (AI) can be implemented in many fields and in the future, robots will be able to do a lot of things for humans. Dr. Hwang Jenq-neng, professor of electrical and computer engineering (ECE) and director of the Information Processing Lab (IPL) at the University of Washington (UW, USA), Seattle, has partaken in many AI projects from electronic monitoring of fisheries (NOAA, USA), autonomous driving (Cisco, USA), smart transportation and smart city (ETRI, Korea), automated golf swing analysis (Sportsbox AI, USA), multi-camera multi-person tracking for long-care center surveillance (Quanta Computer, Taiwan), to wafer defect classification (Vanguard International Semiconductor, Taiwan), etc.
For a robot to have the perceptions of a human, it takes a great amount of time for developers to collect data and iterate the AI training process. Hwang said one of the projects he almost got involved with was related to enhancing the capabilities of the US Customs and Border Protection (CBP).
More specifically, AI could be applied to detect lying of travelers during face-to-face interviews, like using a polygraph but without attaching devices to travelers, and hopefully, AI could help CBP officials to work more efficiently by camera and microphone scanning. However, the data needed for ground truth would have been hard to categorize and verify. The question was: which data points would be true, and which ones would be false?
It is difficult for deep learning-based AI to cultivate humane common sense, Hwang said. Humans learn from their life experiences, and some can sense the slightest changes in the environment, in others' body language and attitudes, or even sometimes, we rely on intuition to make decisions. Robots, even with a well-trained convolutional neural network with large input, do not "perceive" as humans do, whereas they are simply able to observe the trained, logical thought process. For instance, Hwang added, "it is much easier for AI to win a Go game than to perceive the emotions or thoughts of the person in front of it."
Open-set long-tailed recognition (OLTR)
The real-life visual object recognition task, which is one of the most fundamental and substantial studies in computer vision, sweeps various fields like species identification, medical imaging perception, human face recognition, and scene classification in autonomous driving, etc. However, in real-world applications, the performances of the off-the-shelf object recognition methods based on deep learning training mostly bias on the sample-rich majority classes that have been seen in the training set, with limited ability on classifying the sample-few minority classes, not to mention the novel classes of objects that have never been seen in the training data, i.e., the factual object samples are unevenly distributed, and the object classes are always open-ended, the so-called open-set long-tailed recognition (OLTR).
Hwang and his team proposed a one-stage LTR scheme: "ally complementary experts (ACE)," where the convolution neural networks (CNN)-based experts are trained with diverse but overlapping, imbalanced subsets, to benefit from specialization in the dominating part and achieve the state-of-the-art LTR performance on all benchmark datasets. His team further proposed a metric-learning framework known as "localizing unfamiliarity near acquaintance (LUNA)" to quantitatively measure the level of novelty. The LUNA was based on the local density of the deep CNN features for a very effective open-set recognition (OSR) solution.
Domain and label shifts
One of the projects he has been working on for more than 10 years with the National Oceanic and Atmospheric Administration (NOAA) of the US government is to help fishermen, using a monocular camera deployed on the fishing boat, to automatically count, size-measure, and identify more than 100 species of fishes directly caught out on the sea. Hwang said domain and label shifts, i.e., change of appearance and distribution of captured fishes, occur once the fishing boat moves to a different water. Moreover, there may be fishes that were never seen in other waters and the AI model trained by the old data sets would not apply or function as well – a real-world OLTR task.
Hwang's team is now working diligently to systematically adapt a trained AI model to recognize the new fish with domain and label shifts.
For an AI project to succeed, AI companies or labs will be required to invest extensive labor capital to collect training data and generate annotations for supervised learning, unless some less effective semi-supervised or self-supervised learning techniques are adopted. It is thus time-consuming and requires high costs of human work at first, Hwang said, but "in the end, the efficiency of trained AI model will increase significantly once the labeling is completed."
Hwang said the use of transformer neural networks – originally introduced by Google Brain in 2017 for natural language processing based on a highly parallel self-attention mechanism – could generate very effective and informative embedding features for a sequence of text words (a language sentence). This additional training parallelization allows training of transformers on larger datasets, resulting in the development of pretrained transformer systems that can be further fine-tuned with much smaller, more focused training data for a more specific task, as evidenced by the recent success of GPT-3 for producing human-like text given an initial text as prompt, and ChatGPT as the claimed best AI chatbot ever released for human-like interactions.
With the extended application of transformers to visual data, such as images and videos, it is now possible to simultaneously train two separate transformers: one for images and one for corresponding descriptive texts with hundreds of millions of training image-text data pairs, and allow the generated image embedding features to be closely aligned with corresponding text embedding features based on a so-called contrastive learning strategy, the resulting system can efficiently learn visual concepts from natural language supervision and can be applied to any visual classification benchmark without further finetuned by the training data of the new benchmark applications, i.e., zero-shot learning.
The challenges of AI development in Taiwan
More sensors being mass deployed on all kinds of devices - from smartphones, home appliances, factory or laboratory equipment, and electric vehicles, to all IoT electronics and public infrastructures - a massive amount of data is being collected from almost every corner of the world.
Hwang said data acquisition, network connectivity, and computing power are the three pillars of a successful AI project, noting that AI is making rapid progresses in China and in the US, which view the technology as critical to generating national economic growth and increasing competitiveness.
As the world progresses from 4G to 5G, and soon to 6G, connectivity in developed countries is not a problem, but networking infrastructures are still lacking in low-income countries. On the other hand, AI developers oftentimes find themselves having no access to sufficient high-performance computing power. For example, despite Taiwan's government providing 1,000 V100 GPUs for research purpose and applications, demand for high-performance GPUs remains significantly unmet.
Moreover, for complex projects like high-level autonomous driving using multiple IoT sensor-based solutions, data-collecting and labeling require intensive labor capital, time, and investment in keeping model training effective and ongoing.
Taiwan is a small market with little user base, and for decades, the country has put much emphasis and focus on manufacturing hardware. Semiconductor foundry, electronics manufacturing, and other hardware and IoT-sensor assembly have been quite successful – however, it should be a benefit instead of an impediment for the software and AI industry to grow.
With enhanced capabilities of manufacturing, Hwang believes system integration of hardware and AI software is worth investing time and money in - from IoT sensor data acquisition, 5G/6G mobile edge data networking, to massive infrastructure-based CPU/GPU support.
Dr. Hwang Jenq-neng, ECE Professor at UW; credit: DIGITIMES Asia