BLOG POST
by
Poomjai Nacaskul, Ph.D.
Chulalongkorn School of Integrated Innovation (CSII),
Poomjai.N@chula.ac.th
-- 2023.12.06 --
Intelligence Is Contextually Defined Pattern Recognition
I should like to believe there is some kind of provident or serendipitous reason why I am a faculty of \"Applied Digital Intelligence\". You see, I have been pondering on the question of what defines/constitutes \"Intelligence\", on and off, for a better part of twenty, thirty years!
To wit, I recall (think it was back in 1994) one of my earlier conversations my Ph.D. supervisor[1]. Quite offhandedly I quipped that one day there should be an academic department of \"Intelligence Engineering\", dedicated to the science and engineering of \"Intelligence Engines\". In the back of my mind, I was thinking Chulalongkorn University, envisioning undergrad-level courses in Artificial Neural Networks (ANN) and cognitive models in Buddhism, i.e. from which the Five Aggregates (Skandhas)¾Form (Rūpa), Feeling/Sensation (Vedanā), Cognition/Perception (Saññā), Mental Formation/Volition (Saṅkhāra), and Consciousness (Viññāṇa)¾would translate to five areas of critical development w.r.t. AI-driven Robotics.
Also, I seem to remember learning (probably many years prior to that) that one sure sign of intelligence in a person is his/her mental ability to see analogy between forms/functions across vastly different contexts and applications, e.g. that a water canteen is analogous to a battery, that climbing a mountain is analogous to maximising a mathematical objective function, and so on. This would be filed under my \"Intelligence is Pattern Recognition\"[2] thinking.
Likewise, I also recall a piece of research that attempted to test intelligence across cultures. Details have since been lost on me, but here is the gist. Subjects were tasked with grouping items together. Items would be like hoes, plows, this plant seeds, that plant seeds, etc. The supposed intelligent answer was to group hoes with plows and group this plant seeds with that plant seeds, as hoes and plows were both farm implements, while this and that plant seeds were, well, seeds. Somehow, non-industrialised communities in Africa couldn’t get the right answer. They would group hoes with this plant seeds, and plows with that plant seeds, and so on. This seemed to suggest inability to group items based on abstract notion of form/function, hence inferior mental aptitude. Upon further investigation, however, it was revealed that hoes were needed to plant one type of seeds in Spring, and plows were needed to plant the other type of sees in Fall, and keeping the right type of farm implement with the right type of seeds to plant for the season was the more intelligent solution in this context. This would be filed under my \"Intelligence (testing) is Contextually Defined (hence biased)\"[3] thinking.
All of this is to set the stage. You see, I firmly believe/subscribe to the notion that measurable \"Intelligence\", and by extension \"Artificial Intelligence\", operates on both pattern recognition and context orientation challenges, the former frontier is more concerned with an intelligence engine’s ability to generalise, whence the latter more about its ability to specialise. I can further clarify/elaborate on this, but you get the gist. So let’s plow ahead.
My take on the LVM vs. LLM Conundrum
As our school’s executive director, Prof. Worsak Kanok-Nukulchai, shared with us this morning, Prof. Andrew Ng, in yesterday’s Linkedin post [www.linkedin.com/posts/andrewyng_the-lvm-large-vision-model-revolution-is-activity-7137483177714995200-nxlM] opined:
“The LVM (large vision model) revolution is coming a little after the LLM (large language model) one, and will transform how we process images. But there’s an important difference between LLMs and LVMs:
- Internet text is similar enough to proprietary text documents that an LLM trained on internet text can understand your documents.
- But internet images – such as Instagram pictures – contain a lot of pictures of people, pets, landmarks, and everyday objects. Many practical vision applications (manufacturing, aerial imagery, life sciences, etc.) use images that look nothing like most internet images. So a generic LVM trained on internet images fares poorly at picking out the most salient features of images in many specialized domains.
That’s why domain specific LVMs – ones adapted to images of a particular domain (such as semiconductor manufacturing, or pathology) – do much better. …”
LVM (Large Visual Model)¾by the way, this is the first time I have seen this acronym used, but I see the analogy¾is comparable to LLM (Large Language Model) in that both Models are emergent technologies only made possible by the confluence of ginormous computational power together with stupendously Large data universe. As for LVM vs. LLM, I agree with Prof. Ng almost entirely, of course. But let me offer some subtle takes on the issue. Hopefully that would give the readers added insights as to how LLM vs. LVM respectively does its magic.
If you were to peak “under the hood”, both LVM and LLM can be seen to comprise two parts:
- the “primitive” part, by which I mean portion of the ANN architecture performing generalised abstract recognition on the “primitive” vocabulary of their respective applications¾convolutional filtering graphic primitives (lines, curves, shapes, etc.) in the case LVM and word embedding (representing words in human languages as numerical vectors) in the case of LLM, and
- the “constructive” part, by which I mean portion of the ANN architecture that performs specialised synthesis on the resulting abstract primitives into labelled object concepts¾coherent-construct pictorial composition in the case of LVM and context-sensitive word association in the case of LLM.
And therein lies the rub, so to speak.
With LLM, success relies very much on its ability to perform context orientation. As such, so long as the “primitive” portion of LLM is successfully trained on generalised human speeches and prose gathered from the Internet, the “constructive” portion of LLM can easily cover a lot of ground without having to retrain ANN weights/parameters on specialised data.
With LVM, however, a lot more emphasis is placed on its ability to perform pattern recognition. But there is only so much generalised information the “primitive” portion of LVM can learn from example lines, curves, shapes, etc. harvested from the Internet, hence the “constructive” portion of LVM needs a lot of additional specialised data to meaningfully synthesise concepts. In short, the Large-ness of the Internet data universe does not endlessly improve the performance of LVM (in the way Large-ness of the Internet data universe does endlessly improve the performance of LLM) because, well, lines are lines, curves are curves, and shapes are shapes!
The Way Forward?
Here is where I play a \"futurist\".
Thus I surmise that the way out of LVM’s performance plateau requires no less than a game-changer, an algorithmic innovation in Machine Learning that, to put it simply, learn \"context-sensitive graphic primitives\" (lines, curves, shapes, etc.).
To those who know the mathematics behind Convolutional Neural Networks (CNN), such a methodology would appear impossible, even downright contradictory. For you see, one of the reasons CNN has been tremendously successful at visual object recognition task is that it operates with a high degree of invariance, i.e. an ellipse is an ellipse no matter where it appears on the screen (location invariance), no matter the size, the elongation, or the rotation applied to the ellipse (extension invariance), and a cat is a cat no matter the lighting (illumination invariance), no matter which direction it is facing (pose invariance), and so on.
But of course, presentation invariance is as diametrically opposite to context sensitivity as it gets!
My hunch and until some clever AI lab can implement the methodology this is just a hunch¾is that the solution will involve some combination of Graph Neural Network (GNN) and Vector Embedding tricks, the former for inducing the contextual relationship amongst sample graphic primitives, the latter for enabling coherent composition, not at the pictorial level, but at the level of graphic primitives themselves. In any event, let’s wait a few years for the likes of OpenAI and DeepMind to invent this \"meta-convolutional GNN-based graphic-primitive embedding\" algorithm. Watch this space, all.
[1]the late Prof. Antonia J. Jones whom I courted to be my Ph.D. supervisor at Imperial College, on account of her preeminence in the then-emergent field of Genetic Algorithms (GA), and for whom I left the Centre for Quantitative Finance for the Department of Computing, where she was a reader at the time.
[2]which no doubt largely formed as a direct result of having attended Artificial Neural Networks (ANN) lectures by the late Prof. Yoh-Han Pao, author of the ground-breaking book Adaptive Pattern Recognition and Neural Networks (pub. 1989).
[3]Now, I’m sure I need not remind educated and well-informed readers of the controversy surrounding the issue of endemic cultural bias in Aptitude Tests (SAT/ACT/GRE/GMAT/etc.) and, perhaps to a lesser extent, IQ Tests, both of which set out to test individuals’ \"innate mental ability\" regardless of test takers’ cultural/linguistic history and educational background/context. So yeah, context matters!
SEE ALSO: Previous Blogpost: Does Prototyping (as Part of a Hackathon Pitch) Kill Creativity?