Well diffusers are trained unsupervised on raw pictures. I don't know how they train multi-modal LLMs on images, but yes obviously they are consuming other media than just text. I don't think, but would be happy to be corrected, that models glean much of their "knowledge" from non-textual training data.
Please tell me more. When I ask an LLM a question, and get a text response, can that response incorporate non-textual information from visual training data?
> Fridman, the podcast’s host, defines AGI as an AI system that’s able to “essentially do your job,” as in start, grow, and run a successful tech company worth more than $1 billion. He then asks Huang when he believes AGI will be real — asking if it’s, say, five, 10, 15, or 20 years away — and Huang responds, “I think it’s now. I think we’ve achieved AGI.”
> But Huang then seemed to slightly walk back his earlier claims, saying, “A lot of people use it for a couple of months and it kind of dies away. Now, the odds of 100,000 of those agents building Nvidia is zero percent.”
That may not be the case here, and certainly isn't the assumption we can make more generally.
We regularly see regressions in platform security.
reply