Which language is the hardest to train an ai on? so suppose we make a website out of that language, would the ai not understand anything?

الله@lemmy.world · 1 month ago

Which language is the hardest to train an ai on? so suppose we make a website out of that language, would the ai not understand anything?

howrar@lemmy.ca · 1 month ago

The difficulty in training an AI is dependent on data availability. So this is just a question of choosing a language that has the least amount of writing. You can trivially choose any language that doesn’t have a writing system at all and invent a writing system for it. But then you’d also run into the problem of learning the language yourself.

الله@lemmy.world · 1 month ago

then that’s a plus isn’t it? prompters don’t like studying alot, so they will have to study to get somewhere

Axolotl@feddit.it · 1 month ago

Prompters don’t need to know that language to have AI scrap things

Axolotl@feddit.it · 1 month ago

deleted by creator

disregardable@lemmy.zip · 1 month ago

The AI doesn’t understand anything in any language. It’s just putting words in order by their association.

Zak@lemmy.world · 1 month ago

LLMs don’t understand things. They repeatedly predict the next token given previous tokens.

I don’t think something without predictable patterns is likely to work as a language. A very complex grammar probably means the LLM will make grammatical errors more often, but that’s probably the most that can be done to make a language hard for LLMs. Other comments mention languages without much training data, but I don’t think that’s what you’re asking.

edit-2 1 month ago

AI probably can’t underatand 台山話 Taishanese xD

Literally no study materials, no tv or movies, no way for AI to scrap data.

Problem is, I’m the only speaker in the entire fediverse, so not exactly useful. (And I barely speak it, mostly only hear, not speak)

Like, the only way to learn it is if you were born into it.

mysticpickle@lemmy.ca · edit-2 1 month ago

Probably more people that speak fluent Klingon here than that one :o

Axolotl@feddit.it · 1 month ago

Probably a dialect or a dead language could be, since they have little to none written texts to train AI on

tal@lemmy.today · 1 month ago

https://en.wikipedia.org/wiki/Linear_A

Linear A is a writing system that was used by the Minoans of Crete from 1800 BC to 1450 BC. Linear A was the primary script used in palace and religious writings of the Minoan civilization. It evolved into Linear B, which was used by the Mycenaeans to write an early form of Greek. It was discovered by the archaeologist Sir Arthur Evans in 1900. No texts in Linear A have yet been deciphered.

SillyGooseQuacked@lemmy.world · edit-2 1 month ago

Not to be a downer if you’re anti-AI, but you should know a functional, small, 1B parameter model only needs ~85GB of data if the training data set is high quality (the four-year old chinchilla paper set out the 20 to 1 optimization rule for ai training, so it may require even less today).

That’s basically nothing. If a language has over ~130,000 books or an equivalent amount of writing (1,500 books is about a gig in plain ascii), a functional text-based ai model could be built that uses it.

My understanding is there are next to zero languages in existence today that do not have this amount of quality text. Certainly, spoken languages that have no written word are not accessible like this, but most endangered languages with few speakers that have a historical written word could in theory have ai models built that effectively communicate in those languages.

To give you an idea of what this means for less-written languages and a website revolving around them, look at worldcat (which does NOT have anywhere near most of the written text available entirely online for each language listed, it’s JUST a resource for libraries): https://www.oclc.org/en/worldcat/inside-worldcat.html

But this gets even harder for a theoretical website used to avoid an LLM that can read it, because this is all assuming creating an ai model for language from scratch. That is not necessary today because of transfer learning.

Major LLM models with over 100 diverse major languages can be fined-tuned on an insignificant amount of data (even 1GB could work in theory) and produce results like those of a 1B parameter model trained solely on one language. This is because the multi-lingual models developed cross-cultural vector-based understandings of Grammer.

In truth, the only remaining major barriers for any language not understood by fine-tuning an ai model today are both (1) digitization and (2) character recognition. Digitization will vanish as an issue for basically every written language that has a unique script within the next ten years. Character recognition (and more specifically, the economic viability of building the character recognition) will be the only remaining issue.

Ironically, in creating such a website, you will be creating more data for a future potential ai model to use in training. Especially if whatever you write makes the language of greater economic importance.

droning_in_my_ears@lemmy.world · 1 month ago

There’s a ton of spoken only languages that have no writing system and under documented rules

theneverfox@pawb.social · 1 month ago

You can literally train them on animal languages. They’re language models, their primary attribute is modeling a language

And they are very good at it

Zwuzelmaus@feddit.org · edit-2 1 month ago

All languages that are spoken, but not written.

Current AI models are trained on written stuff only.

CombatWombatEsq@lemmy.world · edit-2 8 days ago

deleted by creator

m_‮f@discuss.online · 1 month ago

I posted about this a bit ago, but that likely wouldn’t work:

In a First, AI Models Analyze Language As Well As a Human Expert

https://discuss.online/post/30279537

Any language you picked or invented would just add to the training data and help the AI out.

Demonmariner@lemmy.world · 1 month ago

Maybe Toki Pona?