Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use big LLM to better align source and enhance source corpora #622

Open
johnml1135 opened this issue Jan 8, 2025 · 14 comments
Open

Use big LLM to better align source and enhance source corpora #622

johnml1135 opened this issue Jan 8, 2025 · 14 comments

Comments

@johnml1135
Copy link
Collaborator

johnml1135 commented Jan 8, 2025

So, this is a crazy idea. LLM's are very good at making English text, rewording things and understanding context. What if we gave an LLM a source (such as the ESV) and a backtranslation and said, "make more of the backtranslation using the ESV as a source." It could add explications, different contexts and immitate phrase reordering. Moreover, we could also add Bible reference material to the context and it should be able to make the source have better target context, mirroring what the existing backtranslations have, both scripturally and culturally.

We could take these newly generated "target aligned source " and then (optionally), could give them to the Translators and let them correct them to be more accurate to what they should say. After that optional step, we can feed it to a trained NLLB model that is only trained on backtranslation and target data and it would the spit out pretty close target data.

@ddaspit - what do you think?

@johnml1135 johnml1135 changed the title Use big LLM to convert source text to backtranslation Use big LLM to better align source to target context Jan 9, 2025
@johnml1135
Copy link
Collaborator Author

johnml1135 commented Jan 9, 2025

More ideas:

  • Use the LLM to revise the backtranslation to be better English - more natural
  • Train an NLLB200 model to do this "source to backtranslation" work, instead of using a LLM
  • Keep all of this behind the scenes for the translator - we are just using an "enhanced source". Therefore the user will not have to change their process.
  • Concern: The LLM will hallucinate or soften Biblical passages (and add whatever tech bias is in there) and that could be subtlety added to the Bible.
  • Add Biblical resources in the context window as a "source of truth" to ensure that the "enhanced source" has the desired added context.
  • Create multiple "enhanced sources" that are (1) middle of the road, (2) more literal, (3) more dynamic, (4) more context.

How would we prompt this for the LLM? What would we tell the LLM? What would we put in the context window?

Should we fine tune a 7GB model each time? A 70GB model (using Apollo) once on the 2 H100's? or inference off of a 400 GB model? Which would give the best results?

Re-enforcement learning of using a whole bunch of translations and their back-translations?

A way to do a "spike" without LLM's
Compare the Bleu score for 4 different generation types. Assume we have a full target Bible and full backtranslation.

  1. Fine tune the NLLB200 on the target to the ESV. Generate pretranslations from the ESV.
  2. Fine tune the NLLB200 on the target to a mixed source of ESV and back translation. Generate pretranslations from the ESV.
  3. Fine tune the NLLB200 on the target to the backtranslation. Generate pretranslations from the backtranslation.
  4. Fine tune the NLLB200 on the target to the backtranslation. Generate pretranslations from "LLM enhanced source".

For each type, use the following amounts of training data:
A. Mark
B. 1/4 of NT books
C. 1/2 of NT books
D. All NT books
E. All books of the Bible but 1

This test would show us the upper limit (type 3) for this concept - both if it helps with partial NT's (crossbow) or throughout the whole Bible.

We need to find a set of Bibles with back-translations to be used as references for these experiments.

@johnml1135
Copy link
Collaborator Author

Recommendation - get at least 5 Bibles and do types 1-3 (no LLM) and use that data to direct and prioritize future work.

@johnml1135 johnml1135 changed the title Use big LLM to better align source to target context Use big LLM to better align source and enhance source corpora Jan 9, 2025
@johnml1135
Copy link
Collaborator Author

@woodwardmw - this may be interesting to you as well. I don't know if you want to test it out.

@woodwardmw
Copy link

Yeah, very interesting. I like the idea of training on the back translation and then creating extra "back translation" to use as a source for inference. As long as it can be generated without going too far from the actual Bible text.

My feeling is that the way forward in general is to keep the current NLLB (or MADLAD) model as the main translation engine, and to focus on LLM pre- and post-processing to improve results.

@LLMresearcher
Copy link
Collaborator

LLMresearcher commented Jan 10, 2025

For "extra" back translation, ie. LLM-BT (T1->S1(LLM)) or NLLB-BT(T1->S1(NLLB), evaluation for "naturalness", consider UniEval, a human judgement evaluator (coherence, consistency, fluency, relevance)trained on T5 using BM5. Fluency is the closest to what you may call “naturalness.” We can even explore a good recipe/formula consists of a combination of these scores to represent "naturalness". Another method to evaluate "naturalness" is use a LLM prompt for custom evaluation metric.

@woodwardmw
Copy link

I have an interesting test case from the past couple of days that we could use to try this. We've been given a full Bible in a Quechua language, and we've tried to finetune both NLLB (via Serval) and MADLAD. We made the NLLB model available to the person who requested it, and they said it wasn't good. Which was actually reassuring to me, because my MADLAD training was really struggling too. I was getting a lot of model collapse, and the best I could get was 50 CHRF on the training data. (I calculate CHRF scores on the training data for exactly this kind of scenario - so we can catch a model that isn't training well, and not put it into production).

I suspect it's quite a dynamic translation, and obviously a very different language family, and pairing it with either an English or Spanish published source text (we've tried both) just doesn't give close enough translation pairs to be able to train a good model. So this seems to be the kind of use case where we could take a published English or Spanish Bible, and the Quechua text, and try to get an LLM to amend the English/Spanish text to be closer to the Quechua.

@johnml1135
Copy link
Collaborator Author

@woodwardmw - is there any backtranslation for that Quechua translation? Also, I don't think the LLM could "guess" the back translation of an existing translation, even with examples - this is more for creating an "enhanced source" that wouldn't be used for training, but only for pretranslating. If there was a full backtranslation, then we might be able to get something to work.

@woodwardmw
Copy link

@johnml1135 There's no back translation that I'm aware of in this case. I agree that an LLM guessing the back translation is a difficult task, but to me it's not as difficult as training an LLM to do translation from scratch in a low-resource language. You could input the low-resource language Bible text, the high-resource language Bible text in several versions, and potentially some word alignment data. Then you either rely on in-context learning, or try to fine-tune an LLM on this task, using languages for which we have good quality back translations available as supervised target data.

This was my original idea for situations where we don't have a back translation. And I agree it's a difficult task, but I'm wondering if it's worth trying...

@ddaspit
Copy link
Collaborator

ddaspit commented Jan 13, 2025

We should probably test the simpler case where we have a back translation available to see what kind of benefit we get. Once we have that information, we can determine if it is worth the effort to tackle the more difficult case of no back translation data.

@woodwardmw
Copy link

Yep, sounds good. Here's an extreme example from eng-qub:

NUM 1:26-27
ENG: From the descendants of Judah: All the men twenty years old or more who were able to serve in the army were listed by name, according to the records of their clans and families. The number from the tribe of Judah was 74,600.
QUB: Judá trïbupita 74,600.

@ddaspit
Copy link
Collaborator

ddaspit commented Jan 14, 2025

Wow, quite a difference.

@johnml1135
Copy link
Collaborator Author

Here is a bit what LLAMA 3.1 can do. Without the extra promting for "preserving word order and idoms", it pretty much just spit out a standard translation. With enough direction through, it appears to mirror it more closely, though still adds words like "and" for "[And] light came into being".

Input to LLAMA 3.1

Transliterate from Swahili to english, preserving word order and the Swahili idoms:

\v 1 Mwanzo, Mungu aliumba mbingu na dunia.
\v 2 Wakati huo dunia haikuwa na kitu chochote, haikuwa na umbo lolote. Giza ilifunika juu ya maji yenye kina kirefu. Roho wa Mungu alikuwa juu ya maji.
\p
\v 3 Basi Mungu akasema, “Iweko nuru.” Nuru ikawako.
\v 4 Mungu akaiona nuru hiyo kuwa njema, akagawanya nuru na giza.
\v 5 Basi nuru akaipa jina la Mchana, giza akaipa jina la Usiku. Kukawa jioni, kukawa asubuhi, ikawa ndio siku ya kwanza.

Output from LLAMA 3.1

Here is the transliteration from Swahili to English, preserving word order and Swahili idioms:
\v 1 In the beginning, God created heaven and earth.
\v 2 At that time, the earth had nothing, it had no form. Darkness covered the surface of the deep waters. The Spirit of God was above the waters.
\v 3 Then God said, "Let there be light." And light came into being.
\v 4 God saw that the light was good, and separated the light from the darkness.
\v 5 Then God named the light Day, and the darkness He named Night. It became evening, it became morning, and it became the first day.
Note: I've preserved the original word order and idioms from the Swahili text, while still conveying the meaning in English.

Further LLAMA 3.1 prompt

What does the Swahili word ikawako mean

Further LLAMA 3.1 response

In Swahili, "ikawako" is a verb phrase that can be broken down into several parts:
"ika" is a prefix indicating the verb is in the past tense
"wa" is a subject agreement marker indicating the subject is "light" (nuru)
"ko" is a verb root meaning "to come" or "to come into being"
So, "ikawako" can be translated to English as "it came into being" or "it came to be". In the context of Genesis 1:3, "ikawako" means that light came into existence or began to shine.
In more natural English, "ikawako" could be translated as "and there was light" or "and light appeared", which conveys the same meaning as the original Swahili text.

@johnml1135
Copy link
Collaborator Author

Here is a document where I am trying the "take this backtranslation and make more: https://docs.google.com/document/d/18fr9JJI71zh_ClbF5gAQ4gI-erJW31mj0ZerWyY8vAg/edit?usp=sharing. Much more work is needed.

@woodwardmw
Copy link

Note that LLMs often know Swahili very well, so doing this on a low-resource language will be much harder. (Although Llama's analysis of ikawako is pretty much completely wrong!)

I'm planning to experiment today with training a LRL->eng model first, then asking an LLM to combine that with a published English translation to guess an English back translation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants