Markdown homogenization of training data #3458
Replies: 3 comments
-
In my opinion, if we can convert all possible inputs to a common format with a robust algorithm and artificial intelligence's knowledge of different formats does not offer a benefit, we can update the entire dataset to conform to a single standard. |
Beta Was this translation helpful? Give feedback.
-
I checked a couple of datasets, here is the result: |
Beta Was this translation helpful? Give feedback.
-
It might be worth checking oa_leet dataset. I don't think the model responses in that dataset have escaped markdown, but I think the prompts might have it. |
Beta Was this translation helpful? Give feedback.
-
Some of our datasets are markdown formatted and others in plain-text. Datasets using strictly markdown (e.g. after conversion from html with a tool) escape special markdown characters like
_
->\_
and*
->\*
. Currently the model has to learn that3 * 5 =
means the same as3 \* 5 =
and some of the messages refer to a variable name asa_b
while others would represent the same name asa\_b
. If we want our model to generate outputs always in strictly escaped markdown we need to escape all input text for training. This could for example be done by reading with a lax markdown parser and converting into strict markdown.A different approach would be to signal to the model in some way if the input is (strict) markdown or not.
Do you think it is a problem at all, LLMs can deal with multiple languages and spelling errors quite will?
How would you handle it?
Do you know a python markdown-parser we could use for the conversion/preprocessing?
Beta Was this translation helpful? Give feedback.
All reactions