Markdown homogenization of training data #3458

andreaskoepf · 2023-04-14T09:26:29Z

andreaskoepf
Apr 14, 2023
Maintainer

Some of our datasets are markdown formatted and others in plain-text. Datasets using strictly markdown (e.g. after conversion from html with a tool) escape special markdown characters like _ -> \_ and * -> \*. Currently the model has to learn that 3 * 5 = means the same as 3 \* 5 = and some of the messages refer to a variable name as a_b while others would represent the same name as a\_b. If we want our model to generate outputs always in strictly escaped markdown we need to escape all input text for training. This could for example be done by reading with a lax markdown parser and converting into strict markdown.

A different approach would be to signal to the model in some way if the input is (strict) markdown or not.

Do you think it is a problem at all, LLMs can deal with multiple languages and spelling errors quite will?
How would you handle it?
Do you know a python markdown-parser we could use for the conversion/preprocessing?

atemiz · 2023-04-14T14:38:16Z

atemiz
Apr 14, 2023

In my opinion, if we can convert all possible inputs to a common format with a robust algorithm and artificial intelligence's knowledge of different formats does not offer a benefit, we can update the entire dataset to conform to a single standard.
In the inference phase, the entered inputs can be pre-processed with the same algorithm.

0 replies

CloseChoice · 2023-04-14T15:31:39Z

CloseChoice
Apr 14, 2023
Collaborator

I checked a couple of datasets, here is the result:
Vicuna: 8032 entries contain at least one escaped backslash
CodeAlpaca: 4
Dolly: 0
GradSchoolMathInstructions: 0

0 replies

olliestanley · 2023-04-16T15:08:43Z

olliestanley
Apr 16, 2023
Collaborator

I checked a couple of datasets, here is the result:
Vicuna: 8032 entries contain at least one escaped backslash
CodeAlpaca: 4
Dolly: 0
GradSchoolMathInstructions: 0

It might be worth checking oa_leet dataset. I don't think the model responses in that dataset have escaped markdown, but I think the prompts might have it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Markdown homogenization of training data #3458

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Markdown homogenization of training data #3458

andreaskoepf Apr 14, 2023 Maintainer

Replies: 3 comments

atemiz Apr 14, 2023

CloseChoice Apr 14, 2023 Collaborator

olliestanley Apr 16, 2023 Collaborator

andreaskoepf
Apr 14, 2023
Maintainer

atemiz
Apr 14, 2023

CloseChoice
Apr 14, 2023
Collaborator

olliestanley
Apr 16, 2023
Collaborator