@janellecshane
I am currently training this net from scratch, i.e., without the gpt-2 base model, with Harry Potter stories only.
See @harry_botter for a neural net trained on the english books and @harry_botter_de for a neural net trained on german Harry Potter fanfiction.

@janellecshane

The english version using only the books creates okayish unconditional quotes and works badly in the conditional interactive mode.

In addition, I am not sure if the network overfitted the training text a bit. It did not learn the text by heart, but it seems to me that some of the earlier training samples were better than the later ones.
As the input is only 6 MB, it could just be too little input.
Or I am just expecting too much.

@janellecshane
You may have a look at the examples in the bot account and tell me if you have an idea what could be improved in the training process.

The german version uses 395 MB of HP fanfiction as input, what may make it much better. I currently have a model with 50000 word dictionary that I consider converged.
I am now training a model with 2000 word dictionary and more layers to see the difference between having almost all words in the dictionary or only the most common ones.

@janellecshane
When I am satisfied with the german models (I may try another one with even fewer words), I want to train a model with 10 GB of english fanfiction.

I am not sure about what ressources I will need to train it, though.

Follow

@janellecshane
Here is an unconditional sample at temperature 0.7:
pastebin.com/WRxg0Hp3

The ⁇ are unknown words, and the vocabulary of the net does not contain uppercase J or X, so it can only construct words that start with these letters, when the whole word is in the dictionary.

For the newer nets I changed the sentencepiece commandline to create an alphabeth, that covers 100% of the text.

@janellecshane
For some reason it seems to always start quotes with an unknown word like "⁇ ust".

The book model is configured like the 345M model (16 and 24 layers) what is probably much too large.
The current german model is configured like the 117M model (12 and 12 layers). The new german model has 16 and 16 layers, as I wanted to compensate for the need to create words from the syllables in the dictionary.
In addition, this parameters result in a model of similar size as the previous one.

@janellecshane
Here are some of the earlier training samples. I think, they make more sense than the sample in the pastebin.

chaos.social/@allo/10228187564

@allo Interesting! I don't have much experience with training neural nets as big as GPT-2 from scratch. The coherence does seem to be less than I'd expect from transfer learning from the pretrained GPT-2. If you use gpt-2-simple for transfer learning, you'd be able to do your training for free, so that's always an option. It's interesting to see the results of training from scratch though.

@janellecshane Maybe I should refine the GPT2 models for the english texts.

I thought I want to start from scratch, so Harry does not fly space ships. On the other hand, the power of GPT2 is probably related to the much larger text corpus.

For the german model it does not make sense to refine GPT2, because it seems to contain very little german text.

I am using the original GPT2 training code from the rkfg-fork following the instructions from different posts in github issues.

@janellecshane
Playing with such a network is a fun experience, and some texts are quite interesting and funny.

For example I really like when it starts to mix metaphors, as for example "Harrys heart flew into a table".
Such results show, which associations the net is capable to understand and what are the limitations in understanding the context.

@janellecshane
The downside of the large models is, that I need to do CPU training, because they do not fit on my graphics card. Otherwise I may have trained the models I consider converged a bit longer and maybe they would reach a lower loss and still improve quite a bit.

@allo it might be worth trying the finetuning even for German. I found finetuning worked pretty well for crochet. Especially since GPU finetuning is quick and free via gpt-2-simple on colab.

Sign in to participate in the conversation
chaos.social

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!