-
Notifications
You must be signed in to change notification settings - Fork 105
Open
Description
Hello,
Your paper seems to have covered linear layers, convs, and transformers but not rnns. Was it just to reduce the number of experiments or is their a more fundamental reason behind this choice. If it was just to reduce n_experiments, how would h0 be handeled? Would you recommend zeroing out h0, or it needs to be initialized using mup.init.normal.
Thank you.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels