Large language models have been fine-tuned to learn poetry via supervised learning on a dataset containing relevant examples. However, those models do not generate good-quality output that respects the structure expected for a specific poem type. For instance, generated haikus may contain toxic language, be off-topic, incoherent, and not respect the typical 5-7-5 syllable meter. In this work, we investigate if it is possible to learn an objective function to quantify the quality of haiku—from human feedback—and if this reward function can be used to improve haiku generation using reinforcement learning.