Vision-language models demonstrate impressive capabilities, yet constructing evaluation datasets that capture their failure cases still relies heavily on tedious manual annotation. We present a gamified, collaborative platform in which pairs of participants naturally uncover AI weaknesses through engaging gameplay. Players describe and identify images drawn from visually similar sets, while a vision-language model performs the same guessing tasks in parallel, allowing direct comparison between human and AI performance. In a preliminary study with 68 participants across 137 rounds, the game format sustained high enjoyment while producing effective data. Overall, our work shows how gamified interaction can shift AI evaluation from static, labor-intensive dataset construction to dynamic human–AI comparison, and offers design implications for crowdsourcing and evaluation platforms.