Training LLMs to synthesize interdisciplinary research on climate change
ClimateGPT is a family of domain-specific large language models tailored for climate change research. Trained on a dataset of scientific papers containing more than 300 billion tokens and fine-tuned on a high-quality and human-generated domain-specific dataset that has been created in close cooperation with climate scientists, the resulting 7-billion-parameter models perform on par with the 70-billion-parameter Llama-2-70B chat model on climate domain benchmarks, and is optimized for hierarchical retrieval-augmented generation (RAG) to reduce hallucinations. All models were trained using renewable energy and are publicly available. Future enhancements include the addition of machine translation to increase access for non-English-speaking users.
As the volume of climate research grows, large language models can provide valuable tools for summarizing and synthesizing findings and generating insights across disciplines. This can significantly enhance urban planning, climate risk assessment, and resilience-building efforts by providing sophisticated tools for climate data analysis and decision support.