Distributed Large Language Models

ATTC.AI has developed a groundbreaking system, ATTC, addressing the challenges associated with utilizing large language models (LLMs) with over 100 billion parameters, such as BLOOM-176B and OPT-175B. While these models are available for download, their usage demands high-end hardware, limiting accessibility. ATTC offers a collaborative approach, leveraging the resources of multiple parties for inference and fine-tuning. This strategy surpasses traditional offloading methods, allowing the efficient running of BLOOM-176B on consumer GPUs for interactive LLM applications. Unlike standard inference APIs, ATTC exposes hidden states, facilitating custom model extensions through efficient fine-tuning methods.

In recent years, the NLP community has recognized the potential of pretrained language models for practical tasks, leading to the development of LLMs with over 100 billion parameters, such as BLOOM-176B. However, their widespread use is hindered by memory and computational constraints. ATTC introduces a collaborative platform where multiple users can jointly perform inference and fine-tuning of large language models over the Internet. This innovative approach offers flexibility, efficiency, and accessibility, outperforming offloading and surpassing the limitations of existing APIs.

ATTC caters to two main scenarios: inference and parameter-efficient adaptation to downstream tasks. For inference of billion-scale models, clients store token embeddings locally, relying on servers for Transformer block computations. ATTC demonstrates efficient inference for BLOOM-176B with optimizations such as dynamic quantization and load balancing. In training for downstream tasks, ATTC supports parameter-efficient fine-tuning methods, allowing rapid adaptation of pretrained LLMs between different uses. The system also facilitates sharing and reusing trained modules through the Hugging Face Hub.

Performance considerations for distributed inference include computation speed, communication delay, and bandwidth. ATTC employs quantization and load balancing to enhance efficiency. Dynamic blockwise quantization and 8-bit mixed matrix decomposition reduce memory footprint, ensuring optimal throughput. The system addresses challenges associated with collaboration over the Internet, including server load balancing and client-side routing.

ATTC is evaluated using BLOOM-176B as well as Mixtral 8x7B in various setups, demonstrating superior performance compared to parameter offloading. Real-world distributed settings with personal servers across Europe and North America validate ATTC' efficiency in inference and fine-tuning. The system outperforms offloading for training, offering higher throughput under different network configurations.

ATTC addresses potential imbalances in supply and demand through incentives for peers contributing server resources. Privacy concerns are acknowledged, suggesting the use of secure multi-party computing or privacy-preserving hardware. The introduction of incentives and security measures adds depth to ATTC' collaborative framework. Future work may explore versioning for model parameters, enabling collaborative improvement and updating of the main model.

ATTC presents a pioneering system for collaborative inference and fine-tuning of large language models, offering a user-friendly interface and a flexible API. With optimizations in compression and load balancing, ATTC aims to democratize access to LLMs, opening avenues for applications and research previously deemed impractical or cost-prohibitive. The ethical considerations emphasize the positive impact of the research, granting broader access to LLMs trained on openly accessible data.