Guide to Using Pre-trained Large Language Models of Source Code

This guide provides a comprehensive overview of using pre-trained large language models (LLMs) for source code generation and analysis, focusing on the models and resources available through the Code-LMs project on GitHub.

Setup and Installation

The Code-LMs project utilizes the GPT NeoX toolkit. To get started, you'll need to download a pre-trained checkpoint from the provided Zenodo repository. These checkpoints can range in size up to 6GB and require a similar amount of GPU memory to run; CPU execution is not recommended.

You can choose to build the project from source using a forked version of the NeoX repository, or leverage a pre-built Docker image for easier setup.

From Source

The project's GitHub repository contains a fork of the GPT-NeoX repository with modifications to handle tabs and newlines in tokenization. Building from source allows for greater customization but requires familiarity with the GPT-NeoX toolkit.

Via Docker

A Docker image is available on DockerHub, simplifying the setup process. This image can be used with the downloaded checkpoint files.

Code Generation

Once the environment is set up, code generation is performed using the generate.py script. The script accepts prompts and generates code based on the provided model. Parameters such as temperature can be adjusted to control the randomness of the generated code.

Models

Several models are available, including PolyCoder (available in various sizes), trained on a large corpus of code across multiple programming languages. The models are available on HuggingFace and Zenodo.

PolyCoder

PolyCoder is a multilingual model trained on a massive dataset of code from various programming languages. It's available in different sizes (160M, 405M, and 2.7B parameters), offering a trade-off between performance and resource requirements.

Datasets

The models were trained on a 249GB multilingual corpus of code, collected from popular GitHub repositories. The dataset includes code from 12 programming languages and has been cleaned and deduplicated to improve training quality.

Evaluation

The project includes methods for evaluating the models' performance using metrics such as perplexity and HumanEval. Detailed instructions for replicating these evaluations are provided in the repository.

Caveats

The models have some limitations. They were not trained to solve programming problems directly and may not perform as well as models trained on natural language prompts. Whitespace is crucial for proper model input, and the model may generate random new files once it reaches the end of the current one.

Conclusion

The Code-LMs project offers a valuable resource for researchers and developers interested in using LLMs for code generation and analysis. The detailed instructions and available models make it a great starting point for exploring the capabilities of these powerful tools.

Explore the Latest in AI Tools

Code

Guide to Using Pre-trained Large Language Models of Source Code

Setup and Installation

From Source

Via Docker

Code Generation

Models

PolyCoder

Datasets

Evaluation

Caveats

Conclusion

Top Alternatives to Code

bloop

CommandDash

GitHub Copilot

Amazon Q Developer

CodeGeeX

AlphaCode

CodeWP

Juno

FormulaGenerator

AppMaster

CodeCompanion

Code

InCoder

CodeScene

CodeSandbox Boxy (integrated into Codeium)

CodeRabbit

BashSenpai

Chat2Code

Bricabrac AI

CodeGeeX

Related Categories of Code

Code Generation

AI Development Frameworks

Specialized AI Models