mcp-git-ingest

MCP Git Ingest is an MCP server designed to empower AI models with the ability to understand and interact with GitHub repositories. It provides two core functions: github_directory_structure, which returns a tree-like representation of a repository's directory structure, and github_read_important_files, which retrieves the content of specified files.

By leveraging the MCP protocol, AI models can seamlessly access and analyze codebases, enabling use cases like automated code review, documentation generation, and understanding complex software architectures. This server utilizes fastmcp and gitpython for efficient Git repository operations, cloning repositories, and extracting relevant information. The key value lies in providing AI models with a structured understanding of code repositories, unlocking new possibilities for AI-driven software development and analysis. It integrates by cloning the target repository and exposing the file system structure and content via the MCP protocol.

Repository Directory Structure Retrieval

The github_directory_structure function is a core feature of mcp-git-ingest, enabling AI models to understand the organization of a GitHub repository. It clones the specified repository to a temporary directory, generates a tree-like representation of the directory structure, and then cleans up the temporary directory. This function uses gitpython to interact with the Git repository and recursively builds the directory tree. The output is a string that visually represents the repository's file hierarchy, making it easy for AI models to parse and interpret.

For example, an AI model tasked with identifying the main entry point of a project can use this function to quickly locate files like main.py, index.js, or app.ts based on their position in the directory structure. This eliminates the need for the AI to sift through irrelevant files, significantly speeding up the analysis process. The technical implementation involves cloning the repository using gitpython, traversing the file system, and formatting the output as a tree-like string.

Targeted File Content Extraction

The github_read_important_files function allows AI models to selectively access the content of specific files within a GitHub repository. This function takes a repository URL and a list of file paths as input. It clones the repository, reads the content of the specified files, and returns a dictionary mapping file paths to their corresponding content. Error handling is implemented to gracefully manage scenarios where files are missing or inaccessible. This targeted approach minimizes the amount of data transferred and processed, improving efficiency and reducing noise.

Consider an AI model designed to identify potential security vulnerabilities in a codebase. By using github_read_important_files, the model can focus on files known to be common sources of vulnerabilities, such as configuration files, authentication modules, and API endpoints. This targeted analysis significantly reduces the search space and allows the AI to quickly identify and flag potential issues. The function leverages gitpython for repository cloning and standard file I/O operations for reading file content.

Secure Temporary Repository Management

mcp-git-ingest incorporates robust temporary repository management to ensure data security and prevent conflicts. When a repository is accessed, it's cloned to a temporary directory. The system uses a hash-based naming scheme for these directories, which can potentially reuse previously cloned repositories if the same repository URL is requested again. After the requested information (directory structure or file content) is extracted, the temporary directory is automatically cleaned up. This process ensures that sensitive data is not left lingering on the server and prevents potential security breaches.

This feature is crucial for maintaining the integrity of the MCP ecosystem. For instance, if an AI model is analyzing multiple repositories, the secure temporary repository management prevents data leakage between different analysis tasks. The implementation involves using Python's tempfile module for creating temporary directories and shutil for removing them. The hash-based naming scheme is implemented using Python's hashlib module.