An End-to-End Guide to Beautifying Your Open-Source Repo with Agentic AI

Contents

Why is it needed?What is the OSA tool?How does OSA work?README generation Documentation generation CI/CD and structure organization How to use OSA What’s Next for OSA?

is Nikolay Nikitin, PhD. I am the Research Lead at the AI Institute of ITMO University and an open-source enthusiast. I often see many of my colleagues failing to find the time and energy to create open repositories for their research papers and to ensure they are of proper quality. In this article, I will discuss how we can help solve this problem using OSA, an AI tool developed by our team that helps the repository become a better version of itself. If you’re maintaining or contributing to open source, this post will save you time and effort: you’ll learn how OSA can automatically improve your repo by adding a proper README, generating documentation, setting up CI/CD scripts, and even summarizing the key strengths and weaknesses of the project.

There are many different documentation improvement tools. However, they focus on different individual components of repository documentation. For example, the Readme-AI tool generates the README file, but it doesn’t account for additional context, which is important, for example, for repositories of scientific articles. Another tool, RepoAgent, generates complete documentation for the repository code, but not README or CI/CD scripts. In contrast, OSA considers the repository holistically, aiming to make it easier to understand and ready to run. The tool was originally made for our colleagues in research, including biologists and chemists, who often lack experience in software engineering and modern development practices. The main aim was to help them make the repository more readable and reproducible in a few clicks. But OSA can be used on any repository, not only scientific ones.

Why is it needed?

Scientific open source faces challenges with the reuse of research results. Even when code is shared with scientific papers, it is rarely available or complete. This code is usually difficult to read; there is no documentation for it, and sometimes even a basic README is missing, as the developer intended to write it at the last moment but didn’t have time. Libraries and frameworks often lack basic CI/CD settings such as linters, automated tests, and other quality checks. Therefore, it is impossible to reproduce the algorithm described in the article. And this is a big problem, because if someone publishes their research, they do it with a desire to share it with the community

But this problem isn’t limited to science only. Professional developers also often put off writing readme and documentation for long periods. And if a project has dozens of repositories, maintaining and using them can be complicated.

Ideally, each repository should be easy to run and user-friendly. And often the posted developments often lack essential elements such as a clear README file or proper docstrings, which can be compiled into full documentation using standard tools like mkdocs.

Based on our experience and analysis of the problem, we tried to suggest a solution and implement it as the Open Source Advisor tool – OSA.

What is the OSA tool?

OSA is an open-source Python library that leverages LLM agents to improve open-source repositories and make them easier to reuse.
The tool is a package that runs via a command-line interface (CLI). It can also be deployed locally using Docker. By specifying an API key for your preferred LLM, you can interact with the tool via the console. You can also try OSA via the public web GUI. There is short introduction to main ideas of repository improvement with OSA:

Intro to scientific repository improvement with OSA (video by author).

How does OSA work?

The Open Source Advisor (OSA) is a multi-agent tool that helps improve the structure and usability of scientific repositories in an automated way. It addresses common issues in research projects by handling tasks such as generating documentation (README files, code docstrings), creating essential files (licenses and requirements), and suggesting practical improvements to the repository. Users simply provide a repository link and can either receive an automatically generated Pull Request (PR) with all recommended changes or review the suggestions locally before applying them.

OSA can be used in two ways: by cloning the repository and running it through a command-line interface (CLI), or via a web interface. It also offers three working modes: basic, automatic, and advanced, which are chosen at runtime to fit different needs. In basic mode, OSA applies a small set of standard improvements with no extra input: it generates a report, README, community documentation, and an About section, and adds common folders like “tests” and “examples” if they’re missing. Advanced mode gives users full manual control over every step. In automatic mode, OSA uses an LLM to analyze the repository structure and the existing README, then proposes a list of improvements for users to approve or reject. An experimental multi-agent conversational mode is also being developed, allowing users to specify desired improvements in free-form natural language via the CLI. OSA interprets this request and applies the corresponding changes. This mode is currently under active development.

Another key strength of OSA is its flexibility with language models. It works with popular providers like OpenRouter and OpenAI, as well as local models such as Ollama and self-hosted LLMs running via FastAPI.

OSA also supports multiple repository platforms, including GitHub and GitLab (both GitLab.com and self-hosted instances). It can adjust CI/CD configuration files, set up documentation deployment workflows, and correctly configure paths for community documentation.

an experimental multi-agent system (MAS), currently under active development, that serves as the basis for its automatic and conversational modes. The system decomposes repository improvement into a sequence of reasoning and execution stages, each handled by a specialized agent. Agents communicate via a shared state and are coordinated through a directed state graph, enabling conditional transitions and iterative workflows.

Agent workflow graph in OSA (image by author)

README generation

OSA includes a README generation tool that automatically creates clear and useful README files in two formats: a standard README and an article-style README. The tool decides which format to use on its own, for example, if the user provides a path or URL to a scientific paper through the CLI, OSA switches to the article format. To start, it scans the repository to find the most important files, focusing on core logic and project descriptions, and takes into account the folder structure and any existing README.

For the standard README, OSA analyzes the key project files, repository structure, metadata, and the main sections of an existing README if one is present. It then generates a “Core Features” section that serves as the foundation for the rest of the document. Using this information, OSA writes a clear project overview and adds a “Getting Started” section when example scripts or demo files are available, helping users quickly understand how to use the project.

In article mode, the tool creates a summary of the associated scientific paper and extracts relevant information from the main code files. These pieces are combined into an Overview that explains the project goals, a Content section that describes the main components and how they work together, and an Algorithms section that explains how the implemented methods fit into the research. This approach keeps the documentation scientifically accurate while making it easier to read and understand.

Documentation generation

The documentation generation tool produces concise, context-aware documentation for functions, methods, classes, and code modules. The documentation generation process is as follows:

(1) Reference parsing: Initially, a TreeSitter-driven parser fetches imported modules and resolves paths to them for each particular source code file, forming an import map that will further be used to determine method and function calls for the foreign modules utility. By implementing such an approach, it is relatively easy to rectify interconnections between different parts of the processed project and to distinguish between internal aliases. Along with the import maps, the parser also preserves general information such as the processing file, a list of occurring classes, and standalone functions. Each class contains its name, attributes list, decorators, docstring, list of its methods, and each method has its specific details which are of the same structure as standalone functions, that is: method name, docstring, return type, source code and alias resolved foreign method calls with a name of the imported module, class, method, and path to it.

(2) Initial docstrings generation for functions, methods, and classes: With a parser having a structure formed, an initial docstrings generation stage is ongoing. Only docstrings that lack classes, methods, and functions are processed at this stage. Here is a general description of what the ‘what’ method does. The context is mostly the method’s source code, since at this point, forming a general description of the functionality is crucial. The onward prompt includes information about the method’s arguments and decorators, and it trails with the source code of the called foreign methods to provide additional context for processing method utility. A neat moment here is that class docstrings are generated only after all their docstring-lacking methods are generated; then class attributes, their methods’ names, and docstrings are provided to the model.

(3) Generation of “the main idea” of the project using descriptions of components derived from the previous stage.

(4) Docstrings update using generated “main idea”: Hence, all docstrings for the project are presumably present, generation of the main idea of the project can be performed. Essentially, the prompt for the idea consists of docstrings for all classes and functions, along with their importance score based on the rate of occurrence of each component in the import maps mentioned before, and their place in the project hierarchy determined by source path. The model response is returned in markdown format, summarizing the project’s components. Once the main idea is acquired, the second stage of docstring generation begins, during which all of the project’s source code components are processed. At this moment, the key focus is on providing the model with an original or generated docstring at the initial stage docstring with the main idea to elaborate on ‘why’ this component is needed for the project. The source code for the methods is also being provided, since an expanded project narrative may prompt the model to correct some points in the original docstring.

(5) Hierarchical modules description generation starting from the bottom to the top.

(6) Using Mkdocs and GitHub pages for automated documentation pushing and streaming: Final stage of the docstring pipeline, considering a recursive traversal across the project’s modules and submodules. Hierarchy is based on the source path; at each leaf-processing level, a previously parsed structure is used to create a description of which submodule is used, in accordance with the main idea. As processing moves to higher levels of the hierarchy, generated submodules’ summaries are also used to provide additional context. The model returns summaries in Markdown to ensure seamless integration with the mkdocs documentation generation pipeline. The complete schema of the approach is described in the image below.

*Documentation generation workflow (image by author)*

CI/CD and structure organization

OSA offers an automated CI/CD setup that works across different repository hosting platforms. It generates configurable workflows that make it easier to run tests, check code quality, and deploy projects. The tool supports common utilities such as Black for code formatting, unit_test for running tests, PEP8 and autopep8 for style checks, fix_pep8 for automatic style fixes, pypi_publish for publishing packages, and slash_command_dispatch for handling commands. Depending on the platform, these workflows are placed in the appropriate locations, for example, .github/workflows/ for GitHub or a .gitlab-ci.yml file in the repository root for GitLab.

Users can customize the generated workflows using options like –use-poetry to enable Poetry for dependency management, –branches to define which branches trigger the workflows (by default, main and master), and code coverage settings via --codecov-token and --include-codecov.

To ensure reliable testing, OSA also reorganizes the repository structure. It identifies test and example files and moves them into standardized tests and examples directories, allowing CI workflows to run tests consistently without additional configuration.

Workflow files are created from templates that combine project-specific information with user-defined settings. This approach keeps workflows consistent across projects while still allowing flexibility when needed.

OSA also automates documentation deployment using MkDocs. For GitHub repositories, it generates a YAML workflow in the .github/workflows directory and requires enabling read/write permissions and selecting the gh-pages branch for deployment in the repository settings. For GitLab, OSA creates or updates the .gitlab-ci.yml file to include build and deployment jobs using Docker images, scripts, and artifact retention rules. Documentation is then automatically published when changes are merged into the main branch.

How to use OSA

To begin using OSA, choose your repository with draft code that is incomplete or underdocumented. Optionally, include a related scientific paper or another document describing the library or algorithm implemented in the chosen repo. The paper is uploaded as a separate file and used to generate the README. You can also specify the LLM provider (e.g., OpenAI) and the model name (such as GPT-4o).

OSA generates recommendations for improving the repository, including:

A README file generated from code analysis, using standard templates and examples
Docstrings for classes and methods that are currently missing, to enable automatic documentation generation with MkDocs
Basic CI/CD scripts, including linters and automated tests
A report with actionable recommendations for improving the repository
Contribution guidelines and files (Code of Conduct, pull request and issue templates, etc.)

You can easily install OSA by running:

pip install osa_tool

After setting up the environment, you should choose an LLM provider (such as OpenAI or a local model). Next, you should add GIT_TOKEN (GitHub token with standard repo permissions) and OPENAI_API_KEY (if you use OpenAI-compatible API) as environment variables, or you can store them in the .env file as well. Finally, you can launch OSA directly from the command line. OSA is designed to work with an existing open-source repository by providing its URL. The basic launch command includes the repository address and optional parameters such as the operation mode, API endpoint, and model name:

osa_tool -r {repository} [--mode {mode}] [--api {api}] [--base-url {base_url}] [--model {model_name}]

OSA supports three operating modes:

auto (default) – analyzes the repository and creates a customized improvement plan using the specialized LLM agent.
basic – applies a predefined set of improvements: generates a project report, README, community guidelines, an “About” section, and creates standard directories for tests and examples (if they are missing).
advanced – allows manual selection and configuration of actions before execution.

Additional CLI options are available here. You can customize OSA by passing these options as arguments to the CLI, or by selecting desired features in the interactive command-line mode.

*OSA interactive command interface. Image by authors.*

Once launched, OSA performs an initial analysis of the repository and displays key information: general project details, the current environment configuration, and tables with planned and inactive actions. The user is then prompted to either accept the suggested plan, cancel the operation, or enter an interactive editing mode.

In interactive mode, the plan can be modified: actions toggled on or off, parameters (strings and lists) adjusted, and additional options configured. The system guides the user through each action’s description, possible values, and current settings. This process continues until the user confirms the final plan.

This CLI-based workflow ensures flexibility, from fully automated processing to precise manual control, making it suitable for both rapid initial assessments and detailed project refinements.

OSA also includes an experimental conversational interaction mode that allows users to specify desired repository improvements using free-form natural language via the CLI. If the request is ambiguous or insufficiently related to repository processing, the system iteratively requests clarifications and allows the attached supplementary file to be updated. Once a valid instruction is obtained, OSA analyzes the repository, selects the appropriate internal modules, and executes the corresponding actions. This mode is currently under active development.

When OSA finishes, it creates a pull request (PR) in the repository. The PR includes all proposed changes, such as the README, docstrings, documentation page, CI/CD scripts, сontribution guidelines, report, and more. The user can easily review the PR, make changes if needed, and merge it into the project’s main branch.

Let’s look at an example. GAN-MFS is a repository that provides a PyTorch implementation of Wasserstein GAN with Gradient Penalty (WGAN-GP). Here is an example of a command to launch OSA on this repo:

osa_tool -r github.com/Roman223/GAN_MFS --mode auto --api openai --base-url https://api.openai.com/v1 --model gpt-4.1-mini

OSA made several contributions to the repository, including a README file generated from the paper’s content.

*README file before OSA’s run (image by author)*

*Excerpt from the README generated by OSA (image by the author)*

OSA also added a License file to the pull request, as well as some basic CI/CD scripts.

*Сontribution guidelines and CI/CD scripts generated by OSA (image by author)*

OSA added docstrings to all classes and methods where documentation was missing. It also generated a structured, web-based documentation site using those docstrings.

*A snippet from the project documentation page created by OSA (image by author)*

The generated report includes an audit of the repository’s key components: README, license, documentation, usage examples, tests, and a project summary. It also analyzes key sections of the repository, such as structure, README, and documentation. Based on this analysis, the system identifies key areas for improvement and provides targeted recommendation.

*A repository analysis report (image by author)*

Finally, OSA interacts with the target repository via GitHub. The OSA bot creates a fork of the repository and opens a pull request that includes all proposed changes. The developer only needs to review the suggestions and adjust anything that seems incorrect. In my opinion, this is much easier than writing the same README from scratch. After review, the repository maintainer successfully merged the pull request. All changes proposed by OSA are available here.

*Pull request made by OSA (image by author)*

Although the number of changes introduced by the OSA is significant, it’s difficult to assess the overall improvement in repository quality. To do this, we decided to examine the repository from a security perspective. The scorecard tool allows us to evaluate the repository using the aggregated metric. Scorecard was created to help open source maintainers improve their security best practices and to help open source consumers judge whether their dependencies are safe. The aggregate score takes into account many repository parameters, including the presence of binary artifacts, CI/CD tests, the number of contributors, and a license. The aggregated score of the original repository was 2.2/10. After the processing by OSA, it rose to 3.7/10. This happened due to the addition of a license and CI/CD scripts. This score may still seem too low, but the repository being processed isn’t intended for integration into large projects. It’s a small tool for generating synthetic data based on a scientific article, so its security requirements are lower.

What’s Next for OSA?

We plan to integrate a RAG system into OSA, based on best practices in open-source development. OSA will compare the target repository with reference examples to identify missing components. For example, if the repository already has a high-quality README, it won’t be regenerated. Initially, we used OSA for Python repositories, but we plan to support additional programming languages in the future.

If you have an open repository that requires improvement, give OSA a try! We would also appreciate ideas for new features that you can leave as issues and PRs.

If you wish to use OSA in your works, it can be cited as:

Nikitin N. et al. An LLM-Powered Tool for Enhancing Scientific Open-Source Repositories // Championing Open-source DEvelopment in ML Workshop@ ICML25.