Skip to content

add codemap script #2658

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

add codemap script #2658

wants to merge 1 commit into from

Conversation

williballenthin
Copy link
Contributor

@williballenthin williballenthin commented Apr 25, 2025

This PR adds a script that displays the layout of a program:

  • metadata
  • sections
  • libraries
  • list of functions, each with
    • xrefs to the function
    • API calls
    • strings
    • calls from the function
    • (optionally) the capa matches, when provided a results.json file

for example:

image
image

This was originally developed to help with research segmenting a program into its object files, but it turns out to be an interesting overview of programs generally.

Under the hood, this program uses lancelot to process the program into a BinExport2 representation, and then works with that as the IR.

Checklist

  • No CHANGELOG update needed
  • No new tests needed
  • No documentation update needed

@williballenthin williballenthin requested a review from mr-tz April 25, 2025 18:50
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @williballenthin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

This pull request introduces a new script, codemap.py, designed to display the layout of a program. It visualizes metadata, sections, libraries, and a list of functions, including xrefs, API calls, strings, calls from the function, and optionally capa matches. The script leverages lancelot and rich libraries for analysis and presentation. It accepts a BinExport2 file as input, and optionally capa JSON results and Assemblage JSONL files to enhance the analysis.

Highlights

  • Script Addition: Adds a new script scripts/codemap.py for visualizing program layout.
  • Dependency Management: Specifies dependencies like protobuf, python-lancelot, and rich within the script's header.
  • Data Extraction: Extracts and presents key program information such as metadata, sections, libraries, and functions.
  • Capa Integration: Optionally integrates with capa results to display rule matches within functions.
  • Assemblage Integration: Optionally integrates with Assemblage data to update function names.
  • Thunk Resolution: Resolves thunks to provide more accurate call graph information.
  • Output Formatting: Uses the rich library to format the output with colors and indentation for better readability.

Changelog

  • scripts/codemap.py
    • Adds a new script to display the layout of a program.
    • Includes metadata, sections, libraries, and function details.
    • Supports optional integration with capa and Assemblage data.
    • Uses lancelot for BinExport2 analysis.
    • Uses rich for formatted output.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


A map of code,
Functions, calls, a winding road,
Insights we find.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add bug fixes, new features, breaking changes and anything else you think is worthwhile mentioning to the master (unreleased) section of CHANGELOG.md. If no CHANGELOG update is needed add the following to the PR description: [x] No CHANGELOG update needed

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR introduces a script to display the layout of a program, including metadata, sections, libraries, functions, xrefs, API calls, strings, and capa matches. The script leverages lancelot and rich for analysis and presentation. Overall, the script provides a valuable tool for understanding program structure. However, there are a few areas that could be improved for clarity, efficiency, and robustness.

Summary of Findings

  • Error Handling: The script uses a broad except clause to catch google.protobuf.message.DecodeError. It would be better to catch a more specific exception or add additional checks to ensure that the file is a valid BinExport2 file before attempting to parse it. This will prevent unexpected behavior if the script is run on an invalid file.
  • Thunk Resolution: The script contains logic to resolve thunks, but there are several places where thunks are handled differently or not at all. It would be beneficial to consolidate the thunk resolution logic into a single function or class to ensure consistency and reduce code duplication.
  • Assemblage Location Handling: The script assumes that the base address is the lowest mapped page. This assumption may not always be correct, especially for more complex binaries. It would be better to either explicitly determine the base address or provide a way for the user to specify it.
  • Missing Documentation: The script lacks documentation for some of the key functions and classes, such as Renderer and AssemblageLocation. Adding docstrings to these functions and classes would improve the script's readability and maintainability.

Merge Readiness

The script provides a useful tool for analyzing program layouts. However, the identified issues related to error handling, thunk resolution, base address guessing, and documentation should be addressed before merging. I am unable to approve this pull request, and recommend that it not be merged until the critical and high severity issues are addressed, and that others review and approve this code before merging.

Comment on lines +81 to +82
else:
raise ValueError("unexpected section name")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider raising a more specific exception type here, such as TypeError, to provide more context about the error.

Suggested change
else:
raise ValueError("unexpected section name")
raise TypeError("unexpected section name: expected str or Text, got %s" % type(name))

Comment on lines +201 to +203
# we don't know which function this is.
# hopefully its a function recognized in our BinExport analysis.
# *shrug*

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This comment indicates uncertainty about the function recognition. It might be helpful to add more logging or debugging information to understand why the function is not being recognized.

Comment on lines +410 to +412
for call_target_address in instruction.call_target:
if call_target_address in idx.thunks:
call_target_address = idx.thunks[call_target_address]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic duplicates the thunk resolution logic from earlier in the script. Consider consolidating this logic into a single function or class to ensure consistency and reduce code duplication.

Comment on lines +444 to +446
for call_target_address in instruction.call_target:
call_target_index = idx.vertex_index_by_address[call_target_address]
call_target_vertex = be2.call_graph.vertex[call_target_index]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic duplicates the thunk resolution logic from earlier in the script. Consider consolidating this logic into a single function or class to ensure consistency and reduce code duplication.

@williballenthin williballenthin marked this pull request as draft April 25, 2025 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant