Skip to content

myobie/htmldiff

Repository files navigation

HTMLDiff

Gem Version CI Ruby Style Guide License: MIT

HTMLDiff is a Ruby gem that generates HTML-formatted diffs between two text strings. It can be used in your app to highlight additions, deletions, and modifications of text using HTML and CSS.

Features

  • Simple and opinionated API—it just works™.
  • Generates diffs of text using the LCS (Longest Common Subsequence) algorithm.
  • Diff preserves whitespace and HTML tags, HTML entities, URLs, and email addresses.
  • Multi-language support (Cyrillic, Greek, Arabic, Hebrew, Chinese, Japanese, Korean, etc.)
  • Customizable output formatting (see examples below).

Alternatives

  • diffy - Far more complex and feature-rich, but less opinionated.
  • diff-lcs - The underlying gem used by HTMLDiff.

Getting Started

Installation

Add this line to your application's Gemfile:

gem 'htmldiff'

Basic Usage

require 'htmldiff'

old_text = "The quick red fox jumped over the dog."
new_text = "The red fox hopped over the lazy dog."

diff = HTMLDiff.diff(old_text, new_text)

Output:

The <del class="diffdel">quick </del>fox <del class="diffmod">jumped</del><ins class="diffmod">hopped</ins> over the <ins class="diffins">lazy</ins> dog.

Formatting the HTML Output

HTMLDiff includes a highly customizable HtmlFormatter that gives you fine-grained control over the HTML output. This formatter allows you to specify different HTML tags and CSS classes for various diff elements.

old_text = "The quick red fox jumped over the dog."
new_text = "The red fox hopped over the lazy dog."

diff = HTMLDiff.diff(old_text, new_text, html_format: {
  tag: 'span',
  class_delete: 'highlight removed',
  class_insert: 'highlight added'
})

Output:

The <span class="highlight removed">quick </span>red fox <span class="highlight removed">jumped</span><span class="highlight added">hopped</span> over the <span class="highlight added">lazy</span> dog.

Customization Options

HTMLDiff.diff(html_format:) supports the following options:

Option Description
:tag Base HTML tag to use for all change nodes (default: none)
:tag_delete HTML tag for deleted content (overrides :tag, default: "del")
:tag_insert HTML tag for inserted content (overrides :tag, default: "ins")
:tag_replace HTML tag for replaced content (overrides :tag_delete, :tag)
:tag_replace_delete HTML tag for deleted content in replacements (overrides :tag_replace, :tag_delete, :tag)
:tag_replace_insert HTML tag for inserted content in replacements (overrides :tag_replace, :tag_insert, :tag)
:tag_unchanged HTML tag for unchanged content (optional)
:class Base CSS class(es) for all change nodes
:class_delete CSS class(es) for deleted content (overrides :class)
:class_insert CSS class(es) for inserted content (overrides :class)
:class_replace CSS class(es) for replaced content (overrides :class_delete, :class_insert, :class)
:class_replace_delete CSS class(es) for deleted content in replacements (overrides :class_replace, :class_delete, :class)
:class_replace_insert CSS class(es) for inserted content in replacements (overrides :class_replace, :class_insert, :class)
:class_unchanged CSS class(es) for unchanged content (optional)

Example: Wrapping unchanged text in tags

diff = HTMLDiff.diff(old_text, new_text, html_format: {
  tag_unchanged: 'span',
  class_unchanged: 'unchanged',
  tag: 'span',
  class_delete: 'deleted',
  class_insert: 'inserted'
})

Output:

<span class="unchanged">The </span><span class="deleted">quick </span><span class="unchanged">red fox </span><span class="deleted">jumped</span><span class="inserted">hopped</span><span class="unchanged"> over the </span><span class="inserted">lazy</span><span class="unchanged"> dog.</span>

Example: Special handling for replacements

diff = HTMLDiff.diff(old_text, new_text, html_format: {
  tag_delete: 'span',
  tag_insert: 'div',
  tag_replace: 'mark',
  class_delete: 'deleted',
  class_insert: 'inserted',
  class_replace_delete: 'replaced deleted',
  class_replace_insert: 'replaced inserted'
})

Output:

The <span class="deleted">quick </span>red fox <mark class="replaced deleted">jumped</mark><mark class="replaced inserted">hopped</mark> over the <div class="inserted">lazy</div> dog.

Using a Custom Output Formatter

If the HTML formatting options above aren't sufficient for your use case, or if you'd like to output to an alternative format (e.g. XML, JSON, etc.), you can further customize the output by creating your own formatter.

Your formatter may be any object that responds to the #format method, and it can return whatever object type you'd like (typically a String).

module MyCustomFormatter
  def self.format(changes)
    changes.each_with_object(+'') do |(action, old_string, new_string), content|
      case action
      when '=' # equal
        content << new_string if new_string
      when '-' # remove
        content << %(<removed>#{old_string}</removed>) if old_string
      when '+' # add
        content << %(<added>#{new_string}</added>) if new_string
      when '!' # replace
        content << %(<removed>#{old_string}</removed>) if old_string
        content << %(<added>#{new_string}</added>) if new_string
      end
    end
  end
end

# Test your custom formatter
example_changes = [
  ['=', 'The ', 'The '],
  ['+', nil, 'quick '],
  ['=', 'red fox ', 'red fox '],
  ['!', 'jumped', 'hopped'],
  ['=', ' over the ', ' over the '],
  ['-', 'lazy ', nil],
  ['=', 'dog.', 'dog.']
]
MyCustomFormatter.format(example_changes)
#=> "The <added>quick </added>red fox <removed>jumped</removed>" \
#   "<added>hopped</added> over the <removed>lazy </removed>dog."

# Use your custom formatter in the diff method
diff = HTMLDiff.diff(old_text, new_text, formatter: MyCustomFormatter)

Using a Custom Tokenizer

You can customize how text is split into tokens by creating your own tokenizer. A tokenizer can be any object that responds to the #tokenize method and returns an Array of Strings (i.e. the tokens).

It is useful to think of tokens as the "unsplittable" unit in your diff. For example, if you tokenize each word ["Hello", "beautiful", "world"], the diff output will never split these mid-word. However, if you tokenize each character ["H", "e", "l", "l", "o"], the diff output can split words mid-character, for example, HTMLDiff.diff("Hello", "Help", tokenizer: ...) would return "Hel<del>lo</del><ins>p</ins>".

Your custom tokenizer's output array should include whitespace tokens, such that the output can be joined to match the original string.

module MyCustomTokenizer
  def self.tokenize(string)
    string.split(/(\b|\s)/).reject(&:empty?)
  end
end

# Check that your tokenizer output matches the original string when joined
test = MyCustomTokenizer.tokenize("Hello, world!") #=> ["Hello", ",", " ", "world", "!"]
test.join #=> "Hello, world!"

# Use your custom tokenizer in the diff method
diff = HTMLDiff.diff(old_text, new_text, tokenizer: MyCustomTokenizer)

How HTMLDiff Works

HTMLDiff uses a three-step process:

  1. Tokenization: The input strings are broken into an array of tokens by the HTMLDiff::Tokenizer module.
  2. Diff Generation: The HTMLDiff::Differ module uses the LCS (Longest Common Subsequence) algorithm to find the differences between the token arrays.
  3. Formatting: The differences are formatted into HTML by a formatter.

About HTMLDiff

Maintainers

HTMLDiff is maintained by the team at TableCheck based in Tokyo, Japan. We use HTMLDiff in our products to help our restaurant users visualize the edit history of their customer and reservation data. If you're seeking your next career adventure, we're hiring!

Acknowledgements

Original implementation by Nathan Herald, based on an unknown Wiki article.

HTMLDiff uses the fantastic diff-lcs gem under the hood.

License

This project is licensed under the MIT License.

About

A diff library that uses html tags to show differences

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages