Wednesday, September 10, 2014

Parsing HTML with JSoup

I recently had one of those tasks that appears trivial at first, and then reveals itself to be frustratingly complex. I'm working on a system that receives content from a third party source, but we want to truncate it to fit in a limited space. The problem is that the the content may contain HTML, which we want to preserve, but which should obviously not be included in the character count.

The examples I found on Stack Overflow to somebody facing the same problem were testament to its complexity, weighing in with complex regular expressions or a 156 line method using Collections. I thought there must be a neater way, so I turned to JSoup, an HTML parser for Java.

The solution relies on JSoup's excellent and easy to use API for parsing and manipulating the DOM. I constructed a visitor that traversed the DOM tree, collecting text content until the maximum character count is reached. As the text values are returned back up the tree they are set as the inner HTML content of the surrounding element, and then the outer HTML (the content including the HTML tags wrapping it) is returned back up to the caller.

The end result is that all HTML tags are preserved. For example, this content:

I am a particularly contrived example

when truncated to 20 characters becomes

I am a particularly...

with the italics and the link still in place.

public static String truncateText(int maxLength, String textToTruncate)
  {
    if (textToTruncate == null || textToTruncate.length() <= maxLength)
    {
      return textToTruncate;
    }

    Document document = Jsoup.parse(textToTruncate);
    Element body = document.getElementsByTag("body").first();

    TextBuilderNodeVisitor visitor = new TextBuilderNodeVisitor(0, maxLength);
    for (Node node : body.childNodes())
    {
      visitor.visit(node);
    }

    return visitor.text.toString();

  }

private static class TextBuilderNodeVisitor
  {

    private int currentLength;

    private final int maxLength;

    private StringBuilder text = new StringBuilder();

    public TextBuilderNodeVisitor(int currentLength, int maxLength)
    {
      this.currentLength = currentLength;
      this.maxLength = maxLength;
    }

    public void visit(Node node)
    {
      if (currentLength >= maxLength)
      {
        return;
      }

      if (node instanceof TextNode)
      {
        visitTextNode((TextNode) node);
      }
      if (node instanceof Element)
      {
        visitElement((Element) node);
      }
    }

    private void visitTextNode(TextNode node)
    {
      if (currentLength + node.text().length() > maxLength)
      {
        int cutoff = maxLength - currentLength + 1;
        String toAppend = node.text().substring(0, cutoff);
        int space = toAppend.lastIndexOf(" ");
        if (space >= 0)
        {
          cutoff = space;
        }
        text.append(toAppend.substring(0, cutoff));
        text.append(ELLIPSIS);
        currentLength = maxLength;
      }
      else
      {
        text.append(node.text());
        currentLength += node.text().length();
      }
    }

    private void visitElement(Element element)
    {
      TextBuilderNodeVisitor childVisitor = new TextBuilderNodeVisitor(currentLength, maxLength);
      for (Node node : element.childNodes())
      {
        childVisitor.visit(node);
      }
      element.html(childVisitor.text.toString());
      currentLength = childVisitor.currentLength;
      text.append(element.outerHtml());
    }
  }