Wednesday, September 10, 2014

Parsing HTML with JSoup

I recently had one of those tasks that appears trivial at first, and then reveals itself to be frustratingly complex. I'm working on a system that receives content from a third party source, but we want to truncate it to fit in a limited space. The problem is that the the content may contain HTML, which we want to preserve, but which should obviously not be included in the character count.

The examples I found on Stack Overflow to somebody facing the same problem were testament to its complexity, weighing in with complex regular expressions or a 156 line method using Collections. I thought there must be a neater way, so I turned to JSoup, an HTML parser for Java.

The solution relies on JSoup's excellent and easy to use API for parsing and manipulating the DOM. I constructed a visitor that traversed the DOM tree, collecting text content until the maximum character count is reached. As the text values are returned back up the tree they are set as the inner HTML content of the surrounding element, and then the outer HTML (the content including the HTML tags wrapping it) is returned back up to the caller.

The end result is that all HTML tags are preserved. For example, this content:

I am a particularly contrived example

when truncated to 20 characters becomes

I am a particularly...

with the italics and the link still in place.

public static String truncateText(int maxLength, String textToTruncate)
  {
    if (textToTruncate == null || textToTruncate.length() <= maxLength)
    {
      return textToTruncate;
    }

    Document document = Jsoup.parse(textToTruncate);
    Element body = document.getElementsByTag("body").first();

    TextBuilderNodeVisitor visitor = new TextBuilderNodeVisitor(0, maxLength);
    for (Node node : body.childNodes())
    {
      visitor.visit(node);
    }

    return visitor.text.toString();

  }

private static class TextBuilderNodeVisitor
  {

    private int currentLength;

    private final int maxLength;

    private StringBuilder text = new StringBuilder();

    public TextBuilderNodeVisitor(int currentLength, int maxLength)
    {
      this.currentLength = currentLength;
      this.maxLength = maxLength;
    }

    public void visit(Node node)
    {
      if (currentLength >= maxLength)
      {
        return;
      }

      if (node instanceof TextNode)
      {
        visitTextNode((TextNode) node);
      }
      if (node instanceof Element)
      {
        visitElement((Element) node);
      }
    }

    private void visitTextNode(TextNode node)
    {
      if (currentLength + node.text().length() > maxLength)
      {
        int cutoff = maxLength - currentLength + 1;
        String toAppend = node.text().substring(0, cutoff);
        int space = toAppend.lastIndexOf(" ");
        if (space >= 0)
        {
          cutoff = space;
        }
        text.append(toAppend.substring(0, cutoff));
        text.append(ELLIPSIS);
        currentLength = maxLength;
      }
      else
      {
        text.append(node.text());
        currentLength += node.text().length();
      }
    }

    private void visitElement(Element element)
    {
      TextBuilderNodeVisitor childVisitor = new TextBuilderNodeVisitor(currentLength, maxLength);
      for (Node node : element.childNodes())
      {
        childVisitor.visit(node);
      }
      element.html(childVisitor.text.toString());
      currentLength = childVisitor.currentLength;
      text.append(element.outerHtml());
    }
  }

Monday, March 24, 2014

The fundamental difference between Guice and Spring

If you do a Google search for the difference between Guice and Spring chances are you will find many, many pages arguing about XML configuration, performance and the relative weight of the packages. However, in my experience most developers fail to realise the core difference between Guice and Spring.

Guice and Spring represent fundamentally different dependency injection paradigms.

In Spring:
  • There is an application context that holds instantiated objects that may be injected into their collaborators.
  • The default scope is singleton, but the context may hold proxies or factories for some objects that require different scopes.
  • If an object in the context requires an instance of a type to be injected that does not exist in the context, an exception will be thrown.
In Guice:
  • There is an injector that can create instances of any type requested.
  • The default scope is to create a new instance for every injection.
  • If an object requires an instance of any type, it will be instantiated by default.
These differences are pretty significant, but I think they are lost on a lot of developers. It's fairly common to see huge module files where everything is a singleton, and if that is the way your application is best structured, perhaps you should be using Spring instead.

Monday, January 20, 2014

Parameterised tests with TestNG

My previous posts on this blog have both been about how wonderful it is to be able to run exactly the same tests against client and server side code. However, actually doing that is a little bit clunky in JUnit, requiring you to create a separate test method for each implementation of your interface. At best you can do this:
@Test
public void testGetCustomerById_Service() {
    doTestGetCustomerById(service);
}

@Test
public void testGetCustomerById_Client() {
    doTestGetCustomerById(client);
}

public void doTestGetCustomerById(CustomerService service) {
    int id = random.nextInt();
        
    Customer expected = new Customer(id);
    when(customerDao.getById(id)).thenReturn(expected);
        
    Customer actual = service.getCustomerById(id);
    assertEquals(expected, actual);
}
Your choice is between writing 3 methods for each test, or 2 methods containing identical code. Here's something better you can do if you're using TestNG.
@Test(dataProvider = "myProvider")
public void testGetCustomerById(CustomerService service) {
    int id = random.nextInt();
        
    Customer expected = new Customer(id);
    when(customerDao.getById(id)).thenReturn(expected);
        
    Customer actual = service.getCustomerById(id);
    assertEquals(expected, actual);
}

@DataProvider(name = "myProvider")
public Object[][] provider() {
    return new Object[][] {{service}, {client}};
}
That Object[][] looks a bit ugly, but the results are great. Now you have 1 method per test, plus 1 data provider method for the whole test class. The first dimension of the returned array represents each run of the test, while the second allows you to pass multiple parameters. Here's a (slightly contrived) example testing a calculator implementation.
@Test(dataProvider = "myProvider")
public void testAddition(int a, int b, int expectedResult) {
    int result = calculator.calc(a + "+" + b);
    assertEquals(expectedResult, result);
}

@DataProvider(name = "myProvider")
public Object[][] provider() {
    return new Object[][] {{ 1, 2, 3 }, { 4, 5, 9 }, { 10, 11, 21 }};
}
And you can use the same provider to test subtraction.
@Test(dataProvider = "myProvider")
public void testSubtraction(int expectedResult, int b, int a) {
    int result = calculator.calc(a + "-" + b);
    assertEquals(expectedResult, result);
}