Wednesday, September 10, 2014

Parsing HTML with JSoup

I recently had one of those tasks that appears trivial at first, and then reveals itself to be frustratingly complex. I'm working on a system that receives content from a third party source, but we want to truncate it to fit in a limited space. The problem is that the the content may contain HTML, which we want to preserve, but which should obviously not be included in the character count.

The examples I found on Stack Overflow to somebody facing the same problem were testament to its complexity, weighing in with complex regular expressions or a 156 line method using Collections. I thought there must be a neater way, so I turned to JSoup, an HTML parser for Java.

The solution relies on JSoup's excellent and easy to use API for parsing and manipulating the DOM. I constructed a visitor that traversed the DOM tree, collecting text content until the maximum character count is reached. As the text values are returned back up the tree they are set as the inner HTML content of the surrounding element, and then the outer HTML (the content including the HTML tags wrapping it) is returned back up to the caller.

The end result is that all HTML tags are preserved. For example, this content:

I am a particularly contrived example

when truncated to 20 characters becomes

I am a particularly...

with the italics and the link still in place.

public static String truncateText(int maxLength, String textToTruncate)
  {
    if (textToTruncate == null || textToTruncate.length() <= maxLength)
    {
      return textToTruncate;
    }

    Document document = Jsoup.parse(textToTruncate);
    Element body = document.getElementsByTag("body").first();

    TextBuilderNodeVisitor visitor = new TextBuilderNodeVisitor(0, maxLength);
    for (Node node : body.childNodes())
    {
      visitor.visit(node);
    }

    return visitor.text.toString();

  }

private static class TextBuilderNodeVisitor
  {

    private int currentLength;

    private final int maxLength;

    private StringBuilder text = new StringBuilder();

    public TextBuilderNodeVisitor(int currentLength, int maxLength)
    {
      this.currentLength = currentLength;
      this.maxLength = maxLength;
    }

    public void visit(Node node)
    {
      if (currentLength >= maxLength)
      {
        return;
      }

      if (node instanceof TextNode)
      {
        visitTextNode((TextNode) node);
      }
      if (node instanceof Element)
      {
        visitElement((Element) node);
      }
    }

    private void visitTextNode(TextNode node)
    {
      if (currentLength + node.text().length() > maxLength)
      {
        int cutoff = maxLength - currentLength + 1;
        String toAppend = node.text().substring(0, cutoff);
        int space = toAppend.lastIndexOf(" ");
        if (space >= 0)
        {
          cutoff = space;
        }
        text.append(toAppend.substring(0, cutoff));
        text.append(ELLIPSIS);
        currentLength = maxLength;
      }
      else
      {
        text.append(node.text());
        currentLength += node.text().length();
      }
    }

    private void visitElement(Element element)
    {
      TextBuilderNodeVisitor childVisitor = new TextBuilderNodeVisitor(currentLength, maxLength);
      for (Node node : element.childNodes())
      {
        childVisitor.visit(node);
      }
      element.html(childVisitor.text.toString());
      currentLength = childVisitor.currentLength;
      text.append(element.outerHtml());
    }
  }

Monday, March 24, 2014

The fundamental difference between Guice and Spring

If you do a Google search for the difference between Guice and Spring chances are you will find many, many pages arguing about XML configuration, performance and the relative weight of the packages. However, in my experience most developers fail to realise the core difference between Guice and Spring.

Guice and Spring represent fundamentally different dependency injection paradigms.

In Spring:
  • There is an application context that holds instantiated objects that may be injected into their collaborators.
  • The default scope is singleton, but the context may hold proxies or factories for some objects that require different scopes.
  • If an object in the context requires an instance of a type to be injected that does not exist in the context, an exception will be thrown.
In Guice:
  • There is an injector that can create instances of any type requested.
  • The default scope is to create a new instance for every injection.
  • If an object requires an instance of any type, it will be instantiated by default.
These differences are pretty significant, but I think they are lost on a lot of developers. It's fairly common to see huge module files where everything is a singleton, and if that is the way your application is best structured, perhaps you should be using Spring instead.

Monday, January 20, 2014

Parameterised tests with TestNG

My previous posts on this blog have both been about how wonderful it is to be able to run exactly the same tests against client and server side code. However, actually doing that is a little bit clunky in JUnit, requiring you to create a separate test method for each implementation of your interface. At best you can do this:
@Test
public void testGetCustomerById_Service() {
    doTestGetCustomerById(service);
}

@Test
public void testGetCustomerById_Client() {
    doTestGetCustomerById(client);
}

public void doTestGetCustomerById(CustomerService service) {
    int id = random.nextInt();
        
    Customer expected = new Customer(id);
    when(customerDao.getById(id)).thenReturn(expected);
        
    Customer actual = service.getCustomerById(id);
    assertEquals(expected, actual);
}
Your choice is between writing 3 methods for each test, or 2 methods containing identical code. Here's something better you can do if you're using TestNG.
@Test(dataProvider = "myProvider")
public void testGetCustomerById(CustomerService service) {
    int id = random.nextInt();
        
    Customer expected = new Customer(id);
    when(customerDao.getById(id)).thenReturn(expected);
        
    Customer actual = service.getCustomerById(id);
    assertEquals(expected, actual);
}

@DataProvider(name = "myProvider")
public Object[][] provider() {
    return new Object[][] {{service}, {client}};
}
That Object[][] looks a bit ugly, but the results are great. Now you have 1 method per test, plus 1 data provider method for the whole test class. The first dimension of the returned array represents each run of the test, while the second allows you to pass multiple parameters. Here's a (slightly contrived) example testing a calculator implementation.
@Test(dataProvider = "myProvider")
public void testAddition(int a, int b, int expectedResult) {
    int result = calculator.calc(a + "+" + b);
    assertEquals(expectedResult, result);
}

@DataProvider(name = "myProvider")
public Object[][] provider() {
    return new Object[][] {{ 1, 2, 3 }, { 4, 5, 9 }, { 10, 11, 21 }};
}
And you can use the same provider to test subtraction.
@Test(dataProvider = "myProvider")
public void testSubtraction(int expectedResult, int b, int a) {
    int result = calculator.calc(a + "-" + b);
    assertEquals(expectedResult, result);
}

Monday, January 13, 2014

JAX-RS exception handling using CXF

One thing that I think is very important when implementing a RESTful service is to use the appropriate HTTP response codes for error statuses. JAX-RS provides a few ready-made exceptions that do this, so my getCustomerById method from the previous post can throw a NotFoundException if the customer does not exist.

But I'm using an underlying API that throws an exception if it receives too many requests in a short space of time. JAX-RS doesn't provide a TooManyRequestsException so I'll have to make one of my own.
public class TooManyRequestsException extends WebApplicationException {

    public TooManyRequestsException() {
        super(429);
    }

}
And here's a test for it:
@Test(expected=TooManyRequestsException.class)
public void testGetCustomerById_tooManyRequests() {
    int id = random.nextInt();
        
    when(customerDao.getById(id)).thenThrow(new TooManyRequestsException());
        
    customerService.getCustomerById(id);
}
The test above works as a unit test, but as an integration test using the client the exception caught is ClientErrorException. The CXF client doesn't know how to deal with the 429 status code and, frustratingly, CXF doesn't give us any mechanism to register a custom exception on the client side.

Fortunately, a little digging around in the source code reveals a workaround. The class JAXRSUtils contains a static map of exception classes mapped to HTTP status codes. When it receives an HTTP response with an error status it attempts to instantiate one of these exceptions using a constructor that takes a JAX-RS Response object. So I'll add that constructor to my TooManyRequestsException.
public TooManyRequestsException(Response response) {
    super(response);
}
The map is private, so we have to do some reflection hackery to get it in there. You can put this code anywhere it will be run by the client, such as a static block, but I decided the best thing for me was to create my own custom JAXRSClientFactory.
public class CustomJAXRSClientFactory {

    private final String url;

    private final List<Object> providers = new ArrayList<>();

    public VCommsManJAXRSClientFactory(String url) {
        registerExceptions();

        this.url = url;
        providers.add(new JacksonJsonProvider());
    }

    /** This is where we register the exceptions with the CXF client */
    private void registerExceptions() {
        try {
            Field f = JAXRSUtils.class.getDeclaredField("EXCEPTIONS_MAP");
            f.setAccessible(true);
            Map<Integer, Class<?>> m = (Map<Integer, Class<?>>) f.get(null);

            m.put(429, TooManyRequestsException.class);
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    public <T> T create(Class<T> serviceInterface) {
        JAXRSClientFactoryBean bean = new JAXRSClientFactoryBean();
        bean.setProviders(providers);
        bean.setAddress(url);
        bean.setServiceClass(serviceInterface);
        return bean.create(serviceInterface);
    }

}
Now I replace the client code in my test with this:
CustomJAXRSClientFactoryBean clientFactory = new CustomJAXRSClientFactoryBean("http://localhost:9090");
customerService = clientFactory.create(CustomerService.class);
And my integration test passes :)

Tuesday, January 07, 2014

Using CXF as a JAX-RS server and client

I recently implemented a JAX-RS service and client using CXF and I think it's brilliant. Why? Because here's a unit test for my service code:
@Test
public void testGetCustomerById() {
    int id = random.nextInt();
        
    Customer expected = new Customer(id);
    when(customerDao.getById(id)).thenReturn(expected);
        
    Customer actual = customerService.getCustomerById(id);
    assertEquals(expected, actual);
}
And here's an integration test using the client:
@Test
public void testGetCustomerById() {
    int id = random.nextInt();
        
    Customer expected = new Customer(id);
    when(customerDao.getById(id)).thenReturn(expected);
        
    Customer actual = customerService.getCustomerById(id);
    assertEquals(expected, actual);
}
Yup, they're the same test! This makes me happy.

So, how is it done? We'll start with the service interface.
@Path("/customer")
public interface CustomerService {
    
    @GET
    @Path("/{id}")
    @Produces(MediaType.APPLICATION_JSON)
    Customer getCustomerById(@PathParam("id") int id);
}
And an implementation that satisfies the unit test above.
public class CustomerServiceImpl implements CustomerService {
    
    @Inject CustomerDao customerDao;

    @Override
    public Customer getCustomerById(int id) {
        return customerDao.getById(id);
    }
}
Wiring up the unit test is simple, so here's what the integration test looks like in full:
public class CustomerServiceClientTest {
    
    // This is the client
    private CustomerService customerService;

    // CXF JAX-RS server
    private Server server;

    private Random random = new Random();

    @Before
    public void before() {
        CustomerServiceImpl serviceImpl = new CustomerServiceImpl();
        serviceImpl.customerDao = mock(CustomerDao.class);

        JAXRSServerFactoryBean serverFactory = new JAXRSServerFactoryBean();
        serverFactory.setAddress("http://localhost:9090");
        serverFactory.setProvider(new JacksonJsonProvider());
        serverFactory.setServiceBean(serviceImpl);
        server = serverFactory.create();

        JAXRSClientFactoryBean clientFactory = new JAXRSClientFactoryBean();
        clientFactory.setAddress("http://localhost:9090");
        clientFactory.setProvider(new JacksonJsonProvider());
        clientFactory.setServiceClass(CustomerService.class);
        customerService = clientFactory.create();
    }

    @After
    public void after() {
        service.destroy();
    }

    @Test
    public void testGetCustomerById() {
        int id = random.nextInt();
        
        Customer expected = new Customer(id);
        when(customerDao.getById(id)).thenReturn(expected);
        
        Customer actual = customerService.getCustomerById(id);
        assertEquals(expected, actual);
    }
}
The key thing here is the JAXRSClientFactoryBean, which creates an implementation of CustomerService that makes calls to the RESTful service running at localhost:9090.

What this means is that if you publish an artifact containing just the interface, your client-side project can make use of your RESTful service by directly calling the interface methods in Java, hiding the JAX-RS mechanics completely.