Skip to main content

Alain Antone's Library tagged boilerpipe   View Popular, Search in Google

Mar
23
2011

In the world of web scraping, text mining and article reading utilities (readability bookmarklet) there is an ever growing demand for utilities that are capable of distinguishing parts of a HTML document which represent an article apart from other common website building blocks like menus, headers, footers, ads etc.

boilerpipe text extraction html article code

The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.

Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate.

Boilerpipe is a Java library written by Christian Kohlschütter. It is released under the Apache License 2.0.

boilerpipe text html extraction java code web api

1 - 2 of 2
Showing 20 items per page

Diigo is about better ways to research, share and collaborate on information. Learn more »

Join Diigo
Move to top