Alain Antone's Library tagged → View Popular, Search in Google
In the world of web scraping, text mining and article reading utilities (readability bookmarklet) there is an ever growing demand for utilities that are capable of distinguishing parts of a HTML document which represent an article apart from other common website building blocks like menus, headers, footers, ads etc.
The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.
Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate.
Boilerpipe is a Java library written by Christian Kohlschütter. It is released under the Apache License 2.0.
Harvest the web
If you desperately need...
- a list of all the design studios in London,
- your dream job ads in an Excel file every day,
- hundreds of photos of your favorite movie star,
- all available PDF files about Semantic P2P...
And if you are tired of scrolling down Web
pages, scanning text and compulsively
clicking, cutting and pasting for hours:
Here is the first beta release of
OutWit Hub, your Web Collection Engine.
FREE DOWNLOAD
Selected Tags
Related Tags
Top Contributors
Groups interested in extraction
-
Discover OutWit Hub
Discover OutWit Hub, a new W...
Items: 13 | Visits: 22
Created by: syl vie
-
Resistance to tar sands
Research notes on activism a...
Items: 75 | Visits: 32
Created by: Alex Miltsov
Highlighter, Sticky notes, Tagging, Groups and Network: integrated suite dramatically boosting research productivity. Learn more »
Join Diigo

