Skip to main content

Mar
23
2011

In the world of web scraping, text mining and article reading utilities (readability bookmarklet) there is an ever growing demand for utilities that are capable of distinguishing parts of a HTML document which represent an article apart from other common website building blocks like menus, headers, footers, ads etc.

boilerpipe text extraction html article code

The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.

Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate.

Boilerpipe is a Java library written by Christian Kohlschütter. It is released under the Apache License 2.0.

boilerpipe text html extraction java code web api

May
26
2010

Site qui ne sert à rien mais qui est une belle démo

flip text fun tools upside

Jan
22
2010

Collection outils de manipulation de texte : inversion compte etc.

tools text online

Nov
26
2008

  • Doesn't Google Docs already do this?

     
      

    No.

      

    Google Docs is a suite of products that do many things,  from word processing to spreadsheets to document management. One  thing that Google Docs does not do is real-time  collaborative text editing. We think this is an important use  case, so we built EtherPad with real-time collaboration as the  focus.

      

    For example, with Google Docs it takes about 5 to 15 seconds  for a change to make its way from your keyboard to other  people's screens. Imagine if whiteboards or telephones had this  kind of delay! In contrast, the EtherPad infrastructure is built  to carry your every keystroke at the speed of light, limited  only by the time it takes electrons to travel over a wire  (such as an "ethernet" cable).

Jun
23
2008

  • Getting text and other content out of your PDF documents is often a hassle. Adobe Acrobat™ (or your other  favorite PDF viewer) can do copy-and-paste, but that's time-consuming and tedious for anything but  the smallest jobs. Acrobat™ also has a 'save as text' option, but unless you spring for Acrobat™  Professional, it often generates inaccurate text and simply cannot cope with some languages  (especially Chinese, Japanese, and Korean). 

     

     

1 - 9 of 9
Showing 20 items per page

Highlighter, Sticky notes, Tagging, Groups and Network: integrated suite dramatically boosting research productivity. Learn more »

Join Diigo
Move to top