Alain Antone's Library tagged → View Popular, Search in Google
In the world of web scraping, text mining and article reading utilities (readability bookmarklet) there is an ever growing demand for utilities that are capable of distinguishing parts of a HTML document which represent an article apart from other common website building blocks like menus, headers, footers, ads etc.
The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.
Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate.
Boilerpipe is a Java library written by Christian Kohlschütter. It is released under the Apache License 2.0.
Collection outils de manipulation de texte : inversion compte etc.
-
Doesn't Google Docs already do this?
No.
Google Docs is a suite of products that do many things, from word processing to spreadsheets to document management. One thing that Google Docs does not do is real-time collaborative text editing. We think this is an important use case, so we built EtherPad with real-time collaboration as the focus.
For example, with Google Docs it takes about 5 to 15 seconds for a change to make its way from your keyboard to other people's screens. Imagine if whiteboards or telephones had this kind of delay! In contrast, the EtherPad infrastructure is built to carry your every keystroke at the speed of light, limited only by the time it takes electrons to travel over a wire (such as an "ethernet" cable).
-
Getting text and other content out of your PDF documents is often a hassle. Adobe Acrobat™ (or your other favorite PDF viewer) can do copy-and-paste, but that's time-consuming and tedious for anything but the smallest jobs. Acrobat™ also has a 'save as text' option, but unless you spring for Acrobat™ Professional, it often generates inaccurate text and simply cannot cope with some languages (especially Chinese, Japanese, and Korean).
Selected Tags
Related Tags
Top Contributors
Groups interested in text
-
Random surfing
these are the pages i have a...
Items: 14 | Visits: 98
Created by: Mukesh Soni
-
Beats
This trail provides basic ba...
Items: 15 | Visits: 243
Created by: professor gill
-
TEXT TO SPEECH
All the best sites harnessin...
Items: 43 | Visits: 159
Created by: eflclassroom 2.0
Highlighter, Sticky notes, Tagging, Groups and Network: integrated suite dramatically boosting research productivity. Learn more »
Join Diigo
