marcell mars's Library tagged → View Popular
Converting 11 million articles from TIFF to PDF-s on amazon EC2 & S3: Self-service, Prorated Super Computing Fun!
"I was ready to deploy Hadoop and my code on a cluster of EC2 machines. For deployment, I created a custom AMI (Amazon Machine Image) for EC2 that was based on a Xen image from my desktop machine. Using some simple Python scripts and the boto library, I booted four EC2 instances of my custom AMI. [..] thanks to the swell people at Amazon, I got access to a few more machines and churned through all 11 million articles in just under 24 hours using 100 EC2 instances, and generated another 1.5TB of data to store in S3."
-
I was ready to deploy Hadoop and my code on a cluster of EC2 machines. For deployment, I created a custom AMI (Amazon Machine Image) for EC2 that was based on a Xen image from my desktop machine. Using some simple Python scripts and the boto library, I booted four EC2 instances of my custom AMI. I logged in, started Hadoop and submitted a test job to generate a couple thousands articles — and to my surprise it just worked.
I then began some rough calculations and determined that if I used only four machines, it could take some time to generate all 11 million article PDFs. But thanks to the swell people at Amazon, I got access to a few more machines and churned through all 11 million articles in just under 24 hours using 100 EC2 instances, and generated another 1.5TB of data to store in S3.
A short history of btrfs [LWN.net]
"we'll take a behind-the-scenes look at the design and development of btrfs on many levels - technical, political, personal - and trace it from its origins at a workshop to its current position as Linus's root file system. Knowing the background and motivation for each step will help you understand why btrfs was started, how it works, and where it's going in the future. By the end, you should be able to hand-wave your way through a description of btrfs's on-disk format."
-
we'll take a behind-the-scenes look at the design and
development of btrfs on many levels - technical, political, personal -
and trace it from its origins at a workshop to its current position as
Linus's root file system. Knowing the background and motivation for
each step will help you understand why btrfs was started, how it
works, and where it's going in the future. By the end, you should be
able to hand-wave your way through a description of btrfs's on-disk
format.
RANDOM.ORG - Introduction to Randomness and Random Numbers
"RANDOM.ORG is a true random number service that generates randomness via atmospheric noise. "
-
RANDOM.ORG is a true random number service that generates
randomness via atmospheric noise.
A Neighborhood of Infinity: The Three Projections of Doctor Futamura
"The Three Projections of Futamura are a sequence of applications of a programming technique called 'partial evaluation' or 'specialisation', each one more mind-bending than the previous one. But it shouldn't be programmers who have all the fun. So I'm going to try to explain the three projections in a way that non-programmers can maybe understand too."
-
The Three Projections of Futamura are a sequence of applications of a programming technique called 'partial evaluation' or 'specialisation', each one more mind-bending than the previous one. But it shouldn't be programmers who have all the fun. So I'm going to try to explain the three projections in a way that non-programmers can maybe understand too.
erikfrey's bashreduce at master - GitHub
"bashreduce lets you apply your favorite unix tools in a mapreduce fashion across multiple machines/cores. There’s no installation, administration, or distributed filesystem."
-
bashreduce lets you apply your favorite unix tools in a mapreduce fashion across multiple machines/cores. There’s no installation, administration, or distributed filesystem.
Unladen-swallow - Plans for optimizing Python
"Produce a version of Python at least 5x faster than CPython. "
-
Produce a version of Python at least 5x faster than CPython.
http://www.ics.uci.edu/~eppstein/PADS/README.txt
"This is PADS, a library of Python Algorithms and Data Structures implemented by David Eppstein of the University of California, Irvine."
-
This is PADS, a library of Python Algorithms and Data Structures implemented by David Eppstein of the University of California, Irvine.
Home - MongoDB - 10gen Confluence
"MongoDB is a high-performance, open source, schema-free document database designed for cloud computing. The project's goal is a cloud-scale data store that's easy to deploy, manage and use."
-
MongoDB is a high-performance, open source, schema-free document database designed for cloud computing. The project's goal is a cloud-scale data store that's easy to deploy, manage and use.
AsmXml - A Fast XML Parser
AsmXml is a very fast XML parser and decoder for x86 platforms. It achieves high speed by using the following features: Written in pure assembler, Optimized memory access, Parsing and decoding at the same time. To give an idea of the relative speed of AsmXml, the fastest open source XML parsers process between 10 and 30 MBs of XML per seconds while AsmXml processes around 200 MBs per seconds (on an Athlon XP 1800+).
-
- Written in pure assembler
- Optimized memory access
- Parsing and decoding at the same time
AsmXml is a very fast XML parser and decoder for x86 platforms.
It achieves high speed by using the following features:To give an idea of the relative speed of AsmXml, the fastest open source
XML parsers process between 10 and 30 MBs of XML per seconds while AsmXml
processes around 200 MBs per seconds (on an Athlon XP 1800+). - Written in pure assembler
The Wireworld computer
the first ever computer implemented as a cellular automaton that you might reasonably want to write a program for. The design was done by David Moore and Mark Owen, with the help of many others, between 1990 and 1992.
-
the first ever computer
implemented as a cellular automaton that you might reasonably want
to write a program for.
The design was done by David Moore
and Mark Owen, with the help of many others, between 1990 and 1992.
montylingua :: a free, commonsense-enriched natural language understander
MontyLingua is a free*, commonsense-enriched, end-to-end natural language understander for English. Feed raw English text into MontyLingua, and the output will be a semantic interpretation of that text. Perfect for information retrieval and extraction, request processing, and question answering. From English sentences, it extracts subject/verb/object tuples, extracts adjectives, noun phrases and verb phrases, and extracts people's names, places, events, dates and times, and other semantic information.
-
MontyLingua
is a free*, commonsense-enriched, end-to-end natural language understander
for English. Feed raw English text into MontyLingua, and the output
will be a semantic interpretation of that text. Perfect for information
retrieval and extraction, request processing, and question answering.
From English sentences, it extracts subject/verb/object tuples,
extracts adjectives, noun phrases and verb phrases, and extracts
people's names, places, events, dates and times, and other semantic
information.
Goodbye MapReduce, Hello Cascading
Cascading abstracts away MapReduce into a more natural logical model and provides a workflow management layer to handle things like intermediate data and data staleness. Cascading’s logical model abstracts away MapReduce into a convenient tuples, pipes, and taps model.
-
Cascading abstracts away MapReduce into a more natural logical model and provides a workflow management layer to handle things like intermediate data and data staleness.
Cascading’s logical model abstracts away MapReduce into a convenient tuples, pipes, and taps model.
nodal - generative software application for composing music
Nodal is a generative software application for composing music. It uses a novel method for the notation and playing of MIDI based music. This method is based around the concept of a user-defined graph. The graph consists of nodes (musical events) and edges (connections between events). You interactively define the graph, which is then traversed by any number of players who play the musical events as they encounter them on the graph. The time taken to travel from one node to another is based on the length of the edges that connect the nodes.
-
Nodal is a generative software application for composing music. It
uses a novel method for the notation and playing of MIDI based music. This
method is based around the concept of a user-defined graph. The graph consists
of nodes (musical events) and edges (connections between events). You interactively
define the graph, which is then traversed by any number of players who play
the musical events as they encounter them on the graph. The time taken to travel
from one node to another is based on the length of the edges that connect the
nodes.
Slashdot | Which Open Source Video Apps Use SMP Effectively?
Which open source video conversion apps take full native advantage of SMP? (And before you ask, no, I don't want to pick up the code and add SMP support myself, thanks.)
-
Which open source video conversion apps take full native advantage of SMP? (And before you ask, no, I don't want to pick up the code and add SMP support myself, thanks.)
Twitter Can Be Liberated - Here’s How
Distributing twitter can’t be done efficiently just via RSS because rapid and excessive polling would bring servers to a halt. Instead, Saad thinks wrapping RSS in XMPP, an open standards based instant messaging protocol that was originally created for Jabber and is now used in various applications including Google Talk, is the answer. XMPP allows for pushing of messages to subscribers, which removes the need for constant polling. For more of Saad’s thinking, see his site on their product SyncStream, and they’ve already written code that will do this based on their proposed standard called “GetPingd.” Twitter uses XMPP in their API already; third party applications like Google Talk integrate with Twitter via XMPP already.
-
Twitter can be decentralized effectively.
-
This can’t be done efficiently just via RSS because rapid and excessive polling would bring servers to a halt. Instead, Saad thinks wrapping RSS in XMPP
, an open standards based instant messaging protocol that was originally created for Jabber and is now used in various applications including Google Talk, is the answer. XMPP allows for pushing of messages to subscribers, which removes the need for constant polling. For more of Saad’s thinking, see his site on their product SyncStream
, and they’ve already written code that will do this based on their proposed standard called “GetPingd
.” Twitter uses XMPP in their API already; third party applications like Google Talk integrate with Twitter via XMPP already.
UNIX® Load Average Part 1: How It Works
-
Have you ever wondered how those three little numbers that appear in the UNIX®
load average (LA) report are calculated?
Selected Tags
Related Tags
Sponsored Links
Top Contributors
Groups interested in compute
-
IBM Deep Compute
Items: 2 | Visits: 5
Created by: Bruce Dittemore
-
Number and Operations
Mathematics - Goal 1: Numbe...
Items: 5 | Visits: 4
Created by: Danielle Rambo
Diigo is about better ways to research, share and collaborate on information. Learn more »
Join Diigo
