My Favorite jQuery Plugin Template - Moshe's Blog

December 27, 2013, 11:46 am

≪ Previous: Sherlock Holmes Is in the Public Domain, American Judge Rules

Comments:"My Favorite jQuery Plugin Template - Moshe's Blog"

URL:http://kolodny.github.io/blog/blog/2013/12/27/my-favorite-jquery-plugin-template/

I’ve dabbled quite a bit in jQuery and writing plugins for it. I’ve played around with quite a few different ways to start a plugin, and now I’ve got a new favorite:

;(function($) {
 // multiple plugins can go here
 (function(pluginName) {
 var defaults = {
 color: 'black',
 testFor: function(div) {
 return true;
 }
 };
 $.fn[pluginName] = function(options) {
 options = $.extend(true, {}, defaults, options);
 return this.each(function() {
 var elem = this,
 $elem = $(elem);
 // heres the guts of the plugin
 if (options.testFor(elem)) {
 $elem.css({
 borderWidth: 1,
 borderStyle: 'solid',
 borderColor: options.color
 });
 }
 });
 };
 $.fn[pluginName].defaults = defaults;
 })('borderize');
})(jQuery);

Now let’s see how we would use it.

$('div').borderize();
$('div').borderize({color: 'red'});

Here’s some of the reasons that I like this technique

You can still use a default option inside of a override (similar to calling a parent property in class based programming) Easily change the name of the plugin as long as we use pluginName all over (also there’s an insignificant minification advantage of that). Cleaner (at least in my opinion)

Point #1 is huge, let’s see an example that. Let’s say we want to override the testFor function but still want the option of defaulting to the original behaviour

$('.borderize').borderize({
 testFor: function(elem) {
 var $elem = $(elem);
 if (elem.is('.inactive')) {
 return false;
 } else {
 // calling "parent" function
 return $.fn.borderize.defaults.testFor.apply(this, arguments);
 }
 }
});

We can even do this with regular properties like this

var someVarThatMayBeSet = false;
/* code ... */
$('.borderize').borderize({
 color: someVarThatMayBeSet ? 'red' : $.fn.borderize.defaults.color
});

Do you have a different style that you like? Leave a comment below

Edit I’ve changed the $.each call to $.extend(true, {}, defaults, options); based on phlyingpenguin comment.

↧

Article 36

December 27, 2013, 11:32 am

≫ Next: Reflected hidden faces in photographs revealed in pupil | KurzweilAI

≪ Previous: My Favorite jQuery Plugin Template - Moshe's Blog

Comments:""

URL:http://dl.dropboxusercontent.com/u/1018963/Articles/HowSpotifyBuildsProducts.pdf

↧

Reflected hidden faces in photographs revealed in pupil | KurzweilAI

December 27, 2013, 11:32 am

≫ Next: SheetJS Rants and Raves — Running your JS code in Python

≪ Previous: Article 36

Comments:"Reflected hidden faces in photographs revealed in pupil | KurzweilAI"

URL:http://www.kurzweilai.net/reflected-hidden-faces-in-photographs-revealed-in-pupil

Zooming in on the pupil of a subject’s eye reveals hidden bystanders (credit: Rob Jenkins)

The pupil* of the eye in a photograph of a face can be mined for hidden information, such as reflected faces of the photographer and bystanders, according to research led by Dr. Rob Jenkins, of the Department of Psychology at the University of York and published in PLOS ONE (open access).

The researchers say that in crimes in which the victims are photographed, such as hostage taking or child sex abuse, reflections in the eyes of the photographic subject could help to identify perpetrators. Images of people retrieved from cameras seized as evidence during criminal investigations could be used to piece together networks of associates or to link individuals to particular locations.

By zooming in on high-resolution passport-style photographs, Jenkins and co-researcher Christie Kerr of the School of Psychology, University of Glasgow were able to recover bystander images that could be identified accurately by observers, despite their low resolution.

Lineup-style array of reflected images from photographs for spontaneous recognition task in experiment. All participants were familiar with the face of the psychologist and unfamiliar with the faces of the bystanders. Correct naming of the familiar face was frequent (hits 90%), and mistaken identification of the unfamiliar faces was infrequent (false positives 10%). (Credit: Rob Jenkins)

To establish whether these bystanders could be identified from the reflection images, the researchers presented them as stimuli in a face-matching task. Observers who were unfamiliar with the bystanders’ faces performed at 71 per cent accuracy, while participants who were familiar with the faces performed at 84 per cent accuracy. In a test of spontaneous recognition, observers could reliably name a familiar face from an eye reflection image.

“The pupil of the eye is like a black mirror,” said Jenkins. “To enhance the image, you have to zoom in and adjust the contrast. A face image that is recovered from a reflection in the subject’s eye is about 30,000 times smaller than the subject’s face.” In the research, the whole-face area for the reflected bystanders was 322 pixels on average.

Forensics implications

You probably recognize this well-known person, even though his face in this image measures only 16 pixels wide × 20 pixels high. (Photo credit: Steve Jurvetson)

High-resolution face photographs may also contain unexpected information about the environment of the photographic subject, including the appearance of the immediate surroundings, Jenkins explained to KurzweilAI.

“In the context of criminal investigations, this could be used to piece together networks of associates, or to link individuals to particular locations. This may be especially important when for categories of crime in which perpetrators photograph their victims. Reflections in the victims eyes could reveal the identity of the photographer.

“Also, around 40 million photographs per day are uploaded to Instagram alone, he pointed out. “Faces are among the most frequently photographed objects. Our study serves as a reminder to be careful what you upload. Eyes in the photographs could reveal where you were and who you were with.”

Although Jenkins did the study with a high-resolution (39 megapixels) Hasselblad camera, face images retrieved from eye reflections need not be of high quality in order to be identifiable, he said. “Obtaining optimal viewers — those who are familiar with the faces concerned — may be more important than obtaining optimal images.”

In addition, “in accordance with Hendy’s Law (a derivative of Moore’s Law), pixel count per dollar for digital cameras has been doubling approximately every twelve months. This trajectory implies that mobile phones could soon carry >39 megapixel cameras routinely.”

It would be interesting to see what hidden information is buried in law-enforcement (and other) photo archives — some of which could even help exculpate innocent persons.

*Technically, the image is reflected by the cornea.

Abstract of PLOS ONE paper

Criminal investigations often use photographic evidence to identify suspects. Here we combined robust face perception and high-resolution photography to mine face photographs for hidden information. By zooming in on high-resolution face photographs, we were able to recover images of unseen bystanders from reflections in the subjects’ eyes. To establish whether these bystanders could be identified from the reflection images, we presented them as stimuli in a face matching task (Experiment 1). Accuracy in the face matching task was well above chance (50%), despite the unpromising source of the stimuli. Participants who were unfamiliar with the bystanders’ faces (n = 16) performed at 71% accuracy [t(15) = 7.64, p<.0001, d = 1.91], and participants who were familiar with the faces (n = 16) performed at 84% accuracy [t(15) = 11.15, p<.0001, d = 2.79]. In a test of spontaneous recognition (Experiment 2), observers could reliably name a familiar face from an eye reflection image. For crimes in which the victims are photographed (e.g., hostage taking, child sex abuse), reflections in the eyes of the photographic subject could help to identify perpetrators.

↧

SheetJS Rants and Raves — Running your JS code in Python

December 27, 2013, 11:32 am

≫ Next: Fwd: TCP in 30 instructions (Was Re: Karl Auerbach: Re: Storage over Eth

≪ Previous: Reflected hidden faces in photographs revealed in pupil | KurzweilAI

Comments:"SheetJS Rants and Raves — Running your JS code in Python"

URL:http://blog.sheetjs.com/post/71326534924/running-your-js-code-in-python

tl;dr: PyPI package xls (repo https://github.com/SheetJS/py-xls)

I wanted to see if it was possible to run the JS XLS module from Python, partially because I needed an XLS parser in a python tool but mostly to prove to myself that it can be done.

Running small fragments of JS is easy with PyV8:

$ sudo pip install PyV8
$ python
>>> import PyV8
>>> ctx = PyV8.JSContext()
>>> ctx.enter()
>>> ctx.eval('1+1')
2

Variables are available:

>>> ctx.eval('var x = 1+1')
>>> ctx.eval('Math.pow(x,x)')
4

However, the context is bare bones. Even the mighty console is missing

>>> ctx.eval('console.log(x)')
ReferenceError: ReferenceError: console is not defined ( @ 1 : 0 ) -> console.log(x)

Fortunately, you can configure the context with your own console:

class MyConsole(PyV8.JSClass):
 def log(self, *args):
 print(" ".join([str(x) for x in args]))class GlobalContext(PyV8.JSClass):
 console = MyConsole()
ctx = PyV8.JSContext(GlobalContext())

… which works as expected:

>>> ctx.enter(); ctx.eval('console.log("foo bar baz qux")')
foo bar baz qux

For larger scripts, I found it easy to just put the entire code block within a triple-quoted string. The only caveat is that slashes have to be escaped. To automate the process, the code is separated into a python header and footer, with a Makefile to bring everything together. (note: the header ends with a triple quote and the footer starts with a triple quote)

In python code, you can access the variables in the JS scope with ctx.locals[‘…’], and numbers are readily available:

>>> ctx.eval('x = 1+1')
2
>>> ctx.locals['x']
2

however, arrays are passed as JSArray. The easiest way to capture the python values is to slice the array:

>>> ctx.eval('y = [1,2]')
<_PyV8.JSArray object at 0x106ebc5f0>
>>> ctx.locals['y']
<_PyV8.JSArray object at 0x106b82050>
>>> ctx.locals['y'][:]
[1, 2]

PyV8 automatically converts numeric and string literals from python to JS form, making it a breeze to call functions:

>>> ctx.eval('z=function(foo,bar) { return foo+bar; }')
>>> ctx.locals['z'](1,2)
3

I suspect it would be possible to build up a wrapper that emulates the entire NodeJS javascript API in PyV8, making it easy for people to bring pure-JS modules into Python. But that’s a task for another day.

↧

Fwd: TCP in 30 instructions (Was Re: Karl Auerbach: Re: Storage over Eth

December 27, 2013, 10:40 am

≫ Next: 30C3

≪ Previous: SheetJS Rants and Raves — Running your JS code in Python

Comments:"Fwd: TCP in 30 instructions (Was Re: Karl Auerbach: Re: Storage over Eth"

URL:http://www.pdl.cmu.edu/mailinglists/ips/mail/msg00133.html

↧

30C3

December 27, 2013, 10:32 am

≫ Next: Learn to think big

≪ Previous: Fwd: TCP in 30 instructions (Was Re: Karl Auerbach: Re: Storage over Eth

Comments:" 30C3 "

URL:http://streaming.media.ccc.de/saal1/native/lq/

↧

Learn to think big

December 27, 2013, 10:05 am

≫ Next: anvaka/atree · GitHub

≪ Previous: 30C3

Comments:"Learn to think big"

URL:http://www.gabrielweinberg.com/blog/2013/12/learn-to-think-big.html

I often get asked to give advice to people starting on the startup career path. I didn't have an obvious starting point until now.

You should learn to think big. It's the precursor to choosing an ambitious startup idea, which I also strongly recommend.

Unfortunately, thinking big is easier advised than explained. It's hard to explain because thinking big is one of those things that you don't know you're doing wrong until you're doing it right. It's kind of like a high-school relationship vs a good marriage, which I realize is an analogy that also doesn't help a lot of people.

The good news is there is an easy path to getting there quickly -- get a good mentor who already thinks big and then they can keep telling you that you're not thinking big enough or not thinking the right kind of big until you are. In retrospect, this is probably the most concrete thing I know I could have done better years earlier if I had utilized mentors more effectively.

I didn't start to truly grasp thinking big until 2010. That's a decade after doing startups full-time and when I was already thirty. I can't point to one particular moment, but looking back I think I finally got there through a combination of starting to angel invest (and seeing my portfolio struggling with this), committing to blogging for that year (and the benefits that come with it), and starting to see how DuckDuckGo could eventually make a real dent in the search market.

It wasn't for lack of confidence. I've been pretty confident since high school. And it's not like I wasn't doing relatively big or unique things through my twenties. For example, I started and sold a company for millions of dollars.

That may seem like it is only possible through thinking big. But that's not true. I was thinking very small. I was thinking how I personally could make a successful business.

And in that case I took this small thinking to the extreme, which makes it a particularly compelling example. I wrote every line of code and we never hired any employees from beginning to exit. That kind of thinking doesn't scale and it won't maximize your chance of making real change in the world, which I believe is real startup success.

However, it may make you just as much money. And that's an important distinction. Thinking big and making money are not the same thing. If your goal is to make money then an indie company (like my last one) may be your best bet, or some other industry altogether like finance.

But if your primary goal is to change the world, then the first thing you should do is learn to think big. I suggest working on real plan that sketches out how your vision could impact the world. What does your vision look like at scale? How many people or businesses could you possibly impact? How would you change what people do or how they think on a day to day basis?

I've repeatedly seen two common ways this is messed up. First, like I did, you only focus on the first million. This leads to niche ideas in smaller markets and a lack of long-term planning that will bite you when you get to that million.

Second, you put too much detail into what things look like five years from now. You should have a detailed 12-18 month plan, a good sketch for 24-36 months and rough ideas thereafter. The wrong approach here is saying in version 5 we're going to build this and that feature and have these revenue streams because you have no idea right now. In other words you shouldn't have a consistent level of detail but a rapidly declining one. But you can and should have a rough idea of what changes your product may bring if it is a market leader in a large market and how it might evolve to that reality.

The trickiest part about thinking big is you still need to think small at the same time while executing. You need that grand vision, but you also need to bring that back to today in the form of strategy, goals, priorities and of course your very next steps. Being able to articulate both the big and small in a succinct way that makes plausible sense is a signal of a great entrepreneur.

In fact I arguably should have titled this Learn to think big and small since that is ultimately what you need to do, but really I don't see too many people just thinking big except for wannabe entrepreneurs. As soon as you make your first move, you're thinking small because you have to start somewhere and do something.

From the investor point of view, I see a lot of thinking small but not a too lot of cohesive big visions and that's why I have a hunch that the lack of thinking big (often manifest in those above two ways) is the biggest disconnect between startup founders and investors. Looking back at my own wall of shame I think that it is the strongest signal that I missed in ultimately successful companies that I passed on that also wasn't present in ultimately less successful startups.

I don't believe there are any necessary or sufficient conditions in startup success, but along with questioning your operating assumptions enough, learning to think big is another one the key ways to maximize its likelihood.

↧

anvaka/atree · GitHub

December 27, 2013, 9:57 am

≫ Next: N.S.A. Phone Surveillance Is Lawful, Federal Judge Rules - NYTimes.com

≪ Previous: Learn to think big

Comments:"anvaka/atree · GitHub"

URL:https://github.com/anvaka/atree

Christmas Tree

I found animated Christmas tree on reddit:

Was curious to make it in js. This is work in progress. Current results are here: Merry Christmas!

How it's built?

The tree is built of two spirals. These 11 lines of code render one line on spiral. It includes 3d projection and background shadow. Almost the same as this wiki image:

I tried to replicate original animated gif of a tree, but...

It's not perfect yet

Want to help make it better? There are small things which are not perfect:

I do not properly mimic spiral's curvature at the top of the tree
Colors need to change more gradually, depending on Z coordinate
Shadows are not accurate
I think code is more complex than it should be.

Happy Holidays!

license

MIT

↧

N.S.A. Phone Surveillance Is Lawful, Federal Judge Rules - NYTimes.com

December 27, 2013, 9:33 am

≫ Next: The CDC 6600 Architecture

≪ Previous: anvaka/atree · GitHub

Comments:"N.S.A. Phone Surveillance Is Lawful, Federal Judge Rules - NYTimes.com"

URL:http://www.nytimes.com/2013/12/28/us/nsa-phone-surveillance-is-lawful-federal-judge-rules.html

WASHINGTON — A federal judge in New York on Friday ruled that the National Security Agency’s program that is systematically keeping phone records of all Americans is lawful, creating a conflict among lower courts and increasing the likelihood that the issue will be resolved by the Supreme Court.

In the ruling, Judge William H. Pauley III, of the United States District Court for the Southern District of New York, granted a motion filed by the federal government to dismiss a challenge to the program brought by the American Civil Liberties Union, which had tried to halt the program.

Judge Pauley said that protections under the Fourth Amendment do not apply to records held by third parties, like phone companies.

“This blunt tool only works because it collects everything,” Judge Pauley said in the ruling.

“While robust discussions are underway across the nation, in Congress and at the White House, the question for this court is whether the government’s bulk telephony metadata program is lawful. This court finds it is,” he added.

A spokesman for the Justice Department said, “We are pleased the court found the N.S.A.'s bulk telephony metadata collection program to be lawful.” He declined to comment further.

Jameel Jaffer, the A.C.L.U. deputy legal director, said the group intended to appeal. “We are extremely disappointed with this decision, which misinterprets the relevant statutes, understates the privacy implications of the government’s surveillance and misapplies a narrow and outdated precedent to read away core constitutional protections,” he said.

The ruling comes nearly two weeks after Judge Richard J. Leon of Federal District Court for the District of Columbia said the program most likely violated the Fourth Amendment. As part of that ruling, Judge Leon ordered the government to stop collecting data on two plaintiffs who brought the case against the government.

In his ruling, Judge Leon said that the program “infringes on ‘that degree of privacy’ that the founders enshrined in the Fourth Amendment,” which prohibits unreasonable searches and seizures.

While Judge Leon ordered the government to stop collecting data on the two plaintiffs, he stayed the ruling, giving the government time to appeal the decision.

Judge Pauley, whose courtroom is just blocks from where the World Trade Center towers stood, endorsed arguments made in recent months by senior government officials — including the former F.B.I. director Robert S. Mueller III — that the program may have caught the Sept. 11, 2001, hijackers had it been in place before the attacks.

In the months before Sept. 11, the N.S.A. had intercepted several calls made to an Al Qaeda safe house in Yemen. But because the N.S.A. was not tracking all phone calls made from the United States, it did not detect that the calls were coming from one of the hijackers who was living in San Diego.

“Telephony metadata would have furnished the missing information and might have permitted the N.S.A. to notify the Federal Bureau of Investigation of the fact that al-Mihdhar was calling the Yemeni safe house from inside the United States,” Judge Pauley said, referring to the hijacker, Khalid al-Mihdhar.

Judge Pauley said that the “government learned from its mistake and adapted to confront a new enemy: a terror network capable of orchestrating attacks across the world.”

The government, he added, “launched a number of countermeasures, including a bulk telephony metadata collection program — a wide net that could find and isolate gossamer contacts among suspected terrorists in an ocean of seemingly disconnected data.”

The main dispute between Judge Pauley and Judge Leon was over how to interpret a 1979 Supreme Court decision, Smith v. Maryland, in which the court said a robbery suspect had no reasonable expectation that his right to privacy extended to the numbers dialed from his phone.

“Smith’s bedrock holding is that an individual has no legitimate expectation of privacy in information provided to third parties,” Judge Pauley wrote.

But Judge Leon said in his ruling that advances in technology and suggestions in concurring opinions in later Supreme Court decisions had undermined Smith. The government’s ability to construct a mosaic of information from countless records, he said, called for a new analysis of how to apply the Fourth Amendment’s prohibition of unreasonable government searches.

Judge Pauley disagreed. “The collection of breathtaking amounts of information unprotected by the Fourth Amendment does not transform that sweep into a Fourth Amendment search,” he wrote.

He acknowledged that “five justices appeared to be grappling with how the Fourth Amendment applies to technological advances” in a pair of 2012 concurrences in United States v. Jones. In that decision, the court unanimously rejected the use of a GPS device to track the movements of a drug suspect over a month. The majority in the 2012 case said that attaching the device violated the defendant’s property rights.

In one of the concurrences, Justice Sonia Sotomayor wrote that “it may be necessary to reconsider the premise that an individual has no reasonable expectation of privacy in information voluntarily disclosed to third parties.”

But Judge Pauley wrote that the 2012 decision did not overrule the one from 1979. “The Supreme Court,” he said, “has instructed lower courts not to predict whether it would overrule a precedent even if its reasoning has been supplanted by later cases.”

As for changes in technology, he wrote, customers’ “relationship with their telecommunications providers has not changed and is just as frustrating.”

↧

The CDC 6600 Architecture

December 27, 2013, 8:58 am

≫ Next: fogus: The best things and stuff of 2013

≪ Previous: N.S.A. Phone Surveillance Is Lawful, Federal Judge Rules - NYTimes.com

Comments:"The CDC 6600 Architecture"

URL:http://ygdes.com/CDC/cdc6600.html

↧

fogus: The best things and stuff of 2013

December 27, 2013, 8:51 am

≫ Next: YogaGlo Update @ YogaGlo

≪ Previous: The CDC 6600 Architecture

Comments:"fogus: The best things and stuff of 2013"

URL:http://blog.fogus.me/2013/12/27/the-best-things-and-stuff-of-2013/

Dec 27, 2013

Great things and people that I discovered, learned, read, met, etc. in 2013. No particular ordering is implied. Not everything is new.

also: see the lists from 2012, 2011 and 2010

Join in on the Hacker News discussion.

Great blog posts read

Most viewed blog posts by me (20K+ viewers)

10 Technical Papers Every Programmer Should Read (At Least Twice) — My most popular post of 2011 was also my most popular of 2012 and also of 2013 — go figure. FP vs. OO, from the trenches — Really just an anecdote about where I’ve found functional programming useful over object-orientation and vice versa. For some reason it was popular for a few days — or at least controversial. Fun.js — My announcement of my book “Functional JavaScript” made the Internet rounds. Plus the whole Fun.js series as a whole garnered a crap-ton of views and some discussion. C.S. on the Cheap — My idea for a Dover-like publication run of computer science books. Scala: Sharp and Gets Things Cut — Kind of a rant about the way that Scala is marketed that came off more critical than I wanted. Enfield: a programming language designed for pedagogy — A description of a the perfect programming language for exploration. Computerists — A bit of cynicism on my part about computer “science.”

Favorite technical books discovered (and read)

Favorite non-technical books read

Earth Abides by George Stewart — My favorite entry in the family of global-pandemic-centric science fiction novels.
The Invention of Morel— A beautifully written book about an island of … I can’t say without giving away too much.
The Autobiography of Malcolm X by Alex Haley — What can I add about this books that hasn’t been said 1000 times already. A fascinating life.
Grooks— Wonderful, poems… no… aphorisms… no… sketches… no… Grooks by Piet Hein
A Gamut of Games by Sid Sackson — The book that really kicked my recent obsession with games into high-gear.

Number of books read

a bunch

Number of books published

1 (with another due early 2014), plus a newsletter about Lispiness.

Number of books written

Number of papers read

≈ 20 (a very slow year for me in the paper department, sadly)

Number of papers read deeply

Language zoo additions

Zeder

Favorite musicians discovered

Om, Alien Sex Fiend, The Fiery Furnaces

Favorite games discovered

I’ve discovered gaming at a late age. That’s not to say that I never played games. In fact, I’ve played my share of Chess, Checkers, Go, Risk, Gin Rummy, Hearts and Uno, but for one reason or another I never expanded much further than those staple games. However, now that my kids are getting older their drive to experience games and gaming is growing… and so goes mine. Therefore, below I’ll list my favorite games found in this year of discovery.

Board games

Cannon (PDF) — A wonderful board game with simple rules and deep complexities. A 150-year game.4
Mate— 20 cards. Perfect information. Mind games. Fun. I’m looking for people to play with via email; interested?
Hive— A game of bug-tile placement. A 150-year game.
Volcano— Not only great fun, but a beautiful game to boot. My current favorite Looney Pyramids game.
Homeworlds— An abstract game a intergalactic conquest. A 150-year game. I’m looking for people to play with via email; interested?

Card games

Super Nova— An abandoned CCG that tried to gain players at the wrong time. Reading over the rules hints at a very fun game that might one day work as a LCG.
Magic: The Gathering— As a child of RPGs and LARPs, I’m ashamed to admit that I was a snob about Magic for many years, but I’ve come to discover a very deep and fun game 20-years later. I’m a very casual player without a drive to spend spend spend and thankfully there are movements in the MtG community to support such players.
Haggis— The best two-player card game since Gin Rummy.

Favorite TV series about zombies

The Walking Dead

Favorite programming languages (or related)

Clojure, ClojureScript, Haskell, Datalog, Frink, Pure, Racket, T

Programming languages used for projects both professional and not

Clojure, ClojureScript, Haskell, Java, JavaScript, SQL, Bash, make, Datalog, Zeder

Favorite papers discovered (and read)

Still haven’t read…

Snow Crash, Spook Country, A Fire upon the Deep, Programmer avec Scheme, Norwegian Wood, The Contortionists Handbook and a boat-load of scifi

Favorite conference attended

Strange Loop

Favorite code read

Life changing technology discovered

State of plans from 2012

Pescetarianism (redux) — huge fail (again)
Ariadne (the super-secret project) — eventually became my production rules system Zeder. Huge personal success!
More concatenative — year number two of huge failure.
Participate in the PLT Games— I was only able to participate once before Zeder took over my life.
No talks unless I have code to show — I consider this a rousing success as the few talks that I gave were code-heavy and about real projects that I was working on. I will continue this principle moving forward.

Plans for 2014

Goodbye

To my friend and colleague, whom I worked with for many years and learned so much of what I know about the art of programming — you will be missed. Rest in peace.

↧

YogaGlo Update @ YogaGlo

December 27, 2013, 8:36 am

≫ Next: Xv6, a simple Unix-like teaching operating system

≪ Previous: fogus: The best things and stuff of 2013

Comments:"YogaGlo Update @ YogaGlo"

URL:http://www.yogaglo.com/blog/2013/12/yogaglo-update/

↧

Xv6, a simple Unix-like teaching operating system

December 27, 2013, 8:09 am

≫ Next: The Performance of Open Source Software | Zotonic

≪ Previous: YogaGlo Update @ YogaGlo

Comments:"Xv6, a simple Unix-like teaching operating system"

URL:http://pdos.csail.mit.edu/6.828/2012/xv6.html

Introduction

Xv6 is a teaching operating system developed in the summer of 2006 for MIT's operating systems course, 6.828: Operating System Engineering. We hope that xv6 will be useful in other courses too. This page collects resources to aid the use of xv6 in other courses, including a commentary on the source code itself.

History and Background

For many years, MIT had no operating systems course. In the fall of 2002, one was created to teach operating systems engineering. In the course lectures, the class worked through Sixth Edition Unix (aka V6) using John Lions's famous commentary. In the lab assignments, students wrote most of an exokernel operating system, eventually named Jos, for the Intel x86. Exposing students to multiple systems–V6 and Jos–helped develop a sense of the spectrum of operating system designs.

V6 presented pedagogic challenges from the start. Students doubted the relevance of an obsolete 30-year-old operating system written in an obsolete programming language (pre-K&R C) running on obsolete hardware (the PDP-11). Students also struggled to learn the low-level details of two different architectures (the PDP-11 and the Intel x86) at the same time. By the summer of 2006, we had decided to replace V6 with a new operating system, xv6, modeled on V6 but written in ANSI C and running on multiprocessor Intel x86 machines. Xv6's use of the x86 makes it more relevant to students' experience than V6 was and unifies the course around a single architecture. Adding multiprocessor support requires handling concurrency head on with locks and threads (instead of using special-case solutions for uniprocessors such as enabling/disabling interrupts) and helps relevance. Finally, writing a new system allowed us to write cleaner versions of the rougher parts of V6, like the scheduler and file system. 6.828 substituted xv6 for V6 in the fall of 2006.

Xv6 sources and text

The latest xv6 source is available via

git clone git://pdos.csail.mit.edu/xv6/xv6.git

We also distribute the sources as a printed booklet with line numbers that keep everyone together during lectures. The booklet is available as xv6-rev7.pdf. To get the version corresponding to this booklet, run

git checkout -b xv6-rev7 xv6-rev7

The xv6 source code is licensed under the traditional MIT license; see the LICENSE file in the source distribution. To help students read through xv6 and learn about the main ideas in operating systems we also distribute a textbook/commentary for the latest xv6. The line numbers in this book refer to the above source booklet.

xv6 compiles using the GNU C compiler, targeted at the x86 using ELF binaries. On BSD and Linux systems, you can use the native compilers; On OS X, which doesn't use ELF binaries, you must use a cross-compiler. Xv6 does boot on real hardware, but typically we run it using the QEMU emulator. Both the GCC cross compiler and QEMU can be found on the 6.828 tools page.

Xv6 lecture material

In 6.828, the lectures in the first half of the course cover the xv6 sources and text. The lectures in the second half consider advanced topics using research papers; for some, xv6 serves as a useful base for making discussions concrete. The lecture notes are available from the 6.828 schedule page.

Unix Version 6

6.828's xv6 is inspired by Unix V6 and by:

The following are useful to read the original code:

The PDP11/40 Processor Handbook, Digital Equipment Corporation, 1972.
- A PDF (made from scanned images, and not text-searchable)
- A web-based version that is indexed by instruction name.

Feedback

If you are interested in using xv6 or have used xv6 in a course, we would love to hear from you. If there's anything that we can do to make xv6 easier to adopt, we'd like to hear about it. We'd also be interested to hear what worked well and what didn't.

Russ Cox (rsc@swtch.com)
Frans Kaashoek (kaashoek@mit.edu)
Robert Morris (rtm@mit.edu)

You can reach all of us at 6.828-staff@pdos.csail.mit.edu.

↧

The Performance of Open Source Software | Zotonic

December 27, 2013, 7:31 am

≫ Next: PostgreSQL indexing in Rails

≪ Previous: Xv6, a simple Unix-like teaching operating system

Comments:"The Performance of Open Source Software | Zotonic"

URL:http://aosabook.org/en/posa/zotonic.html

Introduction to Zotonic

Zotonic is an open source framework for doing full-stack web development, all the way from frontend to backend. Consisting of a small set of core functionalities, it implements a lightweight but extensible Content Management System on top. Zotonic’s main goal is to make it easy to create well-performing websites “out of the box”, so that a website scales well from the start.

While it shares many features and functionalities with web development frameworks like Django, Drupal, Ruby on Rails and Wordpress, its main competitive advantage is the language that Zotonic is powered by: Erlang. This language, originally developed for building phone switches, allows Zotonic to be fault tolerant and have great performance characteristics.

Like the title says, this chapter focusses on the performance of Zotonic. We’ll look at the reasons why Erlang was chosen as the programming platform, then inspect the HTTP request stack, then dive in to the caching strategies that Zotonic employs. Finally, we’ll describe the optimisations we applied to Zotonic’s submodules and the database.

Why Zotonic? Why Erlang?

The first work on Zotonic was started in 2008, and, like many projects, came from “scratching an itch”. Marc Worrell, the main Zotonic architect, had been working for seven years at Mediamatic Lab, in Amsterdam, on a Drupal-like CMS written in PHP/MySQL called Anymeta. Anymeta’s main paradigm was that it implemented a “pragmatic approach to the Semantic Web” by modeling everything in the system as generic “things”. Though successful, its implementations suffered from scalability problems.

After Marc left Mediamatic, he spent a few months designing a proper, Anymeta-like CMS from scratch. The main design goals for Zotonic were that it had to be easy to use for frontend developers; it had to support easy development of real-time web interfaces, simultaneously allowing long-lived connections and many short requests; and it had to have well-defined performance characteristics. More importantly, it had to solve the most common problems that limited performance in earlier Web development approaches–for example, it had to withstand the “Shashdot Effect” (a sudden rush of visitors).

Problems with the Classic PHP+Apache Approach

A classic PHP setup runs as a module inside a container web server like Apache. On each request, Apache decides how to handle the request. When it’s a PHP request, it spins up mod_php5, and then the PHP interpreter starts interpreting the script. This comes with startup latency: typically, such a spin-up already takes 5 ms, and then the PHP code still needs to run. This problem can partially be mitigated by using PHP accelerators which precompile the PHP script, bypassing the interpreter. The PHP startup overhead can also be mitigated by using a process manager like PHP-FPM.

Nevertheless, systems like that still suffer from a shared nothing architecture. When a script needs a database connection, it needs to create one itself. Same goes any other I/O resource that could otherwise be shared between requests. Various modules feature persistent connections to overcome this, but there is no general solution to this problem in PHP.

Handling long-lived client connections is also hard because such connections need a separate web server thread or process for every request. In the case of Apache and PHP-FPM, this does not scale with many concurrent long-lived connections.

Requirements for a Modern Web Framework

Modern web frameworks typically deal with three classes of HTTP request. First, there are dynamically generated pages: dynamically served, usually generated by a template processor. Second, there is static content: small and large files which do not change (e.g., JavaScript, CSS, and media assets). Third, there are long-lived connections: WebSockets and long-polling requests for adding interactivity and two-way communication to pages.

Before creating Zotonic, we were looking for a software framework and programming language that would allow us to meet our design goals (high performance, developer friendliness) and sidestep the bottlenecks associated with traditional web server systems. Ideally the software would meet the following requirements.

Concurrent: it needs to support many concurrent connections that are not limited by the number of unix processes or OS threads.
Shared resources: it needs to have a mechanism to share resources cheaply (e.g., caching, db connections) between requests.
Hot code upgrades: for ease of development and the enabling of hot-upgrading production systems (keeping downtime to a minimum), it would be nice if code changes could be deployed in a running system, without needing to restart it.
Multi-core CPU support: a modern system needs to scale over multiple cores, as current CPUs tend to get scale in number of cores as opposed to increased clock speed.
Fault tolerant: the system needs to be able to handle exceptional situations, “badly behaving” code, anomalies or resource starvation. Ideally, the system would achieve this by having some kind of supervision mechanism to restart the failing parts.
Distributed: ideally, a system has built-in and easy to set up support for distribution over multiple nodes, to allow for better performance and protection against hardware failure.

Erlang to the Rescue

To our knowledge, Erlang was the only language that met these requirements “out of the box”. The Erlang VM, combined with its Open Telecom Platform (OTP), provided the system that gave and continues to give us all the necessary features.

Erlang is a (mostly) functional programming language and runtime system. Erlang/OTP applications were originally developed for telephone switches, and are known for their fault-tolerance and their concurrent nature. Erlang employs an actor-based concurrency model: each actor is a lightweight “process” (green thread) and the only way to share state between processes is to pass messages. The Open Telecom Platform is the set of standard Erlang libraries which enable fault tolerance and process supervision, amongst others.

Fault tolerance is at the core of its programming paradigm: let it crash is the main philosophy of the system. As processes don’t share any state (to share state, they must send messages to each other), their state is isolated from other processes. As such, a single crashing process will never take down the system. When a process crashes, its supervisor process can decide to restart it.

Let it crash also allows you to program for the happy case. Using pattern matching and function guards to assure a sane state means less error handling code is needed, which usually results in clean, concise, and readable code.

Zotonic’s Architecture

Before we discuss Zotonic’s performance optimizations, let’s have a look at its architecture. Figure 9.1 describes Zotonic’s most important components.

Figure 9.1 - The architecture of Zotonic

The diagram shows the layers of Zotonic that an HTTP request goes through. For discussing performance issues we’ll need to know what these layers are, and how they affect performance.

First, Zotonic comes with a built in web server, Mochiweb (another Erlang project). It does not require an external web server. This keeps the deployment dependencies to a minimum.

Like many web frameworks, a URL routing system is used to match requests to controllers. Controllers handle each request in a RESTful way, thanks to the Webmachine library.

Controllers are “dumb” on purpose, without much application-specific logic. Zotonic provides a number of standard controllers which, for the development of basic web applications, are often good enough. For instance, there is a controller_template, whose sole purpose it is to reply to HTTPGET requests by rendering a given template.

The template language is an Erlang-implementation of the well-known Django Template Language, called ErlyDTL. The general principle in Zotonic is that the templates drive the data requests. The templates decide which data they need, and retrieve it from the models.

Models expose functions to retrieve data from various data sources, like a database. Models expose an API to the templates, dictating how they can be used. The models are also responsible for caching their results in memory; they decide when and what is cached and for how long. When templates need data, they call a model as if it were a globally available variable.

A model is an Erlang wrapper module which is responsible for certain data. It contains the necessary functions to retrieve and store data in the way that the application needs. For instance, the central model of Zotonic is called m.rsc, which provide access to the generic resource (“page”) data model. Since resources use the database, m_rsc.erl uses a database connection to retrieve its data and pass it through to the template, caching it whenever it can.

This “templates drive the data” approach is different from other web frameworks like Rails and Django, which usually follow a more classical MVC approach where a controller assigns data to a template. Zotonic follows a less “controller-centric” approach, so that typical websites can be built by just writing templates.

Zotonic uses PostgreSQL for data persistence. Data Model: a Document Database in SQL explains the rationale for this choice.

Additional Zotonic Concepts

While the main focus of this chapter are the performance characteristics of the web request stack, it is useful to know some of the other concepts that are at the heart of Zotonic.

Virtual hosting A single Zotonic instance typically serves more than one site. It is designed for virtual hosting, including domain aliases and SSL support. And due to Erlang’s process-isolation, a crashing site does not affect any of the other sites running in the same VM. Modules Modules are Zotonic’s way of grouping functionality together. Each module is in its own directory containing Erlang files, templates, assets, etc. They can be enabled on a per-site basis. Modules can hook into the admin system: for instance, the mod_backup module adds version control to the page editor and also runs a daily full database backup. Another module, mod_github, exposes a webhook which pulls, rebuilds and reloads a Zotonic site from github, allowing for continuous deployment. Notifications To enable the loose coupling and extensibility of code, communication between modules and core components is done by a notification mechanism which functions either as a map or fold over the observers of a certain named notification. By listening to notifications it becomes easy for a module to override or augment certain behaviour. The calling function decides whether a map or fold is used. For instance, the admin_menu notification is a fold over the modules which allow modules to add or remove menu items in the admin menu. Data model The main data model that Zotonic uses can be compared to Drupal’s Node module; “every thing is a thing”. The data model consists of hierarchically categorized resources which connect to other resources using labelled edges. Like its source of inspiration, the Anymeta CMS, this data model is loosely based on the principles of the Semantic Web.

Zotonic is an extensible system, and all parts of the system add up when you consider performance. For instance, you might add a module that intercepts web requests, and does something on each request. Such a module might impact the performance of the entire system. In this chapter we’ll leave this out of consideration, and instead focus on the core performance issues.

Problem Solving: Fighting the Slashdot Effect

Most web sites live an unexceptional life in a small place somewhere on the web. That is, until one of their pages hit the front page of a popular website like CNN, BBC or Yahoo. In that case, the traffic to the website will likely increase to tens, hundreds, or even thousands of page requests per second in no time.

Such a sudden surge overloads a traditional web server and makes it unreachable. The term “Slashdot Effect” was named after the web site that started this kind of overwhelming referrals. Even worse, an overloaded server is sometimes very hard to restart. As the newly started server has empty caches, no database connections, often un-compiled templates, etc.

Many anonymous visitors requesting exactly the same page around the same time shouldn’t be able to overload a server. This problem is easily solved using a caching proxy like Varnish, which caches a static copy of the page and only checks for updates to the page once in a while.

A surge of traffic becomes more challenging when serving dynamic pages for every single visitor; these can’t be cached. With Zotonic, we set out to solve this problem.

We realized that most web sites have

only have a limited number of very popular pages,
a long tail of far less popular pages, and
many shared parts on all pages (menu, most read items, news, etc.).

and decided to

cache hot data in memory so no communication needed to access it,
share renderings of templates and sub-templates between requests and on pages on the web site, and
explicitly design the system to prevent overload on server start and restart.

Cache Hot Data

Why fetch data from an external source (database, memcached) when another request fetched it already a couple of milliseconds ago? We always cache simple data requests. In the next section the caching mechanism is discussed in detail.

Share Rendered Templates and Sub-templates Between Pages

When rendering a page or included template, a developer can add optional caching directives. This caches the rendered result for a period of time.

Caching starts what we called the memo functionality: while the template is being rendered and one or more processes request the same rendering, the later processes will be suspended. When the rendering is done all waiting processes will be sent the rendering result

The memoization alone–without any further caching–gives a large performance boost by drastically reducing the amount of parallel template processing.

Prevent Overload on Server Start or Restart

Zotonic introduces several bottlenecks on purpose. These bottlenecks limit the access to processes that use limited resources or are expensive (in terms of CPU or memory) to perform. Bottlenecks are currently set up for the template compiler, the image resizing process, and the database connection pool.

The bottlenecks are implemented by having a limited worker pool for performing the requested action. For CPU or disk intensive work, like image resizing, there is only a single process handling the requests. Requesting processes post their request in the Erlang request queue for the process and wait until their request is handled. If a request times out it will just crash. Such a crashing request will return HTTP status 503 Service not available.

Waiting processes don’t use many resources and the bottlenecks protect against overload if a template is changed or an image on a hot page is replaced and needs cropping or resizing.

In short: a busy server can still dynamically update its templates, content and images without getting overloaded. At the same time it allows single requests to crash while the system itself continues to operate.

The Database Connection Pool

One more word on database connections. In Zotonic a process fetches a database connection from a pool of connections for every single query or transaction. This enables many concurrent processes to share a very limited number of database connections. Compare this with most (PHP) systems where every request holds a connection to the database for the duration of the complete request.

Zotonic closes unused database connections after a time of inactivity. One connection is always left open so that the system can always handle an incoming request or background activity quickly. The dynamic connection pool drastically reduces the number of open database connections on most Zotonic web sites to one or two.

Caching Layers

The hardest part of caching is cache invalidation: keeping the cached data fresh and purging stale data. Zotonic uses a central caching mechanism with dependency checks to solve this problem.

This section describes Zotonic’s caching mechanism in a top-down fashion: from the browser down through the stack to the database.

Client-Side Caching

The client-side caching is done by the browser. The browser caches images, CSS and JavaScript files. Zotonic does not allow client-side caching of HTML pages, it always dynamically generates all pages. Because it is very efficient in doing so (as described in the previous section) and not caching HTML pages prevents showing old pages after users log in, log out, or comments are placed.

Zotonic improves client-side performance in two ways:

It allows caching of static files (CSS, JavaScript, images etc.) It includes multiple CSS or JavaScript files in a single response

The first is done by adding the appropriate HTTP headers to the request:

Last-Modified: Tue, 18 Dec 2012 20:32:56 GMT
Expires: Sun, 01 Jan 2023 14:55:37 GMT
Date: Thu, 03 Jan 2013 14:55:37 GMT
Cache-Control: public, max-age=315360000

Multiple CSS or JavaScript files are concatenated into a single file, separating individual files by a tilde and only mentioning paths if they change between files:

http://example.org/lib/bootstrap/css/bootstrap
 ~bootstrap-responsive~bootstrap-base-site~
 /css/jquery.loadmask~z.growl~z.modal~site~63523081976.css

The number at the end is a timestamp of the newest file in the list. The necessary CSS link or JavaScript script tag is generated using the {% lib %} template tag.

Server-Side Caching

Zotonic is a large system, and many parts in it do caching in some way. The sections below explain some of the more interesting parts.

Static CSS, JS and Image Files

The controller handling the static files has some optimizations for handling these files. It can decompose combined file requests into a list of individual files.

The controller has checks for the If-Modified-Since header, serving the HTTP status 304 Not Modified when appropriate.

On the first request it will concatenate the contents of all the static files into one byte array (an Erlang binary). This byte array is then cached in the central depcache (see Depcache) in two forms: compressed (with gzip) and uncompressed. Depending on the Accept-Encoding headers sent by the browser, Zotonic will serve either the compressed or uncompressed version.

This caching mechanism is efficient enough that its performance is similar to many caching proxies, while still fully controlled by the web server. With an earlier version of Zotonic and on simple hardware (quad core 2.4 GHz Xeon from 2008) we saw throughputs of around 6000 requests/second and were able to saturate a gigabit ethernet connection requesting a small (~20 KB) image file.

Rendered Templates

Templates are compiled into Erlang modules, after which the byte code is kept in memory. Compiled templates are called as regular Erlang functions.

The template system detects any changes to templates and will recompile the template during runtime. When compilation is finished Erlang’s hot code upgrade mechanism is used to load the newly compiled Erlang module.

The main page and template controllers have options to cache the template rendering result. Caching can also be enabled only for anonymous (not logged in) visitors. As for most websites, anonymous visitors generate the bulk of all requests and those pages will be not be personalized and (almost) be identical. Note that template rendering results is an intermediate result and not the final HTML. This intermediate result contains (among others) untranslated strings and JavaScript fragments. The final HTML is generated by parsing this intermediate structure, picking the correct translations and collecting all javascript.

The concatenated JavaScript, along with a unique page ID, is placed at the position of the {% script %} template tag. This should be just above the closing </body> body tag. The unique page ID is used to match this rendered page with the handling Erlang processes and for WebSocket/Comet interaction on the page.

Like with any template language, templates can include other templates. In Zotonic, included templates are usually compiled inline to eliminate any performance lost by using included files.

Special options can force runtime inclusion. One of those options is caching. Caching can be enabled for anonymous visitors only, a caching period can be set, and cache dependencies can be added. These cache dependencies are used to invalidate the cached rendering if any of the shown resources is changed.

Another method to cache parts of templates is to use the {% cache %} ... {% endcache %} block tag, which caches a part of a template for a given amount of time. This tag has the same caching options as the include tag, but has the advantage that it can easily be added in existing templates.

In-Memory Caching

All caching is done in memory, in the Erlang VM itself. No communication between computers or operating system processes is needed to access the cached data. This greatly simplifies and optimizes the use of the cached data.

As a comparison, accessing a memcache server typically takes 0.5 milliseconds. In contrast, accessing main memory within the same process takes 1 nanoseconds on a CPU cache hit and 100 nanoseconds on a CPU cache miss–not to mention the huge speed difference between memory and network.

Zotonic has two in-memory caching mechanisms:

Depcache, the central per-site cache Process Dictionary Memo Cache

Depcache

The central caching mechanism in every Zotonic site is the depcache, which is short for dependency cache. The depcache is an in-memory key-value store with a list of dependencies for every stored key.

For every key in the depcache we store:

the key’s value;
a serial number, a global integer incremented with every update request;
the key’s expiration time (counted in seconds);
a list of other keys that this key depends on (e.g., a resource ID displayed in a cached template); and
if the key is still being calculated, a list of processes waiting for the key’s value.

If a key is requested then the cache checks if the key is present, not expired, and if the serial numbers of all the dependency keys are lower than serial number of the cached key. If the key was still valid its value is returned, otherwise the key and its value is removed from the cache and undefined is returned.

Alternatively if the key was being calculated then the requesting process would be added to the waiting list of the key.

The implementation makes use of ETS, the Erlang Term Storage, a standard hash table implementation which is part of the Erlang OTP distribution. The following ETS tables are created by Zotonic for the depcache:

Meta table: the ETS table holding all stored keys, the expiration and the depending keys. A record in this table is written as #meta{key, expire, serial, deps}.
Deps table: the ETS table stores the serial for each key.
Data table: the ETS table that stores each key’s data.
Waiting PIDs dictionary: the ETS table that stores the IDs of all processes waiting for the arrival of a key’s value.

The ETS tables are optimized for parallel reads and usually directly accessed by the calling process. This prevents any communication between the calling process and the depcache process.

The depcache process is called for:

memoization where processes wait for another process’s value to be calculated;
put (store) requests, serializing the serial number increments; and
delete requests, also serializing the depcache access.

The depcache can get quite large. To prevent it from growing too large there is a garbage collector process. The garbage collector slowly iterates over the complete depcache, evicting expired or invalidated keys. If the depcache size is above a certain threshold (100 MiB by default) then the garbage collector speeds up and evicts 10% of all encountered items. It keeps evicting until the cache is below its threshold size.

100 MiB might sound small in this area of multi-TB databases. However, as the cache mostly contains textual data it will be big enough to contain the hot data for most web sites. Otherwise the size of the cache can be changed in configuration.

Process Dictionary Memo Cache

The other memory-caching paradigm in Zotonic is the process dictionary memo cache. As described earlier, the data access patterns are dictated by the templates. The caching system uses simple heuristics to optimize access to data.

Important in this optimization is data caching in the Erlang process dictionary of the process handling the request. The process dictionary is a simple key-value store in the same heap as the process. Basically, it adds state to the functional Erlang language. Use of the process dictionary is usually frowned upon for this reason, but for in-process caching it is useful.

When a resource is accessed (remember, a resource is the central data unit of Zotonic), it is copied into the process dictionary. The same is done for computational results–like access control checks–and other data like configuration values.

Every property of a resource–like its title, summary or body text–must, when shown on a page, perform an access control check and then fetch the requested property from the resource. Caching all the resource’s properties and its access checks greatly speeds up resource data usage and removes many drawbacks of the hard-to-predict data access patterns by templates.

As a page or process can use a lot of data this memo cache has a couple of pressure valves:

When holding more than 10,000 keys the whole process dictionary is flushed. This prevents process dictionaries holding many unused items, like what happens when looping through long lists of resources. Special Erlang variables like $ancestors are kept. The memo cache must be programmatically enabled. This is automatically done for every incoming HTTP or WebSocket request and template rendering. Between HTTP/WebSocket requests the process dictionary is flushed, as multiple sequential HTTP/WebSocket requests share the same process. The memo cache doesn’t track dependencies. Any depcache deletion will also flush the complete process dictionary of the process performing the deletion.

When the memo cache is disabled then every lookup is handled by the depcache. This results in a call to the depcache process and data copying between the depcache and the requesting process.

The Erlang Virtual Machine

The Erlang Virtual Machine has a few properties that are important when looking at performance.

Processes are Cheap

The Erlang VM is specifically designed to do many things in parallel, and as such has its own implementation of multiprocessing within the VM. Erlang processes are scheduled on a reduction count basis, where one reduction is roughly equivalent to a function call. A process is allowed to run until it pauses to wait for input (a message from some other process) or until it has executed a fixed number of reductions. For each CPU core, a scheduler is started with its own run queue. It is not uncommon for Erlang applications to have thousands to millions of processes alive in the VM at any given point in time.

Processes are not only cheap to start but also cheap in memory at 327 words per process, which amounts to ~2.5 KiB on a 64 bit machine. This compares to ~500 KiB for Java and a default of 2 MiB for pthreads.

Since processes are so cheap to use, any processing that is not needed for a request’s result is spawned off into a separate process. Sending an email or logging are both examples of tasks that could be handled by separate processes.

Data Copying is Expensive

In the Erlang VM messages between processes are relatively expensive, as the message is copied in the process. This copying is needed due to Erlang’s per-process garbage collector. Preventing data copying is important; which is why Zotonic’s depcache uses ETS tables, which can be accessed from any process.

Separate Heap for Bigger Byte Arrays

There is a big exception for copying data between processes. Byte arrays larger than 64 bytes are not copied between processes. They have their own heap and are separately garbage collected.

This makes it cheap to send a big byte array between processes, as only a reference to the byte array is copied. However, it does make garbage collection harder, as all references must be garbage collected before the byte array can be freed.

Sometimes, references to parts of a big byte array are passed: the bigger byte array can’t be garbage collected until the reference to the smaller part is garbage collected. A consequence is that copying a byte array is an optimization if that frees up the bigger byte array.

String Processing is Expensive

String processing in any functional language can be expensive because strings are often represented as linked lists of integers, and, due to the functional nature of Erlang, data cannot be destructively updated.

If a string is represented as a list, then it is processed using tail recursive functions and pattern matching. This makes it a natural fit for functional languages. The problem is that the data representation of a linked list has a big overhead and that messaging a list to another process always involves copying the full data structure. This makes a list a non-optimal choice for strings.

Erlang has its own middle-of-the-road answer to strings: io-lists. Io-lists are nested lists containing lists, integers (single byte value), byte arrays and references to parts of other byte arrays. Io-lists are extremely easy to use and appending, prefixing or inserting data is inexpensive, as they only need changes to relatively short lists, without any data copying.

An io-list can be sent as-is to a “port” (a file descriptor), which flattens the data structure to a byte stream and sends it to a socket.

Example of an io-list:

 [ <<"Hello">>, 32, [ <<"Wo">>, [114, 108], <<"d">>].

Which flattens to the byte array:

<<"Hello World">>.

Interestingly, most string processing in a web application consists of:

Concatenating data (dynamic and static) into the resulting page. HTML escaping and sanitizing content values.

Erlang’s io-list is the perfect data structure for the first use case. And the second use case is resolved by an aggressive sanitization of all content before it is stored in the database.

These two combined means that for Zotonic a rendered page is just a big concatenation of byte arrays and pre-sanitized values in a single io-list.

Implications for Zotonic

Zotonic makes heavy use of a relatively big data structure, the Context. This is a record containing all data needed for a request evaluation. It contains:

The request data: headers, request arguments, body data etc.
Webmachine status
User information (e.g., user ID, access control information)
Language preference
User-Agent class (e.g., text, phone, tablet, desktop)
References to special site processes (e.g., notifier, depcache, etc.)
Unique ID for the request being processed (this will become the page ID)
Session and page process IDs
Database connection process during a transaction
Accumulators for reply data (e.g., data, actions to be rendered, JavaScript files)

All this data can make a large data structure. Sending this large Context to different processes working on the request would result in a substantial data copying overhead.

That is why we try to do most of the request processing in a single process: the Mochiweb process that accepted the request. Additional modules and extensions are called using function calls instead of using inter-process messages.

Sometimes an extension is implemented using a separate process. In that case the extension provides a function accepting the Context and the process ID of the extension process. This interface function is then responsible of efficiently messaging the extension process.

Zotonic also needs to send a message when rendering cacheable sub-templates. In this case the Context is pruned of all intermediate template results and some other unneeded data (like logging information) before the Context is messaged to the process rendering the sub-template.

We don’t care too much about messaging byte arrays as they are, in most cases, larger than 64 bytes and as such will not be copied between processes.

For serving large static files, there is the option of using the Linux sendfile() system call to delegate sending the file to the operating system.

Changes to the Webmachine Library

Webmachine is a library implementing an abstraction of the HTTP protocol. It is implemented on top of the Mochiweb library which implements the lower level HTTP handling, like acceptor processes, header parsing, etc.

Controllers are made by creating Erlang modules implementing callback functions. Examples of callback functions are resource_exists, previously_existed, authorized, allowed_methods, process_post, etc. Webmachine also matches request paths against a list of dispatch rules; assigning request arguments and selecting the correct controller for handling the HTTP request.

With Webmachine, handling the HTTP protocol becomes easy. We decided early on to build Zotonic on top of Webmachine for this reason.

While building Zotonic a couple of problems with Webmachine were encountered.

When we started, it supported only a single list of dispatch rules; not a list of rules per host (i.e., site). Dispatch rules are set in the application environment, and copied to the request process when dispatching. Some callback functions (like last_modified) are called multiple times during request evaluation. When Webmachine crashes during request evaluation no log entry is made by the request logger. No support for HTTP Upgrade, making WebSockets support harder.

The first problem (no partitioning of dispatch rules) is only a nuisance. It makes the list of dispatch rules less intuitive and more difficult to interpret.

The second problem (copying the dispatch list for every request) turned out to be a show stopper for Zotonic. The lists could become so large that copying it could take the majority of time needed to handle a request.

The third problem (multiple calls to the same functions) forced controller writers to implement their own caching mechanisms, which is error prone.

The fourth problem (no log on crash) makes it harder to see problems when in production.

The fifth problem (no HTTP Upgrade) prevents us from using the nice abstractions available in Webmachine for WebSocket connections.

The above problems were so serious that we had to modify Webmachine for our own purposes.

First a new option was added: dispatcher. A dispatcher is a module implementing the dispatch/3 function which matches a request to a dispatch list. The dispatcher also selects the correct site (virtual host) using the HTTPHost header. When testing a simple “hello world” controller, these changes gave a threefold increase of throughput. We also observed that the gain was much higher on systems with many virtual hosts and dispatch rules.

Webmachine maintains two data structures, one for the request data and one for the internal request processing state. These data structures were referring to each other and actually were almost always used in tandem, so we combined them in a single data structure. Which made it easier to remove the use of the process dictionary and add the new single data structure as an argument to all functions inside Webmachine. This resulted in 20% less processing time per request.

We optimized Webmachine in many other ways that we will not describe in detail here, but the most important points are:

Return values of some controller callbacks are cached (charsets_provided, content_types_provided, encodings_provided, last_modified, and generate_etag).
More process dictionary use was removed (/less global state, clearer code, easier testing).
Separate logger process per request; even when a request crashes we have a log up to the point of the crash.
An HTTP Upgrade callback was added as a step after the forbidden access check to support WebSockets.
Originally, a controller was called a “resource”. We changed it to “controller” to make a clear distinction between the (data-)resources being served and the code serving those resources.
Some instrumentation was added to measure request speed and size.

Data Model: a Document Database in SQL

From a data perspective it is worth mentioning that all properties of a “resource” (Zotonic’s main data unit) are serialized into a binary blob; “real” database columns are only used for keys, querying and foreign key constraints.

Separate “pivot” fields and tables are added for properties, or combinations of properties that need indexing, like full text columns, date properties, etc.

When a resource is updated, a database trigger adds the resource’s ID to the pivot queue. This pivot queue is consumed by a separate Erlang background process which indexes batches of resources at a time in a single transaction.

Choosing SQL made it possible for us to hit the ground running: PostgreSQL has a well known query language, great stability, known performance, excellent tools, and both commercial and non-commercial support.

Beyond that, the database is not the limiting performance factor in Zotonic. If a query becomes the bottleneck, then it is the task of the developer to optimize that particular query using the database’s query analyzer.

Finally, the golden performance rule for working with any database is: Don’t hit the database; don’t hit the disk; don’t hit the network; hit your cache.

Benchmarks, Statistics and Optimizations

We don’t believe too much in benchmarks as they often test only minimal parts of a system and don’t represent the performance of the whole system. Especially as a system has many moving parts and in Zotonic the caching system and handling common access patterns are an integral part of the design.

A Simplified Benchmark

What a benchmark might do is show where you could optimize the system first.

With this in mind we benchmarked Zotonic using the TechEmpower JSON benchmark, which is basically testing the request dispatcher, JSON encoder, HTTP request handling and the TCP/IP stack.

The benchmark was performed on a Intel i7 quad core M620 @ 2.67 GHz. The command was wrk -c 3000 -t 3000 http://localhost:8080/json. The results are shown in Table 9.1.

Table 9.1 - Benchmark results Node.js 27 Cowboy (/Erlang) 31 Elli (/Erlang) 38 Zotonic 5.5 Zotonic w/o access log 7.5 Zotonic w/o access log, with dispatcher pool 8.5

Zotonic’s dynamic dispatcher and HTTP protocol abstraction gives lower scores in such a micro benchmark. Those are relatively easy to solve, and the solutions were already planned:

Replace the standard webmachine logger with a more efficient one
Compile the dispatch rules in an Erlang module (instead of a single process interpreting the dispatch rule list)
Replace the MochiWeb HTTP handler with the Elli HTTP handler
Use byte arrays in Webmachine instead of the current character lists

Real-Life Performance

For the 2013 abdication of the Dutch queen and subsequent inauguration of the new Dutch king a national voting site was built using Zotonic. The client requested 100% availability and high performance, being able to handle 100,000 votes per hour.

The solution was a system with four virtual servers, each with 2 GBRAM and running their own independent Zotonic system. Three nodes handled voting, one node was for administration. All nodes were independent but the voting nodes shared every vote with the at least two other nodes, so no vote would be lost if a node crashed.

A single vote gave ~30 HTTP requests for dynamic HTML (in multiple languages), Ajax, and static assets like css and javascript. Multiple requests were needed for selecting the three projects to vote on and filling in the details of the voter.

When tested we easily met the customer’s requirements without pushing the system to the max. The voting simulation was stopped at 500,000 complete voting procedures per hour, using bandwidth of around 400 mbps, and 99% of request handling times were below 200 milliseconds.

From the above it is clear that Zotonic can handle popular dynamic web sites. On real hardware we have observed much higher performance, especially for the underlying I/O and database performance.

Conclusion

When building a content management system or framework it is important to take the full stack of your application into consideration, from the web server, the request handling system, the caching systems, down to the database system. All parts must work well together for good performance.

Much performance can be gained by preprocessing data. An example of preprocessing is pre-escaping and sanitizing data before storing it into the database.

Caching hot data is a good strategy for web sites with a clear set of popular pages followed by a long tail of less popular pages. Placing this cache in the same memory space as the request handling code gives a clear edge over using separate caching servers, both in speed and simplicity.

Another optimization for handling sudden bursts in popularity is to dynamically match similar requests and process them once for the same result. When this is well implemented, a proxy can be avoided and all HTML pages generated dynamically.

Erlang is a great match for building dynamic web based systems due to its lightweight multiprocessing, failure handling, and memory management.

Using Erlang, Zotonic makes it possible to build a very competent and well-performing content management system and framework without needing separate web servers, caching proxies, memcache servers, or e-mail handlers. This greatly simplifies system management tasks.

On current hardware a single Zotonic server can handle thousands of dynamic page requests per second, thus easily serving the fast majority of web sites on the world wide web.

Using Erlang, Zotonic is prepared for the future of multi-core systems with dozens of cores and many gigabytes of memory.

Acknowledgements

The authors would like to thank Michiel Klønhammer (Maximonster Interactive Things), Andreas Stenius, Maas-Maarten Zeeman and Atilla Erdődi.

↧

PostgreSQL indexing in Rails

December 27, 2013, 5:32 am

≫ Next: ReText | Free Development software downloads at SourceForge.net

≪ Previous: The Performance of Open Source Software | Zotonic

Comments:"PostgreSQL indexing in Rails"

URL:http://rny.io/rails/postgresql/2013/08/20/postgresql-indexing-in-rails.html

The things you need to know about PostgreSQL indexes to keep your Rails applications snappy.

The purpose of indexes is to make access to data faster. Most of the time an index will make your queries faster but the trade off is that for each index you have your data insertion will become slower. That’s because when you insert data with an index it must write data to two different places.

PostgreSQL has many types of options when it comes to indexing. The one we’re going to focus on in this article is the B-tree index type which is the most commonly used index type for most use cases.

Primary key indexes

Ok, let’s start with the basics. In general it’s a good practice to add an index for the primary key in your tables. If your table will have a large number of rows it makes good use of an index and the lookup will take place in the index instead of sequentially scan your table for the matching rows. Luckily, PostgreSQL automatically creates an index for primary keys to enforce uniqueness. Thus, it is not necessary to create an index explicitly for primary key columns:

classCreateProducts<ActiveRecord::Migrationdefchangecreate_table:productsdo|t|t.string:nameendendend

…your table description will look like this in psql:

indexes_development=#\dproductsTable"public.products"Column|Type|Modifiers--------+------------------------+-------------------------------------------------------id|integer|notnulldefaultnextval('products_id_seq'::regclass)name|charactervarying(255)|Indexes:"products_pkey"PRIMARYKEY,btree(id)

As you can see, you now have an primary key index using the btree type to index the id column.

Foreign keys and other commonly used columns

Unlike primary keys, foreign keys and other columns in your table will not be indexed automatically in Rails. So it’s always a good idea to add indexes for foreign keys, columns that need to be sorted, lookup fields and columns that are used with the group method (GROUP BY) in the Active Record Query Interface.

One of the most common performance problem with rails applications is the lack of indexes on foreign keys. Luckily it’s very easy to avoid this pitfall:

classCreateProducts<ActiveRecord::Migrationdefchangecreate_table:productsdo|t|t.string:namet.belongs_to:categoryendcreate_table:categoriesdo|t|t.string:nameendadd_index:products,:category_idendend

And after adding the migration above you should see this in psql:

indexes_development=#indexes_development=#\dproductsTable"public.products"Column|Type|Modifiers-------------+------------------------+-------------------------------------------------------id|integer|notnulldefaultnextval('products_id_seq'::regclass)name|charactervarying(255)|category_id|integer|Indexes:"products_pkey"PRIMARYKEY,btree(id)"index_products_on_category_id"btree(category_id)

Notice that we now have an index called index_products_on_category_id for the category_id. So that extra add_index line in the migration will make your application perform a lot better.

Unique Indexes

If you create a unique index for a column it means you’re guaranteed the table won’t have more than one row with the same value for that column. Using only validates_uniqueness_of validation in your model isn’t enough to enforce uniqueness because there can be concurrent users trying to create the same data.

Imagine that two users tries to register an account with the same username where you have added validates_uniqueness_of :username in your user model. If they hit the “Sign up” button at the same time, Rails will look in the user table for that username and respond back that everything is fine and that it’s ok to save the record to the table. Rails will then save the two records to the user table with the same username and now you have a really shitty problem to deal with.

To avoid this you need to create a unique constraint at the database level as well:

classCreateUsers<ActiveRecord::Migrationdefchangecreate_table:usersdo|t|t.string:username...endadd_index:users,:username,unique:trueendend

In psql:

indexes_development=#\dusersTable"public.users"Column|Type|Modifiers----------+------------------------+----------------------------------------------------id|integer|notnulldefaultnextval('users_id_seq'::regclass)username|charactervarying(255)|Indexes:"users_pkey"PRIMARYKEY,btree(id)"index_users_on_username"UNIQUE,btree(username)

So by creating the index_users_on_username unique index you get two very nice benefits. Data integrity as descibed above and good performance because unique indexes tends to be very fast.

Sorted Indexes

By default, the entries in a B-tree index is sorted in ascending order. However, in some particular cases it can be a good idea to use a descending order for the index instead.

One of the most obvious case is when you have something that is paginated and all the items are sorted by the most recent released first. For example a blog post model that has a released_at column. For unreleased blog posts, the released_at value is NULL.

This is how you create this kind of index:

classCreatePosts<ActiveRecord::Migrationdefchangecreate_table:postsdo|t|t.string:titlet.datetime:released_att.timestampsendadd_index:posts,:released_at,order:{released_at:"DESC NULLS LAST"}endend

As we’re going to query the table in sorted order by released_at and limiting the result, we may have some benefit by creating an index in that order. PostgreSQL will find the rows it needs from the index in the correct order, and then go to the data blocks to retrieve the data. If the index wasn’t sorted, there’s a good chance that PostgreSQL would read the data blocks sequentially and then sort the results.

This technique is mostly relevant with single column indexes when you require nulls to be last. Otherwise the order is already there because an index can be scanned in any direction.

Partial Indexes

If you frequently filter your queries by a particular column value, and that column value is present in a minority of your rows, partial indexes may increase your performance significantly. A partial index is basically an index using a WHERE clause. It increases the efficiency of the index by reducing its size which makes the index smaller and takes less storage, is easier to maintain, and is faster to scan.

Let’s say that you have a table for orders. That table can contain both billed and unbilled orders, where the unbilled orders take up a minority of the total rows in the table. It’s very likely that the unbilled orders are also the most accessed rows in your application. Then it is very likely that your application performance will increase if you use an partial index.

Example:

classCreateOrders<ActiveRecord::Migrationdefchangecreate_table:ordersdo|t|t.float:amountt.boolean:billed,default:falset.timestampsendadd_index:orders,:billed,where:"billed = false"endend

This is how it looks in psql:

indexes_development-#\dordersTable"public.orders"Column|Type|Modifiers------------+-----------------------------+-----------------------------------------------------id|integer|notnulldefaultnextval('orders_id_seq'::regclass)amount|doubleprecision|billed|boolean|defaultfalsecreated_at|timestampwithouttimezone|updated_at|timestampwithouttimezone|Indexes:"orders_pkey"PRIMARYKEY,btree(id)"index_orders_on_billed"btree(billed)WHEREbilled=false

↧

ReText | Free Development software downloads at SourceForge.net

December 27, 2013, 1:31 am

≫ Next: First time in the country, ED raids a Bitcoin seller in Ahmedabad - India - DNA

≪ Previous: PostgreSQL indexing in Rails

Comments:"ReText | Free Development software downloads at SourceForge.net"

URL:http://sourceforge.net/projects/retext/

↧

First time in the country, ED raids a Bitcoin seller in Ahmedabad - India - DNA

December 26, 2013, 11:31 pm

≫ Next: Google offering free Chromebooks to Indian schools - The Times of India

≪ Previous: ReText | Free Development software downloads at SourceForge.net

Comments:"First time in the country, ED raids a Bitcoin seller in Ahmedabad - India - DNA"

URL:http://www.dnaindia.com/india/report-first-time-in-the-country-ed-raids-a-bitcoin-seller-in-ahmedabad-1941187

A couple of days after the Reserve Bank of India issued an advisory to public not to indulge in buying-selling of Bitcoins, the first raid in India was undertaken in Ahmedabad by Enforcement Directorate (ED) on an entity that provide platform to trade in this illegal but virtual currency.

On Thursday, ED raided the premises of Mahim Gupta in Bopal area of the city who provides trading platform through his website -‑ buysellbit.co.in. During the preliminary investigation, the ED found that it is in clear violation of Foreign Exchange Management Act (FEMA) rules of the country as central bank does not provide permission to indulge in such transactions.

“We have found that through the website 400 persons have recorded 1,000 transactions that amount to a few crores of rupees. We are gathering the data of the transactions, name of the people who have transacted in the virtual currency from Gupta’s server that is hired in the US. At present, we believe that this is a violation of foreign exchange regulations of the country. If we are able to establish money laundering aspect then he can be arrested,” said a top ED official.

As per sources, a separate raid was also conducted in Satellite area of the city, however, the person the investigation agency was looking for could not be found. “When we reached his office, he was not there. We have sealed the premises,” the official added.

The value of transactions is not known as each transaction will be verified by the investigating agency. However, it is likely to be around Rs20-30 crore. “Value of transaction is one aspect. Being a virtual currency its transfer and settlement is done online. No country has legalised Bitcoin as of now because of its opaque nature. The biggest threat is that without recording your transaction in official foreign currency platform money can be transferred like hawala with the use of this transaction. We are examining such instances, if any, here,” the official said.

Sources also added that there are a handful of entities that provide trading in the virtual currency in India. “I think there are only five entities. Of these, we believe two are operating from Ahmedabad. We believe that they have channel of agents or people who promote the use of such currency but entities that provide online platform are few,” official said.

Because of complex nature of transaction and high level information technology security involved ED is taking help of company called ECS Corporation that specialises in forensic audit and IT technology.

Virtual currency trade is 3-year-old
Bitcoin came into existence just 3-year ago. It is a virtual currency that can be generated through complex computer software systems with solutions shared on a network. Despite new in existence, Bitcoin has already become the world’s most expensive currency and its per unit value soared past USD 1,000 level or about Rs 63,000 recently, although the prices have now slipped to Rs46,600.

↧

Google offering free Chromebooks to Indian schools - The Times of India

December 26, 2013, 10:33 pm

≫ Next: What Surveillance Valley knows about you | PandoDaily

≪ Previous: First time in the country, ED raids a Bitcoin seller in Ahmedabad - India - DNA

Comments:"Google offering free Chromebooks to Indian schools - The Times of India"

URL:http://timesofindia.indiatimes.com/tech/tech-news/hardware/Google-offering-free-Chromebooks-to-Indian-schools/articleshow/28000046.cms

HYDERABAD: Google is expanding to India an initiative to popularize the use of its Chromebook laptops in schools, starting with a pilot in four schools in Andhra Pradesh.

The internet search company that makes the world's most popular software for smartphones and tablets will initially make available 25 Chromebooks to each school and train the teachers and instructors in the use of the required software applications.

"The school instructors will teach core subjects using applications and software. We believe with interactive learning, the student will understand better and will take interest in the subjects," Ponnala Lakshmaiah, the state's minister for information technology, said.

Chromebooks require an internet connection to use, and most of the data, such as files that users work on, are stored on Google's storage network connected to the internet. Earlier this month, Samsung released a Chromebook model specifically for the Indian market. Schools are among the most popular market segments for the Chromebooks.

Google is running this programme in some 3,000 schools in the US, Singapore and Malaysia, a Google executive with direct knowledge of the plan to expand it to India told ET. The executive requested not to be named as Google was yet to announce the plan.

"Google aims to increase access to information and knowledge for all students, and encourages tools that support effective teaching and learning in the classroom, but we have nothing to announce at this time," a company spokeswoman said in an emailed statement.

The pilot project, in collaboration with Andhra Pradesh department of information technology, will start next month, a senior government official said. It will be launched in three government schools and one private school in Jangaon in Warangal district.

After the pilot, which benefits students of grades nine and 10, discussions are on to expand the programme statewide, the government official said.

The state's IT department will provide the schools with Wi-Fi internet connectivity with 1Mbps speed and power backups for unhindered use. Each school will be assigned dedicated mentors who would train the teachers and instructors, the official said.

Training will include using the Google Apps Training Center, an online learning environment that offers six modules including Google Apps Education Edition, Apps Mail, Calendar, Docs, Sites, and other tools.

↧

What Surveillance Valley knows about you | PandoDaily

December 26, 2013, 10:32 pm

≫ Next: Linode Blog » Docker on Linode

≪ Previous: Google offering free Chromebooks to Indian schools - The Times of India

Comments:"What Surveillance Valley knows about you | PandoDaily"

URL:http://pando.com/2013/12/22/a-peek-into-surveillance-valley/

By Yasha Levine
On December 22, 2013

“In 2012, the data broker industry generated 150 billion in revenue that’s twice the size of the entire intelligence budget of the United States government—all generated by the effort to detail and sell information about our private lives.” — Senator Jay Rockefeller IV “Quite simply, in the digital age, data-driven marketing has become the fuel on which America’s free market engine runs.” — Direct Marketing Association

* *

Google is very secretive about the exact nature of its for-profit intel operation and how it uses the petabytes of data it collects on us every single day for financial gain. Fortunately, though, we can get a sense of the kind of info that Google and other Surveillance Valley megacorps compile on us, and the ways in which that intel might be used and abused, by looking at the business practices of the “data broker” industry.

Thanks to a series of Senate hearings, the business of data brokerage is finally being understood by consumers, but the industry got its start back in the 1970s as a direct outgrowth of the failure of telemarketing. In its early days, telemarketing had an abysmal success rate: only 2 percent of people contacted would become customers. In his book, “The Digital Perso,” Daniel J. Solove explains what happened next:

To increase the low response rate, marketers sought to sharpen their targeting techniques, which required more consumer research and an effective way to collect, store, and analyze information about consumers. The advent of the computer database gave marketers this long sought-after ability — and it launched a revolution in targeting technology.

Data brokers rushed in to fill the void. These operations pulled in information from any source they could get their hands on — voter registration, credit card transactions, product warranty information, donations to political campaigns and non-profits, court records — storing it in master databases and then analyzing it in all sorts of ways that could be useful to direct-mailing and telemarketing outfits. It wasn’t long before data brokers realized that this information could be used beyond telemarketing, and quickly evolved into a global for-profit intelligence business that serves every conceivable data and intelligence need.

Today, the industry churns somewhere around $200 billion in revenue annually. There are up to 4,000 data broker companies — some of the biggest are publicly traded — and together, they have detailed information on just about every adult in the western world.

No source of information is sacred: transaction records are bought in bulk from stores, retailers and merchants; magazine subscriptions are recorded; food and restaurant preferences are noted; public records and social networks are scoured and scraped. What kind of prescription drugs did you buy? What kind of books are you interested in? Are you a registered voter? To what non-profits do you donate? What movies do you watch? Political documentaries? Hunting reality TV shows?

That info is combined and kept up to date with address, payroll information, phone numbers, email accounts, social security numbers, vehicle registration and financial history. And all that is sliced, isolated, analyzed and mined for data about you and your habits in a million different ways.

The dossiers are not restricted to generic market segmenting categories like “Young Literati” or “Shotguns and Pickups” or “Kids & Cul-de-Sacs,” but often contain the most private and intimate details about a person’s life, all of it packaged and sold over and over again to anyone willing to pay.

Take MEDbase200, a boutique for-profit intel outfit that specializes in selling health-related consumer data. Well, until last week, the company offered its clients a list of rape victims (or “rape sufferers,” as the company calls them) at the low price of $79.00 per thousand. The company claims to have segmented this data set into hundreds of different categories, including stuff like the ailments they suffer, prescription drugs they take and their ethnicity:

These rape sufferers are family members who have reported, or have been identified as individuals affected by specific illnesses, conditions or ailments relating to rape. Medbase200 is the owner of this list. Select from families affected by over 500 different ailments, and/or who are consumers of over 200 different Rx medications. Lists can be further selected on the basis of lifestyle, ethnicity, geo, gender, and much more. Inquire today for more information.

MEDbase promptly took its “rape sufferers” list off line last week after its existence was revealed in a Senate investigation into the activities of the data-broker industry. The company pretended like the list was a huge mistake. A MEDbase rep tried convincing a Wall Street Journal reporter that its rape dossiers were just a “hypothetical list of health conditions/ailments.” The rep promised it was never sold to anyone. Yep, it was a big mistake. We can all rest easy now. Thankfully, MEDbase has hundreds of other similar dossier collections, hawking the most private and sensitive medical information.

For instance, if lists of rape victims aren’t your thing, MEDbase can sell dossiers on people suffering from anorexia, substance abuse, AIDS and HIV, Alzheimer’s Disease, Asperger Disorder, Attention Deficit Hyperactivity Disorder, Bedwetting (Enuresis), Binge Eating Disorder, Depression, Fetal Alcohol Syndrome, Genital Herpes, Genital Warts, Gonorrhea, Homelessness, Infertility, Syphilis… the list goes on and on and on and on.

Normally, such detailed health information would fall under federal law and could not be disclosed or sold without consent. But because these data harvesters rely on indirect sources of information instead of medical records, they’re able to sidestep regulations put in place to protect the privacy of people’s health data.

MEBbase isn’t the only company exploiting these loopholes. By the industry’s own estimates, there are something like 4,000 for-profit intel companies operating in the United States. Many of them sell information that would normally be restricted under federal law. They offer all sorts of targeted dossier collections on every population segments of our society, from the affluent to the extremely vulnerable:

people with drug addictions
detailed personal info on police officers and other government employees
people with bad credit/bankruptcies
minorities who’ve used payday loan services
domestic violence shelter locations (normally these addresses would be shielded by law)
elderly gamblers

If you want to see how this kind of profile data can be used to scam unsuspecting individuals, look no further than a Richard Guthrie, an Iowa retiree who had his life savings siphoned out of his bank account. Their weapon of choice: databases bought from large for-profit data brokers listing retirees who entered sweepstakes and bought lottery tickets.

Here’s a 2007 New York Times story describing the racket:

Mr. Guthrie, who lives in Iowa, had entered a few sweepstakes that caused his name to appear in a database advertised by infoUSA, one of the largest compilers of consumer information. InfoUSA sold his name, and data on scores of other elderly Americans, to known lawbreakers, regulators say. InfoUSA advertised lists of “Elderly Opportunity Seekers,” 3.3 million older people “looking for ways to make money,” and “Suffering Seniors,” 4.7 million people with cancer or Alzheimer’s disease. “Oldies but Goodies” contained 500,000 gamblers over 55 years old, for 8.5 cents apiece. One list said: “These people are gullible. They want to believe that their luck can change.”

Data brokers argue that cases like Guthrie are an anomaly — a once-in-a-blue-moon tragedy in an industry that takes privacy and legal conduct seriously. But cases of identity thieves and sophistical con-rings obtaining data from for-profit intel businesses abound. Scammers are a lucrative source of revenue. Their money is just as good as anyone else’s. And some of the profile “products” offered by the industry seem tailored specifically to fraud use.

As Royal Canadian Mounted Police Sergeant Yves Leblanc told the New York Times: “Only one kind of customer wants to buy lists of seniors interested in lotteries and sweepstakes: criminals. If someone advertises a list by saying it contains gullible or elderly people, it’s like putting out a sign saying ‘Thieves welcome here.’”

So what is InfoUSA, exactly? What kind of company would create and sell lists customized for use by scammers and cons?

As it turns out, InfoUSA is not some fringe or shady outfit, but a hugely profitable politically connected company. InfoUSA was started by Vin Gupta in the 1970s as a basement operation hawking detailed lists of RV and mobile home dealers. The company quickly expanded into other areas and began providing business intel services to thousands of businesses. By 2000, the company raised more than $30 million in venture capital funding from major Silicon Valley venture capital firms.

By then, InfoUSA boasted of having information on 230 million consumers. A few years later, InfoUSA counted the biggest Valley companies as its clients, including Google, Yahoo, Microsoft and AOL. It got involved not only in raw data and dossiers, but moved into payroll and financial, conducted polling and opinion research, partnered with CNN, vetted employees and provided customized services for law enforcement and all sorts of federal and government agencies: processing government payments, helping states locate tax cheats and even administrating President Bill Clinton “Welfare to Work” program. Which is not surprising, as Vin Gupta is a major and close political supporter of Bill and Hillary Clinton.

In 2008, Gupta was sued by InfoUSA shareholders for inappropriately using corporate funds. Shareholders accused of Gupta of illegally funneling corporate money to fund an extravagant lifestyle and curry political favor. According to the Associated Press, the lawsuit questioned why Gupta used private corporate jets to fly the Clintons on personal and campaign trips, and why Gupta awarded Bill Clinton a $3.3 million consulting gig.

As a result of the scandal, InfoUSA was threatened with delisting from Nasdaq, Gupta was forced out and the company was snapped up for half a billion dollars by CCMP Capital Advisors, a major private equity firm spun off from JP Morgan in 2006. Today, InfoUSA continues to do business under the name Infogroup, and has nearly 4,000 employees working in nine countries.

As big as Infogroup is, there are dozens of other for-profit intelligence businesses that are even bigger: massive multi-national intel conglomerates with revenues in the billions of dollars. Some of them, like Lexis-Nexis and Experian, are well known, but mostly these are outfits that few Americans have heard of, with names like Epsilon, Altegrity and Acxiom.

These for-profit intel behemoths are involved in everything from debt collection to credit reports to consumer tracking to healthcare analysis, and provide all manner of tailored services to government and law enforcement around the world. For instance, Acxiom has done business with most major corporations, and boasts of intel on “500 million active consumers worldwide, with about 1,500 data points per person. That includes a majority of adults in the United States,” according to the New York Times.

This data is analyzed and sliced in increasingly sophisticated and intrusive ways to profile and predict behavior. Merchants are using it customize shopping experience— Target launched a program to figure out if a woman shopper was pregnant and when the baby would be born, “even if she didn’t want us to know.” Life insurance companies are experimenting with predictive consumer intel to estimate life expectancy and determine eligibility for life insurance policies. Meanwhile, health insurance companies are raking over this data in order to deny and challenge the medical claims of their policyholders.

Even more alarming, large employers are turning to for-profit intelligence to mine and monitor the lifestyles and habits of their workers outside the workplace. Earlier this year, the Wall Street Journal described how employers have partnered with health insurance companies to monitor workers for “health-adverse” behavior that could lead to higher medical expenses down the line:

Your company already knows whether you have been taking your meds, getting your teeth cleaned and going for regular medical checkups. Now some employers or their insurance companies are tracking what staffers eat, where they shop and how much weight they are putting on — and taking action to keep them in line. But companies also have started scrutinizing employees’ other behavior more discreetly. Blue Cross and Blue Shield of North Carolina recently began buying spending data on more than 3 million people in its employer group plans. If someone, say, purchases plus-size clothing, the health plan could flag him for potential obesity — and then call or send mailings offering weight-loss solutions. …”Everybody is using these databases to sell you stuff,” says Daryl Wansink, director of health economics for the Blue Cross unit. “We happen to be trying to sell you something that can get you healthier.”

“As an employer, I want you on that medication that you need to be on,” says Julie Stone, a HR expert at Towers Watson told the Wall Street Journal.

Companies might try to frame it as a health issue. I mean, what kind of asshole could be against employers caring about the wellbeing of their workers? But their ultimate concern has nothing to do with the employee health. It’s all about the brutal bottom line: keeping costs down.

An employer monitoring and controlling your activity outside of work? You don’t have to be union agitator to see the problems with this kind of mindset and where it could lead. Because there are lots of things that some employers might want to know about your personal life, and not only to “keep costs down.” It could be anything: to weed out people based on undesirable habits or discriminate against workers based on sexual orientation, regulation and political beliefs.

It’s not difficult to imagine that a large corporation facing a labor unrest or a unionization drive would be interested in proactively flagging potential troublemakers by pinpointing employees that might be sympathetic to the cause. But the technology and data is already here for wide and easy application: did a worker watch certain political documentaries, donate to environmental non-profits, join an animal rights Facebook group, tweet out support for Occupy Wall Street, subscribe to the Nation or Jacobin, buy Naomi Klein’s “Shock Doctrine”? Or maybe the worker simply rented one of Michael Moore’s films? Run your payroll through one of the massive consumer intel databases and look if there is any matchup. Bound to be plenty of unpleasant surprises for HR!

This has happened in the past, although in a cruder and more limited way. In the 1950s, for instance, some lefty intellectuals had their lefty newspapers and mags delivered to P.O. boxes instead of their home address, worrying that otherwise they’d get tagged as Commie symps. That might have worked in the past. But with the power of private intel companies, today there’s nowhere to hide.

FTC Commissioner Julie Brill has repeatedly voiced concern that unregulated data being amassed by for-profit intel companies would be used to discriminate and deny employment, and to determine consumer access to everything from credit to insurance to housing. “As Big Data algorithms become more accurate and powerful, consumers need to know a lot more about the ways in which their data is used,” she told the Wall Street Journal.

Pam Dixon, executive director of the Privacy World Forum, agrees. Dixon frequently testifies on Capitol Hill to warn about the growing danger to privacy and civil liberties posed by big data and for-profit intelligence. In Congressional testimony back in 2009, Dixon called this growing mountain of data the “modern permanent record” and explained that users of these new intel capabilities will inevitably expand to include not just marketers and law enforcement, but insurance companies, employers, landlords, schools, parents, scammers and stalkers. “The information – like credit reports – will be used to make basic decisions about the ability of individual to travel, participate in the economy, find opportunities, find places to live, purchase goods and services, and make judgments about the importance, worthiness, and interests of individuals.”

* *

For the past year, Chairman John D. (Jay) Rockefeller IV has been conducting a Senate Commerce Committee investigation of the data broker industry and how it affects consumers. The committee finished its investigation last week without reaching any real conclusions, but issued a report warning about the dangers posed by the for-profit intel industry and the need for further action by lawmakers. The report noted with concern that many of these firms failed to cooperate with the investigation into their business practices:

Data brokers operate behind a veil of secrecy. Three of the largest companies – Acxiom, Experian, and Epsilon – to date have been similarly secretive with the Committee with respect to their practices, refusing to identify the specific sources of their data or the customers who purchase it. … The refusal by several major data broker companies to provide the Committee complete responses regarding data sources and customers only reinforces the aura of secrecy surrounding the industry.

Rockefeller’s investigation was an important first step breaking open this secretive industry, but it was missing one notable element. Despite its focus on companies that feed on people’s personal data, the investigation did not include Google or the other big Surveillance Valley data munchers. And that’s too bad. Because if anything, the investigation into data brokers only highlighted the danger posed by the consumer-facing data companies like Google, Facebook, Yahoo and Apple.

As intrusive as data brokers are, the level of detail in the information they compile on Americans pales to what can be vacuumed up by a company like Google. To compile their dossiers, traditional data brokers rely on mostly indirect intel: what people buy, where they vacation, what websites they visit. Google, on the other hand, has access to the raw uncensored contents of your inner life: personal emails, chats, the diary entries and medical records that we store in the cloud, our personal communication with doctors, lawyers, psychologists, friends. Data brokers know us through our spending habits. Google accesses the unfiltered details of our personal lives.

A recent study showed that Americans are overwhelmingly opposed to having their online activity tracked and analyzed. Seventy-three percent of people polled for the Pew Internet & American Life Project viewed the tracking of their search history as an invasion of privacy, while 68 percent were against targeted advertising, replying: “I don’t like having my online behavior tracked and analyzed.”

This isn’t news to companies like Google, which last year warned shareholders: “Privacy concerns relating to our technology could damage our reputation and deter current and potential users from using our products and services.”

Little wonder then that Google, and the rest of Surveillance Valley, is terrified that the conversation about surveillance could soon broaden to include not only government espionage, but for-profit spying as well.

[Image courtesy redjar]

↧

Linode Blog » Docker on Linode

January 3, 2014, 11:08 am

≫ Next: Software Detection of Currency :: Projects :: Steven J. Murdoch

≪ Previous: What Surveillance Valley knows about you | PandoDaily

Comments:"Linode Blog » Docker on Linode"

URL:https://blog.linode.com/2014/01/03/docker-on-linode/

January 3, 2014 1:48 pm

We’re pleased to announce that Linode now supports Docker right out of the box.

Docker allows you to create lightweight containers for your applications as well as use images created by other users.

Docker’s latest release, 0.7, focused on supporting a wider range of standard kernel configuration options, and we’ve released a new kernel (3.12.6) with adjustments to match this. Starting with this release, you’ll be able to use Docker with the default Linode kernel, rather than needing to use a custom kernel via pv-grub.

Installing Docker on Your Linode

Make sure you are running our latest kernel. You may need to reboot to get it. Install Docker by following their excellent documentation: Start Using Docker

Try it out by running through the Hello World example or really dive in and set up a Redis service! Feel free to check out all of the Docker Examples or search through the Docker Image Index to learn more.

Enjoy!

Filed under: announcements, features by stan_theman

↧