Quantcast
Channel: Hacker News 50
Viewing all 9433 articles
Browse latest View live

Yahoo mail hacked: What to do if you’ve been affected - The Washington Post

$
0
0

Comments:"Yahoo mail hacked: What to do if you’ve been affected - The Washington Post"

URL:http://m.washingtonpost.com/business/technology/yahoo-mail-hacked-what-to-do-if-youve-been-affected/2014/01/31/2857ef8a-8a7d-11e3-833c-33098f9e5267_story.html


Yahoo Mail users, we have some bad news: It’s time to change your e-mail password.

In a company blog post Thursday night, Yahoo revealed that a number of users’ passwords and usernames were exposed to cyber-attackers who used malicious computer software to gain access to lists of Yahoo Mail credentials.

The information was likely collected from a third-party database, Jay Rossiter, Yahoo’s senior vice president of platforms and personalization products, wrote in the posting.

The company is resetting passwords on accounts that have been affected and is taking steps to allow users to re-secure their accounts. It is sending notification e-mails instructing those users to change their passwords; users may also receive a text message, if they’ve shared their phone number with the company.

It’s a song-and-dance that users may be tiring of, but it is important for Yahoo account holders who were swept up in the attack to change their passwords for immediately. They should also change their log-in credentials for any account that may share their Yahoo password, particularly if they use their Yahoo e-mail as their username. The same is true if you use a similar e-mail address as the username — it’s not a big leap for hackers to think that you may be both jdoe@yahoo.com and jdoe@gmail.com.

Finally, everyone should also be on the lookout for spam, as the attack also appears to have picked up names and e-mail addresses for the most recent contacts from affected accounts, according to the company’s post.

If you get an odd e-mail from the Yahoo account of someone you know, ignore the message, and do not click on any links in the message. (It’s also be nice to let the person whose account has been hacked know about the fraudulent messages, so they can warn others to avoid the e-mails.)

Yahoo has apologized for the inconvenience and has said that it has taken “additional measures” to block attacks on its system. The company did not immediately respond to a request asking how many of its users were affected.

Yahoo is the world's second-largest e-mail provider, and has an estimated 273 million users, according to a report from the Associated Press.

Related stories:

Yahoo’s 4th-quarter results dragged down by revenue drop

The Switch: Thousands of visitors to yahoo.com hit with malware attack, researchers say

Follow The Post’s new tech blog, The Switch, where technology and policy connect.


Our first matching loan fund launches in beta – Zidisha: P2P Microfinance

$
0
0

Comments:"Our first matching loan fund launches in beta – Zidisha: P2P Microfinance"

URL:http://p2p-microlending-blog.zidisha.org/2014/02/02/our-first-matching-loan-fund-launches-in-beta/


Today marks the beta launch of our first matching loan fund.  For a limited time, this fund will match, dollar for dollar, the amounts lent manually by individual Zidisha members.

For lenders, this means double the impact for each dollar lent.  For borrowers, this means that loan applications that meet the quality standards necessary to attract bids from individual lenders are more likely to raise the full amount needed.

Our first matching loan fund is made possible by a generous contribution from Yun-Fang Juan, the creator of Facebook Ads and active supporter of game-changing financial services startups.

View our latest entrepreneurs here!

Like this:

LikeLoading...

Related

Posted on February 2, 2014 by Julia Kurnia. This entry was posted in Uncategorized. Bookmark the permalink.

Sheriff Fires Cop Who Threatened to Arrest Me for Taking Photos of Cops | Slog

$
0
0

Comments:"Sheriff Fires Cop Who Threatened to Arrest Me for Taking Photos of Cops | Slog"

URL:http://slog.thestranger.com/slog/archives/2014/02/03/sheriff-fires-cop-who-threatened-to-arrest-me-for-taking-photos-of-cops


King County deputy Patrick "K.C." Saulet has been fired for threatening to arrest me last summer, when I was photographing several officers on a downtown street corner, and then lying to investigators about the incident, says King County sheriff John Urquhart.

The termination is effective today.

"You have a constitutional right to photograph the police," Sheriff Urquhart asserted in a phone interview with me today. Threatening to arrest a citizen for legally taking photos of cops while on public property, he added, "is a constitutional violation, as far as I am concerned."

That incident occurred last July 30 in the International District. The short version: Several officers were surrounding a man sitting on a planter box. I'd started taking photos from a distance when Saulet rushed over to say I'd be arrested if I didn't leave. He claimed, wrongly, that I was standing on private property—that the International District Station plaza belongs to King County Metro and I could not stand there. Even though that didn't sound right to me, I backed up anyway until I was unambiguously on the City of Seattle's sidewalk. But Sergeant Saulet insisted that was illegal, too, and I would be arrested unless I left the entire block. I filed a complaint with King County against Saulet. (I also filed another complaint with the Seattle Police Department, against a Seattle cop who was nearby, saw the interaction I had with Saulet, and threatened to come into The Stranger's offices and harass me at work. SPD punished him with a day off.)

After a six month investigation—which included an internal recommendation to terminate Saulet and the King County police union attempting to overturn that recommendation—Sheriff Urquhart issued a disciplinary letter on January 30.

One paragraph of the sheriff's letter to Saulet summarized the results:

Suffice it to say, in my judgement, the evidence shows that (i) you abused your authority in your dealings with Mr. Holden on July 30, and (ii) thereafter, rather than be accountable, you attempted to recast events in a light more favorable to you. Stated broadly, for example, you claim you interacted with Mr. Holden in a civil, professional manner that was nothing more than 'social contact'; you did little more than tell him for his benefit that he couldn't ride on Metro property because doing so is a $66 infraction; [you claim that two other deputies] Shook and Mikulcik told him the same thing; and you once calmly pointed him in a direction you were suggesting he leave. But the evidence is that you approached Mr. Holden because you took exception with him lawfully exercising his right to take photographs of you and your colleagues while lawfully standing on public property; you were agitated and confrontational; you essentially 'squared off' with him; you expressly and/or implicitly threatened to arrest him if he did not leave immediately in the specific direction you pointed, not once but five times (misidentifying public property as private property in the process); and Shook and Mikulcik deny the statement you attribute to them.

"Your ill-advised actions also play to some of the most basic fears among some citizens, which is that a police officer may indiscriminately exercise his or her power in violation of their rights," Urquharts discipline letter continues. He explains people fear that "in the event of a complaint, the officer will just deny the allegations and 'circle the wagons' with his or her fellow officers with the expectation they will take care of their own. In a matter of minutes, your actions violated the trust that we, as a department, spend years trying to build and maintain."

Saulet and his union fought the decision. They argued the investigation was a "witch hunt," according to the sheriff's termination letter, and that Saulet did nothing wrong. After Saulet's commander and the deputy sheriff recommended termination, Saulet appealed their recommendation in what's called a Loudermill hearing before the sheriff. Saulet and his union, called the King County Police Officers' Guild, further argued that investigators asked leading questions and that "none of the witness statements are consistent."

"This is an overstatement," says Urquhart in his discipline letter to Saulet. "There are some inconsistencies to be sure, but no more or less than is typical of most police investigations: The most comprehensive and fundamental conflict was between Mr. Holden's statement and yours, and the other statements provided substantially more support for him than you on key points." Responding to the "witch hunt" claim, which the union made based on the fact that there was a particularly large investigation file, Urquhart says that "the density of the file, however, favorably reflects the thoroughness of the investigation. If the department in general, or I or the investigator in particular, were 'hunting' for a reason to take action against you, we would not have made such a substantial effort to collect and carefully review all relevant circumstances, including any and all that might have wholly or partly exculpated you or otherwise mitigate the circumstances."

Saulet and his union may try to appeal, but, as of today, Urquhart says they have not.

Saulet has long history of misconduct, with approximately 120 allegations against him and 21 cases of sustained misconduct (more than any other officer in the department). The sheriff's letter says that Saulet repeatedly was told to improve interactions with the public, and provided with remarkable investments of coaching and counseling. Saulet underwent three performance-improvement plans, two training sessions, and two multi-visit sessions with a social psychologist, coaching sessions with supervisors, and 80 hours of time off without pay. Saulet was demoted from sergeant to deputy for another incident in August.

For the record: I'm not gleeful that Saulet got fired, although it's welcome evidence that Urquhart takes complaints seriously. This incident—and my complaint—is not about me. After growing up in this town, I believe that certain cops regularly submit civilians (particularly racial minorities) to abusive treatment—much more abusive than what I faced here. Often, folks don't complain, and when they do, the record shows, bad cops are often wholly or partially exonerated, even when they're guilty. We know from a US Department of Justice investigation that Seattle cops have a practice of using excessive force, and we know from internal audits that the King County Sheriff's Office has had problems disciplining bad cops. So now, more than ever, I think citizens should complain if they encounter hostile, unconstitutional, or violent policing. Sheriff Urquhart has only been in office about one year. Again, it's good to see him taking complaints against problem cops seriously. Most cops are not problem cops. Most work hard and keep us safe. It's miserable that abusive cops ruin those good cops' reputations, and if we're going to get from here to a place where the public trusts the police more, it will require police brass continuing to punish the bad apples, as Urquhart has done.

Urquhart added that I was "treated no differently than other people" who file complaints. His decision to terminate Saulet was not because I'm a reporter and editor at a newspaper, he said. "We would do exactly the same with anyone making these allegations." Urquhart has fired other deputies accused of misconduct.

Facebook

Mark Zuckerberg - Today is Facebook's 10th anniversary.... | Facebook

$
0
0

Comments:"Mark Zuckerberg - Today is Facebook's 10th anniversary.... | Facebook"

URL:https://www.facebook.com/zuck/posts/10101250930776491


Today is Facebook's 10th anniversary.

It's been an amazing journey so far, and I'm so grateful to be a part of it. It's rare to be able to touch so many people's lives, and I try to remind myself to make the most of every day and have the biggest impact I can.

People often ask if I always knew that Facebook would become what it is today. No way.

I remember getting pizza with my friends one night in college shortly after opening Facebook. I told them I was excited to help connect our school community, but one day someone needed to connect the whole world.

I always thought this was important -- giving people the power to share and stay connected, empowering people to build their own communities themselves.

When I reflect on the last 10 years, one question I ask myself is: why were we the ones to build this? We were just students. We had way fewer resources than big companies. If they had focused on this problem, they could have done it.

The only answer I can think of is: we just cared more.

While some doubted that connecting the world was actually important, we were building. While others doubted that this would be sustainable, you were forming lasting connections.

We just cared more about connecting the world than anyone else. And we still do today.

That's why I'm even more excited about the next ten years than the last. The first ten years were about bootstrapping this network. Now we have the resources to help people across the world solve even bigger and more important problems.

Today, only one-third of the world's population has access to the internet. In the next decade, we have the opportunity and the responsibility to connect the other two-thirds.

Today, social networks are mostly about sharing moments. In the next decade, they'll also help you answer questions and solve complex problems.

Today, we have only a few ways to share our experiences. In the next decade, technology will enable us to create many more ways to capture and communicate new kinds of experiences.

It's been amazing to see how all of you have used our tools to build a real community. You've shared the happy moments and the painful ones. You've started new families, and kept spread out families connected. You've created new services and built small businesses. You've helped each other in so many ways.

I'm so grateful to be able to help build these tools for you. I feel a deep responsibility to make the most of my time here and serve you the best I can.

Thank you for letting me be a part of this journey.

Silicon Valley billionaires believe in the free market, as long as they benefit | Dean Baker | Comment is free | theguardian.com

$
0
0

Comments:" Silicon Valley billionaires believe in the free market, as long as they benefit | Dean Baker | Comment is free | theguardian.com "

URL:http://www.theguardian.com/commentisfree/2014/feb/03/google-apple-silicon-valley-free-market-joke


Agreements between Google, Apple and a host of other Silicon Valley companies not to compete for each others' workers are illegal. Photograph: Adam Berry/Getty Images

Last week, Mark Ames published an article that should forever destroy any connection between the Silicon Valley tech billionaires and their supposed libertarian worldviews. The article reports on a court case that alleges that Apple, Google, and other Silicon Valley powerhouses actively conspired to keep their workers' wages down. According to documents filed in the case, these companies agreed not to compete for each others' workers dating at least as far back as 2005. Workers in the industry have filed a class action suit that could lead to the payment of billions of dollars in lost wages.

This case is striking on many levels, the most obvious being the effective theft of large amounts of money by some of the richest people on the planet from their employees. This is pernicious, but not altogether surprising. After all, the boss stealing from the workers is as dog bites man as it gets. Few would be surprised that rich people were willing to break the law to get even richer.

The real news here is how the Silicon Valley barons allegedly broke the law. The charge is that they actively colluded to stifle market forces. They collectively acted to prevent their workers from receiving the market-clearing wage. This means not only that they broke the law, and that they acted to undermine the market, but that they really don't think about the market the way libertarians claim to think about the market.

The classic libertarian view of the market is that we have a huge number of people in the market actively competing to buy and sell goods and services. They acknowledge the obvious – some actors are much bigger than others – but there is so much competition that no individual or company can really hope to have much impact on market outcomes.

This point is central to their argument that the government should not interfere with corporate practices. For example, if we think our local cable company is charging too much for cable access, our libertarian friends will insist that the phone company, satellite television or other competitors will step in to keep prices in line. They would tell the same story if the issue were regulating the airlines, banks, health insurance, or any other sector where there is reason to believe that competition might be limited.

They would tell the same story on the labor side. If we are concerned that workers are getting low wages, then the answer is to improve their skills through education and training, rather than raise the minimum wage. If workers were worth more than the minimum wage, then the market would already be paying them more than the minimum wage.

They tell the same story when it comes to requiring family leave, sick days, or other benefits. Libertarians would say that if workers value these benefits, they would negotiate for them and be willing to trade off wages. There is no reason for the government to get involved.

This story about the wonders of the free market is simple in its appeal and it has the great implication that nothing should be done to keep the rich from getting ever richer. However, the Silicon Valley non-compete agreements show that this is not how the tech billionaires believe the market really works. This is just a story they peddle to children and gullible reporters.

If they really believed the market had a deep sea of competitors in which no individual actor could count for much, then their non-compete agreements would serve no purpose. If Google, Apple, Intel and the other biggies agreed not to hire each others' workers, it really wouldn't affect their pay since there would always be new upstarts ready to jump in and hire away underpaid engineers.

The fact the Silicon Valley honchos took the time to negotiate and presumably enforce these non-compete agreements was because they did not think that there were enough competitors to hire away their workers. They believed that they had enough weight on the buy-side of the market for software engineers that if they agreed not to compete for workers, they could keep their wages down.

It shouldn't be surprising that the Silicon Valley billionaires really aren't libertarians. After all, much of their fortunes rest on patents and copyrights, both of which are government-granted monopolies: the opposite of a free market.

But for some reason, seeing the tech whiz-kids forming a cartel to keep down their workers' wages seems an even more direct violation of any belief in libertarian principles. This is the same sort of cartel behavior that we associate with the cigar-chomping robber barons of the late 19th century. It turns out that the biggest difference between the tech billionaires of the Internet Age and the high rollers of the railroad age is the cigars.

dinedal/textql · GitHub

Indian School Girl Creates Washing Machine That Runs Without Electricity! | Trak.in

$
0
0

Comments:"Indian School Girl Creates Washing Machine That Runs Without Electricity! | Trak.in"

URL:http://trak.in/innovation/indian-school-girl-invents-washing-machine-without-electricity-302013/


We love these kind of inventions – Those that solve real world grass roots problem. Remya Jose, a 14 year school girl from Kerala, has created a simple pedal-powered washing machine which came about due to her dislike to wash clothes by hand. Her invention is not only a washing machine, but can double up as an exercise machine as well.

This washing machine is actually quite simple. A modified bicycle frame attached to a square metal box that contains iron mesh drum inside. The pedal is basically like the exercise cycling machine with attached metal box extension.

The Iron mesh drum hold the clothes, while the metal box holds the water with detergent mixed with it. Pedaling this machine for a few minutes moves the iron mesh drum inside, rigorously washing the clothes inside the drum. This entire cycle is carried out again for 1-2 times thereby cleaning the clothes thoroughly.

After the complete water is drained, pedaling it few more minutes can dry the clothes to about 80 percent.

The machine does need any electricity, doubles up as a exercise machine and it is quite affordable!

Check out this video where Remya demonstrates how this washing machine functions:

It looks like the invention was first shown on discovery channel (the video has discovery logo on it) and was uploaded to Youtube by a user.

This invention is similar to another one “Load carrier for labourer”, which was designed and created by a student from National Institute of Design and had won best product design award in 2011.

What do you think of this pedal powered washing machine?


Rachel Aviv: The Scientist Who Took on a Leading Herbicide Manufacturer : The New Yorker

$
0
0

Comments:"Rachel Aviv: The Scientist Who Took on a Leading Herbicide Manufacturer : The New Yorker"

URL:http://www.newyorker.com/reporting/2014/02/10/140210fa_fact_aviv?currentPage=all


In 2001, seven years after joining the biology faculty of the University of California, Berkeley, Tyrone Hayes stopped talking about his research with people he didn’t trust. He instructed the students in his lab, where he was raising three thousand frogs, to hang up the phone if they heard a click, a signal that a third party might be on the line. Other scientists seemed to remember events differently, he noticed, so he started carrying an audio recorder to meetings. “The secret to a happy, successful life of paranoia,” he liked to say, “is to keep careful track of your persecutors.”

Three years earlier, Syngenta, one of the largest agribusinesses in the world, had asked Hayes to conduct experiments on the herbicide atrazine, which is applied to more than half the corn in the United States. Hayes was thirty-one, and he had already published twenty papers on the endocrinology of amphibians. David Wake, a professor in Hayes’s department, said that Hayes “may have had the greatest potential of anyone in the field.” But, when Hayes discovered that atrazine might impede the sexual development of frogs, his dealings with Syngenta became strained, and, in November, 2000, he ended his relationship with the company.

Hayes continued studying atrazine on his own, and soon he became convinced that Syngenta representatives were following him to conferences around the world. He worried that the company was orchestrating a campaign to destroy his reputation. He complained that whenever he gave public talks there was a stranger in the back of the room, taking notes. On a trip to Washington, D.C., in 2003, he stayed at a different hotel each night. He was still in touch with a few Syngenta scientists and, after noticing that they knew many details about his work and his schedule, he suspected that they were reading his e-mails. To confuse them, he asked a student to write misleading e-mails from his office computer while he was travelling. He sent backup copies of his data and notes to his parents in sealed boxes. In an e-mail to one Syngenta scientist, he wrote that he had “risked my reputation, my name . . . some say even my life, for what I thought (and now know) is right.” A few scientists had previously done experiments that anticipated Hayes’s work, but no one had observed such extreme effects. In another e-mail to Syngenta, he acknowledged that it might appear that he was suffering from a “Napoleon complex” or “delusions of grandeur.”

For years, despite his achievements, Hayes had felt like an interloper. In academic settings, it seemed to him that his colleagues were operating according to a frivolous code of manners: they spoke so formally, fashioning themselves as detached authorities, and rarely admitted what they didn’t know. He had grown up in Columbia, South Carolina, in a neighborhood where fewer than forty per cent of residents finish high school. Until sixth grade, when he was accepted into a program for the gifted, in a different neighborhood, he had never had a conversation with a white person his age. He and his friends used to tell one another how “white people do this, and white people do that,” pretending that they knew. After he switched schools and took advanced courses, the black kids made fun of him, saying, “Oh, he thinks he’s white.”

He was fascinated by the idea of metamorphosis, and spent much of his adolescence collecting tadpoles and frogs and crossbreeding different species of grasshoppers. He raised frog larvae on his parents’ front porch, and examined how lizards respond to changes in temperature (by using a blow-dryer) and light (by placing them in a doghouse). His father, a carpet layer, used to look at his experiments, shake his head, and say, “There’s a fine line between a genius and a fool.”

Hayes received a scholarship to Harvard, and, in 1985, began what he calls the worst four years of his life. Many of the other black students had gone to private schools and came from affluent families. He felt disconnected and ill-equipped—he was placed on academic probation—until he became close to a biology professor, who encouraged him to work in his lab. Five feet three and thin, Hayes distinguished himself by dressing flamboyantly, like Prince. The Harvard Crimson, in an article about a campus party, wrote that he looked as if he belonged in the “rock-’n’-ready atmosphere of New York’s Danceteria.” He thought about dropping out, but then he started dating a classmate, Katherine Kim, a Korean-American biology major from Kansas. He married her two days after he graduated.

They moved to Berkeley, where Hayes enrolled in the university’s program in integrative biology. He completed his Ph.D. in three and a half years, and was immediately hired by his department. “He was a force of nature—incredibly gifted and hardworking,” Paul Barber, a colleague who is now a professor at U.C.L.A., says. Hayes became one of only a few black tenured biology professors in the country. He won Berkeley’s highest award for teaching, and ran the most racially diverse lab in his department, attracting students who were the first in their families to go to college. Nigel Noriega, a former graduate student, said that the lab was a “comfort zone” for students who were “just suffocating at Berkeley,” because they felt alienated from academic culture.

Hayes had become accustomed to steady praise from his colleagues, but, when Syngenta cast doubt on his work, he became preoccupied by old anxieties. He believed that the company was trying to isolate him from other scientists and “play on my insecurities—the fear that I’m not good enough, that everyone thinks I’m a fraud,” he said. He told colleagues that he suspected that Syngenta held “focus groups” on how to mine his vulnerabilities. Roger Liu, who worked in Hayes’s lab for a decade, both as an undergraduate and as a graduate student, said, “In the beginning, I was really worried for his safety. But then I couldn’t tell where the reality ended and the exaggeration crept in.”

Liu and several other former students said that they had remained skeptical of Hayes’s accusations until last summer, when an article appeared in Environmental Health News that drew on Syngenta’s internal records. Hundreds of Syngenta’s memos, notes, and e-mails have been unsealed following the settlement, in 2012, of two class-action suits brought by twenty-three Midwestern cities and towns that accused Syngenta of “concealing atrazine’s true dangerous nature” and contaminating their drinking water. Stephen Tillery, the lawyer who argued the cases, said, “Tyrone’s work gave us the scientific basis for the lawsuit.”

Hayes has devoted the past fifteen years to studying atrazine, and during that time scientists around the world have expanded on his findings, suggesting that the herbicide is associated with birth defects in humans as well as in animals. The company documents show that, while Hayes was studying atrazine, Syngenta was studying him, as he had long suspected. Syngenta’s public-relations team had drafted a list of four goals. The first was “discredit Hayes.” In a spiral-bound notebook, Syngenta’s communications manager, Sherry Ford, who referred to Hayes by his initials, wrote that the company could “prevent citing of TH data by revealing him as noncredible.” He was a frequent topic of conversation at company meetings. Syngenta looked for ways to “exploit Hayes’ faults/problems.” “If TH involved in scandal, enviros will drop him,” Ford wrote. She observed that Hayes “grew up in world (S.C.) that wouldn’t accept him,” “needs adulation,” “doesn’t sleep,” was “scarred for life.” She wrote, “What’s motivating Hayes?—basic question.”

Syngenta, which is based in Basel, sells more than fourteen billion dollars’ worth of seeds and pesticides a year and funds research at some four hundred academic institutions around the world. When Hayes agreed to do experiments for the company (which at that time was part of a larger corporation, Novartis), the students in his lab expressed concern that biotech companies were “buying up universities” and that industry funding would compromise the objectivity of their research. Hayes assured them that his fee, a hundred and twenty-five thousand dollars, would make their lab more rigorous. He could employ more students, buy new equipment, and raise more frogs. Though his lab was well funded, federal support for research was growing increasingly unstable, and, like many academics and administrators, he felt that he should find new sources of revenue. “I went into it as if I were a painter, performing a service,” Hayes told me. “You commissioned it, and I come up with the results, and you do what you want with them. It’s your responsibility, not mine.”

Atrazine is the second most widely used herbicide in the U.S., where sales are estimated at about three hundred million dollars a year. Introduced in 1958, it is cheap to produce and controls a broad range of weeds. (Glyphosate, which is produced by Monsanto, is the most popular herbicide.) A study by the Environmental Protection Agency found that without atrazine the national corn yield would fall by six per cent, creating an annual loss of nearly two billion dollars. But the herbicide degrades slowly in soil and often washes into streams and lakes, where it doesn’t readily dissolve. Atrazine is one of the most common contaminants of drinking water; an estimated thirty million Americans are exposed to trace amounts of the chemical.

In 1994, the E.P.A., expressing concerns about atrazine’s health effects, announced that it would start a scientific review. Syngenta assembled a panel of scientists and professors, through a consulting firm called EcoRisk, to study the herbicide. Hayes eventually joined the group. His first experiment showed that male tadpoles exposed to atrazine developed less muscle surrounding their vocal cords, and he hypothesized that the chemical had the potential to reduce testosterone levels. “I have been losing lots of sleep over this,” he wrote one EcoRisk panel member, in the summer of 2000. “I realize the implications and of course want to make sure that everything possible has been done and controlled for.” After a conference call, he was surprised by the way the company kept critiquing what seemed to be trivial aspects of the work. Hayes wanted to repeat and validate his experiments, and complained that the company was slowing him down and that independent scientists would publish similar results before he could. He decided to resign from the panel, writing in a letter that he didn’t want to be “scooped.” “I fear that my reputation will be damaged if I continue my relationship and associated low productivity with Novartis,” he wrote. “It will appear to my colleagues that I have been part of a plan to bury important data.”

Hayes repeated the experiments using funds from Berkeley and the National Science Foundation. Afterward, he wrote to the panel, “Although I do not want to make a big deal out of it until I have all of the data analyzed and decoded—I feel I should warn you that I think something very strange is coming up in these animals.” After dissecting the frogs, he noticed that some could not be clearly identified as male or female: they had both testes and ovaries. Others had multiple testes that were deformed.

In January, 2001, Syngenta employees and members of the EcoRisk panel travelled to Berkeley to discuss Hayes’s new findings. Syngenta asked to meet with him privately, but Hayes insisted on the presence of his students, a few colleagues, and his wife. He had previously had an amiable relationship with the panel—he had enjoyed taking long runs with the scientist who supervised it—and he began the meeting, in a large room at Berkeley’s Museum of Vertebrate Zoology, as if he were hosting an academic conference. He wore a new suit and brought in catered meals.

After lunch, Syngenta introduced a guest speaker, a statistical consultant, who listed numerous errors in Hayes’s report and concluded that the results were not statistically significant. Hayes’s wife, Katherine Kim, said that the consultant seemed to be trying to “make Tyrone look as foolish as possible.” Wake, the biology professor, said that the men on the EcoRisk panel looked increasingly uncomfortable. “They were experienced enough to know that the issues the statistical consultant was raising were routine and ridiculous,” he said. “A couple of glitches were presented as if they were the end of the world. I’ve been a scientist in academic settings for forty years, and I’ve never experienced anything like that. They were after Tyrone.”

Hayes later e-mailed three of the scientists, telling them, “I was insulted, felt railroaded and, in fact, felt that some dishonest and unethical activity was going on.” When he explained what had happened to Theo Colborn, the scientist who had popularized the theory that industrial chemicals could alter hormones, she advised him, “Don’t go home the same way twice.” Colborn was convinced that her office had been bugged, and that industry representatives followed her. She told Hayes to “keep looking over your shoulder” and to be careful whom he let in his lab. She warned him, “You have got to protect yourself.”

Hayes published his atrazine work in the Proceedings of the National Academy of Sciences a year and a half after quitting the panel. He wrote that what he called “hermaphroditism” was induced in frogs by exposure to atrazine at levels thirty times below what the E.P.A. permits in water. He hypothesized that the chemical could be a factor in the decline in amphibian populations, a phenomenon observed all over the world. In an e-mail sent the day before the publication, he congratulated the students in his lab for taking the “ethical stance” by continuing the work on their own. “We (and our principles) have been tested, and I believe we have not only passed but exceeded expectations,” he wrote. “Science is a principle and a process of seeking truth. Truth cannot be purchased and, thus, truth cannot be altered by money. Professorship is not a career, but rather a life’s pursuit. The people with whom I work daily exemplify and remind me of this promise.”

He and his students continued the work, travelling to farming regions throughout the Midwest, collecting frogs in ponds and lakes, and sending three hundred pails of frozen water back to Berkeley. In papers in Nature and in Environmental Health Perspectives, Hayes reported that he had found frogs with sexual abnormalities in atrazine-contaminated sites in Illinois, Iowa, Nebraska, and Wyoming. “Now that I have realized what we are into, I cannot stop it,” he wrote to a colleague. “It is an entity of its own.” Hayes began arriving at his lab at 3:30 A.M. and staying fourteen hours. He had two young children, who sometimes assisted by color-coding containers.

According to company e-mails, Syngenta was distressed by Hayes’s work. Its public-relations team compiled a database of more than a hundred “supportive third party stakeholders,” including twenty-five professors, who could defend atrazine or act as “spokespeople on Hayes.” The P.R. team suggested that the company “purchase ‘Tyrone Hayes’ as a search word on the internet, so that any time someone searches for Tyrone’s material, the first thing they see is our material.” The proposal was later expanded to include the phrases “amphibian hayes,” “atrazine frogs,” and “frog feminization.” (Searching online for “Tyrone Hayes” now brings up an advertisement that says, “Tyrone Hayes Not Credible.”)

In June, 2002, two months after Hayes’s first atrazine publication, Syngenta announced in a press release that three studies had failed to replicate Hayes’s work. In a letter to the editor of the Proceedings of the National Academy of Sciences, eight scientists on the EcoRisk panel wrote that Hayes’s study had “little regard for assessment of causality,” lacked statistical details, misused the term “dose,” made vague and naïve references, and misspelled a word. They said that Hayes’s claim that his paper had “significant implications for environmental and public health” had not been “scientifically demonstrated.” Steven Milloy, a freelance science columnist who runs a nonprofit organization to which Syngenta has given tens of thousands of dollars, wrote an article for Fox News titled “Freaky-Frog Fraud,” which picked apart Hayes’s paper in Nature, saying that there wasn’t a clear relationship between the concentration of atrazine and the effect on the frog. Milloy characterized Hayes as a “junk scientist” and dismissed his “lame” conclusions as “just another of Hayes’ tricks.”

Fussy critiques of scientific experiments have become integral to what is known as the “sound science” campaign, an effort by interest groups and industries to slow the pace of regulation. David Michaels, the Assistant Secretary of Labor for Occupational Safety and Health, wrote, in his book “Doubt Is Their Product” (2008), that corporations have developed sophisticated strategies for “manufacturing and magnifying uncertainty.” In the eighties and nineties, the tobacco industry fended off regulations by drawing attention to questions about the science of secondhand smoke. Many companies have adopted this tactic. “Industry has learned that debating the science is much easier and more effective than debating the policy,” Michaels wrote. “In field after field, year after year, conclusions that might support regulation are always disputed. Animal data are deemed not relevant, human data not representative, and exposure data not reliable.”

In the summer of 2002, two scientists from the E.P.A. visited Hayes’s lab and reviewed his atrazine data. Thomas Steeger, one of the scientists, told Hayes, “Your research can potentially affect the balance of risk versus benefit for one of the most controversial pesticides in the U.S.” But an organization called the Center for Regulatory Effectiveness petitioned the E.P.A. to ignore Hayes’s findings. “Hayes has killed and continues to kill thousands of frogs in unvalidated tests that have no proven value,” the petition said. The center argued that Hayes’s studies violated the Data Quality Act, passed in 2000, which requires that regulatory decisions rely on studies that meet high standards for “quality, objectivity, utility, and integrity.” The center is run by an industry lobbyist and consultant for Syngenta, Jim Tozzi, who proposed the language of the Data Quality Act to the congresswoman who sponsored it.

The E.P.A. complied with the Data Quality Act and revised its Environmental Risk Assessment, making it clear that hormone disruption wouldn’t be a legitimate reason for restricting use of the chemical until “appropriate testing protocols have been established.” Steeger told Hayes that he was troubled by the circularity of the center’s critique. In an e-mail, he wrote, “Their position reminds me of the argument put forward by the philosopher Berkeley, who argued against empiricism by noting that reliance on scientific observation is flawed since the link between observations and conclusions is intangible and is thus immeasurable.”

Nonetheless, Steeger seemed resigned to the frustrations of regulatory science and gently punctured Hayes’s idealism. When Hayes complained that Syngenta had not reported his findings on frog hermaphroditism quickly enough, he responded that it was “unfortunate but not uncommon for registrants to ‘sit’ on data that may be considered adverse to the public’s perception of their products.” He wrote that “science can be manipulated to serve certain agendas. All you can do is practice ‘suspended disbelief.’ ” (The E.P.A. says that there is “no indication that information was improperly withheld in this case.”)

After consulting with colleagues at Berkeley, Hayes decided that, rather than watch Syngenta discredit his work, he would make a “preëmptive move.” He appeared in features in Discover and the San Francisco Chronicle, suggesting that Syngenta’s science was not objective. Both articles focussed on his personal biography, leading with his skin color, and moving on to his hair style: at the time, he wore his hair in braids. Hayes made little attempt to appear disinterested. Scientific objectivity requires what the philosopher Thomas Nagel has called a “view from nowhere,” but Hayes kept drawing attention to himself, making blustery comments like “Tyrone can only be Tyrone.” He presented Syngenta as a villain, but he didn’t quite fulfill the role of the hero. He was hyper and a little frantic—he always seemed to be in a rush or on the verge of forgetting to do something—and he approached the idea of taking down the big guys with a kind of juvenile zeal.

Environmental activists praised Hayes’s work and helped him get media attention. But they were concerned by the bluntness of his approach. A co-founder of the Environmental Working Group, a nonprofit research organization, told Hayes to “stop what you are doing and take time to actually construct a plan” or “you will get your ass handed to you on a platter.” Steeger warned him that vigilantism would distract him from his research. “Can you afford the time and money to fight battles where you are clearly outnumbered and, to be candid, outclassed?” he asked. “Most people would prefer to limit their time in purgatory; I don’t know anyone who knowingly enters hell.”

Hayes had worked all his life to build his scientific reputation, and now it seemed on the verge of collapse. “I cannot in reasonable terms explain to you what this means to me,” he told Steeger. He took pains to prove that Syngenta’s experiments had not replicated his studies: they used a different population of animals, which were raised in different types of tanks, in closer quarters, at cooler temperatures, and with a different feeding schedule. On at least three occasions, he proposed to the Syngenta scientists that they trade data. “If we really want to test repeatability, let’s share animals and solutions,” he wrote.

In early 2003, Hayes was considered for a job at the Nicholas School of the Environment, at Duke. He visited the campus three times, and the university arranged for a real-estate agent to show him and his wife potential homes. When Syngenta learned that Hayes might be moving to North Carolina, where its crop-protection headquarters are situated, Gary Dickson—the company’s vice-president of global risk assessment, who a year earlier had established a fifty-thousand-dollar endowment, funded by Syngenta, at the Nicholas School—contacted a dean at Duke. According to documents unsealed in the class-action lawsuits, Dickson informed the dean of the “state of the relationship between Dr. Hayes and Syngenta.” The company “wanted to protect our reputation in our community and among our employees.”

There were several candidates for the job at Duke, and, when Hayes did not get it, he concluded that it was due to Syngenta’s influence. Richard Di Giulio, a Duke professor who had hosted Hayes’s first visit, said that he was irritated by Hayes’s suggestion: “A little gift of fifty thousand dollars would not influence a tenure hire. That’s not going to happen.” He added, “I’m not surprised that Syngenta would not have liked Hayes to be at Duke, since we’re an hour down the road from them.” He said that Hayes’s conflict with Syngenta was an extreme example of the kind of dispute that is not uncommon in environmental science. The difference, he said, was that the “scientific debate spilled into Hayes’s emotional life.”

In June, 2003, Hayes paid his own way to Washington so that he could present his work at an E.P.A. hearing on atrazine. The agency had evaluated seventeen studies. Twelve experiments had been funded by Syngenta, and all but two showed that atrazine had no effect on the sexual development of frogs. The rest of the experiments, by Hayes and researchers at two other universities, indicated the opposite. In a PowerPoint presentation at the hearing, Hayes disclosed a private e-mail sent to him by one of the scientists on the EcoRisk panel, a professor at Texas Tech, who wrote, “I agree with you that the important issue is for everyone involved to come to grips with (and stop minimizing) the fact that independent laboratories have demonstrated an effect of atrazine on gonadal differentiation in frogs. There is no denying this.”

The E.P.A. found that all seventeen atrazine studies, including Hayes’s, suffered from methodological flaws—contamination of controls, variability in measurement end points, poor animal husbandry—and asked Syngenta to fund a comprehensive experiment that would produce more definitive results. Darcy Kelley, a member of the E.P.A.’s scientific advisory panel and a biology professor at Columbia, said that, at the time, “I did not think the E.P.A. made the right decision.” The studies by Syngenta scientists had flaws that “really cast into doubt their ability to carry out their experiments. They couldn’t replicate effects that are as easy as falling off a log.” She thought that Hayes’s experiments were more respectable, but she wasn’t persuaded by Hayes’s explanation of the biological mechanism causing the deformities.

The E.P.A. approved the continued use of atrazine in October, the same month that the European Commission chose to remove it from the market. The European Union generally takes a precautionary approach to environmental risks, choosing restraint in the face of uncertainty. In the U.S., lingering scientific questions justify delays in regulatory decisions. Since the mid-seventies, the E.P.A. has issued regulations restricting the use of only five industrial chemicals out of more than eighty thousand in the environment. Industries have a greater role in the American regulatory process—they may sue regulators if there are errors in the scientific record—and cost-benefit analyses are integral to decisions: a monetary value is assigned to disease, impairments, and shortened lives and weighed against the benefits of keeping a chemical in use. Lisa Heinzerling, the senior climate-policy counsel at the E.P.A. in 2009 and the associate administrator of the office of policy in 2009 and 2010, said that cost-benefit models appear “objective and neutral, a way to free ourselves from the chaos of politics.” But the complex algorithms “quietly condone a tremendous amount of risk.” She added that the influence of the Office of Management and Budget, which oversees major regulatory decisions, has deepened in recent years. “A rule will go through years of scientific reviews and cost-benefit analyses, and then at the final stage it doesn’t pass,” she said. “It has a terrible, demoralizing effect on the culture at the E.P.A.”

In 2003, a Syngenta development committee in Basel approved a strategy to keep atrazine on the market “until at least 2010.” A PowerPoint presentation assembled by Syngenta’s global product manager explained that “we need atrazine to secure our position in the corn marketplace. Without atrazine we cannot defend and grow our business in the USA.” Sherry Ford, the communications manager, wrote in her notebook that the company “should not phase out atz until we know about” the Syngenta herbicide paraquat, which has also been controversial, because of studies showing that it might be associated with Parkinson’s disease. She noted that atrazine “focuses attention away from other products.”

Syngenta began holding weekly “atrazine meetings” after the first class-action suit was filed, in 2004. The meetings were attended by toxicologists, the company’s counsel, communications staff, and the head of regulatory affairs. To dampen negative publicity from the lawsuit, the group discussed how it could invalidate Hayes’s research. Ford documented peculiar things he had done (“kept coat on”) or phrases he had used (“Is this line clean?”). “If TH wanted to win the day, and he had the goods,” she wrote, “he would have produced them when asked.” She noted that Hayes was “getting in too deep w/ enviros,” and searched for ways to get him to “show his true colors.”

In 2005, Ford made a long list of methods for discrediting him: “have his work audited by 3rd party,” “ask journals to retract,” “set trap to entice him to sue,” “investigate funding,” “investigate wife.” The initials of different employees were written in the margins beside entries, presumably because they had been assigned to look into the task. Another set of ideas, discussed at several meetings, was to conduct “systematic rebuttals of all TH appearances.” One of the company’s communications consultants said in an e-mail that she wanted to obtain Hayes’s calendar of speaking engagements, so that Syngenta could “start reaching out to the potential audiences with the Error vs. Truth Sheet,” which would provide “irrefutable evidence of his polluted messages.” (Syngenta says that many of the documents unsealed in the lawsuits refer to ideas that were never implemented.)

To redirect attention to the financial benefits of atrazine, the company paid Don Coursey, a tenured economist at the Harris School of Public Policy, at the University of Chicago, five hundred dollars an hour to study how a ban on the herbicide would affect the economy. In 2006, Syngenta supplied Coursey with data and a “bundle of studies,” and edited his paper, which was labelled as a Harris School Working Paper. (He disclosed that Syngenta had funded it.) After submitting a draft, Coursey had been warned in an e-mail that he needed to work harder to articulate a “clear statement of your conclusions flowing from this analysis.” Coursey later announced his findings at a National Press Club event in Washington and told the audience that there was one “basic takeaway point: a ban on atrazine at the national level will have a devastating, devastating effect upon the U.S. corn economy.”

Hayes had been promoted from associate to full professor in 2003, an achievement that had sent him into a mild depression. He had spent the previous decade understanding his self-worth in reference to a series of academic milestones, and he had reached each one. Now he felt aimless. His wife said she could have seen him settling into the life of a “normal, run-of-the-mill, successful scientist.” But he wasn’t motivated by the idea of “writing papers and books that we all just trade with each other.”

He began giving more than fifty lectures a year, not just to scientific audiences but to policy institutes, history departments, women’s health clinics, food preparers, farmers, and high schools. He almost never declined an invitation, despite the distance. He told his audiences that he was defying the instructions of his Ph.D. adviser, who had told him, “Let the science speak for itself.” He had a flair for sensational stories—he chose phrases like “crime scene” and “chemically castrated”—and he seemed to revel in details about Syngenta’s conflicts of interest, presenting theories as if he were relating gossip to friends. (Syngenta wrote a letter to Hayes and his dean, pointing out inaccuracies: “As we discover additional errors in your presentations, you can expect us to be in touch with you again.”)

At his talks, Hayes noticed that one or two men in the audience were dressed more sharply than the other scientists. They asked questions that seemed to have been designed to embarrass him: Why can’t anyone replicate your research? Why won’t you share your data? One former student, Ali Stuart, said that “everywhere Tyrone went there was this guy asking questions that made a mockery of him. We called him the Axe Man.” 

Hayes had once considered a few of the scientists working with Syngenta friends, and he approached them in a nerdy style of defiance. He wrote them mass e-mails, informing them of presentations he was giving and offering tips on how to discredit him. “You can’t approach your prey thinking like a predator,” he wrote. “You have to become your quarry.” He described a recent trip to South Carolina and his sense of displacement when “my old childhood friend came by to update me on who got killed, who’s on crack, who went to jail.” He wrote, “I have learned to talk like you (better than you . . . by your own admission), write like you (again better) . . . you however don’t know anyone like me . . . you have yet to spend a day in my world.” After seeing an e-mail in which a lobbyist characterized him as “black and quite articulate,” he began signing his e-mails, “Tyrone B. Hayes, Ph.D., A.B.M.,” for “articulate black man.”

Syngenta was concerned by Hayes’s e-mails and commissioned an outside contractor to do a “psychological profile” of Hayes. In her notes, Sherry Ford described him as “bipolar/manic-depressive” and “paranoid schizo & narcissistic.” Roger Liu, Hayes’s student, said that he thought Hayes wrote the e-mails to relieve his anxiety. Hayes often showed the e-mails to his students, who appreciated his rebellious sense of humor. Liu said, “Tyrone had all these groupies in the lab cheering him on. I was the one in the background saying, you know, ‘Man, don’t egg them on. Don’t poke that beast.’ ”

Syngenta intensified its public-relations campaign in 2009, as it became concerned that activists, touting “new science,” had developed a “new line of attack.” That year, a paper in Acta Paediatrica, reviewing national records for thirty million births, found that children conceived between April and July, when the concentration of atrazine (mixed with other pesticides) in water is highest, were more likely to have genital birth defects. The author of the paper, Paul Winchester, a professor of pediatrics at the Indiana University School of Medicine, received a subpoena from Syngenta, which requested that he turn over every e-mail he had written about atrazine in the past decade. The company’s media talking points described his study as “so-called science” that didn’t meet the “guffaw test.” Winchester said, “We don’t have to argue that I haven’t proved the point. Of course I haven’t proved the point! Epidemiologists don’t try to prove points—they look for problems.”

A few months after Winchester’s paper appeared, the Times published an investigation suggesting that atrazine levels frequently surpass the maximum threshold allowed in drinking water. The article referred to recent studies in Environmental Health Perspectives and the Journal of Pediatric Surgery that found that mothers living close to water sources containing atrazine were more likely to have babies who were underweight or had a defect in which the intestines and other organs protrude from the body.

The day the article appeared, Syngenta planned to “go through the article line by line and find all 1) inaccuracies and 2) misrepresentations. Turn that into a simple chart.” The company would have “a credible third party do the same.” Elizabeth Whelan, the president of the American Council on Science and Health, which asked Syngenta for a hundred thousand dollars that year, appeared on MSNBC and declared that the Times article was not based on science. “I’m a public-health professional,” she said. “It really bothers me very much to see the New York Times front-page Sunday edition featuring an article about a bogus risk.”

Syngenta’s public-relations team wrote editorials about the benefits of atrazine and about the flimsy science of its critics, and then sent them to “third-party allies,” who agreed to “byline” the articles, which appeared in the Washington Times, the Rochester Post-Bulletin, the Des Moines Register, and the St. Cloud Times. When a few articles in the “op-ed pipeline” sounded too aggressive, a Syngenta consultant warned that “some of the language of these pieces is suggestive of their source, which suggestion should be avoided at all costs.”

After the Times article, Syngenta hired a communications consultancy, the White House Writers Group, which has represented more than sixty Fortune 500 companies. In an e-mail to Syngenta, Josh Gilder, a director of the firm and a former speechwriter for Ronald Reagan, wrote, “We need to start fighting our own war.” By warning that a ban on atrazine would “devastate the economies” of rural regions, the firm tried to create a “state of affairs in which the new political leadership at E.P.A. finds itself increasingly isolated.” The firm held “elite dinners with Washington influentials” and tried to “prompt members of Congress” to challenge the scientific rationale for an upcoming E.P.A. review of atrazine. In a memo describing its strategy, the White House Writers Group wrote that, “regarding science, it is important to keep in mind that the major players in Washington do not understand science.”

In 2010, Hayes told the EcoRisk panel in an e-mail, “I have just initiated what will be the most extraordinary academic event in this battle!” He had another paper coming out in the Proceedings of the National Academy of Sciences, which described how male tadpoles exposed to atrazine grew up to be functional females with impaired fertility. He advised the company that it would want to get its P.R. campaign up to speed. “It’s nice to know that in this economy I can keep so many people employed,” he wrote. He quoted both Tupac Shakur and the South African king Shaka Zulu: “Never leave an enemy behind or it will rise again to fly at your throat.”

Syngenta’s head of global product safety wrote a letter to the editor of the Proceedings of the National Academy of Sciences and to the president of the National Academy of Sciences, expressing concern that a “publication with so many obvious weaknesses could achieve publication in such a reputable scientific journal.” A month later, Syngenta filed an ethics complaint with the chancellor of Berkeley, claiming that Hayes’s e-mails violated the university’s Standards of Ethical Conduct, particularly Respect for Others. Syngenta posted more than eighty of Hayes’s e-mails on its Web site and enclosed a few in its letter to the chancellor. In one, with the subject line “Are y’all ready for it,” Hayes wrote, “Ya fulla my j*z right now!” In another, he told the Syngenta scientists that he’d had a drink after a conference with their “republican buddies,” who wanted to know about a figure he had used in his paper. “As long as you followin me around, I know I’m da sh*t,” he wrote. “By the way, yo boy left his pre-written questions at the table!”

Berkeley declined to take disciplinary action against Hayes. The university’s lawyer reminded Syngenta in a letter that “all parties have an equal responsibility to act professionally.” David Wake said that he read many of the e-mails and found them “quite hilarious.” “He’s treating them like street punks, and they view themselves as captains of industry,” he said. “When he gets tapped, he goes right back at them.”

Michelle Boone, a professor of aquatic ecology at Miami University, who served on the E.P.A.’s scientific advisory panel, said, “We all follow the Tyrone Hayes drama, and some people will say, ‘He should just do the science.’ But the science doesn’t speak for itself. Industry has unlimited resources and bully power. Tyrone is the only one calling them out on what they’re doing.” However, she added, “I do think some people feel he has lost his objectivity.”

Keith Solomon, a professor emeritus at the University of Guelph, Ontario, who has received funding from Syngenta and served on the EcoRisk panel, noted that academics who refuse industry money are not immune from biases; they’re under pressure to produce papers, in order to get tenure and promotions. “If I do an experiment, look at the data every which way, and find nothing, it will not be easy to publish,” he said. “Journals want excitement. They want bad things to happen.”

Hayes, who had gained more than fifty pounds since becoming tenured, wore bright scarves draped over his suit and silver earrings from Tibet. At the end of his lectures, he broke into rhyme: “I see a ruse / intentionally constructed to confuse the news / well, I’ve taken it upon myself to defuse the clues / so that you can choose / and to demonstrate the objectivity of the methods I use.” At some of his lectures, Hayes warned that the consequences of atrazine use were disproportionately felt by people of color. “If you’re black or Hispanic, you’re more likely to live or work in areas where you’re exposed to crap,” he said. He explained that “on the one side I’m trying to play by the ivory-tower rules, and on the other side people are playing by a different set of rules.” Syngenta was speaking directly to the public, whereas scientists were publishing their research in “magazines that you can’t buy in Barnes and Noble.”

Hayes was confident that at the next E.P.A. hearing there would be enough evidence to ban atrazine, but in 2010 the agency found that the studies indicating risk to humans were too limited. Two years later, during another review, the E.P.A. determined that atrazine does not affect the sexual development of frogs. By that point, there were seventy-five published studies on the subject, but the E.P.A. excluded the majority of them from consideration, because they did not meet the requirements for quality that the agency had set in 2003. The conclusion was based largely on a set of studies funded by Syngenta and led by Werner Kloas, a professor of endocrinology at Humboldt University, in Berlin. One of the co-authors was Alan Hosmer, a Syngenta scientist whose job, according to a 2004 performance evaluation, included “atrazine defence” and “influencing EPA.”

After the hearing, two of the independent experts who had served on the E.P.A.’s scientific advisory panel, along with fifteen other scientists, wrote a paper (not yet published) complaining that the agency had repeatedly ignored the panel’s recommendations and that it placed “human health and the environment at the mercy of industry.” “The EPA works with industry to set up the methodology for such studies with the outcome often that industry is the only institution that can afford to conduct the research,” they wrote. The Kloas study was the most comprehensive of its kind: its researchers had been scrutinized by an outside auditor, and their raw data turned over to the E.P.A. But the scientists wrote that one set of studies on a single species was “not a sufficient edifice on which to build a regulary assessment.” Citing a paper by Hayes, who had done an analysis of sixteen atrazine studies, they wrote that “the single best predictor of whether or not the herbicide atrazine had a significant effect in a study was the funding source.”

In another paper, in Policy Perspective, Jason Rohr, an ecologist at the University of South Florida, who served on an E.P.A. panel, criticized the “lucrative ‘science for hire’ industry, where scientists are employed to dispute data.” He wrote that a Syngenta-funded review of the atrazine literature had arguably misrepresented more than fifty studies and made a hundred and forty-four inaccurate or misleading statements, of which “96.5% appeared to be beneficial for Syngenta.” Rohr, who has conducted several experiments involving atrazine, said that, at conferences, “I regularly get peppered with questions from Syngenta cronies trying to discount my research. They try to poke holes in the research rather than appreciate the adverse effects of the chemicals.” He said, “I have colleagues whom I’ve tried to recruit, and they’ve told me that they’re not willing to delve into this sort of research, because they don’t want the headache of having to defend their credibility.”

Deborah Cory-Slechta, a former member of the E.P.A.’s science advisory board, said that she, too, felt that Syngenta was trying to undermine her work. A professor at the University of Rochester Medical Center, Cory-Slechta studies how the herbicide paraquat may contribute to diseases of the nervous system. “The folks from Syngenta used to follow me to my talks and tell me I wasn’t using ‘human-relevant doses,’ ” she said. “They would go up to my students and try to intimidate them. There was this sustained campaign to make it look like my science wasn’t legitimate.”

Syngenta denied repeated requests for interviews, but Ann Bryan, its senior manager for external communications, told me in an e-mail that some of the studies I was citing were unreliable or unsound. When I mentioned a recent paper in the American Journal of Medical Genetics, which showed associations between a mother’s exposure to atrazine and the likelihood that her son will have an abnormally small penis, undescended testes, or a deformity of the urethra—defects that have increased in the past several decades—she said that the study had been “reviewed by independent scientists, who found numerous flaws.” She recommended that I speak with the author of the review, David Schwartz, a neuroscientist, who works for Innovative Science Solutions, a consulting firm that specializes in “product defense” and strategies that “give you the power to put your best data forward.” Schwartz told me that epidemiological studies can’t eliminate confounding variables or make claims about causation. “We’ve been incredibly misled by this type of study,” he said.

In 2012, in its settlement of the class-action suits, Syngenta agreed to pay a hundred and five million dollars to reimburse more than a thousand water systems for the cost of filtering atrazine from drinking water, but the company denies all wrongdoing. Bryan told me that “atrazine does not and, in fact, cannot cause adverse health effects at any level that people would ever be exposed to in the real-world environment.” She wrote that she was “troubled by a suggestion that we have ever tried to discredit anyone. Our focus has always been on communicating the science and setting the record straight.” She noted that “virtually every well-known brand, or even well-known issue, has a communications program behind it. Atrazine’s no different.”

Last August, Hayes put his experiments on hold. He said that his fees for animal care had risen eightfold in a decade, and that he couldn’t afford to maintain his research program. He accused the university of charging him more than other researchers in his department; in response, the director of the office of laboratory-animal care sent detailed charts illustrating that he is charged according to standard campus-wide rates, which have increased for most researchers in recent years. In an online Forbes op-ed, Jon Entine, a journalist who is listed in Syngenta’s records as a supportive “third party,” accused Hayes of being attached to conspiracy theories, and of leading the “international regulatory community on a wild goose chase,” which “borders on criminal.”

By late November, Hayes’s lab had resumed work. He was using private grants to support his students rather than to pay outstanding fees, and the lab was accumulating debt. Two days before Thanksgiving, Hayes and his students discussed their holiday plans. He was wearing an oversized orange sweatshirt, gym shorts, and running shoes, and a former student, Diana Salazar Guerrero, was eating fries that another student had left on the table. Hayes encouraged her to come to his Thanksgiving dinner and to move into the bedroom of his son, who is now a student at Oberlin. Guerrero had just put down half the deposit on a new apartment, but Hayes was disturbed by her description of her new roommate. “Are you sure you can trust him?” he asked.

Hayes had just returned from Mar del Plata, Argentina. He had flown fifteen hours and driven two hundred and fifty miles to give a thirty-minute lecture on atrazine. Guerrero said, “Sometimes I’m just, like, ‘Why don’t you let it go, Tyrone? It’s been fifteen years! How do you have the energy for this?’ ” With more scientists documenting the risks of atrazine, she assumed he’d be inclined to move on. “Originally, it was just this crazy guy at Berkeley, and you can throw the Berserkley thing at anyone,” she said. “But now the tide is turning.”

In a recent paper in the Journal of Steroid Biochemistry and Molecular Biology, Hayes and twenty-one other scientists applied the criteria of Sir Austin Bradford Hill, who, in 1965, outlined the conditions necessary for a causal relationship, to atrazine studies across different vertebrate classes. They argued that independent lines of evidence consistently showed that atrazine disrupts male reproductive development. Hayes’s lab was working on two more studies that explore how atrazine affects the sexual behavior of frogs. When I asked him what he would do if the E.P.A., which is conducting another review of the safety of atrazine this year, were to ban the herbicide, he joked, “I’d probably get depressed again.”

Not long ago, Hayes saw a description of himself on Wikipedia that he found disrespectful, and he wasn’t sure whether it was an attack by Syngenta or whether there were simply members of the public who thought poorly of him. He felt deflated when he remembered the arguments he’d had with Syngenta-funded pundits. “It’s one thing if you go after me because you have a philosophical disagreement with my science or if you think I’m raising alarm where there shouldn’t be any,” he said. “But they didn’t even have their own opinions. Someone was paying them to take a position.” He wondered if there was something inherently insane about the act of whistle-blowing; maybe only crazy people persisted. He was ready for a fight, but he seemed to be searching for his opponent.

One of his first graduate students, Nigel Noriega, who runs an organization devoted to conserving tropical forests, told me that he was still recovering from the experience of his atrazine research, a decade before. He had come to see science as a rigid culture, “its own club, an élite society,” Noriega said. “And Tyrone didn’t conform to the social aspects of being a scientist.” Noriega worried that the public had little understanding of the context that gives rise to scientific findings. “It is not helpful to anyone to assume that scientists are authoritative,” he said. “A good scientist spends his whole career questioning his own facts. One of the most dangerous things you can do is believe.” 

Put.io

Beyond Price Predictor Beyond Stays

$
0
0

Comments:"Beyond Price Predictor Beyond Stays"

URL:http://beyondstays.com/pricing


Always know how to price your home.

Sign up for future price alerts.

We specialize in filling your Airbnb, not your inbox.
We send emails when rental prices rise dramatically.

Great!

We've added you to our Price Alert email list.

Darn!

Something went wrong. We could not capture your email.

Tracking 5,000+ Vacation Rentals, 1,000+ Hotel Room Rates, and 100+ Local Conferences

We track data to help you price more accurately.

conferences

+64%

location

+16%

seasonality

+12%

day of week

+8%

KeePass Password Safe

Adobe to Require New Epub DRM in July, Expects to Abandon Existing Users - The Digital Reader

$
0
0

Comments:"Adobe to Require New Epub DRM in July, Expects to Abandon Existing Users - The Digital Reader"

URL:http://www.the-digital-reader.com/2014/02/03/adobe-require-new-epub-drm-july-expects-abandon-existing-users/


When Adobe announced their new DRM a couple weeks ago some said that we would soon see compatibility issues with older devices and apps as Adobe forced everyone to upgrade.

At that time I didn’t think Adobe would make the mistake of cutting off so many existing readers, but now it seems that I could not have been more wrong on the issue.

The following video (found via The SF Reader) confirms that Adobe is planning to require that everyone (ebookstores, app and device makers) to upgrade to the new DRM by July 2014.

The video is a recording of a webinar hosted by Datalogics and Adobe, and it covers in detail aspects of how and when the new DRM will be implemented (as well as a lot of other data). If the embed link doesn’t work for you, here’s a link to the video on Youtube.

The tl;dr version is that Adobe is going to start pushing for ebook vendors to provide support for the new DRM in March, and when July rolls Adobe is going to force the ebook vendors to stop supporting the older DRM. (Hadrien Gardeur, Paul Durrant, and Martyn Daniels concur on this interpretation.)

This means that any app or device which still uses the older Adobe DRM will be cut off. Luckily for many users, that penalty probably will not affect readers who use Kobo or Google reading apps or devices; to the best of my knowledge neither uses the Adobe DRM internally. And of course Kindle and Apple customers won’t even notice, thanks to those companies’ wise decision to use their own DRM.

But everyone else just got screwed.

If you’re using Adobe DE 2.1, come July you won’t be able to read any newly downloaded DRMed ebooks until after you upgrade to Adobe DE 3.0. If you’re using a preferred 3rd-party reading app, you won’t be able to download any new DRMed ebooks until after the app developer releases an update.

And if you’re using an existing ebook reader, you’d better plan on only reading DRM-free ebooks until further notice.

One thing Adobe seems to have missed is that there are tens of millions of ebook readers on the market that support the older DRM but will probably never be upgraded to the new DRM. Sony and Pocketbook, for example, have released a number of models over the past 5 or so years, most of which have since been discontinued.

Do you really think they’re going to invest in updating a discontinued (but otherwise perfectly functional) device?

I don’t, and that’s just the tip of the iceberg. Not only will millions of existing readers be cut off, there are also hundreds of thousands of ebook readers sitting on store shelves which, as of July, will no longer support Adobe DRM.

And do you know what’s even better? All signs point to the ebook reader market having peaked in 2011 or 2012 (I have industry sources which have said this) so the existing and soon to be incompatible ereaders will probably outnumber the compatible models for the indefinite future (years if not decades).

If you look hard enough you can still buy many of the ebook readers released in 2010, 2011, and 2012 as new, and you can also find them as refurbs or used. They work just fine today (albeit a little slowly by today’s standards) but when July rolls around they will be little more than junk.

And that includes ebook readers owned by libraries and other cost conscious institutions.

If you’re beginning to grasp just how bad this move could be, wait a second because I’m not done.

Not only will readers be affected, but so will indie ebookstores. They’re going to have to pay to upgrade their servers and their reading apps. That cost is going to hit them in the pocketbook (potentially driving some out of business), and that’s not all.

Many if not most of the indie ebookstores are dependent on the various Adobe DRM compatible ebook readers on the market. They cannot afford to develop their own hardware so they rely on readers buying and using devices made by other companies including, Pocketbook, Sony, Gajah (a major OEM), and others.

Once those existing ebook readers are abandoned by Adobe the indie ebookstores will probably lose customers to one or another of the major ebook vendors.

In other words Adobe just gave Amazon a belated Christmas present. After all, everyone might hate Amazon but we also know we can trust them to not break their DRM.

Folks, the above scenario spells out all the reasons why I didn’t expect Adobe to completely abandon support for the older DRM. It is so obviously a bad idea that I thought they would avoid it.

With that in mind, I would also like to add an addendum and apply Tyrion’s Razor. Perhaps Adobe has internal data which says that this won’t be a serious issue.  I seriously doubt it, but it’s possible.

P.S. But if this turns out to be the utter disaster I am expecting, I would like to take this opportunity to thank Adobe for on yet another occasion giving DRM a bad name.

Modern Microprocessors - A 90 Minute Guide!

$
0
0

Comments:"Modern Microprocessors - A 90 Minute Guide!"

URL:http://www.lighterra.com/papers/modernmicroprocessors/


WARNING: This article is meant to be informal and fun!

Okay, so you're a CS graduate and you did a hardware/assembly course as part of your degree, but perhaps that was a few years ago now and you haven't really kept up with the details of processor designs since then.

In particular, you might not be aware of some key topics that developed rapidly in recent times...

  • pipelining (superscalar, OoO, VLIW, branch prediction, predication)
  • multi-core & simultaneous multithreading (SMT, hyper-threading)
  • SIMD vector instructions (MMX/SSE/AVX, AltiVec)
  • caches and the memory hierarchy

Fear not! This article will get you up to speed fast. In no time you'll be discussing the finer points of in-order vs out-of-order, hyper-threading, multi-core and cache organization like a pro.

But be prepared – this article is brief and to-the-point. It pulls no punches and the pace is pretty fierce (really). Let's get into it...

More Than Just Megahertz

The first issue that must be cleared up is the difference between clock speed and a processor's performance. They are not the same thing. Look at the results for processors of a few years ago (the late 1990s)...

SPECint95SPECfp95
195 MHzMIPS R1000011.017.0
400 MHzAlpha 2116412.317.2
300 MHzUltraSPARC12.115.5
300 MHzPentium-II11.68.8
300 MHzPowerPC G314.811.4
135 MHzPOWER26.217.6

A 200 MHz MIPS R10000, a 300 MHz UltraSPARC and a 400 MHz Alpha 21164 were all about the same speed at running most programs, yet they differed by a factor of two in clock speed. A 300 MHz Pentium-II was also about the same speed for many things, yet it was about half that speed for floating-point code such as scientific number crunching. A PowerPC G3 at that same 300 MHz was somewhat faster than the others for normal integer code, but still far slower than the top 3 for floating-point. At the other extreme, an IBM POWER2 processor at just 135 MHz matched the 400 MHz Alpha 21164 in floating-point speed, yet was only half as fast for normal integer programs.

How can this be? Obviously, there's more to it than just clock speed – it's all about how much work gets done in each clock cycle. Which leads to...

Pipelining & Instruction-Level Parallelism

Instructions are executed one after the other inside the processor, right? Well, that makes it easy to understand, but that's not really what happens. In fact, that hasn't happened since the middle of the 1980s. Instead, several instructions are all partially executing at the same time.

Consider how an instruction is executed – first it is fetched, then decoded, then executed by the appropriate functional unit, and finally the result is written into place. With this scheme, a simple processor might take 4 cycles per instruction (CPI = 4)...

Figure 1 – The instruction flow of a sequential processor.

Modern processors overlap these stages in a pipeline, like an assembly line. While one instruction is executing, the next instruction is being decoded, and the one after that is being fetched...

Figure 2 – The instruction flow of a pipelined processor.

Now the processor is completing 1 instruction every cycle (CPI = 1). This is a four-fold speedup without changing the clock speed at all. Not bad, huh?

From the hardware point of view, each pipeline stage consists of some combinatorial logic and possibly access to a register set and/or some form of high speed cache memory. The pipeline stages are separated by latches. A common clock signal synchronizes the latches between each stage, so that all the latches capture the results produced by the pipeline stages at the same time. In effect, the clock "pumps" instructions down the pipeline.

At the beginning of each clock cycle, the data and control information for a partially processed instruction is held in a pipeline latch, and this information forms the inputs to the logic circuits of the next pipeline stage. During the clock cycle, the signals propagate through the combinatorial logic of the stage, producing an output just in time to be captured by the next pipeline latch at the end of the clock cycle...

Figure 3 – A pipelined microarchitecture.

Since the result from each instruction is available after the execute stage has completed, the next instruction ought to be able to use that value immediately, rather than waiting for that result to be committed to its destination register in the writeback stage. To allow this, forwarding lines called bypasses are added, going backwards along the pipeline...

Figure 4 – A pipelined microarchitecture with bypasses.

Although the pipeline stages look simple, it is important to remember that the execute stage in particular is really made up of several different groups of logic (several sets of gates), making up different functional units for each type of operation that the processor must be able to perform...

Figure 5 – A pipelined microarchitecture in more detail.

The early RISC processors, such as IBM's 801 research prototype, the MIPS R2000 (based on the Stanford MIPS machine) and the original SPARC (derived from the Berkeley RISC project), all implemented a simple 5 stage pipeline not unlike the one shown above (the extra stage is for memory access, placed after execute). At the same time, the mainstream 80386, 68030 and VAX processors worked sequentially using microcode (it's easier to pipeline a RISC because the instructions are all simple register-to-register operations, unlike x86, 68k or VAX). As a result, a SPARC running at 20 MHz was way faster than a 386 running at 33 MHz. Every processor since then has been pipelined, at least to some extent. A good summary of the original RISC research projects can be found in the 1985 CACM article by David Patterson.

Deeper Pipelines – Superpipelining

Since the clock speed is limited by (among other things) the length of the longest stage in the pipeline, the logic gates that make up each stage can be subdivided, especially the longer ones, converting the pipeline into a deeper super-pipeline with a larger number of shorter stages. Then the whole processor can be run at a higher clock speed! Of course, each instruction will now take more cycles to complete (latency), but the processor will still be completing 1 instruction per cycle (throughput), and there will be more cycles per second, so the processor will complete more instructions per second (actual performance)...

Figure 6 – The instruction flow of a superpipelined processor.

The Alpha architects in particular liked this idea, which is why the early Alphas had very deep pipelines and ran at such very high clock speeds for their era. Today, modern processors strive to keep the number of gate delays down to just a handful for each pipeline stage (about 12-25 gates deep (not total!) plus another 3-5 for the latch itself), and most have quite deep pipelines (7-12 in PowerPC G4e, 8+ in ARM11 & Cortex-A9, 10-15 in Athlon, 12+ in Pentium-Pro/II/III/M, 12-17 in Athlon 64/Phenom, 13+ in Cortex-A8, 14 in UltraSPARC-III/IV, 14+ in Core 2, 14-18+ in Core i*2, 15 in Bobcat, 15-25 in Cortex-A15, 16 in Atom, 16+ in Core i, 16-25 in PowerPC G5, 18+ in Bulldozer, 20+ in Pentium-4, 31+ in Pentium-4E). The x86 processors generally have deeper pipelines than the RISCs because they need to do extra work to decode the complex x86 instructions (more on this later). UltraSPARC-T1/T2/T3 are an exception to the deep pipeline trend (just 6 for UltraSPARC-T1 and 8-12 for T2/T3).

Multiple Issue – Superscalar

Since the execute stage of the pipeline is really a bunch of different functional units, each doing its own task, it seems tempting to try to execute multiple instructions in parallel, each in its own functional unit. To do this, the fetch and decode/dispatch stages must be enhanced so that they can decode multiple instructions in parallel and send them out to the "execution resources"...

Figure 7 – A superscalar microarchitecture.

Of course, now that there are independent pipelines for each functional unit, they can even be different lengths. This allows the simpler instructions to complete more quickly, reducing latency (which we'll get to soon). There are also a bunch of bypasses within and between the various pipelines, but these have been left out of the diagram for simplicity.

In the above example, the processor could potentially execute 3 different instructions per cycle – for example one integer, one floating-point and one memory operation. Even more functional units could be added, so that the processor might be able to execute two integer instructions per cycle, or two floating-point instructions, or whatever the target applications could best use.

On a superscalar processor, the instruction flow looks something like...

Figure 8 – The instruction flow of a superscalar processor.

This is great! There are now 3 instructions completing every cycle (CPI = 0.33, or IPC = 3). The number of instructions able to be issued or completed per cycle is called a processor's width.

Note that the issue-width is less than the number of functional units – this is typical. There must be more functional units because different code sequences have different mixes of instructions. The idea is to execute 3 instructions per cycle, but those instructions are not always going to be 1 integer, 1 floating-point and 1 memory operation, so more than 3 functional units are required.

The IBM POWER1 processor, the predecessor of PowerPC, was the first mainstream superscalar processor. Most of the RISCs went superscalar soon after (SuperSPARC, Alpha 21064). Intel even managed to build a superscalar x86 – the original Pentium – however the complex x86 instruction set was a real problem for them (more on this later).

Of course, there's nothing stopping a processor from having both a deep pipeline and multiple instruction issue, so it can be both superpipelined and superscalar at the same time...

Figure 9 – The instruction flow of a superpipelined-superscalar processor.

Today, virtually every processor is a superpipelined-superscalar, so they're just called superscalar for short. Strictly speaking, superpipelining is just pipelining with a deeper pipe anyway.

The widths of current processors range from single-issue (ARM11, UltraSPARC-T1) through 2-issue (UltraSPARC-T2/T3, Cortex-A8 & A9, Atom, Bobcat) to 3-issue (Pentium-Pro/II/III/M, Athlon, Pentium-4, Athlon 64/Phenom, Cortex-A15) or 4-issue (UltraSPARC-III/IV, PowerPC G4e, Core 2, Core i, Core i*2, Bulldozer) or 5-issue (PowerPC G5), or even 6-issue (Itanium, but it's a VLIW – see below). The exact number and type of functional units in each processor depends on its target market. Some processors have more floating-point execution resources (IBM's POWER line), others are more integer-biased (Pentium-Pro/II/III/M), some devote much of their resources towards SIMD vector instructions (PowerPC G4e), while most try to take the "balanced" middle ground.

Explicit Parallelism – VLIW

In cases where backward compatibility is not an issue, it is possible for the instruction set itself to be designed to explicitly group instructions to be executed in parallel. This approach eliminates the need for complex dependency checking logic in the dispatch stage, which should make the processor easier to design (and easier to ramp up the clock speed over time, at least in theory).

In this style of processor, the "instructions" are really groups of little sub-instructions, and thus the instructions themselves are very long (often 128 bits or more), hence the name VLIW – very long instruction word. Each instruction contains information for multiple parallel operations.

A VLIW processor's instruction flow is much like a superscalar, except that the decode/dispatch stage is much simpler and only occurs for each group of sub-instructions...

Figure 10 – The instruction flow of a VLIW processor.

Other than the simplification of the dispatch logic, VLIW processors are much like superscalar processors. This is especially so from a compiler's point of view (more on this later).

It is worth noting, however, that most VLIW designs are not interlocked. This means they do not check for dependencies between instructions, and often have no way of stalling instructions other than to stall the whole processor on a cache miss. As a result, the compiler needs to insert the appropriate number of cycles between dependent instructions, even if there are no instructions to fill the gap, by using nops (no-operations, pronounced "no-ops") if necessary. This complicates the compiler somewhat, because it is doing something that a superscalar processor normally does at runtime, however the extra code in the compiler is minimal and it saves precious resources on the processor chip.

No VLIW designs have yet been commercially successful as mainstream CPUs, however Intel's IA64 architecture, which is still in production in the form of the Itanium processors, was once intended to be the replacement for x86. Intel chose to call IA64 an "EPIC" design, for "explicitly parallel instruction computing", but it was essentially a VLIW with clever grouping (to allow long-term compatibility) and predication (see below). The programmable shaders in many graphics processors (GPUs) are usually VLIW designs, although obviously they provide graphics-oriented instruction sets, and there's also Transmeta (see the x86 section, coming up soon).

Instruction Dependencies & Latencies

How far can pipelining and multiple issue be taken? If a 5 stage pipeline is 5 times faster, why not build a 20 stage superpipeline? If 4-issue superscalar is good, why not go for 8-issue? For that matter, why not build a processor with a 50 stage pipeline which issues 20 instructions per cycle?

Well, consider the following two instructions...

a = b * c;
d = a + 1;

The second instruction depends on the first – the processor can't execute the second instruction until after the first has completed calculating its result. This is a serious problem, because instructions that depend on each other cannot be executed in parallel. Thus, multiple issue is impossible in this case.

If the first instruction was a simple integer addition then this might still be okay in a pipelined single issue processor, because integer addition is quick and the result of the first instruction would be available just in time to feed it back into the next instruction (using bypasses). However in the case of a multiply, which will take several cycles to complete, there is no way the result of the first instruction will be available when the second instruction reaches the execute stage just one cycle later. So, the processor will need to stall the execution of the second instruction until its data is available, inserting a bubble into the pipeline where no work gets done.

The number of cycles between when an instruction reaches the execute stage and when its result is available for use by other instructions is called the instruction's latency. The deeper the pipeline, the more stages and thus the longer the latency. So a very deep pipeline is not much more effective than a short one, because a deep one just gets filled up with bubbles thanks to all those nasty instructions depending on each other.

From a compiler's point of view, typical latencies in modern processors range from a single cycle for integer operations, to around 3-6 cycles for floating-point addition and the same or perhaps slightly longer for multiplication, through to over a dozen cycles for integer division.

Latencies for memory loads are particularly troublesome, in part because they tend to occur early within code sequences, which makes it difficult to fill their delays with useful instructions, and equally importantly because they are somewhat unpredictable – the load latency varies a lot depending on whether the access is a cache hit or not (we'll get to caches later).

Branches & Branch Prediction

Another key problem for pipelining is branches. Consider the following code sequence...

if (a > 5) {
 b = c;
} else {
 b = d;
}

which compiles into something like...

 cmp a, 5 ; a > 5 ?
 ble L1
 mov c, b ; b = c
 br L2
L1: mov d, b ; b = d
L2: ...

Now consider a pipelined processor executing this code sequence. By the time the conditional branch at line 2 reaches the execute stage in the pipeline, the processor must have already fetched and decoded the next couple of instructions. But which instructions? Should it fetch and decode the if branch (lines 3 & 4) or the else branch (line 5)? It won't really know until the conditional branch gets to the execute stage, but in a deeply pipelined processor that might be several cycles away. And it can't afford to just wait – the processor encounters a branch every six instructions on average, and if it was to wait several cycles at every branch then most of the performance gained by using pipelining in the first place would be lost.

So the processor must make a guess. The processor will then fetch down the path it guessed and speculatively begin executing those instructions. Of course, it won't be able to actually commit (writeback) those instructions until the outcome of the branch is known. Worse, if the guess is wrong the instructions will have to be cancelled, and those cycles will have been wasted. But if the guess is correct the processor will be able to continue on at full speed.

The key question is how the processor should make the guess. Two alternatives spring to mind. First, the compiler might be able to mark the branch to tell the processor which way to go. This is called static branch prediction. It would be ideal if there was a bit in the instruction format in which to encode the prediction, but for older architectures this is not an option, so a convention can be used instead (such as backward branches are predicted to be taken while forward branches are predicted not-taken). More importantly, however, this approach requires the compiler to be quite smart in order for it to make the correct guess, which is easy for loops but might be difficult for other branches.

The other alternative is to have the processor make the guess at runtime. Normally, this is done by using an on-chip branch prediction table containing the addresses of recent branches and a bit indicating whether each branch was taken or not last time. In reality, most processors actually use two bits, so that a single not-taken occurrence doesn't reverse a generally taken prediction (important for loop back edges). Of course, this dynamic branch prediction table takes up valuable space on the processor chip, but branch prediction is so important that it's well worth it.

Unfortunately, even the best branch prediction techniques are sometimes wrong, and with a deep pipeline many instructions might need to be cancelled. This is called the mispredict penalty. The Pentium-Pro/II/III was a good example – it had a 12+ stage pipeline and thus a mispredict penalty of 10-15 cycles. Even with a clever dynamic branch predictor that correctly predicted an impressive 90% of the time, this high mispredict penalty meant about 30% of the Pentium-Pro/II/III's performance was lost due to mispredictions. Put another way, one third of the time the Pentium-Pro/II/III was not doing useful work but instead was saying "oops, wrong way". Modern processors devote ever more hardware to branch prediction in an attempt to raise the prediction accuracy even further, and reduce this cost, but even the best processors still lose quite a lot of performance due to branch mispredictions.

Eliminating Branches with Predication

Conditional branches are so problematic that it would be nice to eliminate them altogether. Clearly, if statements cannot be eliminated from programming languages, so how can the resulting branches possibly be eliminated? The answer lies in the way some branches are used.

Consider the above example once again. Of the five instructions, two are branches, and one of those is an unconditional branch. If it was possible to somehow tag the mov instructions to tell them to execute only under some conditions, the code could be simplified...

cmp a, 5 ; a > 5 ?
mov c, b ; b = c
cmovle d, b ; if le, then b = d

Here, a new instruction has been introduced called cmovle, for "conditional move if less than or equal". This instruction works by executing as normal, but only commits itself if its condition is true. This is called a predicated instruction because its execution is controlled by a predicate (a true/false test).

Given this new predicated move instruction, two instructions have been eliminated from the code, and both were costly branches. In addition, by being clever and always doing the first mov then overwriting it if necessary, the parallelism of the code has also been increased – lines 1 and 2 can now be executed in parallel, resulting in a 50% speedup (2 cycles rather than 3). Most importantly, though, the possibility of getting the branch prediction wrong and suffering a large mispredict penalty has been eliminated.

Of course, if the blocks of code in the if and else cases were longer, then using predication would mean executing more instructions than using a branch, because the processor is effectively executing both paths through the code. Whether it's worth executing a few more instructions to avoid a branch is a tricky decision – for very small or very large blocks the decision is simple, but for medium-sized blocks there are complex tradeoffs which the optimizer must consider.

The Alpha architecture had a conditional move instruction from the very beginning. MIPS, SPARC and x86 added it later. With IA64, Intel went all-out and made almost every instruction predicated in the hope of dramatically reducing branching problems in inner loops, especially ones where the branches are unpredictable (such as compilers and OS kernels). Interestingly, the ARM architecture used in many phones and tablets was the first architecture with a fully predicated instruction set. This is even more intriguing given that the early ARM processors only had short pipelines and thus relatively small mispredict penalties.

Instruction Scheduling, Register Renaming & OoO

If branches and long latency instructions are going to cause bubbles in the pipeline(s), then perhaps those empty cycles can be used to do other work. To achieve this, the instructions in the program must be reordered so that while one instruction is waiting, other instructions can execute. For example, it might be possible to find a couple of other instructions from further down in the program and put them between the two instructions in the earlier multiply example.

There are two ways to do this. One approach is to do the reordering in hardware at runtime. Doing dynamic instruction scheduling (reordering) in the processor means the dispatch logic must be enhanced to look at groups of instructions and dispatch them out of order as best it can to use the processor's functional units. Not surprisingly, this is called out-of-order execution, or just OoO for short (sometimes written OOO or OOE).

If the processor is going to execute instructions out of order, it will need to keep in mind the dependencies between those instructions. This can be made easier by not dealing with the raw architecturally-defined registers, but instead using a set of renamed registers. For example, a store of a register into memory, followed by a load of some other piece of memory into the same register, represent different values and need not go into the same physical register. Furthermore, if these different instructions are mapped to different physical registers they can be executed in parallel, which is the whole point of OoO execution. So, the processor must keep a mapping of the instructions in flight at any moment and the physical registers they use. This process is called register renaming. As an added bonus, it becomes possible to work with a potentially larger set of real registers in an attempt to extract even more parallelism out of the code.

All of this dependency analysis, register renaming and OoO execution adds a lot of complex logic to the processor, making it harder to design, larger in terms of chip area, and more power hungry. The extra logic is particularly power hungry because those transistors are always working, unlike the functional units which spend at least some of their time idle (possibly even powered down). On the other hand, out-of-order execution offers the advantage that software need not be recompiled to get at least some of the benefits of the new processor's design (though typically not all).

Another approach to the whole problem is to have the compiler optimize the code by rearranging the instructions (called static, or compile-time, instruction scheduling). The rearranged instruction stream can then be fed to a processor with simpler in-order multiple-issue logic, relying on the compiler to "spoon feed" the processor with the best instruction stream. Avoiding the need for complex OoO logic should make the processor quite a lot easier to design, less power hungry and smaller, which means more cores (or extra cache) could be placed onto the same amount of chip area (more on this later).

The compiler approach also has some other advantages over OoO hardware – it can see further down the program than the hardware, and it can speculate down multiple paths rather than just one (a big issue if branches are unpredictable). On the other hand, a compiler can't be expected to be psychic, so it can't necessarily get everything perfect all the time. Without OoO hardware, the pipeline will stall when the compiler fails to predict something like a cache miss.

Most of the early superscalars were in-order designs (SuperSPARC, hyperSPARC, UltraSPARC-I/II, Alpha 21064 & 21164, the original Pentium). Examples of early OoO designs included the MIPS R10000, Alpha 21264 and to some extent the entire POWER/PowerPC line (with their reservation stations). Today, almost all high performance processors are out-of-order designs, with the notable exceptions of UltraSPARC-III/IV and POWER6. Most low-power processors, such as ARM11, Cortex-A8 and Atom, are in-order designs because OoO logic consumes a lot of power for a relatively small performance gain.

The Brainiac Debate

A question that must be asked is whether the costly out-of-order logic is really warranted, or whether compilers can do the task of instruction scheduling well enough without it. This is historically called the brainiac vs speed-demon debate. This simple (and fun) classification of design styles first appeared in a 1993 Microprocessor Report editorial by Linley Gwennap, and was made widely known by Dileep Bhandarkar's Alpha Implementations & Architecture book.

Brainiac designs are at the smart-machine end of the spectrum, with lots of OoO hardware trying to squeeze every last drop of performance out of the code, even if it costs millions of logic transistors and tons of power to do it. In contrast, speed-demon designs are simpler and smaller, relying on a smart compiler and willing to sacrifice a little bit of instruction-level parallelism for the other benefits that simplicity brings. Historically, the speed-demon designs tended to run at higher clock speeds, precisely because they were simpler, hence the "speed-demon" name, but today that's no longer the case because clock speed is limited mainly by power and thermal issues.

Clearly, OoO hardware should make it possible for more instruction-level parallelism to be extracted, because things will be known at runtime that cannot be predicted in advance (cache misses, for example). On the other hand, a simpler in-order design will be smaller and use less power, which means you can place more small in-order cores onto the same chip as fewer, larger out-of-order cores. Which would you rather have: 4 powerful brainiac cores, or 8 simpler in-order cores?

Exactly which is the more important factor is currently open to hot debate. In general, it seems that both the benefits and the costs of OoO execution have been somewhat overstated in the past. In terms of cost, appropriate pipelining of the dispatch and register renaming logic allowed OoO processors to achieve clock speeds competitive with simpler designs by the late 1990s, and clever engineering has reduced the power overhead of OoO execution considerably in recent years, leaving only the chip area cost. This is a testament to some outstanding engineering by processor architects. Unfortunately, however, the effectiveness of OoO execution in dynamically extracting additional instruction-level parallelism has been disappointing, with only a relatively small improvement being seen, perhaps 20-30% or so. OoO execution has also been unable to deliver the degree of schedule-insensitivity originally hoped for, with recompilation still producing large speedups even on aggressive OoO processors.

When it comes to the brainiac debate, many vendors have gone down one path then changed their mind and switched to the other side...

Figure 11 – Brainiacs vs speed-demons.

DEC, for example, went primarily speed-demon with the first two generations of Alpha, then changed to brainiac for the third generation. MIPS did similarly. Sun, on the other hand, went brainiac with their first superscalar then switched to speed-demon for more recent designs. The POWER/PowerPC camp also gradually moved away from brainiac designs over the years, although the reservation stations in all PowerPC designs do offer a degree of OoO execution between different functional units even if the instructions within each functional unit's queue are executed strictly in order.

Intel has been the most interesting of all to watch. Modern x86 processors have no choice but to be at least somewhat brainiac due to limitations of the x86 architecture (more on this soon), and the Pentium-Pro/II/III embraced that sentiment wholeheartedly. But then with the Pentium-4 Intel went about as speed-demon as possible for a decoupled x86 microarchitecture, and with IA64 Intel again bet solidly on the smart-compiler approach, with a simple but very wide design relying totally on static scheduling. Faced with the enormous power and heat issues of the Pentium-4, Intel then reversed its position once again and revived the older Pentium-Pro/II/III brainiac design to produce the Pentium-M and its Core successors.

No matter which route is taken, the key problem is still the same – normal programs just don't have a lot of fine-grained parallelism in them. A 4-issue superscalar processor requires four independent instructions to be available, with all their dependencies and latencies met, at every cycle. In reality this is virtually never possible, especially with load latencies of three or four cycles. Currently, real-world instruction-level parallelism for mainstream applications is limited to about 2 instructions per cycle at best. Certain types of applications do exhibit more parallelism, such as scientific code, but these are generally not representative of mainstream applications. There are also some types of code, such as pointer chasing, where even sustaining 1 instruction per cycle is extremely difficult. For those programs, the key problem is the memory system (which we'll get to later).

What About x86?

So where does x86 fit into all this, and how have Intel and AMD been able to remain competitive through all of these developments in spite of an architecture that's now more than 30 years old?

While the original Pentium, a superscalar x86, was an amazing piece of engineering, it was clear that the big problem was the complex and messy x86 instruction set. Complex addressing modes and a minimal number of registers meant that few instructions could be executed in parallel due to potential dependencies. For the x86 camp to compete with the RISC architectures, they needed to find a way to "get around" the x86 instruction set.

The solution, invented independently (at about the same time) by engineers at both NexGen and Intel, was to dynamically decode the x86 instructions into simple, RISC-like micro-instructions, which can then be executed by a fast, RISC-style register-renaming OoO superscalar core. The micro-instructions are often called uops (short for micro-ops). Most x86 instructions decode into 1, 2 or 3 uops, while the more complex instructions require a larger number.

For these "decoupled" superscalar x86 processors, register renaming is absolutely critical due to the meager 8 registers of the x86 architecture in 32-bit mode (64-bit mode added another 8 registers). This differs strongly from the RISC architectures, where providing more registers via renaming only has a minor effect. Nonetheless, with clever register renaming, the full bag of RISC tricks become available to the x86 world, with the two exceptions of advanced static instruction scheduling (because the micro-instructions are hidden behind the x86 layer and thus are less visible to compilers) and the use of a large register set to avoid memory accesses.

The basic scheme works something like this...

Figure 12 – A "RISCy x86" decoupled microarchitecture.

All recent x86 processors use this technique. Of course, they all differ in the exact design of their core pipelines, functional units and so on, just like the various RISC processors, but the fundamental idea of translating from x86 to internal micro-instructions is common to all of them.

One of the most interesting members of this RISC-style x86 group was the Transmeta Crusoe processor, which translated x86 instructions into an internal VLIW form, rather than internal superscalar, and used software to do the translation at runtime, much like a Java virtual machine. This approach allowed the processor itself to be a simple VLIW, without the complex x86 decoding and register renaming hardware of decoupled x86 designs, and without any superscalar dispatch or OoO logic either. The software-based x86 translation did reduce the system's performance compared to hardware translation (which occurs as additional pipeline stages and thus is almost free in performance terms), but the result was a very lean chip which ran fast and cool and used very little power. A 600 MHz Crusoe processor could match a then-current 500 MHz Pentium-III running in its low-power mode (300 MHz clock speed) while using only a fraction of the power and generating only a fraction of the heat. This made it ideal for laptops and handheld computers, where battery life is crucial. Today, of course, x86 processor variants designed specifically for low power use, such as the Pentium-M and its Core descendents, have made the Transmeta-style software-based approach unnecessary.

Threads – SMT, Hyper-Threading & Multi-Core

As already mentioned, the approach of exploiting instruction-level parallelism through superscalar execution is seriously weakened by the fact that most normal programs just don't have a lot of fine-grained parallelism in them. Because of this, even the most aggressively brainiac OoO superscalar processor, coupled with a smart and aggressive compiler to spoon feed it, will still almost never exceed an average of about 2 instructions per cycle when running most real-world software, due to a combination of load latencies, cache misses, branching and dependencies between instructions. Issuing many instructions in the same cycle only ever happens for short bursts of a few cycles at most, separated by many cycles of executing low-ILP code, so peak performance is not even close to being achieved.

If additional independent instructions aren't available within the program being executed, there is another potential source of independent instructions – other running programs (or other threads within the same program). Simultaneous multithreading (SMT) is a processor design technique which exploits exactly this type of thread-level parallelism.

Once again, the idea is to fill those empty bubbles in the pipelines with useful instructions, but this time rather than using instructions from further down in the same program (which are hard to come by), the instructions come from multiple threads running at the same time, all on the one processor core. So, an SMT processor appears to the rest of the system as if it were multiple independent processors, just like a true multi-processor system.

Of course, a true multi-processor system also executes multiple threads simultaneously – but only one in each processor. This is also true for multi-core processors, which place two or more processor cores onto a single chip, but are otherwise no different from traditional multi-processor systems. In contrast, an SMT processor uses just one physical processor core to present two or more logical processors to the system. This makes SMT much more efficient than a multi-core processor in terms of chip space, fabrication cost, power usage and heat dissipation. And of course there's nothing preventing a multi-core implementation where each core is an SMT design.

From a hardware point of view, implementing SMT requires duplicating all of the parts of the processor which store the "execution state" of each thread – things like the program counter, the architecturally-visible registers (but not the rename registers), the memory mappings held in the TLB, and so on. Luckily, these parts only constitute a tiny fraction of the overall processor's hardware. The really large and complex parts, such as the decoders and dispatch logic, the functional units, and the caches, are all shared between the threads.

Of course, the processor must also keep track of which instructions and which rename registers belong to which threads at any given point in time, but it turns out that this only adds a small amount to the complexity of the core logic. So, for the relatively cheap design cost of around 10% more logic in the core (and an almost negligible increase in total transistor count and final production cost), the processor can execute several threads simultaneously, hopefully resulting in a substantial increase in functional unit utilization and instructions-per-clock (and thus overall performance).

The instruction flow of an SMT processor looks something like...

Figure 13 – The instruction flow of an SMT processor.

This is really great! Now that we can fill those bubbles by running multiple threads, we can justify adding more functional units than would normally be viable in a single-threaded processor, and really go to town with multiple instruction issue. In some cases, this may even have the side effect of improving single-thread performance (for particularly ILP-friendly code, for example).

So 20-issue here we come, right? Unfortunately, the answer is no.

SMT performance is a tricky business. First, the whole idea of SMT is built around the assumption that either lots of programs are simultaneously executing (not just sitting idle), or if just one program is running, it has lots of threads all executing at the same time. Experience with existing multi-processor systems shows that this isn't always true. In practice, at least for desktops, laptops and small servers, it is rarely the case that several different programs are actively executing at the same time, so it usually comes down to just the one task that the machine is currently being used for.

Some applications, such as database systems, image & video processing, audio processing, 3D graphics rendering and scientific code, do have obvious high-level (course-grained) parallelism available and easy to exploit, but unfortunately even many of these applications have not been written to make use of multiple threads in order to exploit multiple processors. In addition, many of the applications which are inherently parallel in nature are primarily limited by memory bandwidth, not by the processor (eg: image & video processing, audio processing, most scientific code), so adding a second thread or processor won't help them much – unless memory bandwidth is also dramatically increased (we'll get to the memory system soon). Worse yet, many other applications such as web browsers, multimedia design tools, language interpreters, hardware simulations and so on, are simply not inherently parallel enough to make effective use of multiple processors.

On top of this, the fact that the threads in an SMT design are all sharing just one processor core, and just one set of caches, has major performance downsides compared to a true multi-processor (or multi-core). Within the pipelines of an SMT processor, if one thread saturates just one functional unit which the other threads need, it effectively stalls all of the other threads, even if they only need relatively little use of that unit. Thus, balancing the progress of the threads becomes critical, and the most effective use of SMT is for applications with highly variable code mixtures (so that the threads don't constantly compete for the same hardware resources). Also, competition between the threads for cache space may produce worse results than letting just one thread have all the cache space available – particularly for applications where the critical working set is highly cache-size sensitive, such as hardware simulators/emulators, virtual machines and high quality video encoding (with a large motion prediction window).

The bottom line is that without care, and even with care for some applications, SMT performance can actually be worse than single-thread performance and traditional context switching between threads. On the other hand, applications which are limited primarily by memory latency (but not memory bandwidth), such as database systems and 3D graphics rendering, benefit dramatically from SMT, since it offers an effective way of using the otherwise idle time during cache misses (we'll cover caches later). Thus, SMT presents a very complex and application-specific performance picture. This also makes it a difficult challenge for marketing – sometimes almost as fast as two "real" processors, sometimes more like two really lame processors, sometimes even worse than one processor, huh?

The Pentium-4 was the first processor to use SMT, which Intel calls "hyper-threading". Its design allowed for 2 simultaneous threads (although earlier revisions of the Pentium-4 had the SMT feature disabled due to bugs). Speedups from SMT on the Pentium-4 ranged from around -10% to +30% depending on the application(s). Subsequent Intel designs then eschewed SMT during the transition back to the brainiac designs of the Pentium-M and Core 2, along with the transition to multi-core. Many other SMT designs were also cancelled around the same time (Alpha 21464, UltraSPARC-V), and for a while it almost seemed as if SMT was out of favor, before it finally made a comeback with POWER5, a 2-thread SMT design as well as being multi-core (2 threads per core times 2 cores per chip = 4 threads per chip). Intel's Core i and Core i*2 are also 2-thread SMT, as is the low-power Atom x86 processor. A typical quad-core Core i processor is thus an 8 thread chip. Sun was the most aggressive of all on the thread-level parallelism front, with UltraSPARC-T1 (aka: "Niagara") providing 8 simple in-order cores each with 4-thread SMT, for a total of 32 threads on a single chip. This was subsequently increased to 8 threads per core in UltraSPARC-T2, and then 16 cores in UltraSPARC-T3, for a whopping 128 threads!

More Cores or Wider Cores?

Given SMT's ability to convert thread-level parallelism into instruction-level parallelism, coupled with the advantage of better single-thread performance for particularly ILP-friendly code, you might now be asking why anyone would ever build a multi-core processor when an equally wide (in total) SMT design would be superior.

Well unfortunately it's not quite as simple as that. As it turns out, very wide superscalar designs scale very badly in terms of both chip area and clock speed. One key problem is that the complex multiple-issue dispatch logic scales somewhere between quadratically and exponentially with the issue-width. That is, the dispatch logic of a 5-issue processor is almost twice as big as a 4-issue design, with 6-issue being 4 times as big, 7-issue 8 times and so on. In addition, a very wide superscalar design requires highly multi-ported register files and caches. Both of these factors conspire to not only increase size, but also to massively increase the amount of wiring at the circuit-design level, placing serious limits on the clock speed. So a 10-issue core would actually be both larger and slower than two 5-issue cores, and our dream of a 20-issue SMT design isn't really viable due to circuit design limitations.

Nevertheless, since the benefits of both SMT and multi-core depend so much on the nature of the target application(s), a broad spectrum of designs might still make sense with varying degrees of SMT and multi-core. Let's explore some possibilities...

Today, a "typical" SMT design implies both a wide execution core and OoO execution logic, including multiple decoders, the large and complex superscalar dispatch logic and so on. Thus, the size of a typical SMT core is quite large in terms of chip area. With the same amount of chip space it would be possible to fit several simpler, single-issue, in-order cores (either with or without basic SMT). In fact, it may be the case that as many as half a dozen small, simple cores could fit within the chip area taken by just one modern OoO superscalar SMT design!

Now, given that both instruction-level parallelism and thread-level parallelism suffer from diminishing returns (in different ways), and remembering that SMT is essentially a way to convert TLP into ILP, but also remembering that wide superscalar OoO designs scale very non-linearly in terms of chip area (and design complexity), the obvious question is where is the sweet spot? How wide should the cores be made to reach a good balance between ILP and TLP? Right now, many different approaches are being explored...

At one extreme we have processors like Intel's Core i*2 "Sandy Bridge" (above left), consisting of four large, wide, 4-issue, out-of-order, aggressively brainiac cores (along the top, with shared L3 cache below) each running 2 threads for a total of 8 "fast" threads. At the other end of the spectrum, Sun's UltraSPARC-T3 "Niagara 3" (above right) contains 16 much smaller, simpler, 2-issue in-order cores (top and bottom with shared L2 cache towards the center) each running 8 threads, for a massive 128 threads in total, though these threads are considerably slower than those of the Core i*2. Both chips contain around 1 billion transistors and are drawn approximately to scale above. Note just how much smaller the simple, in-order cores really are.

Which is the better approach? Alas, there's no simple answer here – once again it's going to depend very much on the application(s). For applications with lots of active but memory-latency-limited threads (eg: database systems, 3D graphics rendering), more simple cores would be better because the big/wide cores spend most of their time waiting for memory anyway. For most applications, however, there simply are not enough threads active to make this viable, and the performance of just a single thread is much more important, so a design with fewer but bigger, wider, more brainiac cores is more appropriate.

Of course, there are also a whole range of options between these two extremes that have yet to be fully explored. IBM's POWER7, for example, takes the middle ground with an 8 core, 4-thread SMT design with moderately but not overly aggressive OoO execution hardware. AMD's Bulldozer design uses a more innovative approach, with a shared, SMT-style front-end for each pair of cores feeding a back-end with unshared, multi-core-style integer execution units but shared, SMT-style floating-point units, blurring the lines between SMT and multi-core. Who knows, perhaps in the future we might even see asymmetric designs, with one or two big, wide, brainiac cores plus a large number of smaller, narrower, simpler cores. Just imagine trying to optimize code for that! IBM's Cell processor was arguably the first such design, although the small, simple cores in Cell were not instruction-set compatible with the large main core, and acted more like special-purpose coprocessors.

Data Parallelism – SIMD Vector Instructions

In addition to instruction parallelism, there is another source of parallelism in many programs – data parallelism. Rather than looking for ways to execute groups of instructions in parallel, the idea is to look for ways to make one instruction apply to a group of values in parallel.

This is sometimes called SIMD parallelism (single instruction, multiple data). More often, it's called vector processing. Supercomputers used to use vector processing a lot, with very long vectors, because the types of scientific programs which are run on supercomputers are quite amenable to vector processing.

Today, however, vector supercomputers have long since given way to multi-processor designs where each processing unit is a commodity CPU. So why revive vector processing?

In many situations, especially in imaging, video and multimedia applications, a program needs to execute the same instruction for a small group of related values, usually a short vector (a simple structure or small array). For example, an image processing application might want to add groups of 8-bit numbers, where each 8-bit number represents one of the red, green, blue or alpha (transparency) values of a pixel...

Figure 15 – A SIMD vector addition operation.

What's happening here is exactly the same operation as a 32-bit addition, except that every 8th carry is not being propagated. Also, it might be desirable for the values not to wrap to zero once all 8 bits are full, and instead to hold at 255 as a maximum value in those cases (called saturation arithmetic). In other words, every 8th carry is not carried across but instead triggers an all-ones result. So, the vector addition operation shown above is really just a modified 32-bit add.

From the hardware point of view, adding these types of vector instructions is not terribly difficult – existing registers can be used and in many cases the functional units can be shared with existing integer or floating-point units. Other useful packing and unpacking instructions can also be added, for byte shuffling and so on, and a few predicate-like instructions for bit-masking etc. With some thought, a small set of vector instructions can enable some impressive speedups.

Of course, there's no reason to stop at 32 bits. If there happen to be some 64-bit registers, which architectures usually have for floating-point (at least), they could be used to provide 64-bit vectors, thereby doubling the parallelism (SPARC VIS and x86 MMX did this). If it is possible to define entirely new registers, then they might as well be even wider (SSE added 8 new 128-bit registers, later increased to 16 registers in 64-bit mode, then widened to 256 bits with AVX, while PowerPC AltiVec provided a full set of 32 new 128-bit registers from the start, in keeping with PowerPC's more separated design style where even the branch instructions have their own registers). An alternative to widening the registers is to use pairing, where each pair of registers is treated as a single operand by the SIMD vector instructions (ARM NEON does this, with its registers usable both as 32 64-bit registers or as 16 128-bit registers). Naturally, the data in the registers can also be divided up in other ways, not just as 8-bit bytes – for example as 16-bit integers for high-quality image processing, or as floating-point values for scientific number crunching. With AltiVec, for example, it is possible to execute a 4-way parallel floating-point multiply-add as a single, fully pipelined instruction.

For applications where this type of data parallelism is available and easy to extract, SIMD vector instructions can produce amazing speedups. The original target applications were primarily in the area of image and video processing, however suitable applications also include audio processing, speech recognition, some parts of 3D graphics rendering and many types of scientific programs. For other types of applications, such as compilers and database systems, the speedup is generally much smaller, perhaps even nothing at all.

Unfortunately, it's quite difficult for a compiler to automatically make use of vector instructions when working from normal source code, except in trivial cases. The key problem is that the way programmers write programs tends to serialize everything, which makes it difficult for a compiler to prove that two given operations are independent and can be done in parallel. Progress is slowly being made in this area, but at the moment programs must basically be rewritten by hand to take advantage of vector instructions (except for simple array-based loops in scientific code).

Luckily, however, rewriting just a small amount of code in key places within the graphics and video/audio libraries of your favorite operating system has a widespread effect across many applications. Today, most OSs have enhanced their key library functions in this way, so that virtually all multimedia and 3D graphics applications do make use of these highly effective vector instructions. Chalk up yet another win for abstraction!

Almost every architecture has now added SIMD vector extensions, including SPARC (VIS), x86 (MMX/SSE/AVX), PowerPC (AltiVec) and ARM (NEON). Only relatively recent processors from each architecture can execute some of these new instructions, however, which raises backward compatibility issues, especially on x86 where the SIMD instructions evolved somewhat haphazardly (3DNow!, MMX, SSE, SSE2, SSE3, SSE4, AVX).

Caches & The Memory Hierarchy

As mentioned earlier, latency is a big problem for pipelined processors, and latency is especially bad for loads from memory, which make up about a quarter of all instructions.

Loads tend to occur near the beginning of code sequences (basic blocks), with most of the other instructions depending on the data being loaded. This causes all the other instructions to stall, and makes it difficult to obtain large amounts of instruction-level parallelism. Things are even worse than they might first seem, because in practice most superscalar processors can still only issue one, or at most two, memory instructions per cycle.

The core problem with memory access is that building a fast memory system is very difficult because of fixed limits, like the speed of light. These impose delays while a signal is transferred out to RAM and back. Nothing can change this fact of nature – we must learn to work around it.

For example, access latency for main memory, even using a modern SDRAM with a CAS latency of 5, will typically be around 15 cycles of the memory system clock– 1 to send the address to the chipset (north bridge), 1 more to get it to the DIMM, RAS-to-CAS delay of 5 (assuming a page miss), CAS latency of 5, another 1 to get the data to the output buffer of the DIMM, 1 to send the data back to the chipset, and a final 1 to send the data up to the processor (or E-cache). On a multi-processor system, even more bus cycles may be required to support cache coherency.

Assuming a typical 400 MHz SDRAM memory system (DDR2-800), and assuming a 2.0 GHz processor, this makes 15*5 = 75 cycles of the CPU clock to access main memory! Yikes, you say! And it gets worse – a 2.4 GHz processor would take it to 90 cycles, a 2.8 GHz processor to 105 cycles, and even if the memory system was increased to 666 MHz (DDR3-1333, with CAS latency slipping to 9 in the process), a 3.3 GHz processor would still wait 115 cycles, and a 4.0 GHz processor a staggering 138 cycles to access main memory!

Furthermore, although a DDR SDRAM memory system transfers data on both the rising and falling edges of the clock signal (ie: at "double data rate"), the true clock speed of the memory system is still only half that, and it is the true clock speed which applies for control signals. So the latency of a DDR memory system is the same as a non-DDR system, even though the bandwidth is doubled (more on the difference between bandwidth and latency later).

Also note that a small portion of memory latency (2 of the 15 bus cycles) involves the transfer of data between the processor and the chipset on the motherboard. One way to reduce this is to dramatically increase the speed of the frontside bus (FSB) between the processor and the chipset (eg: 800 MHz QDR in Pentium-4, 1.25 GHz DDR in PowerPC G5). An even better approach is to integrate the memory controller directly onto the processor chip, which allows the 2 bus cycles to be converted into much faster processor cycles instead. The UltraSPARC-IIi and Athlon 64 were the first mainstream processors to do this, and now all modern designs feature on-chip memory controllers, although Intel were late to do so and only integrated the memory controller into their CPUs starting with Core i & i*2.

Unfortunately, both DDR memory and on-chip memory controllers are only able to do so much – and memory latency continues to be a major problem. This problem of the large and widening gap between the processor and memory is sometimes called the memory wall. It was at one time the single most important problem facing hardware engineers, though today the problem has eased considerably because processor clock speeds are no longer climbing at the rate they previously did due to power and heat constraints.

Nonetheless, memory latency is still a huge problem.

Modern processors try to solve this problem with caches. A cache is a small but fast type of memory located on or near the processor chip. Its role is to keep copies of small pieces of main memory. When the processor asks for a particular piece of main memory, the cache can supply it much more quickly than main memory would be able to – if the data is in the cache.

Typically, there are small but fast "primary" level-1 (L1) caches on the processor chip itself, inside each core, usually around 8k-64k in size, with a larger level-2 (L2) cache further away but still on-chip (a few hundred KB to a few MB), and possibly an even larger and slower L3 cache etc. The combination of the on-chip caches, any off-chip external cache (E-cache) and main memory (DRAM) together form a memory hierarchy, with each successive level being larger but slower than the one before it. At the bottom of the memory hierarchy, of course, is the virtual memory system (paging/swapping), which provides the illusion of an almost infinite amount of main memory by moving pages of RAM to and from hard drive storage (which is even slower again, by a large margin).

It's a bit like working at a desk in a library... You might have two or three books open on the desk itself. Accessing them is fast (you can just look), but you can't fit more than a couple on the desk at the same time – and even if you could, accessing 100 books laid out on a huge desk would take longer because you'd have to walk between them. Instead, in the corner of the desk you might have a pile of a dozen more books. Accessing them is slower, because you have to reach over, grab one and open it up. Each time you open a new one, you also have to put one of the books already on the desk back into the pile to make room. Finally, when you want a book that's not on the desk, and not in the pile, it's very slow to access because you have to get up and walk around the library looking for it. However the size of the library means you have access to thousands of books, far more than could ever fit on your desk.

The amazing thing about caches is that they work really well – they effectively make the memory system seem almost as fast as the L1 cache, yet as large as main memory. A modern primary (L1) cache has a latency of just 2 to 4 processor cycles, which is dozens of times faster than accessing main memory, and modern primary caches achieve hit rates of around 90% for most applications. So 90% of the time, accessing memory only takes a couple of cycles!

Caches can achieve these seemingly amazing hit rates because of the way programs work. Most programs exhibit locality in both time and space – when a program accesses a piece of memory, there's a good chance it will need to re-access the same piece of memory in the near future (temporal locality), and there's also a good chance that it will need to access other nearby memory in the future as well (spatial locality). Temporal locality is exploited by merely keeping recently-accessed data in the cache. To take advantage of spatial locality, data is transferred from main memory up into the cache in blocks of a few dozen bytes at a time, called a cache block.

From the hardware point of view, a cache works like a two column table – one column is the memory address and the other is the block of data values (remember that each cache line is a whole block of data, not just a single value). Of course, in reality the cache need only store the necessary higher-end part of the address, since lookups work by using the lower part of the address to index the cache. When the higher part, called the tag, matches the tag stored in the table, this is a hit and the appropriate piece of data can be sent to the CPU...

Figure 16 – A cache lookup.

It is possible to use either the physical address or the virtual address to do the cache lookup. Each has pros and cons (like everything else in computing). Using the virtual address might cause problems because different programs use the same virtual addresses to map to different physical addresses – the cache might need to be flushed on every context switch. On the other hand, using the physical address means the virtual-to-physical mapping must be performed as part of the cache lookup, making every lookup slower. A common trick is to use virtual addresses for the cache indexing but physical addresses for the tags. The virtual-to-physical mapping (TLB lookup) can then be performed in parallel with the cache indexing so that it will be ready in time for the tag comparison. Such a scheme is called a virtually-indexed physically-tagged cache.

The sizes and speeds of the various levels of cache in modern processors are absolutely crucial to performance. The most important by far is the primary L1 data cache. Some processors go for small data caches (Pentium-Pro/II/III, Pentium-4E and Bulldozer have 16k D-caches, earlier Pentium-4s and UltraSPARC-T1/T2/T3 are even smaller at just 8k), most have settled on 32k as the sweet spot, and a few are larger at 64k (Athlon, UltraSPARC-III/IV, Athlon 64/Phenom). For such caches, load latency is usually 3 cycles but occasionally shorter (2 cycles in UltraSPARC-III/IV, Pentium-4 & UltraSPARC-T1/T2/T3) or longer (4 cycles in Pentium-4E, Core i & i*2, Cortex-A9 & A15, Bulldozer). Increasing the load latency by a cycle can seem like a minor change but is actually a serious hit to performance, and is something rarely noticed or understood by end users. For normal, everyday pointer-chasing code, a processor's load latency is a major factor in real-world performance.

Most modern processors also have a large second or third level of on-chip cache, usually shared between all cores. This cache is also very important, but its size sweet spot depends heavily on the type of application being run and the size of that application's active working set. The difference between 2 MB of L3 cache and 8 MB will be barely measurable for some applications, while for others it will be enormous. Given that the relatively small L1 caches already take up to half of the chip area for many modern processor cores, you can imagine how much area a large L2 or L3 cache would take, yet this is still probably the best use for the high transistor budgets allowed by modern chip fabrication technology. Usually, the large L2/L3 cache is so large that it's clearly visible in chip photographs, standing out as a relatively clean, repetitive structure against the more "messy" logic transistors of the cores and memory controller.

Cache Conflicts & Associativity

Ideally, a cache should keep the data that is most likely to be needed in the future. Since caches aren't psychic, a good approximation of this is to keep the most recently used data.

Unfortunately, keeping exactly the most recently used data would mean that data from any memory location could be placed into any cache line. The cache would thus contain exactly the most recently used n KB of data, which would be great for exploiting locality but unfortunately is not suitable for allowing fast access – accessing the cache would require checking every cache line for a possible match, which would be very slow for a modern cache with thousands of lines.

Instead, a cache usually only allows data from any particular address in memory to occupy one, or at most a handful, of locations within the cache. Thus, only one or a handful of checks are required during access, so access can be kept fast (which is the whole point of having a cache in the first place). This approach does have a downside, however – it means the cache doesn't store the absolutely best set of recently accessed data, because several different locations in memory will all map to the same one location in the cache. When two such memory locations are wanted at the same time, such a scenario is called a cache conflict.

Cache conflicts can cause "pathological" worst-case performance problems, because when a program repeatedly accesses two memory locations which happen to map to the same cache line, the cache must keep storing and loading from main memory and thus suffering the long main memory latency on each access (up to 100 cycles or more, remember!). This type of situation is called thrashing, since the cache is not achieving anything and is simply getting in the way – despite obvious temporal locality and reuse of data, the cache is unable to exploit the locality offered by this particular access pattern due to limitations of its simplistic mapping between memory locations and cache lines.

To address this problem, more sophisticated caches are able to place data in a small number of different places within the cache, rather than just a single place. The number of places a piece of data can be stored in a cache is called its associativity. The word associativity comes from the fact that cache lookups work by association – that is, a particular address in memory is associated with a particular location in the cache (or set of locations for a set-associative cache).

As described above, the simplest and fastest caches allow for only one place in the cache for each address in memory – each piece of data is simply mapped to address % size within the cache by simply looking at the lower bits of the address (as in the above diagram). This is called a direct mapped cache. Any two locations in memory whose addresses are the same for the lower address bits will map to the same cache line in a direct mapped cache, causing a cache conflict.

A cache which allows data to occupy one of two locations based on its address is called 2-way set-associative. Similarly, a 4-way set-associative cache allows for 4 possible locations for any given piece of data. Set-associative caches work much like direct mapped ones, except there are several tables, all indexed in parallel, and the tags from each table are compared to see whether there is a match for any one of them...

Figure 17 – A 4-way set-associative cache.

Each table, or way, may also have marker bits so that only the line of the least recently used way is evicted when a new line is brought in (or perhaps some faster approximation of that ideal).

Usually, set-associative caches are able to avoid the problems that occasionally occur with direct mapped caches due to unfortunate cache conflicts. Adding even more ways allows even more conflicts to be avoided. Unfortunately, the more highly associative a cache is, the slower it is to access, because there are more comparisons to perform during each access. Even though the comparisons themselves are performed in parallel, additional logic is required to select the appropriate hit, if any, and the cache may also need to update the marker bits appropriately within each way. More chip area is also required, because relatively more of the cache's data is consumed by tag information rather than data blocks, and extra datapaths are needed to access each individual way of the cache in parallel. Any and all of these factors may negatively affect access time. Thus, a 2-way set-associative cache is slower but smarter than a direct mapped cache, with 4-way and 8-way being slower and smarter again.

In most modern processors the instruction cache is usually highly set-associative, since its latency can be hidden by fetching and buffering. The data cache, on the other hand, is usually set-associative to some degree but often not overly so to keep down latency (2-way in Athlon, PowerPC G5, Athlon 64/Phenom, Cortex-A15; 4-way in Pentium-Pro/II/III, UltraSPARC-III/IV, Pentium-4, UltraSPARC-T1 & T2, Cortex-A8 & A9, Bulldozer; 8-way in PowerPC G4e, Pentium-M, Core 2, Core i, Core i*2). As the last resort before heading off to far away main memory, the large on-chip L2/L3 cache is also usually highly set-associative, although external E-cache is sometimes direct mapped for flexibility of implementation.

The concept of caches also extends up into software systems. For example, main memory is used to cache the contents of the filesystem to speed up file I/O, and web caches (also known as proxy caches) cache the contents of remote web servers on a more local server. With respect to main memory and virtual memory (paging/swapping), it can be thought of as being a smart, fully associative cache, like the ideal cache mentioned initially (above). After all, the virtual memory system is managed by the (hopefully) intelligent software of the operating system kernel.

Memory Bandwidth vs Latency

Since memory is transferred in blocks, and since cache misses are an urgent "show stopper" type of event with the potential to halt the processor in its tracks (or at least severely hamper its progress), the speed of those block transfers from memory is critical. The transfer rate of a memory system is called its bandwidth. But how is that different from latency?

A good analogy is a highway... Suppose you want to drive in to the city from 100 miles away. By doubling the number of lanes, the total number of cars that can travel per hour (the bandwidth) is doubled, but your own travel time (the latency) is not reduced. If all you want to do is increase cars-per-second, then adding more lanes (wider bus) is the answer, but if you want to reduce the time for a specific car to get from A to B then you need to do something else – usually either raise the speed limit (bus & DRAM speed), or reduce the distance, or perhaps build a regional mall so that people don't need to go to the city as often (a cache).

When it comes to memory systems, there are often subtle tradeoffs between latency and bandwidth. Lower latency designs will be better for pointer-chasing code, such as compilers and database systems, whereas bandwidth-oriented systems have the advantage for programs with simple linear access patterns, such as image processing and scientific code.

The two major memory technologies of recent times, standard SDRAM and Rambus RDRAM, differ slightly in this respect – for any given level of chip technology, SDRAM should have lower latency but RDRAM should have higher bandwidth. This is due to the "snake-like" physical structure of RDRAM memory systems, which reduce signal reflections by avoiding splitting the wires that normally go to each memory module in parallel, and instead go "through" each module in sequence – allowing RDRAM to run at higher clock speeds but with a longer average physical length to the memory modules.

Of course, it's reasonably easy to increase bandwidth– simply adding more memory banks and making the busses wider can easily double or quadruple bandwidth. In fact, many high-end systems do this to increase their performance, but it comes with downsides as well. In particular, wider busses mean a more expensive motherboard, restrictions on the way RAM can be added to a system (install in pairs or groups of 4) and a higher minimum RAM configuration.

Unfortunately, latency is much harder to improve than bandwidth – as the saying goes: "you can't bribe god". Even so, there have been some good improvements in effective memory latency in past years, chiefly in the form of synchronously-clocked DRAM (SDRAM) which uses the same clock as the memory bus. The main benefit of SDRAM is that it allows pipelining of the memory system, because the internal timing aspects and interleaved structure of SDRAM chip operation are exposed to the system and can thus be taken advantage of. This reduces effective latency because it allows a new memory access to be started before the current one has completed, thereby eliminating the small amounts of waiting time found in older asynchronous DRAM systems, which had to wait for the current access to complete before starting the next (on average, an asynchronous memory system had to wait for the transfer of half a cache block from the previous access before starting a new request, which is often several bus cycles).

In addition to the reduction in effective latency, there is also a substantial increase in bandwidth, because in an SDRAM memory system multiple memory requests can be outstanding at any one time, all being processed in a highly efficient, fully pipelined fashion. Pipelining of the memory system has dramatic effects for memory bandwidth – an SDRAM memory system generally provides double or triple the sustained memory bandwidth of an asynchronous memory system, even though the latency of the SDRAM system is only slightly lower.

Will further improvements in DRAM technology be able to continue to hold off the memory wall, while at the same time scaling up to the ever higher bandwidth demanded by more and more processor cores? Or will we soon end up constantly bottlenecked by memory, both bandwidth and latency, with neither the processor microarchitecture nor the number of cores making much difference, and the memory system being all that matters? It will be interesting to watch...

Acknowledgements

The overall style of this article, particularly with respect to the style of the processor "instruction flow" and microarchitecture diagrams, is derived from the combination of a well-known 1989 ASPLOS research paper by Norman Jouppi and David Wall, the book POWER & PowerPC by Shlomo Weiss and James Smith, and the two very famous Hennessy/Patterson textbooks Computer Architecture: A Quantitative Approach and Computer Organization and Design.

There have, of course, been many other presentations of this same material, and naturally they are all somewhat similar, however the above four are exceptionally good (in my opinion). To learn more about these topics, those books are an excellent place to start.

More Information?

If you want more detail on the specifics of recent processor designs – and something more insightful than the raw technical manuals – here are a few good articles...

And here are some articles not specifically related to any particular processor, but still very interesting...

And if you want to keep up with the latest news in the world of microprocessors...

That should keep you busy!

Blog: Playing with the CPU pipeline – Lol Engine

$
0
0

Comments:" Blog: Playing with the CPU pipeline – Lol Engine "

URL:http://lolengine.net/blog/2011/9/17/playing-with-the-cpu-pipeline


This article will show how basic knowledge of a modern CPU’s instruction pipeline can help micro-optimise code at very little cost, using a real world example: the approximation of a trigonometric function. All this without necessarily having to look at lines of assembly code.

The code used for this article is included in the attached file.

Evaluating polynomials

Who needs polynomials anyway? We’re writing games, not a computer algebra system, after all. But wait! Taylor series are an excellent mathematical tool for approximating certain classes of functions. For instance, this is the Taylor series of sin(x) near x = 0:

Truncating the series at the 15th power will compute sin(x) with an absolute error no greater than 1e-11 in the range [-π/2; π/2], and 2e-16 in the range [-π/4; π/4].

However, a better approximation known as the minimax polynomial (probably featured in an upcoming article) will give a maximum absolute error of about 2e-16 on the whole [-π/2; π/2] range:

staticdouble a0 =+1.0;staticdouble a1 =-1.666666666666580809419428987894207e-1;staticdouble a2 =+8.333333333262716094425037738346873e-3;staticdouble a3 =-1.984126982005911439283646346964929e-4;staticdouble a4 =+2.755731607338689220657382272783309e-6;staticdouble a5 =-2.505185130214293595900283001271652e-8;staticdouble a6 =+1.604729591825977403374012010065495e-10;staticdouble a7 =-7.364589573262279913270651228486670e-13;doublesin1(double x){return a0 * x+ a1 * x * x * x+ a2 * x * x * x * x * x+ a3 * x * x * x * x * x * x * x+ a4 * x * x * x * x * x * x * x * x * x+ a5 * x * x * x * x * x * x * x * x * x * x * x+ a6 * x * x * x * x * x * x * x * x * x * x * x * x * x+ a7 * x * x * x * x * x * x * x * x * x * x * x * x * x * x * x;}

That is 64 multiplications and 7 additions, though compiler options such as GCC’s -ffast-math will help factor the expression in order to perform fewer operations.

It is possible to help the CPU by noticing that a term such as x^9 can be computed in only one operation if x^2 and x^7 are already known, leading to the following code:

doublesin2(double x){double ret, y = x, x2 = x * x;
 ret = a0 * y; y *= x2;
 ret += a1 * y; y *= x2;
 ret += a2 * y; y *= x2;
 ret += a3 * y; y *= x2;
 ret += a4 * y; y *= x2;
 ret += a5 * y; y *= x2;
 ret += a6 * y; y *= x2;
 ret += a7 * y;return ret;}

That is now only 16 multiplications and 7 additions. But it is possible to do even better using the Horner form of a polynomial evaluation:

Leading to the following code:

doublesin3(double x){double x2 = x * x;return x *(a0 + x2 *(a1 + x2 *(a2 + x2 *(a3 + x2 *(a4 + x2 *(a5 + x2 *(a6 + x2 * a7)))))));}

We are down to 9 multiplications and 7 additions. There is probably no way to be faster, is there? Let’s see…

Timings

Here are the timings in nanoseconds for the above code, compared with the glibc’s sin() function. The test CPU is an Intel® Core™ i7-2620M CPU at 2.70GHz. The functions were compiled using -O3 -ffast-math:

function sin sin1 sin2 sin3
nanoseconds per call 22.518 16.406 16.658 25.276

Wait, what? Our superbly elegant function, performing only 9 multiplications, is actually slower than the 64-multiplication version? Which itself is as fast as the 16-multiplication one? Surely we overlooked something.

That’s right. We ignored the CPU pipeline.

The instruction pipeline

In order to execute an instruction, such as “add A and B into C”, a CPU needs to do at least the following:

  • fetch the instruction (ie. read it from the program’s memory)
  • decode the instruction
  • read the instruction’s operands (ie. A and B)
  • execute the instruction
  • write the result in memory or in registers (in our case, C)

On a modern Intel® CPU, the execution step only accounts for 1/10 or even 1/16 of the total execution time. The idea behind pipelining is simple: while executing an instruction, the CPU can often already read the operands for the next one.

But there is a problem with this strategy: if the next instruction depends on the result of the current one, the CPU cannot read the next operands yet. This is called a read-after-write hazard, and most usually causes a pipeline stall: the CPU just does nothing until it can carry on.

For the sake of simplicity, imagine the CPU’s pipeline depth is 3. At a given time, it can fetch, execute and finish one instruction:

instruction is being fetched, executed or finished
instruction could start, but needs to wait for the result of a previous instruction

This is how the CPU would execute A = (a + b) * (c + d):

time →total: 7
1 B = a + b
2 C = c + d
3 A = B * C

The c + d operation can be started very early because it does not depend on the result of a + b. This is called instruction-level parallelism. However, the final B * C operation needs to wait for all previous instructions to finish.

Since every operation in sin3() depends on the previous one, this is how it would execute that function:

time →total: 48
1 x2 = x * x
2 A = a7 * x2
3 A += a6
4 A *= x2
5 A += a5
6 A *= x2
7 A += a4
8 A *= x2
9 A += a3
10 A *= x2
11 A += a2
12 A *= x2
13 A += a1
14 A *= x2
15 A += a0
16 A *= x

These 9 multiplications and 7 additions are done in 48 units of time. No instruction-level parallelism is possible because each instruction needs to wait for the previous one to finish.

The secret behind sin2()’s performance is that the large number of independent operations allows the compiler to reorganise the computation so that the instructions can be scheduled in a much more efficient way. This is roughly how GCC compiled it:

time →total: 30
1 x2 = x * x
2 A = a7 * x2
3 x3 = x2 * x
4 A += a6
5 B = a1 * x3
6 x5 = x3 * x2
7 A *= x2
8 C = a2 * x5
9 B += x
10 x7 = x5 * x2
11 A += a5
12 D = a3 * x7
13 B += C
14 x9 = x7 * x2
15 B += D
16 E = a4 * x9
17 x11 = x9 * x2
18 B += E
19 A *= x11
20 A += B

These 13 multiplications and 7 additions are executed in 30 units of time instead of 48 for the previous version. The compiler has been rather clever here: the number of ○’s is kept small.

Note that 30 / 48 = 0.625, and the ratio between sin2 and sin3’s timings is 16.658 / 25.276 = 0.659. Reality matches theory pretty well!

Going further

We have seen that increasing the number of operations in order to break dependencies between CPU instructions allowed to help the compiler perform better optimisations taking advantage of the CPU pipeline. But that was at the cost of 40% more multiplications. Maybe there is a way to improve the scheduling without adding so many instructions?

Luckily there are other ways to evaluate a polynomial.

Even-Odd form and similar schemes

Consider our 8th order polynomial:

Separating the odd and even coefficients, it can be rewritten as:

Which using Horner’s form yields to:

This polynomial evaluation scheme is called the Even-Odd scheme. It only has 9 multiplications and 7 additions (only one multiplication more than the optimal case). It results in the following C code:

doublesin4(double x){double x2 = x * x;double x4 = x2 * x2;double A = a0 + x4 *(a2 + x4 *(a4 + x4 * a6));double B = a1 + x4 *(a3 + x4 *(a5 + x4 * a7));return x *(A + x2 * B);}

And this is the expected scheduling:

time →total: 33
1 x2 = x * x
2 x4 = x2 * x2
3 B = a7 * x4
4 A = a6 * x4
5 B += a5
6 A += a4
7 B *= x4
8 A *= x4
9 B += a3
10 A += a2
11 B *= x4
12 A *= x4
13 B += a1
14 A += a0
15 B *= x2
16 A += B
17 A *= x

Still not good enough, but we’re certainly onto something here. Let’s try another decomposition for the polynomial:

And using Horner’s form again:

Resulting in the following code:

doublesin5(double x){double x2 = x * x;double x4 = x2 * x2;double x6 = x4 * x2;double A = a0 + x6 *(a3 + x6 * a6);double B = a1 + x6 *(a4 + x6 * a7);double C = a2 + x6 * a5;return x *(A + x2 * B + x4 * C);}

And the following scheduling:

time →total: 31
1 x2 = x * x
2 x4 = x2 * x2
3 x6 = x4 * x2
4 B = x6 * a7
5 A = x6 * a6
6 C = x6 * a5
7 B += a4
8 A += a3
9 C += a2
10 B *= x6
11 A *= x6
12 C *= x4
13 B += a1
14 A += a0
15 B *= x2
16 A += C
17 A += B
18 A *= x

One more instruction and two units of time better. That’s slightly better, but still not as good as we would like. One problem is that a lot of time is lost waiting for the value x6 to be ready. We need to find computations to do in the meantime to avoid pipeline stalls.

High-Low form

Instead of splitting the polynomial into its even and odd coefficients, we split it into its high and low coefficients:

And again using Horner’s form:

The corresponding code is now:

doublesin6(double x){double x2 = x * x;double x4 = x2 * x2;double x8 = x4 * x4;double A = a0 + x2 *(a1 + x2 *(a2 + x2 * a3));double B = a4 + x2 *(a5 + x2 *(a6 + x2 * a7));return x *(A + x8 * B);}

And the expected scheduling:

time →total: 30
1 x2 = x * x
2 B = x2 * a7
3 A = x2 * a3
4 x4 = x2 * x2
5 B += a6
6 A += a2
7 x8 = x4 * x4
8 B *= x2
9 A *= x2
10 B += a5
11 A += a1
12 B *= x2
13 A *= x2
14 B += a4
15 A += a0
16 B *= x8
17 A += B
18 A *= x

Finally! We now schedule as well as GCC, and with 11 multiplications instead of 13. Still no real performance gain, though.

Pushing the limits

Can we do better? Probably. Remember that each ○ in the above table is a pipeline stall, and any instruction we would insert there would be basically free.

Note the last instruction, A *= x. It causes a stall because it needs to wait for the final value of A, but it would not be necessary if A and B had been multiplied by x beforehands.

Here is a way to do it (bold instructions indicate a new instruction or a modified one):

time →total: 27
1 x2 = x * x
2 B = x2 * a7
3 A = x2 * a3
4 x4 = x2 * x2
5 B += a6
6 A += a2
7 x8 = x4 * x4
8 B *= x2
9 A *= x2
10x3 = x2 * x
11 B += a5
12 A += a1
13C = a0 * x
14 B *= x2
15A *= x3
16x9 = x8 * x
17 B += a4
18A += C
19B *= x9
20 A += B

Excellent! Just as many instructions as GCC, but now with fewer pipeline stalls. I don’t know whether this scheduling is optimal for the (incorrect) assumption of a 3-stage pipeline, but it does look pretty good. Also, loading a0, a1 etc. from memory hasn't been covered for the sake of simplicity.

Anyway, we just need to write the code corresponding to this behaviour, and hope the compiler understands what we need:

doublesin7(double x){double x2 = x * x;double x3 = x2 * x;double x4 = x2 * x2;double x8 = x4 * x4;double x9 = x8 * x;double A = x3 *(a1 + x2 *(a2 + x2 * a3));double B = a4 + x2 *(a5 + x2 *(a6 + x2 * a7));double C = a0 * x;return A + C + x9 * B;}

Conclusion

It’s time to check the results! Here they are, for all the functions covered in this article:

function sin sin1 sin2 sin3 sin4 sin5 sin6 sin7
nanoseconds per call 22.518 16.406 16.658 25.276 18.666 18.582 16.366 17.470

Damn. All these efforts to understand and refactor a function, and our best effort actually performs amongst the worst!

What did we miss? Actually, this time, nothing. The problem is that GCC didn't understand what we were trying to say in sin7() and proceeded with its own optimisation ideas. Compiling with -O3 instead of -O3 -ffast-math gives a totally different set of timings:

function sin sin1 sin2 sin3 sin4 sin5 sin6 sin7
nanoseconds per call 22.497 30.250 19.865 25.279 18.587 18.958 16.362 15.891

There. We win eventually!

There is a way to still use -ffast-math yet prevent GCC from trying to be too clever. This might be preferable because we do not want to lose the benefits of -ffast-math in other places. By using an architecture-specific assembly construct, we can mark temporary variables as used, effectively telling GCC that the variable needs to be really computed and not optimised away:

doublesin7(double x){double x2 = x * x;double x3 = x2 * x;double x4 = x2 * x2;double x8 = x4 * x4;double x9 = x8 * x;#if defined __x86_64__ __asm__("":"+x"(x3),"+x"(x9));#elif defined __powerpc__ || defined __powerpc64__ __asm__("":"+f"(x3),"+f"(x9));#else __asm__("":"+m"(x3),"+m"(x9));/* Out of luck :-( */#endifdouble A = x3 *(a1 + x2 *(a2 + x2 * a3));double B = a4 + x2 *(a5 + x2 *(a6 + x2 * a7));double C = a0 * x;return A + C + x9 * B;}

This works on the x86_64 architecture, where "+x" indicates the SSE registers commonly used for floating point calculations, and on the PowerPC, where "+f" can be used. This approach is not portable and it is not clear what should be used on other platforms. Using "+m" is generic but often means a useless store into memory; however, on x86 it is still a noticeable gain.

And our final results, this time with the full -O3 -ffast-math optimisation flags:

function sin sin1 sin2 sin3 sin4 sin5 sin6 sin7
nanoseconds per call 22.522 16.411 16.663 25.277 18.628 18.588 16.365 15.617

The code used for this article is included in the attached file.


open mic: Independently Poor: A Twist on FU Money. Or: "FU, Money"

Google Doodle celebrates Canada’s coldest day, -63 C

$
0
0

Comments:"Google Doodle celebrates Canada’s coldest day, -63 C"

URL:http://www.ottawacitizen.com/story_print.html?id=9462148


 

Google’s “doodle” Monday celebrates a part of Canadian heritage we might rather not think about: Our coldest day, set this date in 1947 when the temperature in Snag, Yukon, hit -63 C.

It sounds even worse on the Fahrenheit scale in use in 1947: -81.4 F.

Environment Canada’s senior climatologist, David Phillips, tells the Snag tale often, and he has always been fascinated by the changes that cold air makes, beyond the obvious human suffering.

There was a different sound, for instance. People at the airport could clearly hear dogs barking in town and townspeople talking as if they were close by instead of five kilometres away. The sound travelled better in cold, dense air, and a temperature inversion caused sound waves to bend back toward the ground rather than escaping upwards.

The air looked different, too. Phillips notes in his book Blame It On the Weather that “ground visibility was greatly reduced. At about arm’s length, an eerie, dull grey shroud of patchy ice fog hung above the dogs and heated buildings.”

Weather officer Gordon Toole recorded that his breath froze, with a hissing sound, and fell to the ground as white dust.

“Ice in the White River about a mile east of the airport, cracked and boomed loudly, like gunfire,” he wrote.

“Snag snug as mercury sags to a record -82.6,” said a newspaper headline. (That figure was later revised slightly upward.) In fact thermometers don’t use mercury. They use alcohol, and in Snag the alcohol had dropped all the way past the vertical stem of the thermometer and was concentrated in the bulb at the bottom.

Ottawa is expected to reach a relatively toasty -5 C Monday.

tspears@ottawacitizen.com

twitter.com/TomSpears1

© Copyright (c) The Ottawa Citizen

 

What Happens When You Drop A Magnet Inside A Copper Tube - Digg

So You Want To Write Your Own Language? | Dr Dobb's

$
0
0

Comments:"So You Want To Write Your Own Language? | Dr Dobb's"

URL:http://www.drdobbs.com/architecture-and-design/so-you-want-to-write-your-own-language/240165488


My career has been all about designing programming languages and writing compilers for them. This has been a great joy and source of satisfaction to me, and perhaps I can offer some observations about what you're in for if you decide to design and implement a professional programming language. This is actually a book-length topic, so I'll just hit on a few highlights here and avoid topics well covered elsewhere.

Work

First off, you're in for a lot of work…years of work…most of which will be wandering in the desert. The odds of success are heavily stacked against you. If you are not strongly self-motivated to do this, it isn't going to happen. If you need validation and encouragement from others, it isn't going to happen.

Fortunately, embarking on such a project is not major dollar investment; it won't break you if you fail. Even if you do fail, depending on how far the project got, it can look pretty good on your résumé and be good for your career.

Design

One thing abundantly clear is that syntax matters. It matters an awful lot. It's like the styling on a car — if the styling is not appealing, it simply doesn't matter how hot the performance is. The syntax needs to be something your target audience will like.

Trying to go with something they've not seen before will make language adoption a much tougher sell.

I like to go with a mix of familiar syntax and aesthetic beauty. It's got to look good on the screen. After all, you're going to spend plenty of time looking at it. If it looks awkward, clumsy, or ugly, it will taint the language.

There are a few things I (perhaps surprisingly) suggest should not be considerations. These are false gods:

Minimizing keystrokes. Maybe this mattered when programmers used paper tape, and it matters for small languages like bash or awk. For larger applications, much more programming time is spent reading than writing, so reducing keystrokes shouldn't be a goal in itself. Of course, I'm not suggesting that large amounts of boilerplate is a good idea. Easy parsing. It isn't hard to write parsers with arbitrary lookahead. The looks of the language shouldn't be compromised to save a few lines of code in the parser. Remember, you'll spend a lot of time staring at the code. That comes first. As mentioned below, it still should be a context-free grammar. Minimizing the number of keywords. This metric is just silly, but I see it cropping up repeatedly. There are a million words in the English language, I don't think there is any looming shortage. Just use your good judgment.

Things that are true gods:

Context-free grammars. What this really means is the code should be parsable without having to look things up in a symbol table. C++ is famously not a context-free grammar. A context-free grammar, besides making things a lot simpler, means that IDEs can do syntax highlighting without integrating most of a compiler front end. As a result, third-party tools become much more likely to exist. Redundancy. Yes, the grammar should be redundant. You've all heard people say that statement terminating ; are not necessary because the compiler can figure it out. That's true — but such non-redundancy makes for incomprehensible error messages. Consider a syntax with no redundancy: Any random sequence of characters would then be a valid program. No error messages are even possible. A good syntax needs redundancy in order to diagnose errors and give good error messages. Tried and true. Absent a very strong reason, it's best to stick with tried and true grammatical forms for familiar constructs. It really cuts the learning curve for the language and will increase adoption rates. Think of how people will hate the language if it swaps the operator precedence of + and *. Save the divergence for features not generally seen before, which also signals the user that this is new.

As always, these principles should not be taken as dicta. Use good judgment. Any language design principle blindly followed leads to disaster. The principles are rarely orthogonal and frequently conflict. It's a lot like designing a house — making the master closet bigger means the master bedroom gets smaller. It's all about finding the right balance.

Getting past the syntax, the meat of the language will be the semantic processing, which is where meaning is assigned to the syntactical constructs. This is where you'll be spending the vast bulk of design and implementation. It's much like the organs in your body — they are unseen and we don't think about them unless they are going wrong. There won't be a lot of glory in the semantic work, but in it will be the whole point of the language.

Once through the semantic phase, the compiler does optimizations and then code generation — collectively called the "back end." These two passes are very challenging and complicated. Personally, I love working with this stuff, and grumble that I've got to spend time on other issues. But unless you really like it, and it takes a fairly unhinged programmer to delight in the arcana of such things, I recommend taking the common sense approach and using an existing back end, such as the JVM, CLR, gcc, or LLVM. (Of course, I can always set you up with the glorious Digital Mars back end!)

Implementation

How best to implement it? I hope I can at least set you off in the right direction. The first tool that beginning compiler writers often reach for is regex. Regex is just the wrong tool for lexing and parsing. Rob Pike explains why reasonably well. I'll close that with the famous quote from Jamie Zawinski:

"Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems."

Somewhat more controversial, I wouldn't bother wasting time with lexer or parser generators and other so-called "compiler compilers." They're a waste of time. Writing a lexer and parser is a tiny percentage of the job of writing a compiler. Using a generator will take up about as much time as writing one by hand, and it will marry you to the generator (which matters when porting the compiler to a new platform). And generators also have the unfortunate reputation of emitting lousy error messages.

Facebook, Google, Microsoft, Others Reveal New Data on NSA Data Requests

$
0
0

Comments:"Facebook, Google, Microsoft, Others Reveal New Data on NSA Data Requests"

URL:http://thenextweb.com/insider/2014/02/03/facebook-linkedin-google-microsoft-reveal-data-showing-range-accounts-requested-nsa/?fromcat=all


Facebook, LinkedIn, YahooGoogle, and Microsoft have all released new data about the national security requests they’ve received. Under new rules from the US government, each company is now allowed to provide how many requests for member data it’s received, the number of accounts impacted, and the percentage that they respond to.

With regards to Facebook, it says that within the last six months of 2012, only a “small fraction” of one percent of its users were the target of any government data requests, national security-related or otherwise. In the first half of 2013, the company again said that the total volume of requests was a small fraction of one percent.

In LinkedIn’s case, it has updated its transparency report to indicate that for the first six months of 2013, the professional social network company received “between 0 and 249″ national security-related requests.

Microsoft says that during the same time period, it received “fewer than 1,000″ FISA orders that sought the disclosure of customer content, which related to between 15,000 and 15,999 accounts. It stresses that this doesn’t necessarily mean more than 15,000 accounts were covered by the government requests though. Additionally, the company received fewer than 1,000 FISA orders for non-content data only, requesting information relating to fewer than 1,000 accounts. Lastly, Microsoft states it has received fewer than 1,000 National Security Letters covering fewer than 1,000 accounts.

Yahoo has also updated the global transparency report it launched back in 2013, showing that the number of accounts requested by governments amounted to less than “one one-hundredth of one percent” of its worldwide user base for the reporting period.

Not to be outdone, Google has also released its own data that shows that it has received less than 1,000 requests for national security or content from governments from January 2009 to June 2013. It has published the complete table below showing a breakdown of requests and the number of users and accounts affected.

Brad Smith, Microsoft’s general counsel and executive vice president for legal and corporate affairs, says that the US government has agreed to allow companies to share this information, but only that it can be reported “in bands of a thousand”. What’s more, while the aggregate FISA data covers a six month period, it can only be published six months after the reporting period.

Last year, after former NSA contractor Edward Snowden published details about the US agency’s surveillance program “Prism”, tech companies immediately went on the defensive to deny accusations that they had provided server access to the government. Some went to court to get the government’s permission to help release some data to help them become more transparent, but were denied.

However, last week, President Obama’s administration decided to relax some rules as it seeks to reform the way it conducts surveillance around the world. Naturally, because of this action, lawsuits from Google, Microsoft, Yahoo, and Facebook have been dropped, but comes with a stipulation: tech companies are prohibited from revealing information about government requests for two years.

US Attorney General Eric Holder and Director of National Intelligence James Clapper said at the time: “Permitting disclosure of this aggregate data addresses an important area of concern to communications providers and the public.” However, not everyone shares in the sentiment — the New York Times says privacy advocates fear this new rule will prevent the public from knowing if their government is spying on an email platform or chat service.

Since revelations about Prism were made public, tech companies like Google and Microsoft have added new features and protocols to better protect user data from the NSA.

All of the companies have said they will be updating their transparency reports every six months so the public is aware of any government activity on its servers, but that it will also comply with the NSA rules that restrict when specific data can be revealed.

Photo credit: NSA via Getty Images

Viewing all 9433 articles
Browse latest View live