Comments:"Predicting Google closures"
URL:http://www.gwern.net/Google%20shutdowns
A first step in predicting when a product will be shutdown is predicting whether it will be shutdown. Since we’re predicting a binary outcome (a product living or dying), we can use an ordinary logistic regression. Our first look uses the main variables plus the total hits:
Coefficients:Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.39681.06802.240.025
Typeprogram 0.92480.81811.130.258
Typeservice 1.22610.78941.550.120
Typething 0.88051.16170.760.448
ProfitTRUE -0.38570.2952 -1.310.191
FLOSSTRUE -0.17770.3791 -0.470.639
AcquisitionTRUE 0.49550.34341.440.149
SocialTRUE 0.78660.38882.020.043log(Hits) -0.30890.0567 -5.455.1e-08
In log odds, >0 increases the chance of an event (shutdown) and <0 decreases it. So looking at the coefficients, we can venture some interpretations:
Google has a past history of screwing up social and then killing them
This is interesting for confirming the general belief that Google has handled badly its social properties in the past, but I’m not sure how useful this is for predicting the future: since Larry Page became obsessed with social in 2009, a we might expect anything to do with social will now either be merged into Google+ or otherwise be kept on life support far longer than it would beforeGoogle is deprecating software products in favor of web services
This should have been obvious to anyone watching - a lot of Google’s efforts with Firefox and then Chromium was for improving web browsers as a platform for delivering applications. As efforts like HTML5 mature, there is less incentive for Google to release and support standalone software.But apparently not its FLOSS software
This seems due to a number of its software releases being picked up by third-parties (Wave, Etherpad, Refine), designed to be integrated into existing communities (Summer of Code projects), or apparently serving a strategic role (Android, Chromium, Dart, Go, Closure Tools, VP Codecs) in which we could summarize as building up a browser replacement for operating systems. (Why? Commoditize your complements.)things which charge or show advertising are more likely to survive
Also obvious, but it’s good to have confirmation (if nothing else, it partially validates the data).Popularity as measured by Google hits seems to matter
Likewise obvious… or is it?
The logistic regression helped winnow down the variables but is limited to the binary outcome of shutdown or not. For looking at survival over time, survival analysis might be a useful elaboration of logistic-style approaches. Drawing on Fox & Weisberg’s appendix and Hosmer & Lemeshow’s Applied Survival Analysis. The initial characterization gives us an optimistic median of 2824 days (note that this is much higher than Arthur’s mean of 1459 days because it includes products which were never canceled and I made a much stronger effort to collect older pre-2009 products), but the lower bound is not tight and too little of the sample has died to get an upper bound:
records n.max n.start events median 0.95LCL 0.95UCL35035035012328242095NA
Our overall Kaplan-Meiersurvivorship curve looks a bit interesting:
Shutdown cumulative probability as a function of timeIf there were constant mortality of products at each day after their launch, we would expect a type II curve where it looks like a straight line, but in fact it looks like there’s a sort of leveling off of deaths, suggesting a type III curve; per Wikipedia:
…the greatest mortality is experienced early on in life, with relatively low rates of death for those surviving this bottleneck. This type of curve is characteristic of species that produce a large number of offspring (see r/K selection theory).Very nifty: the survivorship curve is consistent with tech industry or startup philosophies of doing lots of things, iterating fast, and throwing things at the wall to see what sticks.
However, it looks like the mortality only starts decreasing around 2000 days, so any product that far out must have been founded around or before 2005, which is when we previously noted that Google started pumping out a lot of products and may also have changed its shutdown-related behaviors; this could violate a basic assumption of Kaplan-Meier, that the underlying survival function isn’t itself changing over time.
Our next step is to fit a Cox proportional hazards model to our covariates:
...
n=350, number of events=123
coef exp(coef) se(coef) z Pr(>|z|)
AcquisitionTRUE 0.1301.1390.2570.510.613
FLOSSTRUE 0.1411.1510.2930.480.630
ProfitTRUE -0.1800.8360.231 -0.780.438
SocialTRUE 0.6641.9430.2622.530.011
Typeprogram 0.9572.6030.7471.280.200
Typeservice 1.2913.6380.7251.780.075
Typething 1.6825.3781.0231.640.100log(DeflatedHits) -0.2880.7490.036 -8.011.2e-15exp(coef) exp(-coef) lower .95 upper .95
AcquisitionTRUE 1.1390.8780.6881.884
FLOSSTRUE 1.1510.8680.6482.045
ProfitTRUE 0.8361.1970.5311.315
SocialTRUE 1.9430.5151.1633.247
Typeprogram 2.6030.3840.60211.247
Typeservice 3.6370.2750.87815.064
Typething 5.3770.1860.72439.955log(DeflatedHits) 0.7491.3340.6980.804
Concordance=0.726 (se =0.028 )
Rsquare=0.227 (max possible=0.974 )
Likelihood ratio test=90.1 on 8 df, p=4.44e-16
Wald test =79.5 on 8 df, p=6.22e-14Score (logrank) test =83.5 on 8 df, p=9.77e-15
And then we can also test whether any of the covariates are suspicious; in general they seem to be fine:
rho chisq p
AcquisitionTRUE -0.02520.08050.777
FLOSSTRUE 0.01680.03700.848
ProfitTRUE -0.06940.62900.428
SocialTRUE 0.02790.08820.767
Typeprogram 0.08570.94290.332
Typeservice 0.09361.14330.285
Typething 0.06130.46970.493log(DeflatedHits) -0.04500.26100.609
GLOBAL NA2.53580.960
My suspicion lingers, though, so I threw in another covariate (EarlyGoogle
): whether a product was released before or after 2005. Does this add predictive value above and over simply knowing that a product is really old, and does the regression still pass the proportional assumption check? Apparently yes to both:
{.R} coef exp(coef) se(coef) z Pr(>|z|) AcquisitionTRUE 0.1674 1.1823 0.2553 0.66 0.512 FLOSSTRUE 0.1034 1.1090 0.2922 0.35 0.723 ProfitTRUE -0.1949 0.8230 0.2318 -0.84 0.401 SocialTRUE 0.6541 1.9233 0.2601 2.51 0.012 Typeprogram 0.8195 2.2694 0.7472 1.10 0.273 Typeservice 1.1619 3.1960 0.7262 1.60 0.110 Typething 1.6200 5.0529 1.0234 1.58 0.113 log(DeflatedHits) -0.2645 0.7676 0.0375 -7.06 1.7e-12 EarlyGoogleTRUE -1.0061 0.3656 0.5279 -1.91 0.057 … Concordance= 0.728 (se = 0.028 ) Rsquare= 0.237 (max possible= 0.974 ) Likelihood ratio test= 94.7 on 9 df, p=2.22e-16 Wald test = 76.7 on 9 df, p=7.2e-13 Score (logrank) test = 83.8 on 9 df, p=2.85e-14{.R} rho chisq p … EarlyGoogleTRUE -0.05167 0.51424 0.473 GLOBAL NA 2.52587 0.980
As predicted, the pre-2005 variable predicts less chance of being shutdown and is a large predictor as well, but doesn’t trigger the assumption tester, so we’ll keep using the Cox model.
Now let’s interpret the model. The covariates tell us that to reduce the risk of shutdown, you want to:
Not be an acquisition Not be FLOSS Be directly making money Not be related to social networking Have lots of Google hits relative to lifetime Have been launched early in Google’s lifetimeThis all makes sense to me. I find particularly interesting the profit and social effects, but the odds are a little hard to understand intuitively; if being social increases the odds of shutdown by 1.9233 and not being directly profitable increases the odds by 1.215, what do those look like? We can graph pairs of survivorship curves, splitting the full dataset (omitting the confidence intervals for legibility, although they do overlap), to get a grasp of what these numbers mean:
All products over time, split by Profit variableAll products over time, split by Social variableBecause I can, I was curious how random forests might stack up to the logistic regression and against a base-rate predictor (that nothing was shut down, since ~65% of the products are still alive).
I trained a random forest as a classifier, yielding reasonable looking error rates:
Type of random forest:classification
Number of trees:500
No. of variables tried at each split:2
OOB estimate of error rate:31.71%
Confusion matrix:FALSETRUE class.errorFALSE216110.04846TRUE100230.81301
To compare the random forest accuracy with the logistic and survival model accuracy, I interpreted the logistic estimate of shutdown odds >1 as predicting shutdown and <1 as predicting not shutdown; I then compared the full sets of predictions with the actual shutdown status. The base-rate predictor got 65% right by definition, the logistic managed to score 68% correct (bootstrap 95% CI: 66-72%), and the random forest similarly got 68% (67-78%). These rates are not quite as bad as they may seem: I excluded the lifetime length (Days
) from the logistic and random forests because unless one is handling it specially with survival analysis, it leaks information; so there’s predictive power being left on the table. (There are survival analysis-specific ways of applying random forests, apparently; I may try them out in the future.) Regardless, there’s no real reason to switch to the more complex random forests.