Quantcast
Channel: Hacker News 50
Viewing all articles
Browse latest Browse all 9433

Predicting Google closures

$
0
0

Comments:"Predicting Google closures"

URL:http://www.gwern.net/Google%20shutdowns


A first step in predicting when a product will be shutdown is predicting whether it will be shutdown. Since we’re predicting a binary outcome (a product living or dying), we can use an ordinary logistic regression. Our first look uses the main variables plus the total hits:

Coefficients:Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.39681.06802.240.025
Typeprogram 0.92480.81811.130.258
Typeservice 1.22610.78941.550.120
Typething 0.88051.16170.760.448
ProfitTRUE -0.38570.2952 -1.310.191
FLOSSTRUE -0.17770.3791 -0.470.639
AcquisitionTRUE 0.49550.34341.440.149
SocialTRUE 0.78660.38882.020.043log(Hits) -0.30890.0567 -5.455.1e-08

In log odds, >0 increases the chance of an event (shutdown) and <0 decreases it. So looking at the coefficients, we can venture some interpretations:

  • Google has a past history of screwing up social and then killing them

    This is interesting for confirming the general belief that Google has handled badly its social properties in the past, but I’m not sure how useful this is for predicting the future: since Larry Page became obsessed with social in 2009, a we might expect anything to do with social will now either be merged into Google+ or otherwise be kept on life support far longer than it would before
  • Google is deprecating software products in favor of web services

    This should have been obvious to anyone watching - a lot of Google’s efforts with Firefox and then Chromium was for improving web browsers as a platform for delivering applications. As efforts like HTML5 mature, there is less incentive for Google to release and support standalone software.
  • But apparently not its FLOSS software

    This seems due to a number of its software releases being picked up by third-parties (Wave, Etherpad, Refine), designed to be integrated into existing communities (Summer of Code projects), or apparently serving a strategic role (Android, Chromium, Dart, Go, Closure Tools, VP Codecs) in which we could summarize as building up a browser replacement for operating systems. (Why? Commoditize your complements.)
  • things which charge or show advertising are more likely to survive

    Also obvious, but it’s good to have confirmation (if nothing else, it partially validates the data).
  • Popularity as measured by Google hits seems to matter

    Likewise obvious… or is it?

Is our popularity metric - or any of the 4 - trustworthy? All this data has been collected after the fact, sometimes many years; what if the data have been contaminated by the fact that something shutdown? For example, by a burst of publicity about an obscure service shutting down? (Ironically, this page is contributing to the inflation of hits for any dead service mentioned.) Are we just seeing information leakage? Leakage can be subtle, as I learned for myself doing this analysis. Investigating further, hits by themselves do matter: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.4052 0.7302 4.66 3.1e-06 log(Hits) -0.3000 0.0549 -5.46 4.7e-08 Average hits (hits over the product’s lifetime) turns out to be even more important: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.297 1.586 -1.45 0.147 log(Hits) 0.511 0.209 2.44 0.015 log(AvgHits) -0.852 0.217 -3.93 8.3e-05 This is more than a little strange; the higher the average hits, the less likely to be killed makes perfect sense but then, surely the higher the hits, the less likely as well? But no. The mystery deepens as we bring in the third hit metric we developed: Estimate Std. Error z value Pr(>|z|) (Intercept) -21.589 11.955 -1.81 0.0709 log(Hits) 2.054 0.980 2.10 0.0362 log(AvgHits) -1.921 0.708 -2.71 0.0067 log(DeflatedHits) -0.456 0.277 -1.64 0.1001 And sure enough, if we run all 4 hit variables, 3 of them turn out to be statistically-significant and large: Estimate Std. Error z value Pr(>|z|) (Intercept) -24.6898 12.4696 -1.98 0.0477 log(Hits) 2.2908 1.0203 2.25 0.0248 log(AvgHits) -2.0943 0.7405 -2.83 0.0047 log(DeflatedHits) -0.5383 0.2914 -1.85 0.0647 AvgDeflatedHits -0.0651 0.0605 -1.08 0.2819 It’s not that the hit variables are somehow summarizing or proxying for the others, because if we toss in all the non-hits predictors and penalize parameters based on adding complexity without increasing fit, we still wind up with the 3 hit variables: Estimate Std. Error z value Pr(>|z|) (Intercept) -23.341 12.034 -1.94 0.0524 AcquisitionTRUE 0.631 0.350 1.80 0.0712 SocialTRUE 0.907 0.394 2.30 0.0213 log(Hits) 2.204 0.985 2.24 0.0252 log(AvgHits) -2.068 0.713 -2.90 0.0037 log(DeflatedHits) -0.492 0.280 -1.75 0.0793 ... AIC: 396.9 Most of the predictors were removed as not helping a lot, 3 of the 4 hit variables survived (but not the both averaged & deflated hits, suggesting it wasn’t adding much in combination), and we see two of the better predictors from earlier survived: whether something was an acquisition and whether it was social. The original hits variable has the wrong sign, as expected of data leakage; now the average and deflated hits have the predicted sign (the higher the hit count, the lower the risk of death), but this doesn’t put to rest my concerns: the average hits has the right sign, yes, but now the effect size seems way too high - we reject the hits with a log-odds of +2.1 as obviously contaminated and a correlation almost 4 times larger than one of the known-good correlations (being an acquisition), but the average hits is -2 & almost as big a log odds! The only variable which seems trustworthy is the deflated hits: it has the right sign and is a more plausible 5x smaller. I’ll use just the deflated hits variable (although I will keep in mind that I’m still not sure it is free from data leakage).

The logistic regression helped winnow down the variables but is limited to the binary outcome of shutdown or not. For looking at survival over time, survival analysis might be a useful elaboration of logistic-style approaches. Drawing on Fox & Weisberg’s appendix and Hosmer & Lemeshow’s Applied Survival Analysis. The initial characterization gives us an optimistic median of 2824 days (note that this is much higher than Arthur’s mean of 1459 days because it includes products which were never canceled and I made a much stronger effort to collect older pre-2009 products), but the lower bound is not tight and too little of the sample has died to get an upper bound:

records n.max n.start events median 0.95LCL 0.95UCL35035035012328242095NA

Our overall Kaplan-Meiersurvivorship curve looks a bit interesting:

Shutdown cumulative probability as a function of time

If there were constant mortality of products at each day after their launch, we would expect a type II curve where it looks like a straight line, but in fact it looks like there’s a sort of leveling off of deaths, suggesting a type III curve; per Wikipedia:

…the greatest mortality is experienced early on in life, with relatively low rates of death for those surviving this bottleneck. This type of curve is characteristic of species that produce a large number of offspring (see r/K selection theory).

Very nifty: the survivorship curve is consistent with tech industry or startup philosophies of doing lots of things, iterating fast, and throwing things at the wall to see what sticks.

However, it looks like the mortality only starts decreasing around 2000 days, so any product that far out must have been founded around or before 2005, which is when we previously noted that Google started pumping out a lot of products and may also have changed its shutdown-related behaviors; this could violate a basic assumption of Kaplan-Meier, that the underlying survival function isn’t itself changing over time.

Our next step is to fit a Cox proportional hazards model to our covariates:

...
 n=350, number of events=123
 coef exp(coef) se(coef) z Pr(>|z|)
AcquisitionTRUE 0.1301.1390.2570.510.613
FLOSSTRUE 0.1411.1510.2930.480.630
ProfitTRUE -0.1800.8360.231 -0.780.438
SocialTRUE 0.6641.9430.2622.530.011
Typeprogram 0.9572.6030.7471.280.200
Typeservice 1.2913.6380.7251.780.075
Typething 1.6825.3781.0231.640.100log(DeflatedHits) -0.2880.7490.036 -8.011.2e-15exp(coef) exp(-coef) lower .95 upper .95
AcquisitionTRUE 1.1390.8780.6881.884
FLOSSTRUE 1.1510.8680.6482.045
ProfitTRUE 0.8361.1970.5311.315
SocialTRUE 1.9430.5151.1633.247
Typeprogram 2.6030.3840.60211.247
Typeservice 3.6370.2750.87815.064
Typething 5.3770.1860.72439.955log(DeflatedHits) 0.7491.3340.6980.804
Concordance=0.726 (se =0.028 )
Rsquare=0.227 (max possible=0.974 )
Likelihood ratio test=90.1 on 8 df, p=4.44e-16
Wald test =79.5 on 8 df, p=6.22e-14Score (logrank) test =83.5 on 8 df, p=9.77e-15

And then we can also test whether any of the covariates are suspicious; in general they seem to be fine:

 rho chisq p
AcquisitionTRUE -0.02520.08050.777
FLOSSTRUE 0.01680.03700.848
ProfitTRUE -0.06940.62900.428
SocialTRUE 0.02790.08820.767
Typeprogram 0.08570.94290.332
Typeservice 0.09361.14330.285
Typething 0.06130.46970.493log(DeflatedHits) -0.04500.26100.609
GLOBAL NA2.53580.960

My suspicion lingers, though, so I threw in another covariate (EarlyGoogle): whether a product was released before or after 2005. Does this add predictive value above and over simply knowing that a product is really old, and does the regression still pass the proportional assumption check? Apparently yes to both:

{.R} coef exp(coef) se(coef) z Pr(>|z|) AcquisitionTRUE 0.1674 1.1823 0.2553 0.66 0.512 FLOSSTRUE 0.1034 1.1090 0.2922 0.35 0.723 ProfitTRUE -0.1949 0.8230 0.2318 -0.84 0.401 SocialTRUE 0.6541 1.9233 0.2601 2.51 0.012 Typeprogram 0.8195 2.2694 0.7472 1.10 0.273 Typeservice 1.1619 3.1960 0.7262 1.60 0.110 Typething 1.6200 5.0529 1.0234 1.58 0.113 log(DeflatedHits) -0.2645 0.7676 0.0375 -7.06 1.7e-12 EarlyGoogleTRUE -1.0061 0.3656 0.5279 -1.91 0.057 … Concordance= 0.728 (se = 0.028 ) Rsquare= 0.237 (max possible= 0.974 ) Likelihood ratio test= 94.7 on 9 df, p=2.22e-16 Wald test = 76.7 on 9 df, p=7.2e-13 Score (logrank) test = 83.8 on 9 df, p=2.85e-14{.R} rho chisq p … EarlyGoogleTRUE -0.05167 0.51424 0.473 GLOBAL NA 2.52587 0.980

As predicted, the pre-2005 variable predicts less chance of being shutdown and is a large predictor as well, but doesn’t trigger the assumption tester, so we’ll keep using the Cox model.

Now let’s interpret the model. The covariates tell us that to reduce the risk of shutdown, you want to:

Not be an acquisition Not be FLOSS Be directly making money Not be related to social networking Have lots of Google hits relative to lifetime Have been launched early in Google’s lifetime

This all makes sense to me. I find particularly interesting the profit and social effects, but the odds are a little hard to understand intuitively; if being social increases the odds of shutdown by 1.9233 and not being directly profitable increases the odds by 1.215, what do those look like? We can graph pairs of survivorship curves, splitting the full dataset (omitting the confidence intervals for legibility, although they do overlap), to get a grasp of what these numbers mean:

All products over time, split by Profit variableAll products over time, split by Social variable

Because I can, I was curious how random forests might stack up to the logistic regression and against a base-rate predictor (that nothing was shut down, since ~65% of the products are still alive).

I trained a random forest as a classifier, yielding reasonable looking error rates:

 Type of random forest:classification
 Number of trees:500
No. of variables tried at each split:2
 OOB estimate of error rate:31.71%
Confusion matrix:FALSETRUE class.errorFALSE216110.04846TRUE100230.81301

To compare the random forest accuracy with the logistic and survival model accuracy, I interpreted the logistic estimate of shutdown odds >1 as predicting shutdown and <1 as predicting not shutdown; I then compared the full sets of predictions with the actual shutdown status. The base-rate predictor got 65% right by definition, the logistic managed to score 68% correct (bootstrap 95% CI: 66-72%), and the random forest similarly got 68% (67-78%). These rates are not quite as bad as they may seem: I excluded the lifetime length (Days) from the logistic and random forests because unless one is handling it specially with survival analysis, it leaks information; so there’s predictive power being left on the table. (There are survival analysis-specific ways of applying random forests, apparently; I may try them out in the future.) Regardless, there’s no real reason to switch to the more complex random forests.


Viewing all articles
Browse latest Browse all 9433

Trending Articles