preface xiii
1 introduction 1
data analysis 1
what’s in this book 2
what’s with theworkshops? 3
what’s with the math? 4
what you’ll need 5
what’smissing 6
part i graphics: looking at data
2 a single variable: shape and distribution 11
dot and jitter plots 12
histograms and kernel density estimates 14
the cumulative distribution function 23
rank-order plots and lift charts 30
only when appropriate: summary statistics and box plots 33
workshop: numpy 38
further reading 45
3 two variables: establishing relationships 47
scatter plots 47
conquering noise: smoothing 48
.logarithmic plots 57
banking 61
linear regression and all that 62
showing what’s important 66
graphical analysis and presentation graphics 68
workshop: matplotlib 69
further reading 78
4 time as a variable: time-series analysis 79
examples 79
the task 83
smoothing 84
don’t overlook the obvious! 90
the correlation function 91
optional: filters and convolutions 95
workshop: scipy.signal 96
further reading 98
5 more than two variables: graphical multivariate analysis 99
false-color plots 100
a lot at a glance: multiplots 105
composition problems 110
novel plot types 116
interactive explorations 120
workshop: tools for multivariate graphics 123
further reading 125
6 intermezzo: a data analysis session 127
a data analysis session 127
workshop: gnuplot 136
further reading 138
part ii analytics: modeling data
7 guesstimation and the back of the envelope 141
principles of guesstimation 142
how good are those numbers? 151
optional: a closer look at perturbation theory and
error propagation 155
workshop: the gnu scientific library (gsl) 158
further reading 161
8 models from scaling arguments 163
models 163
arguments from scale 165
mean-field approximations 175
common time-evolution scenarios 178
case study: how many servers are best? 182
why modeling? 184
workshop: sage 184
further reading 188
9 arguments from probability models 191
the binomial distribution and bernoulli trials 191
the gaussian distribution and the central limit theorem 195
power-law distributions and non-normal statistics 201
other distributions 206
optional: case study—unique visitors over time 211
workshop: power-law distributions 215
further reading 218
10 what you really need to know about classical statistics 221
genesis 221
statistics defined 223
statistics explained 226
controlled experiments versus observational studies 230
optional: bayesian statistics—the other point of view 235
workshop: r 243
further reading 249
11 intermezzo: mythbusting—bigfoot, least squares,
and all that 253
how to average averages 253
the standard deviation 256
least squares 260
further reading 264
part iii computation: mining data
12 simulations 267
awarm-up question 267
monte carlo simulations 270
resampling methods 276
workshop: discrete event simulations with simpy 280
further reading 291
13 finding clusters 293
what constitutes a cluster? 293
distance and similarity measures 298
clustering methods 304
pre- and postprocessing 311
other thoughts 314
a special case:market basket analysis 316
aword ofwarning 319
workshop: pycluster and the c clustering library 320
further reading 324
14 seeing the forest for the trees: finding
important attributes 327
principal component analysis 328
visual techniques 337
kohonen maps 339
workshop: pca with r 342
further reading 348
15 intermezzo: when more is different 351
a horror story 353
some suggestions 354
what about map/reduce? 356
workshop: generating permutations 357
further reading 358
part iv applications: using data
16 reporting, business intelligence, and dashboards 361
business intelligence 362
corporate metrics and dashboards 369
data quality issues 373
workshop: berkeley db and sqlite 376
further reading 381
17 financial calculations and modeling 383
the time value of money 384
uncertainty in planning and opportunity costs 391
cost concepts and depreciation 394
should you care? 398
is this all that matters? 399
workshop: the newsvendor problem 400
further reading 403
18 predictive analytics 405
introduction 405
some classification terminology 407
algorithms for classification 408
the process 419
the secret sauce 423
the nature of statistical learning 424
workshop: two do-it-yourself classifiers 426
further reading 431
19 epilogue: facts are not reality 433
a programming environments for scientific computation
and data analysis 435
software tools 435
a catalog of scientific software 437
writing your own 443
further reading 444
b results from calculus 447
common functions 448
calculus 460
useful tricks 468
notation and basic math 472
where to go from here 479
further reading 481
c working with data 485
sources for data 485
cleaning and conditioning 487
sampling 489
data file formats 490
the care and feeding of your data zoo 492
skills 493
terminology 495
further reading 497
index 499
· · · · · · (
收起)