Kirjailijakuva
1 Work 88 jäsentä 1 Review

Tekijän teokset

Merkitty avainsanalla

Yleistieto

Sukupuoli
female

Jäseniä

Kirja-arvosteluja

Indeholder "Foreword I", "Foreword II", "Preface", " Conventions Used in This Book", " Using Code Examples", " O'Reilly Safari", " How to Contact Us", " Acknowledgments", "1. How SRE Relates to DevOps", " Background on DevOps", " No More Silos", " Accidents Are Normal", " Change Should Be Gradual", " Tooling and Culture Are Interrelated", " Measurement Is Crucial", " Background on SRE", " Operations Is a Software Problem", " Manage by Service Level Objectives (SLOs)", " Work to Minimize Toil", " Automate This Year's Job Away", " Move Fast by Reducing the Cost of Failure", " Share Ownership with Developers", " Use the Same Tooling, Regardless of Function or Job Title", " Compare and Contrast", " Organizational Context and Fostering Successful Adoption", " Narrow, Rigid Incentives Narrow Your Success", " It's Better to Fix It Yourself; Don't Blame Someone Else", " Consider Reliability Work as a Specialized Role", " When Can Substitute for Whether", " Strive for Parity of Esteem: Career and Financial", " Conclusion", "Part I. Foundations", "2. Implementing SLOs", " Why SREs Need SLOs", " Getting Started", " Reliability Targets and Error Budgets", " What to Measure: Using SLIs", " A Worked Example", " Moving from SLI Specification to SLI Implementation", " Measuring the SLIs", " Using the SLIs to Calculate Starter SLOs", " Choosing an Appropriate Time Window", " Getting Stakeholder Agreement", " Establishing an Error Budget Policy", " Documenting the SLO and Error Budget Policy", " Dashboards and Reports", " Continuous Improvement of SLO Targets", " Improving the Quality of Your SLO", " Decision Making Using SLOs and Error Budgets", " Advanced Topics", " Modeling User Journeys", " Grading Interaction Importance", " Modeling Dependencies", " Experimenting with Relaxing Your SLOs", " Conclusion", "3. SLO Engineering Case Studies", " Evernote's SLO Story", " Why Did Evernote Adopt the SRE Model?", " Introduction of SLOs: A Journey in Progress", " Breaking Down the SLO Wall Between Customer and Cloud Provider", " Current State", " The Home Depot's SLO Story", " The SLO Culture Project", " Our First Set of SLOs", " Evangelizing SLOs", " Automating VALET Data Collection", " The Proliferation of SLOs", " Applying VALET to Batch Applications", " Using VALET in Testing", " Future Aspirations", " Summary", " Conclusion", "4. Monitoring", " Desirable Features of a Monitoring Strategy", " Speed", " Calculations", " Interfaces", " Alerts", " Sources of Monitoring Data", " Examples", " Managing Your Monitoring System", " Treat Your Configuration as Code", " Encourage Consistency", " Prefer Loose Coupling", " Metrics with Purpose", " Intended Changes", " Dependencies", " Saturation", " Status of Served Traffic", " Implementing Purposeful Metrics", " Testing Alerting Logic", " Conclusion", "5. Alerting on SLOs", " Alerting Considerations", " Ways to Alert on Significant Events", " 1: Target Error Rate ≥ SLO Threshold", " 2: Increased Alert Window", " 3: Incrementing Alert Duration", " 4: Alert on Burn Rate", " 5: Multiple Burn Rate Alerts", " 6: Multiwindow, Multi-Burn-Rate Alerts", " Low-Traffic Services and Error Budget Alerting", " Generating Artificial Traffic", " Combining Services", " Making Service and Infrastructure Changes", " Lowering the SLO or Increasing the Window", " Extreme Availability Goals", " Alerting at Scale", " Conclusion", "6. Eliminating Toil", " What Is Toil?", " Measuring Toil", " Toil Taxonomy", " Business Processes", " Production Interrupts", " Release Shepherding", " Migrations", " Cost Engineering and Capacity Planning", " Troubleshooting for Opaque Architectures", " Toil Management Strategies", " Identify and Measure Toil", " Engineer Toil Out of the System", " Reject the Toil", " Use SLOs to Reduce Toil", " Start with Human-Backed Interfaces", " Provide Self-Service Methods", " Get Support from Management and Colleagues", " Promote Toil Reduction as a Feature", " Start Small and Then Improve", " Increase Uniformity", " Assess Risk Within Automation", " Automate Toil Response", " Use Open Source and Third-Party Tools", " Use Feedback to Improve", " Case Studies", " Case Study 1: Reducing Toil in the Datacenter with Automation", " Background", " Problem Statement", " What We Decided to Do", " Design First Effort: Saturn Line-Card Repair", " Implementation", " Design Second Effort: Saturn Line-Card Repair Versus Jupiter Line-Card Repair", " Implementation", " Lessons Learned", " Case Study 2: Decommissioning Filer-Backed Home Directories", " Background", " Problem Statement", " What We Decided to Do", " Design and Implementation", " Key Components", " Lessons Learned", " Conclusion", "7. Simplicity", " Measuring Complexity", " Simplicity Is End-to-End, and SREs Are Good for That", " Case Study 1: End-to-End API Simplicity", " Case Study 2: Project Lifecycle Complexity", " Regaining Simplicity", " Case Study 3: Simplification of the Display Ads Spiderweb", " Case Study 4: Running Hundreds of Microservices on a Shared Platform", " Case Study 5: pDNS No Longer Depends on Itself", " Conclusion", "Part II. Practices", "8. On-Call", " Recap of 'Being On-Call' Chapter of First SRE Book", " Example On-Call Setups Within Google and Outside Google", " Google: Forming a New Team", " Evernote: Finding Our Feet in the Cloud", " Practical Implementation Details", " Anatomy of Pager Load", " On-Call Flexibility", " On-Call Team Dynamics", " Conclusion", "9. Incident Response", " Incident Management at Google", " Incident Command System", " Main Roles in Incident Response", " Case Studies", " Case Study 1: Software Bug - The Lights Are On but No One's (Google) Home", " Case Study 2: Service Fault - Cache Me If You Can", " Case Study 3: Power Outage - Lightning Never Strikes Twice... Until It Does", " Case Study 4: Incident Response at PagerDuty", " Putting Best Practices into Practice", " Incident Response Training", " Prepare Beforehand", " Drills", " Conclusion", "10. Postmortem Culture: Learning from Failure", " Case Study", " Bad Postmortem", " Why Is This Postmortem Bad?", " Good Postmortem", " Why Is This Postmortem Better?", " Organizational Incentives", " Model and Enforce Blameless Behavior", " Reward Postmortem Outcomes", " Share Postmortems Openly", " Respond to Postmortem Culture Failures", " Tools and Templates", " Postmortem Templates", " Postmortem Tooling", " Conclusion", "11. Managing Load", " Google Cloud Load Balancing", " Anycast", " Maglev", " Global Software Load Balancer", " Google Front End", " GCLB: Low Latency", " GCLB: High Availability", " Case Study 1: Pokémon GO on GCLB", " Autoscaling", " Handling Unhealthy Machines", " Working with Stateful Systems", " Configuring Conservatively", " Setting Constraints", " Including Kill Switches and Manual Overrides", " Avoiding Overloading Backends", " Avoiding Traffic Imbalance", " Combining Strategies to Manage Load", " Case Study 2: When Load Shedding Attacks", " Conclusion", "12. Introducing Non-Abstract Large System Design", " What Is NALSD?", " Why 'Non-Abstract'?", " AdWords Example", " Design Process", " Initial Requirements", " One Machine", " Distributed System", " Conclusion", "13. Data Processing Pipelines", " Pipeline Applications", " Event Processing/Data Transformation to Order or Structure Data", " Data Analytics", " Machine Learning", " Pipeline Best Practices", " Define and Measure Service Level Objectives", " Plan for Dependency Failure", " Create and Maintain Pipeline Documentation", " Map Your Development Lifecycle", " Reduce Hotspotting and Workload Patterns", " Implement Autoscaling and Resource Planning", " Adhere to Access Control and Security Policies", " Plan Escalation Paths", " Pipeline Requirements and Design", " What Features Do You Need?", " Idempotent and Two-Phase Mutations", " Checkpointing", " Code Patterns", " Pipeline Production Readiness", " Pipeline Failures: Prevention and Response", " Potential Failure Modes", " Potential Causes", " Case Study: Spotify", " Event Delivery", " Event Delivery System Design and Architecture", " Event Delivery System Operation", " Customer Integration and Support", " Summary", " Conclusion", "14. Configuration Design and Best Practices", " What Is Configuration?", " Configuration and Reliability", " Separating Philosophy and Mechanics", " Configuration Philosophy", " Configuration Asks Users Questions", " Questions Should Be Close to User Goals", " Mandatory and Optional Questions", " Escaping Simplicity", " Mechanics of Configuration", " Separate Configuration and Resulting Data", " Importance of Tooling", " Ownership and Change Tracking", " Safe Configuration Change Application", " Conclusion", "15. Configuration Specifics", " Configuration-Induced Toil", " Reducing Configuration-Induced Toil", " Critical Properties and Pitfalls of Configuration Systems", " Pitfall 1: Failing to Recognize Configuration as a Programming Language Problem", " Pitfall 2: Designing Accidental or Ad Hoc Language Features", " Pitfall 3: Building Too Much Domain-Specific Optimization", " Pitfall 4: Interleaving 'Configuration Evaluation' with 'Side Effects'", " Pitfall 5: Using an Existing General-Purpose Scripting Language Like Python, Ruby, or Lua", " Integrating a Configuration Language", " Generating Config in Specific Formats", " Driving Multiple Applications", " Integrating an Existing Application: Kubernetes", " What Kubernetes Provides", " Example Kubernetes Config", " Integrating the Configuration Language", " Integrating Custom Applications (In-House Software)", " Effectively Operating a Configuration System", " Versioning", " Source Control", " Tooling", " Testing", " When to Evaluate Configuration", " Very Early: Checking in the JSON", " Middle of the Road: Evaluate at Build Time", " Late: Evaluate at Runtime", " Guarding Against Abusive Configuration", " Conclusion", "16. Canarying Releases", " Release Engineering Principles", " Balancing Release Velocity and Reliability", " What Is Canarying?", " Release Engineering and Canarying", " Requirements of a Canary Process", " Our Example Setup", " A Roll Forward Deployment Versus a Simple Canary Deployment", " Canary Implementation", " Minimizing Risk to SLOs and the Error Budget", " Choosing a Canary Population and Duration", " Selecting and Evaluating Metrics", " Metrics Should Indicate Problems", " Metrics Should Be Representative and Attributable", " Before/After Evaluation Is Risky", " Use a Gradual Canary for Better Metric Selection", " Dependencies and Isolation", " Canarying in Noninteractive Systems", " Requirements on Monitoring Data", " Related Concepts", " Blue/Green Deployment", " Artificial Load Generation", " Traffic Teeing", " Conclusion", "Part III. Processes", "17. Identifying and Recovering from Overload", " From Load to Overload", " Case Study 1: Work Overload When Half a Team Leaves", " Background", " Problem Statement", " What We Decided to Do", " Implementation", " Lessons Learned", " Case Study 2: Perceived Overload After Organizational and Workload Changes", " Background", " Problem Statement", " What We Decided to Do", " Implementation", " Effects", " Lessons Learned", " Strategies for Mitigating Overload", " Recognizing the Symptoms of Overload", " Reducing Overload and Restoring Team Health", " Conclusion", "18. SRE Engagement Model", " The Service Lifecycle", " Phase 1: Architecture and Design", " Phase 2: Active Development", " Phase 3: Limited Availability", " Phase 4: General Availability", " Phase 5: Deprecation", " Phase 6: Abandoned", " Phase 7: Unsupported", " Setting Up the Relationship", " Communicating Business and Production Priorities", " Identifying Risks", " Aligning Goals", " Setting Ground Rules", " Planning and Executing", " Sustaining an Effective Ongoing Relationship", " Investing Time in Working Better Together", " Maintaining an Open Line of Communication", " Performing Regular Service Reviews", " Reassessing When Ground Rules Start to Slip", " Adjusting Priorities According to Your SLOs and Error Budget", " Handling Mistakes Appropriately", " Scaling SRE to Larger Environments", " Supporting Multiple Services with a Single SRE Team", " Structuring a Multiple SRE Team Environment", " Adapting SRE Team Structures to Changing Circumstances", " Running Cohesive Distributed SRE Teams", " Ending the Relationship", " Case Study 1: Ares", " Case Study 2: Data Analysis Pipeline", " Conclusion", "19. SRE: Reaching Beyond Your Walls", " Truths We Hold to Be Self-Evident", " Reliability Is the Most Important Feature", " Your Users, Not Your Monitoring, Decide Your Reliability", " If You Run a Platform, Then Reliability Is a Partnership", " Everything Important Eventually Becomes a Platform", " When Your Customers Have a Hard Time, You Have to Slow Down", " You Will Need to Practice SRE with Your Customers", " How to: SRE with Your Customers", " Step 1: SLOs and SLIs Are How You Speak", " Step 2: Audit the Monitoring and Build Shared Dashboards", " Step 3: Measure and Renegotiate", " Step 4: Design Reviews and Risk Analysis", " Step 5: Practice, Practice, Practice", " Be Thoughtful and Disciplined", " Conclusion", "20. SRE Team Lifecycles", " SRE Practices Without SREs", " Starting an SRE Role", " Finding Your First SRE", " Placing Your First SRE", " Bootstrapping Your First SRE", " Distributed SREs", " Your First SRE Team", " Forming", " Storming", " Norming", " Performing", " Making More SRE Teams", " Service Complexity", " SRE Rollout", " Geographical Splits", " Suggested Practices for Running Many Teams", " Mission Control", " SRE Exchange", " Training", " Horizontal Projects", " SRE Mobility", " Travel", " Launch Coordination Engineering Teams", " Production Excellence", " SRE Funding and Hiring", " Conclusion", "21. Organizational Change Management in SRE", " SRE Embraces Change", " Introduction to Change Management", " Lewin's Three-Stage Model", " McKinsey's 7-S Model", " Kotter's Eight-Step Process for Leading Change", " The Prosci ADKAR Model", " Emotion-Based Models", " The Deming Cycle", " How These Theories Apply to SRE", " Case Study 1: Scaling Waze - From Ad Hoc to Planned Change", " Background", " The Messaging Queue: Replacing a System While Maintaining Reliability", " The Next Cycle of Change: Improving the Deployment Process", " Lessons Learned", " Case Study 2: Common Tooling Adoption in SRE", " Background", " Problem Statement", " What We Decided to Do", " Design", " Implementation: Monitoring", " Lessons Learned", " Conclusion", "Conclusion", " Onward...", " The Future Belongs to the Past", " SRE + 'Insert Other Discipline'", " Trickles, Streams, and Floods", " SRE Belongs to All of Us", " On Gratitude", "A. Example SLO Document", " Service Overview", " SLIs and SLOs", " Rationale", " Error Budget", " Clarifications and Caveats", "B. Example Error Budget Policy", " Service Overview", " Goals", " Non-Goals", " SLO Miss Policy", " Outage Policy", " Escalation Policy", " Background", "C. Results of Postmortem Analysis", "Index".

Superfedt med en beskrivelse af hvad man kan gøre, hvis man har fat i rorpinden på fx Google. Andre steder (og sikkert også i Google) kan skrigene fra maskinrummet ikke høres på ledelsesgangen.
… (lisätietoja)
 
Merkitty asiattomaksi
bnielsen | Aug 27, 2021 |

You May Also Like

Associated Authors

Tilastot

Teokset
1
Jäseniä
88
Suosituimmuussija
#209,356
Arvio (tähdet)
4.0
Kirja-arvosteluja
1
ISBN:t
4

Taulukot ja kaaviot