python|部署证明书提出了挑战和架构正统观念

I lead a team of immensely talented engineers maintaining a critical application that is at the nucleus of my organization’s IT map. The advantage of being in such a team is the fact that you get to gauge the impact of your changes by looking at the effect it has on other teams and consumers. A major disadvantage, if you have not already guessed, is this same dependency and the pressure it brings along and a very thin margin for error. The application that my team manages used to be a large monolith and had a single source of non-replicable data, with the downstream systems being tightly coupled to this. Breaking this down into a host of microservices was a colossal undertaking. But that would be a story for yet another fewer-meetings day.
我带领一支由极富才华的工程师组成的团队,维护一个关键应用程序,该应用程序是我组织IT架构的核心。 成为这样一个团队的好处是,您可以通过观察变更对其他团队和消费者的影响来评估变更的影响。 一个主要的缺点(如果您还没有猜到的话)是相同的依赖关系及其带来的压力以及非常小的错误余量。 我的团队管理的应用程序以前是一个大型整体,只有一个不可复制的数据源,下游系统与此紧密耦合。 将其分解为一系列微服务是一项艰巨的任务。 但这将是又一个少开会的日子的一个故事。
Fast forward to a time when we are managing a suite of 60 odd, loosely coupled, context bound microservices. But effortless deployments were still a challenge that we had not fully won over. We were still relying on after-business hours releases, redundant release definitions and complex release cycles. With a super agile team and a massively dynamic workitem backlog, the need to come up with a way to upgrade our deployments was well overdue. Along with the team, I came up with a couple of options and chose to organically revamp the whole system. Putting together the details of this exercise in one place would be gross over-simplification. Here is my first attempt at doing that.
快进到我们正在管理60个奇数,松散耦合,上下文绑定微服务的时代。 但是轻松部署仍然是我们尚未完全克服的挑战。 我们仍然依赖于下班时间发布,冗余的发布定义和复杂的发布周期。 有了一个超级敏捷的团队和一个巨大的动态工作单积压,就已经想出了一种升级我们的部署的方法。 与团队一起,我提出了两个选择,并选择有机地改造整个系统。 将这项工作的细节放在一个地方将是过于简化。 这是我第一次尝试这样做。
Blue/GreenWhy?We started with simple resource swap deployments. Where new production resources are tested on a set of servers prior to swapping the resources out in production. This process is popularly known as Blue/Green deployments. For those who are unfamiliar with this, essentially there are two exact copies of the application’s source code, designated “Blue” and “Green” respectively (not exactly sure why those two colors). And we started seeing really good results with this approach, pretty early on.
蓝色/绿色 为什么? 我们从简单的资源交换部署开始。 在生产中换出资源之前,要在一组服务器上测试新的生产资源。 此过程通常称为“ 蓝色/绿色”部署。 对于不熟悉此方法的人,基本上有两个应用程序源代码的精确副本,分别指定为“ Blue ”和“ Green ”(不确定为什么使用这两种颜色)。 我们很早就开始通过这种方法看到了很好的结果。
Advantages One of these Blue/Green stacks is the external facing set of servers, while the other stays unavailable to the users. Essentially this second set of servers becomes a smoke test environment on top of the production stack. By using this inactive set of servers as our deployment target, we could test our application’s behavior with the new codebase prior to deploying the functionality to live users. This allowed us to reduce the downtime in deployments while also improving the overall resiliency of the application. Improvement in resiliency was achieved by the way of providing an easy fallback in the event of a sudden increase in scale or the occasional botched deployment.
优点这些蓝色/绿色堆栈中的一个是面向外部的服务器集,而另一个则对用户不可用。 从本质上讲,第二组服务器成为生产堆栈顶部的烟雾测试环境。 通过使用这组不活动的服务器作为我们的部署目标,我们可以在将功能部署到实时用户之前使用新的代码库测试应用程序的行为。 这使我们能够减少部署中的停机时间,同时也提高了应用程序的整体弹性。 通过在规模突然增加或偶尔部署不足的情况下提供轻松回退的方式,可以提高弹性。
ChallengesThe blue/green approach can be very cost-intensive, since the core premise of this is duplication of application’s production resources. While the actual work of setting up this kind of environment is greatly eased due to advancements in cloud-resource availability, the costs of such an install are nearly double as those of a less secure option.
挑战 蓝/绿方法可能会非常耗费成本,因为此方法的核心前提是重复使用应用程序的生产资源。 尽管由于云资源可用性的提高而大大简化了设置此类环境的实际工作,但这种安装的成本几乎是安全性较差的选项的两倍。
Canary
金丝雀
Canary deployments aim to address these issues by reducing the need for duplicate application infrastructures. With some tweaks to the application’s architecture and to the development practices, we could reduce the need for maintaining redundant stacks and deliver the same level of stability. Canary releases have always been used by development teams whenever there is a new or big change that has to be introduced. But giving our stakeholders and teammates, a surprise, at the same time might not work in everyone’s best interest.
Canary部署旨在通过减少对重复的应用程序基础结构的需求来解决这些问题。 通过对应用程序体系结构和开发实践进行一些调整,我们可以减少维护冗余堆栈的需求并提供相同级别的稳定性。 每当必须进行新的或较大的更改时,开发团队就一直使用Canary版本。 但是同时给我们的利益相关者和团队成员一个惊喜可能并不符合每个人的最大利益。
State Department static images国务院静态图片 I have worked on teams that have extensively used feature flags. Which are an ingenious way to isolate certain features. But as developers world over are growing more and more curious about the upcoming releases, it becomes more and more strenuous to manage internal releases. Especially if the changes correspond to a company that is due on releasing a new product. And thus even an innocuous miss around the code-names or dead code, might cost the company a lot in PR cover-up. Phased or incremental rollouts could be another way of gradually folding the changes in.Canary releases shift the focus from releasing entire applications to releasing individual features within an application. So, instead of releasing all of the new features at once, as part of a mammoth monthly release, we started releasing the code for new features while slowly scaling up the number of users that have access to that feature. And by employing multiple canaries each tagged by a different feature toggle across various geographical regions we were able to get early success indices. All this while, gradually updating the traffic allowed on the canary from under 5% to a full 100%. Another upside is that it enabled us to continuously release new features for the application without needing a specific deployment or a release window.
我曾在广泛使用功能标志的团队中工作。 这是隔离某些功能的巧妙方法。 但是,随着世界各地的开发人员对即将发布的版本越来越好奇,管理内部版本变得越来越费劲。 特别是如果更改对应于要发布新产品的公司。 因此,即使是无故遗漏了代号名称或无效代码,也可能使公司蒙受大量PR掩盖。 分阶段或增量推出可能是逐渐折叠更改的另一种方法。Canary发布将重点从发布整个应用程序转移到发布应用程序中的单个功能。 因此,我们不像一次庞大的月度发布那样一次发布所有新功能,而是开始发布新功能的代码,同时慢慢扩大可使用该功能的用户数量。 通过使用多个金丝雀,每个金丝雀都具有不同的特征,可以跨不同的地理区域切换 ,我们就能获得早期的成功指数。 在所有这些期间,逐渐将金丝雀允许的流量从5%以下更新为完整的100%。 另一个好处是,它使我们能够连续发布该应用程序的新功能,而无需特定的部署或发布窗口。
Observability and Defining Success
可观察性和成功的定义
【python|部署证明书提出了挑战和架构正统观念】In the process of switching traffic to users, we had to identify the user sets. What should be the criteria for defining the target user base? We started with identifying the users that are relatively more active.
在将流量切换到用户的过程中,我们必须识别用户集。 定义目标用户群的标准是什么? 我们首先确定相对活跃的用户。
Illustration by karthik on his 90’s machine karthik在他90年代的机器上的插图 Once done, we had to engage the marketing and public relations team to seek out consent and request feedback. This would help gauge the issues early on. For internal product teams dogfooding is an easier approach.Choosing success metrics for such a series of releases becomes more and more critical since the numbers from each stage help in identifying any early stage issues. Increased response times for an extrapolated throughput, increased error rates or other measurable entities were some of the key metrics.
完成后,我们必须聘请营销和公共关系团队寻求同意并要求反馈。 这将有助于及早评估问题。 对于内部产品团队来说,采用狗食是一种更简单的方法。选择这样一系列发布的成功指标变得越来越关键,因为每个阶段的数字都有助于识别任何早期阶段的问题。 其中一些关键指标是:增加吞吐量的响应时间,增加错误率或其他可衡量的实体。
Challenges
挑战性
One of the major challenges, we know, would be cleaning up after or managing multiple release versions and constantly reducing the number of parallel versions currently live in production. Database changes were a real challenge to work around. Parallel Change or expand, merge and contract pattern can help mitigate this to a large extent. This aspect is yet to be thoroughly explored.
我们知道,主要挑战之一是清理或管理多个发行版本之后,不断减少当前生产环境中并行版本的数量。 解决数据库更改是一个真正的挑战。 并行更改扩展,合并和收缩模式可以在很大程度上缓解这种情况。 这方面还有待深入探讨。
References: https://martinfowler.com/https://docs.microsoft.com/
参考: https : //martinfowler.com/ https://docs.microsoft.com /

翻译自: https://medium.com/swlh/deployment-testimonials-issues-challenges-and-architectural-orthodoxy-95d38ccdf00b

    推荐阅读