90 Day Plan for a CTO in a New Job

This is a checklist for a new CTO, head of Product, or leader in a similar role starting in a new job. It is meant to kickstart continuous improvement in your product engineering organization. I encourage you to take a scientific test and learn approach to everything you do. You should customize this template based on your own experiences over time. If you find it helpful, please feel welcome to send me additions and improvements to this list.

Repeat the following seven steps iteratively to make incremental and continuous improvements.

1. Understand your job. Learn the organization and industry you are in.

  1. Make a list of the areas you are responsible for. These are likely to include:
    1. Technology: Software Engineering, Infrastructure Engineering, DevOps, Cyber Security, Systems Operations, Application Support
    2. Product: Product Management, Project Management, User Experience, User Interface Design
    3. Data: Data Science, Data Engineering, Data Visualization
  2. Review what it takes to be an effective Chief Technology & Product Officer.
  3. Create a mind map of culture, technology, and operations parts of your CTO job.
  4. Meet customers, executives, stakeholders, colleagues, and team members.
  5. Connect with a network of your peers outside your organization.
  6. Get feedback.
  7. Collect, compile, and synthesize information into knowledge.
  8. Check: How are we doing in relation to our existing metrics for success?
  9. Identify common themes, patterns, and problems.
  10. Consider retaining the services of an executive coach.

2. Define and revise measurements for success.

  1. List metrics for the success of the company as viewed by shareholders.
  2. Prioritize metrics for the success of the teams you manage and how they relate to the metrics for the success of the whole organization.
  3. Determine: What metrics are no longer a priority?
  4. Determine: What new metrics do we need to add?

3. Articulate your vision and strategy.

  1. Clearly communicate it to customers, executives, stakeholders, colleagues, and team members. On a regular basis.
  2. Meet regularly with your team members, peers, executives, stakeholders, customers, partners, and vendors. Human relationships and face to face communications (when feasible) are essential.
  3. Host regular 1:1 meetings with your direct reports, at least once a week. team members
  4. Host regular all-hands meetings and communications. Monthly all-hands for staff less than ~100 people depending on space. Quarterly all-hands for staff more than ~100 people, depending on space. Encourage your departments to hold regular all-hands meetings of their own.
  5. Host regular social, relationship building events and activities. For example, a monthly celebration event to mention professional and personal milestones that people want to share.
  6. Implement processes to have productive business meetings.

4. Organize people for success.

  1. Reorganize teams and redeploy people.
    1. Ensure that your organizational structure factors in products, stakeholders, and career growth needs of your team members.
    2. Here is an example of a technology team organization for media companies.
  2. Reinvigorate people.
    1. Implement managerial and technical career tracks.
    2. Standardize titles while still retaining flexibility, and fun.
    3. Consider that career pathways are not linear.
  3. Recruit talent.
    1. When feasible, interview people by putting them to work.

5. Build culture.

  1. Align team members towards common good, shared goals.
  2. Ask team members how they are doing. Are they happy in their jobs? Are their jobs exciting, challenging, and rewarding?
  3. Solicit advice, including leadership advice from your colleagues, regardless of their level or experience. You can learn important leadership lessons from people who report to you. This also encourages your colleagues to become leaders.
  4. Remember to thank people when they deserve it.
  5. Implement a performance evaluation and career development system.
  6. Build and maintain a cohesive leadership team. Make it well known that internal rivalries are strongly discouraged and not tolerated.
  7. Encourage good life/work balance, including a sensible vacation policy.
  8. Experiment with ideas to keep the workplace interesting.

6. Revise processes for success & delivery, and suitable for the environment and the times.

  1. Create checklists to help you do your job better (like this one itself). These checklists will also help your colleagues. Encourage others to collaborate on checklists and share them.
    1. Here is a sample one I made about reviewing managed services contracts
    2. and another one for dealing with outages.
  2. Encourage a culture of sharing best practices, like simple personal productivity tips.
  3. Design evaluation scorecards and criteria to justify, prioritize, and classify projects.
  4. Ensure that your project portfolio management system and your people role definitions factor in the need to regularly evaluate and decommission projects and products that don’t make sense to continue.

7. Upgrade technologies.

  1. Pay off technical debt [external link]and continue performance enhancements.
    1. App, site, and service reliability
    2. Automation (QA, deployments, support, etc.)
    3. Performance
    4. Security (e.g. start down the path to HTTPS)
  2. Make each team increasingly autonomous and self-sufficient while enabling collaboration and economies of scale.
    1. For example, by moving to a microservices model, using tools such as Docker, hosted on a cloud service provider (AWS).

Thank you for reading this and for sending me suggestions to make this list even more helpful to others.

This article is mirrored on LinkedIn. It is a part of the ctobook series of articles related to #culture, #technology, and #operations: three critical part of a Chief Technology & Product Officer’s job.

CTO Mind Map: Culture, Technology, Operations

In the role of chief technology officer, you have to be concerned with many topics. Some relate to functions you have direct supervisory responsibility for and some in areas that are managed by others but you still need to share responsibility for.

To keep all of a CTO’s concerns organized, I created this mind map using XMind. The items are classified under three major categories: culture, technology, and operations.

CTO-Mind-Map-highlevel-view-export-v1.0
CTO Mind Map: Culture, Technology, Operations: High Level Summary View

 

The purpose of this mind map are manifold. It serves as a visual job description. It is a map for CTOs to use to prioritize and focus their own work and that of their team members, based on the organization’s needs, the skill sets of the CTO and others. It is also used to identify gaps, both in terms of areas and coverage.

You can view it as an image in the SVG format (scalable vector graphics) in your Web browser or download the editable document in XMind format.

This mind map is a general version for CTOs across industries. You may find it useful to create a version of this specific to your role. I plan to expand this to include more information over time and to keep it current with the technology landscape. If you create versions of this that you are willing to share, please let me know via comments here or via Twitter @rajivpant.

CTO Mind Map version 1.0 by Rajiv Pant
CTO Mind Map version 1.0

 

3 Roles of a CTO: Culture. Technology. Operations.

This is a guide for CTOs, VPs of Software Engineering and other technology managers responsible for a software engineering organization. The purpose of this checklist is to help the CTO cover the areas of culture, technology and operations in their teams. It is presented in the form of a memo to direct reports.

cto-culture-technology-operations


Dear Tech Management Team Colleagues,

For those of you who have weekly 1:1 meetings with me, this template is a guide for our regular discussions. I value your experience, so please feel free to suggest making this format even better. I’d like us to cover three major areas on a regular basis.

  1. Culture
  2. Technology
  3. Operations

Each of these three major areas is further divided into three sub-categories containing a list of items to consider reviewing.

The first time you see this list, it may seem too long to review in a 30 minute meeting. This is a guideline to structure our conversations. You are not expected to discuss each one of these at every meeting. This checklist will help us review things that are relevant at the time. Managers have successfully used this checklist to review pertinent items in less than 30 minutes.

Tip: Here is one way to effectively use this. Let us both spend 5 to 10 minutes to read this checklist in advance of each of our regular 1:1 meetings. We can even use the first 5 minutes of our meeting to read it. Then we will both have a good idea of which items are relevant for the next discussion from our perspectives.

Culture “people, behaviors & teamwork”

Relationships

  1. How well are your team members collaborating with each other?
  2. …with their colleagues in other tech teams?
  3. …with their stakeholders, customers and executives?
  4. Are there any tensions that I need to be aware of?
  5. Advocacy: What should we do to be better understood, respected, and appreciated by our stakeholders, customers and executives and vice versa?
  6. Anything in this area that you are waiting on or need from me?
  7. Is there anything non-work-related that you’d like to share?

Retention

  1. Are the people in your teams happy with their work? How is team morale?
  2. Is the work intellectually challenging?
  3. Are they learning new things and getting better at existing skills?
  4. Do they feel they are making a positive impact?
  5. Are we taking good care of them?
  6. Are we proactively providing feedback, coaching, and training?
  7. Is anyone considering leaving that you know of?
  8. Has anyone given notice?
  9. Is there anything related to retention that you are waiting on or need from me?

Recruiting

  1. Are you feeling a staffing shortage this week?
  2. Are we thinking ahead and planning for capacity, skills and having some slack for flexibility?
  3. How many open positions do you have in your team this week? How long have they been open?
  4. What are you doing for recruiting?
  5. Is there anything related to recruiting that you are waiting on or need from me?

Technology “engineering, infrastructure & innovation”

Architecture

  1. What new technologies, platforms, products and APIs are we evaluating?
  2. …implementing?
  3. …decommissioning?
  4. …consolidating?
  5. …releasing as open source or making public?

Integration

  1. What are we doing to support integrations across teams?
  2. Are you facing any challenges integrations across teams?

DevOps

  1. In what areas are we implementing temporary hacks?
  2. How are your apps and platforms doing with respect to their goals in Performance & Scalability?
  3. …Reliability?
  4. …Security?
  5. …Test Coverage?
  6. …Technical Debt?
  7. …On calls and P1s?
  8. How are the integrations, process and relationships between the development, infrastructure and security folks?

Operations “projects, sustainability & recycling”

Work

  1. Are there any projects at risk?
  2. Are there any changes to a) due dates, and/or b) delivery dates?
  3. What did we a) accomplish, and b) work on over the past week?
  4. What do we plan to do over the next week?
  5. What projects/work can we decommission?
  6. What was your budget forecast? How are we doing with respect to it? Any budget issues?
  7. Did we recently say “no” to a stakeholder or executive’s project request (or say something would be very hard to do) that I should know about?
  8. … any that we said “yes” to that I should know about? :-)
  9. Did we give any estimates that I should know about?
  10. What can I do to help? What do you need from me?

Learn

  1. What did we learn from the past week?
  2. Are we sharing these learnings with others who’d benefit from them?
  3. Did we do any retrospectives? What changes are we making based on retrospectives?
  4. Are there any process changes that you recommend? … for both outside and inside of your teams.
  5. How can I help?

Challenges

  1. What issues are we facing now or are likely to face in the future?
  2. What prioritization problems are we facing?
  3. What do you suggest are our countermeasures to address those issues?
  4. How can I and/or others help and support you or remove obstacles from your path?

Mixing it up

To prevent our weekly discussions from feeling too structured and getting stale, I suggest mixing it up a bit. Let us try this format for 3 out of every 4 of our regular 1:1s and keep 1 meeting free-form.

We can also break monotony by switching the locations of these meetings and having some of these discussions walking about.

Why discuss this in a meeting and not ask for this information in a weekly status report?

… because no one likes to write a status report, but everyone likes to talk :-)

Let us take a test and learn approach with this and adjust as we go along.

Thank you in advance for your help with this.


This article is mirrored at LinkedIn and Medium.

3-5-7 Meeting Format for Weekly Staff Meetings

If you are the manager of a team of people at your job, here is a format we suggest for running your staff meetings. We call it the 3-5-7 format because of its convention of giving 3 to 5 minutes per person to answer 7 questions. This system assumes that you have fewer than ten direct reports so that you can complete such a staff meeting in under one hour.

The purpose of a staff meeting need not be to get status reports. If you have excellent collaboration tools at work where statuses, issues and risks are already documented, that’s preferable. Some companies like Automattic (WordPress) make great use of internal blogs for communication. However, face-to-face meetings are continue to be useful because our brains have evolved being wired for being most effective in face-to-face conversations for several things.

An in-person (or via video conference) discussion structured around these questions is likely to be effective in finding solutions, building a more collaborative team and keeping everyone on the same page.

Here are the seven questions we suggest you request each attendee to come prepared to answer.

  1. What did we (you and the team reporting in to you) do over the past week?
  2. What did you learn over the past week?
  3. What do we (you and the team reporting in to you) plan to do over the next week?
  4. What issues are we (you and the team reporting in to you) facing now or are likely to face in the future?
  5. What do you suggest are our countermeasures to address those issues?
  6. What do you need help with from the rest of us in this meeting?
  7. Is there anything non-work-related that you’d like to share?

Each person may answer the seven questions the order of their choice and may also combine the answers to multiple questions. The only requirement is that all seven areas be answered in a focused, efficient, and effective narrative lasting between three to five minutes.

Some of this advice is based on management experiences shared by Don Kiefer in an operations management class he teaches at MIT’s Sloan School of Business.

5 Productivity Tips for Executives in Leadership & Management Roles

MP900309344Here are 5 productivity tips for executives in leadership & management roles. Each tip involves the number 5.

  1. Every morning (or the night before), make a prioritized list of the top 5 things you plan to accomplish that day. These are your must-do tasks for the day. At the day’s end (or when making the next day’s list), review how many of the 5 items you completed successfully. Learn from past data when planning your current top 5 things.
  2. Whenever practical, write emails and replies in 5 sentences or less. Link to five.sentenc.es in your email signature to explain this policy to your recepients.
  3. Time box your presentations, proposal pitches and plans/project descriptions at 5 minutes. Learn via  www.google.com/search?q=5+minute+presentations how to make effective presentations in 5 minutes. Limit certain conversations, phone calls and quick improptu meetings to 5 minutes or less.
  4. Wake up at 5 am or soon after and leave the office to go home soon after 5 pm.
  5. Do not check your email, social media and other messages every 5 minutes.

MB910227540MH900211482

CAREER-CLEAR: An Employee Evaluation and Career Development System

CAREER-CLEAR is a system for doing fair, consistent and constructive employee performance evaluations and determining employee rank, title and compensation. It is meant to be used by supervisors to identify areas for improvement for their employees and to guide their career growth.

Employees are scored in a total of 5 categories. Upto 10 points can be earned in each category for a total of upto 50 points. The final score is then multiplied by a factor of 2 to give a standard scale of 0 to 100. Using a normalized 100 point scale allows it to remain consistent (by adjusting the factor) even as companies add/remove categories and items.

If you want to jump directly to the system first and then come back and read the text, click here.

The scoring for each item follows a simple but strict 3-level scale of 0 (below baseline), 1 (at baseline) or 2 (better than baseline). There are no fractional “in between” scores. For example, you must not score someone 1.5. You must pick either 1 or 2. This 3-options-only scale is meant to minimize vagueness. For the same reason, a wider scoring range like 1 to 5 (commonly seen in star rating systems) is not used. A score of 0 in an item is not neccessarily bad. If you are not seeing at least a few 0 scores for most employees, you have set your baselines for each item too low.

The baseline for each item is the same for everyone from the programmer-apprentice to the VP of Engineering. The baseline level — i.e. what quality of performance in that item rates a score of 1 — must be defined in advance for each item as unambiguously as possible. This can be done by senior management or by management consultants hired for this purpose. Doing this in consultation with the employees (who are to be rated) and clients/stakeholders is recommended.

The resulting total score is meant to be mapped to the employee’s level of seniority/rank for title and compensation. That means within a job functional area, employees at senior levels should score higher than employees at junior levels.

For example, a score of 81-100 could map to director/VP levels; 61-80 manager; 41-60 engineer/contributor; 21-40 junior level/apprentice. Since different functional areas — for example, software engineering and quality assurance testing — may have different pay scales, this score maps directly to rank/title, and those are mapped to salaries corresponding to the functional areas’ market rates.

You will notice that a lot of emphasis is given to leadership and management qualities. This is designed for the system to work across the wide range of skills from intern to VP. At first, this may seem like the system is unfairly skewed in favor of seniority and higher level employees. The system, however, is designed to favor skills and better level of performance in multiple areas.

The first four categories are described below. The fifth category is defined as discretionary/user-defined. CAREER-CLEAR is designed to be used in the real world, in a diversity of organizations and on a regular basis. The system won’t succeed if it is too rigid. On the other hand, the system must meet its goals of being fair, consistent and constructive for all employees. To accomodate and balance these goals, 20% of the criteria is meant to be user-defined at descretion of the manager within the fair, consistent and constructive guidelines.

It is inspired by systems described to be in use at Microsoft, Construx, FogCreek (Joel on Software) and Conde Nast Digital Technology. The latter was developed by Bobby Chowdhury, Brian Murphy, Janet Kasdan and Rajiv Pant.

The 5 categories are: Caliber, Leadership, Expertise, Role and Discretionary.

Caliber

This section measures the talent of the employee in general (non-technical) areas.

Scoring: Above Average=2, Average=1, Below Average=0. Add the score for each of the heuristics. Max Score=10 points.

  1. Ownership – Has identifiable long-term ownership of projects. This is a measure of the criticality, complexity and / or number of projects the employee has ownership in.
  2. Responsibility – Is consistently reliable in terms of deliverables and time.
  3. Communication – Communicates effectively with peers and other colleagues. Listens to and understands others’ viewpoints, challenges, needs and desires.
  4. Consistency – Is approachable, predictable, receptive and consistently applies good judgment in all interpersonal interactions in the work place.
  5. Innovative – Innovates and stays abreast of emerging technologies and finds ways to incorporate those technologies into systems.

Leadership

This section evaluates the positive influence the employee has on others.

Scoring: Above Average=2, Average=1, Below Average=0. Add the score for each of the heuristics. Max Score=10 points.

  1. Teacher, Coach & Motivator – Mentors others, makes great use of all information sharing tools available and is an active presenter. Rallies the troops and improves morale.
  2. Enabler – Empowers and enables others to succeed.
  3. Exemplary – Leads by example and goes above and beyond the ‘requirements’.
  4. Maturity & Humility – Embraces others’ solutions, even when incompatible with one’s own. Incorporates feedback from others to find the best solutions.
  5. Connector – Has familiarity with the ecosystem beyond one’s own projects. Functions as a hub which others are drawn to for a quick answer or a quick redirect towards an answer.

Expertise

This section quantifies the skills and experience of the employee related to the job function.

Scoring: Above Average=2, Average=1, Below Average=0. Add the score for each of the heuristics. Max Score=10 points.

  1. Fundamentals – Understands of the core technical concepts aligned with the given job function. This may include data structures & algorithms, testing, networking, etc.
  2. Breadth of Expertise – Is a subject matter expert and go-to person for many areas of technology.
  3. Pragmatic – Has a demonstrated ability to identify the best solution to balance what’s most theoretically ideal against what might be the most practical due to concerns about security, scalability, time to market pressures and cost.
  4. Automator – Consistently works to drive improvement in processes and systems.
  5. “Boy/Girl Scout Rule” – Leaves code and systems better off than they found them.

Role

This section enumerates the employee’s role and areas of contribution within the organization and beyond.

Scoring: Above Average=2, Average=1, Below Average=0. Add the score for each of the heuristics. Max Score=10 points.

  1. Strategic – Provides sound vision for broad, long-term goals.
  2. Tactical – Oversees many projects or activities that move the organization towards strategic goals.
  3. Operational – Steers day-to-day processes that achieve the tactical goals.
  4. Executional – Implements repetitive tasks that make up the operational processes. A measure of quantity and more importantly, quality of work produced.
  5. Industry Recognition – Is recognized externally as a leading technologist through contributions to open source projects, blogging, writing books, participating in technical committees, speaking at conferences, etc.

The following are some examples to illustrate strategic, tactical, operational and executional.

  • Strategic: “Our new Web application will become one of the top three, preferably #1, in its space in the US market.”
  • Tactical: “We will hire a small team to develop and launch it. An office location would be required to meet partners and clients. We will also need additional funding.”
  • Operational: “We will hire a great software architect, 2 expert engineers, set up office in Manhattan, and have goal of reaching $500,000 in additional funding by the end of the year.”
  • Executional: “The architect designs the Web application in collaboration with the engineers. The engineers and the architect implement it. The team then makes it live and markets it via social networks and other channels.

Discretionary

Please be sure to adhere to the goals of being fair, consistent and constructive for all employees in using this discretionary section. This category is not meant to be used to justify favoritism nor meant to be arbitrary. Good descretion comes from rational, reasonable and relevant criteria. Place items here that are not already covered in other categories and are important to your organization. A good rule of thumb is that you must be able to justify any criteria you apply here.

Scoring: Above Average=2, Average=1, Below Average=0. Add the score for each of the heuristics. Max Score=10 points.

  1. discretionary / user-defined item 1
  2. discretionary / user-defined item 2
  3. discretionary / user-defined item 3
  4. discretionary / user-defined item 4
  5. discretionary / user-defined item 5

CAREER-CLEAR version 2.1 2010-Oct-13

Hosting Large-Scale Web Sites: Contract Review Guide for the CTO

If you host and operate large-scale Web sites, or negotiate contract agreements with vendors that provide such services, you need to understand what should be included in a Web hosting infrastructure. This knowledge will help you in three areas:

  1. Providing reliability, scalability & good performance
  2. Minimizing risks via security, privacy, regulatory compliance and reduction of vulnerability to potential lawsuits
  3. Reducing and controlling costs

This guide is meant to help you review upcoming contracts as well as existing services.

Likely audience for this article: Managers, directors and vice presidents of technology, operations or finance at organizations operating large-scale Web sites; Executives supervising technology: CTO, CIO, CFO, COO.

Seven Aspects of Large-Scale Web Hosting

Large-scale Web hosting infrastructure and services can be organized into the following seven areas:

  1. Servers & Environments
  2. Network & Other Appliances
  3. Managed Hosting Services
  4. Third-party Provided Services
  5. Program Management Office, PMO
  6. Account Management
  7. Infrastructure & Facilities

Checklist for Review

You can use the following checklist to review your hosting services or a vendor’s proposal.

What to look for

When you review each item below, consider:

  • Is this item included in the vendor’s proposal or in the services we are currently receiving? If it is not included, what are the good reasons it isn’t included?
  • Is this needed for my organization’s current business requirements? Can we do without it? Is it a must have or nice to have for present and reasonable future needs?
  • What are the alternatives?
  • What is the unit price of this item? How does the price scale up as needs grow? How does the price scale down when need for this item decreases?
  • What level of fault-tolerance does this item need? i.e. redundancy, standby backups, time to recover

Some of the above review questions may apply only to things and not apply to services and processes.

Servers

Servers may be physical hardware servers and/or virtual servers managed using software such as VMWare, Parallels Virtuozzo or Xen. The services listed below can each run on separate servers or multiple services can run on a server. It is generally better to have servers running only one (or minimum number) of the major services listed below. That reduces complexity and saves expensive staff time saved maintaining, troubleshooting and recovering. Virtualization makes it economical to have multiple virtual servers on the shared physical hardware economize costs.

The following is a list of commonly found services at large-scale Web sites that require servers.

  • Web
  • Application
    • Content Management software. This is the software that the Editorial and Production teams use to submit, edit, package and manage articles, photos and other Web site content
    • Dynamic Content Assembly. Typically done using Portal Server software, either third-party supplied or in-house developed
    • Data Processing. E.g. workflow engines, jobs/tasks processing servers
    • Middleware
    • Other applications. These are applications that happen to be separate from the main content management system. They could be separate for any number of reasons. E.g. blogs, forums
  • Database

Server Environments

An environment is a self-sufficient set of servers assigned to serve a purpose as described below. Large-scale Web sites typically utilize multiple environments.

  • Production
    • This serves the Web sites to the customers and public.
    • Typically has 99.9% or higher uptime guarantee in the Service Level Agreement
      Please refer to the accompanying table titled Understanding SLA Uptime Guarantee Percentages to compare different time windows when the SLA Uptime measurement gets reset. I recommend that you ensure that the reset window you get is the same duration as your billing cycle (usually monthly) or shorter. This will help avoid having long downtimes without penalty.
  • Staging
    • This is the environment where content packages are developed, integrated and previewed by Editorial, Design and Production teams before they are published to the end-users. For example, when working on a major site redesign or relaunch for several months. Since the tech teams are often making changes to the Development Integration and QA environments, they are not suitable for content integration work by the Editorial and Design teams. Staging is used in large-scale Web sites where mutiple Editors, Designers and Production staff are collaboratively creating content packages and new sections. In smaller Web sites or in cases where just one or two Editors are working on a piece of content like an individual article, previewing is done in the Production environment itself with access controls.
  • Quality Assurance (QA)
    • The QA engineers perform Functional Testing and Load Testing here. Doing functional testing while a load test is running is sometimes a good idea as it simulates usage closer to live production.
  • Development Integration
    • Software product code developed by different engineers is integrated here. There could be continuous integration or nightly builds.
    • This is where developers ensure that their code works with other developers’ code (does not break the build, and does not conflict resulting in undesired functionality)
    • Programmers should ensure that the product works here before handing it off to the QA engineers for testing

In a virtualized system the environments may not be physically separate and may regularly grow and shrink at different times. For example when hosted at a cloud computing provider, the QA environment may scale up during load testing and shut down completely during the hours the QA team is not working.

Network & Other Appliances

These are devices to which various servers are directly or indirectly connected.

Managed Hosting Services

  • Systems Administration
    • This typically includes all the management of the physical hardware up to and including the operating system and popular applications that complement the operating system.
  • Database Administration Services
  • Applications Management Services
    • This typically includes all the administration of the applications that run on top of the operating system.
  • Systems Monitoring, Alerting & Reporting
  • Web Support Help Desk, 24×7

Third-party Services

Program Management Office, PMO

  • Project Management
    • PM people, organization, processes
    • Collaborative project management tools, e.g. JIRA, RallyDev, Mingle
    • Shared documentation management tools, e.g. Wiki
  • Change Management Processes & Tools
    • Documentation system
    • Tools for source control, build & deployment
  • RASIC Matrix Describing Roles & Responsibilities
  • Escalation Flowcharts
  • Crisis Management & Emergency Procedures

Account Management

  • Customer service
  • Relationship management
  • Master Services Agreement, MSA
  • Statements of Work, SOW
  • Service Level Agreement, SLA
    • What to look for in the SLA is the subject of a separate article in this series.
  • Billing & Service Level Agreements
    • Monthly bills provided by telecommunications (telco) and hosting companies tend be extremely complex and lengthy. As a result, they are difficult and time-consuming to review.
    • Always factor in one-time setup fees and any implementation fees paid to the vendor and/or their partners in the total cost of the contract. Don’t look only at the recurring charges. A simple way to do this is this:
      contract cost = implementation fees + (estimated recurring fees x number of recurrences committed to)
      e.g. contract cost for 1 year = setup fees + (estimated monthly charges x 12)
      For most hosting / telco contracts I recommend this simple calculation over more sophisticated methods that factor in time value of money because the recurring fees are estimates anyway.
    • Make sure that 1-year contract is really a 1-year contract and not effectively a 13-month, 15-month or even longer contract by ensuring the following:
    • The contract’s start date is the first date for which the recurring billing begins. This is useful in determining the default end date of the contract. For example:
      If you agree to a 1-year contract with monthly billing when the first monthly bill will be for services provided April, 1, 2010 through April 20, 2010, then the default termination date for the contract is March 31, 2011. If the service provider estimates 3 months for implementation that ends on June 30, 2010 and they charge you the monthly services for April, May and June, don’t let the vendor tell you the contract start date is July 1. If you paid the monthly fees for services provided on April 1, then the start date is April 1.
    • If the vendor charges you fractional monthly fees for the implementation period and/or charges you one-time set up fees, then you should negotiate and agree on a contract end date that is fair to both parties. Use this guideline: The contract commitment should aim towards a certain money target (revenue for the vendor). If the implementation fees are equivalent to say, 3 months of recurring billing, you might agree that end date is after 9 months of the first recurring billing cycle.

Tips for Reviewing Technology Vendor Contracts and Service Level Agreements (SLA)

  • Don’t let the vendor use a lower monthly rate for calculating SLA credits.An example: The vendor’s contract section X.YZ1 states that the customer’s service credits will be calculated against a monthly rate of $6,000.00 per month. However, the vendor’s estimated total charges seem to be at least $10,000 per month. Don’t let the vendor calculate service credits based on a lower monthly bill than the actual monthly bill.
  • Don’t get locked into a deal where you could be stuck with overages every month.An example: The vendor’s contract section X.YZ2 locks the customer into the vendor’s service for two years for a total of between $80K/month to $100K/month if the customer remains at under 100 million page views per month. If the customer’s page views go over 100 million in any month, then there will be additional overage charges. There is no out clause nor a pre-determined next rate tier in the customer’s favor in the contract. If customer s traffic rises to regularly being over 100 million page views per month, the customer will be trapped in a contract with recurring overage charges. Make sure that if you have overages in the future, you can move into the next tier, preferably at a better rate.
  • Beware of vaguely defined scheduled maintenance and make sure scheduled maintenance needs customer’s prior approval.An example: The SLA section X.YZ3 states that the vendor can schedule maintenance downtime with 48 hours notice. They can give the customer notice by one of many means. There is no requirement for the customer to review or even acknowledge receipt. This is slanted too much in vendor s favor. The customer should have some ability to reschedule scheduled maintenance or ask for it to be shorter in duration if it interferes with the customer’s business.
  • Make sure that service credits can also be redeemed as cash.An example: The SLA section X.YZ4 states that service credits are not cash. Such credits will only be applied to future service billings. This is usually fine, except if it happens in the last month of the contract or if there is not enough future usage to use up the credits. In such instances, service credits should be payable as cash.
  • If the vendor will charge you for overages, the vendor needs to be responsible for service at the overage usage levels too.An example: The SLA section X.YZ5 states that response time service credits will not apply if monthly page-views exceed 120 million. This is not fair to the customer. The vendor is fine with charging the customer overage fees, but not being responsible for level of service at those levels. If the vendor charges overage fees, it should bind them to providing full service at the exceeded usage as well.

Infrastructure & Facilities

This item, infrastructure & facilities, is beyond the scope of this article. It includes the buildings, electric power, generators, climate control, physical security and related staffing.

This article is part of a series titled “Guide for the CTO: A compilation of articles on how to lead and manage technologies, projects and people”.

Understanding SLA Uptime Percentages

The table below helps illustrate why you should ensure that the “SLA Uptime Measurement Meter Reset Window” is the same duration as your billing cycle or shorter.

Availability % Downtime per year Downtime per month (30 days) Downtime per week Downtime per day
90 “one nine” 36.5 days 72 hours 16.8 hours 2.4000 hours
95 18.25 days 36 hours 8.4 hours 1.2000 hours
97 10.96 days 21.6 hours 5.04 hours 43.2000 minutes
98 7.3 days 14.4 hours 3.36 hours 28.8000 minutes
99 “two nines” 3.65 days 7.2 hours 1.68 hours 14.4000 minutes
99.5 1.83 days 3.6 hours 50.4 minutes 7.2000 minutes
99.8 17.52 hours 86.23 minutes 20.16 minutes 2.8800 minutes
99.9 “three nines” 8.76 hours 43.8 minutes 10.1 minutes 1.4400 minutes
99.95 4.38 hours 21.56 minutes 5.04 minutes 43.2000 seconds
99.99 “four nines” 52.56 minutes 4.32 minutes 1.01 minutes 8.6400 seconds
99.999 “five nines” 5.26 minutes 25.9 seconds 6.05 seconds 0.8640 seconds
99.9999 “six nines” 31.5 seconds 2.59 seconds 0.605 seconds 0.0864 seconds
99.99999 “seven nines” 3.15 seconds 0.259 seconds 0.0605 seconds 0.0086 seconds

Sources: Wikipedia, my calculations

Checklist for Migration of Web Application from Traditional Hosting to Cloud

In 2010, Cloud Computing is likely to see increasing adoption. Migrating Web applications from one data center to another is a complex project. To assist you in migrating Web applications from your hosting facilities to cloud hosting solutions like Amazon EC2, Microsoft Azure or RackSpace’s Cloud offerings, I’ve published a set of checklists for migrating Web applications to the Cloud.

These are not meant to be comprehensive step-by-step, ordered project plans with task dependencies. These are checklists in the style of those used in other industries like Aviation and Surgery where complex projects need to be performed. Their goal is get the known tasks covered so that you can spend your energies on any unexpected ones. To learn more about the practice of using checklists in complex projects, I recommend the book Checklist Manifesto by Atul Gawande.

Your project manager should adapt them for your project. If you are not familiar with some of the technical terms below, don’t worry: Your engineers will understand them.

Pre-Cutover Migration Checklist

The pre-cutover checklist should not contain any tasks that “set the ship on sail”, i.e. you should be able to complete the pre-cutover tasks, pausing and adjusting where needed without worry that there is no turning back.

  • Set up communications and collaboration
    • Introduce migration team members to each other by name and role
    • Set up email lists and/or blog for communications
    • Ensure that appropriate business stakeholders, customers and technical partners and vendors are in the communications. (E.g. CDN, third-party ASP)
  • Communicate via email and/or blog
    • Migration plan and schedule
    • Any special instructions, FYI, especially any disruptions like publishing freezes
    • Who to contact if they find issues
    • Why this migration is being done
  • Design maintenance message pages, if required
  • Setup transition DNS entries
  • Set up any redirects, if needed
  • Make CDN configuration changes, if needed
  • Check that monitoring is in place and update if needed
    • Internal systems monitoring
    • External (e.g. Keynote, Gomez)
  • Create data/content migration plan/checklist
    • Databases
    • Content in file systems
    • Multimedia (photos, videos)
    • Data that may not transfer over and needs to be rebuilt at new environment (e.g. Search-engine indexes, database indexes, database statistics)
  • Export and import initial content into new environment
  • Install base software and platforms at new environment
  • Install your Web applications at new environment
  • Compare configurations at old environments with configurations at new environments
  • Do QA testing of Web applications at new environment using transition DNS names
  • Review rollback plan to check that it will actually work if needed.
    • Test parts of it, where practical
  • Lower production DNS TTL for switchover

During-Cutover Migration Checklist

  • Communicate that migration cutover is starting
  • Data/content migration
    • Import/refresh delta content
    • Rebuild any data required at new environment (e.g. Search-engine indexes, database indexes, database statistics)
  • Activate Web applications at new environment
  • Do QA testing of Web applications at new environment
  • Communicate
    • Communicate any publishing freezes and other disruptions
    • Activate maintenance message pages if applicable
  • Switch DNS to point Web application to new hosting environment
  • Communicate
    • Disable maintenance message pages if applicable
    • When publishing freezes and any disruptions are over
    • Communicate that the Web application is ready for QA testing in production.
  • Flush CDN content cache, if needed
  • Do QA testing of the Web application in production
    • From the private network
    • From the public Internet
  • Communicate
    • The QA testing at the new hosting location’s production environment has passed
    • Any changes for accessing tools at the new hosting location
  • Confirm that DNS changes have propagated to the Internet

Post-Cutover Migration Checklist

  • Cleanup
    • Remove any temporary redirects that are no longer needed
    • Remove temporary DNS entries that are no longer needed
    • Revert any CDN configuration changes that are no longer needed
    • Flush CDN content cache, if needed
  • Check that incoming traffic to old hosting environment has faded away down to zero
  • Check that traffic numbers at new hosting location don’t show any significant change from old hosting location
    • Soon after launch
    • A few days after launch
  • Check monitoring
    • Internal systems monitoring
    • External (e.g. Keynote, Gomez)
  • Increase DNS TTL settings back to normal
  • Archive all required data from old environment into economical long-term storage (e.g. tape)
  • Decommission old hosting environment
  • Communicate
    • Project completion status
    • Any remaining items and next steps
    • Any changes to support at new hosting environment

The checklists are also published on the RevolutionCloud book Web site at www.revolutioncloud.com/2010/01/checklists-migration/ and on the Checklists Wiki Web site at www.checklistnow.org/wiki/IT_Web_Application_Migration

Opinion on the Amazon S3 Outage; Checklist for Dealing with Outages

My journalist colleagues at Wired.com published some of my comments related to Amazon S3.1 Wired also posted another article titled Customers Shrug Off S3 Service Failure. I agree with the views of many of the customers expressed in the article. Don MacAskill, CEO of the popular photo hosting site Smugmug, wrote an understanding post about it.

My entire career working for media companies, I’ve held firm the belief that the uptime, reliability, performance, scalability, performance and security of commercial Web sites is of paramount importance. When sites that I’ve been responsible for have had issues, my colleagues and I have given our personal time and energy to resolution. With my teams, I spend considerable time on proactive measures. I’ve had the honor of working closely with and learning from some who do an excellent job running technology operations.

Experience has taught that things can and sometimes do go wrong. Sometimes calculated risks don’t pan out. Sometimes mistakes cause problems. We are human. We should strive for perfection; we can get close to it, but not fully attain it. We should be prepared for such scenarios. When they happen, we should work diligently and expeditiously on resolution and have frequent and honest communications with stakeholders and customers. Such communications during the incident should include:

Update 2010-Jan-24: This checklist is now maintained on the Checklists Wiki Web site at:

www.checklistnow.org/wiki/IT_Incident_Reporting

During-Incident Communication Checklist

  • Current status
  • What is the full impact?
  • Estimated time to resolution
  • Any recommended workarounds until resolution, if practical
  • Assurance that it is being worked on
    • It often helps to mention who all are working on it and what they are doing

The post-incident communications to stakeholders and customers should include:

Post-Incident Communication Checklist

  • Summary
  • What happened, how and why it happened?
    • Including full description of all impact
    • Do not blame2 third-parties or say things like “beyond our control”. A technology leader takes responsibility equally for both insourced and outsourced products and services.3
  • How it was resolved
    • If the resolution is temporary or long-term
  • Next steps
  • Plan for eliminating or minimizing this and similar incidents from happening again
  • Thank all those who helped resolve and the customers for their understanding
  • Mention the monetary credits you plan to give as per the Service Level Agreement (SLA)
    • Specify any additional ‘make goods’ or returns you plan to make to the customers above and beyond the credits as per SLA, if appropriate.

Stakeholders and customers here refer to internal customers of the technology operations team (e.g. the concerned folks in editorial, marketing, sales, finance, legal and other departments). External communications to the public Internet should be handled in consultation with legal and public relations.

S3’s outage (or any outage) isn’t to be taken lightly, but I have faith Amazon and their customers will learn from it.

Disclaimers:

  • As explained in the terms of use of this site, any opinions expressed on my personal Web site do not reflect those of any employer, past or present. My Web site and I in my personal life neither represent nor speak for any corporation.
  • I have no affiliation, financial or otherwise with Amazon.com. I happen to be a user of their products and services, some of which I like and some that I don’t.
  • Personal Web sites like this are exempt from the performance requirements of corporate Web sites :-) My personal Web site is for expressing, learning and R&D. It also happens to be hosted on Amazon EC2 and S3.
  1. Silicon Alley Insider and ValleyWag have amusing spins on it. :-) []
  2. There may be extreme instances, especially when criminal activity or malicious wrongdoing was the cause where it would be appropriate to blame someone. []
  3. It is ok to mention service providers, or describing external events for explaining what happened, but don’t do it in a “it was their fault, not ours” tone. The technology leader should factually describe what happened and take responsibility. []