Software Engineering at Google [pdf]
Henderson, Fergus. "Software engineering at Google." arXiv preprint arXiv:1702.01715 (2017).
This paper gives a high-level overview of software engineering principles at Google. This includes technical topics as well as project and people management concepts. Since most of this is only discussed on a very high level, the paper is easy to read and gives an interesting glimpse into how things work at Google.
- Most of Google's code is stored in a single repository
- The notable exceptions are Chrome, Android and some security-critical pieces of code
- Write access is more controlled than read access: Directories list owners and one of them needs to approve any changes to the respective directories. The owner status also transfers to any subdirectories
- Most development happens at the head of the repository, not on branches. This minimizes the amount of merging that needs to be done
- Automated systems run tests when new code is added
- Code that does not pass tests is generally quickly rolled back
- Each subtree needs to have at least two owners
- By using the standardized build system Blaze, any Google engineer can easily build code in any part of the repository
- Declarative BUILD files describe the rules that Blaze uses to build the system
- BUILD files for Go programs can be generated automatically
- Each build is typically distributed across hundreds or thousands of machines. This makes it possible to build extremely large programs quickly
- Build steps need to be pure. They should not have side effects
- Builds are deterministic. They always produce the same results. The build team made sure to remove any kind of timestamps from builds
- Build results are cached in the cloud
- Google has great web-based review tools
- All changes must be reviewed by at least one person
- Code review discussions are automatically copied to a mailing list
- The repository has an experimental section for quick prototyping. Code reviews are not enforced here
- Developers are encouraged to keep individual changes in individual code reviews small
- All code in production must have unit tests
- The review tool highlights files without corresponding tests
- Integration, regression and load testing are also widely done
- CRs can be associated with a bug number
- Some teams have individuals that are responsible for prioritizing and assigning bugs. In other teams, this is done during team meetings
- There are five official languages at Google: C++, Java, Python, Go and JavaScript
- All of these languages have style guides at Google
- Developers that demonstrated that they can write great code in a language, get readability rights. During code reviews, at least one reviewer with readability status in the respective language needs to approve the change. To get readability status, one's code is thoroughly reviewed for some time
- For communicating between different programs and languages, Protocol Buffers are used
- Protocol Buffers are also integrated into Google's RPC libraries
- Many tools at Google are language-independent. This makes it very easy to switch to other projects or languages
- Google servers have libraries for debugging built-in
- For example, if something crashes, one can change command-line flags in a web interface and run everything again
- This greatly decreases how much conventional tools like gdb are used
- Some teams have dedicated release engineers, but mostly this is done by regular software engineers
- Releases are often done weekly or fortnightly, sometimes even daily
- Regular releases keep engineers motivated and make it easier to change ideas based on user feedback
- To deploy a release, dedicated release branches are used
- First, the change is typically pushed to a staging server. The developers of the team test the change here. Sometimes a part of the real traffic is used to check if the staging server reacts in a good way. However, the results are not send back to actual users
- Next, the change is sent to canary servers. A subset of the user traffic is processed by these servers and users do see the responses
- If everything works fine until here, a gradual roll-out to production is performed. This could take a few days
- Changes visible to users need approval from people from the respective core engineering team
- There might be more approvals necessary, e.g. because of legal, privacy or security requirements
- Google has an internal launch tool for tracking what reviews are still required
- When there is a major problem in production systems, the people involved write a post-mortem document
- This document describes what happened and how it could be avoided in the future
- The focus is on the problem and how to fix it in the future, not on blaming people for it
- A lot of software at Google is rewritten from time to time
- This might seem like a waste of resources but it has some advantages
- Knowledge about the system that is rewritten is transferred to new people
- It ensures that code is written using modern technologies
- Engineers can spend 20% of their time working on a side project
- Makes it easy to experiment with different ideas
- This also keeps engineers happy, which accounts for much more than 20% of productivity
- Individuals and teams explicitly document their goals and ways to measure their progress
- This is done at all levels of the company
- Each objective is given a score between 0 (no progress) and 1 (completely done) at the end of each quarter
- The desired average score is 0.65: Goals should be ambitious enough that they cannot all be completed
- OKRs have no impact on performance evaluations or compensation
- Works differently in various parts of the company
- Sometimes top-down (management makes a plan), sometimes bottom-up (engineers decide)
- There are separate ladders for engineering and management
- Research teams are embedded into product teams
- The main roles are:
- Engineering Manager: Generally former software engineers that manage people. One does not have to be a manager to lead other people though
- Software Engineer (SWE): Most of the people in software engineering are in this role. At higher levels, showing leadership does not have to mean managing / leading people
- Research Scientist: Very strict hiring criteria. Most PhDs are not hired as research scientists but as SWEs. Research scientists are evaluated on their research contributions, e.g. publications. They generally work along SWEs
- Site Reliability Engineer (SRE): A mixture between software engineering and system administration. The software engineering requirements are a bit lower and can be compensated by Unix or networking knowledge
- Product Manager: They coordinate software engineers and work out new feature ideas. They do not write code themselves
- Program Manager / Technical Program Manager: Instead of managing a product, they manage projects or processes. This role is rather rare
- New googlers go through introductionary classes called codelabs
- Google supports studying at external institutions as well as taking online courses
- Transferring to a different team inside the company is encouraged
- Sometimes, SWEs also do 6-month rotations as SREs
- Engineers can give each other peer bonuses, a small cash bonus for doing something above their actual work
- For promotion, people have to nominate themselves or have to be nominated by their manager
- The performance of managers is partly assesed by asking people in their team for feedback