Review: Working Effectively With Legacy Code

In this post we will review the book Working Effectively with Legacy Code by Michael C. Feathers.

The basic premise of the book is that we want to make a change in an existing (legacy) code base and we want to make it safely. The flowchart below describes in broad strokes the content of the book:

 

The core of the book focuses on breaking dependencies of legacy code so that we can add tests more easily. How can we break the vicious cycle? We need to make changes to add tests, but we need to have tests to make changes safely. The idea is to break the dependencies in safe ways even if it temporarily leads to a worse design until we can add tests and refactor the code.

Organization of the book

The book is divided into 3 parts of a total of 25 chapters. In the first part the author describes concepts and introduce tools that are used in the rest of the book. In part II, he describes several different situations and problems and goes on to provide a solution to them. Part 3 contains a single chapter and describes several dependency breaking techniques. As the author suggests, they are not required to be read in order.

Selected Notes

Here I’m going to list some of the techniques the book provided that I found interesting and novel.

How to break dependencies

  • Seam model – the idea is to make behavior changes while minimizing code changes. One example is to add a new subclass in order to remove/mock dependencies for tests.
  • Sprouting – move blocks of code to new methods and classes which can be easily tested.
  • Wrapper class – wrap code inside a new class which we have more control of, and can be mocked. This is a variant of the seam model, but instead of relying on inheritance it uses composition. As the author notes, this is the decorator design pattern.

A common pattern between them is keeping code changes to a minimum. The bigger the changes that more likely subtle behavior changes are introduced.

Where to test

  • Effect analysis – draw an informal DAG, where nodes represent functions or classes and a directed edge implies that changing one node will affect the other. This will help understanding which code is subject to be affected and must be tested.
  • Pinch points – The effect analysis might yield too many affected nodes – pinch points nodes that cover a large portion of affected code but has relatively few dependencies so it’s easier to test.

Interpreted Languages

The book has the strongest assumption that we are working with a compiled object-oriented language, namely C++, Java or C#. He dedicates about a page or two to dynamic languages, in this case Ruby.

Quotes

The author shares several design principles throughout the chapters. Here are some quotes describing some of them.

Temporal coupling:

When you group things together just because they have to happen at the same time, the relationship between them isn’t very strong.

Refactoring:

Rename class is a powerful refactoring. It changes the way people see code and lets them notice possibilities that they might not have considered before.

Command and Query Separation:

A method should be a command or a query but not both. A command is a method that can modify the state of the object but that doesn’t return a value. A query is a method that returns a value but that does not modify the object.

Psychology of working with legacy code:

If morale is low on your team and it’s slow because of code quality, here’s something that you can try: pick the ugliest most obnoxious set of classes in the project and get them under test. When you’ve tackled the worst problem as a team you will feel in control of your situation.

Conclusion

Overall there are many interesting techniques for breaking dependencies and making the code more testable. I’ve been using some of them in a less methodic way. I like the pragmatism of the techniques (done is better than perfect), which can be seen when good practices and designs are temporarily violated to write tests.

I’ve also enjoyed the code design tips interspersed throughout the chapters, but I disagree with some of the suggestions, for example converting static utils to methods in class instances. In my opinion static classes are a good forcing function to keep implementation modular and stateless. This is especially the case if there’s a good mocking framework available which can mock out static functions.

I found that the set of tips are less useful for dynamic language or just in time compilers which allows runtime modifications of the objects, because it allows much more powerful testing frameworks which for example can easily mock dependencies is for us. Nowadays I work mostly with Hack and JavaScript, which both provide a lot of runtime flexibility and support for mocking dependencies. The challenge I often face is having to mock too many dependencies which is tedious.

The major negative of the book is that it contains a lot of obvious typos and typographic error such as duplicated listings.

Related Posts

  • Review: Code Complete 2 – This is another review of a highly practical and pragmatic book that I believe made me a better programmer. The overlap in content is minimal, so they’re a nice complement.

Review: Code Complete 2

code-complete

In this post I’ll share my notes about the book: Code Complete 2, by Steve McConnell. The book has great information about several aspects of Software Development, but it’s quite long: 862 pages.

This is not a summary of the book by any means, but rather points that I found interesting, novel or useful. I hope the reader find it useful or at least that it inspires you to go after the book for more details on a specific topic.

I’ve written down bullet points about each chapter, some of which I added my own thoughts/comments.

Chapter 1 – Introduction

* Motivation: What is Software construction and why it’s important
* Explains the overall format of the rest of the book

Chapter 2 – Metaphors

* Metaphors help to understand problems better and use solutions developed for the analogous problems and apply them to the original problem.
* Common analogy to software development is civil construction.
* Bad metaphors, from forced analogies, which can be misleading.

Chapter 3 – Prerequisites (gathering requirements)

* Errors caught in the specification phase are the cheapest to fix (10x if the error is caught in construction).
* Different types of software require different degrees of investment in prerequisites. Three categories of softwares:
— business (e.g. website),
— mission-critical (e.g. packaged software) and
— embedded/life critical (e.g. avionics or medical devices).
* Requirements changes over time, so your design has to be flexible enough to allow changes.

Chapter 4 – Construction Planning

* Important decisions that have to be taken before the construction phase: programming language, code conventions, development practices (e.g. pair-programming), etc.
* Technology wave and the tradeoffs of choosing technology in different stages of this wave. For example, late-wave technology is more mature, has better documentation and user-friendly error messages. On the other hand early-wave environments are usually created to address problems with existing solutions.

Comments: Adopting early technology also helps with recruiting. Programmers like new technology.

Chapter 5 – Design

* The main goal of the design should be to keep software complexity low.
* Different levels of design: software-level, packages, classes, routines and internal routines — different types of software require different amounts of design details.
* Two main flavors to carry over the design: bottom-up or top-down approaches.
* Information hiding is key in a good design: It helps lowering the complexity by not requiring the person reading the code abstract details and reduce decoupling.

Comments: Hiding information properly is an art. It doesn’t help to stick as much code as possible into private methods if the public methods are not intuitive and require diving into implementation details.

Chapter 6 – Classes

* Consistent abstractions – different methods in the class should model the problem at the same level. Example: a class representing an employees record which inherits from a List with two methods:

addEmployee()
firstItem()

The second one is closer to implementation detail. In general, the closest to the business level the abstraction is, the better.

* Inheritance vs. composition: Inheritance if often abused and long chains of inheritance is often hard to read. Arthur Riel suggests no more than 6 levels, author says it should be limited to 3 levels.
* Be careful with excess of attribution to a single class. Heuristic that a class should have at most 7 members.

Chapter 7 – Routines

* Naming: should be verb + noun and should describe the value it returns (if any).
* The purpose of a routine is to reduce complexity.
* The heuristic to the maximum number of parameters is 7.
* Routines should follow the linux philosophy: it should do one thing and do it well.

Comments: in the past I used to think of routines as ways to share code. This sometimes conflicts with readability and the linux principle. This is especially true when you group several routines calls into one because it’s used in two places, but they’re not cohesive enough to name it the routine clearly, so we end up using vague terms such as init, prepare, preprocess, cleanup, etc. Nowadays I prefer being verbose (i.e. repeating the lines in both places) in favor of readable code.

Chapter 8 – Defensive Programming

* When to use assertions: error handling for things you expect to occur and assertion for the ones you don’t expect.
* When to use exceptions: should be defined a convention. The name of the exceptions should match the level of abstraction of the current code (e.g. no RuntimeException where business logic is defined) this also means catch/re-throwing if the exception crosses the boundary of two different abstraction layers.
* Barricades: a specific layer that deals with bad input so that the core code doesn’t have to deal with them.

Comments: the barricade is pretty nice, it helps reducing the complexity in the main code by not having to handle too many corner cases and also centralizes where bad data is handled, so you don’t risk double or triple checking the same conditions in several places.

defense

Chapter 9 – Pseudocode Programming Process (PPP)

* Describe the routine first in pseudo-code and then add the actual implementation but leaving the pseudo-code as comment.

Chapter 10 – Variables

* The span of a variable is defined as the number of lines between where a variable is declared to where it’s used. The average span of a variable is a indicator of complexity. High span variables means that the variable is spread out along the code. To reduce this number one can try to re-order statements in such a way that variables are declared close to where it’s used and all its uses are grouped in closer places.

Chapter 11 – Variable naming

* Optimal length is 10 to 16 characters.
* Computed qualities such as total, max, should be suffix, not prefix.
* Use conventions to indicate special types of variables such as loop indexes, temporary, boolean, enums, etc.
* Define a document for consistent variable naming conventions, including abbreviations.

Chapter 12 – Fundamental Data Types

* Consider using special purposed containers instead of plain arrays. Most of the cases we need sequential access, so queue, stack or sets can be more appropriate.
* Use intermediated boolean variables for the sole purpose of making complex predicates (in if clauses) simpler.

Chapter 13 – Unusual Data Types

* Organize related set of variables into a structure so they become more readable/easier to copy.
* Global variables are evil. If you need them, at least out them behind access routines (e.g. static member variables in a class).

Chapter 14 – Organizing code within a routine

* Make dependencies between 2 pieces of code obvious: via parameters, comments or flags + invariants.
* To break dependencies chunks of code, initializing variables in the beginning of the routine might help.

Comments: I also like memoized functions to break dependencies. If B depends on A being run, I create A as a memoized function that B can call no matter if it had been called already.

Chapter 15 – Conditionals

* When doing if/else conditionals, test the “normal” case first and the exception in the “else”.

Comments: I tend to do the opposite in favor of the early returns: this helps reducing nesting and clear up corner cases first – this works well when the handling of the exception case is simple.

Chapter 16 – Loops

* Keep as much code outside of the loop as possible, and treat it as a black box, if possible.
* Perform only one function in the loop, prefer using multiple loops with simple functions than one loop with many functions – unless you can prove that using multiple loops is the bottleneck of the code.
* The loop body should be short (<20 lines) and should be visible entirely in the screen.
* Loop nesting should be limited to 3 levels.

Chapter 17 – Unusual Control Structures

* Multiple returns within the same routine: use only if this improves readability.
* Avoid recursion if possible. Restrict it to a single routine (no chains like A -> B -> A -> B...).

Chapter 18 – Table-driven methods

* Simplify complicated conditional logic by pre-computing the values and hard-coding them in a lookup table (works if inputs are discrete).

Chapter 19 – General control issues

* When comparing numbers, use the number-line order, in other words, always use the “. For example, instead of

a < 0 || a > 10 do a < 0 || 10 < a

* Do not put side effects on conditionals.
* Switch/Case statements indicates poorly factored code in OO programming.
* Measuring control complexity in a routine: count the number of if, while, for, and, or and case. It should be less than 10.

Chapter 20 – Quality assurance

* Testing and code reviews are not enough by themselves. A combination of different techniques yields the lowest number of bugs in general.
* In studies, code review by 2 people found twice more errors as code reviews by 1 person – this is surprising because one would think a lot of the errors found by each reviewer would overlap. The book doesn’t mention the optimal number of reviewers.
* There are many qualities in a software including: correctness, usability, efficiency, reliability, flexibility and robustness, and some of them are conflicting (e.g. improving robustness decreases efficiency). Decide upfront which characteristic to optimize and keep the tradeoffs in mind.

Chapter 21 – Collaborative Development

* Formal inspections of design: the author of the design creates a document and other people in the team have to review it independently. This not only forces the designer to think it thoroughly, but also makes sure other people in the team will be familiar with the architecture.

Chapter 22 – Testing

* Exercise control flows. Instead of testing all conditionals combinations (which would be exponential, and prohibitive), add 2 tests for each conditional (one for the true and another for the false cases).
* Test data flow. The suggestion is to test all pairs of (definition, usage) of all variables. For example, if we have

  int a = 10; // 1
  ...
  if (x < a) { // 2
     ...
  } 
  int b = a + 1; // 3

In this case we would add two tests: one that exercises lines (1) and (2) and another that exercises (1) and (3).

* Bug distribution is not uniform across the code. It’s more likely that 20% of the code contains 80% of the bugs. It’s important to identify these areas and avoid over-testing, especially if using TDD.
* Keep records of bugs and fixes: where the bugs were found, severity of the code, etc. This will help to identify the critical paths.

Chapter 23 – Debugging

* A bug in your code means you don’t fully understand your code.
* Reproduce the error in different ways, to make sure you understand what is really causing the problem.
* Rewriting code might be a better alternative if debugging is taking too long. The idea is set a maximum time dedicated to debugging, after which one is allowed more drastic solutions such as rewrites.

Chapter 24 – Refactoring

* Make refactorings safe. In order to accomplish that, they should be small, self-contained, easy to review, documented, and tested.
* Different refactorings have different risks degrees and the planned accordingly.
* Book recommendation: Refactoring: Improving the Design of Existing Code by Martin Fowler.

Chapter 25 – Code Tuning

* Code tuning is overrated.
* Performance is often not the best feature to optimize: throughput and correctness are more important.
* 4% of the code accounts for 50% of the performance – and programmers are bad at guessing code bottlenecks. Finish the product first, and profiling to find the bottlenecks.
* Common sources of performance bottlenecks are system calls, I/O and pagination.

Chapter 26 – Code Tuning Techniques

* Specific techniques to improve runtime of code.
* Experimental results for each technique shows that in some environments the optimizations provide great performance gains, but in other cases, no significant improvements are obtained (sometimes degrading performance).
* The key takeaway is: profile! Compilers are very smart nowadays, so it’s hard to predict what roles an optimization will be converted to final machine code.
* Downside of optimizations is loss of readability:

“The impact of unmeasured code tuning on performance is speculative at best, whereas the impact on readability is as certain as it is detrimental.”

Chapter 27 – Program Size Affect Construction

* As the code size grows,
— More people are necessary, increasing communication overhead
— Productivity is lower
— Error density increases
— More time has to be proportionally spent on non-construction phases (planning and design)

Chapter 28 – Managing Construction

This chapter seems to be more targeted to managers, but also useful to developers to understand the “other side”.

* On encouraging good coding:

“If someone on a project is going to define standards, have a respected architect define the standards rather than the manager. Software projects operate as much on an expertise hierarchy as on an authority hierarchy.”

* Configuration management: practices to deal with changes, either in requirements, design or source code.
— Discuss planned changes in group, no matter how easy they are to implement, so people keep track of .
* Estimating construction time:
— Use estimating software,
— Treat estimation seriously
— Adjust estimates periodically and
— If initial estimations were off, learn why.

Chapter 29 – Integration

* Different strategies for doing code integration, mainly top-down (start with skeleton and fill in the blanks) and bottom-up (starts with individual pieces and glue them together).
* Strategies to make the integration smoother, such as automated builds and smoke tests.

Chapter 30 – Programming Tools

* Covers: IDE’s, tools for compiling, refactoring, debugging and testing
* Large projects require special purpose tools
* Programmers overlook some powerful tools for years before discovering them

11054398564_58a9322fa1_o

Chapter 31 – Layout and Style

* Covers indentation, line breaks
* Do not align assignment statements on the ‘=’ sign. The idea is noble and it improves readability, but it’s a standard hard to maintain. Not everyone will have the same care and also automated refactors will likely miss it.
* Add at least one blank line before each comment

Chapter 32 – Self-documenting code

* Don’t comment too much or too little :)
* The author admits there’s not a lot of hard-data regarding usefulness of comments and what the “right” amount is
* Comment while or, better yet, before coding
* Especially in bug fixes, the comment should explain why the code works now, not why it didn’t work in the past.
* Comments styles have to be easy to maintain (avoid end-of-line comments, because if the variable gets renamed, it will misalign the comment)

Chapter 33 – Personal Character

Traits that the author considers important in a good programmer:

* Humility – admits their limitation, open to learn new things, change their minds
* Curiosity

Analyze and plan before you act: dichotomy between analysis and action. Programmers tend to err of the side of acting.

* Intellectual Honesty – admit mistakes, be honest/realistic about estimates and delays,
* Communication
* Creativity
* Discipline
* Laziness

The most important work in effective programming is thinking, and people tend not to look busy when they’re thinking. If I worked with a programmer who looked busy all the time, I’d assume he was not a good programmer because he wasn’t using his most valuable tool, his brain.

Traits that the author thinks are overrated:

* Persistence
* Experience
* Gonzo programming – programming like crazy, non-stop

Chapter 34 – Themes in Software Craftsmanship

Conclusion and review of thee book

* The primary goal of software design and construction is conquering complexity
* Process matters.

My message to the serious programmer is: spend a part of your working day examining and refining your own methods. Even though programmers are always struggling to meet some future or past deadline, methodological abstraction is a wise long-term investment – Robert W. Floyd.

* Write programs for people first, computers second
* Watch for warning signs. Examples: a high number of bugs in a particular class may indicate the class is poorly design. A lot of bugs in the project overall might indicate a flawed development process. Difficulty to reuse in another place, indicates it’s too coupled, etc.

Chapter 35 – Where to find more information

Books recommendation:

* Pragmatic Programmer – Hunt and Thomas.
* Programming Pearls – Bentley, J.
* Extreme Programming Explained: Embrace Change – Beck, K.
* The Psychology of Computer Programming – Weinberg.
* The Mythical Man-Month – Brooks
* Software Creativity – Glass, R.
* Software Engineering: A Practitioner’s Approach – Pressman, R.
* Facts and Fallacies of Software Engineering – Glass, R.
* UML Distilled – Fowler, M.
* Refactoring: Improving the Design of Existing Code – Fowler, M.
* Design Patterns – Gamma et al.
* Writing Solid Code – Maguire, S.

Conclusion

This book contains a lot of valuable information and I’ve incorporated several of his ideas in my day-to-day work, especially regarding making code easier to read.

The suggestions in the book are often backed by hard data, making them more credible. Sometimes the advice is subjective, even contradicting, but he often provides several points of view or alternatives, so that the reader can make their best judgement of when to use them.

Revisiting Python: Object Oriented Programming

This is the third post in the series on revisiting Python. In the previous post we talked about Python functions.

python-logo

We’ll talk about classes in Python 2.7. Following the premisses of previous posts, this is not intended to be an introduction to Object Oriented Programming, but rather how we can use this paradigm in Python.

We’ll first cover basic concepts like classes and inheritance. Later we’ll discuss some more advanced features from the Python language that allows us extending the set of built-in patterns provided by the language.

Classes

Classes are the building blocks of object oriented programming. Python classes look similar to those in languages like C++ and Java, but there are some important differences, which we’ll comment on this section.

Classes are objects. When writing a class definition the code is executed and an class object is created and assigned to a name corresponding to the class name. For example, in the code below, an object representing the class is assigned to a variable in the current scope MyClass.

class MyClass:
    print 'hi'

print MyClass # __main__.MyClass
x = MyClass
print x       # __main__.MyClass

If we run the code above it will print ‘hi’. We can manipulate MyClass as a regular variable, assigning it to other variables, passing it as parameter to functions, etc.

Instantiating. We can create an instance of a given class by using a function call notation, that is, MyClass().

This will also call the method __init()__ which can be used to initialize the instance, like a constructor. Functions defined within a class become methods to the instances. Methods can be called using the syntax instance.method(). For example:

class MyClass:
    def myMethod(self):
        print(self)

instance = MyClass()

instance.myMethod() # Prints &amp;lt;__main__.MyClass instance at 0x10891f290&amp;gt; 

When invoking a function using instance.method(), instance() is bound to the first argument to method() in this case. We usually name this first parameter self, but it’s just a convention.

Class vs. instance members. Note that local variables defined at the class level belong to the class object, not to a specific instance object. Making an analogy to C++, this is a static member variable.

class MyClass:
    x = 10

inst = MyClass()
inst.x = 20
inst2 = MyClass()
print inst2.x # 20

To make a variable instance specific, we need to make use of the instance passed as the first parameter to the methods.

class MyClass:
    def __init__(self):
        self.x = 10

inst = MyClass()
inst.x = 20
inst2 = MyClass()
print inst2.x

All methods and variables defined at the class level are stored internally in the __dict__ variable.

class MyClass:
    x = 10
    def f():
        pass

print MyClass.__dict__
# 'x': 1, '__module__': '__main__', '__doc__': None, 'f': &amp;lt;function f at 0x109f32d70&amp;gt;}

Methods. In Python, methods are “static”, but by default it requires an instance of the class to be bound to the first parameter. As we saw above, this happens automatically when we use the instance.method() syntax. Alternatively, we could explicitly pass the instance to the static method, using the SomeClass.method(instance) syntax. To illustrate that, imagine we have a method in our class a method printX(). We can invoke it either by a method from the instance object or from the class object passing the instance:

class MyClass:
    def __init__(self):
        self.x = 10

    def printX(self):
        print self.x

inst = MyClass()

# Both are equivalent:
inst.printX() 
MyClass.printX(inst)

Methods are essentially functions assigned to class member variables. To make the point clear, we can rewrite the previous snippet replacing the inline definition of printX() by an external function:

def externalPrintX(self):
    print self.x

class MyClass:
    def __init__(self):
        self.x = 10

    printX = externalPrint

Note that the external function still needs to have as the first arguments an instance of MyClass.

Given that all methods are public and data variables take precedence over static ones, it can cause some confusion if we assign functions to instance members:

def externalPrintX():
    print 'external hello world'

class MyClass:
    def __init__(self):
        self.x = 10
        self.printX = externalPrintX

    def printX(self):
        print 'hello world'

inst = MyClass()
inst.printX()

Here, the method printX() method got replaced by another function, externalPrintX(). Note that, differently from regular methods, the instance is not bound to the first argument when invoking inst.printX().

Static Methods. Since we’re required to provide an instance to the first argument, it’s not truly static in the C++ class terminology. It’s possible to override this requirement by using the staticmethod() function. To illustrate this, here’s an example:

class ClassWithStaticMethod():
    def regularMethod(x):
        print 'This is not bound to any instance:', x

    staticMethod = staticmethod(regularMethod)

x = ClassWithStaticMethod()
x.staticMethod(10) # This is not bound to any instance: 10

Class Methods. are different from regular methods because they always receive the class object instead of the instance object as the first argument. More specifically, when calling from a class object, it is bound automatically as the first parameter and when called from the instance object, its corresponding class object is bound instead. For example,

class ClassWithClassMethod():
    def regularMethod(type):
        print type
    classMethod = ClassMethod(regularMethod)

ClassWithClassMethod.classMethod() 

x = ClassWithClassMethod() # ClassWithClassMethod
x.classMethod()            # ClassWithClassMethod

Inheritance

The syntax for inheritance is class Derived(Base). Methods defined in the base classes can be overridden by derived classes. If a name (variable or method) is referenced from an instance, it’s searched from the current class, going up on the inheritance chain, until the definition is first found.

# animal.py
class Animal:
    def breath(self):
        return 'breathing air'
    def eat(self):
        return 'eating'

class Dog(Animal):
    def eat(self):
        return Animal.eat(self) + ' meat'

dog = Dog()
print dog.breath()  # breathing air
print dog.eat()     # eating meat

In this example, eat() is first found at Dog, while breath() is only found at Animal. Note that we can refer to a specific implementation by providing the class explicitly, like Animal.eat(self).

Old-style vs. new style classes. In a nutshell, the difference between old-style and new-style classes is that the latter is a descendant of object. Thus, by default user defined classes are old-style by default (though old-style classes are removed from Python 3).

According to the docs, the motivation for creating the new-style classes is to unify Python types and user defined classes. In this model, classes are able to derive from built-in types [2].

Multiple-inheritance. is supported in Python. The syntax is a comma separated list of base classes:

class DerivedClassName(Base1, Base2, Base3):
   ...

Method resolution order (MRO). We just saw that for simple inheritance chains, the natural lookup order for methods is straightforward. For multiple-inheritance it can be complicated mainly because of the diamond pattern.

For old-style classes, the method lookup order, MRO, is defined by a “depth-first search”. It first looks in the entire inheritance chain of Base1, then Base2, and so on. The MRO is always well defined as long as the inheritance graph is a DAG (direct acyclic graph).

The problem is that it’s not intuitive. Depending on the order a given class appears in a descendant, its own MRO can vary. For example, consider the following complex dependency:

class A:
    def name(self):
        return 'a'

class B:
    def name(self):
        return 'b'

class C(B, A):
    def composed_name(self):
        return 'c ' + self.name()

class D(A, C): pass

d = D()
print d.composed_name()

When we call composed_name() on the instance of class D, the name resolution will only find it in class C. This class on its turn needs to resolve the name() method. Should we get it from class B, since it’s listed first on C‘s inheritance?

That’s not what happens. name() is resolved from the class D, because there, A appears before. This is a very confusing behavior and error-prone.

In new-style classes, the diamond pattern exists by default when multiple inheritance is used, since everything ends up deriving from object. But in the new style, the type of inheritance that can be created is more strict. It uses an interesting algorithm, C3, described in [3] to create a consistent ordering or throw an exception in case such ordering doesn’t exist.

super. In the animal.py example, we had to hard code Animal.eat() to be able to call Dog‘s parent class. The super() function takes a class object and an instance, and it returns a proxy object on which we can call methods and it will resolve them based on the MRO the object class. More specifically, consider the following example:

class A(object):
    def method(self):
        print 'calling method from a'

class B(object):
    def method(self):
        print 'calling method from b'
    def anotherMethod(self):
        print 'calling another method from b'

class C(A, B):
   def method(self):
        print 'calling method from c'
    def anotherMethod(self):
        print 'calling another method from c'

    def run(self):
        proxy = super(C, self)
        proxy.method()          # calling method from a
        proxy.anotherMethod()   # calling another method from b

C().run()

In this example, super() returned an object, proxy, which can resolve method() and anotherMethod() properly. Since C derives from A first, it resolves it to A.method(). On the other hand, since anotherMethod() is only defined in B, it will resolve to B.anotherMethod().

Descriptors

So far, we discussed the basic components of object oriented programming. Now we’ll describe some advanced concepts that will help us better understand how some functions like staticmethod() work.

A class is said to be a descriptor if it implements any of __get__(), __set__() or __delete__(). In particular, if it doesn’t implement __set__(), it’s called non-data descriptor. If it implements both __get__() and __set__(), it’s a data descriptor [4, 5].

If another class contains class members that are descriptors, the __get__() and __set()__ will be executed when do reads and assignments respectively. For a concrete example, considering the following descriptor:

class Descriptor(object):
    def __get__(self, obj, objtype):
        print 'getting x'
        return self.x

    def __set__(self, obj, val):
        print 'setting x to ' + str(val)
        self.x = val

It stores a value internally at the variable x, but includes some logging when setting or getting this value. Now, suppose we have a class that has a member that is a descriptor:

class ClassWithDescriptors(object):
    member = Descriptor()

x = ClassWithDescriptors()
x.member = 20
print x.member

When we try to assign a value to x.member, the method __set__() from Descriptor is called instead. The same applies to when we try to read from x.member. It’s important that member is a class member, not an instance member. For example, if we have the instance member variable, it would not print the logging messages:

class ClassWithDescriptors(object):
    def __init__(self):
        self.instance_member = Descriptor()

x = ClassWithDescriptors()

x.instance_member = 20   # This is just overriding instance_member with an integer
print x.instance_member  

staticmethod() and classmethod(). One interesting use case for descriptors to implement the staticmethod() and classmethod() “functions”. We could write the following descriptor for each of these, respectively:

class StaticMethod(object):
    def __init__(self, f):
        self.f = f

    def __get__(self, obj, objtype=None):
      return self.f

class ClassMethod(object):
     def __init__(self, f):
          self.f = f

     def __get__(self, obj, objtype=None):
          if objtype is None:
               objtype = type(obj)
          def newfunc(*args):
               return self.f(objtype, *args)
          return newfunc

Functors

Python has the concept of function objects or functors. It’s an object that can be invoked as a function so long its class defines the __call__() method. A simple example is:

class Multiplier:
    def __init__(self, factor):
        self._factor = factor
    def __call__(self, a):
        return self._factor * a

double = Multiplier(2)
triple = Multiplier(3)
print double(5) # Prints 10
print triple(7) # Prints 21

One advantage of functors over regular functions is the capability of carrying context.

Decorators

Decorator is a design pattern in which we can modify the behavior of an object without modifying all the objects from that particular class. It’s also known as a wrapper, because we wrap the object inside another object that adds the desired behavior.

We’ll now describe a motivation for using decorators and how Python offers a syntax sugar for this pattern using annotations. For our example, imagine we have a function that takes a three variables and returns the sum of them:

def add(a, b, c):
    return a + b + c

Now imagine we want to validate the parameters passed to add(). We could add that to the beginning of the function, but maybe we already have a function to do the validation for us. In this case, we could wrap the function in a another one, which would first perform the parameters validation and then call the function. The code below does that:

def validate(f):
    def closure(*args):
        for arg in args:
            assert isinstance(arg, int)
        return f(*args)
    return closure

validated_add = validate(add)

print validated_add(5, 2, 3)
# Throws an exception
print validated_add(5, 2, 3.0) 

validate() takes a function f() and wraps it inside closure(), which verifies all arguments are integers before invoking f(). So by passing add() to validate(), we’re getting a decorated function validated_add() which performs validation on the arguments.

Python offers a syntax sugar to do exactly that. By annotating a function with @something, it passes the function to a function named something(), that should act as a function transformation, and uses the transformed function instead.

@validate
def add(a, b, c):
    return a + b + c

Imagine we want to customize the validate function to enable us to define the type of each element passed to the decorated function.

Annotations with parameters. We can do that by passing parameters to the decorator, using this syntax @something(arg1, arg2, ...). In this form, we actually construct a decorator builder, so we need to wrap our validate() function inside another one:

def validate(*arg_types):
    def validate_impl(f):
        def closure(*args):
            assert len(arg_types) == len(args), &amp;quot;Arguments and types don't match&amp;quot;
            for i in range(len(args)):
                assert isinstance(args[i], arg_types[i])
            return f(*args)
        return closure
    return validate_impl

With this new validate() function, we can update the annotation to our simple add() function:

@validate(int, int, int)
def add(a, b, c):
    return a + b + c

We can now use this as a very simple type validation. Imagine we have a function that takes a string and repeats it N times. We can enforce the types using the same annotation:

@validate(str, int)
def rep(s, n):
    return &amp;quot;&amp;quot;.join([s]*n)

Decorators can also be function objects, so we can define a class to do a similar work the validate() function does:

class Validator:
    def __init__(self, *arg_types):
        self._arg_types = arg_types

    def __call__(self, f):
        arg_types = self._arg_types
        def closure(*args):
            assert len(arg_types) == len(args), &amp;quot;Arguments and types don't match&amp;quot;
            for i in range(len(args)):
                assert isinstance(args[i], arg_types[i])
            return f(*args)
        return closure

In this case, the parameters defined in the annotation are passed to the constructor of the class.

Finally, two very common annotations are staticmethod and classmethod. There is no magic here. These cause the staticmethod() and classmethod() instances to be instantiated, and we saw how these work in previous sections.

Conclusion

In this post we covered object oriented programming in Python. We learned details about classes, especially how they compare to classes in other languages. We then discussed inheritance, including multiple inheritance and old and new-style classes. We then delved into more advanced concepts related to object oriented programming. This included descriptors, functors and decorators.

Python requires relatively few special constructs to accomplish features from other object-oriented languages. Examples include the concept of super and static methods, which are implemented using general purpose features from Python like descriptors and decorators.

Reference

[1] The Python Tutorial – Classes
[2] Python.org Docs – New-style Classes
[3] The Python 2.3 Method Resolution Order
[4] Python Descriptors Demystified
[5] Python HOWTOs – Descriptor HowTo Guide

The Visitor pattern and Vtables in C++

The other day I was developing a Java code and I needed to use instanceof, but since I have read that this is a sign of bad design, I asked for help and was recommended the Visitor pattern.

In this post we’ll talk about this design pattern in Java, that is where my need began, but also in C++ because the implementation is related to vtables, which I’ve been wanting to read about.

Visitors in Java

Let us consider the following example, where we have the classes Dog and Cat, which are specializations of Animal and have methods to emit sound, respectively bark() and meow(). We want to implement a method emitSound() which receives an instance of Animal and emits the sound according to the actual class of the instance. A first alternative using instanceof is the following:

// Listing 1
interface Animal {    
}
public class Dog implements Animal {
	public void bark(){
		System.out.println("Woof");
	}
}
public class Cat implements Animal {
	public void meow(){
		System.out.println("Meow");
	}
}
...
public static void emitSound(Animal animal){
    if(animal instanceof Dog){
        Dog dog = (Dog) animal;
        dog.bark();
    }
    else if(animal instanceof Cat){
        Cat cat = (Cat) animal;
        cat.meow();
    }
}

Here we can quote Scott Meyers, from Effective C++ book:

Anytime you find yourself writing code of the form “if the object is of type T1, then do something, but if it’s of type T2, then do something else,” slap yourself.

To get rid of instanceof, we can add the method emitSound() to Animal interface and replace bark()/meow() for it. In our method emitSound(), we let polymorphism choose the correct implementation.

// Listing 2
interface Animal {
    void emitSound();
}
public class Dog implements Animal {
    @Override
    public void emitSound(){
        System.out.println("Woof");
    }
}
public class Cat implements Animal {
    @Override
    public void emitSound(){
        System.out.println("Meow");
    }
}
...
public static void emitSound(Animal animal){
    animal.emitSound();
}

Another possibility is to delegate the implementation of emitSound() to another class through the use of the Visitor pattern. This pattern considers two types of classes: an element and a visitor. The element implements a method usually named accept() while the visitor implements the method visit(). The accept() method receives a visitor as a parameter and calls the method visit() from this visitor with itself as parameter. This way, the visitor can implement the visit() method for each possible element.

We can implement like this:

// Listing 3
interface Animal {
    void accept(AnimalVisitor visitor);
}
public class Dog implements Animal {
    @Override
    public void accept(AnimalVisitor visitor){
        visitor.visit(this);
    }
}
public class Cat implements Animal {
    @Override
    public void accept(AnimalVisitor visitor){
        visitor.visit(this);
    }
}
interface AnimalVisitor {
    void visit(Dog dog);
    void visit(Cat cat);
}
public class SoundEmissor implements AnimalVisitor {
    @Override
    public void visit(Cat cat){
        System.out.println("Meow");
    }
    @Override
    public void visit(Dog dog){
        System.out.println("Woof");
    }	
}
...
public static void emitSound(Animal animal){
    SoundEmissor soundEmissor = new SoundEmission();
    animal.accept(soundEmissor);
}

The advantage of this approach is that SoundEmissor may contain members and methods related to the way animals emit sounds. Otherwise, these methods would ended up being implemented in Animal.

Another advantage is that the visitor implementation can be chosen at runtime and, being a dependency injection, simplifies testing.

Visitors in C++

In C++ we don’t have instanceof, but we can use the typeid() operator instead. According to [1], if the argument to this operator is a reference or a dereferenced pointer to a polymorfic class, typeid() will return a type_info corresponding to the runtime object type.

// Listing 4
struct Animal {
    virtual void foo(){};
};
struct Dog : public Animal {
    void bark(){
        cout << "Woof\n";
    } 
};
struct Cat : public Animal {
    void meow(){
        cout << "Meow\n";
    }
};
void emitSound(Animal *animal){
    if(typeid(*animal) == typeid(Dog)){
        Dog *dog = dynamic_cast<Dog *>(animal);
        dog->bark();
    }
    else if(typeid(*animal) == typeid(Cat)){
        Cat *cat = dynamic_cast<Cat *>(animal);
        cat->meow();
    }
}

Again, we can create emitSound() in a interface and make use of polymorphism. In C++ we can implement an interface through a class containing only purely virtual function and no visible constructor like Animal in the code below:

// Listing  5
struct Animal {
    virtual void emitSound() = 0;
};
struct Dog : public Animal {
    void emitSound(){
        cout << "Woof\n";
    }
};
struct Cat : public Animal {
    void emitSound(){
        cout << "Meow\n";
    }
};
void emitSound(Animal *animal){
    animal->emitSound();
}

The Visitor pattern can be similarly implemented in the following way:

// Listing 6
struct Cat;
struct Dog;

struct AnimalVisitor {
    virtual void visit(Cat *cat) = 0;
    virtual void visit(Dog *dog) = 0;
};
struct Animal {
    virtual void accept(AnimalVisitor *visitor) = 0;
};
struct SoundEmissor : public AnimalVisitor {
    void visit(Cat *cat){
        cout << "Meow\n";
    }
    void visit(Dog *dog){
        cout << "Woof\n";
    }    
};
struct Dog : public Animal {
    void accept(AnimalVisitor *visitor){
        visitor->visit(this);
    }
};
struct Cat : public Animal {
    void accept(AnimalVisitor *visitor){
        visitor->visit(this);
    }
};
void emitSound(Animal *animal){
    AnimalVisitor *visitor = new SoundEmissor();
    animal->accept(visitor);
}

Single and multiple dispatch

We just saw that the Visitor pattern relies on the use of virtual functions. So now, let’s analyse how C++ implements these kind of functions. Before that, we need to understand the concept of dynamic dispatch.

Dynamic dispatch is a technique used in cases where different classes (with a common ancestor) implement the same methods, but we can’t tell on compile time what is the type of a given instance, as in the cases above.

In C++ and Java, this type of dispatch is also known as single dispatch since it only considers the type of the object calling the method. In Lisp and other few languages, there is the multiple dispatch that also takes into account the type of the parameters.

As an example, take a look at the following code:

// Listing 7
void visitor(Dog *dog){
    cout << "Woof\n";
}
void visitor(Cat *cat){
    cout << "Meow!\n";
}
void emitSound(Animal *animal){
    visitor(animal);
}

We’ll get a compile error since method/function matching for parameters is made at compile time. In this case we need to have an implementation for the Animal type.

Now, we’ll see how C++ implements dynamic dispatch.

C++ Vtables

Although the C++ standard does not specify how dynamic dispatch should be implemented, compilers in general use the a structure called the virtual method table, known for other names such as vtable, which is the one we’ll use henceforth. Actually, this table is an array of pointers to virtual methods.

Each class that defines or inherits at least one virtual method has its own vtable. This table points to the virtual methods defined by the nearest ancestral (including the class itself) that are visible for that class. Besides, each instance of these classes will have an additional member, a pointer that points to the table corresponding to the class used to instantiate that object.

As an example, if we were to instantiate a Dog from Listing 5:

Animal *animal = new Cachorro();

Here, the animal's pointer points to Dog‘s and not Animal‘s. Thus, when we do

animal->emitirSom();

the corresponding table is looked up to find exactly which emitSound() to call. Note that virtual methods requires an extra indirection that may affect performance. Because of this dynamic dispatch is not performed by default unless some class defines a virtual method. On the other hand, in Java this is done by default.

Let’s see a final example,

struct A {
    virtual void foo(){}
    virtual void bar(){}
    void baz(){}
};
struct B : public A {
    void bar(){}
    virtual void baz(){}
};

We can analyse the vtables from class A and B compiling the code above with gcc using the -fdump-class-hierarch flag. A text file with extension .class will be generated like this:

Vtable for A
A::_ZTV1A: 4u entries
0 (int (*)(...))0
4 (int (*)(...))(&amp; _ZTI1A)
8 (int (*)(...))A::foo
12 (int (*)(...))A::bar

Class A
size=4 align=4
base size=4 base align=4
A (0xb71f6038) 0 nearly-empty
vptr=((&amp; A::_ZTV1A) + 8u)

Vtable for B
B::_ZTV1B: 5u entries
0 (int (*)(...))0
4 (int (*)(...))(&amp; _ZTI1B)
8 (int (*)(...))A::foo
12 (int (*)(...))B::bar
16 (int (*)(...))B::baz

Class B
size=4 align=4
base size=4 base align=4
B (0xb7141618) 0 nearly-empty
vptr=((&amp; B::_ZTV1B) + 8u)
A (0xb71f61f8) 0 nearly-empty
primary-for B (0xb7141618)

Here we can see that function A::foo and A::bar appear in A‘s vtable, but A::baz don’t, since it was not declared virtual. On B‘s vtable we have A::foo, because it was not overridden in B. We also have B::bar, although it has not been declared virtual in B, this property was inherited from A::bar. Finally, B::baz appears on the vtable because it was declared virtual in B.

We can also see the value that the instances of such classes will have: vptr=((& A::_ZTV1A) + 8u) and vptr=((& B::_ZTV1B) + 8u), which is the offset to the functions’ pointers on the respective tables. A nice reference for a understanding of vtables can be seen in [2].

Referências

[1] The typeid operator (C++ only)
[2] LearnCpp.com – The Virtual Table