So Long WordPress, and Thanks for All the Fish

I finally decided to move from WordPress to Github pages at https://www.kuniga.me/blog/.

I’ve been using WordPress since 2009 when I realized it suited my needs much better than Blogger. Over the years I’ve grew unsatisfied with some of the limitations:

  • More flexible control of the layout – unless you’re using the self-hosted version of WordPress you’re limited on what you can do. Examples include changing the font-size of headers, font-family, or the quotation style.
  • Code snippet – I’m assuming most people don’t include code snippets in WordPress, which makes support subpar. I’ve been using GitHub gists to work around some of the limitations but it brings its own set of drawbacks including a distracting frame and that it sometimes does not load on mobile web.
  • Code formatting / monospace – WordPress has a very nice UI for formatting text. One of the major formatting missing is for code. it’s possible to achieve that by switching to raw mode and including <code> tags, but it is inconvenient. Looks like the new UI fixed it.
  • WordPress show ads, which is understandable from their business point of view but are an extra inconvenience anyway.

  • Being able to inject custom HTML with JavaScript is something I’d like to do.
  • As WordPress improves and tailors the experience to a less technical audience, power-user features such as editing raw text becomes harder. I must admit the new image insertion is really good though!

At some point I tried hosting my own version of WordPress on Dreamhost but ran into several problems with spam and to use WordPress’s anti-spam service I would need to pay for the service. Moreover it was running on a single host so it was subject to spike traffic or DDoS.

Future

I  decided to go with GitHub Pages using Jekyll. It’s free but limits the content to be static, which is fine for my case.

It uses a framework called Liquid which has good support for Markdown. While I’ll still be limited by what the framework supports, because it’s all open source, I theory I could write plugins or fork my own version.

I’m going to miss the text editor and the ease of including images but so far the pros are worth it.

I’m going to write a post on the other blog about how I replaced features like comments with alternatives.

This should be the last post and this blog. I will work on migrating  all the listing posts (but not the comments :/) and see if there is a way to redirect them.

I’ll always be very grateful to WordPress for the option to host and publicize my thoughts and notes for free to the outer world but now it’s time to move on.

Review: Working Effectively With Legacy Code

In this post we will review the book Working Effectively with Legacy Code by Michael C. Feathers.

The basic premise of the book is that we want to make a change in an existing (legacy) code base and we want to make it safely. The flowchart below describes in broad strokes the content of the book:

 

The core of the book focuses on breaking dependencies of legacy code so that we can add tests more easily. How can we break the vicious cycle? We need to make changes to add tests, but we need to have tests to make changes safely. The idea is to break the dependencies in safe ways even if it temporarily leads to a worse design until we can add tests and refactor the code.

Organization of the book

The book is divided into 3 parts of a total of 25 chapters. In the first part the author describes concepts and introduce tools that are used in the rest of the book. In part II, he describes several different situations and problems and goes on to provide a solution to them. Part 3 contains a single chapter and describes several dependency breaking techniques. As the author suggests, they are not required to be read in order.

Selected Notes

Here I’m going to list some of the techniques the book provided that I found interesting and novel.

How to break dependencies

  • Seam model – the idea is to make behavior changes while minimizing code changes. One example is to add a new subclass in order to remove/mock dependencies for tests.
  • Sprouting – move blocks of code to new methods and classes which can be easily tested.
  • Wrapper class – wrap code inside a new class which we have more control of, and can be mocked. This is a variant of the seam model, but instead of relying on inheritance it uses composition. As the author notes, this is the decorator design pattern.

A common pattern between them is keeping code changes to a minimum. The bigger the changes that more likely subtle behavior changes are introduced.

Where to test

  • Effect analysis – draw an informal DAG, where nodes represent functions or classes and a directed edge implies that changing one node will affect the other. This will help understanding which code is subject to be affected and must be tested.
  • Pinch points – The effect analysis might yield too many affected nodes – pinch points nodes that cover a large portion of affected code but has relatively few dependencies so it’s easier to test.

Interpreted Languages

The book has the strongest assumption that we are working with a compiled object-oriented language, namely C++, Java or C#. He dedicates about a page or two to dynamic languages, in this case Ruby.

Quotes

The author shares several design principles throughout the chapters. Here are some quotes describing some of them.

Temporal coupling:

When you group things together just because they have to happen at the same time, the relationship between them isn’t very strong.

Refactoring:

Rename class is a powerful refactoring. It changes the way people see code and lets them notice possibilities that they might not have considered before.

Command and Query Separation:

A method should be a command or a query but not both. A command is a method that can modify the state of the object but that doesn’t return a value. A query is a method that returns a value but that does not modify the object.

Psychology of working with legacy code:

If morale is low on your team and it’s slow because of code quality, here’s something that you can try: pick the ugliest most obnoxious set of classes in the project and get them under test. When you’ve tackled the worst problem as a team you will feel in control of your situation.

Conclusion

Overall there are many interesting techniques for breaking dependencies and making the code more testable. I’ve been using some of them in a less methodic way. I like the pragmatism of the techniques (done is better than perfect), which can be seen when good practices and designs are temporarily violated to write tests.

I’ve also enjoyed the code design tips interspersed throughout the chapters, but I disagree with some of the suggestions, for example converting static utils to methods in class instances. In my opinion static classes are a good forcing function to keep implementation modular and stateless. This is especially the case if there’s a good mocking framework available which can mock out static functions.

I found that the set of tips are less useful for dynamic language or just in time compilers which allows runtime modifications of the objects, because it allows much more powerful testing frameworks which for example can easily mock dependencies is for us. Nowadays I work mostly with Hack and JavaScript, which both provide a lot of runtime flexibility and support for mocking dependencies. The challenge I often face is having to mock too many dependencies which is tedious.

The major negative of the book is that it contains a lot of obvious typos and typographic error such as duplicated listings.

Related Posts

  • Review: Code Complete 2 – This is another review of a highly practical and pragmatic book that I believe made me a better programmer. The overlap in content is minimal, so they’re a nice complement.

CPU Cache

In this post we’ll study CPU caches. We’ll first understand why it exists, some of its typical properties and then explore vulnerabilities associated with CPU cache.

Motivation

According to Wikipedia [1]:

Microprocessors have advanced much faster than memory, especially in terms of their operating frequency, so memory became a performance bottleneck.

How much faster is it? Doing some estimates based on a benchmark for Intel’s Haswell architecture [6], it seems that the worst case of RAM access is 100x slower than the fastest CPU cache. See Appendix A for how our calculations.

CPU cache is cache

I wanted to being with this tautology to call out that many of properties and design of a high-level caching system we might be more familiar with also applies to a CPU cache. It was useful for me to keep this in mind for understanding this new topic.

For example, as we’ll see later in more details, a CPU cache can be seen as a key-value table, which requires hash-like strategies for assigning keys to rows and for cache eviction.

Types

There are multiple CPU caches and they perform different functions. It varies depending on the CPU model but there are some general similarities.

Hierarchy

The CPU cache is divided into multiple levels of cache. The first level, referred to as L1 is the fastest and smallest. The n-th level, Ln is the largest and slowest. Typical values of n seem 3 or 4. In a multi-core CPU, the first levels of cache are exclusive to each cache, while the later levels are shared.

For a real-world example, Wikipedia [4] describes the architecture for Intel’s Kaby Lake architecture, where L1 is 64KB, L2 is 256KB and L3 is between 2MB to 8MB and shared by the cores.

Why do we have multiple-level of caches vs. a single level? This Q&A [5] suggests that it allows working with different tradeoffs of physical constraints, namely location and size of the component.

Data, Instruction, Address

L1 is often divided between data and instruction. As the name suggests, the L1-data caches data from memory needed during a program execution, while the L1-instruction caches the instructions needed to execute the program itself.

There’s another specialized cache called translation lookaside buffer, or TLB, which is used to lookup memory addresses (we’ll see more details later).

Why do we need separate caches for these 3 different uses? Wikipedia [1] provides an insight:

Pipelined CPUs access memory from multiple points in the pipeline: instruction fetch, virtual-to-physical address translation, and data fetch. The natural design is to use different physical caches for each of these points, so that no one physical resource has to be scheduled to service two points in the pipeline.

It’s worth recalling that a CPU operates in cycles and it follows a set of operations in a pipeline fashion (think of a conveyor belt in a factory). The following is a diagram of a possible CPU pipeline:

Structure

We’ll now analyze the structure of a CPU cache, in particular a data cache. Like any general purpose cache, it conceptually represents a lookup table or a key-value structure, so we’ll use this as our analogy to make absorbing new terminology easier.

Cache entry

The value of our lookup table is called a cache line and it has a fixed size. The key is called tag and will correspond to a memory address in a data cache. The combination key-value entry of our table is called cache entry. The cache entry also reserves some bits to store flags, which can be seen as metadata.

A cache entry can have an associated index, which would correspond to the row number in a table. The index information can be utilized to shave off some bits from the tag. To see why, imagine a cache with 64 (2^6) entries and that our assignment policy (think hash function) is that an address A is assigned to index A mod 64. So for example, an address (in binary) 10110010110101 would be assigned to index 110101 (53 decimal). Because the number of entries, 64, is a power of 2, the 6 least significant bits are the buckets. This means our tag for 10110010110101 can be 10110010 since the other bits 110101 is implicit from the index.

Associativity

We gave a glimpse above when talking about the mod function that we can implement our lookup table using a simple hash function, and when talking about hash functions we necessarily need to think of collisions, i.e., two different entries mapping to the same index.

On one side a CPU cache is “bare metal” hardware, so complex hash collision resolution strategies is infeasible to implement. On the other hand it is a cache, not a persistent key-value store, meaning that when a collision happen we can just evict the previous entry.

Still, high eviction rates can also mean high cache misses rates which is a key quality factor of a cache. Given the limited size of a cache, there are two dimensions we can tradeoff: number of indexes and number of cache entries per index. In our analogy, this would roughly map to number of rows vs. number of columns.

Some common strategies are either one column (direct mapping) or two columns (two-way set associativity), but in theory we could have any number columns (Intel’s Haswell architecture uses 8) [6]. Any such strategy is what we call associativity. Let’s take a closer look at these strategies.

Direct-mapping

Like a hash table with a simple function. Each address maps to exactly one index/row. If a new entry maps to a row already “occupied”, the previous entry is evicted.

The advantage of this is simplicity: there’s no criteria we need to use to choose which “cell” of our table to assign it to once we determined its row, since there’s only one. Eviction is easy too.

The downside is that we’re subject of degenerate cases where two hot addresses map to the same index. This will result in the entries being swapped out and many cache misses.

Two-way set associativity

As we said, we have two columns for each row, so a given memory address can go to any of two cells. When deciding which column to assign it to, it boils down to which of the existing values in that row to evict, and a natural strategy is to use LRU (least recently used).

To do so, we need an extra bit to keep track of this information: every time an item is read, we set a bit in that cell to 1 and to 0 in the other. We can then always evict the slot with bit 0.

This is a more complex approach than direct-mapping and we need to use a bit to keep account for the LRU. On the other hand it avoids the degenerate case of a direct-mapping, but suffer the same if there are 3 or more hot addresses.

N-way associative cache

A more general structure would be N column per row. This further shifts the balance between complexity (the higher the N the more expensive it is) and avoiding degenerate cases (resilient to up N hot addresses).

Translation Lookaside Buffer

Before we describe the translation lookaside buffer, we need to revisit virtual memory.

Virtual Memory: a prelude

Virtual memory is an abstraction layer that hides the details of physical memory. Some use cases:

  • Memory is fragmented in the physical space but is contiguous in the virtual space.
    Storage is made transparent to programs – the OS can decide whether to store data in memory or in disk.
  • Programs can have their own (virtual) address space. For example, both programs A and B could access their own address X and not have conflict, since these are mapped to different physical addresses.

Virtual memory is chunked in blocks of fixed size which are called (virtual) pages. The page size is defined by the processor (mine in 4KiB). The corresponding physical page is called page frame [3].

The mapping between virtual memory and physical memory is done by the memory management unit, or MMU. It keeps a table (page table) in memory (RAM) that has one entry for each page and it maps to corresponding the physical address.

Why memory is segmented in pages? The page table is a likely reason. If we mapped byte to byte, then the page table wouldn’t fit in memory because we’d need an entry for every byte! We need a good balance between a page that’s not too big (coarse granularity) which would lead to wasted memory if most of the page went unused vs. a page that’s not too small (fine granularity) which would lead to a gigantic page table.

TLB: The page table cache

Many of CPU instructions require access to data in (physical) memory, so it needs to translate a given virtual address to its physical correspondent. As we’ve seen, memory access is slow, so not surprisingly there’s also a cache for the page table, which is called translation lookaside buffer (TLB).

Page or Cache Coloring

Imagine our application needs access to a large block of memory – this will consist of a list of virtual pages that are adjacent in the virtual address space. However, these are not guaranteed to be adjacent in the physical address space.

The CPU cache is designed to map adjacent physical addresses to different cache lines. This is so that accessing a contiguous piece of physical memory will enable most of the data will be on cache at a time. If they mapped to the same cache line, we’d evict.

What can happen is that a list of adjacent virtual pages are not physically adjacent, which means there’s no guarantee they will map to different cache lines.

To address this problem, pages frames are “colored” based on which cache line they would be assigned to [8]. The corresponding virtual page “inherits” that color. When assigning sequential pages to physical locations, the OS tries to make sure that contiguous pages will have a diverse set of colors.

Security

Whenever adding caching to improve the performance of a system, an important consideration is security. One example could be caching some expensive computation which also perform access checks. The cached data could be vulnerable if an attacker has access to the key.

We’ll now concentrate into two famous exploits known as Spectre and Meltdown. Before that, we need to understand out of order execution.

Out of order execution

In conjunction with caching, an optimization technique the CPU employs to speed up execution is to execute instructions ahead of time (out of order) if it can guarantee the end result is correct.

For example, say one instruction i requires reading from a value from memory, which can be expensive on a cache miss. Say that the next instruction i+1 doesn’t depend on this value. The CPU can then execute that instruction i+1 before i is finished.

One more concrete example is called branch prediction. Say your code has a conditional like:

if (condition) {
expression1;
} else {
expression2;
}

and that evaluating condition is slow. Then the CPU might predict condition=true and execute expression1 before it knows whether condition is true. If it happens to be false, it just resumes the execution at the other branch.

Meltdown and Spectre

A few years ago, two vulnerabilities were found related to CPU cache, named Spectre and Meltdown. Let’s analyze the Spectre [11] first.

Spectre.

As we mentioned above, the CPU might perform branch prediction. The problem with that is the CPU will cache the result of expression1 even if it turns out to be abandoned later.

Consider the following contrived example proposed by ricbit [7]:

Suppose we want to read the contents of the memory address represented by

a[bigvalue]

Because we didn’t explicitly allocate memory for that position, we’ll generally not have access to that. But say we can be sure the CPU will perform branch prediction in

bigvalue < a.length

spectre

And predict it will be true. Then it will execute the next few lines before it determines the condition is actually false. By that time it accessed the value in a[bigvalue] without permission checks. If the last bit of the value is 1, xwill have the value of a[100] and but importantly the CPU will cache the memory address represented by a[100]. Otherwise, it will do the same with a[200].

After the branch prediction fails and it’s cleaned up, we execute our timing attack. It simply measure the time to access both a[100] and a[200]. Whichever address is cached will return much faster, which in turn is a proxy for the last bit of the value in a[bigvalue]. Repeat this will all the bits and we’ll get the exact value.

Meltdown

Meltdown uses a similar approach [12], but instead of relying on branch prediction, it relies on CPU executing instructions after an exception is raised, so if the application manages to handle the exception, it can then use timing attacks to detect what value was cached.

The paper provides a code in assembly. I’ve attempted to write a pseudo (meaning it doesn’t work but captures the idea of the exploit) version in C for educational purposes:

The steps are:

meltdown

1-) Try accessing some protected memory address but exception is thrown due to invalid access. But handle the exception so that application doesn’t crash.

2-) Right after the exception, use the content of the protected address as index of a probe array (non-protected address). This exploits a CPU behavior that executes instructions out-of-order and caching them. In this scenario, it will execute the instruction above and cache that portion of the array.

3-) Timing: Try reading parts of the array and measure which one returns faster, indicating that address is cached, and hence we know that that address

Spectre vs. Meltdown?

One of the most difficult things for me is to understand the precise difference between Spectre and Meltdown. I’ve seen a lot of comparisons between the 2, but they’re too high-level to satisfy my curiosity. On the other hand the distinction provided by the papers are hard for me to grasp.

The discussion of Spectre [11] on Meltdown is a bit clearer:

Meltdown is a related attack which exploits out-of-order execution to leak kernel memory. Meltdown is distinct from Spectre attacks in two main ways. First, unlike Spectre, Meltdown does not use branch prediction. Instead, it relies on the observation that when an instruction causes a trap, following instructions are executed out-oforder before being terminated. Second, Meltdown exploits a vulnerability specific to many Intel and some ARM processors which allows certain speculatively executed instructions to bypass memory protection. Combining these issues, Meltdown accesses kernel memory from user space. This access causes a trap, but before the trap is issued, the instructions that follow the access leak the contents of the accessed memory through a cache covert channel.

The remarks of Meltdown [12] on Spectre in more succinct and obscure:

Meltdown is distinct from the Spectre Attacks in several ways, notably that Spectre requires tailoring to the victim process’s software environment, but applies more broadly to CPUs.

My feeling is that the differences are too specific and technical to be explained in a simple way, yet I seem to lack the background to understand them fully.

Conclusion

I was reading a paper that made references to CPU cache cache lines, so that prompted me to research about it. I was surprised to learn how much information there is about it, so decided to dedicate an entire post for it.

It was also great to dive deeper in understanding about Spectre and Meltdown, even though I didn’t fully internalize their specifics. I also learned a lot about virtual memory.

Appendix A: RAM vs Cache latency estimates

[6] provides a benchmark for Intel’s Haswell architecture, a high-end i7-4770. It provides data in terms of CPU cycles:

  • L1: 4-5 cycles
  • L2: 12 cycles
  • L3: 36-58 cycles
  • RAM: 36 cycles + 57ns / 62 cycles + 100 ns

We can see that L3 can be roughly 10x slower than L1. To compare apples to apples, we need to convert cycles to time so we can add the overhead from the RAM. The CPU in question operates at 3.4GHz, which means a cycle is about 0.294ns. So 4 cycles of L1 takes abount 1.17ns and the worst case for RAM of 62 cycles + 100 ns is about 118ns, which gives us 100x slower latency estimate.

Related Posts

  • Von Neumann Architecture – We presented an overview of the architecture many computers use. We briefly touched on memory bottlenecks and CPU cache.
  • Content Delivery Network – As discussed, CDN is another kind of cache, that lives in a much much higher level of the stack.
  • Python Coroutines – Not directly related to caching, but the out-of-order execution reminds me a lot of the co-routine execution. One task waiting on I/O might get queued for later the same way an instruction waiting for – an expensive – memory read is queued.
  • Browser Performance – This is relevant to timing attacks – memory access is relatively slow, but still very fast in absolute terms. The Appendix A above calculates the latency on the order of 100ns. To be able to determine if the data was cached, the measurement has to have high-precision. To help reducing this vulnerability, browser purposefully reduce the precision of APIs such performance.now() .

 

References

[1] Wikipedia – CPU cache
[2] Wikipedia – Virtual memory
[3] What is the difference between a ‘page’ of memory and a ‘frame’ of memory?
[4] Wikipedia – Cache hierarchy
[5] Why do we need multiple levels of cache memory?
[6] 7-Zip LZMA Benchmark – Haswell
[7] Ricbit Permanente – 2018/01/04 (in Portuguese)
[8] Wikipedia – Cache Coloring
[9] kunigami.blog – Content Delivery Network
[10] Meltdown and Spectre
[11] Spectre Attacks: Exploiting Speculative Execution – Kocher et al.
[12] Meltdown: Reading Kernel Memory from User Space – Lipp et al.

Browser Performance

In this post we’ll go over the topic of browser performance. We’ll cover tools and techniques to identify, measure and improve performance on the client side of web applications.

Assumptions. Our tests use the Chrome browser, which means the tools likely only make sense for this specific browser (general techniques are agnostic). All the screenshots are taken either on a Linux machine running Gnome or the latest MacOS, though it should be relatively similar to Chrome in other platforms. The Chrome version used is v80.

Setup. I’ll be using simple standalone HTML/JavaScript code. I recommend downloading the folder browser-performance.

Analyzing Performance

FPS Meter

This widget tracks the number of frames per second. To enable it,

In the … icon menu, click on Show console drawer:

Screenshot from 2020-03-16 10-12-37

In the Rendering tab, check the box FPS meter:

Screenshot from 2020-03-16 10-17-55

And it should looks like this:

Screenshot from 2020-03-16 09-46-38

The yellow bar represents 60 FPS, which is the threshold under which the page might look sluggish. The bar chart on the right is a histogram view of the visible line chart. The bars at the bottom represent low FPS and on the top high FPS. The number on the top are the range (max/min FPS).

Initially the numbers will be low because the is not previous data, since it only measure FPS when the page re-renders. To prime the data you need to manually trigger many visual updates, say resizing the window or specific actions on your application:

Screenshot from 2020-03-16 10-26-35

Flame Chart

The flame chart is a very useful visualization to understand performance. It allows identifying expensive functions at a glance (long bars) and dive in into sub-functions quickly.

To use this, download and open (locally) the on_click.html example. Click on the “Performance” tab. Click the record button. Click the “Click Me” button on the page. After some number is displayed, click on Stop on the recording. We should get a chart similar to the one below:

Screenshot from 2020-03-17 10-20-06

Flame chart view

We can see the area chart at the top which is an overview of the processing happening below. The y-axis in that chart is amount of CPU.

The second chart is the flame chart. It has the bars corresponding to function calls (e.g. Event:click) and below that are the functions that were called. We see a generic Function Call because we used a anonymous arrow function like below:

Another thing to notice is that if we zoom enough, we see what looks like multiple calls of the expensive() function but in our code it is a single continuous function. I’m not sure why Chrome does that, but I’m assuming it has to do with the garbage collection kicking in.

Screenshot from 2020-03-17 10-26-02

There are other interesting alternative views at the bottom. The doughnut chart gives a breakdown on types of work being done. In our case it’s just a mix of scripting and idle:

Screenshot from 2020-03-17 10-28-42

The Call Tree view is great for the details. It allows looking at specific call chains and how much each is contributing. In our example, we can see Event: click is an expensive operation but it’s the expensive() function that is the culprit (based on the Self time):

Screenshot from 2020-03-18 09-41-47

To find specific bottlenecks the Bottom-Up view might be more interesting. If the Call Tree shows a nested tree from the root, the Bottom-up view shows the internal nodes but sorted by Self Time:

Screenshot from 2020-03-18 09-44-46

Performance Metrics

When we record a trace that includes a page load, it also includes some standard metrics like First contentful paint (FCP). The definition of some them are here [4].

Screenshot from 2020-03-18 10-00-29

It’s possible to write our own events that show up in Timings. We just need to add markers and  measure duration using the performance API [6]:

Then it shows up in the Timeline row:

Screenshot from 2020-03-18 10-21-37

The result of performance.measure() contains useful information:

Screenshot from 2020-03-18 10-21-06

Network

When we first open the Network tab, it immediately starts recording the network activity, but since we want a full-scan of the page requests, let’s stop it, clear the data, start recording again, reload the page and then stop recording to get a report.

The sample below is from imdb.com:

Screen Shot 2020-03-19 at 9.23.48 AM

Network tracing for imdb.com

Similar to the performance tab, this view provides a timeline which we can zoom in. The table below lists all network requests performed by the browser, including static resources like images, JavaScript files and async requests to fetch more data.

We can filter by request type at the top:

Screen Shot 2020-03-19 at 9.29.24 AM

Here we could look for oversized images (though unlikely you’ll need to show a gigantic high-resolution image by default – so look into sending a thumbnail with a link to the larger image) or slow async requests.

Async requests are of type XHR (XmlHttpRequest), so we can filter by that:

Screen Shot 2020-03-19 at 9.33.48 AM

the bars on the last column give an indication of not only the time a request took, but the dependencies. Sometimes an async request doesn’t start until a specific one finishes. If we hover over the bar of the render request, we see:

Screen Shot 2020-03-19 at 9.38.42 AM

It has a nice breakdown on the timings, including DNS lookup and SSL latency (authentication). It also includes queuing time, the amount of time waiting TTFB, and then the content download.

The queuing indicates that the request was ready to go, but needed to wait on some other dependency or due to browser limits. Usually there’s a limit of active XHRs per domain the browser can have (<10) [7].

TTFB means “Time to first byte” – this means that the time it takes to start receiving data. This is indicative of the time the server took processing.

Content download is the time spend getting the resource/data. It’s correlated with the size of the data, network bandwidth/speed.

Measuring Performance

Duration from start

The performance API keeps track of specific events in the page loading lifetime, so we can measure durations compared to a specific point in time.

performance.now() [9] provides the timestamp since the page started to load (in rough terms – see this more precise definition).

Side note: browsers purposefully reduce the precision of these APIs to protect against attacks that rely on high-precision timing information [10].

Custom Durations

As we mentioned in an earlier in measurement.js, we can measure durations. By using performance.mark() and performance.measure() [5].

Benchmarking

We can use the APIs above to create simple benchmarks.

Before we start: always beware of over-optimizing code! You might end up spending a lot of time to improve performance against a specific benchmark, browser or even JavaScript engine version. That’s part of the reason V8 retired the benchmark tool Octane.

That said, if you identify some code in the critical path that executes the same operation on the order of millions of times, it might be worth optimizing it. One example is changing from array methods like forEach() and reduce() to pure for-loops. We can perform several runs of the same function and get the average. I opted for using a random number generator to avoid any optimization related to caching.

The function generated an array with 10 million random floats and calculates the sum. It then runs the code 10 times. The results follow:

  • for loop 12ms
  • reduce 156ms
  • forEach: 208ms

On Firefox it yields

  • for loop 40ms
  • reduce 48ms
  • forEach 64ms

We can see a for loop is ~92% and ~16% faster than reduce on Chrome and Firefox respectively. Some takeaways:

  • It’s highly browser-dependent! Don’t over-optimize code for one browser.
  • Although the relative gains are substantial, the absolute improvements are only 100ms for an array of 10 million items – how practical is this? The creation and allocation of the array probably takes more time than this.
  • This is pure computation, so no DOM access is needed, so consider using Web Workers

Techniques

Here we discuss some techniques to improve performance. The first two are general purpose optimizations but we include here for completeness.

Reduce Algorithmic Complexity

We’ll start with a quote from Donald Knuth [15]:

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

Before optimizing code, we should consider optimizing algorithm. This means reducing the big-O complexity. In a recent project, we found several wins by identifying hidden O(n^2) code and converted them to O(n).

Trade Memory for Computation

We often need to make this tradeoff: to improve speed of code (computation) we have to use more memory. This is actually how we improved the O(n^2) to O(n): by keeping a look-up hash table for fast access.

In JavaScript a quick look-up table can be implemented via Object, or ES6 Map and Set. Other techniques involve caching (memoization).

Encode JSON Payloads

In [8] Addy Osmani and Mathias Bynens suggest that if you’re sending large JSON payload via the network, it’s better to send them JSON-encoded instead of JavaScript objects literals:

Because the JSON grammar is much simpler than JavaScript’s grammar, JSON can be parsed more efficiently than JavaScript

Reduce CSS selector complexity

In [17], Lewis suggests reducing the complexity of CSS selectors such as

to

The reason being:

Roughly 50% of the time used to calculate the computed style for an element is used to match selectors (…)

Use Flexbox

In [18] Lewis suggests that flexbox is more efficient than other layouting alternatives such as float positioning. One caveat made in the article is that is node widely supported. As of 2020 however, that doesn’t seem the case anymore.

Avoid reading DOM properties right after mutating them

In [18] Lewis explains that the Layout phase happens after the JavaScript runs:

Screen Shot 2020-03-28 at 11.51.13 AM

This means that changing a DOM property and reading it afterwards, requires the browser to block and perform the layout before reading the new value. If possible, make the read before the write.

Avoid some advanced CSS effects

This is more of a tradeoff between performance and design. In [19] Lewis mentions that properties such as box-shadow can be expensive.

It’s unclear however which specific CSS properties are expensive.

Use Layers (but sparingly)

In [19] Lewis suggests creating layers for elements that are updated very often, to avoid these updates from causing updates in the rest of the page. One way to force the creation of a layer is:

These should only be created if there’re clear improvements in performance, since like indexes in databases, they come with memory and management cost.

Use requestAnimationFrame()

Since JavaScript’s main thread is used for rendering, running expensive function might cause the UI to freeze. In order to prioritize the rendering, we can schedule expensive functions to run we can use requestAnimationFrame(). According to [3]:

When visual changes are happening on screen you want to do your work at the right time for the browser, which is right at the start of the frame. The only way to guarantee that your JavaScript will run at the start of a frame is to use requestAnimationFrame

In [20], Lewis provides a more specific case for the use of requestAnimationFrame(): on scroll input handlers since they block the UI.

There’s one caveat however: Input handlers execute before requestAnimationFrame(), so if it makes style changes and then we read it in requestAnimationFrame, we run into the problem described in Avoid reading DOM properties right after mutating them.

Web Workers

As we discussed previously, web workers run in a separate thread, which prevents your application from freezing because the main thread is blocked. To understand why this happens, it’s important to know the basics of how the JavaScript execution model works.

The major caveat is that web workers don’t have access to the DOM but they do have I/O access (like XHR).

Web Assembly

It’s possible to use lower level languages like C++ and Rust and run them on the browser by compiling them to an intermediate representation called Web Assembly (WASM).

This guide explains how to compile Rust code to Web Assembly, package it as an npm package and execute it on the browser via Webpack. I’ve tried myself and the process is relatively straightforward.

Portability. Web Assembly’s official website [14] claims it is portable across different platforms.

Existing libraries. One of the major selling points of Web Assembly is being able to run existing libraries written in languages like C/C++. In [11] PSPDFKit’s team showcase their port of a PDF renderer that previously had to be done on the server and now can be done by shipping the binary as wasm. They perform some benchmark, getting mixed results in terms of performance but had several follow-up with browser vendors and were able to improve their performance (though no subsequent analysis was available in the post).

Web Assembly’s website [13] lists other use cases.

Performance. The main benefit I can see for starting a project from scratch in say, C++ or Rust, would be performance over JavaScript. JS vs WASM’s site [12] provides a set of computations in JavaScript and wasm compiled from C++. The results are not promising: in most cases Javascript is faster.

GPU via WebGL

WebGL (Web Graphics Library) is a JavaScript library available in most browsers, with the main purpose of rendering 2D and 3D graphics. But because it has access to the GPU (JavaScript doesn’t have direct access), it’s can be used for heavy but highly parallelizable computations such as Machine Learning [16].

If your application requires this sort of processing, WebGL can be an interesting idea.

Conclusion

This was another of the notes on Web Development. The other notes are listed in the Related Posts section below.

In this post we covered a wide variety of tools and techniques regarding performance of client-side applications. We didn’t delve into much detail, but hopefully some of the links will serve as starting pointers for further research.

For me it was particularly interesting to learn more about Web Assembly and I’m interested in studying it further, especially with Rust.

References

[1] Get Started With Analyzing Runtime Performance
[2] Rendering Performance
[3] Optimize JavaScript Execution
[4] User-centric performance metrics
[5] User Timing API
[6] StackOverflow: Using performance.mark() with Chrome dev tools performance tab
[7] StackOverflow: How many concurrent AJAX (XmlHttpRequest) requests are allowed in popular browsers?
[8] The cost of parsing JSON
[9] MDN: performance.now()
[10] Github w3c/hr-time: Reducing the precision of the DOMHighResTimeStamp resolution
[11] A Real-World WebAssembly Benchmark
[12] JS vs WASM
[13] WebAssembly: Use Cases
[14] WebAssembly: Portability
[15] Donald Knuth: Structured Programming with go to Statements
[16] TensorFlow.js: Machine Learning for the Web and Beyond
[17] Reduce the Scope and Complexity of Style Calculations
[18] Avoid Large, Complex Layouts and Layout Thrashing
[19] Simplify Paint Complexity and Reduce Paint Areas
[20] Debounce Your Input Handlers

Related Posts

  • Notes on JavaScript Interpreters – This post focuses on the detail of Chrome’s V8 interpreter. The relationship with performance is that understanding how JavaScript code is executed (e.g. event loop model, JIT) can give us insight on how to optimize it.
  • Web workers – is a browser feature that allows running code outside of the main thread. This can speed up computation as long as no DOM access is needed.
  • Notes on how browsers work – provides a basic overview on the stages of executing a web page, from downloading HTML and Javascript, executing the scripts, constructing the DOM tree and laying out the page. Understanding these steps might give insights on how to better optimize the app.
  • Notes on Javascript Memory Profiling in Google Chrome – memory is related to performance because memory and CPU are the major bounded resources your app has to deal with. Not taking care of memory (e.g. memory leaks) could also exhaust the memory available for the execution engine to make the CPU-Memory tradeoffs we mentioned above.

Sockets

In this post we’ll discuss sockets, more precisely Network Sockets. It’s based on two excellent articles from Brian “Beej Jorgensen” Hall. We’ll first cover some basic OS and Network concepts and then go over some important functions needed to create a simple socket server that can handle requests from multiple clients.

This post assumes basic familiarity with the Linux operating system and the C programming language.

OS Basics

File Descriptors

A file descriptor is basically a numerical identifier (id) to a lookup table a given process contains. It is used to model not only files, but also things like stdin (special id = 0), stdout (id = 1) and stderr (id = 2). Sockets are also represented as file descriptors as we’ll see later.

Fork

fork() is a system call the current process can use to generate copies processes (known as children), that run the same program. One observation is that the child process gets a copy of the parent’s data.

This will be important for our example where the main process uses fork() to generate children to handle connections. The child needs to inherit some of the file descriptors from the parent.

fork() returns 0 if it’s the current process executing the code or a non-zero value corresponding to the process id (pid) of the child. A common way to use fork() is the following:

The return value can be used to distinguish between a parent and child and hence we can have them execute different code.

Signal

Signals are one of the simplest ways to communicate with a process. We just send a code that the process knows how to handle. There are OS-level signals like SIGKILL and SIGTERM and user-defined such as SIGUSR1. It’s possible to override how a process handles specific signals via sigaction() which take a structure that points to a handler:

Network Basics

Beej describes the Layered Network Model but follows with this interesting observation:

Now, this model is so general you could probably use it as an automobile repair guide if you really wanted to. A layered model more consistent with Unix might be:

  • Application Layer (telnet, ftp, etc.)
  • Host-to-Host Transport Layer (TCP, UDP)
  • Internet Layer (IP and routing)
  • Network Access Layer (Ethernet, wi-fi, or whatever)

In this model, sockets are in the application layer since it relies on TCP or UDP on top of IP. We’ll go over these 3 things next.

IP the Internet Protocol

The guide discusses some details of the IP, including the IPv4 and IPv6 distinction and different address types. The flags AF_INET and AF_INET6 are part of the socket API and associated to these 2 types.

One interesting detail is the byte order (Little-Endian and Big-Endian): while each computer systems can represent data in different ways, the order is standardized for the internet, and is Big-Endian. This is also known as Network Byte Order.

UDP and TCP

The User Datagram Protocol is connectionless (stateless) and provides no guarantees on the order of the datagrams, their delivery or that duplicates are avoided. The messages sent via UDP are known as datagrams.

The Transmission Control Protocol relies on a connection (via a 3-way handshake), guarantees order and perform retries. The messages sent via TCP are known as data stream.

The socket types SOCK_STREAM and SOCK_DGRAM are associated to the TCP and UDP protocols respectively.

Socket API

Before we proceed with our example, we’ll cover the C functions in sys/socket.h that correspond to the Linux socket APIs:

getaddrinfo() is a relatively high-level function that is capable of resolving an address (e.g. google.com) to an actual IP and port. An example of use is (fullcode):

The variable servinfo is of type addrinfo, which is a node of a linked list:

getaddrinfo() returns a list of such values, any that match the criteria from the input parameters. In the Beej’s client code, we’ll see iterates over that list until it finds a set of parameters that it can connect to.

socket() returns a file descriptor. It’s basically creating a register with a given ID in a table and it returns that identifier we’ll use to establish a connection later. The interface is as follows:

int socket(int domain, int type, int protocol);

  • domain could be one of AF_INET and AF_INET6 (IPv4 / PIv6)
  • type is one of SOCK_STREAM or SOCK_DGRAM (TCP / UDP)
  • protocol could be one of PF_INET and PF_INET6

In practice AF_INET is the same as PF_INET. Beej says:

This PF_INET thing is a close relative of the AF_INET (…) they’re so closely related that they actually have the same value (…), it was thought that maybe an address family (what the “AF” in “AF_INET” stands for) might support several protocols that were referred to by their protocol family (what the “PF” in “PF_INET” stands for). That didn’t happen.

More conveniently we can also use the results of getaddrinfo() to fill these for us:

bind() binds a socket to a specific hostname and port. It is not always needed (for example in the client case). The client doesn’t usually care which port is used on its own side, so it can let the OS choose. For the server case it’s important because it defines the IP address and port the socket will listen to.

This information can be more easily provided via the struct returned from getaddrinfo(). By providing null to getaddrinfo() and AI_PASSIVE to ai_flags, we’ll have this function fill the IP in res for us:

connect() is the function a client can use to indicate the desire to establish a connection with a given server. Similarly to bind() we can use the results from getaddrinfo():

Note how we didn’t need to bind(). connect() filled the local host and a random port for us.

listen() is the function a server calls to define how many connections it can listen to at a specific socket. Beej’s article mentions a system limit of 20, but using 5 to 10 seems to work in practice.

accept() is the function that is actually connects with a specific client. So that the original socket can keep listening to other incoming connections, accept() returns a new socket descriptor which will be used to send and receive messages.

send() / recv() are used to send and receive messages via the established connection. One important aspect is that while you specify the size of the data being sent, the API does not guarantee it will send the whole data, so you need to write a loop to make sure all data is  sent/received.

Example

The high-level sequence of calls for the API above is:

  • getaddrinfo()
  • socket()
  • bind()
  • accept()
  • recv()/send()

For the client we have:

  • getaddrinfo()
  • socket()
  • connect()
  • send()/recv()

Beej provides examples for server and client codes in C: server.c and client.c.

Conclusion

Beej has an entire guide dedicated to inter-process communication. The guide covers the basic concepts such as creating new processes via fork() and handling signals; synchronization mechanisms suck as locks and semaphores; and communication mechanisms such as pipes, message queues and sockets. I like the conversational style of his writings.

I probably wrote code using sockets a long time ago. I didn’t have time to dig deep on this subject so I didn’t feel like I learned a ton. It was a good refresher nevertheless.

References

[1] Beej’s Guide to Unix Interprocess Communication
[2] Beej’s Guide to Network Programming

Related Posts

Python Coroutines

In this post we’ll cover Python coroutines. First we’ll study David Beazley’s great tutorial [1] to learn about generators and coroutines, and how to build a multitask system from the ground-up.

3052214847_dbd27f9ccc_w (1)

Goddess Durga – source: Flickr

We’ll then take a look at a Python module called asyncio which makes use of coroutines to enable asynchronous I/O.

Generators

Python generators are functions what allow return the execution back to the caller such that when called again it will resume its execution from the same place. It’s the same concept of JavaScript generators, which we talked about before.

The syntax for yielding the execution back is via the yield keyword. Here’s an example of a custom implementation of range():

Because the execution can be yielded at any time in the code, we can use infinite loops:

Another way to use generators is to “pipe” (in the Linux way), one generator into another, like in the example where we read lines from a dictionary file and find those ending in “ion”:

This also highlights one advantage of using generators abstraction: the lower memory footprint since it can process one line at a time. Sure, we could achieve the same with regular functions, but they would need to be combined in a single block:

Which is efficient but doesn’t allow modularity.

Combined with the fact that generators can be infinite loops, one can model functionality like tail and grep as generators (e.g. tail -f | grep foo) which is the exact example Beazley provides in his presentation [1].

Coroutines

In PEP 342 it explains the motivation behind coroutines:

Coroutines are a natural way of expressing many algorithms, such as simulations, games, asynchronous I/O, and other forms of event-driven programming or co-operative multitasking. Python’s generator functions are almost coroutines — but not quite — in that they allow pausing execution to produce a value, but do not provide for values or exceptions to be passed in when execution resumes. They also do not allow execution to be paused within the try portion of try/finally blocks, and therefore make it difficult for an aborted coroutine to clean up after itself.

There are four main changes:

  • yield is an expression instead of statement, which means you can assign the return type of yield to a variable.
  • send() method which can be used to send value to the yield expression. send() is a general version of next(), which is equivalent to .send(None).
  • throw() method which raises an exception at the place where the last yield was called.
  • close() method which raises a GeneratorExit at the place where the last yield was called. Effectively used to force terminating the generator.

Here’s a simple coroutine that just prints what it receives:

Notice that we have to call an empty send() to “initialize” the coroutine. According to PEP 342:

Because generator-iterators begin execution at the top of the generator’s function body, there is no yield expression to receive a value when the generator has just been created. Therefore, calling send() with a non-None argument is prohibited when the generator iterator has just started (…). Thus, before you can communicate with a coroutine you must first call next() or send(None) to advance its execution to the first yield expression.

In the example below we highlight how this flow can be confusing. The first send() receives the value from yield, but it is only able to send a value through a subsequent send().

Since in a coroutine most of the times the first send() will be to initialize the coroutine, Beazley provides a decorator to abstract that:

Piping

We can chain multiple coroutines the same we did with generators but now the order is somewhat reversed. To see why, consider our generator pipe example rewritten with coroutines. In here we pass print_step() to filter_step() and filter_step() to get_words(). Contrast this with the generator version where we start with get_words, pass its results to filter_step() and then print_step().

Event Handling

Beazley provides an example using an XML parser to manipulate XML and convert to JSON. The API of xml.sax relies on the visitor pattern, where at important steps of the AST traversal (e.g. startElement, character, endElement) it calls specific methods of a content handler.

The example uses an adapter layer (cosax.py), to allow using coroutines instead the content handler. It also combines multiple smaller coroutines to produce a more complex ones, demonstrating the composability of coroutines.

Coroutines as tasks

After some preliminary overview of Operating Systems Beazley builds a prototype of a multitasking system by modeling coroutines as tasks (no threads or processes).

  • Task – wrapper around coroutine
  • Scheduler – can create new tasks and schedule tasks – keeps a loop as long as there are tasks. Existing tasks can cause tasks to be scheduled. JavaScript event loop.

Yielding inside a task is like an OS interruption (trap).

System Calls

One of the interesting implementation of the multitasking system is system calls. The task/coroutine issues a system call by providing it as a value to the yield (e.g. yield GetTid()). This is received by the scheduler which can provide the necessary context to the specific system call implementation.

The tutorial then covers one of the most important parts of the multitasking system which is the ability to do asynchronous I/O operations (e.g. reading and writing to a file).

The idea is to introduce a “system call” corresponding to performing I/O. When a task invokes that, it doesn’t get rescheduled but put in a staging area (self.read_waiting in the code below). The scheduler has then a continuously run task that polls the OS to check if any of the file descriptors corresponding to the tasks in the staging area are ready, in which case the task is rescheduled.

Beazley uses that system to implement a simple server that receives data via a socket:

Trampolining

One of the limitations of the scheduler that is developed up to this point is that the yield has to happen at the top-level of coroutine, that is, it cannot invoke other functions that yield, which limits refactoring and modularization.

To overcome this limitation, the author proposes a technique called trampolining. Here’s an example where the execution of add() is done by the top level code, not by main():

We can do a similar thing in the multitasking system. Because the coroutines can recursively call other coroutines, we need to keep a callstack. We can do it inside Task by keeping an explicit stack:

In the example above, when a coroutine that is wrapped in Task makes another coroutine call, for example:

client, addr = yield Accept(sock)

Within Task, result will be a generator, so the current coroutine will be put on the stack and the execution will pass to that generator instead:

self.target = result

When that sub-coroutine yields the execution, the original caller is resumed

self.target = self.stack.pop()

Note that this all happen transparently to the scheduler itself.

This concludes the study of the multitask system from the ground up.

Modern Asynchronous I/O

The tutorial we just studied is from 2009 and while still relevant, Python went through many changes since then.

We’ll now cover some of the development since the generator-based coroutines were introduced and that culminated in a standard module for handling asynchronous I/O.

Delegating to subgenerators: yield from

PEP 380 introduced a new syntax for delegating the execution to another generator until that generator is finished. Simeon [5] provides an example where using yield from simplifies the code. In the code below generator() calls generator2() and generator3() in a loop to “exhaust” their execution.

The same can be accomplished with yield for, which does that implicitly:

Async / Await

PEP 492 Introduces the async and await syntax. The motivation contains these two reasons among others:

It is easy to confuse coroutines with regular generators, since they share the same syntax; this is especially true for new developers.

Whether or not a function is a coroutine is determined by a presence of yield or yield from statements in its body, which can lead to unobvious errors when such statements appear in or disappear from function body during refactoring

To address these issues, a function containing the async is considered a native coroutine (as opposed to a generator-based coroutine):

A native coroutine can await generator-based coroutines in which case it has the same  behavior as the wait for. Borrowing from the example above:

The asyncio module

PEP 3156 Describes a system to support asynchronous I/O, which is now known as the asyncio module. It proposes a coroutine-based multitasking system similar to the one Beazley describes. It has support to common I/O operations like those involving files, sockets, subprocesses, etc.

We won’t dive into much detail in this post, but we can see parallels to the scheduler we  studied previously. Here’s an example where asyncio.run() schedules the coroutine main() which in turn executes the sub-coroutines sleep() one after another:

This is not taking advantage of the non-blocking capabilities of asyncio. Why? Recall that await is equivalent to yield from and that causes the current coroutine to wait the call to finish until it continues. If we run this code we get:

will sleep: 1
slept: 1
will sleep: 2
slept: 2

What we want is to schedule both sub-coroutines at the same time, and asyncio allows that via the gather() method:

If we run this code we get:

will sleep 1
will sleep 2
slept 1
slept 2

Which means that the first sleep executed but yielded the execution back to the scheduler after the sleep() call. The second sleep() got run, printing “will sleep 2”.

Conclusion

I’ve used async/await and event loops in other languages such as Hack and JavaScript, but only recently ran into it in Python. I kept seeing mentions of coroutines and that led me to study it in more details. Overall I felt like I learned a lot.

David Beazley’s tutorial is really good. They’re thorough and provide lots of analogies with operating systems concepts.

I also liked the presentation medium: the slides are self-contained (all the information is present as text) and he shows the whole code at first then repeats the code multiple times in subsequent slides, highlighting the important pieces in each of them. This is difficult to achieve in flat text like a blog post.

References

[1] David Beazley – A Curious Course on Coroutines and Concurrency
[2] PEP 342 – Coroutines via Enhanced Generators
[3] PEP 380 — Syntax for Delegating to a Subgenerator
[4] PEP 492 — Coroutines with async and await syntax
[5] Simeon Visser – Python 3: Using “yield from” in Generators – Part 1
[6] PEP 3156 — Asynchronous IO Support Rebooted: the “asyncio” Module

2019 in Review

This is a meta-post to review what happened in 2019.

Posts Summary

This year I continued my efforts to learn about Bioinformatics. I wrote about the DNA Assembly problem, Protein Fold prediction and Protein Design.

I covered some Computer Science topics such as Consistent Hashing, the Aho-Corasick algorithm and Constructing Trees from a Distance Matrix (as part of my studies in Bioinformatics).

For Web Development I learned about JavaScript Async Functions and CDNs in more details. I explored Observable, a JavaScript notebook.

As part of my resolution for 2019, I learned a bit more about Rust, in particular its memory management model which is the part I struggled (and still do) the most with. I also started revisiting Python because of work, and started with Python Type Hints.

I assembled my own Desktop which inspired me to learn more about the Von Neumann Architecture. I also went back to using Linux (after many years exclusive on MacOS).

The Blog in 2019

The most popular blog post of the year was the Python Type Hints (10k views, 9k visits in just 2 days!). It was somehow feature on the front page of Hacker News being at position #8 at some point! The discussion was mostly about the merits of typed vs. untyped languages but it was a nice surprise nevertheless.

With some delays here and there, I kept the resolution to post once a month on average. The blog completed 7 years with 92 posts.

Resolutions for 2020

My plan is to focus in learning Python, since now I have an opportunity to use it at work. I’ll keep Rust on the side and learn more about it if time permits.

My studies on Bioinformatics made me feel that the computational models are too simplistic to be useful in practice. I’d love to get exposed to real-world examples, even if they’re ugly heuristics, but for now I’m done with the theory. As I dived deeper on the subjects, I started seeing more and more mentions to quantum physics and quantum computing.

I studied QM back in college, but I cannot claim to understand even the basics of it. As for Quantum Computing, I know as much as Justin Trudeau ;) My focus will be in learning Quantum Computing.

I’m still interested in reading more papers, especially on distributed systems and AI, and mobile development (I mentioned these for the sake of tradition at this point, since I never end up following up on them).

Personal

At the end of the year I like to look back and reflect on and remember all the things I’ve done besides work and the technical blog.

Trips

In summary, in 2019 I visited Israel, Egypt, India. Within the US, I had a chance to visit many national parks.

Early in the year, after a work trip, I went to Egypt and visited some sites in Cairo and Luxor.

egypt.jpg

1. Giza Pyramids, 2. Temple in Luxor, 3. Lanterns in the Khan el-Khalili

In November we attended a wedding in Hyderabad. We decided to tour around New Delhi, Jaipur and Agra before heading south to Hyderabad and Tirupati. From my selection below, one can see how much I admire the Mughal architecture.

india.jpg

Top: 1. Humayun Tomb (New Delhi), 2. Hawa Mahal (Jaipur), 3. Jal Mahal. Bottom: 1. Karauli Temple, 2. Taj Mahal (Agra), 3. Qtub Tombs (Hyderabad)

I had a chance to see many new National Parks and their natural wonders, as part of a quick trip to Seattle, Washington and a road trip in Southern Utah.

nps.jpg

Top: 1. Lake Diablo (North Cascades), 2. Angel’s Landing Overlook (Zion), 3. Canyons (Bryce). Bottom: 4. Rock formations along the road 5. Famous arch (Arches) 6. Islands in the sky (Canyonlands).

Books

Books I enjoyed this year:

  • Range: Why Generalists Triumph in a Specialized World – David Epstein makes a case that being a generalist can lead to a more successful path. He provided a bunch of examples including Roger Federer and Galileo. Engaging.
  • A History of Western Philosophy – Bertrand Russell covers many philosophers and schools of thought. Good perspective on Western Europe’s history. His accounts often expose his own beliefs and opinions.
  • Ten Drugs: How Plants, Powders, and Pills Have Shaped the History of Medicine – Thomas Hager describes the history of 10 drugs, from opioids, to vaccines and high colestherol drugs. Engaging.
  • The Book of Why – Judea Pearl describes some of his work such as Causal Models and Do Calculus in accessible language. The ideas are very clever and interesting, but the language of the book is a bit hyperbolic (e.g. terms like miracle, revolution and “scientists couldn’t have done this before“) which gets in the way.
  • Algorithms to Live By – Brian Christian and Tom Griffiths draw ideas from Computer Science (mostly combinatorial optimization) to solve problems from real life. Great to see many problems and algorithms from grad school in here.
  • Outliers – Malcom Gladwell analyzes several success stories to make a compelling case that success is not about just genius and hard-work. The environment where you were born and grow up matter a lot more than we think.
  • Educated: A Memoir – Intense and powerful story by Tara Westover. She grew up in an environment very hostile to her education but succeeded in achieving high goals against the odds. Really well written.
  • Shoe Dog: A Memoir by the Creator of NIKE – history of Nike from the perspective of its founder, Phil Knight. He is a good writer and Nike had a very interesting initial business model.
  • Mythos: The Greek Myths Retold – Stephen Fry provides a nice overview of the Greek Mythology focusing on the gods and semi-gods (not heroes). Casual writing style and interesting connections to the current world (via etymologies).

I’m looking forward to many more adventures, growth and learning in 2020!