Largest sets of subsequences in OCaml

I’ve noticed that there is this set of words in English that look very similar: tough, though, through, thought, thorough, through and trough. Except thought, they have one property in common: they’re all subsequence of thorough. It made me wonder if there are interesting sets of words that are subsequences of other words.

Word cloud made with Jason Davie’s tool (https://www.jasondavies.com/wordcloud/)

This post is an attempt to answer a more general question: given a list of words, what is the largest set of these words such that they’re subsequences of a given word?

A word A is a subsequence of a word B if A can be obtained by removing zero or more characters from B. For example, “ac” is a subsequence of “abc”, so is “bc” and even “abc”, but not “ba” nor “aa”.

A simple algorithm to determine if a word A is a subsequence of another is to start with 2 pointers at the beginning of each word, pA and pB. We move pB forward until pA and pB point to the same character. In that case we move pA forward. A is a subsequence of B if and only if we reach the end of A before B. We could then iterate over each word W and find all the words that are subsequences of W. If the size of the dictionary is n, and the size of the largest word is w, this algorithm would run in O(n^2 w).

For English words, we can use entries from /usr/share/dict/words. In this case, n is around 235k (wc -l /usr/share/dict/words), so a quadratic algorithm will take a while to run (around 5e10 operations).

Another approach is to generate all subsequences of words for a given word W and search the dictionary for the generated word. There are O(2^w) subsequences of a word of length w. If we use a hash table, we can then do it in O(n w 2^w). In /usr/share/dict/words, the length of the largest word, w, is 24.

Running a calculation with the numbers (R script), the number of high-level operations is 4e10, about the same order of magnitude as the quadratic algorithm.

Distribution using ggplot2

A third approach is to use a trie. This data structure allows us to store the words from the dictionary in a space-efficient way and we can search for all subsequences using this structure. The trie will have at most 2e6 characters (sum of characters of all words), less because of shared prefixes. Since any valid subsequence has to be a node in the trie, the cost of search for a given word cannot be more than the size of the trie t, so the complexity per word is O(\min(2^w, t)). A back of envelope calculation gives us 2e9. But we’re hoping that the size of the trie will be much less than 2e6.

Before implementing the algorithm, let’s define out trie data structure.

The Trie data structure

A trie is a tree where each node has up to |E| children, where |E| is the size of the alphabet in consideration. For this problem, we’ll use lower case ascii only, so it’s 26. The node has also a flag telling whether there’s a word ending at this node.

Notice that in this implementation of trie, the character is in the edge of the trie, not in the node. The Map structure from the stlib uses a tree underneath so get and set operations are O(log |E|).

The insertion is the core method of the structure. At a given node we have a string we want to insert. We look at the first character of the word. If a corresponding edge exists, we keep following down that path. If not, we first create a new node.

To decide whether a trie has a given string, we just need to traverse the trie until we either can’t find an edge to follow or after reaching the end node it doesn’t have the hasEntry flag set to true:

This and other trie methods are available on github.

The search algorithm

Given a word W, we can search for all its subsequences in a trie with the following recursive algorithm: given a trie and a string we perform two searches: 1) for all the subsequences that contain the first character of current string, in which case we “consume” the first character and follow the corresponding node and 2) for all the subsequences that do not contain the first character of the current string, in which case we “consume” the character but stay at the current node. In pseudo-code:

Search(t: TrieNode, w: string):
    Let c be the first character of w.
    Let wrest be w with the first character removed

    If t contains a word, it's a subsequence of the
    original word. Save it.

    // Pick character c
    Search(t->child[c], wrest)

    // Do not pick character c
    Search(t, wrest)

The implementation in OCaml is given below:

Experiments

Our experiment consists in loading the words from /usr/share/dict/words into a trie, and then, for each word in the dictionary, look for its subsequences. The full code is available on github.

The code takes 90 seconds to run on my laptop. Not too bad but I’m still exploring ways to improve the performance. One optimization I tried is to, instead of returning an explicit list of strings as mentioned in the search implementation, return them encoded in a trie, since we can save some operations due to shared prefixes. I have that version on github, but unfortunately that takes 240 seconds to run and requires more memory.

Another way is to parallelize the code. The search for subsequences is independent for each word, so it’s an embarrassingly parallel case. I haven’t tried this path yet.

The constructed trie has 8e5 nodes or ~40% of the size of sum of characters.

Subsequences of “thorough”

The question that inspired this analysis was finding all the subsequences of thorough. It turns out it has 44 subsequences, but most of them are boring, that is, single letter or small words that look completely unrelated to the original word. The most interesting ones are those that start with t and have at least three letters. I selected some of them here:

  • tho
  • thoo
  • thoro
  • thorough
  • thou
  • though
  • thro
  • throu
  • through
  • thug
  • tog
  • tou
  • toug
  • tough
  • trough
  • trug
  • tug

The word with most subsequences is pseudolamellibranchiate, 1088! The word cloud at the beginning of the post contains the 100 words with the largest number of subsequences. I tried to find interesting words among these, but they’re basically the largest words – large words have exponentially more combination of subsequences, and hence the chance of them existing in the dictionary is greater. I tried to come up with penalization for the score:

1) Divide the number of subsequences by the word’s length. This is not enough, the largest words still show on top.
2) Apply log2 to the number of subsequences and divide by the word’s length. In theory this should account for the exponential number of subsequences of a word. This turns out to be too much of a penalization and the smallest word fare too well in this scenario.

I plotted the distribution of number of subsequences by word lengths. We can see a polynomial curve but with increased variance:

Generated with this ggplot2

In the chart above, we’d see all points with the same x-value in a single vertical line. One neat visualization trick is to add noise (jitter) so we also get a sense of density.

If we use a box plot instead, we can see a quadratic pattern more clearly by looking at the median for each length.

Generated with this ggplot2

Given this result, I tried a final scoring penalization, by dividing the number of subsequences by the square of the length of the word, but it’s still not enough to surface too many interesting words. Among the top 25, streamlined is the most common word, and it has 208 subsequences.

One interesting fact is that the lowest scoring words are those with repeated patterns, for example: kivikivi, Mississippi, kinnikinnick, curucucu and deedeed. This is basically because we only count unique subsequences.

Conclusion

This was a fun problem to think about and even though it didn’t have very interesting findings, I learned more about OCaml and R. After having to deal with bugs, compilation and execution errors, I like OCaml more than before and I like R less than before.

R has too many ways of doing the same thing and the API is too lenient. That works well for the 80% of the cases which it supports, but finding what went wrong in the other 20% is a pain. OCaml on the other hand is very strict. It doesn’t even let you add an int and a float without an explicit conversion.

I learned an interesting syntax that allows to re-use the qualifier/namespace between several operations when chaining them, for example:

I also used the library Batteries for the first time. It has a nice extension for the rather sparse String module. It allows us to simply do open Batteries but that overrides a lot of the standard modules and that can be very confusing. I was scratching my head for a long time to figure out why the compiler couldn’t find the union() function in the Map module, even though I seemed to have the right version, until I realized it was being overridden by Batteries. From now on, I’ll only use the specific modules, such as BatString, so it’s easy to tell which method is coming from which module.

References

OCaml

  • [1] OCaml Tutorials > Map
  • [2] Strings – Batteries included
  • [3] Using batteries when compiling

R

  • [1] R Tutorial – Histogram
  • [2] Creating plots in R using ggplot2 – part 10: boxplots
  • [3] My R Cheat sheet
Advertisements

Bloom Filters

Burton Howard Bloom is a MIT alumni, who, mwhile employed by the Computer Usage Company, developed and published a data structure that later became famous as “Bloom Filter” [1].

In this post we’ll talk about Bloom Filters. We’ll give a description of the data structure and its operations. We’ll then study the theory that explains the performance of this data structure. Next, we’ll describe an implementation in Python, run some experiments and finally discuss applications.

Introduction

Bloom filter is a probabilist data structure that enables inserting elements in a set and test whether an element belongs this set, sacrificing accuracy for lower memory usage.

More precisely, it can say an element is in the set when in fact it’s not (false positive), but if it says it’s not in the set, then it’s true. Also, the original variant doesn’t allow removal of elements [2].

649px-Bloom_filter.svg

In its original form, it doesn’t allow removing elements.

Algorithm

The bloom filter structure consists of a bit array of size m. We then define k independent hash functions, which can distribute a given value into any of the m positions uniformly.

Inserting an element consists in applying the k hash functions to the element and set the bit in the k positions returned by these functions to 1.

Querying for an element consists in applying the same k hash functions and testing whether all the bits in the k positions returned by these functions is set to 1.

If the test returns positive, there’s a chance the element is there, but it could be a false positive, since the bits could have been set by the insertion of other elements. On the other hand, if it returns negative, we’re sure the element is there, because we never unset a bit. Also, as the number of elements in the set grow, the probability of false positives increases.

Time complexity. Both insertion and querying for an element are constant-time operations, since they are proportional to k, assuming random access memory and O(1) hash functions.

Probability of false positives. The probability p that the algorithm will return true when the element is not there (false positive), can be described in terms of m, k, n, through the following equation:

p = (1 -e^{(kn/m)})^k

In the appendix A, we provide an approximated derivation of this result.

Optimal number of hash functions. For given values n and m, we have a choice of how many hash functions to use, k. Intuitively, if we use too few hash functions, we may increase the chances of collision when querying for an element, whereas if we use too many, the array will be filled up earlier and thus the collisions will also increase.

It’s possible to prove (a proof is sketched in appendix B), that for fixed m, n, the value of k that minimizes p is given by:

k = ln(2)m/n

Python Implementation

In this experiment, we use the bitarray python module. It’s a C implementation that represents a bit array efficiently. It has conveniently overloaded operators so most of the time it’s like working with arrays.

We can define a sufficiently large value of our bit array length, m, for example:

from bitarray import bitarray
m = 1 << 20
bit = bitarray(m)

# bitarray doesn't guarantee the bits are all set to 0
# upon initialization
bit.setall(False)

The murmur algorithm

We need to pick a hash function to use with our Bloom filter. References such as [3] suggest using an algorithm called Murmur. This stack exchange thread has a nice survey about different hash functions.

Murmur is a hash algorithm developed by Austin Appleby. Even though it’s not suitable for cryptographic applications, in practice it is very fast and tends to distributed real world instances well.

According to the author, the name comes from an early implementation detail, which used multiply-rotate-multiply-rotate operations to mix the internal state of the hash (hence MuR-MuR).

From the source code, this algorithm seems to shuffle the bits from the input key by using multiplication, shifting and xor operations.

The author mentions using simulated annealing to find some of the constants of the final mix of algorithm. This final part is used to cause the avalanche effect, in which very similar inputs are hashed to very different values.

There’s is a Python library called pyhash that has an interface for several hash functions (implemented in C++).

To install it, we can use the easy_install command. Easy Install is a python module that is used to download packages. pyhash is particular is available in the default repository, PyPI (python package index), so all we need to do is:

> sudo easy_install pyhash

To use it within a Python script:

from pyhash import murmur3_x64_128

hasher = murmur3_x64_128()
hash_value = hasher(input)

Families of hash functions

In Jonathan Ellis‘s post, he mentions a strategy for generating a family of hash functions. He cites a paper from Kirsch and Mitzenmacher [4] which shows we can generate a set of k hash functions from 2 independent hash functions in the following way:

h_k(x) = h_A(x) + i \cdot h_B(x), \, i = 0, \cdots, k-1

In practice, since murmur3_x64_128() generates 128-bit hashes, we can use the first 64-bits as the first hash function, and the last 64-bits as the second.

from pyhash import murmur3_x64_128

hasher = murmur3_x64_128()
h128 = hasher(input)
h64l = h128 & ((1L << 64) - 1)
h64u = h128 >> 64

hashes = map(
  lambda i: (h64l + i*h64u) % k,
  range(k)
)

All the code is available on github.

Experiments

Before running experiments with our Bloom filter implementation, let’s try to visualize the distribution of the Murmur hash functions. For this first experiment, we compute the hash function of keys ranging form 1 to 10k and module it 10k, so they’re all in the same range of values. Then we plot the keys vs. their hashes in a scatter plot, by exporting this data as a CSV file and rendering in R (using ggplot2):

ggplot(data, aes(x=i, y=h_i)) + geom_point(size=1.5) 

murmur3

From the chart, we can see it apparently distributes the keys uniformly, without any obvious pattern. This is not sufficient to prove it’s a good hash function, but it’s a necessary property.

To test the family of hashes, we can add a new column, which has i for the i-th hash function. Then, in ggplot, we render each class with a different color with the following command:

ggplot(data, aes(x=i, y=h_i, color=class)) 
  + geom_point(size=1.5) + facet_grid(class ~ .) 

multi

We can now compare the distributions between each of the hashes by plotting scatter plots for each pair i, j. Now we use another ggplot2 command to generate a 5×5 grid of charts:

ggplot(data, aes(x=i, y=h_i, color=class)) 
  + geom_point(size=1.5) + facet_wrap(~class, ncol=5) 

pairwise

We can see each chart has a uniform pattern, which is a good sign if we’re looking for independent hash functions. For comparison, let’s plot the same chart for a set of hash functions derived only from a single hash function, for example

h_i(x) = i \cdot h_A(x), i = 1, \cdots, k

bad_hash

Now, let’s analyze the false positive rate. We can do that by counting the number of false positives FP and the true negatives TN and evaluating:

\dfrac{FP}{FP + TN}

We fix m and k and shuffle an array from 0 to n-1 to simulate a sampling without replacement. After each inserted element of this array, we go over all non-inserted elements (FP + TN) and count how many of those our Bloom filter thinks they are in the set FP. We then plot a line chart with the number of elements inserted vs. the false positive rate, using qplot() R function:

qplot(
  data$n, 
  data$prob, 
  ylab='Pr(false positive)', 
  xlab='N elements'
)

bloom

Let’s plot the effect of increasing the size of the bit array, using the optimal k and increasing the number of elements. In this particular experiment, we use n=250 and try different values of m: 100, 250, 500, 1000

qplot(
  data$n, 
  data$prob, 
  ylab='Pr(false positive)', 
  xlab='N elements', 
  size=I(1), 
  color=data$m
) + scale_colour_manual(
  values=c("red", "blue", "green", "black"), 
  name="m values"
)

Rplot04

One strange behaviour seems to happen towards the right end of the chart for some curves — the false probability ratio seems to drop. I’ve double checked the code and it looks sane. One possible explanation is that we are calculating the probability over the non-inserted elements and as it approaches the end, the sample size is smaller, so the noise is more significant. Other than that, the probability of false positives tend to increase as we add more and more elements.

Now, let’s analyze the effect of the number of hash functions used. We fix it in n=250, m=1000 and try out different values of k=1, 5, 10, 50.

Rplot05

We can see that using many hash functions can be bad because it fills up the array too fast (k=50). In general k=5 and k=10 perform best for most part of the spectrum.

Applications

Bloom filters are suitable where false positives are tolerable. For example, it can be used as a first step to avoid lookup to a cache. Periodically a cache server could send a Bloom filter to a local machine, corresponding to items it has cached. If the Bloom filter says an element is on cache, it might not be true, but if it says the item is not there, the local client can avoid a request to the cache and talk directly to the underlying storage.

Bloom filters use little memory, so they are suitable for network transmission.

Broder and Mitzenmacher discuss a lot more examples [5], especially network applications.

Conclusion

In this post we learned about Bloom filters. The idea itself is quite simple, by studying the theory we are able to review some probability and calculus concepts.

The implementation made us aware of details which are usually not discussed in theoretical texts, for example, which hash functions to use and how to generate a family of hash functions.

We used two Python libraries, pyhash and bitarray and learned a little bit about the Python packaging system, PyPI. We got some experience with the ggplot2 R library, which I plan to post about later.

What we didn’t cover was variants of Bloom filters, like the count version, which allows deletion. Chazelle et al., introduced the Bloomier Filters, which is a generalization of Bloom filters [6].

References

[1] Quora – Where can one find a photo and biographical details for Burton Howard Bloom, inventor of the Bloom filter?
[2] Wikipedia – Bloom Filter
[3] Spyved – All you ever wanted to know about writing bloom filters
[4] Less hashing, same performance: Building a better Bloom filter – A Kirsch, M. Mitzenmacher.
[5] Network Applications of Bloom Filters – A. Broder, M. Mitzenmacher
[6] Bloomier Filters – B. Chazelle et al.
[7] Probability and Computing: Randomized Algorithms and Probabilistic Analysis – M. Mitzenmacher, E. Upfal

Appendix A: Probability of false positives

Let’s consider an array of m bits and the insertion operation. The probability of a given bit to be set by one of the hash functions is 1/m. Conversely, the probability of a bit not being set is 1 - 1/m. By the hypothesis of independent hash functions, the probability of a bit not being set by any of the k hash functions is

\left(1 - \frac{1}{m}\right)^k

Inserting n elements is equivalent to running the k functions n times, so after inserting n elements, the probability that a given bit is set to 0 is

\left(1 - \frac{1}{m}\right)^{kn}

or the probability it’s set to one as

1 - \left(1 - \frac{1}{m}\right)^{kn}

Now, the probability of a false positive p, is the probability k positions being set for a element that was never inserted.

p = \left(1 - \left(1 - \frac{1}{m}\right)^{kn}\right)^k

According to [2] though, this derivation is not accurate, since assumes that the probability of each bit being set is independent from each other. The Wikipedia article sketches a more precise proof from Mitzenmacher and Upfal [7] which arrives at the same result.

To simplify the equation, we’ll use the following limit:

\lim_{x\to\infty} \left( 1 - 1/x \right) = 1/e

So, for large values of m, we can assume that

(1 - 1/m)^{kn} = (1 - 1/m)^{(mkn)/m} \approx (1/e)^{(kn/m)}

Finally we have:

p = (1 -e^{(kn/m)})^k

Appendix B: Optimal number of hash functions

If we make p a function of k, it’s possible to prove it has a global minimum for positive k‘s so we can use the derivative to find the value of k that minimizes the false positive probability.

To find the derivative, we can use the generalized power rule to get

\ln(1 - e^{-ck}) + \frac{k}{(1 - e^{-ck})} c e^{-ck}

where c=n/m is a constant. If we make y = e^{-ck} (and use k = \ln(y)/-c)

\ln(1 - y) - \frac{\ln(y)}{(1 - y)} y

By making it equals to 0, we get the equation:

\ln(1 - y) (1-y) = \ln(y) y

To solve this for y, let x = 1 - y,

\ln(x) x = \ln(y) y

Since \ln(n) and n are both monotonic functions, so is \ln(n)n. Then if () holds, we can conclude that x = y, because otherwise, if x > y, \ln(x) x > \ln(y) y and if x < y, \ln(x) x < \ln(y) y.

Replacing x back, we find out that y = 1/2 and finally k = ln(2)m/n

Introduction to the R language for programmers

r-logo

In this post we will cover the basics of the R programming language. We’ll work with the assumption that the reader has experience with imperative programming languages.

In particular, we’re often going to make references to similar features from C++, Python, Matlab/Octave and Javascript. Knowing them will be useful for make associations with R, but it’s not necessary.

History

R is a programming language and software environment for statistical computing. It is an alternative implementation of the S language, combined with lexical scoping (we’ll talk about it later). It was developed by Ross Ihaka and Robert Gentleman.

The name of the language comes from the first letter of the authors’ names and also because it’s right before S.

Features

R is an interpreted and dynamically typed language. As we shall see, it is centred around vectors, so that even scalars are considered unit vectors in R. It also has built-in matrix operations like Matlab.

R also offers a package manager, which makes it easy to extend the basic package. Common libraries include ggplot and dplyr, but we won’t talk about them here.

Although R has a CLI version, there’s a really nice free IDE called RStudio.

Syntax

First let’s discuss some general R syntax.

Statements. Semi-colons in R are optional. R also parse new line as end of statements, much like Javascript.

Comments. The # character starts a comment for the current line, like in Python.

Variable Names. R has a very unique variable name syntax. In C++ and other languages, a variable name can contain basically all letters from the alphabet, numbers and underscore, except that it can’t start with a number. In R, the . character is also a valid character in variable names, but variables starting with . are treated differently, since they are hidden from the default variable listing (ls()) [3].

There are some gotchas though: variables can’t start with underscore and a number cannot be used in the second position if it started with a dot. Examples of invalid variable names: _hello, .1ello. Valid variable names: ._, ..1hello.

R also has a set of one-line variables that should be reserved but can be overridden in practice. For example, c is a function used to create vectors, but it can be easily overridden and cause head scratching.

> c(1, 3)
[1] 1 3
> c <- function (x) {x}
> c(1, 3) 
Error in c(1, 3) : unused argument (3)

Similarly, T and F are aliases for TRUE and FALSE, but they’re just variables that can also be overridden. Thus, we can mess up and do

> T <- FALSE
> F <- TRUE

The command rm(list=ls()) is very useful to clean up a messed up environment.

Assignment. In R assignment is often denoted by the <- symbol. It also allows left to right assignment by using the symbol ->, for example:

2 -> x;

R also accepts the usual = sign for assignment. The reason for the <- is merely historical and a more context can be found here.

Blocks of code. Like C++, R has the concept of blocks of code delimited by curly braces: {}. One important observation is that it returns the result of the last expression. Example:

> c <- {
+   a = 3 + 7;
+   b = a + 1;
+ }
> c
[1] 11

Data Types

We’ll now discuss the main built-in data types in R.

Vectors

Vectors in R represent an ordered list of elements of the same (atomic) type. The main types are integer, numeric (integers and floats), character (strings), logical (booleans) and complex. Vectors cannot contain vectors or functions.

As we mentioned before, R is a vector-centred language, so, in the command line, if we type:

> 2
[1] 2  # Unit vector

To create a vector with more elements, we use the c() function:

> x <- c(1, 2, 3)
> x
[1] 1 2 3

R also offers helper functions for generating vectors, like seq() and rep().

> seq(1, 10, 3)
[1]  1  4  7 10

> rep(c(1, 2), 5)
[1] 1 2 1 2 1 2 1 2 1 2

We can also generate empty vectors. Since vectors must have types, we need to inform which type we’re intending to have (since it can’t figure out from the element):

> numeric()
numeric(0)
> character()
character(0)
> logical()
logical(0)

Vector operations. R has built-in operators and functions that work with vector out of the box. This include things like vector addition, scalar multiplication and element-by-element multiplication:

> c(3, 4) + c(7, 6)
[1] 10 10

> c(1, 2, 3) * 2
[1] 2 4 6

> c(1, 2, 3) * c(1, 2, 3)
[1] 1 4 9

R is able to perform scalar and also element-by-element multiplication in a quite peculiar way: by repeating the smaller vector until it becomes the size of the larger vector. To illustrate, consider the following example:

> c(2, 2, 2, 2, 2, 2) * c(2, 3)
[1] 4 6 4 6 4 6

It repeated c(2, 3) 3 times, so we basically did:

> c(2, 2, 2, 2, 2, 2) * c(2, 3, 2, 3, 2, 3)
[1] 4 6 4 6 4 6

The same behaviour happens with a scalar, since it’s an unit vector repeated many times.

If the length of the larger one is not a multiple of the length of the smaller one, it used only part of the vector in the last repetition, and displays a warning.

> c(1, 2, 3) + c(10, 20, 30, 40)
Warning: longer object length is not a multiple of shorter object length
[1] 11 22 33 41

Notable arithmetic operators. Most operators work as we expect, but the ones I found worth noting were: ^ for exponentiation, / is float division, %% is the module function and %/% is integer division.

Vector access. Indexes in R start from 1, like Matlab. Programmers are used to it starting from 0.

We can use numeric vectors to index another vector:

> c(1, 2, 3, 5, 8, 13)[c(1, 4)] 
[1] 1 5

Negative numbers is used for exclusion (can’t be mixed with positive numbers):

> c(1,2,3,4,5)[c(-1, -3)]
[1] 2 4 5

Logical vectors are used for filtering:

> x <- 1:10
> x %% 2 == 0
[1] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
> x[x %% 2 == 0] 
[1]  2  4  6  8 10

Character vectors can be used to index vectors by name. We can set names to vectors so they work like an associative array:

> fruits <- c(1, 2, 3)
> names(fruits) <- c("apple", "banana", "coconut")
> fruit[c("apple", "coconut")]
[1] 1 3

We can also use indexing for assignment:

> v <- c(10, 11, 12)
> v[c(1, 3)] = 20
> v
[1] 20 11 20

It’s worth noting that numeric indexing can cause resizing if it’s greater than the current vector length:

v <- numeric()
v[10] <- 1
v
[1] NA NA NA NA NA NA NA NA NA  1

Because of vector indexing, we can easily perform permutations of a vector just by indexing a permutation of its indexes:

> c(10, 20, 30, 40, 50)[c(4, 2, 3, 1, 5)]
[1] 40 20 30 10 50

The order() function makes use of this idea. Instead of returning a copy of the sorted vector, it returns a permutation of the original index representing a natural ordering:

> v <- c("a", "c", "d", "b")
> order(v) 
[1] 1 4 2 3
> v[order(v)]
> [1] "a" "b" "c" "d"

Lists

As we saw, vectors require all elements to be of the same atomic type. Lists are an ordered list of elements of any type, including lists, vectors and any other types.

>list(c(1,2), "hello", list(TRUE, FALSE))
[[1]]
[1] 1 2

[[2]]
[1] "hello"

[[3]]
[[3]][[1]]
[1] TRUE

[[3]][[2]]
[1] FALSE

As suggested in the output, list access can be done by double brackets:

> x <- list(c(1,2), "hello", list(TRUE, FALSE))
> x[[1]]
[1] 1 2

One key difference between the single bracket and double brackets is that single brackets always return the same type of the indexed element, making it especially suitable for subsetting. For example:

> x[1]
[[1]]
[1] 1 2

Returns a list with one element, whereas the double brackets return the element itself. Lists can have names like vectors do:

y <- list(a=c(1,2), b="hello", c=list(TRUE, FALSE))

And we can use the double bracket notation for access as well:

> y[["b"]] 
[1] "hello"

There’s a common alternative to this syntax which consists of using $ and the name without quotes:

> y$b
[1] "hello"

It’s similar to javascript where you can also access an object obj with key "foo" either as obj["foo"] or obj.foo. This pattern is useful when the value to that key is a function, so instead of calling obj["myFunction"](baz) we can do obj.myFunction(baz). The same idea applies to R.

Matrices

Matrices are two-dimensional vectors. In most languages, matrices are just vectors of vectors, but since in R vectors cannot contain vectors, they cannot represent matrices.

The advantage of having a new type for matrices is that we can guarantee they always have consistent sizes (that is, all rows have the same number of columns) and also that we can have built-in operations for them, like matrix multiplication.

Since matrices are just 2 dimensional vectors, we also have the restriction over its elements. We can construct a matrix out of a vector:

> matrix(c(1, 2, 3, 4, 5, 6), nrow=2, ncol=3)
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

Note how it generates the matrix column-first by default. We can change it to be row-first by setting the byrow flag:

> matrix(c(1, 2, 3, 4, 5, 6), nrow=2, ncol=3, byrow=TRUE)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6

Indexing is straightforward, we just provide a pair of values representing the row index and the column index:

> x <- matrix(c(1, 2, 3, 4, 5, 6), nrow=2, ncol=3, byrow=FALSE)
> x[1, 2]
[1] 3

We can easily index entire rows or columns and also subsets:

> x[1,] # first row
[1] 1 3 5

> x[,2] # second column
[1] 3 4

> x[, c(1, 3)] # first and third column
     [,1] [,2]
[1,]    1    5
[2,]    2    6

Data Frames

Simply put, data Frames are to matrices what lists are to vectors. They usually represent tables, where each column from the table may have different types. We can construct a data frame from a set of vectors. They have to have the same number of elements, but if not, R applies the repetition principle:

> x <- data.frame(c(1, 2, 3, 4), c(1, 2), 1);
> x
  c.1..2..3..4. c.1..2. X3
1             1       1  3
2             2       2  3
3             3       1  3
4             4       2  3

We can give names to the columns, similar how we named lists and vectors:

colnames(x) <- c("a", "b", "c");

Data frames can be indexed pretty much like lists, but in this case we are always guaranteed to get a vector of the same length.

Useful functions to explore data frames are head() and summary():

> head(x, 3) # Displays the top 3 rows
  a b c
1 1 1 3
2 2 2 3
3 3 1 3

Control Structures

Now that we’ve covered the basics of data types, let’s discuss some control structures, mainly conditionals and loops.

Conditionals

The boolean values are TRUE and FALSE. We also can use the T and F variables as alias.

Logical composition is done by &, &&, | and ||. The difference is that the shorter versions (&, |) are vectorized (ie, element-by-element), while the other ones expect scalars, and will only look at the first element of the vectors.

Also, the double version is “lazy” in a sense that it doesn’t compute the second argument if the first is enough.

Some examples:

>TRUE | c(FALSE, TRUE) 
# same as c(TRUE, TRUE) | c(FALSE, TRUE)
TRUE TRUE
>TRUE || c(FALSE, TRUE) 
# same as TRUE || FALSE
TRUE
> TRUE || someFunction() 
# someFunction is never called
TRUE

If clause. The if clause is similar to many languages, but since R blocks return the value of last expression and one-line blocks don’t require braces, this allow us to define a ternary operator without a different syntax:

> y <- if (4 > 3) 10 else 0
> y
[1] 10

For loop. It always iterates over elements of a vector:

for (x in c("a", "b")) {
  ...
}

If we’re interested in the index of the elements, one idea is to use seq_along(v), which returns a vector with the indices from a vector v:

xs <- c("a", "b")
for (i in seq_along(xs)) {
  x <- xs[i];
  ...
}

The while construct is nothing special, but one interesting related construct is the repeat. It is equivalent while(TRUE), but since it’s a common programming practice, it was aliased as a keyword.

Also, instead of the usual continue to skip to the next iteration of the loop, R uses next.

R has a switch construct, but it works very differently from the C++ style. For example, the following C++ pseudo-code:

switch(variable) {
  "value_1": expression_1;
  break;
  "value_2": expression_2;
  break;
}

Would be translated to:

switch(variable, value_1=expression_1, value_2=expresion_2). 

It’s shorter, but it doesn’t allow fallthroughs for example.


Functions

Functions are defined anonymously and can be assigned to like in Javascript. The key difference is that R returns the last statement (the return keyword is optional).

> f <- function (x) {
  x + 1;
}
> f(1)
[1] 2

Functions in R can take default values and named calls like in Python. For example:

> f <- function (a, b=2) {
  a + b;
}
> f(3)
[1] 5
> f(b=3, a=3)
[1] 6

Differently from Python, R doesn’t require arguments with default values to come after the arguments without. For example, the following construct is valid in R but not in Python:

> f <- function (a=1, b) {
  a + b;
}
> f(3)
Error in f(2) : argument "b" is missing, with no default
> f(b=3)
[1] 4

Varargs. R lets us work with a variable number of arguments by using the ellipsis variable. It works the same as the rest parameters from the Javascript ES6 version. For example, the following JS code:

x = function(y, ...other) {
  return sum(other);
}
x(2, 3, 4, 5); //  11

Can be translated to this R snippet:

> x <- function(y, ...) {
+   sum(...)
+ }
> x(2, 3, 4, 5);
[1] 11

Scoping. When reading R material, we often see it has lexical scoping as opposed to dynamic scoping. It wouldn’t be worth mentioning in basic introduction since that’s how most languages work, but it’s one of the main difference from the original S language.

The difference between lexical scoping and dynamic scoping relies on which variable a free variable refers to in a function (a free variable is one that hasn’t been created or passed as argument). In lexical scoping it’s determined in compile time (static), while in dynamic scoping it’s determined in runtime (dynamic). A good explanation with examples is given in this post.

An example in R would be the following:

> f <- function() {
  x <- 1; 
  a <- function() {
    x;
  }
  b <- function() {
    x <- 2;
    a();
  }
  b();
}
> f()
[1] 1

The x in the a function is a free variable. In lexical scoping it’s bound to the variable of the ancestor where it is defined, in this case, f(). In dynamic scoping, it’s bound to the variable to the ancestor where it is called, thus b(), so in this case the code above would return 2.

Lazy evaluation. Arguments in a function are only evaluated if they are used. A good example is given here.

Conclusion

In this post we present a simplified introduction to the R language and compared it with languages like C++, Javascript and Python. We covered data types, control structures and functions.

R has a very unique syntax and much of it is due to legacy reasons. It has been increasing in popularity in recent years. R also seems more error-tolerant than other languages and this can cause unexpected behaviours. [1] offers some criticism about the language.

I was planning to learn the Julia programming language, but I had opportunities to use R this year and also there as a quick course on Coursera. The R course is easy for programmers and only lasts for 4 weeks.

References

[1] CRAN: An Introduction to R
[2] R Language for Programmers – John D. Cook
[3] Stack Overflow: What does the dot mean in R – personal preference, naming convention or more?
[4] Revolutions: Use = or <- for assignment?
[5] ECMAScript 6 and Rest Parameter – Ariya Hidayat
[6] Advanced R: Functions – by Hadley Wickham
[7] The R Inferno – Patrick Burns