layout	title
default	Names and Name Table

Translated by Clifford Escobar CAOILE

Chapter 3: Names and Name Table

`st_table`

`st_table` has already appeared several times as a method table and an instance
table. In this chapter let’s look at the structure of the `st_table` in detail.

Summary

I previously mentioned that the `st_table` is a hash table. What is a hash
table? It is a data structure that records one-to-one relations, for example, a
variable name and its value, or a function name and its body, etc.

However, data structures other than hash tables can, of course, record
one-to-one relations. For example, a list of the following structs will suffice
for this purpose.

struct entry {
    ID key;
    VALUE val;
    struct entry *next;  /* point to the next entry */
};

However, this method is slow. If the list contains a thousand items, in the
worst case, it is necessary to traverse a thousand links. In other words, the
search time increases in proportion to the number of elements. This is bad.
Since ancient times, various speed improvement methods have been conceived. The
hash table is one of those improved methods. In other words, the point is not
that the hash table is necessary but that it can be made faster.

Now then, let us examine the `st_table`. As it turns out, this library is not
created by Matsumoto, rather:

▼ `st.c` credits

   1  /* This is a public domain general purpose hash table package
         written by Peter Moore @ UCB. */

(st.c)

as shown above.

By the way, when I searched Google and found another version, it mentioned
that `st_table` is a contraction of “STring TABLE”. However, I find it
contradictory that it has both “general purpose” and “string” aspects.

What is a hash table?

A hash table can be thought as the following: Let us think of an array with
`n` items. For example, let us make `n`=64 (figure 1).

Then let us specify a function `f` that takes a key and produces an integer `i`
from 0 to `n`-1 (0-63). We call this `f` a hash function. `f` when given the
same key always produces the same `i`. For example, if we can assume that the
key is limited to positive integers, when the key is divided by 64, the
remainder should always fall between 0 and 63. Therefore, this calculating
expression has a possibility of being the function `f`.

When recording relationships, given a key, function `f` generates `i`, and
places the value into index `i` of the array we have prepared. Index access
into an array is very fast. The key concern is changing a key into an integer.

However, in the real world it isn’t that easy. There is a critical problem with
this idea. Because `n` is only 64, if there are more than 64 relationships to
be recorded, it is certain that there will be the same index for two different
keys. It is also possible that with fewer than 64, the same thing can occur.
For example, given the previous hash function “key % 64”, keys 65 and 129 will
both have a hash value of 1. This is called a hash value collision. There are
many ways to resolve such a collision.

One solution is to insert into the next element when a collision occurs.
This is called open addressing. (Figure 3).

Other than using the array like this, there are other possible approaches, like using
a pointer to a respective linked list in each element of the array. Then when a
collision occurs, grow the linked list. This is called chaining. (Figure
4) `st_table` uses this chaining method.

However, if it can be determined a priori what set of keys will be used,
it is possible to imagine a hash function that will never create
collisions. This type of function is called a “perfect hash function”.
Actually, there are tools which create a perfect hash function given a set
of arbitrary strings. GNU gperf is one of those. `ruby`‘s parser
implementation uses GNU gperf but… this is not the time to discuss it.
We’ll discuss this in the second part of the book.

Data Structure

Let us start looking at the source code. As written in the introductory
chapter, if there is data and code, it is better to read the data first.
The following is the data type of `st_table`.

▼ `st_table`

   9  typedef struct st_table st_table;

  16  struct st_table {
  17      struct st_hash_type *type;
  18      int num_bins;                   /* slot count */
  19      int num_entries;                /* total number of entries */
  20      struct st_table_entry **bins;   /* slot */
  21  };

(st.h)

▼ `struct st_table_entry`

  16  struct st_table_entry {
  17      unsigned int hash;
  18      char *key;
  19      char *record;
  20      st_table_entry *next;
  21  };

(st.c)

`st_table` is the main table structure. `st_table_entry` is a holder that
stores one value. `st_table_entry` contains a member called `next` which of
course is used to make `st_table_entry` into a linked list. This is the chain
part of the chaining method. The `st_hash_type` data type is used, but I will
explain this later. First let me explain the other parts so you can compare
and understand the roles.

So, let us comment on `st_hash_type`.

▼ `struct st_hash_type`

  11  struct st_hash_type {
  12      int (*compare)();   /* comparison function */
  13      int (*hash)();      /* hash function */
  14  };

(st.h)

This is still Chapter 3 so let us examine it attentively.

int (*compare)()

This part shows, of course, the member `compare` which has a data type of
“a pointer to a function that returns an `int`”. `hash` is also of the same type.
This variable is substituted in the following way:

int
great_function(int n)
{
    /* ToDo: Do something great! */
    return n;
}

{
    int (*f)();
    f = great_function;

And it is called like this:

    (*f)(7);
}

Here let us return to the `st_hash_type` commentary. Of the two members
`hash` and `compare`, `hash` is the hash function `f` explained previously.

On the other hand, `compare` is a function that evaluates if the key is actually the
same or not. With the chaining method, in the spot with the same hash value
`n`, multiple elements can be inserted. To know exactly which element is
being searched for, this time it is necessary to use a comparison function
that we can absolutely trust. `compare` will be that function.

This `st_hash_type` is a good generalized technique. The hash table itself
cannot determine what the stored keys’ data type will be. For example, in
`ruby`, `st_table`’s keys are `ID` or `char*` or `VALUE`, but to write the
same kind of hash for each (data type) is foolish. Usually, the things
that change with the different key data types are things like the hash
function. For things like memory allocation and collision detection,
typically most of the code is the same. Only the parts where the
implementation changes with a differing data type will be bundled up into a
function, and a pointer to that function will be used. In this fashion, the
majority of the code that makes up the hash table implementation can
use it.

In object-oriented languages, in the first place, you can attach a procedure
to an object and pass it (around), so this mechanism is not necessary.
Perhaps it more correct to say that this mechanism is built-in as a
language’s feature.

`st_hash_type` example

The usage of a data structure like `st_hash_type` is good as an
abstraction. On the other hand, what kind of code it actually passes
through may be difficult to understand. If we do not examine what sort of
function is used for `hash` or `compare`, we will not grasp the reality.
To understand this, it is probably sufficient to look at `st_init_numtable()`
introduced in the previous chapter. This function creates a table for
integer data type keys.

▼ `st_init_numtable()`

 182  st_table*
 183  st_init_numtable()
 184  {
 185      return st_init_table(&type_numhash);
 186  }

(st.c)

`st_init_table()` is the function that allocates the table memory and so
on. `type_numhash` is an `st_hash_type` (it is the member named “type” of `st_table`).
Regarding this `type_numhash`:

▼ `type_numhash`

  37  static struct st_hash_type type_numhash = {
  38      numcmp,
  39      numhash,
  40  };

 552  static int
 553  numcmp(x, y)
 554      long x, y;
 555  {
 556      return x != y;
 557  }

 559  static int
 560  numhash(n)
 561      long n;
 562  {
 563      return n;
 564  }

(st.c)

Very simple. The table that the `ruby` interpreter uses is by and large
this `type_numhash`.

`st_lookup()`

Now then, let us look at the function that uses this data structure. First,
it’s a good idea to look at the function that does the searching. Shown below is the
function that searches the hash table, `st_lookup()`.

▼ `st_lookup()`

 247  int
 248  st_lookup(table, key, value)
 249      st_table *table;
 250      register char *key;
 251      char **value;
 252  {
 253      unsigned int hash_val, bin_pos;
 254      register st_table_entry *ptr;
 255
 256      hash_val = do_hash(key, table);
 257      FIND_ENTRY(table, ptr, hash_val, bin_pos);
 258
 259      if (ptr == 0) {
 260          return 0;
 261      }
 262      else {
 263          if (value != 0)  *value = ptr->record;
 264          return 1;
 265      }
 266  }

(st.c)

The important parts are pretty much in `do_hash()` and `FIND_ENTRY()`. Let us
look at them in order.

▼ `do_hash()`

  68  #define do_hash(key,table) (unsigned int)(*(table)->type->hash)((key))

(st.c)

Just in case, let us write down the macro body that is difficult to understand:

(table)->type->hash

is a function pointer where the `key` is passed as a parameter. This is the
syntax for calling the function. `*` is not applied to `table`. In other words,
this macro is a hash value generator for a `key`, using the prepared hash
function `type→hash` for each data type.

Next, let us examine `FIND_ENTRY()`.

▼ `FIND_ENTRY()`

 235  #define FIND_ENTRY(table, ptr, hash_val, bin_pos) do {\
 236      bin_pos = hash_val%(table)->num_bins;\
 237      ptr = (table)->bins[bin_pos];\
 238      if (PTR_NOT_EQUAL(table, ptr, hash_val, key)) {\
 239          COLLISION;\
 240          while (PTR_NOT_EQUAL(table, ptr->next, hash_val, key)) {\
 241              ptr = ptr->next;\
 242          }\
 243          ptr = ptr->next;\
 244      }\
 245  } while (0)

 227  #define PTR_NOT_EQUAL(table, ptr, hash_val, key) ((ptr) != 0 && \
          (ptr->hash != (hash_val) || !EQUAL((table), (key), (ptr)->key)))

  66  #define EQUAL(table,x,y) \
          ((x)==(y) || (*table->type->compare)((x),(y)) == 0)

(st.c)

`COLLISION` is a debug macro so we will (should) ignore it.

The parameters of `FIND_ENTRY()`, starting from the left are:

`st_table`
the found entry will be pointed to by this parameter
hash value
temporary variable

And, the second parameter will point to the found `st_table_entry*`.

At the outermost level, a `do` .. `while(0)` is used to safely wrap up a
multiple expression macro. This is `ruby`‘s, or rather, C language’s preprocessor
idiom. In the case of `if(1)`, there may be a danger of adding an `else` part.
In the case of `while(1)`, it becomes necessary to add a `break` at the very
end.

Also, there is no semicolon added after the `while(0)`.

FIND_ENTRY();

This is so that the semicolon that is normally written at the end of an
expression will not go to waste.

`st_add_direct()`

Continuing on, let us examine `st_add_direct()` which is a function that adds a
new relationship to the hash table. This function does not check if the key is
already registered. It always adds a new entry. This is the meaning of `direct`
in the function name.

▼ `st_add_direct()`

 308  void
 309  st_add_direct(table, key, value)
 310      st_table *table;
 311      char *key;
 312      char *value;
 313  {
 314      unsigned int hash_val, bin_pos;
 315
 316      hash_val = do_hash(key, table);
 317      bin_pos = hash_val % table->num_bins;
 318      ADD_DIRECT(table, key, value, hash_val, bin_pos);
 319  }

(st.c)

Just as before, the `do_hash()` macro that obtains a value is called here.
After that, the next calculation is the same as at the start of
`FIND_ENTRY()`, which is to exchange the hash value for a real index.

Then the insertion operation seems to be implemented by `ADD_DIRECT()`.
Since the name is all uppercase, we can anticipate that is a macro.

▼ `ADD_DIRECT()`

 268  #define ADD_DIRECT(table, key, value, hash_val, bin_pos) \
 269  do {                                                     \
 270      st_table_entry *entry;                               \
 271      if (table->num_entries / (table->num_bins)           \
                              > ST_DEFAULT_MAX_DENSITY) {      \
 272          rehash(table);                                   \
 273          bin_pos = hash_val % table->num_bins;            \
 274      }                                                    \
 275                                                           \
          /* (A) */                                            \
 276      entry = alloc(st_table_entry);                       \
 277                                                           \
 278      entry->hash = hash_val;                              \
 279      entry->key = key;                                    \
 280      entry->record = value;                               \
          /* (B) */                                            \
 281      entry->next = table->bins[bin_pos];                  \
 282      table->bins[bin_pos] = entry;                        \
 283      table->num_entries++;                                \
 284  } while (0)

(st.c)

The first `if` is an exception case so I will explain it afterwards.

(A) Allocate and initialize a `st_table_entry`.

(B) Insert the `entry` into the start of the list.
This is the idiom for handling the list. In other words,

entry->next = list_beg;
list_beg = entry;

makes it possible to insert an entry to the front of the list. This is similar
to “cons-ing” in the Lisp language. Check for yourself that even if `list_beg`
is NULL, this code holds true.

Now, let me explain the code I left aside.

▼ `ADD_DIRECT()`-`rehash`

 271      if (table->num_entries / (table->num_bins)           \
                              > ST_DEFAULT_MAX_DENSITY) {      \
 272          rehash(table);                                   \
 273          bin_pos = hash_val % table->num_bins;            \
 274      }                                                    \

(st.c)

`DENSITY` is “concentration”. In other words, this conditional checks if the
hash table is “crowded” or not. In the `st_table`, as the number of values that
use the same `bin_pos` increases, the longer the link list becomes. In other
words, search becomes slower. That is why for a given `bin` count, when the average elements
per bin become too many, `bin` is increased and the crowding is reduced.

The current `ST_DEFAULT_MAX_DENSITY` is

▼ `ST_DEFAULT_MAX_DENSITY`

  23  #define ST_DEFAULT_MAX_DENSITY 5

(st.c)

Because of this setting, if in all `bin_pos` there are 5 `st_table_entries`,
then the size will be increased.

`st_insert()`

`st_insert()` is nothing more than a combination of `st_add_direct()` and
`st_lookup()`, so if you understand those two, this will be easy.

▼ `st_insert()`

 286  int
 287  st_insert(table, key, value)
 288      register st_table *table;
 289      register char *key;
 290      char *value;
 291  {
 292      unsigned int hash_val, bin_pos;
 293      register st_table_entry *ptr;
 294
 295      hash_val = do_hash(key, table);
 296      FIND_ENTRY(table, ptr, hash_val, bin_pos);
 297
 298      if (ptr == 0) {
 299          ADD_DIRECT(table, key, value, hash_val, bin_pos);
 300          return 0;
 301      }
 302      else {
 303          ptr->record = value;
 304          return 1;
 305      }
 306  }

(st.c)

It checks if the element is already registered in the table. Only when it is
not registered will it be added. If there is a insertion, return 0. If there is
no insertion, return a 1.

`ID` and Symbols

I’ve already discussed what an `ID` is. It is a correspondence between an
arbitrary string of characters and a value. It is used to declare various
names. The actual data type is `unsigned int`.

From `char*` to `ID`

The conversion from string to `ID` is executed by `rb_intern()`. This function
is rather long, so let’s omit the middle.

▼ `rb_intern()` (simplified)

5451  static st_table *sym_tbl;       /*  char* to ID   */
5452  static st_table *sym_rev_tbl;   /*  ID to char*   */

5469  ID
5470  rb_intern(name)
5471      const char *name;
5472  {
5473      const char *m = name;
5474      ID id;
5475      int last;
5476
          /* If for a name, there is a corresponding ID that is already
          registered, then return that ID */
5477      if (st_lookup(sym_tbl, name, &id))
5478          return id;

          /* omitted ... create a new ID */

          /* register the name and ID relation */
5538    id_regist:
5539      name = strdup(name);
5540      st_add_direct(sym_tbl, name, id);
5541      st_add_direct(sym_rev_tbl, id, name);
5542      return id;
5543  }

(parse.y)

The string and `ID` correspondence relationship can be accomplished by using the
`st_table`. There probably isn’t any especially difficult part here.

What is the omitted section doing? It is treating global variable names and
instance variables names as special and flagging them. This is because in the
parser, it is necessary to know the variable’s classification from the `ID`.
However, the fundamental part of `ID` is unrelated to this, so I won’t explain
it here.

From `ID` to `char*`

The reverse of `rb_intern()` is `rb_id2name()`, which takes an `ID` and
generates a `char*`. You probably know this, but the 2 in `id2name` is “to”.
“To” and “two” have the same pronounciation, so “2” is used for “to”. This
syntax is often seen.

This function also sets the `ID` classification flags so it is long. Let me
simplify it.

▼ `rb_id2name()` (simplified)

char *
rb_id2name(id)
    ID id;
{
    char *name;

    if (st_lookup(sym_rev_tbl, id, &name))
        return name;
    return 0;
}

Maybe it seems that it is a little over-simplified, but in reality if we remove
the details it really becomes this simple.

The point I want to emphasize is that the found `name` is not copied. The
`ruby` API does not require (or rather, it forbids) the `free()`-ing of the
return value. Also, when parameters are passed, it always
copies them. In other words, the creation and release is
completed by one side, either by the user or by `ruby`.

So then, when creation and release cannot be accomplished (when passed it is
not returned) on a value, then a Ruby object is used. I have not yet discussed
it, but a Ruby object is automatically released when it is no longer needed,
even if we are not taking care of the object.

Converting `VALUE` and `ID`

`ID` is shown as an instance of the `Symbol` class at the Ruby level.
And it can be obtained like so: `“string”.intern`. The implementation of
`String#intern` is `rb_str_intern()`.

▼ `rb_str_intern()`

2996  static VALUE
2997  rb_str_intern(str)
2998      VALUE str;
2999  {
3000      ID id;
3001
3002      if (!RSTRING(str)->ptr || RSTRING(str)->len == 0) {
3003          rb_raise(rb_eArgError, "interning empty string");
3004      }
3005      if (strlen(RSTRING(str)->ptr) != RSTRING(str)->len)
3006          rb_raise(rb_eArgError, "string contains `\\0'");
3007      id = rb_intern(RSTRING(str)->ptr);
3008      return ID2SYM(id);
3009  }

(string.c)

This function is quite reasonable as a `ruby` class library code example.
Please pay attention to the part where `RSTRING()` is used and casted, and
where the data structure’s member is accessed.

Let’s read the code. First, `rb_raise()` is merely error handling so we ignore
it for now. The `rb_intern()` we previously examined is here, and also ID2SYM
is here. `ID2SYM()` is a macro that converts `ID` to `Symbol`.

And the reverse operation is accomplished using `Symbol#to_s` and such.
The implementation is in `sym_to_s`.

▼ `sym_to_s()`

 522  static VALUE
 523  sym_to_s(sym)
 524      VALUE sym;
 525  {
 526      return rb_str_new2(rb_id2name(SYM2ID(sym)));
 527  }

(object.c)

`SYM2ID to an `ID`.

It looks like the function is not doing anything unreasonable. However, it
is probably necessary to pay attention to the area around the memory handling.
`rb_id2name()` returns a `char*` that must not be `free()`. `rb_str_new2()`
copies the parameter’s `char*` and uses the copy (and does not change the
parameter). In this way the policy is consistent, which allows the line to be
written just by chaining the functions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

name.textile

name.textile

Chapter 3: Names and Name Table

`st_table`

Summary

What is a hash table?

Data Structure

`st_hash_type` example

`st_lookup()`

`st_add_direct()`

`st_insert()`

`ID` and Symbols

From `char*` to `ID`

From `ID` to `char*`

Converting `VALUE` and `ID`

Files

name.textile

Latest commit

History

name.textile

File metadata and controls

Chapter 3: Names and Name Table

`st_table`

Summary

What is a hash table?

Data Structure

`st_hash_type` example

`st_lookup()`

`st_add_direct()`

`st_insert()`

`ID` and Symbols

From `char*` to `ID`

From `ID` to `char*`

Converting `VALUE` and `ID`