Consolidated lecture notes for CS223 as taught in the Spring 2005 semester at Yale by Jim Aspnes.
Contents
- C/Variables
- Machine memory
- Variables
- Using variables
- C/IntegerTypes
- Integer types
- C/Statements
- Simple statements
- Compound statements
- C/AssemblyLanguage
- C/Functions
- Function definitions
- Calling a function
- The return statement
- Function declarations and modules
- Static functions
- Local variables
- Mechanics of function calls
- C/InputOutput
- Character streams
- Reading and writing single characters
- Formatted I/O
- Rolling your own I/O routines
- File I/O
- C/Pointers
- Memory and addresses
- Pointer variables
- The null pointer
- Pointers and functions
- Pointer arithmetic and arrays
- Void pointers
- Run-time storage allocation
- C/DynamicStorageAllocation
- Dynamic storage allocation in C
- Overview
- Allocating space with malloc
- Freeing space with free
- Dynamic arrays using realloc
- Example of building a data structure with malloc
- C/Strings
- String processing in general
- C strings
- String constants
- String buffers
- Operations on strings
- Finding the length of a string
- Comparing strings
- Formatted output to strings
- Dynamic allocation of strings
- argc and argv
- C/Structs
- Structs
- Unions
- Bit fields
- AbstractDataTypes
- Abstraction
- Example of an abstract data type
- Designing abstract data types
- C/Definitions
- Naming types
- Naming constants
- Naming values in sequences
- Other uses of #define
- AsymptoticNotation
- Definitions
- Motivating the definitions
- Proving asymptotic bounds
- Asymptotic notation hints
- Variations in notation
- More information
- LinkedLists
- Stacks and linked lists
- Looping over a linked list
- Looping over a linked list backwards
- Queues
- Deques and doubly-linked lists
- Circular linked lists
- What linked lists are and are not good for
- Further reading
- C/Debugging
- Assertions
- gdb
- valgrind
- Not recommended: debugging output
- TestingDuringDevelopment
- Setting up the test harness
- Stub implementation
- Bounded-space implementation
- First fix
- Final version
- Moral
- Appendix: Test macros
- LinkedLists
- Stacks and linked lists
- Looping over a linked list
- Looping over a linked list backwards
- Queues
- Deques and doubly-linked lists
- Circular linked lists
- What linked lists are and are not good for
- Further reading
- HashTables
- Basics of hashing
- Resolving collisions
- Choosing a hash function
- Maintaining a constant load factor
- Examples
- C/FunctionPointers
- Basics
- Function pointer declarations
- Applications
- Closures
- Objects
- C/PerformanceTuning
- Timing under Linux
- Profiling with gprof
- AlgorithmDesignTechniques
- Basic principles of algorithm design
- Classifying algorithms by structure
- Example: Finding the maximum
- Example: Sorting
- DynamicProgramming
- BinaryTrees
- Tree basics
- Binary tree implementations
- The canonical binary tree algorithm
- Nodes vs leaves
- Special classes of binary trees
- Heaps
- Priority queues
- Expensive implementations of priority queues
- Heaps
- Packed heaps
- Bottom-up heapification
- More information
- BinarySearchTrees
- Searching for a node
- Inserting a new node
- Costs
- BalancedTrees
- The basics: tree rotations
- AVL trees
- 2-3 trees
- Red-black trees
- B-trees
- Splay trees
- Skip lists
- Implementations
- RadixSearch
- Tries
- Patricia trees
- Ternary search trees
- More information
- C/BitExtraction
- C/Graphs
- Graphs
- Why graphs are useful
- Operations on graphs
- Representations of graphs
- An implementation
- Searching for paths in a graph
- DepthFirstSearch
- Depth-first search in a tree
- Depth-first search in a directed graph
- DFS forests
- Running time
- A more complicated application: strongly-connected components
- BreadthFirstSearch
- ShortestPath
- Single-source shortest paths
- All-pairs shortest paths
- Implementations
- C/FloatingPoint
- Floating point basics
- The IEEE-754 floating-point standard
- Error
- Reading and writing floating-point numbers
- Non-finite numbers in C
- C/Persistence
- Persistent data
- A simple solution using text files
- Using a binary file
- A version that updates the file in place
- An even better version using mmap
- Concurrency and fault-tolerance issues: ACIDitiy
- C/WhatNext
- What's wrong with C
- What C++ fixes
- Other C-like languages to consider
- Scripting languages
1. C/Variables
2. Machine memory
Basic model: machine memory consists of many bytes of storage, each of which has an address which is itself a sequence of bits. Though the actual memory architecture of a modern computer is complex, from the point of view of a C program we can think of as simply a large address space that the CPU can store things in (and load things from), provided it can supply an address to the memory. Because we don't want to have to type long strings of bits all the time, the C compiler lets us give names to particular regions of the address space, and will even find free space for us to use.
3. Variables
A variable is a name given in a program for some region of memory. Each variable has a type, which tells the compiler how big the region of memory corresponding to it is and how to treat the bits stored in that region when performing various kinds of operations (e.g. integer variables are added together by very different circuitry than floating-point variables, even though both represent numbers as bits). In modern programming languages, a variable also has a scope (a limit on where the name is meaningful, which allows the same name to be used for different variables in different parts of the program) and an extent (the duration of the variable's existence, controlling when the program allocates and deallocates space for it).
3.1. Variable declarations
Before you can use a variable in C, you must declare it. Variable declarations show up in three places:
Outside a function. These declarations declare global variables that are visible throughout the program (i.e. they have global scope). Use of global variables is almost always a mistake.
In the argument list in the header of a function. These variables are parameters to the function. They are only visible inside the function body (local scope), exist only from when the function is called to when the function returns (bounded extent---note that this is different from what happens in some garbage-collected languages like Scheme), and get their initial values from the arguments to the function when it is called.
At the start of any block delimited by curly braces. Such variables are visible only within the block (local scope again) and exist only when the containing function is active (bounded extent). The convention in C is generally to declare all such local variables at the top of a function; this is different from the convention in C++ or Java, which encourage variables to be declared when they are first used.
Variable declarations consist of a type name followed by one or more variable names separated by commas and terminated by a semicolon (except in argument lists, where each declaration is terminated by a comma). I personally find it easiest to declare variables one per line, to simplify documenting them. It is also possible for global and local variables (but not function arguments) to assign an initial value to a variable by putting in something like = 0 after the variable name. It is good practice to put a comment after each variable declaration that explains what the variable does (with a possible exception for conventionally-named loop variables like i or j in short functions). Below is an example of a program with some variable declarations in it:
1 #include <stdio.h>
2 #include <ctype.h>
3
4 /* This program counts the number of digits in its input. */
5
6 /*
7 *This global variable is not used; it is here only to demonstrate
8 * what a global variable declaration looks like.
9 */
10 unsigned long SpuriousGlobalVariable = 127;
11
12 int
13 main(int argc, char **argv)
14 {
15 int c; /* character read */
16 int count = 0; /* number of digits found */
17
18 while((c = getchar()) != EOF) {
19 if(isdigit(c)) {
20 count++;
21 }
22 }
23
24 printf("%d\n", count);
25
26 return 0;
27 }
3.2. Variable names
The evolution of variable names:
- 11101001001001
- Physical addresses represented as bits.
- #FC27
- Typical assembly language address represented in hexadecimal to save typing (and because it's easier for humans to distinguish #A7 from #B6 than to distinguish 10100111 from 10110110.)
- A1$
- A string variable in BASIC, back in the old days where BASIC variables were one uppercase letter, optionally followed by a number, optionally followed by $ for a string variable and % for an integer variable.
- IFNXG7
- A typical FORTRAN variable name, back in the days of 6-character all-caps variable names. The I at the start means it's an integer variable. The rest of the letters probably abbreviate some much longer description of what the variable means.
- i, j, c, count, top_of_stack, accumulatedTimeInFlight
- Typical names from
modern C programs. Note that there are two different conventions for representing multi-word names: the first is to replace spaces with underscores, and the second is to capitalize the first letter of each word (possibly excluding the first letter), a style called "camel case" (C2find:CamelCase). You should pick one of these two conventions and stick to it.
- prgcGradeDatabase
An example of Hungarian notation, a style of variable naming in which the type of the variable is encoded in the first few character. See http://web.umr.edu/~cpp/common/hungarian.html or C2find:HungarianNotation. Not clearly an improvement on standard naming conventions, but it is popular in some programming shops.
In C, variable names are called identifiers.1 An identifier in C must start with a lower or uppercase letter or the underscore character _. Typically variables starting with underscores are used internally by system libraries, so it's dangerous to name your own variables this way. Subsequent characters in an identifier can be letters, digits, or underscores. So for example a, ____a___a_a_11727_a, AlbertEinstein, aAaAaAaAaAAAAAa, and ______ are all legal identifiers in C, but $foo and 01 are not.
The basic principle of variable naming is that a variable name is a substitute for the programmer's memory. It is generally best to give identifiers names that are easy to read and describe what the variable is used for, i.e., that are self-documenting. None of the variable names in the preceding list are any good by this standard. Better names would be total_input_characters, dialedWrongNumber, or stepsRemaining. Non-descriptive single-character names are acceptable for certain conventional uses, such as the use of i and j for loop iteration variables, or c for an input character. Such names should only be used when the scope of the variable is small, so that it's easy to see all the places where it is used at once.
C identifiers are case-sensitive, so aardvark, AArDvARK, and AARDVARK are all different variables. Because it is hard to remember how you capitalized something before, it is important to pick a standard convention and stick to it. The traditional convention in C goes like this:
Ordinary variables and functions are lowercased or camel-cased, e.g. count, countOfInputBits.
User-defined types (and in some conventions global variables) are capitalized, e.g. Stack, TotalBytesAllocated.
Constants created with #define or enum are put in all-caps: MAXIMUM_STACK_SIZE, BUFFER_LIMIT.
4. Using variables
Ignoring pointers (C/Pointers) for the moment, there are essentially two things you can do to a variable: you can assign a value to it using the = operator, as in:
1 x = 2; /* assign 2 to x */
2 y = 3; /* assign 3 to y */
or you can use its value in an expression:
1 x = y+1; /* assign y+1 to x */
The assignment operator is an ordinary operator, and assignment expressions can be used in larger expressions:
x = (y=2)*3; /* sets y to 2 and x to 6 */
This feature is usually only used in certain standard idioms, since it's confusing otherwise.
There are also shorthand operators for expressions of the form variable = variable operator expression. For example, writing x += y is equivalent to writing x = x + y, x /= y is the same as x = x / y, etc.
For the special case of adding or subtracting 1, you can abbreviate still further with the ++ and -- operators. These come in two versions, depending on whether you want the result of the expression (if used in a larger expression) to be the value of the variable before or after the variable is incremented:
1 x = 0;
2 y = x++; /* sets x to 1 and y to 0 (the old value) */
3 y = ++x; /* sets x to 2 and y to 2 (the new value) */
The intuition is that if the ++ comes before the variable, the increment happens before the value of the variable is read (a preincrement; if it comes after, it happens after the value is read (a postincrement). This is confusing enough that it is best not to use the value of preincrement or postincrement operations except in certain standard idioms. But using x++ by itself as a substitute for x = x+1 is perfectly acceptable style.
5. C/IntegerTypes
6. Integer types
In order to declare a variable, you have to specify a type, which controls both how much space the variable takes up and how the bits stored within it are interpreted in arithmetic operations.
The standard C integer types are:
Name |
Typical size |
Signed? |
char |
8 bits |
Unspecified |
short |
16 bits |
Yes |
int |
32 bits |
Yes |
long |
32 bits |
Yes |
The typical size is for 32-bit architectures like the Intel i386. Some 64-bit machines will have 64-bit ints and longs, and some prehistoric computers had 16-bit ints. Particularly bizarre architectures might have even wilder bit sizes, but you are not likely to see this unless you program vintage 1970s supercomputers. Some compilers also support a long long type that is usually twice the length of a long (e.g. 64 bits on i386 machines); this may or may not be available if you insist on following the ANSI specification strictly. The general convention is that int is the most convenient size for whatever computer you are using and should be used by default.
Whether a variable is signed or not controls how its values are interpreted. In signed integers, the first bit is the sign bit and the rest are the value in 2's complement notation; so for example a signed char with bit pattern 11111111 would be interpreted as the numerical value -1 while an unsigned char with the same bit pattern would be 255. Most integer types are signed unless otherwise specified; an n-bit integer type has a range from -2n-1 to 2n-1-1 (e.g. -32768 to 32767 for a short.) Unsigned variables, which can be declared by putting the keyword unsigned before the type, have a range from 0 to 2n-1 (e.g. 0 to 65535 for an unsigned short).
For chars, whether the character is signed (-128..127) or unsigned (0..255) is at the whim of the compiler. If it matters, declare your variables as signed char or unsigned char. For storing actual characters that you aren't doing arithmetic on, it shouldn't matter.
6.1. Integer constants
Constant integer values in C can be written in any of four different ways:
In the usual decimal notation, e.g. 0, 1, -127, 9919291, 97.
In octal or base 8, when the leading digit is 0, e.g. 01 for 1, 010 for 8, 0777 for 511, 061 for 97.
In hexadecimal or base 16, when prefixed with 0x. The letters a through f are used for the digits 10 through 15. For example, 0x61 is another way to write 97.
Using a character constant, which is a single ASCII character or an escape sequence inside single quotes. The value is the ASCII value of the character: 'a' is 97.2 Unlike languages with separate character types, C characters are identical to integers; you can (but shouldn't) calculate 972 by writing 'a'*'a'. You can also store a character anywhere.
Except for character constants, you can insist that an integer constant is unsigned or long by putting a u or l after it. So 1ul is an `unsigned long version of 1. By default integer constants are (signed) int`s.
6.2. Integer operations
The usual + (addition), - (negation or subtraction), and * (multiplication) operators work on integers pretty much the way you'd expect. The only caveat is that if the result lies outside of the range of whatever variable you are storing it in, it will be truncated instead of causing an error:
1 unsigned char c;
2
3 c = -1; /* sets c = 255 */
4 c = 255 + 255; /* sets c = 254 */
5 c = 256 * 1772717; /* sets c = 0 */
This can be a source of subtle bugs if you aren't careful. The usual giveaway is that values you thought should be large positive integers come back as random-looking negative integers.
Division (/) of two integers also truncates: 2/3 is 0, 5/3 is 1, etc. For positive integers it will always round down. If either the numerator or denominator is negative, the behavior is unpredictable and depends on what your processor does---in practice this means you should never use / if one or both arguments might be negative. There is also a remainder operator % with e.g. 2%3 = 2, 5%3 = 2, 27 % 2 = 1, etc. It is also unpredictable when fed negative values, although y == x*(y/x) + y%x should always be true.
In addition to the arithmetic operators, integer types support bitwise logical operators that apply some Boolean operation to all the bits of their arguments in parallel. What this means is that the i-th bit of the output is equal to some operation applied to the i-th bit(s) of the input(s). The bitwise logical operators are ~ (bitwise negation: used with one argument as in ~0 for the all-1's binary value), & (bitwise AND), '|' (bitwise OR), and '^' (bitwise XOR, i.e. sum mod 2). These are mostly used for manipulating individual bits or small groups of bits inside larger words, as in the expression x & 0x0f, which strips off the bottom four bits stored in x. Two additional logical operators are the shift operators << and >>; see KernighanRitchie for more about these.
To add to the confusion, there are also three logical operators that work on the truth-values of integers, where 0 is defined to be false and anything else is defined by be true. These are && (logical AND), ||, (logical OR), and ! (logical NOT). The result of any of these operations is always 0 or 1 (so !!x, for example, is 0 if x is 0 and 1 if x is anything else). The && and || operators evaluate their arguments left-to-right and ignore the second argument if the first determines the answer (this is the only place in C where argument evaluation order is specified); so
1 0 && execute_programmer();
2 1 || execute_programmer();
is in a very weak sense perfectly safe code to run.
Yet another logical operator is the ternary operator ?:, where x ? y : z equals the value of y if x is nonzero and z if x is zero. Like && and ||, it only evaluates the arguments it needs to:
1 fileExists(badFile) ? deleteFile(badFile) : createFile(badFile);
Most uses of ?: are better done using an if-then-else statement (C/ControlStructures).
Logical operators usually operate on the results of relational operators or comparisons: these are '==' (equality), '!=' (inequality), '<=' (less than or equal to) and '>=' (greater than or equal to). So, for example,
if(size >= MIN_SIZE && size <= MAX_SIZE) {
puts("just right");
}tests if size is in the (inclusive) range [MIN_SIZE..MAX_SIZE].
Beware of confusing == with =. The code
1 /* DANGER! DANGER! DANGER! */
2 if(x = 5) {
3 ...
is perfectly legal C, and will set x to 5 rather than testing if it's equal to 5. Because 5 happens to be nonzero, the body of the if statement will always be executed.
6.3. Alignment
Modern CPU architectures typically enforce alignment restrictions on multi-byte values, which mean that the address of an int or long typically has to be a multiple of 4. This is an effect of the memory being organized as groups of 32 bits that are written in parallel instead of 8 bits. Such restrictions are not obvious when working with integer-valued variables directly, but will come up when we talk about pointers in C/Pointers.
7. C/Statements
The bodies of C functions (including the main function) are made up of statements. These can either be simple statements that do not contain other statements, or compound statements that have other statements inside them. Control structures are compound statements like if/then/else, while, for, and do..while that control how or whether their component statements are executed.
8. Simple statements
The simplest kind of statement in C is an expression (followed by a semicolon, the terminator for all simple statements). Its value is computed and discarded. Examples:
1 x = 2; /* an assignment statement */
2 x = 2+3; /* another assignment statement */
3 2+3; /* has no effect---will be discarded by smart compilers */
4 puts("hi"); /* a statement containing a function call */
5 root2 = sqrt(2); /* an assignment statement with a function call */
Most statements in a typical C program are simple statements of this form.
Other examples of simple statements are the jump statements return, break, continue, and goto. A return statement specifies the return value for a function (if there is one), and when executed it causes the function to exit immediately. The break and continue statements jump immediately to the end of a loop (or switch; see below) or the next iteration of a loop; we'll talk about these more when we talk about loops. The goto statement jumps to another location in the same function, and exists for the rare occasions when it is needed. Using it in most circumstances is a sin.
9. Compound statements
Compound statements come in two varieties: conditionals and loops.
9.1. Conditionals
These are compound statements that test some condition and execute one or another block depending on the outcome of the condition. The simplest is the if statement:
1 if(houseIsOnFire) {
2 /* ouch! */
3 scream();
4 runAway();
5 }
The body of the if statement is executed only if the expression in parentheses at the top evaluates to true (which in C means any value that is not 0).
The braces are not strictly required, and are used only to group one or more statements into a single statement. If there is only one statement in the body, the braces can be omitted:
1 if(programmerIsLazy) omitBraces();
This style is recommended only for very simple bodies. Omitting the braces makes it harder to add more statements later without errors.
1 if(underAttack)
2 launchCounterAttack(); /* executed only when attacked */
3 hideInBunker(); /* executed always */
In the example above, the lack of braces means that the hideInBunker() statement is not part of the if statement, despite the misleading indentation. This sort of thing is why I generally always put in braces in an if.
An if statement may have an else clause, whose body is executed if the test is false (i.e. equal to 0).
1 if(happy) {
2 smile();
3 } else {
4 frown();
5 }
A common idiom is to have a chain of if and else if branches that test several conditions:
1 if(temperature < 0) {
2 puts("brrr");
3 } else if(temperature < 100) {
4 puts("hooray");
5 } else {
6 puts("ouch!");
7 }
This can be inefficient if there are a lot of cases, since the tests are applied sequentially. For tests of the form <expression> == <small constant>, the switch statement may provide a faster alternative. Here's a typical switch statement:
1 /* print plural of cow, maybe using the obsolete dual number construction */
2 switch(numberOfCows) {
3 case 1:
4 puts("cow");
5 break;
6 case 2:
7 puts("cowen");
8 break;
9 default:
10 puts("cows");
11 break;
12 }
This prints the string "cow" if there is one cow, "cowen" if there are two cowen, and "cows" if there are any other number of cows. The switch statement evaluates its argument and jumps to the matching case label, or to the default label if none of the cases match. Cases must be constant integer values.
The break statements inside the block jump to the end of the block. Without them, executing the switch with numberOfCows equal to 1 would print all three lines. This can be useful in some circumstances where the same code should be used for more than one case:
1 switch(c) {
2 case 'a':
3 case 'e':
4 case 'i':
5 case 'o':
6 case 'u':
7 type = VOWEL;
8 break;
9 default:
10 type = CONSONANT;
11 break;
12 }
or when a case "falls through" to the next:
1 switch(countdownStart) {
2 case 3:
3 puts("3");
4 case 2:
5 puts("2");
6 case 1:
7 puts("1")
8 case 0:
9 puts("KABLOOIE!");
10 break;
11 default:
12 puts("I can't count that high!");
13 break;
14 }
Note that it is customary to include a break on the last case even though it has no effect; this avoids problems later if a new case is added. It is also customary to include a default case even if the other cases supposedly exhaust all the possible values, as a check against bad or unanticipated inputs.
1 switch(oliveSize) {
2 case JUMBO():
3 eatOlives(SLOWLY);
4 break;
5 case COLLOSSAL:
6 eatOlives(QUICKLY);
7 break;
8 case SUPER_COLLOSSAL:
9 eatOlives(ABSURDLY);
10 break;
11 default:
12 /* unknown size! */
13 abort();
14 break;
15 }
Though switch statements are better than deeply nested if/else-if constructions, it is often even better to organize the different cases as data rather than code. We'll see examples of this when we talk about function pointers C/FunctionPointers.
Nothing in the C standards prevents the case labels from being buried inside other compound statements. One rather hideous application of this fact is Duff's_device.
9.2. Loops
There are three kinds of loops in C.
9.2.1. The while loop
A while loop tests if a condition is true, and if so, executes its body. It then tests the condition is true again, and keeps executing the body as long as it is. Here's a program that deletes every occurence of the letter e from its input.
1 #include <stdio.h>
2
3 int
4 main(int argc, char **argv)
5 {
6 int c;
7
8 while((c = getchar()) != EOF) {
9 switch(c) {
10 case 'e':
11 case 'E':
12 break;
13 default:
14 putchar(c);
15 break;
16 }
17 }
18
19 return 0;
20 }
Note that the expression inside the while argument both assigns the return value of getchar to c and tests to see if it is equal to EOF (which is returned when no more input characters are available). This is a very common idiom in C programs. Note also that even though c holds a single character, it is declared as an int. The reason is that EOF (a constant defined in stdio.h) is outside the normal character range, and if you assign it to a variable of type char it will be quietly truncated into something else. Because C doesn't provide any sort of exception mechanism for signalling unusual outcomes of function calls, designers of library functions often have to resort to extending the output of a function to include an extra value or two to signal failure; we'll see this a lot when the null pointer shows up in C/Pointers.
9.2.2. The do..while loop
The do..while statement is like the while statement except the test is done at the end of the loop instead of the beginning. This means that the body of the loop is always executed at least once.
Here's a loop that plays a slot machine until it recovers its inital losses (if ever). If we changed the do..while loop to a while loop, it would never pull the lever the first time, because winnings would not be less than 0 to start with.
1 winnings = 0;
2 do {
3 winnings += pullSlotMachineLever();
4 } while(winnings < 0);
The do..while loop is used much less often in practice than the while loop. Note that it is always possible to convert a do..while loop to a while loop by making an extra copy of the body in front of the loop.
9.2.3. The for loop
The for loop is a form of syntactic sugar that is used when a loop iterates over a sequence of values stored in some variable (or variables). Its argument consists of three expressions: the first initializes the variable and is called once when the statement is first reached. The second is the test to see if the body of the loop should be executed; it has the same function as the test in a while loop. The third sets the variable to its next value. Some examples:
PineWiki