Lifetime
(abstract machine)
Segment
Example address range
(runtime location in x86-64 Linux, non- )
Constant global
Static
Code (or Text)
0x40'0000 (≈1 × 2 )
Global
Static
Data
0x60'0000 (≈1.5 × 2 )
Local
Automatic
Stack
0x7fff'448d'0000 (≈2 = 2 × 2 )
Anonymous, returned by
Dynamic
Heap
0x1a0'0000 (≈8 × 2 )
Constant global data and global data have the same lifetime, but are stored in different segments. The operating system uses different segments so it can prevent the program from modifying constants. It marks the code segment, which contains functions (instructions) and constant global data, as read-only, and any attempt to modify code-segment memory causes a crash (a “Segmentation violation”).
An executable is normally at least as big as the static-lifetime data (the code and data segments together). Since all that data must be in memory for the entire lifetime of the program, it’s written to disk and then loaded by the OS before the program starts running. There is an exception, however: the “bss” segment is used to hold modifiable static-lifetime data with initial value zero. Such data is common, since all static-lifetime data is initialized to zero unless otherwise specified in the program text. Rather than storing a bunch of zeros in the object files and executable, the compiler and linker simply track the location and size of all zero-initialized global data. The operating system sets this memory to zero during the program load process. Clearing memory is faster than loading data from disk, so this optimization saves both time (the program loads faster) and space (the executable is smaller).
Programming involves turning an idea into hardware instructions. This transformation happens in multiple steps, some you control and some controlled by other programs.
First you have an idea , like “I want to make a flappy bird iPhone game.” The computer can’t (yet) understand that idea. So you transform the idea into a program , written in some programming language . This process is called programming.
A C++ program actually runs on an abstract machine . The behavior of this machine is defined by the C++ standard , a technical document. This document is supposed to be so precisely written as to have an exact mathematical meaning, defining exactly how every C++ program behaves. But the document can’t run programs!
C++ programs run on hardware (mostly), and the hardware determines what behavior we see. Mapping abstract machine behavior to instructions on real hardware is the task of the C++ compiler (and the standard library and operating system). A C++ compiler is correct if and only if it translates each correct program to instructions that simulate the expected behavior of the abstract machine.
This same rough series of transformations happens for any programming language, although some languages use interpreters rather than compilers.
A bit is the fundamental unit of digital information: it’s either 0 or 1.
C++ manages memory in units of bytes —8 contiguous bits that together can represent numbers between 0 and 255. C’s unit for a byte is char : the abstract machine says a byte is stored in char . That means an unsigned char holds values in the inclusive range [0, 255].
The C++ standard actually doesn’t require that a byte hold 8 bits, and on some crazy machines from decades ago , bytes could hold nine bits! (!?)
But larger numbers, such as 258, don’t fit in a single byte. To represent such numbers, we must use multiple bytes. The abstract machine doesn’t specify exactly how this is done—it’s the compiler and hardware’s job to implement a choice. But modern computers always use place–value notation , just like in decimal numbers. In decimal, the number 258 is written with three digits, the meanings of which are determined both by the digit and by their place in the overall number:
\[ 258 = 2\times10^2 + 5\times10^1 + 8\times10^0 \]
The computer uses base 256 instead of base 10. Two adjacent bytes can represent numbers between 0 and \(255\times256+255 = 65535 = 2^{16}-1\) , inclusive. A number larger than this would take three or more bytes.
\[ 258 = 1\times256^1 + 2\times256^0 \]
On x86-64, the ones place, the least significant byte, is on the left, at the lowest address in the contiguous two-byte range used to represent the integer. This is the opposite of how decimal numbers are written: decimal numbers put the most significant digit on the left. The representation choice of putting the least-significant byte in the lowest address is called little-endian representation. x86-64 uses little-endian representation.
Some computers actually store multi-byte integers the other way, with the most significant byte stored in the lowest address; that’s called big-endian representation. The Internet’s fundamental protocols, such as IP and TCP, also use big-endian order for multi-byte integers, so big-endian is also called “network” byte order.
The C++ standard defines five fundamental unsigned integer types, along with relationships among their sizes. Here they are, along with their actual sizes and ranges on x86-64:
Type | Size | Size | Range |
---|---|---|---|
| 1 | 1 | [0, 255] = [0, 2 −1] |
| ≥1 | 2 | [0, 65,535] = [0, 2 −1] |
| ≥ | 4 | [0, 4,294,967,295] = [0, 2 −1] |
| ≥ | 8 | [0, 18,446,744,073,709,551,615] = [0, 2 −1] |
| ≥ | 8 | [0, 18,446,744,073,709,551,615] = [0, 2 −1] |
Other architectures and operating systems implement different ranges for these types. For instance, on IA32 machines like Intel’s Pentium (the 32-bit processors that predated x86-64), sizeof(long) was 4, not 8.
Note that all values of a smaller unsigned integer type can fit in any larger unsigned integer type. When a value of a larger unsigned integer type is placed in a smaller unsigned integer object, however, not every value fits; for instance, the unsigned short value 258 doesn’t fit in an unsigned char x . When this occurs, the C++ abstract machine requires that the smaller object’s value equals the least -significant bits of the larger value (so x will equal 2).
In addition to these types, whose sizes can vary, C++ has integer types whose sizes are fixed. uint8_t , uint16_t , uint32_t , and uint64_t define 8-bit, 16-bit, 32-bit, and 64-bit unsigned integers, respectively; on x86-64, these correspond to unsigned char , unsigned short , unsigned int , and unsigned long .
This general procedure is used to represent a multi-byte integer in memory.
In little-endian representation, the bytes are stored in memory from least to most significant. If our example was stored at address 0x30, we would have:
In big-endian representation, the bytes are stored in the reverse order.
Computers are often fastest at dealing with fixed-length numbers, rather than variable-length numbers, and processor internals are organized around a fixed word size . A word is the natural unit of data used by a processor design . In most modern processors, this natural unit is 8 bytes or 64 bits , because this is the power-of-two number of bytes big enough to hold those processors’ memory addresses. Many older processors could access less memory and had correspondingly smaller word sizes, such as 4 bytes (32 bits).
The best representation for signed integers—and the choice made by x86-64, and by the C++20 abstract machine—is two’s complement . Two’s complement representation is based on this principle: Addition and subtraction of signed integers shall use the same instructions as addition and subtraction of unsigned integers.
To see what this means, let’s think about what -x should mean when x is an unsigned integer. Wait, negative unsigned?! This isn’t an oxymoron because C++ uses modular arithmetic for unsigned integers: the result of an arithmetic operation on unsigned values is always taken modulo 2 B , where B is the number of bits in the unsigned value type. Thus, on x86-64,
-x is simply the number that, when added to x , yields 0 (mod 2 B ). For example, when unsigned x = 0xFFFFFFFFU , then -x == 1U , since x + -x equals zero (mod 2 32 ).
To obtain -x , we flip all the bits in x (an operation written ~x ) and then add 1. To see why, consider the bit representations. What is x + (~x + 1) ? Well, (~x) i (the i th bit of ~x ) is 1 whenever x i is 0, and vice versa. That means that every bit of x + ~x is 1 (there are no carries), and x + ~x is the largest unsigned integer, with value 2 B -1. If we add 1 to this, we get 2 B . Which is 0 (mod 2 B )! The highest “carry” bit is dropped, leaving zero.
Two’s complement arithmetic uses half of the unsigned integer representations for negative numbers. A two’s-complement signed integer with B bits has the following values:
The most significant bit is also called the sign bit , because if it is 1, then the represented value depends on the signedness of the type (and that value is negative for signed types).
Another way to think about two’s-complement is that, for B -bit integers, the most-significant bit has place value 2 B –1 in unsigned arithmetic and negative 2 B –1 in signed arithmetic. All other bits have the same place values in both kinds of arithmetic.
The two’s-complement bit pattern for x + y is the same whether x and y are considered as signed or unsigned values. For example, in 4-bit arithmetic, 5 has representation 0b0101 , while the representation 0b1100 represents 12 if unsigned and –4 if signed ( ~0b1100 + 1 = 0b0011 + 1 == 4). Let’s add those bit patterns and see what we get:
Note that this is the right answer for both signed and unsigned arithmetic : 5 + 12 = 17 = 1 (mod 16), and 5 + -4 = 1.
Subtraction and multiplication also produce the same results for unsigned arithmetic and signed two’s-complement arithmetic. (For instance, 5 * 12 = 60 = 12 (mod 16), and 5 * -4 = -20 = -4 (mod 16).) This is not true of division. (Consider dividing the 4-bit representation 0b1110 by 2. In signed arithmetic, 0b1110 represents -2, so 0b1110/2 == 0b1111 (-1); but in unsigned arithmetic, 0b1110 is 14, so 0b1110/2 == 0b0111 (7).) And, of course, it is not true of comparison. In signed 4-bit arithmetic, 0b1110 < 0 , but in unsigned 4-bit arithmetic, 0b1110 > 0 . This means that a C compiler for a two’s-complement machine can use a single add instruction for either signed or unsigned numbers, but it must generate different instruction patterns for signed and unsigned division (or less-than, or greater-than).
There are a couple quirks with C signed arithmetic. First, in two’s complement, there are more negative numbers than positive numbers. A representation with sign bit is 1, but every other bit 0, has no positive counterpart at the same bit width: for this number, -x == x . (In 4-bit arithmetic, -0b1000 == ~0b1000 + 1 == 0b0111 + 1 == 0b1000 .) Second, and far worse, is that arithmetic overflow on signed integers is undefined behavior .
Type | Size | Size | Range |
---|---|---|---|
| 1 | 1 | [−128, 127] = [−2 , 2 −1] |
| = | 2 | [−32,768, 32,767] = [−2 , 2 −1] |
| = | 4 | [−2,147,483,648, 2,147,483,647] = [−2 , 2 −1] |
| = | 8 | [−9,223,372,036,854,775,808, 9,223,372,036,854,775,807] = [−2 , 2 −1] |
| = | 8 | [−9,223,372,036,854,775,808, 9,223,372,036,854,775,807] = [−2 , 2 −1] |
The C++ abstract machine requires that signed integers have the same sizes as their unsigned counterparts.
We distinguish pointers , which are concepts in the C abstract machine, from addresses , which are hardware concepts. A pointer combines an address and a type.
The memory representation of a pointer is the same as the representation of its address value. The size of that integer is the machine’s word size; for example, on x86-64, a pointer occupies 8 bytes, and a pointer to an object located at address 0x400abc would be stored as:
The C++ abstract machine defines an unsigned integer type uintptr_t that can hold any address. (You have to #include <inttypes.h> or <cinttypes> to get the definition.) On most machines, including x86-64, uintptr_t is the same as unsigned long . Cast a pointer to an integer address value with syntax like (uintptr_t) ptr ; cast back to a pointer with syntax like (T*) addr . Casts between pointer types and uintptr_t are information preserving, so this assertion will never fail:
Since it is a 64-bit architecture, the size of an x86-64 address is 64 bits (8 bytes). That’s also the size of x86-64 pointers.
To represent an array of integers, C++ and C allocate the integers next to each other in memory, in sequential addresses, with no gaps or overlaps. Here, we put the integers 0, 1, and 258 next to each other, starting at address 1008:
Say that you have an array of N integers, and you access each of those integers in order, accessing each integer exactly once. Does the order matter?
Computer memory is random-access memory (RAM), which means that a program can access any bytes of memory in any order—it’s not, for example, required to read memory in ascending order by address. But if we run experiments, we can see that even in RAM, different access orders have very different performance characteristics.
Our arraysum program sums up all the integers in an array of N integers, using an access order based on its arguments, and prints the resulting delay. Here’s the result of a couple experiments on accessing 10,000,000 items in three orders, “up” order (sequential: elements 0, 1, 2, 3, …), “down” order (reverse sequential: N , N −1, N −2, …), and “random” order (as it sounds).
order | trial 1 | trial 2 | trial 3 |
---|---|---|---|
, up | 8.9ms | 7.9ms | 7.4ms |
, down | 9.2ms | 8.9ms | 10.6ms |
, random | 316.8ms | 352.0ms | 360.8ms |
Wow! Down order is just a bit slower than up, but random order seems about 40 times slower. Why?
Random order is defeating many of the internal architectural optimizations that make memory access fast on modern machines. Sequential order, since it’s more predictable, is much easier to optimize.
Foreshadowing. This part of the lecture is a teaser for the Storage unit, where we cover access patterns and caching, including the processor caches that explain this phenomenon, in much more depth.
The C++ programming language offers several collection mechanisms for grouping subobjects together into new kinds of object. The collections are arrays, structs, and unions. (Classes are a kind of struct. All library types, such as vectors, lists, and hash tables, use combinations of these collection types.) The abstract machine defines how subobjects are laid out inside a collection. This is important, because it lets C/C++ programs exchange messages with hardware and even with programs written in other languages: messages can be exchanged only when both parties agree on layout.
Array layout in C++ is particularly simple: The objects in an array are laid out sequentially in memory, with no gaps or overlaps. Assume a declaration like T x[N] , where x is an array of N objects of type T , and say that the address of x is a . Then the address of element x[i] equals a + i * sizeof(T) , and sizeof(a) == N * sizeof(T) .
The C++ library type std::vector defines an array that can grow and shrink. For instance, this function creates a vector containing the numbers 0 up to N in sequence:
Here, v is an object with automatic lifetime. This means its size (in the sizeof sense) is fixed at compile time. Remember that the sizes of static- and automatic-lifetime objects must be known at compile time; only dynamic-lifetime objects can have varying size based on runtime parameters. So where and how are v ’s contents stored?
The C++ abstract machine requires that v ’s elements are stored in an array in memory. (The v.data() method returns a pointer to the first element of the array.) But it does not define std::vector ’s layout otherwise, and C++ library designers can choose different layouts based on their needs. We found these to hold for the std::vector in our library:
sizeof(v) == 24 for any vector of any type, and the address of v is a stack address (i.e., v is located in the stack segment).
The first 8 bytes of the vector hold the address of the first element of the contents array—call it the begin address . This address is a heap address, which is as expected, since the contents must have dynamic lifetime. The value of the begin address is the same as that of v.data() .
Bytes 8–15 hold the address just past the contents array—call it the end address . Its value is the same as &v.data()[v.size()] . If the vector is empty, then the begin address and the end address are the same.
Bytes 16–23 hold an address greater than or equal to the end address. This is the capacity address . As a vector grows, it will sometimes outgrow its current location and move its contents to new memory addresses. To reduce the number of copies, vectors usually to request more memory from the operating system than they immediately need; this additional space, which is called “capacity,” supports cheap growth. Often the capacity doubles on each growth spurt, since this allows operations like v.push_back() to execute in O (1) time on average.
Compilers must also decide where different objects are stored when those objects are not part of a collection. For instance, consider this program:
The abstract machine says these objects cannot overlap, but does not otherwise constrain their positions in memory.
On Linux, GCC will put all these variables into the stack segment, which we can see using hexdump . But it can put them in the stack segment in any order , as we can see by reordering the declarations (try declaration order i1 , c1 , i2 , c2 , c3 ), by changing optimization levels, or by adding different scopes (braces). The abstract machine gives the programmer no guarantees about how object addresses relate. In fact, the compiler may move objects around during execution, as long as it ensures that the program behaves according to the abstract machine. Modern optimizing compilers often do this, particularly for automatic objects.
But what order does the compiler choose? With optimization disabled, the compiler appears to lay out objects in decreasing order by declaration, so the first declared variable in the function has the highest address. With optimization enabled, the compiler follows roughly the same guideline, but it also rearranges objects by type—for instance, it tends to group char s together—and it can reuse space if different variables in the same function have disjoint lifetimes. The optimizing compiler tends to use less space for the same set of variables. This is because it’s arranging objects by alignment.
The C++ compiler and library restricts the addresses at which some kinds of data appear. In particular, the address of every int value is always a multiple of 4, whether it’s located on the stack (automatic lifetime), the data segment (static lifetime), or the heap (dynamic lifetime).
A bunch of observations will show you these rules:
Type | Size | Address restrictions | ( ) |
---|---|---|---|
( , ) | 1 | No restriction | 1 |
( ) | 2 | Multiple of 2 | 2 |
( ) | 4 | Multiple of 4 | 4 |
( ) | 8 | Multiple of 8 | 8 |
4 | Multiple of 4 | 4 | |
8 | Multiple of 8 | 8 | |
16 | Multiple of 16 | 16 | |
8 | Multiple of 8 | 8 |
These are the alignment restrictions for an x86-64 Linux machine.
These restrictions hold for most x86-64 operating systems, except that on Windows, the long type has size and alignment 4. (The long long type has size and alignment 8 on all x86-64 operating systems.)
Just like every type has a size, every type has an alignment. The alignment of a type T is a number a ≥1 such that the address of every object of type T must be a multiple of a . Every object with type T has size sizeof(T) —it occupies sizeof(T) contiguous bytes of memory; and has alignment alignof(T) —the address of its first byte is a multiple of alignof(T) . You can also say sizeof(x) and alignof(x) where x is the name of an object or another expression.
Alignment restrictions can make hardware simpler, and therefore faster. For instance, consider cache blocks. CPUs access memory through a transparent hardware cache. Data moves from primary memory, or RAM (which is large—a couple gigabytes on most laptops—and uses cheaper, slower technology) to the cache in units of 64 or 128 bytes. Those units are always aligned: on a machine with 128-byte cache blocks, the bytes with memory addresses [127, 128, 129, 130] live in two different cache blocks (with addresses [0, 127] and [128, 255]). But the 4 bytes with addresses [4n, 4n+1, 4n+2, 4n+3] always live in the same cache block. (This is true for any small power of two: the 8 bytes with addresses [8n,…,8n+7] always live in the same cache block.) In general, it’s often possible to make a system faster by leveraging restrictions—and here, the CPU hardware can load data faster when it can assume that the data lives in exactly one cache line.
The compiler, library, and operating system all work together to enforce alignment restrictions.
On x86-64 Linux, alignof(T) == sizeof(T) for all fundamental types (the types built in to C: integer types, floating point types, and pointers). But this isn’t always true; on x86-32 Linux, double has size 8 but alignment 4.
It’s possible to construct user-defined types of arbitrary size, but the largest alignment required by a machine is fixed for that machine. C++ lets you find the maximum alignment for a machine with alignof(std::max_align_t) ; on x86-64, this is 16, the alignment of the type long double (and the alignment of some less-commonly-used SIMD “vector” types ).
We now turn to the abstract machine rules for laying out all collections. The sizes and alignments for user-defined types—arrays, structs, and unions—are derived from a couple simple rules or principles. Here they are. The first rule applies to all types.
1. First-member rule. The address of the first member of a collection equals the address of the collection.
Thus, the address of an array is the same as the address of its first element. The address of a struct is the same as the address of the first member of the struct.
The next three rules depend on the class of collection. Every C abstract machine enforces these rules.
2. Array rule. Arrays are laid out sequentially as described above.
3. Struct rule. The second and subsequent members of a struct are laid out in order, with no overlap, subject to alignment constraints.
4. Union rule. All members of a union share the address of the union.
In C, every struct follows the struct rule, but in C++, only simple structs follow the rule. Complicated structs, such as structs with some public and some private members, or structs with virtual functions, can be laid out however the compiler chooses. The typical situation is that C++ compilers for a machine architecture (e.g., “Linux x86-64”) will all agree on a layout procedure for complicated structs. This allows code compiled by different compilers to interoperate.
That next rule defines the operation of the malloc library function.
5. Malloc rule. Any non-null pointer returned by malloc has alignment appropriate for any type. In other words, assuming the allocated size is adequate, the pointer returned from malloc can safely be cast to T* for any T .
Oddly, this holds even for small allocations. The C++ standard (the abstract machine) requires that malloc(1) return a pointer whose alignment is appropriate for any type, including types that don’t fit.
And the final rule is not required by the abstract machine, but it’s how sizes and alignments on our machines work.
6. Minimum rule. The sizes and alignments of user-defined types, and the offsets of struct members, are minimized within the constraints of the other rules.
The minimum rule, and the sizes and alignments of basic types, are defined by the x86-64 Linux “ABI” —its Application Binary Interface. This specification standardizes how x86-64 Linux C compilers should behave, and lets users mix and match compilers without problems.
From these rules we can derive some interesting consequences.
First, the size of every type is a multiple of its alignment .
To see why, consider an array with two elements. By the array rule, these elements have addresses a and a+sizeof(T) , where a is the address of the array. Both of these addresses contain a T , so they are both a multiple of alignof(T) . That means sizeof(T) is also a multiple of alignof(T) .
We can also characterize the sizes and alignments of different collections .
Thus, the alignment of every collection equals the maximum of the alignments of its components.
It’s also true that the alignment equals the least common multiple of the alignments of its components. You might have thought lcm was a better answer, but the max is the same as the lcm for every architecture that matters, because all fundamental alignments are powers of two.
The size of a struct might be larger than the sum of the sizes of its components, because of alignment constraints. Since the compiler must lay out struct components in order, and it must obey the components’ alignment constraints, and it must ensure different components occupy disjoint addresses, it must sometimes introduce extra space in structs. Here’s an example: the struct will have 3 bytes of padding after char c , to ensure that int i2 has the correct alignment.
Thanks to padding, reordering struct components can sometimes reduce the total size of a struct. Padding can happen at the end of a struct as well as the middle. Padding can never happen at the start of a struct, however (because of Rule 1).
The rules also imply that the offset of any struct member —which is the difference between the address of the member and the address of the containing struct— is a multiple of the member’s alignment .
To see why, consider a struct s with member m at offset o . The malloc rule says that any pointer returned from malloc is correctly aligned for s . Every pointer returned from malloc is maximally aligned, equalling 16*x for some integer x . The struct rule says that the address of m , which is 16*x + o , is correctly aligned. That means that 16*x + o = alignof(m)*y for some integer y . Divide both sides by a = alignof(m) and you see that 16*x/a + o/a = y . But 16/a is an integer—the maximum alignment is a multiple of every alignment—so 16*x/a is an integer. We can conclude that o/a must also be an integer!
Finally, we can also derive the necessity for padding at the end of structs. (How?)
What happens when an object is uninitialized? The answer depends on its lifetime.
In C++, most dynamic memory allocation uses special language operators, new and delete , rather than library functions.
Though this seems more complex than the library-function style, it has advantages. A C compiler cannot tell what malloc and free do (especially when they are redefined to debugging versions, as in the problem set), so a C compiler cannot necessarily optimize calls to malloc and free away. But the C++ compiler may assume that all uses of new and delete follow the rules laid down by the abstract machine. That means that if the compiler can prove that an allocation is unnecessary or unused, it is free to remove that allocation!
For example, we compiled this program in the problem set environment (based on test003.cc ):
The optimizing C++ compiler removes all calls to new and delete , leaving only the call to m61_printstatistics() ! (For instance, try objdump -d testXXX to look at the compiled x86-64 instructions.) This is valid because the compiler is explicitly allowed to eliminate unused allocations, and here, since the ptrs variable is local and doesn’t escape main , all allocations are unused. The C compiler cannot perform this useful transformation. (But the C compiler can do other cool things, such as unroll the loops .)
One of C’s more interesting choices is that it explicitly relates pointers and arrays. Although arrays are laid out in memory in a specific way, they generally behave like pointers when they are used. This property probably arose from C’s desire to explicitly model memory as an array of bytes, and it has beautiful and confounding effects.
We’ve already seen one of these effects. The hexdump function has this signature (arguments and return type):
But we can just pass an array as argument to hexdump :
When used in an expression like this—here, as an argument—the array magically changes into a pointer to its first element. The above call has the same meaning as this:
C programmers transition between arrays and pointers very naturally.
A confounding effect is that unlike all other types, in C arrays are passed to and returned from functions by reference rather than by value. C is a call-by-value language except for arrays. This means that all function arguments and return values are copied, so that parameter modifications inside a function do not affect the objects passed by the caller—except for arrays. For instance: void f ( int a[ 2 ]) { a[ 0 ] = 1 ; } int main () { int x[ 2 ] = { 100 , 101 }; f(x); printf( "%d \n " , x[ 0 ]); // prints 1! } If you don’t like this behavior, you can get around it by using a struct or a C++ std::array . #include <array> struct array1 { int a[ 2 ]; }; void f1 (array1 arg) { arg.a[ 0 ] = 1 ; } void f2 (std :: array < int , 2 > a) { a[ 0 ] = 1 ; } int main () { array1 x = {{ 100 , 101 }}; f1(x); printf( "%d \n " , x.a[ 0 ]); // prints 100 std :: array < int , 2 > x2 = { 100 , 101 }; f2(x2); printf( "%d \n " , x2[ 0 ]); // prints 100 }
C++ extends the logic of this array–pointer correspondence to support arithmetic on pointers as well.
Pointer arithmetic rule. In the C abstract machine, arithmetic on pointers produces the same result as arithmetic on the corresponding array indexes.
Specifically, consider an array T a[n] and pointers T* p1 = &a[i] and T* p2 = &a[j] . Then:
Equality : p1 == p2 if and only if (iff) p1 and p2 point to the same address, which happens iff i == j .
Inequality : Similarly, p1 != p2 iff i != j .
Less-than : p1 < p2 iff i < j .
Also, p1 <= p2 iff i <= j ; and p1 > p2 iff i > j ; and p1 >= p2 iff i >= j .
Pointer difference : What should p1 - p2 mean? Using array indexes as the basis, p1 - p2 == i - j . (But the type of the difference is always ptrdiff_t , which on x86-64 is long , the signed version of size_t .)
Addition : p1 + k (where k is an integer type) equals the pointer &a[i + k] . ( k + p1 returns the same thing.)
Subtraction : p1 - k equals &a[i - k] .
Increment and decrement : ++p1 means p1 = p1 + 1 , which means p1 = &a[i + 1] . Similarly, --p1 means p1 = &a[i - 1] . (There are also postfix versions, p1++ and p1-- , but C++ style prefers the prefix versions.)
No other arithmetic operations on pointers are allowed. You can’t multiply pointers, for example. (You can multiply addresses by casting the pointers to the address type, uintptr_t —so (uintptr_t) p1 * (uintptr_t) p2 —but why would you?)
Let’s write a function that can sum all the integers in an array.
This function can compute the sum of the elements of any int array. But because of the pointer–array relationship, its a argument is really a pointer . That allows us to call it with subarrays as well as with whole arrays. For instance:
This way of thinking about arrays naturally leads to a style that avoids sizes entirely, using instead a sentinel or boundary argument that defines the end of the interesting part of the array.
These expressions compute the same sums as the above:
Note that the data from first to last forms a half-open range . iIn mathematical notation, we care about elements in the range [first, last) : the element pointed to by first is included (if it exists), but the element pointed to by last is not. Half-open ranges give us a simple and clear way to describe empty ranges, such as zero-element arrays: if first == last , then the range is empty.
Note that given a ten-element array a , the pointer a + 10 can be formed and compared, but must not be dereferenced—the element a[10] does not exist. The C/C++ abstract machines allow users to form pointers to the “one-past-the-end” boundary elements of arrays, but users must not dereference such pointers.
So in C, two pointers naturally express a range of an array. The C++ standard template library, or STL, brilliantly abstracts this pointer notion to allow two iterators , which are pointer-like objects, to express a range of any standard data structure—an array, a vector, a hash table, a balanced tree, whatever. This version of sum works for any container of int s; notice how little it changed:
Some example uses:
What’s the difference between these expressions? (Again, a is an array of type T , and p1 == &a[i] and p2 == &a[j] .)
The first expression is defined analogously to index arithmetic, so d1 == i - j . But the second expression performs the arithmetic on the addresses corresponding to those pointers. We will expect d2 to equal sizeof(T) * d1 . Always be aware of which kind of arithmetic you’re using. Generally arithmetic on pointers should not involve sizeof , since the sizeof is included automatically according to the abstract machine; but arithmetic on addresses almost always should involve sizeof .
Although C++ is a low-level language, the abstract machine is surprisingly strict about which pointers may be formed and how they can be used. Violate the rules and you’re in hell because you have invoked the dreaded undefined behavior .
Given an array a[N] of N elements of type T :
Forming a pointer &a[i] (or a + i ) with 0 ≤ i ≤ N is safe.
Forming a pointer &a[i] with i < 0 or i > N causes undefined behavior.
Dereferencing a pointer &a[i] with 0 ≤ i < N is safe.
Dereferencing a pointer &a[i] with i < 0 or i ≥ N causes undefined behavior.
(For the purposes of these rules, objects that are not arrays count as single-element arrays. So given T x , we can safely form &x and &x + 1 and dereference &x .)
What “undefined behavior” means is horrible. A program that executes undefined behavior is erroneous. But the compiler need not catch the error. In fact, the abstract machine says anything goes : undefined behavior is “behavior … for which this International Standard imposes no requirements.” “Possible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message).” Other possible behaviors include allowing hackers from the moon to steal all of a program’s data, take it over, and force it to delete the hard drive on which it is running. Once undefined behavior executes, a program may do anything, including making demons fly out of the programmer’s nose.
Pointer arithmetic, and even pointer comparisons, are also affected by undefined behavior. It’s undefined to go beyond and array’s bounds using pointer arithmetic. And pointers may be compared for equality or inequality even if they point to different arrays or objects, but if you try to compare different arrays via less-than, like this:
that causes undefined behavior.
If you really want to compare pointers that might be to different arrays—for instance, you’re writing a hash function for arbitrary pointers—cast them to uintptr_t first.
A program that causes undefined behavior is not a C++ program . The abstract machine says that a C++ program, by definition, is a program whose behavior is always defined. The C++ compiler is allowed to assume that its input is a C++ program. (Obviously!) So the compiler can assume that its input program will never cause undefined behavior. Thus, since undefined behavior is “impossible,” if the compiler can prove that a condition would cause undefined behavior later, it can assume that condition will never occur.
Consider this program:
If we supply a value equal to (char*) -1 , we’re likely to see output like this:
with no assertion failure! But that’s an apparently impossible result. The printout can only happen if x + 1 > x (otherwise, the assertion will fail and stop the printout). But x + 1 , which equals 0 , is less than x , which is the largest 8-byte value!
The impossible happens because of undefined behavior reasoning. When the compiler sees an expression like x + 1 > x (with x a pointer), it can reason this way:
“Ah, x + 1 . This must be a pointer into the same array as x (or it might be a boundary pointer just past that array, or just past the non-array object x ). This must be so because forming any other pointer would cause undefined behavior.
“The pointer comparison is the same as an index comparison. x + 1 > x means the same thing as &x[1] > &x[0] . But that holds iff 1 > 0 .
“In my infinite wisdom, I know that 1 > 0 . Thus x + 1 > x always holds, and the assertion will never fail.
“My job is to make this code run fast. The fastest code is code that’s not there. This assertion will never fail—might as well remove it!”
Arithmetic on signed integers also has important undefined behaviors. Signed integer arithmetic must never overflow. That is, the compiler may assume that the mathematical result of any signed arithmetic operation, such as x + y (with x and y both int ), can be represented inside the relevant type. It causes undefined behavior, therefore, to add 1 to the maximum positive integer. (The ubexplore.cc program demonstrates how this can produce impossible results, as with pointers.)
Arithmetic on unsigned integers is much safer with respect to undefined behavior. Unsigned integers are defined to perform arithmetic modulo their size. This means that if you add 1 to the maximum positive unsigned integer, the result will always be zero.
Dividing an integer by zero causes undefined behavior whether or not the integer is signed.
Sanitizers, which in our makefiles are turned on by supplying SAN=1 , can catch many undefined behaviors as soon as they happen. Sanitizers are built in to the compiler itself; a sanitizer involves cooperation between the compiler and the language runtime. This has the major performance advantage that the compiler introduces exactly the required checks, and the optimizer can then use its normal analyses to remove redundant checks.
That said, undefined behavior checking can still be slow. Undefined behavior allows compilers to make assumptions about input values, and those assumptions can directly translate to faster code. Turning on undefined behavior checking can make some benchmark programs run 30% slower [link] .
File cs61-lectures/datarep5/ubexplore2.cc contains the following program.
What will be printed if we run the program with ./ubexplore2 0x7ffffffe 0x7fffffff ?
0x7fffffff is the largest positive value can be represented by type int . Adding one to this value yields 0x80000000 . In two's complement representation this is the smallest negative number represented by type int .
Assuming that the program behaves this way, then the loop exit condition i > n2 can never be met, and the program should run (and print out numbers) forever.
However, if we run the optimized version of the program, it prints only two numbers and exits:
The unoptimized program does print forever and never exits.
What’s going on here? We need to look at the compiled assembly of the program with and without optimization (via objdump -S ).
The unoptimized version basically looks like this:
This is a pretty direct translation of the loop.
The optimized version, though, does it differently. As always, the optimizer has its own ideas. (Your compiler may produce different results!)
The compiler changed the source’s less than or equal to comparison, i <= n2 , into a not equal to comparison in the executable, i != n2 + 1 (in both cases using signed computer arithmetic, i.e., modulo 2 32 )! The comparison i <= n2 will always return true when n2 == 0x7FFFFFFF , the maximum signed integer, so the loop goes on forever. But the i != n2 + 1 comparison does not always return true when n2 == 0x7FFFFFFF : when i wraps around to 0x80000000 (the smallest negative integer), then i equals n2 + 1 (which also wrapped), and the loop stops.
Why did the compiler make this transformation? In the original loop, the step-6 jump is immediately followed by another comparison and jump in steps 1 and 2. The processor jumps all over the place, which can confuse its prediction circuitry and slow down performance. In the transformed loop, the step-7 jump is never followed by a comparison and jump; instead, step 7 goes back to step 4, which always prints the current number. This more streamlined control flow is easier for the processor to make fast.
But the streamlined control flow is only a valid substitution under the assumption that the addition n2 + 1 never overflows . Luckily (sort of), signed arithmetic overflow causes undefined behavior, so the compiler is totally justified in making that assumption!
Programs based on ubexplore2 have demonstrated undefined behavior differences for years, even as the precise reasons why have changed. In some earlier compilers, we found that the optimizer just upgraded the int s to long s—arithmetic on long s is just as fast on x86-64 as arithmetic on int s, since x86-64 is a 64-bit architecture, and sometimes using long s for everything lets the compiler avoid conversions back and forth. The ubexplore2l program demonstrates this form of transformation: since the loop variable is added to a long counter, the compiler opportunistically upgrades i to long as well. This transformation is also only valid under the assumption that i + 1 will not overflow—which it can’t, because of undefined behavior.
Using unsigned type prevents all this undefined behavior, because arithmetic overflow on unsigned integers is well defined in C/C++. The ubexplore2u.cc file uses an unsigned loop index and comparison, and ./ubexplore2u and ./ubexplore2u.noopt behave exactly the same (though you have to give arguments like ./ubexplore2u 0xfffffffe 0xffffffff to see the overflow).
Basic bitwise operators.
Computers offer not only the usual arithmetic operators like + and - , but also a set of bitwise operators. The basic ones are & (and), | (or), ^ (xor/exclusive or), and the unary operator ~ (complement). In truth table form:
(and) | ||
---|---|---|
0 | 0 | |
0 | 1 |
(or) | ||
---|---|---|
0 | 1 | |
1 | 1 |
(xor) | ||
---|---|---|
0 | 1 | |
1 | 0 |
(complement) | |
---|---|
1 | |
0 |
In C or C++, these operators work on integers. But they work bitwise: the result of an operation is determined by applying the operation independently at each bit position. Here’s how to compute 12 & 4 in 4-bit unsigned arithmetic:
These basic bitwise operators simplify certain important arithmetics. For example, (x & (x - 1)) == 0 tests whether x is zero or a power of 2.
Negation of signed integers can also be expressed using a bitwise operator: -x == ~x + 1 . This is in fact how we define two's complement representation. We can verify that x and (-x) does add up to zero under this representation:
Bitwise "and" ( & ) can help with modular arithmetic. For example, x % 32 == (x & 31) . We essentially "mask off", or clear, higher order bits to do modulo-powers-of-2 arithmetics. This works in any base. For example, in decimal, the fastest way to compute x % 100 is to take just the two least significant digits of x .
x << i appends i zero bits starting at the least significant bit of x . High order bits that don't fit in the integer are thrown out. For example, assuming 4-bit unsigned integers
Similarly, x >> i appends i zero bits at the most significant end of x . Lower bits are thrown out.
Bitwise shift helps with division and multiplication. For example:
A modern compiler can optimize y = x * 66 into y = (x << 6) + (x << 1) .
Bitwise operations also allows us to treat bits within an integer separately. This can be useful for "options".
For example, when we call a function to open a file, we have a lot of options:
We have a lot of true/false options.
One bad way to implement this is to have this function take a bunch of arguments -- one argument for each option. This makes the function call look like this:
The long list of arguments slows down the function call, and one can also easily lose track of the meaning of the individual true/false values passed in.
A cheaper way to achieve this is to use a single integer to represent all the options. Have each option defined as a power of 2, and simply | (or) them together and pass them as a single integer.
Flags are usually defined as powers of 2 so we set one bit at a time for each flag. It is less common but still possible to define a combination flag that is not a power of 2, so that it sets multiple bits in one go.
File cs61-lectures/datarep5/mb-driver.cc contains a memory allocation benchmark. The core of the benchmark looks like this:
The benchmark tests the performance of memnode_arena::allocate() and memnode_arena::deallocate() functions. In the handout code, these functions do the same thing as new memnode and delete memnode —they are wrappers for malloc and free . The benchmark allocates 4096 memnode objects, then free-and-then-allocates them for noperations times, and then frees all of them.
We only allocate memnode s, and all memnode s are of the same size, so we don't need metadata that keeps track of the size of each allocation. Furthermore, since all dynamically allocated data are freed at the end of the function, for each individual memnode_free() call we don't really need to return memory to the system allocator. We can simply reuse these memory during the function and returns all memory to the system at once when the function exits.
If we run the benchmark with 100000000 allocation, and use the system malloc() , free() functions to implement the memnode allocator, the benchmark finishes in 0.908 seconds.
Our alternative implementation of the allocator can finish in 0.355 seconds, beating the heavily optimized system allocator by a factor of 3. We will reveal how we achieved this in the next lecture.
We continue our exploration with the memnode allocation benchmark introduced from the last lecture.
File cs61-lectures/datarep6/mb-malloc.cc contains a version of the benchmark using the system new and delete operators.
In this function we allocate an array of 4096 pointers to memnode s, which occupy 2 3 *2 12 =2 15 bytes on the stack. We then allocate 4096 memnode s. Our memnode is defined like this:
Each memnode contains a std::string object and an unsigned integer. Each std::string object internally contains a pointer points to an character array in the heap. Therefore, every time we create a new memnode , we need 2 allocations: one to allocate the memnode itself, and another one performed internally by the std::string object when we initialize/assign a string value to it.
Every time we deallocate a memnode by calling delete , we also delete the std::string object, and the string object knows that it should deallocate the heap character array it internally maintains. So there are also 2 deallocations occuring each time we free a memnode.
We make the benchmark to return a seemingly meaningless result to prevent an aggressive compiler from optimizing everything away. We also use this result to make sure our subsequent optimizations to the allocator are correct by generating the same result.
This version of the benchmark, using the system allocator, finishes in 0.335 seconds. Not bad at all.
Spoiler alert: We can do 15x better than this.
1st optimization: std::string
We only deal with one file name, namely "datarep/mb-filename.cc", which is constant throughout the program for all memnode s. It's also a string literal, which means it as a constant string has a static life time. Why can't we just simply use a const char* in place of the std::string and let the pointer point to the static constant string? This saves us the internal allocation/deallocation performed by std::string every time we initialize/delete a string.
The fix is easy, we simply change the memnode definition:
This version of the benchmark now finishes in 0.143 seconds, a 2x improvement over the original benchmark. This 2x improvement is consistent with a 2x reduction in numbers of allocation/deallocation mentioned earlier.
You may ask why people still use std::string if it involves an additional allocation and is slower than const char* , as shown in this benchmark. std::string is much more flexible in that it also deals data that doesn't have static life time, such as input from a user or data the program receives over the network. In short, when the program deals with strings that are not constant, heap data is likely to be very useful, and std::string provides facilities to conveniently handle on-heap data.
2nd optimization: the system allocator
We still use the system allocator to allocate/deallocate memnode s. The system allocator is a general-purpose allocator, which means it must handle allocation requests of all sizes. Such general-purpose designs usually comes with a compromise for performance. Since we are only memnode s, which are fairly small objects (and all have the same size), we can build a special- purpose allocator just for them.
In cs61-lectures/datarep5/mb2.cc , we actually implement a special-purpose allocator for memnode s:
This allocator maintains a free list (a C++ vector ) of freed memnode s. allocate() simply pops a memnode off the free list if there is any, and deallocate() simply puts the memnode on the free list. This free list serves as a buffer between the system allocator and the benchmark function, so that the system allocator is invoked less frequently. In fact, in the benchmark, the system allocator is only invoked for 4096 times when it initializes the pointer array. That's a huge reduction because all 10-million "recycle" operations in the middle now doesn't involve the system allocator.
With this special-purpose allocator we can finish the benchmark in 0.057 seconds, another 2.5x improvement.
However this allocator now leaks memory: it never actually calls delete ! Let's fix this by letting it also keep track of all allocated memnode s. The modified definition of memnode_arena now looks like this:
With the updated allocator we simply need to invoke arena.destroy_all() at the end of the function to fix the memory leak. And we don't even need to invoke this method manually! We can use the C++ destructor for the memnode_arena struct, defined as ~memnode_arena() in the code above, which is automatically called when our arena object goes out of scope. We simply make the destructor invoke the destroy_all() method, and we are all set.
Fixing the leak doesn't appear to affect performance at all. This is because the overhead added by tracking the allocated list and calling delete only affects our initial allocation the 4096 memnode* pointers in the array plus at the very end when we clean up. These 8192 additional operations is a relative small number compared to the 10 million recycle operations, so the added overhead is hardly noticeable.
Spoiler alert: We can improve this by another factor of 2.
3rd optimization: std::vector
In our special purpose allocator memnode_arena , we maintain an allocated list and a free list both using C++ std::vector s. std::vector s are dynamic arrays, and like std::string they involve an additional level of indirection and stores the actual array in the heap. We don't access the allocated list during the "recycling" part of the benchmark (which takes bulk of the benchmark time, as we showed earlier), so the allocated list is probably not our bottleneck. We however, add and remove elements from the free list for each recycle operation, and the indirection introduced by the std::vector here may actually be our bottleneck. Let's find out.
Instead of using a std::vector , we could use a linked list of all free memnode s for the actual free list. We will need to include some extra metadata in the memnode to store pointers for this linked list. However, unlike in the debugging allocator pset, in a free list we don't need to store this metadata in addition to actual memnode data: the memnode is free, and not in use, so we can use reuse its memory, using a union:
We then maintain the free list like this:
Compared to the std::vector free list, this free list we always directly points to an available memnode when it is not empty ( free_list !=nullptr ), without going through any indirection. In the std::vector free list one would first have to go into the heap to access the actual array containing pointers to free memnode s, and then access the memnode itself.
With this change we can now finish the benchmark under 0.3 seconds! Another 2x improvement over the previous one!
Compared to the benchmark with the system allocator (which finished in 0.335 seconds), we managed to achieve a speedup of nearly 15x with arena allocation.
Chapter: 5 data, representation, and information, 5 data, representation, and information.
T he preceding two chapters address the creation of models that capture phenomena of interest and the abstractions both for data and for computation that reduce these models to forms that can be executed by computer. We turn now to the ways computer scientists deal with information, especially in its static form as data that can be manipulated by programs.
Gray begins by narrating a long line of research on databases—storehouses of related, structured, and durable data. We see here that the objects of research are not data per se but rather designs of “schemas” that allow deliberate inquiry and manipulation. Gray couples this review with introspection about the ways in which database researchers approach these problems.
Databases support storage and retrieval of information by defining—in advance—a complex structure for the data that supports the intended operations. In contrast, Lesk reviews research on retrieving information from documents that are formatted to meet the needs of applications rather than predefined schematized formats.
Interpretation of information is at the heart of what historians do, and Ayers explains how information technology is transforming their paradigms. He proposes that history is essentially model building—constructing explanations based on available information—and suggests that the methods of computer science are influencing this core aspect of historical analysis.
Jim Gray, Microsoft Research
A small research investment helped produce U.S. market dominance in the $14 billion database industry. Government and industry funding of a few research projects created the ideas for several generations of products and trained the people who built those products. Continuing research is now creating the ideas and training the people for the next generation of products.
The database industry generated about $14 billion in revenue in 2002 and is growing at 20 percent per year, even though the overall technology sector is almost static. Among software sectors, the database industry is second only to operating system software. Database industry leaders are all U.S.-based corporations: IBM, Microsoft, and Oracle are the three largest. There are several specialty vendors: Tandem sells over $1 billion/ year of fault-tolerant transaction processing systems, Teradata sells about $1 billion/year of data-mining systems, and companies like Information Resources Associates, Verity, Fulcrum, and others sell specialized data and text-mining software.
In addition to these well-established companies, there is a vibrant group of small companies specializing in application-specific databases—for text retrieval, spatial and geographical data, scientific data, image data, and so on. An emerging group of companies offer XML-oriented databases. Desktop databases are another important market focused on extreme ease of use, small size, and disconnected (offline) operation.
Companies began automating their back-office bookkeeping in the 1960s. The COBOL programming language and its record-oriented file model were the workhorses of this effort. Typically, a batch of transactions was applied to the old-tape-master, producing a new-tape-master and printout for the next business day. During that era, there was considerable experimentation with systems to manage an online database that could capture transactions as they happened. At first these systems were ad hoc, but late in that decade network and hierarchical database products emerged. A COBOL subcommittee defined a network data model stan-
dard (DBTG) that formed the basis for most systems during the 1970s. Indeed, in 1980 DBTG-based Cullinet was the leading software company.
However, there were some problems with DBTG. DBTG uses a low-level, record-at-a-time procedural language to access information. The programmer has to navigate through the database, following pointers from record to record. If the database is redesigned, as often happens over a decade, then all the old programs have to be rewritten.
The relational data model, enunciated by IBM researcher Ted Codd in a 1970 Communications of the Association for Computing Machinery article, 1 was a major advance over DBTG. The relational model unified data and metadata so that there was only one form of data representation. It defined a non-procedural data access language based on algebra or logic. It was easier for end users to visualize and understand than the pointers-and-records-based DBTG model.
The research community (both industry and university) embraced the relational data model and extended it during the 1970s. Most significantly, researchers showed that a non-procedural language could be compiled to give performance comparable to the best record-oriented database systems. This research produced a generation of systems and people that formed the basis for products from IBM, Ingres, Oracle, Informix, Sybase, and others. The SQL relational database language was standardized by ANSI/ISO between 1982 and 1986. By 1990, virtually all database systems provided an SQL interface (including network, hierarchical, and object-oriented systems).
Meanwhile the database research agenda moved on to geographically distributed databases and to parallel data access. Theoretical work on distributed databases led to prototypes that in turn led to products. Today, all the major database systems offer the ability to distribute and replicate data among nodes of a computer network. Intense research on data replication during the late 1980s and early 1990s gave rise to a second generation of replication products that are now the mainstays of mobile computing.
Research of the 1980s showed how to execute each of the relational data operators in parallel—giving hundred-fold and thousand-fold speedups. The results of this research began to appear in the products of several major database companies. With the proliferation of data mining in the 1990s, huge databases emerged. Interactive access to these databases requires that the system use multiple processors and multiple disks to read all the data in parallel. In addition, these problems require near-
| E.F. Codd, 1970, “A Relational Model of Data from Large Shared Data Banks,” 13(6):377-387. Available online at . |
linear time search algorithms. University and industrial research of the previous decade had solved these problems and forms the basis of the current VLDB (very large database) data-mining systems.
Rollup and drilldown data reporting systems had been a mainstay of decision-support systems ever since the 1960s. In the middle 1990s, the research community really focused on data-mining algorithms. They invented very efficient data cube and materialized view algorithms that form the basis for the current generation of business intelligence products.
The most recent round of government-sponsored research creating a new industry comes from the National Science Foundation’s Digital Libraries program, which spawned Google. It was founded by a group of “database” graduate students who took a fresh look at how information should be organized and presented in the Internet era.
There continues to be active and valuable research on representing and indexing data, adding inference to data search, compiling queries more efficiently, executing queries in parallel, integrating data from heterogeneous data sources, analyzing performance, and extending the transaction model to handle long transactions and workflow (transactions that involve human as well as computer steps). The availability of huge volumes of data on the Internet has prompted the study of data integration, mediation, and federation in which a portal system presents a unification of several data sources by pulling data on demand from different parts of the Internet.
In addition, there is great interest in unifying object-oriented concepts with the relational model. New data types (image, document, and drawing) are best viewed as the methods that implement them rather than by the bytes that represent them. By adding procedures to the database system, one gets active databases, data inference, and data encapsulation. This object-oriented approach is an area of active research and ferment both in academe and industry. It seems that in 2003, the research prototypes are mostly done and this is an area that is rapidly moving into products.
The Internet is full of semi-structured data—data that has a bit of schema and metadata, but is mostly a loose collection of facts. XML has emerged as the standard representation of semi-structured data, but there is no consensus on how such data should be stored, indexed, or searched. There have been intense research efforts to answer these questions. Prototypes have been built at universities and industrial research labs, and now products are in development.
The database research community now has a major focus on stream data processing. Traditionally, databases have been stored locally and are
updated by transactions. Sensor networks, financial markets, telephone calls, credit card transactions, and other data sources present streams of data rather than a static database. The stream data processing researchers are exploring languages and algorithms for querying such streams and providing approximate answers.
Now that nearly all information is online, data security and data privacy are extremely serious and important problems. A small, but growing, part of the database community is looking at ways to protect people’s privacy by limiting the ways data is used. This work also has implications for protecting intellectual property (e.g., digital rights management, watermarking) and protecting data integrity by digitally signing documents and then replicating them so that the documents cannot be altered or destroyed.
The U.S. government funded many database research projects from 1972 to the present. Projects at the University of California at Los Angeles gave rise to Teradata and produced many excellent students. Projects at Computer Corp. of America (SDD-1, Daplex, Multibase, and HiPAC) pioneered distributed database technology and object-oriented database technology. Projects at Stanford University fostered deductive database technology, data integration technology, query optimization technology, and the popular Yahoo! and Google Internet sites. Work at Carnegie Mellon University gave rise to general transaction models and ultimately to the Transarc Corporation. There have been many other successes from AT&T, the University of Texas at Austin, Brown and Harvard Universities, the University of Maryland, the University of Michigan, Massachusetts Institute of Technology, Princeton University, and the University of Toronto among others. It is not possible to enumerate all the contributions here, but we highlight three representative research projects that had a major impact on the industry.
Project Ingres started at the University of California at Berkeley in 1972. Inspired by Codd’s paper on relational databases, several faculty members (Stonebraker, Rowe, Wong, and others) started a project to design and build a relational system. Incidental to this work, they invented a query language (QUEL), relational optimization techniques, a language binding technique, and interesting storage strategies. They also pioneered work on distributed databases.
The Ingres academic system formed the basis for the Ingres product now owned by Computer Associates. Students trained on Ingres went on
to start or staff all the major database companies (AT&T, Britton Lee, HP, Informix, IBM, Oracle, Tandem, Sybase). The Ingres project went on to investigate distributed databases, database inference, active databases, and extensible databases. It was rechristened Postgres, which is now the basis of the digital library and scientific database efforts within the University of California system. Recently, Postgres spun off to become the basis for a new object-relational system from the start-up Illustra Information Technologies.
Codd’s ideas were inspired by seeing the problems IBM and its customers were having with IBM’s IMS product and the DBTG network data model. His relational model was at first very controversial; people thought that the model was too simplistic and that it could never give good performance. IBM Research management took a gamble and chartered a small (10-person) systems effort to prototype a relational system based on Codd’s ideas. That system produced a prototype that eventually grew into the DB2 product series. Along the way, the IBM team pioneered ideas in query optimization, data independence (views), transactions (logging and locking), and security (the grant-revoke model). In addition, the SQL query language from System R was the basis for the ANSI/ISO standard.
The System R group went on to investigate distributed databases (project R*) and object-oriented extensible databases (project Starburst). These research projects have pioneered new ideas and algorithms. The results appear in IBM’s database products and those of other vendors.
Not all research ideas work out. During the 1970s there was great enthusiasm for database machines—special-purpose computers that would be much faster than general-purpose operating systems running conventional database systems. These research projects were often based on exotic hardware like bubble memories, head-per-track disks, or associative RAM. The problem was that general-purpose systems were improving at 50 percent per year, so it was difficult for exotic systems to compete with them. By 1980, most researchers realized the futility of special-purpose approaches and the database-machine community switched to research on using arrays of general-purpose processors and disks to process data in parallel.
The University of Wisconsin hosted the major proponents of this idea in the United States. Funded by the government and industry, those researchers prototyped and built a parallel database machine called
Gamma. That system produced ideas and a generation of students who went on to staff all the database vendors. Today the parallel systems from IBM, Tandem, Oracle, Informix, Sybase, and Microsoft all have a direct lineage from the Wisconsin research on parallel database systems. The use of parallel database systems for data mining is the fastest-growing component of the database server industry.
The Gamma project evolved into the Exodus project at Wisconsin (focusing on an extensible object-oriented database). Exodus has now evolved to the Paradise system, which combines object-oriented and parallel database techniques to represent, store, and quickly process huge Earth-observing satellite databases.
In addition to creating a huge industry, database theory, science, and engineering constitute a key part of computer science today. Representing knowledge within a computer is one of the central challenges of computer science ( Box 5.1 ). Database research has focused primarily on this fundamental issue. Many universities have faculty investigating these problems and offer classes that teach the concepts developed by this research program.
How can knowledge be represented so that algorithms can make new inferences from the knowledge base? This problem has challenged philosophers for millennia. There has been progress. Euclid axiomized geometry and proved its basic theorems, and in doing so implicitly demonstrated mechanical reasoning from first principles. George Boole’s Laws of Thought created a predicate calculus, and Laplace’s work on probability was a first start on statistical inference. Each of these threads—proofs, predicate calculus, and statistical inference—were major advances; but each requires substantial human creativity to fit new problems to the solution. Wouldn’t it be nice if we could just put all the books and journals in a library that would automatically organize them and start producing new answers? There are huge gaps between our current tools and the goal of a self-organizing library, but computer scientists are trying to fill the gaps with better algorithms and better ways of representing knowledge. Databases are one branch of this effort to represent information and reason about it. The database community has taken a bottom-up approach, working with simple data representations and developing a calculus for asking and answering questions about the database. |
The fundamental approach of database researchers is to insist that the information must be schematized—the information must be represented in a predefined schema that assigns a meaning to each value. The author-title-subject-abstract schema of a library system is a typical example of this approach. The schema is used both to organize the data and to make it easy to express questions about the database. Database researchers have labored to make it easy to define the schema, easy to add data to the database, and easy to pose questions to the database. Early database systems were dreadfully difficult to use—largely because we lacked the algorithms to automatically index huge databases and lacked powerful query tools. Today there are good tools to define schemas, and graphical tools that make it easy to explore and analyze the contents of a database. This has required invention at all levels of the problem. At the lowest levels we had to discover efficient algorithms to sort, index, and organize numeric, text, temporal, and spatial information so that higher-level software could just pick from a wide variety of organizations and algorithms. These low-level algorithms mask data placement so that it can be spread among hundreds or thousands of disks; they mask concurrency so that the higher-level software can view a consistent data snapshot, even though the data is in flux. The low-level software includes enough redundancy so that once data is placed in the database, it is safe to assume that the data will never be lost. One major advance was the theory and algorithms to automatically guarantee these concurrency-reliability properties. Text, spatial, and temporal databases have always posed special challenges. Certainly there have been huge advances in indexing these databases, but researchers still have many more problems to solve. The advent of image, video, and sound databases raises new issues. In particular, we are now able to extract a huge number of features from images and sounds, but we have no really good ways to index these features. This is just another aspect of the “curse of dimensionality” faced by database systems in the data-mining and data analysis area. When each object has more than a dozen attributes, traditional indexing techniques give little help in reducing the approximate search space. So, there are still many unsolved research challenges for the low-level database “plumbers.” The higher-level software that uses this plumbing has been a huge success. Early on, the research community embraced the relational data model championed by Ted Codd. Codd advocated the use of non-procedural set-oriented programming to define schemas and to pose queries. After a decade of experimentation, these research ideas evolved into the SQL database language. Having this high-level non-procedural language was a boon both to application programmers and to database implementers. Application programmers could write much simpler programs. The database implementers faced the challenge of optimizing and executing SQL. Because it is so high level (SQL is a non-procedural functional dataflow language), SQL allows data to be distributed across many computers and disks. Because the programs do not mention any physical structures, the implementer is free to use whatever “plumbing” is available. And because the language is functional, it can be executed in parallel. |
Techniques for implementing the relational data model and algorithms for efficiently executing database queries remain a core part of the database research agenda. Over the last decade, the traditional database systems have grown to include analytics (data cubes), and also data-mining algorithms borrowed from the machine-learning and statistics communities. There is increasing interest in solving information retrieval and multimedia database issues. Today, there are very good tools for defining and querying traditional database systems; but, there are still major research challenges in the traditional database field. The major focus is automating as much of the data administration tasks as possible—making the database system self-healing and self-managing. We are still far from the goal of building systems that automatically ingest information, reason about it, and produce answers on demand. But the goal is closer, and it seems attainable within this century. |
Michael Lesk, Rutgers University
In other countries computer science is often called “informatics” or some similar name. Much computer science research derives from the need to access, process, store, or otherwise exploit some resource of useful information. Just as chemistry is driven to large extent by the need to understand substances, computing is driven by a need to handle data and information. As an example of the way chemistry has developed, see Oliver Sacks’s book Uncle Tungsten: Memories of a Chemical Boyhood (Vintage Books, 2002). He describes his explorations through the different metals, learning the properties of each, and understanding their applications. Similarly, in the history of computer science, our information needs and our information capabilities have driven parts of the research agenda. Information retrieval systems take some kind of information, such as text documents or pictures, and try to retrieve topics or concepts based on words or shapes. Deducing the concept from the bytes can be difficult, and the way we approach the problem depends on what kind of bytes we have and how many of them we have.
Our experimental method is to see if we can build a system that will provide some useful access to information or service. If it works, those algorithms and that kind of data become a new field: look at areas like geographic information systems. If not, people may abandon the area until we see a new motivation to exploit that kind of data. For example, face-recognition algorithms have received a new impetus from security needs, speeding up progress in the last few years. An effective strategy to move computer science forward is to provide some new kind of information and see if we can make it useful.
Chemistry, of course, involves a dichotomy between substances and reactions. Just as we can (and frequently do) think of computer science in terms of algorithms, we can talk about chemistry in terms of reactions. However, chemistry has historically focused on substances: the encyclopedias and indexes in chemistry tend to be organized and focused on compounds, with reaction names and schemes getting less space on the shelf. Chemistry is becoming more balanced as we understand reactions better; computer science has always been more heavily oriented toward algorithms, but we cannot ignore the driving force of new kinds of data.
The history of information retrieval, for example, has been driven by the kinds of information we could store and use. In the 1960s, for example, storage was extremely expensive. Research projects were limited to text
materials. Even then, storage costs meant that a research project could just barely manage to have a single ASCII document available for processing. For example, Gerard Salton’s SMART system, one of the leading text retrieval systems for many years (see Salton’s book, The SMART Automatic Retrieval System , Prentice-Hall, 1971), did most of its processing on collections of a few hundred abstracts. The only collections of “full documents” were a collection of 80 extended abstracts, each a page or two long, and a collection of under a thousand stories from Time Magazine , each less than a page in length. The biggest collection was 1400 abstracts in aeronautical engineering. With this data, Salton was able to experiment on the effectiveness of retrieval methods using suffixing, thesauri, and simple phrase finding. Salton also laid down the standard methodology for evaluating retrieval systems, based on Cyril Cleverdon’s measures of “recall” (percentage of the relevant material that is retrieved in response to a query) and “precision” (the percentage of the material retrieved that is relevant). A system with perfect recall finds all relevant material, making no errors of omission and leaving out nothing the user wanted. In contrast, a system with perfect precision finds only relevant material, making no errors of commission and not bothering the user with stuff of no interest. The SMART system produced these measures for many retrieval experiments and its methodology was widely used, making text retrieval one of the earliest areas of computer science with agreed-on evaluation methods. Salton was not able to do anything with image retrieval at the time; there were no such data available for him.
Another idea shaped by the amount of information available was “relevance feedback,” the idea of identifying useful documents from a first retrieval pass in order to improve the results of a later retrieval. With so few documents, high precision seemed like an unnecessary goal. It was simply not possible to retrieve more material than somebody could look at. Thus, the research focused on high recall (also stimulated by the insistence by some users that they had to have every single relevant document). Relevance feedback helped recall. By contrast, the use of phrase searching to improve precision was tried but never got much attention simply because it did not have the scope to produce much improvement in the running systems.
The basic problem is that we wish to search for concepts, and what we have in natural language are words and phrases. When our documents are few and short, the main problem is not to miss any, and the research at the time stressed algorithms that found related words via associations or improved recall with techniques like relevance feedback.
Then, of course, several other advances—computer typesetting and word processing to generate material and cheap disks to hold it—led to much larger text collections. Figure 5.1 shows the decline in the price of
FIGURE 5.1 Decline in the price of disk space, 1950 to 2004.
disk space since the first disks in the mid-1950s, generally following the cost-performance trends of Moore’s law.
Cheaper storage led to larger and larger text collections online. Now there are many terabytes of data on the Web. These vastly larger volumes mean that precision has now become more important, since a common problem is to wade through vastly too many documents. Not surprisingly, in the mid-1980s efforts started on separating the multiple meanings of words like “bank” or “pine” and became the research area of “sense disambiguation.” 2 With sense disambiguation, it is possible to imagine searching for only one meaning of an ambiguous word, thus avoiding many erroneous retrievals.
Large-scale research on text processing took off with the availability of the TREC (Text Retrieval Evaluation Conference) data. Thanks to the National Institute of Standards and Technology, several hundred megabytes of text were provided (in each of several years) for research use. This stimulated more work on query analysis, text handling, searching
| See Michael Lesk, 1986, “How to Tell a Pine Cone from an Ice Cream Cone,” , pp. 26-28. |
algorithms, and related areas; see the series titled TREC Conference Proceedings, edited by Donna Harmon of NIST.
Document clustering appeared as an important way to shorten long search results. Clustering enables a system to report not, say, 5000 documents but rather 10 groups of 500 documents each, and the user can then explore the group or groups that seem relevant. Salton anticipated the future possibility of such algorithms, as did others. 3 Until we got large collections, though, clustering did not find application in the document retrieval world. Now one routinely sees search engines using these techniques, and faster clustering algorithms have been developed.
Thus the algorithms explored switched from recall aids to precision aids as the quantity of available data increased. Manual thesauri, for example, have dropped out of favor for retrieval, partly because of their cost but also because their goal is to increase recall, which is not today’s problem. In terms of finding the concepts hinted at by words and phrases, our goals now are to sharpen rather than broaden these concepts: thus disambiguation and phrase matching, and not as much work on thesauri and term associations.
Again, multilingual searching started to matter, because multilingual collections became available. Multilingual research shows a more precise example of particular information resources driving research. The Canadian government made its Parliamentary proceedings (called Hansard ) available in both French and English, with paragraph-by-paragraph translation. This data stimulated a number of projects looking at how to handle bilingual material, including work on automatic alignment of the parallel texts, automatic linking of similar words in the two languages, and so on. 4
A similar effect was seen with the Brown corpus of tagged English text, where the part of speech of each word (e.g., whether a word is a noun or a verb) was identified. This produced a few years of work on algorithms that learned how to assign parts of speech to words in running text based on statistical techniques, such as the work by Garside. 5
| See, for example, N. Jardine and C.J. van Rijsbergen, 1971, “The Use of Hierarchical Clustering in Information Retrieval,” 7:217-240. |
| See, for example, T.K. Landauer and M.L. Littman, 1990, “Fully Automatic Cross-Language Document Retrieval Using Latent Semantic Indexing,” pp. 31-38, University of Waterloo Centre for the New OED and Text Research, Waterloo, Ontario, October; or I. Dagan and Ken Church, 1997, “Termight: Coordinating Humans and Machines in Bilingual Terminology Acquisition,” 12(1/2):89-107. |
| Roger Garside, 1987, “The CLAWS Word-tagging System,” in R. Garside, G. Leech, and G. Sampson (eds.), , Longman, London. |
One might see an analogy to various new fields of chemistry. The recognition that pesticides like DDT were environmental pollutants led to a new interest in biodegradability, and the Freon propellants used in aerosol cans stimulated research in reactions in the upper atmosphere. New substances stimulated a need to study reactions that previously had not been a top priority for chemistry and chemical engineering.
As storage became cheaper, image storage was now as practical as text storage had been a decade earlier. Starting in the 1980s we saw the IBM QBIC project demonstrating that something could be done to retrieve images directly, without having to index them by text words first. 6 Projects like this were stimulated by the availability of “clip art” such as the COREL image disks. Several different projects were driven by the easy access to images in this way, with technology moving on from color and texture to more accurate shape processing. At Berkeley, for example, the “Blobworld” project made major improvements in shape detection and recognition, as described in Carson et al. 7 These projects demonstrated that retrieval could be done with images as well as with words, and that properties of images could be found that were usable as concepts for searching.
Another new kind of data that became feasible to process was sound, in particular human speech. Here it was the Defense Advanced Research Projects Agency (DARPA) that took the lead, providing the SWITCH-BOARD corpus of spoken English. Again, the availability of a substantial file of tagged information helped stimulate many research projects that used this corpus and developed much of the technology that eventually went into the commercial speech recognition products we now have. As with the TREC contests, the competitions run by DARPA based on its spoken language data pushed the industry and the researchers to new advances. National needs created a new technology; one is reminded of the development of synthetic rubber during World War II or the advances in catalysis needed to make explosives during World War I.
Yet another kind of new data was geo-coded data, introducing a new set of conceptual ideas related to place. Geographical data started showing up in machine-readable form during the 1980s, especially with the release of the Dual Independent Map Encoding (DIME) files after the 1980
| See, for example, Wayne Niblack, Ron Barber, William Equitz, Myron Flickner, Eduardo H. Glasman, Dragutin Petkovic, Peter Yanker, Christos Faloutsos, and Gabriel Taubin, 1993, “The QBIC Project: Querying Images by Content, Using Color, Texture, and Shape,” pp. 173-187. |
| C. Carson, M. Thomas, S. Belongie, J.M. Hellerstein, and J. Malik, 1999, “Blobworld: A System for Region-based Image Indexing and Retrieval,” , Springer-Verlag, Amsterdam, pp. 509-516. |
census and the Topologically Integrated Geographic Encoding and Referencing (TIGER) files from the 1990 census. The availability, free of charge, of a complete U.S. street map stimulated much research on systems to display maps, to give driving directions, and the like. 8 When aerial photographs also became available, there was the triumph of Microsoft’s “Terraserver,” which made it possible to look at a wide swath of the world from the sky along with correlated street and topographic maps. 9
More recently, in the 1990s, we have started to look at video search and retrieval. After all, if a CD-ROM contains about 300,000 times as many bytes per pound as a deck of punched cards, and a digitized video has about 500,000 times as many bytes per second as the ASCII script it comes from, we should be about where we were in the 1960s with video today. And indeed there are a few projects, most notably the Informedia project at Carnegie Mellon University, that experiment with video signals; they do not yet have ways of searching enormous collections, but they are developing algorithms that exploit whatever they can find in the video: scene breaks, closed-captioning, and so on.
Again, there is the problem of deducing concepts from a new kind of information. We started with the problem of words in one language needing to be combined when synonymous, picked apart when ambiguous, and moved on to detecting synonyms across multiple languages and then to concepts depicted in pictures and sounds. Now we see research such as that by Jezekiel Ben-Arie associating words like “run” or “hop” with video images of people doing those actions. In the same way we get again new chemistry when molecules like “buckyballs” are created and stimulate new theoretical and reaction studies.
Defining concepts for search can be extremely difficult. For example, despite our abilities to parse and define every item in a computer language, we have made no progress on retrieval of software; people looking for search or sort routines depend on metadata or comments. Some areas seem more flexible than others: text and naturalistic photograph processing software tends to be very general, while software to handle CAD diagrams and maps tends to be more specific. Algorithms are sometimes portable; both speech processing and image processing need Fourier transforms, but the literature is less connected than one might like (partly
| An early publication was R. Elliott and M. Lesk, 1982, “Route Finding in Street Maps by Computers and People,” , Pittsburgh, Pa., August, pp. 258-261. |
| T. Barclay, J. Gray, and D. Slutz, 2000, “Microsoft Terraserver: A Spatial Data Warehouse,” , Association for Computing Machinery, New York, pp. 307-318. |
because of the difference between one-dimensional and two-dimensional transforms).
There are many other examples of interesting computer science research stimulated by the availability of particular kinds of information. Work on string matching today is often driven by the need to align sequences in either protein or DNA data banks. Work on image analysis is heavily influenced by the need to deal with medical radiographs. And there are many other interesting projects specifically linked to an individual data source. Among examples:
The British Library scanning of the original manuscript of Beowulf in collaboration with the University of Kentucky, working on image enhancement until the result of the scanning is better than reading the original;
The Perseus project, demonstrating the educational applications possible because of the earlier Thesaurus Linguae Graecae project, which digitized all the classical Greek authors;
The work in astronomical analysis stimulated by the Sloan Digital Sky Survey;
The creation of the field of “forensic paleontology” at the University of Texas as a result of doing MRI scans of fossil bones;
And, of course, the enormous amount of work on search engines stimulated by the Web.
When one of these fields takes off, and we find wide usage of some online resource, it benefits society. Every university library gained readers as their catalogs went online and became accessible to students in their dorm rooms. Third World researchers can now access large amounts of technical content their libraries could rarely acquire in the past.
In computer science, and in chemistry, there is a tension between the algorithm/reaction and the data/substance. For example, should one look up an answer or compute it? Once upon a time logarithms were looked up in tables; today we also compute them on demand. Melting points and other physical properties of chemical substances are looked up in tables; perhaps with enough quantum mechanical calculation we could predict them, but it’s impractical for most materials. Predicting tomorrow’s weather might seem a difficult choice. One approach is to measure the current conditions, take some equations that model the atmosphere, and calculate forward a day. Another is to measure the current conditions, look in a big database for the previous day most similar to today, and then take the day after that one as the best prediction for tomorrow. However, so far the meteorologists feel that calculation is better. Another complicated example is chess: given the time pressure of chess tournaments
against speed and storage available in computers, chess programs do the opening and the endgame by looking in tables of old data and calculate for the middle game.
To conclude, a recipe for stimulating advances in computer science is to make some data available and let people experiment with it. With the incredibly cheap disks and scanners available today, this should be easier than ever. Unfortunately, what we gain with technology we are losing to law and economics. Many large databases are protected by copyright; few motion pictures, for example, are old enough to have gone out of copyright. Content owners generally refuse to grant permission for wide use of their material, whether out of greed or fear: they may have figured out how to get rich off their files of information or they may be afraid that somebody else might have. Similarly it is hard to get permission to digitize in-copyright books, no matter how long they have been out of print. Jim Gray once said to me, “May all your problems be technical.” In the 1960s I was paying people to key in aeronautical abstracts. It never occurred to us that we should be asking permission of the journals involved (I think what we did would qualify as fair use, but we didn’t even think about it). Today I could scan such things much more easily, but I would not be able to get permission. Am I better off or worse off?
There are now some 22 million chemical substances in the Chemical Abstracts Service Registry and 7 million reactions. New substances continue to intrigue chemists and cause research on new reactions, with of course enormous interest in biochemistry both for medicine and agriculture. Similarly, we keep adding data to the Web, and new kinds of information (photographs of dolphins, biological flora, and countless other things) can push computer scientists to new algorithms. In both cases, synthesis of specific instances into concepts is a crucial problem. As we see more and more kinds of data, we learn more about how to extract meaning from it, and how to present it, and we develop a need for new algorithms to implement this knowledge. As the data gets bigger, we learn more about optimization. As it gets more complex, we learn more about representation. And as it gets more useful, we learn more about visualization and interfaces, and we provide better service to society.
Edward L. Ayers, University of Virginia
We might begin with a thought experiment: What is history? Many people, I’ve discovered, think of it as books and the things in books. That’s certainly the explicit form in which we usually confront history. Others, thinking less literally, might think of history as stories about the past; that would open us to oral history, family lore, movies, novels, and the other forms in which we get most of our history.
All these images are wrong, of course, in the same way that images of atoms as little solar systems are wrong, or pictures of evolution as profiles of ever taller and more upright apes and people are wrong. They are all models, radically simplified, that allow us to think about such things in the exceedingly small amounts of time that we allot to these topics.
The same is true for history, which is easiest to envision as technological progress, say, or westward expansion, of the emergence of freedom—or of increasing alienation, exploitation of the environment, or the growth of intrusive government.
Those of us who think about specific aspects of society or nature for a living, of course, are never satisfied with the stories that suit the purposes of everyone else so well.
We are troubled by all the things that don’t fit, all the anomalies, variance, and loose ends. We demand more complex measurement, description, and fewer smoothing metaphors and lowest common denominators.
Thus, to scientists, atoms appear as clouds of probability; evolution appears as a branching, labyrinthine bush in which some branches die out and others diversify. It can certainly be argued that past human experience is as complex as anything in nature and likely much more so, if by complexity we mean numbers of components, variability of possibilities, and unpredictability of outcomes.
Yet our means of conveying that complexity remain distinctly analog: the story, the metaphor, the generalization. Stories can be wonderfully complex, of course, but they are complex in specific ways: of implication, suggestion, evocation. That’s what people love and what they remember.
But maybe there is a different way of thinking about the past: as information. In fact, information is all we have. Studying the past is like studying scientific processes for which you have the data but cannot run the experiment again, in which there is no control, and in which you can never see the actual process you are describing and analyzing. All we have is information in various forms: words in great abundance, billions of numbers, millions of images, some sounds and buildings, artifacts.
The historian’s goal, it seems to me, should be to account for as much of the complexity embedded in that information as we can. That, it appears, is what scientists do, and it has served them well.
And how has science accounted for ever-increasing amounts of complexity in the information they use? Through ever more sophisticated instruments. The connection between computer science and history could be analogous to that between telescopes and stars, microscopes and cells. We could be on the cusp of a new understanding of the patterns of complexity in human behavior of the past.
The problem may be that there is too much complexity in that past, or too much static, or too much silence. In the sciences, we’ve learned how to filter, infer, use indirect evidence, and fill in the gaps, but we have a much more literal approach to the human past.
We have turned to computer science for tasks of more elaborate description, classification, representation. The digital archive my colleagues and I have built, the Valley of the Shadow Project, permits the manipulation of millions of discrete pieces of evidence about two communities in the era of the American Civil War. It uses sorting mechanisms, hypertextual display, animation, and the like to allow people to handle the evidence of this part of the past for themselves. This isn’t cutting-edge computer science, of course, but it’s darned hard and deeply disconcerting to some, for it seems to abdicate responsibility, to undermine authority, to subvert narrative, to challenge story.
Now, we’re trying to take this work to the next stage, to analysis. We have composed a journal article that employs an array of technologies, especially geographic information systems and statistical analysis in the creation of the evidence. The article presents its argument, evidence, and historiographical context as a complex textual, tabular, and graphical representation. XML offers a powerful means to structure text and XSL an even more powerful means to transform it and manipulate its presentation. The text is divided into sections called “statements,” each supported with “explanation.” Each explanation, in turn, is supported by evidence and connected to relevant historiography.
Linkages, forward and backward, between evidence and narrative are central. The historiography can be automatically sorted by author, date, or title; the evidence can be arranged by date, topic, or type. Both evidence and historiographical entries are linked to the places in the analysis where they are invoked. The article is meant to be used online, but it can be printed in a fixed format with all the limitations and advantages of print.
So, what are the implications of thinking of the past in the hardheaded sense of admitting that all we really have of the past is information? One implication might be great humility, since all we have for most
of the past are the fossils of former human experience, words frozen in ink and images frozen in line and color. Another implication might be hubris: if we suddenly have powerful new instruments, might we be on the threshold of a revolution in our understanding of the past? We’ve been there before.
A connection between history and social science was tried before, during the first days of accessible computers. Historians taught themselves statistical methods and even programming languages so that they could adopt the techniques, models, and insights of sociology and political science. In the 1950s and 1960s the creators of the new political history called on historians to emulate the precision, explicitness, replicability, and inclusivity of the quantitative social sciences. For two decades that quantitative history flourished, promising to revolutionize the field. And to a considerable extent it did: it changed our ideas of social mobility, political identification, family formation, patterns of crime, economic growth, and the consequences of ethnic identity. It explicitly linked the past to the present and held out a history of obvious and immediate use.
But that quantitative social science history collapsed suddenly, the victim of its own inflated claims, limited method and machinery, and changing academic fashion. By the mid-1980s, history, along with many of the humanities and social sciences, had taken the linguistic turn. Rather than software manuals and codebooks, graduate students carried books of French philosophy and German literary interpretation. The social science of choice shifted from sociology to anthropology; texts replaced tables. A new generation defined itself in opposition to social scientific methods just as energetically as an earlier generation had seen in those methods the best means of writing a truly democratic history. The first computer revolution largely failed.
The first effort at that history fell into decline in part because historians could not abide the distance between their most deeply held beliefs and what the statistical machinery permitted, the abstraction it imposed. History has traditionally been built around contingency and particularity, but the most powerful tools of statistics are built on sampling and extrapolation, on generalization and tendency. Older forms of social history talked about vague and sometimes dubious classifications in part because that was what the older technology of tabulation permitted us to see. It has become increasingly clear across the social sciences that such flat ways of describing social life are inadequate; satisfying explanations must be dynamic, interactive, reflexive, and subtle, refusing to reify structures of social life or culture. The new technology permits a new cross-fertilization.
Ironically, social science history faded just as computers became widely available, just as new kinds of social science history became feasible. No longer is there any need for white-coated attendants at huge mainframes
and expensive proprietary software. Rather than reducing people to rows and columns, searchable databases now permit researchers to maintain the identities of individuals in those databases and to represent entire populations rather than samples. Moreover, the record can now include things social science history could only imagine before the Web: completely indexed newspapers, with the original readable on the screen; completely searchable letters and diaries by the thousands; and interactive maps with all property holders identified and linked to other records. Visualization of patterns in the data, moreover, far outstrips the possibilities of numerical calculation alone. Manipulable histograms, maps, and time lines promise a social history that is simultaneously sophisticated and accessible. We have what earlier generations of social science historians dreamed of: a fast and widely accessible network linked to cheap and powerful computers running common software with well-established standards for the handling of numbers, texts, and images. New possibilities of collaboration and cumulative research beckon. Perhaps the time is right to reclaim a worthy vision of a disciplined and explicit social scientific history that we abandoned too soon.
What does this have to do with computer science? Everything, it seems to me. If you want hard problems, historians have them. And what’s the hardest problem of all right now? The capture of the very information that is history. Can computer science imagine ways to capture historical information more efficiently? Can it offer ways to work with the spotty, broken, dirty, contradictory, nonstandardized information we work with?
The second hard problem is the integration of this disparate evidence in time and space, offering new precision, clarity, and verifiability, as well as opening new questions and new ways of answering them.
If we can think of these ways, then we face virtually limitless possibilities. Is there a more fundamental challenge or opportunity for computer science than helping us to figure out human society over human time?
This page intentionally left blank.
Computer Science: Reflections on the Field, Reflections from the Field provides a concise characterization of key ideas that lie at the core of computer science (CS) research. The book offers a description of CS research recognizing the richness and diversity of the field. It brings together two dozen essays on diverse aspects of CS research, their motivation and results. By describing in accessible form computer science's intellectual character, and by conveying a sense of its vibrancy through a set of examples, the book aims to prepare readers for what the future might hold and help to inspire CS researchers in its creation.
READ FREE ONLINE
You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.
Do you want to take a quick tour of the OpenBook's features?
Show this book's table of contents , where you can jump to any chapter by name.
...or use these buttons to go back to the previous chapter or skip to the next one.
Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.
Switch between the Original Pages , where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.
To search the entire text of this book, type in your search term here and press Enter .
Share a link to this book page on your preferred social network or via email.
View our suggested citation for this chapter.
Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.
Do you enjoy reading reports from the Academies online for free ? Sign up for email notifications and we'll let you know about new publications in your areas of interest when they're released.
Warning: The NCBI web site requires JavaScript to function. more...
An official website of the United States government
The .gov means it's official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.
The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.
NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.
National Research Council; Division of Behavioral and Social Sciences and Education; Commission on Behavioral and Social Sciences and Education; Committee on Basic Research in the Behavioral and Social Sciences; Gerstein DR, Luce RD, Smelser NJ, et al., editors. The Behavioral and Social Sciences: Achievements and Opportunities. Washington (DC): National Academies Press (US); 1988.
This chapter concerns research on collecting, representing, and analyzing the data that underlie behavioral and social sciences knowledge. Such research, methodological in character, includes ethnographic and historical approaches, scaling, axiomatic measurement, and statistics, with its important relatives, econometrics and psychometrics. The field can be described as including the self-conscious study of how scientists draw inferences and reach conclusions from observations. Since statistics is the largest and most prominent of methodological approaches and is used by researchers in virtually every discipline, statistical work draws the lion’s share of this chapter’s attention.
Problems of interpreting data arise whenever inherent variation or measurement fluctuations create challenges to understand data or to judge whether observed relationships are significant, durable, or general. Some examples: Is a sharp monthly (or yearly) increase in the rate of juvenile delinquency (or unemployment) in a particular area a matter for alarm, an ordinary periodic or random fluctuation, or the result of a change or quirk in reporting method? Do the temporal patterns seen in such repeated observations reflect a direct causal mechanism, a complex of indirect ones, or just imperfections in the data? Is a decrease in auto injuries an effect of a new seat-belt law? Are the disagreements among people describing some aspect of a subculture too great to draw valid inferences about that aspect of the culture?
Such issues of inference are often closely connected to substantive theory and specific data, and to some extent it is difficult and perhaps misleading to treat methods of data collection, representation, and analysis separately. This report does so, as do all sciences to some extent, because the methods developed often are far more general than the specific problems that originally gave rise to them. There is much transfer of new ideas from one substantive field to another—and to and from fields outside the behavioral and social sciences. Some of the classical methods of statistics arose in studies of astronomical observations, biological variability, and human diversity. The major growth of the classical methods occurred in the twentieth century, greatly stimulated by problems in agriculture and genetics. Some methods for uncovering geometric structures in data, such as multidimensional scaling and factor analysis, originated in research on psychological problems, but have been applied in many other sciences. Some time-series methods were developed originally to deal with economic data, but they are equally applicable to many other kinds of data.
Even such an abridged listing makes clear that improvements in methodology are valuable across the spectrum of empirical research in the behavioral and social sciences as well as in application to policy questions. Clearly, methodological research serves many different purposes, and there is a need to develop different approaches to serve those different purposes, including exploratory data analysis, scientific inference about hypotheses and population parameters, individual decision making, forecasting what will happen in the event or absence of intervention, and assessing causality from both randomized experiments and observational data.
This discussion of methodological research is divided into three areas: design, representation, and analysis. The efficient design of investigations must take place before data are collected because it involves how much, what kind of, and how data are to be collected. What type of study is feasible: experimental, sample survey, field observation, or other? What variables should be measured, controlled, and randomized? How extensive a subject pool or observational period is appropriate? How can study resources be allocated most effectively among various sites, instruments, and subsamples?
The construction of useful representations of the data involves deciding what kind of formal structure best expresses the underlying qualitative and quantitative concepts that are being used in a given study. For example, cost of living is a simple concept to quantify if it applies to a single individual with unchanging tastes in stable markets (that is, markets offering the same array of goods from year to year at varying prices), but as a national aggregate for millions of households and constantly changing consumer product markets, the cost of living is not easy to specify clearly or measure reliably. Statisticians, economists, sociologists, and other experts have long struggled to make the cost of living a precise yet practicable concept that is also efficient to measure, and they must continually modify it to reflect changing circumstances.
Data analysis covers the final step of characterizing and interpreting research findings: Can estimates of the relations between variables be made? Can some conclusion be drawn about correlation, cause and effect, or trends over time? How uncertain are the estimates and conclusions and can that uncertainty be reduced by analyzing the data in a different way? Can computers be used to display complex results graphically for quicker or better understanding or to suggest different ways of proceeding?
Advances in analysis, data representation, and research design feed into and reinforce one another in the course of actual scientific work. The intersections between methodological improvements and empirical advances are an important aspect of the multidisciplinary thrust of progress in the behavioral and social sciences.
Four broad kinds of research designs are used in the behavioral and social sciences: experimental, survey, comparative, and ethnographic.
Experimental designs, in either the laboratory or field settings, systematically manipulate a few variables while others that may affect the outcome are held constant, randomized, or otherwise controlled. The purpose of randomized experiments is to ensure that only one or a few variables can systematically affect the results, so that causes can be attributed. Survey designs include the collection and analysis of data from censuses, sample surveys, and longitudinal studies and the examination of various relationships among the observed phenomena. Randomization plays a different role here than in experimental designs: it is used to select members of a sample so that the sample is as representative of the whole population as possible. Comparative designs involve the retrieval of evidence that is recorded in the flow of current or past events in different times or places and the interpretation and analysis of this evidence. Ethnographic designs, also known as participant-observation designs, involve a researcher in intensive and direct contact with a group, community, or population being studied, through participation, observation, and extended interviewing.
Laboratory experiments.
Laboratory experiments underlie most of the work reported in Chapter 1 , significant parts of Chapter 2 , and some of the newest lines of research in Chapter 3 . Laboratory experiments extend and adapt classical methods of design first developed, for the most part, in the physical and life sciences and agricultural research. Their main feature is the systematic and independent manipulation of a few variables and the strict control or randomization of all other variables that might affect the phenomenon under study. For example, some studies of animal motivation involve the systematic manipulation of amounts of food and feeding schedules while other factors that may also affect motivation, such as body weight, deprivation, and so on, are held constant. New designs are currently coming into play largely because of new analytic and computational methods (discussed below, in “Advances in Statistical Inference and Analysis”).
Two examples of empirically important issues that demonstrate the need for broadening classical experimental approaches are open-ended responses and lack of independence of successive experimental trials. The first concerns the design of research protocols that do not require the strict segregation of the events of an experiment into well-defined trials, but permit a subject to respond at will. These methods are needed when what is of interest is how the respondent chooses to allocate behavior in real time and across continuously available alternatives. Such empirical methods have long been used, but they can generate very subtle and difficult problems in experimental design and subsequent analysis. As theories of allocative behavior of all sorts become more sophisticated and precise, the experimental requirements become more demanding, so the need to better understand and solve this range of design issues is an outstanding challenge to methodological ingenuity.
The second issue arises in repeated-trial designs when the behavior on successive trials, even if it does not exhibit a secular trend (such as a learning curve), is markedly influenced by what has happened in the preceding trial or trials. The more naturalistic the experiment and the more sensitive the meas urements taken, the more likely it is that such effects will occur. But such sequential dependencies in observations cause a number of important conceptual and technical problems in summarizing the data and in testing analytical models, which are not yet completely understood. In the absence of clear solutions, such effects are sometimes ignored by investigators, simplifying the data analysis but leaving residues of skepticism about the reliability and significance of the experimental results. With continuing development of sensitive measures in repeated-trial designs, there is a growing need for more advanced concepts and methods for dealing with experimental results that may be influenced by sequential dependencies.
The state of the art in randomized field experiments, in which different policies or procedures are tested in controlled trials under real conditions, has advanced dramatically over the past two decades. Problems that were once considered major methodological obstacles—such as implementing randomized field assignment to treatment and control groups and protecting the randomization procedure from corruption—have been largely overcome. While state-of-the-art standards are not achieved in every field experiment, the commitment to reaching them is rising steadily, not only among researchers but also among customer agencies and sponsors.
The health insurance experiment described in Chapter 2 is an example of a major randomized field experiment that has had and will continue to have important policy reverberations in the design of health care financing. Field experiments with the negative income tax (guaranteed minimum income) conducted in the 1970s were significant in policy debates, even before their completion, and provided the most solid evidence available on how tax-based income support programs and marginal tax rates can affect the work incentives and family structures of the poor. Important field experiments have also been carried out on alternative strategies for the prevention of delinquency and other criminal behavior, reform of court procedures, rehabilitative programs in mental health, family planning, and special educational programs, among other areas.
In planning field experiments, much hinges on the definition and design of the experimental cells, the particular combinations needed of treatment and control conditions for each set of demographic or other client sample characteristics, including specification of the minimum number of cases needed in each cell to test for the presence of effects. Considerations of statistical power, client availability, and the theoretical structure of the inquiry enter into such specifications. Current important methodological thresholds are to find better ways of predicting recruitment and attrition patterns in the sample, of designing experiments that will be statistically robust in the face of problematic sample recruitment or excessive attrition, and of ensuring appropriate acquisition and analysis of data on the attrition component of the sample.
Also of major significance are improvements in integrating detailed process and outcome measurements in field experiments. To conduct research on program effects under field conditions requires continual monitoring to determine exactly what is being done—the process—how it corresponds to what was projected at the outset. Relatively unintrusive, inexpensive, and effective implementation measures are of great interest. There is, in parallel, a growing emphasis on designing experiments to evaluate distinct program components in contrast to summary measures of net program effects.
Finally, there is an important opportunity now for further theoretical work to model organizational processes in social settings and to design and select outcome variables that, in the relatively short time of most field experiments, can predict longer-term effects: For example, in job-training programs, what are the effects on the community (role models, morale, referral networks) or on individual skills, motives, or knowledge levels that are likely to translate into sustained changes in career paths and income levels?
Many people have opinions about how societal mores, economic conditions, and social programs shape lives and encourage or discourage various kinds of behavior. People generalize from their own cases, and from the groups to which they belong, about such matters as how much it costs to raise a child, the extent to which unemployment contributes to divorce, and so on. In fact, however, effects vary so much from one group to another that homespun generalizations are of little use. Fortunately, behavioral and social scientists have been able to bridge the gaps between personal perspectives and collective realities by means of survey research. In particular, governmental information systems include volumes of extremely valuable survey data, and the facility of modern computers to store, disseminate, and analyze such data has significantly improved empirical tests and led to new understandings of social processes.
Within this category of research designs, two major types are distinguished: repeated cross-sectional surveys and longitudinal panel surveys. In addition, and cross-cutting these types, there is a major effort under way to improve and refine the quality of survey data by investigating features of human memory and of question formation that affect survey response.
Repeated cross-sectional designs can either attempt to measure an entire population—as does the oldest U.S. example, the national decennial census—or they can rest on samples drawn from a population. The general principle is to take independent samples at two or more times, measuring the variables of interest, such as income levels, housing plans, or opinions about public affairs, in the same way. The General Social Survey, collected by the National Opinion Research Center with National Science Foundation support, is a repeated cross sectional data base that was begun in 1972. One methodological question of particular salience in such data is how to adjust for nonresponses and “don’t know” responses. Another is how to deal with self-selection bias. For example, to compare the earnings of women and men in the labor force, it would be mistaken to first assume that the two samples of labor-force participants are randomly selected from the larger populations of men and women; instead, one has to consider and incorporate in the analysis the factors that determine who is in the labor force.
In longitudinal panels, a sample is drawn at one point in time and the relevant variables are measured at this and subsequent times for the same people. In more complex versions, some fraction of each panel may be replaced or added to periodically, such as expanding the sample to include households formed by the children of the original sample. An example of panel data developed in this way is the Panel Study of Income Dynamics (PSID), conducted by the University of Michigan since 1968 (discussed in Chapter 3 ).
Comparing the fertility or income of different people in different circumstances at the same time to find correlations always leaves a large proportion of the variability unexplained, but common sense suggests that much of the unexplained variability is actually explicable. There are systematic reasons for individual outcomes in each person’s past achievements, in parental models, upbringing, and earlier sequences of experiences. Unfortunately, asking people about the past is not particularly helpful: people remake their views of the past to rationalize the present and so retrospective data are often of uncertain validity. In contrast, generation-long longitudinal data allow readings on the sequence of past circumstances uncolored by later outcomes. Such data are uniquely useful for studying the causes and consequences of naturally occurring decisions and transitions. Thus, as longitudinal studies continue, quantitative analysis is becoming feasible about such questions as: How are the decisions of individuals affected by parental experience? Which aspects of early decisions constrain later opportunities? And how does detailed background experience leave its imprint? Studies like the two-decade-long PSID are bringing within grasp a complete generational cycle of detailed data on fertility, work life, household structure, and income.
Large-scale longitudinal data collection projects are uniquely valuable as vehicles for testing and improving survey research methodology. In ways that lie beyond the scope of a cross-sectional survey, longitudinal studies can sometimes be designed—without significant detriment to their substantive interests—to facilitate the evaluation and upgrading of data quality; the analysis of relative costs and effectiveness of alternative techniques of inquiry; and the standardization or coordination of solutions to problems of method, concept, and measurement across different research domains.
Some areas of methodological improvement include discoveries about the impact of interview mode on response (mail, telephone, face-to-face); the effects of nonresponse on the representativeness of a sample (due to respondents’ refusal or interviewers’ failure to contact); the effects on behavior of continued participation over time in a sample survey; the value of alternative methods of adjusting for nonresponse and incomplete observations (such as imputation of missing data, variable case weighting); the impact on response of specifying different recall periods, varying the intervals between interviews, or changing the length of interviews; and the comparison and calibration of results obtained by longitudinal surveys, randomized field experiments, laboratory studies, onetime surveys, and administrative records.
It should be especially noted that incorporating improvements in methodology and data quality has been and will no doubt continue to be crucial to the growing success of longitudinal studies. Panel designs are intrinsically more vulnerable than other designs to statistical biases due to cumulative item non-response, sample attrition, time-in-sample effects, and error margins in repeated measures, all of which may produce exaggerated estimates of change. Over time, a panel that was initially representative may become much less representative of a population, not only because of attrition in the sample, but also because of changes in immigration patterns, age structure, and the like. Longitudinal studies are also subject to changes in scientific and societal contexts that may create uncontrolled drifts over time in the meaning of nominally stable questions or concepts as well as in the underlying behavior. Also, a natural tendency to expand over time the range of topics and thus the interview lengths, which increases the burdens on respondents, may lead to deterioration of data quality or relevance. Careful methodological research to understand and overcome these problems has been done, and continued work as a component of new longitudinal studies is certain to advance the overall state of the art.
Longitudinal studies are sometimes pressed for evidence they are not designed to produce: for example, in important public policy questions concerning the impact of government programs in such areas as health promotion, disease prevention, or criminal justice. By using research designs that combine field experiments (with randomized assignment to program and control conditions) and longitudinal surveys, one can capitalize on the strongest merits of each: the experimental component provides stronger evidence for casual statements that are critical for evaluating programs and for illuminating some fundamental theories; the longitudinal component helps in the estimation of long-term program effects and their attenuation. Coupling experiments to ongoing longitudinal studies is not often feasible, given the multiple constraints of not disrupting the survey, developing all the complicated arrangements that go into a large-scale field experiment, and having the populations of interest overlap in useful ways. Yet opportunities to join field experiments to surveys are of great importance. Coupled studies can produce vital knowledge about the empirical conditions under which the results of longitudinal surveys turn out to be similar to—or divergent from—those produced by randomized field experiments. A pattern of divergence and similarity has begun to emerge in coupled studies; additional cases are needed to understand why some naturally occurring social processes and longitudinal design features seem to approximate formal random allocation and others do not. The methodological implications of such new knowledge go well beyond program evaluation and survey research. These findings bear directly on the confidence scientists—and others—can have in conclusions from observational studies of complex behavioral and social processes, particularly ones that cannot be controlled or simulated within the confines of a laboratory environment.
A very important opportunity to improve survey methods lies in the reduction of nonsampling error due to questionnaire context, phrasing of questions, and, generally, the semantic and social-psychological aspects of surveys. Survey data are particularly affected by the fallibility of human memory and the sensitivity of respondents to the framework in which a question is asked. This sensitivity is especially strong for certain types of attitudinal and opinion questions. Efforts are now being made to bring survey specialists into closer contact with researchers working on memory function, knowledge representation, and language in order to uncover and reduce this kind of error.
Memory for events is often inaccurate, biased toward what respondents believe to be true—or should be true—about the world. In many cases in which data are based on recollection, improvements can be achieved by shifting to techniques of structured interviewing and calibrated forms of memory elicitation, such as specifying recent, brief time periods (for example, in the last seven days) within which respondents recall certain types of events with acceptable accuracy.
Presenting this sequence in both directions on different forms showed that the order affected answers to the general happiness question but did not change the marital happiness question: responses to the specific issue swayed subsequent responses to the general one, but not vice versa. The explanations for and implications of such order effects on the many kinds of questions and sequences that can be used are not simple matters. Further experimentation on the design of survey instruments promises not only to improve the accuracy and reliability of survey research, but also to advance understanding of how people think about and evaluate their behavior from day to day.
Both experiments and surveys involve interventions or questions by the scientist, who then records and analyzes the responses. In contrast, many bodies of social and behavioral data of considerable value are originally derived from records or collections that have accumulated for various nonscientific reasons, quite often administrative in nature, in firms, churches, military organizations, and governments at all levels. Data of this kind can sometimes be subjected to careful scrutiny, summary, and inquiry by historians and social scientists, and statistical methods have increasingly been used to develop and evaluate inferences drawn from such data. Some of the main comparative approaches are cross-national aggregate comparisons, selective comparison of a limited number of cases, and historical case studies.
Among the more striking problems facing the scientist using such data are the vast differences in what has been recorded by different agencies whose behavior is being compared (this is especially true for parallel agencies in different nations), the highly unrepresentative or idiosyncratic sampling that can occur in the collection of such data, and the selective preservation and destruction of records. Means to overcome these problems form a substantial methodological research agenda in comparative research. An example of the method of cross-national aggregative comparisons is found in investigations by political scientists and sociologists of the factors that underlie differences in the vitality of institutions of political democracy in different societies. Some investigators have stressed the existence of a large middle class, others the level of education of a population, and still others the development of systems of mass communication. In cross-national aggregate comparisons, a large number of nations are arrayed according to some measures of political democracy and then attempts are made to ascertain the strength of correlations between these and the other variables. In this line of analysis it is possible to use a variety of statistical cluster and regression techniques to isolate and assess the possible impact of certain variables on the institutions under study. While this kind of research is cross-sectional in character, statements about historical processes are often invoked to explain the correlations.
More limited selective comparisons, applied by many of the classic theorists, involve asking similar kinds of questions but over a smaller range of societies. Why did democracy develop in such different ways in America, France, and England? Why did northeastern Europe develop rational bourgeois capitalism, in contrast to the Mediterranean and Asian nations? Modern scholars have turned their attention to explaining, for example, differences among types of fascism between the two World Wars, and similarities and differences among modern state welfare systems, using these comparisons to unravel the salient causes. The questions asked in these instances are inevitably historical ones.
Historical case studies involve only one nation or region, and so they may not be geographically comparative. However, insofar as they involve tracing the transformation of a society’s major institutions and the role of its main shaping events, they involve a comparison of different periods of a nation’s or a region’s history. The goal of such comparisons is to give a systematic account of the relevant differences. Sometimes, particularly with respect to the ancient societies, the historical record is very sparse, and the methods of history and archaeology mesh in the reconstruction of complex social arrangements and patterns of change on the basis of few fragments.
Like all research designs, comparative ones have distinctive vulnerabilities and advantages: One of the main advantages of using comparative designs is that they greatly expand the range of data, as well as the amount of variation in those data, for study. Consequently, they allow for more encompassing explanations and theories that can relate highly divergent outcomes to one another in the same framework. They also contribute to reducing any cultural biases or tendencies toward parochialism among scientists studying common human phenomena.
One main vulnerability in such designs arises from the problem of achieving comparability. Because comparative study involves studying societies and other units that are dissimilar from one another, the phenomena under study usually occur in very different contexts—so different that in some cases what is called an event in one society cannot really be regarded as the same type of event in another. For example, a vote in a Western democracy is different from a vote in an Eastern bloc country, and a voluntary vote in the United States means something different from a compulsory vote in Australia. These circumstances make for interpretive difficulties in comparing aggregate rates of voter turnout in different countries.
The problem of achieving comparability appears in historical analysis as well. For example, changes in laws and enforcement and recording procedures over time change the definition of what is and what is not a crime, and for that reason it is difficult to compare the crime rates over time. Comparative researchers struggle with this problem continually, working to fashion equivalent measures; some have suggested the use of different measures (voting, letters to the editor, street demonstration) in different societies for common variables (political participation), to try to take contextual factors into account and to achieve truer comparability.
A second vulnerability is controlling variation. Traditional experiments make conscious and elaborate efforts to control the variation of some factors and thereby assess the causal significance of others. In surveys as well as experiments, statistical methods are used to control sources of variation and assess suspected causal significance. In comparative and historical designs, this kind of control is often difficult to attain because the sources of variation are many and the number of cases few. Scientists have made efforts to approximate such control in these cases of “many variables, small N.” One is the method of paired comparisons. If an investigator isolates 15 American cities in which racial violence has been recurrent in the past 30 years, for example, it is helpful to match them with 15 cities of similar population size, geographical region, and size of minorities—such characteristics are controls—and then search for systematic differences between the two sets of cities. Another method is to select, for comparative purposes, a sample of societies that resemble one another in certain critical ways, such as size, common language, and common level of development, thus attempting to hold these factors roughly constant, and then seeking explanations among other factors in which the sampled societies differ from one another.
Traditionally identified with anthropology, ethnographic research designs are playing increasingly significant roles in most of the behavioral and social sciences. The core of this methodology is participant-observation, in which a researcher spends an extended period of time with the group under study, ideally mastering the local language, dialect, or special vocabulary, and participating in as many activities of the group as possible. This kind of participant-observation is normally coupled with extensive open-ended interviewing, in which people are asked to explain in depth the rules, norms, practices, and beliefs through which (from their point of view) they conduct their lives. A principal aim of ethnographic study is to discover the premises on which those rules, norms, practices, and beliefs are built.
The use of ethnographic designs by anthropologists has contributed significantly to the building of knowledge about social and cultural variation. And while these designs continue to center on certain long-standing features—extensive face-to-face experience in the community, linguistic competence, participation, and open-ended interviewing—there are newer trends in ethnographic work. One major trend concerns its scale. Ethnographic methods were originally developed largely for studying small-scale groupings known variously as village, folk, primitive, preliterate, or simple societies. Over the decades, these methods have increasingly been applied to the study of small groups and networks within modern (urban, industrial, complex) society, including the contemporary United States. The typical subjects of ethnographic study in modern society are small groups or relatively small social networks, such as outpatient clinics, medical schools, religious cults and churches, ethnically distinctive urban neighborhoods, corporate offices and factories, and government bureaus and legislatures.
As anthropologists moved into the study of modern societies, researchers in other disciplines—particularly sociology, psychology, and political science—began using ethnographic methods to enrich and focus their own insights and findings. At the same time, studies of large-scale structures and processes have been aided by the use of ethnographic methods, since most large-scale changes work their way into the fabric of community, neighborhood, and family, affecting the daily lives of people. Ethnographers have studied, for example, the impact of new industry and new forms of labor in “backward” regions; the impact of state-level birth control policies on ethnic groups; and the impact on residents in a region of building a dam or establishing a nuclear waste dump. Ethnographic methods have also been used to study a number of social processes that lend themselves to its particular techniques of observation and interview—processes such as the formation of class and racial identities, bureaucratic behavior, legislative coalitions and outcomes, and the formation and shifting of consumer tastes.
Advances in structured interviewing (see above) have proven especially powerful in the study of culture. Techniques for understanding kinship systems, concepts of disease, color terminologies, ethnobotany, and ethnozoology have been radically transformed and strengthened by coupling new interviewing methods with modem measurement and scaling techniques (see below). These techniques have made possible more precise comparisons among cultures and identification of the most competent and expert persons within a culture. The next step is to extend these methods to study the ways in which networks of propositions (such as boys like sports, girls like babies) are organized to form belief systems. Much evidence suggests that people typically represent the world around them by means of relatively complex cognitive models that involve interlocking propositions. The techniques of scaling have been used to develop models of how people categorize objects, and they have great potential for further development, to analyze data pertaining to cultural propositions.
Perhaps the most fruitful area for the application of ethnographic methods in recent years has been the systematic study of ideologies in modern society. Earlier studies of ideology were in small-scale societies that were rather homogeneous. In these studies researchers could report on a single culture, a uniform system of beliefs and values for the society as a whole. Modern societies are much more diverse both in origins and number of subcultures, related to different regions, communities, occupations, or ethnic groups. Yet these subcultures and ideologies share certain underlying assumptions or at least must find some accommodation with the dominant value and belief systems in the society.
The challenge is to incorporate this greater complexity of structure and process into systematic descriptions and interpretations. One line of work carried out by researchers has tried to track the ways in which ideologies are created, transmitted, and shared among large populations that have traditionally lacked the social mobility and communications technologies of the West. This work has concentrated on large-scale civilizations such as China, India, and Central America. Gradually, the focus has generalized into a concern with the relationship between the great traditions—the central lines of cosmopolitan Confucian, Hindu, or Mayan culture, including aesthetic standards, irrigation technologies, medical systems, cosmologies and calendars, legal codes, poetic genres, and religious doctrines and rites—and the little traditions, those identified with rural, peasant communities. How are the ideological doctrines and cultural values of the urban elites, the great traditions, transmitted to local communities? How are the little traditions, the ideas from the more isolated, less literate, and politically weaker groups in society, transmitted to the elites?
India and southern Asia have been fruitful areas for ethnographic research on these questions. The great Hindu tradition was present in virtually all local contexts through the presence of high-caste individuals in every community. It operated as a pervasive standard of value for all members of society, even in the face of strong little traditions. The situation is surprisingly akin to that of modern, industrialized societies. The central research questions are the degree and the nature of penetration of dominant ideology, even in groups that appear marginal and subordinate and have no strong interest in sharing the dominant value system. In this connection the lowest and poorest occupational caste—the untouchables—serves as an ultimate test of the power of ideology and cultural beliefs to unify complex hierarchical social systems.
Another current trend in ethnographic methods is its convergence with archival methods. One joining point is the application of descriptive and interpretative procedures used by ethnographers to reconstruct the cultures that created historical documents, diaries, and other records, to interview history, so to speak. For example, a revealing study showed how the Inquisition in the Italian countryside between the 1570s and 1640s gradually worked subtle changes in an ancient fertility cult in peasant communities; the peasant beliefs and rituals assimilated many elements of witchcraft after learning them from their persecutors. A good deal of social history—particularly that of the family—has drawn on discoveries made in the ethnographic study of primitive societies. As described in Chapter 4 , this particular line of inquiry rests on a marriage of ethnographic, archival, and demographic approaches.
Other lines of ethnographic work have focused on the historical dimensions of nonliterate societies. A strikingly successful example in this kind of effort is a study of head-hunting. By combining an interpretation of local oral tradition with the fragmentary observations that were made by outside observers (such as missionaries, traders, colonial officials), historical fluctuations in the rate and significance of head-hunting were shown to be partly in response to such international forces as the great depression and World War II. Researchers are also investigating the ways in which various groups in contemporary societies invent versions of traditions that may or may not reflect the actual history of the group. This process has been observed among elites seeking political and cultural legitimation and among hard-pressed minorities (for example, the Basque in Spain, the Welsh in Great Britain) seeking roots and political mobilization in a larger society.
Ethnography is a powerful method to record, describe, and interpret the system of meanings held by groups and to discover how those meanings affect the lives of group members. It is a method well adapted to the study of situations in which people interact with one another and the researcher can interact with them as well, so that information about meanings can be evoked and observed. Ethnography is especially suited to exploration and elucidation of unsuspected connections; ideally, it is used in combination with other methods—experimental, survey, or comparative—to establish with precision the relative strengths and weaknesses of such connections. By the same token, experimental, survey, and comparative methods frequently yield connections, the meaning of which is unknown; ethnographic methods are a valuable way to determine them.
The objective of any science is to uncover the structure and dynamics of the phenomena that are its subject, as they are exhibited in the data. Scientists continuously try to describe possible structures and ask whether the data can, with allowance for errors of measurement, be described adequately in terms of them. Over a long time, various families of structures have recurred throughout many fields of science; these structures have become objects of study in their own right, principally by statisticians, other methodological specialists, applied mathematicians, and philosophers of logic and science. Methods have evolved to evaluate the adequacy of particular structures to account for particular types of data. In the interest of clarity we discuss these structures in this section and the analytical methods used for estimation and evaluation of them in the next section, although in practice they are closely intertwined.
A good deal of mathematical and statistical modeling attempts to describe the relations, both structural and dynamic, that hold among variables that are presumed to be representable by numbers. Such models are applicable in the behavioral and social sciences only to the extent that appropriate numerical measurement can be devised for the relevant variables. In many studies the phenomena in question and the raw data obtained are not intrinsically numerical, but qualitative, such as ethnic group identifications. The identifying numbers used to code such questionnaire categories for computers are no more than labels, which could just as well be letters or colors. One key question is whether there is some natural way to move from the qualitative aspects of such data to a structural representation that involves one of the well-understood numerical or geometric models or whether such an attempt would be inherently inappropriate for the data in question. The decision as to whether or not particular empirical data can be represented in particular numerical or more complex structures is seldom simple, and strong intuitive biases or a priori assumptions about what can and cannot be done may be misleading.
Recent decades have seen rapid and extensive development and application of analytical methods attuned to the nature and complexity of social science data. Examples of nonnumerical modeling are increasing. Moreover, the widespread availability of powerful computers is probably leading to a qualitative revolution, it is affecting not only the ability to compute numerical solutions to numerical models, but also to work out the consequences of all sorts of structures that do not involve numbers at all. The following discussion gives some indication of the richness of past progress and of future prospects although it is by necessity far from exhaustive.
In describing some of the areas of new and continuing research, we have organized this section on the basis of whether the representations are fundamentally probabilistic or not. A further useful distinction is between representations of data that are highly discrete or categorical in nature (such as whether a person is male or female) and those that are continuous in nature (such as a person’s height). Of course, there are intermediate cases involving both types of variables, such as color stimuli that are characterized by discrete hues (red, green) and a continuous luminance measure. Probabilistic models lead very naturally to questions of estimation and statistical evaluation of the correspondence between data and model. Those that are not probabilistic involve additional problems of dealing with and representing sources of variability that are not explicitly modeled. At the present time, scientists understand some aspects of structure, such as geometries, and some aspects of randomness, as embodied in probability models, but do not yet adequately understand how to put the two together in a single unified model. Table 5-1 outlines the way we have organized this discussion and shows where the examples in this section lie.
A Classification of Structural Models.
Some behavioral and social sciences variables appear to be more or less continuous, for example, utility of goods, loudness of sounds, or risk associated with uncertain alternatives. Many other variables, however, are inherently categorical, often with only two or a few values possible: for example, whether a person is in or out of school, employed or not employed, identifies with a major political party or political ideology. And some variables, such as moral attitudes, are typically measured in research with survey questions that allow only categorical responses. Much of the early probability theory was formulated only for continuous variables; its use with categorical variables was not really justified, and in some cases it may have been misleading. Recently, very significant advances have been made in how to deal explicitly with categorical variables. This section first describes several contemporary approaches to models involving categorical variables, followed by ones involving continuous representations.
Many recent models for analyzing categorical data of the kind usually displayed as counts (cell frequencies) in multidimensional contingency tables are subsumed under the general heading of log-linear models, that is, linear models in the natural logarithms of the expected counts in each cell in the table. These recently developed forms of statistical analysis allow one to partition variability due to various sources in the distribution of categorical attributes, and to isolate the effects of particular variables or combinations of them.
Present log-linear models were first developed and used by statisticians and sociologists and then found extensive application in other social and behavioral sciences disciplines. When applied, for instance, to the analysis of social mobility, such models separate factors of occupational supply and demand from other factors that impede or propel movement up and down the social hierarchy. With such models, for example, researchers discovered the surprising fact that occupational mobility patterns are strikingly similar in many nations of the world (even among disparate nations like the United States and most of the Eastern European socialist countries), and from one time period to another, once allowance is made for differences in the distributions of occupations. The log-linear and related kinds of models have also made it possible to identify and analyze systematic differences in mobility among nations and across time. As another example of applications, psychologists and others have used log-linear models to analyze attitudes and their determinants and to link attitudes to behavior. These methods have also diffused to and been used extensively in the medical and biological sciences.
Models that permit one variable to be explained or predicted by means of others, called regression models, are the workhorses of much applied statistics; this is especially true when the dependent (explained) variable is continuous. For a two-valued dependent variable, such as alive or dead, models and approximate theory and computational methods for one explanatory variable were developed in biometry about 50 years ago. Computer programs able to handle many explanatory variables, continuous or categorical, are readily available today. Even now, however, the accuracy of the approximate theory on given data is an open question.
Using classical utility theory, economists have developed discrete choice models that turn out to be somewhat related to the log-linear and categorical regression models. Models for limited dependent variables, especially those that cannot take on values above or below a certain level (such as weeks unemployed, number of children, and years of schooling) have been used profitably in economics and in some other areas. For example, censored normal variables (called tobits in economics), in which observed values outside certain limits are simply counted, have been used in studying decisions to go on in school. It will require further research and development to incorporate information about limited ranges of variables fully into the main multivariate methodologies. In addition, with respect to the assumptions about distribution and functional form conventionally made in discrete response models, some new methods are now being developed that show promise of yielding reliable inferences without making unrealistic assumptions; further research in this area promises significant progress.
One problem arises from the fact that many of the categorical variables collected by the major data bases are ordered. For example, attitude surveys frequently use a 3-, 5-, or 7-point scale (from high to low) without specifying numerical intervals between levels. Social class and educational levels are often described by ordered categories. Ignoring order information, which many traditional statistical methods do, may be inefficient or inappropriate, but replacing the categories by successive integers or other arbitrary scores may distort the results. (For additional approaches to this question, see sections below on ordered structures.) Regression-like analysis of ordinal categorical variables is quite well developed, but their multivariate analysis needs further research. New log-bilinear models have been proposed, but to date they deal specifically with only two or three categorical variables. Additional research extending the new models, improving computational algorithms, and integrating the models with work on scaling promise to lead to valuable new knowledge.
Event-history studies yield the sequence of events that respondents to a survey sample experience over a period of time; for example, the timing of marriage, childbearing, or labor force participation. Event-history data can be used to study educational progress, demographic processes (migration, fertility, and mortality), mergers of firms, labor market behavior, and even riots, strikes, and revolutions. As interest in such data has grown, many researchers have turned to models that pertain to changes in probabilities over time to describe when and how individuals move among a set of qualitative states.
Much of the progress in models for event-history data builds on recent developments in statistics and biostatistics for life-time, failure-time, and hazard models. Such models permit the analysis of qualitative transitions in a population whose members are undergoing partially random organic deterioration, mechanical wear, or other risks over time. With the increased complexity of event-history data that are now being collected, and the extension of event-history data bases over very long periods of time, new problems arise that cannot be effectively handled by older types of analysis. Among the problems are repeated transitions, such as between unemployment and employment or marriage and divorce; more than one time variable (such as biological age, calendar time, duration in a stage, and time exposed to some specified condition); latent variables (variables that are explicitly modeled even though not observed); gaps in the data; sample attrition that is not randomly distributed over the categories; and respondent difficulties in recalling the exact timing of events.
For a variety of reasons, researchers typically use multiple measures (or multiple indicators) to represent theoretical concepts. Sociologists, for example, often rely on two or more variables (such as occupation and education) to measure an individual’s socioeconomic position; educational psychologists ordinarily measure a student’s ability with multiple test items. Despite the fact that the basic observations are categorical, in a number of applications this is interpreted as a partitioning of something continuous. For example, in test theory one thinks of the measures of both item difficulty and respondent ability as continuous variables, possibly multidimensional in character.
Classical test theory and newer item-response theories in psychometrics deal with the extraction of information from multiple measures. Testing, which is a major source of data in education and other areas, results in millions of test items stored in archives each year for purposes ranging from college admissions to job-training programs for industry. One goal of research on such test data is to be able to make comparisons among persons or groups even when different test items are used. Although the information collected from each respondent is intentionally incomplete in order to keep the tests short and simple, item-response techniques permit researchers to reconstitute the fragments into an accurate picture of overall group proficiencies. These new methods provide a better theoretical handle on individual differences, and they are expected to be extremely important in developing and using tests. For example, they have been used in attempts to equate different forms of a test given in successive waves during a year, a procedure made necessary in large-scale testing programs by legislation requiring disclosure of test-scoring keys at the time results are given.
An example of the use of item-response theory in a significant research effort is the National Assessment of Educational Progress (NAEP). The goal of this project is to provide accurate, nationally representative information on the average (rather than individual) proficiency of American children in a wide variety of academic subjects as they progress through elementary and secondary school. This approach is an improvement over the use of trend data on university entrance exams, because NAEP estimates of academic achievements (by broad characteristics such as age, grade, region, ethnic background, and so on) are not distorted by the self-selected character of those students who seek admission to college, graduate, and professional programs.
Item-response theory also forms the basis of many new psychometric instruments, known as computerized adaptive testing, currently being implemented by the U.S. military services and under additional development in many testing organizations. In adaptive tests, a computer program selects items for each examinee based upon the examinee’s success with previous items. Generally, each person gets a slightly different set of items and the equivalence of scale scores is established by using item-response theory. Adaptive testing can greatly reduce the number of items needed to achieve a given level of measurement accuracy.
Virtually all statistical models now in use impose a linearity or additivity assumption of some kind, sometimes after a nonlinear transformation of variables. Imposing these forms on relationships that do not, in fact, possess them may well result in false descriptions and spurious effects. Unwary users, especially of computer software packages, can easily be misled. But more realistic nonlinear and nonadditive multivariate models are becoming available. Extensive use with empirical data is likely to force many changes and enhancements in such models and stimulate quite different approaches to nonlinear multivariate analysis in the next decade.
Geometric and algebraic models attempt to describe underlying structural relations among variables. In some cases they are part of a probabilistic approach, such as the algebraic models underlying regression or the geometric representations of correlations between items in a technique called factor analysis. In other cases, geometric and algebraic models are developed without explicitly modeling the element of randomness or uncertainty that is always present in the data. Although this latter approach to behavioral and social sciences problems has been less researched than the probabilistic one, there are some advantages in developing the structural aspects independent of the statistical ones. We begin the discussion with some inherently geometric representations and then turn to numerical representations for ordered data.
Although geometry is a huge mathematical topic, little of it seems directly applicable to the kinds of data encountered in the behavioral and social sciences. A major reason is that the primitive concepts normally used in geometry—points, lines, coincidence—do not correspond naturally to the kinds of qualitative observations usually obtained in behavioral and social sciences contexts. Nevertheless, since geometric representations are used to reduce bodies of data, there is a real need to develop a deeper understanding of when such representations of social or psychological data make sense. Moreover, there is a practical need to understand why geometric computer algorithms, such as those of multidimensional scaling, work as well as they apparently do. A better understanding of the algorithms will increase the efficiency and appropriateness of their use, which becomes increasingly important with the widespread availability of scaling programs for microcomputers.
Over the past 50 years several kinds of well-understood scaling techniques have been developed and widely used to assist in the search for appropriate geometric representations of empirical data. The whole field of scaling is now entering a critical juncture in terms of unifying and synthesizing what earlier appeared to be disparate contributions. Within the past few years it has become apparent that several major methods of analysis, including some that are based on probabilistic assumptions, can be unified under the rubric of a single generalized mathematical structure. For example, it has recently been demonstrated that such diverse approaches as nonmetric multidimensional scaling, principal-components analysis, factor analysis, correspondence analysis, and log-linear analysis have more in common in terms of underlying mathematical structure than had earlier been realized.
Nonmetric multidimensional scaling is a method that begins with data about the ordering established by subjective similarity (or nearness) between pairs of stimuli. The idea is to embed the stimuli into a metric space (that is, a geometry with a measure of distance between points) in such a way that distances between points corresponding to stimuli exhibit the same ordering as do the data. This method has been successfully applied to phenomena that, on other grounds, are known to be describable in terms of a specific geometric structure; such applications were used to validate the procedures. Such validation was done, for example, with respect to the perception of colors, which are known to be describable in terms of a particular three-dimensional structure known as the Euclidean color coordinates. Similar applications have been made with Morse code symbols and spoken phonemes. The technique is now used in some biological and engineering applications, as well as in some of the social sciences, as a method of data exploration and simplification.
One question of interest is how to develop an axiomatic basis for various geometries using as a primitive concept an observable such as the subject’s ordering of the relative similarity of one pair of stimuli to another, which is the typical starting point of such scaling. The general task is to discover properties of the qualitative data sufficient to ensure that a mapping into the geometric structure exists and, ideally, to discover an algorithm for finding it. Some work of this general type has been carried out: for example, there is an elegant set of axioms based on laws of color matching that yields the three-dimensional vectorial representation of color space. But the more general problem of understanding the conditions under which the multidimensional scaling algorithms are suitable remains unsolved. In addition, work is needed on understanding more general, non-Euclidean spatial models.
One type of structure common throughout the sciences arises when an ordered dependent variable is affected by two or more ordered independent variables. This is the situation to which regression and analysis-of-variance models are often applied; it is also the structure underlying the familiar physical identities, in which physical units are expressed as products of the powers of other units (for example, energy has the unit of mass times the square of the unit of distance divided by the square of the unit of time).
There are many examples of these types of structures in the behavioral and social sciences. One example is the ordering of preference of commodity bundles—collections of various amounts of commodities—which may be revealed directly by expressions of preference or indirectly by choices among alternative sets of bundles. A related example is preferences among alternative courses of action that involve various outcomes with differing degrees of uncertainty; this is one of the more thoroughly investigated problems because of its potential importance in decision making. A psychological example is the trade-off between delay and amount of reward, yielding those combinations that are equally reinforcing. In a common, applied kind of problem, a subject is given descriptions of people in terms of several factors, for example, intelligence, creativity, diligence, and honesty, and is asked to rate them according to a criterion such as suitability for a particular job.
In all these cases and a myriad of others like them the question is whether the regularities of the data permit a numerical representation. Initially, three types of representations were studied quite fully: the dependent variable as a sum, a product, or a weighted average of the measures associated with the independent variables. The first two representations underlie some psychological and economic investigations, as well as a considerable portion of physical measurement and modeling in classical statistics. The third representation, averaging, has proved most useful in understanding preferences among uncertain outcomes and the amalgamation of verbally described traits, as well as some physical variables.
For each of these three cases—adding, multiplying, and averaging—researchers know what properties or axioms of order the data must satisfy for such a numerical representation to be appropriate. On the assumption that one or another of these representations exists, and using numerical ratings by subjects instead of ordering, a scaling technique called functional measurement (referring to the function that describes how the dependent variable relates to the independent ones) has been developed and applied in a number of domains. What remains problematic is how to encompass at the ordinal level the fact that some random error intrudes into nearly all observations and then to show how that randomness is represented at the numerical level; this continues to be an unresolved and challenging research issue.
During the past few years considerable progress has been made in understanding certain representations inherently different from those just discussed. The work has involved three related thrusts. The first is a scheme of classifying structures according to how uniquely their representation is constrained. The three classical numerical representations are known as ordinal, interval, and ratio scale types. For systems with continuous numerical representations and of scale type at least as rich as the ratio one, it has been shown that only one additional type can exist. A second thrust is to accept structural assumptions, like factorial ones, and to derive for each scale the possible functional relations among the independent variables. And the third thrust is to develop axioms for the properties of an order relation that leads to the possible representations. Much is now known about the possible nonadditive representations of both the multifactor case and the one where stimuli can be combined, such as combining sound intensities.
Closely related to this classification of structures is the question: What statements, formulated in terms of the measures arising in such representations, can be viewed as meaningful in the sense of corresponding to something empirical? Statements here refer to any scientific assertions, including statistical ones, formulated in terms of the measures of the variables and logical and mathematical connectives. These are statements for which asserting truth or falsity makes sense. In particular, statements that remain invariant under certain symmetries of structure have played an important role in classical geometry, dimensional analysis in physics, and in relating measurement and statistical models applied to the same phenomenon. In addition, these ideas have been used to construct models in more formally developed areas of the behavioral and social sciences, such as psychophysics. Current research has emphasized the communality of these historically independent developments and is attempting both to uncover systematic, philosophically sound arguments as to why invariance under symmetries is as important as it appears to be and to understand what to do when structures lack symmetry, as, for example, when variables have an inherent upper bound.
Many subjects do not seem to be correctly represented in terms of distances in continuous geometric space. Rather, in some cases, such as the relations among meanings of words—which is of great interest in the study of memory representations—a description in terms of tree-like, hierarchial structures appears to be more illuminating. This kind of description appears appropriate both because of the categorical nature of the judgments and the hierarchial, rather than trade-off, nature of the structure. Individual items are represented as the terminal nodes of the tree, and groupings by different degrees of similarity are shown as intermediate nodes, with the more general groupings occurring nearer the root of the tree. Clustering techniques, requiring considerable computational power, have been and are being developed. Some successful applications exist, but much more refinement is anticipated.
Several other lines of advanced modeling have progressed in recent years, opening new possibilities for empirical specification and testing of a variety of theories. In social network data, relationships among units, rather than the units themselves, are the primary objects of study: friendships among persons, trade ties among nations, cocitation clusters among research scientists, interlocking among corporate boards of directors. Special models for social network data have been developed in the past decade, and they give, among other things, precise new measures of the strengths of relational ties among units. A major challenge in social network data at present is to handle the statistical dependence that arises when the units sampled are related in complex ways.
As was noted earlier, questions of design, representation, and analysis are intimately intertwined. Some issues of inference and analysis have been discussed above as related to specific data collection and modeling approaches. This section discusses some more general issues of statistical inference and advances in several current approaches to them.
Behavioral and social scientists use statistical methods primarily to infer the effects of treatments, interventions, or policy factors. Previous chapters included many instances of causal knowledge gained this way. As noted above, the large experimental study of alternative health care financing discussed in Chapter 2 relied heavily on statistical principles and techniques, including randomization, in the design of the experiment and the analysis of the resulting data. Sophisticated designs were necessary in order to answer a variety of questions in a single large study without confusing the effects of one program difference (such as prepayment or fee for service) with the effects of another (such as different levels of deductible costs), or with effects of unobserved variables (such as genetic differences). Statistical techniques were also used to ascertain which results applied across the whole enrolled population and which were confined to certain subgroups (such as individuals with high blood pressure) and to translate utilization rates across different programs and types of patients into comparable overall dollar costs and health outcomes for alternative financing options.
A classical experiment, with systematic but randomly assigned variation of the variables of interest (or some reasonable approach to this), is usually considered the most rigorous basis from which to draw such inferences. But random samples or randomized experimental manipulations are not always feasible or ethically acceptable. Then, causal inferences must be drawn from observational studies, which, however well designed, are less able to ensure that the observed (or inferred) relationships among variables provide clear evidence on the underlying mechanisms of cause and effect.
Certain recurrent challenges have been identified in studying causal inference. One challenge arises from the selection of background variables to be measured, such as the sex, nativity, or parental religion of individuals in a comparative study of how education affects occupational success. The adequacy of classical methods of matching groups in background variables and adjusting for covariates needs further investigation. Statistical adjustment of biases linked to measured background variables is possible, but it can become complicated. Current work in adjustment for selectivity bias is aimed at weakening implausible assumptions, such as normality, when carrying out these adjustments. Even after adjustment has been made for the measured background variables, other, unmeasured variables are almost always still affecting the results (such as family transfers of wealth or reading habits). Analyses of how the conclusions might change if such unmeasured variables could be taken into account is essential in attempting to make causal inferences from an observational study, and systematic work on useful statistical models for such sensitivity analyses is just beginning.
The third important issue arises from the necessity for distinguishing among competing hypotheses when the explanatory variables are measured with different degrees of precision. Both the estimated size and significance of an effect are diminished when it has large measurement error, and the coefficients of other correlated variables are affected even when the other variables are measured perfectly. Similar results arise from conceptual errors, when one measures only proxies for a theoretical construct (such as years of education to represent amount of learning). In some cases, there are procedures for simultaneously or iteratively estimating both the precision of complex measures and their effect on a particular criterion.
Although complex models are often necessary to infer causes, once their output is available, it should be translated into understandable displays for evaluation. Results that depend on the accuracy of a multivariate model and the associated software need to be subjected to appropriate checks, including the evaluation of graphical displays, group comparisons, and other analyses.
Internal resampling.
One of the great contributions of twentieth-century statistics was to demonstrate how a properly drawn sample of sufficient size, even if it is only a tiny fraction of the population of interest, can yield very good estimates of most population characteristics. When enough is known at the outset about the characteristic in question—for example, that its distribution is roughly normal—inference from the sample data to the population as a whole is straightforward, and one can easily compute measures of the certainty of inference, a common example being the 95 percent confidence interval around an estimate. But population shapes are sometimes unknown or uncertain, and so inference procedures cannot be so simple. Furthermore, more often than not, it is difficult to assess even the degree of uncertainty associated with complex data and with the statistics needed to unravel complex social and behavioral phenomena.
Internal resampling methods attempt to assess this uncertainty by generating a number of simulated data sets similar to the one actually observed. The definition of similar is crucial, and many methods that exploit different types of similarity have been devised. These methods provide researchers the freedom to choose scientifically appropriate procedures and to replace procedures that are valid under assumed distributional shapes with ones that are not so restricted. Flexible and imaginative computer simulation is the key to these methods. For a simple random sample, the “bootstrap” method repeatedly resamples the obtained data (with replacement) to generate a distribution of possible data sets. The distribution of any estimator can thereby be simulated and measures of the certainty of inference be derived. The “jackknife” method repeatedly omits a fraction of the data and in this way generates a distribution of possible data sets that can also be used to estimate variability. These methods can also be used to remove or reduce bias. For example, the ratio-estimator, a statistic that is commonly used in analyzing sample surveys and censuses, is known to be biased, and the jackknife method can usually remedy this defect. The methods have been extended to other situations and types of analysis, such as multiple regression.
There are indications that under relatively general conditions, these methods, and others related to them, allow more accurate estimates of the uncertainty of inferences than do the traditional ones that are based on assumed (usually, normal) distributions when that distributional assumption is unwarranted. For complex samples, such internal resampling or subsampling facilitates estimating the sampling variances of complex statistics.
An older and simpler, but equally important, idea is to use one independent subsample in searching the data to develop a model and at least one separate subsample for estimating and testing a selected model. Otherwise, it is next to impossible to make allowances for the excessively close fitting of the model that occurs as a result of the creative search for the exact characteristics of the sample data—characteristics that are to some degree random and will not predict well to other samples.
Many technical assumptions underlie the analysis of data. Some, like the assumption that each item in a sample is drawn independently of other items, can be weakened when the data are sufficiently structured to admit simple alternative models, such as serial correlation. Usually, these models require that a few parameters be estimated. Assumptions about shapes of distributions, normality being the most common, have proved to be particularly important, and considerable progress has been made in dealing with the consequences of different assumptions.
More recently, robust techniques have been designed that permit sharp, valid discriminations among possible values of parameters of central tendency for a wide variety of alternative distributions by reducing the weight given to occasional extreme deviations. It turns out that by giving up, say, 10 percent of the discrimination that could be provided under the rather unrealistic assumption of normality, one can greatly improve performance in more realistic situations, especially when unusually large deviations are relatively common.
These valuable modifications of classical statistical techniques have been extended to multiple regression, in which procedures of iterative reweighting can now offer relatively good performance for a variety of underlying distributional shapes. They should be extended to more general schemes of analysis.
In some contexts—notably the most classical uses of analysis of variance—the use of adequate robust techniques should help to bring conventional statistical practice closer to the best standards that experts can now achieve.
In trying to give a more accurate representation of the real world than is possible with simple models, researchers sometimes use models with many parameters, all of which must be estimated from the data. Classical principles of estimation, such as straightforward maximum-likelihood, do not yield reliable estimates unless either the number of observations is much larger than the number of parameters to be estimated or special designs are used in conjunction with strong assumptions. Bayesian methods do not draw a distinction between fixed and random parameters, and so may be especially appropriate for such problems.
A variety of statistical methods have recently been developed that can be interpreted as treating many of the parameters as or similar to random quantities, even if they are regarded as representing fixed quantities to be estimated. Theory and practice demonstrate that such methods can improve the simpler fixed-parameter methods from which they evolved, especially when the number of observations is not large relative to the number of parameters. Successful applications include college and graduate school admissions, where quality of previous school is treated as a random parameter when the data are insufficient to separately estimate it well. Efforts to create appropriate models using this general approach for small-area estimation and undercount adjustment in the census are important potential applications.
In data analysis, serious problems can arise when certain kinds of (quantitative or qualitative) information is partially or wholly missing. Various approaches to dealing with these problems have been or are being developed. One of the methods developed recently for dealing with certain aspects of missing data is called multiple imputation: each missing value in a data set is replaced by several values representing a range of possibilities, with statistical dependence among missing values reflected by linkage among their replacements. It is currently being used to handle a major problem of incompatibility between the 1980 and previous Bureau of Census public-use tapes with respect to occupation codes. The extension of these techniques to address such problems as nonresponse to income questions in the Current Population Survey has been examined in exploratory applications with great promise.
The development of high-speed computing and data handling has fundamentally changed statistical analysis. Methodologies for all kinds of situations are rapidly being developed and made available for use in computer packages that may be incorporated into interactive expert systems. This computing capability offers the hope that much data analyses will be more carefully and more effectively done than previously and that better strategies for data analysis will move from the practice of expert statisticians, some of whom may not have tried to articulate their own strategies, to both wide discussion and general use.
But powerful tools can be hazardous, as witnessed by occasional dire misuses of existing statistical packages. Until recently the only strategies available were to train more expert methodologists or to train substantive scientists in more methodology, but without the updating of their training it tends to become outmoded. Now there is the opportunity to capture in expert systems the current best methodological advice and practice. If that opportunity is exploited, standard methodological training of social scientists will shift to emphasizing strategies in using good expert systems—including understanding the nature and importance of the comments it provides—rather than in how to patch together something on one’s own. With expert systems, almost all behavioral and social scientists should become able to conduct any of the more common styles of data analysis more effectively and with more confidence than all but the most expert do today. However, the difficulties in developing expert systems that work as hoped for should not be underestimated. Human experts cannot readily explicate all of the complex cognitive network that constitutes an important part of their knowledge. As a result, the first attempts at expert systems were not especially successful (as discussed in Chapter 1 ). Additional work is expected to overcome these limitations, but it is not clear how long it will take.
The formal focus of much statistics research in the middle half of the twentieth century was on procedures to confirm or reject precise, a priori hypotheses developed in advance of collecting data—that is, procedures to determine statistical significance. There was relatively little systematic work on realistically rich strategies for the applied researcher to use when attacking real-world problems with their multiplicity of objectives and sources of evidence. More recently, a species of quantitative detective work, called exploratory data analysis, has received increasing attention. In this approach, the researcher seeks out possible quantitative relations that may be present in the data. The techniques are flexible and include an important component of graphic representations. While current techniques have evolved for single responses in situations of modest complexity, extensions to multiple responses and to single responses in more complex situations are now possible.
Graphic and tabular presentation is a research domain in active renaissance, stemming in part from suggestions for new kinds of graphics made possible by computer capabilities, for example, hanging histograms and easily assimilated representations of numerical vectors. Research on data presentation has been carried out by statisticians, psychologists, cartographers, and other specialists, and attempts are now being made to incorporate findings and concepts from linguistics, industrial and publishing design, aesthetics, and classification studies in library science. Another influence has been the rapidly increasing availability of powerful computational hardware and software, now available even on desktop computers. These ideas and capabilities are leading to an increasing number of behavioral experiments with substantial statistical input. Nonetheless, criteria of good graphic and tabular practice are still too much matters of tradition and dogma, without adequate empirical evidence or theoretical coherence. To broaden the respective research outlooks and vigorously develop such evidence and coherence, extended collaborations between statistical and mathematical specialists and other scientists are needed, a major objective being to understand better the visual and cognitive processes (see Chapter 1 ) relevant to effective use of graphic or tabular approaches.
Combining evidence from separate sources is a recurrent scientific task, and formal statistical methods for doing so go back 30 years or more. These methods include the theory and practice of combining tests of individual hypotheses, sequential design and analysis of experiments, comparisons of laboratories, and Bayesian and likelihood paradigms.
There is now growing interest in more ambitious analytical syntheses, which are often called meta-analyses. One stimulus has been the appearance of syntheses explicitly combining all existing investigations in particular fields, such as prison parole policy, classroom size in primary schools, cooperative studies of therapeutic treatments for coronary heart disease, early childhood education interventions, and weather modification experiments. In such fields, a serious approach to even the simplest question—how to put together separate estimates of effect size from separate investigations—leads quickly to difficult and interesting issues. One issue involves the lack of independence among the available studies, due, for example, to the effect of influential teachers on the research projects of their students. Another issue is selection bias, because only some of the studies carried out, usually those with “significant” findings, are available and because the literature search may not find out all relevant studies that are available. In addition, experts agree, although informally, that the quality of studies from different laboratories and facilities differ appreciably and that such information probably should be taken into account. Inevitably, the studies to be included used different designs and concepts and controlled or measured different variables, making it difficult to know how to combine them.
Rich, informal syntheses, allowing for individual appraisal, may be better than catch-all formal modeling, but the literature on formal meta-analytic models is growing and may be an important area of discovery in the next decade, relevant both to statistical analysis per se and to improved syntheses in the behavioral and social and other sciences.
This chapter has cited a number of methodological topics associated with behavioral and social sciences research that appear to be particularly active and promising at the present time. As throughout the report, they constitute illustrative examples of what the committee believes to be important areas of research in the coming decade. In this section we describe recommendations for an additional $16 million annually to facilitate both the development of methodologically oriented research and, equally important, its communication throughout the research community.
Methodological studies, including early computer implementations, have for the most part been carried out by individual investigators with small teams of colleagues or students. Occasionally, such research has been associated with quite large substantive projects, and some of the current developments of computer packages, graphics, and expert systems clearly require large, organized efforts, which often lie at the boundary between grant-supported work and commercial development. As such research is often a key to understanding complex bodies of behavioral and social sciences data, it is vital to the health of these sciences that research support continue on methods relevant to problems of modeling, statistical analysis, representation, and related aspects of behavioral and social sciences data. Researchers and funding agencies should also be especially sympathetic to the inclusion of such basic methodological work in large experimental and longitudinal studies. Additional funding for work in this area, both in terms of individual research grants on methodological issues and in terms of augmentation of large projects to include additional methodological aspects, should be provided largely in the form of investigator-initiated project grants.
Ethnographic and comparative studies also typically rely on project grants to individuals and small groups of investigators. While this type of support should continue, provision should also be made to facilitate the execution of studies using these methods by research teams and to provide appropriate methodological training through the mechanisms outlined below.
Overall, we recommend an increase of $4 million in the level of investigator-initiated grant support for methodological work. An additional $1 million should be devoted to a program of centers for methodological research.
Many of the new methods and models described in the chapter, if and when adopted to any large extent, will demand substantially greater amounts of research devoted to appropriate analysis and computer implementation. New user interfaces and numerical algorithms will need to be designed and new computer programs written. And even when generally available methods (such as maximum-likelihood) are applicable, model application still requires skillful development in particular contexts. Many of the familiar general methods that are applied in the statistical analysis of data are known to provide good approximations when sample sizes are sufficiently large, but their accuracy varies with the specific model and data used. To estimate the accuracy requires extensive numerical exploration. Investigating the sensitivity of results to the assumptions of the models is important and requires still more creative, thoughtful research. It takes substantial efforts of these kinds to bring any new model on line, and the need becomes increasingly important and difficult as statistical models move toward greater realism, usefulness, complexity, and availability in computer form. More complexity in turn will increase the demand for computational power. Although most of this demand can be satisfied by increasingly powerful desktop computers, some access to mainframe and even supercomputers will be needed in selected cases. We recommend an additional $4 million annually to cover the growth in computational demands for model development and testing.
Interaction and cooperation between the developers and the users of statistical and mathematical methods need continual stimulation—both ways. Efforts should be made to teach new methods to a wider variety of potential users than is now the case. Several ways appear effective for methodologists to communicate to empirical scientists: running summer training programs for graduate students, faculty, and other researchers; encouraging graduate students, perhaps through degree requirements, to make greater use of the statistical, mathematical, and methodological resources at their own or affiliated universities; associating statistical and mathematical research specialists with large-scale data collection projects; and developing statistical packages that incorporate expert systems in applying the methods.
Methodologists, in turn, need to become more familiar with the problems actually faced by empirical scientists in the laboratory and especially in the field. Several ways appear useful for communication in this direction: encouraging graduate students in methodological specialties, perhaps through degree requirements, to work directly on empirical research; creating postdoctoral fellowships aimed at integrating such specialists into ongoing data collection projects; and providing for large data collection projects to engage relevant methodological specialists. In addition, research on and development of statistical packages and expert systems should be encouraged to involve the multidisciplinary collaboration of experts with experience in statistical, computer, and cognitive sciences.
A final point has to do with the promise held out by bringing different research methods to bear on the same problems. As our discussions of research methods in this and other chapters have emphasized, different methods have different powers and limitations, and each is designed especially to elucidate one or more particular facets of a subject. An important type of interdisciplinary work is the collaboration of specialists in different research methodologies on a substantive issue, examples of which have been noted throughout this report. If more such research were conducted cooperatively, the power of each method pursued separately would be increased. To encourage such multidisciplinary work, we recommend increased support for fellowships, research workshops, and training institutes.
Funding for fellowships, both pre-and postdoctoral, should be aimed at giving methodologists experience with substantive problems and at upgrading the methodological capabilities of substantive scientists. Such targeted fellowship support should be increased by $4 million annually, of which $3 million should be for predoctoral fellowships emphasizing the enrichment of methodological concentrations. The new support needed for research workshops is estimated to be $1 million annually. And new support needed for various kinds of advanced training institutes aimed at rapidly diffusing new methodological findings among substantive scientists is estimated to be $2 million annually.
Other titles in this collection.
Your browsing activity is empty.
Activity recording is turned off.
Turn recording back on
Connect with NLM
National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894
Web Policies FOIA HHS Vulnerability Disclosure
Help Accessibility Careers
Graphical Representation of Data
Graphical Representation of Data: Graphical Representation of Data,” where numbers and facts become lively pictures and colorful diagrams . Instead of staring at boring lists of numbers, we use fun charts, cool graphs, and interesting visuals to understand information better. In this exciting concept of data visualization, we’ll learn about different kinds of graphs, charts, and pictures that help us see patterns and stories hidden in data.
There is an entire branch in mathematics dedicated to dealing with collecting, analyzing, interpreting, and presenting numerical data in visual form in such a way that it becomes easy to understand and the data becomes easy to compare as well, the branch is known as Statistics .
The branch is widely spread and has a plethora of real-life applications such as Business Analytics, demography, Astro statistics, and so on . In this article, we have provided everything about the graphical representation of data, including its types, rules, advantages, etc.
Table of Content
Types of graphical representations, line graphs, histograms , stem and leaf plot , box and whisker plot .
Frequency based, principles of graphical representations, advantages and disadvantages of using graphical system, general rules for graphical representation of data, frequency polygon, solved examples on graphical representation of data.
Graphics Representation is a way of representing any data in picturized form . It helps a reader to understand the large set of data very easily as it gives us various data patterns in visualized form.
There are two ways of representing data,
They say, “A picture is worth a thousand words”. It’s always better to represent data in a graphical format. Even in Practical Evidence and Surveys, scientists have found that the restoration and understanding of any information is better when it is available in the form of visuals as Human beings process data better in visual form than any other form.
Does it increase the ability 2 times or 3 times? The answer is it increases the Power of understanding 60,000 times for a normal Human being, the fact is amusing and true at the same time.
Check: Graph and its representations
Comparison between different items is best shown with graphs, it becomes easier to compare the crux of the data about different items. Let’s look at all the different types of graphical representations briefly:
A line graph is used to show how the value of a particular variable changes with time. We plot this graph by connecting the points at different values of the variable. It can be useful for analyzing the trends in the data and predicting further trends.
A bar graph is a type of graphical representation of the data in which bars of uniform width are drawn with equal spacing between them on one axis (x-axis usually), depicting the variable. The values of the variables are represented by the height of the bars.
This is similar to bar graphs, but it is based frequency of numerical values rather than their actual values. The data is organized into intervals and the bars represent the frequency of the values in that range. That is, it counts how many values of the data lie in a particular range.
It is a plot that displays data as points and checkmarks above a number line, showing the frequency of the point.
This is a type of plot in which each value is split into a “leaf”(in most cases, it is the last digit) and “stem”(the other remaining digits). For example: the number 42 is split into leaf (2) and stem (4).
These plots divide the data into four parts to show their summary. They are more concerned about the spread, average, and median of the data.
It is a type of graph which represents the data in form of a circular graph. The circle is divided such that each portion represents a proportion of the whole.
Graphs in Math are used to study the relationships between two or more variables that are changing. Statistical data can be summarized in a better way using graphs. There are basically two lines of thoughts of making graphs in maths:
These graphs allow us to study the change of a variable with respect to another variable within a given interval of time. The variables can be anything. Time Series graphs study the change of variable with time. They study the trends, periodic behavior, and patterns in the series. We are more concerned with the values of the variables here rather than the frequency of those values.
Example: Line Graph
These kinds of graphs are more concerned with the distribution of data. How many values lie between a particular range of the variables, and which range has the maximum frequency of the values. They are used to judge a spread and average and sometimes median of a variable under study.
Also read: Types of Statistical Data
Check : Diagrammatic and Graphic Presentation of Data
We should keep in mind some things while plotting and designing these graphs. The goal should be a better and clear picture of the data. Following things should be kept in mind while plotting the above graphs:
A frequency polygon is a graph that is constructed by joining the midpoint of the intervals. The height of the interval or the bin represents the frequency of the values that lie in that interval.
Question 1: What are different types of frequency-based plots?
Types of frequency-based plots: Histogram Frequency Polygon Box Plots
Question 2: A company with an advertising budget of Rs 10,00,00,000 has planned the following expenditure in the different advertising channels such as TV Advertisement, Radio, Facebook, Instagram, and Printed media. The table represents the money spent on different channels.
Draw a bar graph for the following data.
Question 3: Draw a line plot for the following data
Question 4: Make a frequency plot of the following data:
Class Interval | Mid Point | Frequency |
0-3 | 1.5 | 3 |
3-6 | 4.5 | 4 |
6-9 | 7.5 | 2 |
9-12 | 10.5 | 6 |
Now join the mid points of the intervals and their corresponding frequencies on the graph.
This graph shows both the histogram and frequency polygon for the given distribution.
Graphical Representation of Data| Practical Work in Geography Class 12 What are the different ways of Data Representation What are the different ways of Data Representation? Charts and Graphs for Data Visualization
Graphical representation is a powerful tool for understanding data, but it’s essential to be aware of its limitations. While graphs and charts can make information easier to grasp, they can also be subjective, complex, and potentially misleading . By using graphical representations wisely and critically, we can extract valuable insights from data, empowering us to make informed decisions with confidence.
What are the advantages of using graphs to represent data.
Graphs offer visualization, clarity, and easy comparison of data, aiding in outlier identification and predictive analysis.
Common graph types include bar, line, pie, histogram, and scatter plots , each suited for different data representations and analysis purposes.
Select a graph type based on data type, analysis objective, and audience familiarity to effectively convey information and insights.
Use descriptive titles, clear axis labels with units, and legends to ensure the graph communicates information clearly and concisely.
Interpret graphs by examining trends, identifying outliers, comparing data across categories, and considering the broader context to draw meaningful insights and conclusions.
Similar reads.
Literature on data representation.
Here’s the entire UX literature on Data Representation by the Interaction Design Foundation, collated in one place:
Take a deep dive into Data Representation with our course AI for Designers .
In an era where technology is rapidly reshaping the way we interact with the world, understanding the intricacies of AI is not just a skill, but a necessity for designers . The AI for Designers course delves into the heart of this game-changing field, empowering you to navigate the complexities of designing in the age of AI. Why is this knowledge vital? AI is not just a tool; it's a paradigm shift, revolutionizing the design landscape. As a designer, make sure that you not only keep pace with the ever-evolving tech landscape but also lead the way in creating user experiences that are intuitive, intelligent, and ethical.
AI for Designers is taught by Ioana Teleanu, a seasoned AI Product Designer and Design Educator who has established a community of over 250,000 UX enthusiasts through her social channel UX Goodies. She imparts her extensive expertise to this course from her experience at renowned companies like UiPath and ING Bank, and now works on pioneering AI projects at Miro.
In this course, you’ll explore how to work with AI in harmony and incorporate it into your design process to elevate your career to new heights. Welcome to a course that doesn’t just teach design; it shapes the future of design innovation.
In lesson 1, you’ll explore AI's significance, understand key terms like Machine Learning, Deep Learning, and Generative AI, discover AI's impact on design, and master the art of creating effective text prompts for design.
In lesson 2, you’ll learn how to enhance your design workflow using AI tools for UX research, including market analysis, persona interviews, and data processing. You’ll dive into problem-solving with AI, mastering problem definition and production ideation.
In lesson 3, you’ll discover how to incorporate AI tools for prototyping, wireframing, visual design, and UX writing into your design process. You’ll learn how AI can assist to evaluate your designs and automate tasks, and ensure your product is launch-ready.
In lesson 4, you’ll explore the designer's role in AI-driven solutions, how to address challenges, analyze concerns, and deliver ethical solutions for real-world design applications.
Throughout the course, you'll receive practical tips for real-life projects. In the Build Your Portfolio exercises, you’ll practice how to integrate AI tools into your workflow and design for AI products, enabling you to create a compelling portfolio case study to attract potential employers or collaborators.
Visual mapping – the elements of information visualization.
We believe in Open Access and the democratization of knowledge . Unfortunately, world-class educational materials such as this page are normally hidden behind paywalls or in expensive textbooks.
If you want this to change , cite this page , link to us, or join us to help us democratize design knowledge !
Our digital services use necessary tracking technologies, including third-party cookies, for security, functionality, and to uphold user rights. Optional cookies offer enhanced features, and analytics.
Experience the full potential of our site that remembers your preferences and supports secure sign-in.
Governs the storage of data necessary for maintaining website security, user authentication, and fraud prevention mechanisms.
Saves your settings and preferences, like your location, for a more personalized experience.
We use cookies to enable our referral program, giving you and your friends discounts.
We share user ID with Bugsnag and NewRelic to help us track errors and fix issues.
Optimize your experience by allowing us to monitor site usage. You’ll enjoy a smoother, more personalized journey without compromising your privacy.
Collects anonymous data on how you navigate and interact, helping us make informed improvements.
Differentiates real visitors from automated bots, ensuring accurate usage data and improving your website experience.
Lets us tailor your digital ads to match your interests, making them more relevant and useful to you.
Stores information for better-targeted advertising, enhancing your online ad experience.
Permits storing data to personalize content and ads across Google services based on user behavior, enhancing overall user experience.
Allows for content and ad personalization across Google services based on user behavior. This consent enhances user experiences.
Enables personalizing ads based on user data and interactions, allowing for more relevant advertising experiences across Google services.
Receive more relevant advertisements by sharing your interests and behavior with our trusted advertising partners.
Enables better ad targeting and measurement on Meta platforms, making ads you see more relevant.
Allows for improved ad effectiveness and measurement through Meta’s Conversions API, ensuring privacy-compliant data sharing.
Tracks conversions, retargeting, and web analytics for LinkedIn ad campaigns, enhancing ad relevance and performance.
Enhances LinkedIn advertising through server-side event tracking, offering more accurate measurement and personalization.
Tracks ad performance and user engagement, helping deliver ads that are most useful to you.
or copy link
Simply copy and paste the text below into your bibliographic reference list, onto your blog, or anywhere else. You can also just hyperlink to this page.
Download our free ebook The Basics of User Experience Design to learn about core concepts of UX design.
In 9 chapters, we’ll cover: conducting user interviews, design thinking, interaction design, mobile UX design, usability, UX research, and many more!
Computers do not understand human language; they understand data within the prescribed form. Data representation is a method to represent data and encode it in a computer system. Generally, a user inputs numbers, text, images, audio, and video etc types of data to process but the computer converts this data to machine language first and then processes it.
Data representation plays a vital role in storing, process, and data communication. A correct and effective data representation method impacts data processing performance and system compatibility.
Number system.
A computer system considers numbers as data; it includes integers, decimals, and complex numbers. All the inputted numbers are represented in binary formats like 0 and 1. A number system is categorized into four types −
The below-mentioned table below summarises the data representation of the number system along with their Base and digits.
Number System | ||
---|---|---|
System | Base | Digits |
Binary | 2 | 0 1 |
Octal | 8 | 0 1 2 3 4 5 6 7 |
Decimal | 10 | 0 1 2 3 4 5 6 7 8 9 |
Hexadecimal | 16 | 0 1 2 3 4 5 6 7 8 9 A B C D E F |
A bit is the smallest data unit that a computer uses in computation; all the computation tasks done by the computer systems are based on bits. A bit represents a binary digit in terms of 0 or 1. The computer usually uses bits in groups. It's the basic unit of information storage and communication in digital computing.
A group of eight bits is called a byte. Half of a byte is called a nibble; it means a group of four bits is called a nibble. A byte is a fundamental addressable unit of computer memory and storage. It can represent a single character, such as a letter, number, or symbol using encoding methods such as ASCII and Unicode.
Bytes are used to determine file sizes, storage capacity, and available memory space. A kilobyte (KB) is equal to 1,024 bytes, a megabyte (MB) is equal to 1,024 KB, and a gigabyte (GB) is equal to 1,024 MB. File size is roughly measured in KBs and availability of memory space in MBs and GBs.
The following table shows the conversion of Bits and Bytes −
Byte Value | Bit Value |
---|---|
1 Byte | 8 Bits |
1024 Bytes | 1 Kilobyte |
1024 Kilobytes | 1 Megabyte |
1024 Megabytes | 1 Gigabyte |
1024 Gigabytes | 1 Terabyte |
1024 Terabytes | 1 Petabyte |
1024 Petabytes | 1 Exabyte |
1024 Exabytes | 1 Zettabyte |
1024 Zettabytes | 1 Yottabyte |
1024 Yottabytes | 1 Brontobyte |
1024 Brontobytes | 1 Geopbytes |
A Text Code is a static code that allows a user to insert text that others will view when they scan it. It includes alphabets, punctuation marks and other symbols. Some of the most commonly used text code systems are −
EBCDIC stands for Extended Binary Coded Decimal Interchange Code. IBM developed EBCDIC in the early 1960s and used it in their mainframe systems like System/360 and its successors. To meet commercial and data processing demands, it supports letters, numbers, punctuation marks, and special symbols. Character codes distinguish EBCDIC from other character encoding methods like ASCII. Data encoded in EBCDIC or ASCII may not be compatible with computers; to make them compatible, we need to convert with systems compatibility. EBCDIC encodes each character as an 8-bit binary code and defines 256 symbols. The below-mentioned table depicts different characters along with their EBCDIC code.
ASCII stands for American Standard Code for Information Interchange. It is an 8-bit code that specifies character values from 0 to 127. ASCII is a standard for the Character Encoding of Numbers that assigns numerical values to represent characters, such as letters, numbers, exclamation marks and control characters used in computers and communication equipment that are using data.
ASCII originally defined 128 characters, encoded with 7 bits, allowing for 2^7 (128) potential characters. The ASCII standard specifies characters for the English alphabet (uppercase and lowercase), numerals from 0 to 9, punctuation marks, and control characters for formatting and control tasks such as line feed, carriage return, and tab.
ASCII Tabular column | ||
---|---|---|
ASCII Code | Decimal Value | Character |
0000 0000 | 0 | Null prompt |
0000 0001 | 1 | Start of heading |
0000 0010 | 2 | Start of text |
0000 0011 | 3 | End of text |
0000 0100 | 4 | End of transmit |
0000 0101 | 5 | Enquiry |
0000 0110 | 6 | Acknowledge |
0000 0111 | 7 | Audible bell |
0000 1000 | 8 | Backspace |
0000 1001 | 9 | Horizontal tab |
0000 1010 | 10 | Line Feed |
Extended American Standard Code for Information Interchange is an 8-bit code that specifies character values from 128 to 255. Extended ASCII encompasses different character encoding normal ASCII character set, consisting of 128 characters encoded in 7 bits, some additional characters that utilise full 8 bits of a byte; there are a total of 256 potential characters.
Different extended ASCII exist, each introducing more characters beyond the conventional ASCII set. These additional characters may encompass symbols, letters, and special characters to a specific language or location.
Extended ASCII Tabular column
It is a worldwide character standard that uses 4 to 32 bits to represent letters, numbers and symbols. Unicode is a standard character encoding which is specifically designed to provide a consistent way to represent text in nearly all of the world's writing systems. Every character is assigned a unique numeric code, program, or language. Unicode offers a wide variety of characters, including alphabets, ideographs, symbols, and emojis.
Unicode Tabular Column
BMC Medical Education volume 24 , Article number: 885 ( 2024 ) Cite this article
201 Accesses
Metrics details
Research evidence suggests gender-based differences in the extent and experiences of academic leaders across the globe even in developed countries like USA, UK, and Canada. The under-representation is particularly common in higher education organizations, including medical and dental schools. The current study aimed to investigate gender-based distribution and explore leaders’ experiences in the medical and dental institutes in a developing country, Pakistan.
A mixed-method approach was used. Gender-based distribution data of academic leaders in 28 colleges including 18 medical and 10 dental colleges of Khyber Pakhtunkhwa, Pakistan were collected. Qualitative data regarding the experiences of academic leaders ( n = 10) was collected through semi-structured interviews followed by transcription and thematic analysis using standard procedures.
Gender-based disparities exist across all institutes with the greatest differences among the top-rank leadership level (principals/deans) where 84.5% of the positions were occupied by males. The gender gap was relatively narrow at mid-level leadership positions reaching up to as high as > 40% of female leaders in medical/dental education. The qualitative analysis found gender-based differences in the experiences under four themes: leadership attributes, leadership journey, challenges, and support.
The study showed that women are not only significantly under-represented in leadership positions in medical and dental colleges in Pakistan, they also face gender-based discrimination and struggling to maintain a decent work life balance. These findings are critical and can have important implications for government, organizations, human resource managers, and policymakers in terms of enacting laws, proposing regulations, and establishing support mechanisms to improve gender-based balance and help current and aspiring leaders in their leadership journey.
Peer Review reports
An inclusive and diverse workforce is crucial for organizational success in today’s modern world. However, despite a gradual increase in the representation of women in the workforce globally, gender discrimination and unequal gender ratios still exist in workplaces both horizontally (i.e., across different sectors and industries) and vertically (women are usually excluded from the top positions at the organizational level [ 1 , 2 , 3 ]. Workplace gender descrimination, in any farm, is not only illegal and against the basic human rights [ 4 ], it also contributes towards the lower socioeconomic status of women globally [ 5 ]. The under-representation of women, especially at leadership position is particularly common in higher education organizations, including medical and dental schools, as reported previously [ 6 , 7 ]. Even those who reach a position in academic healthcare institutes publish research less frequently [ 8 ], receive fewer research grants [ 9 ], and have 23% fewer chances of promotion [ 10 ] than their male counterparts.
Recently, researchers tried to explore the extent of gender-based parity in leadership positions in academic medicine and sciences in many different countries of the world. Surprisingly, gender imbalance exists at all levels and nations, including countries ranking high on gender equality indices, such as Nordic countries [ 11 ]. According to a report by the European Commission, only 25% of the professors in European member states are women [ 12 ]. Similarly, in medical schools in the United States of America (USA), only 18% of the department heads are females [ 13 ]. Research studies also revealed that gender-based inequalities are multifaceted, and many extrinsic and intrinsic factors contributing to the issue have been identified [ 10 , 14 ]. However, most research studies exploring women and leadership in academic medicine have been conducted in developed Western countries, including Europe, the USA, and Canada [ 15 ]. There is a dire need for research to document the extent and how male and female professionals develop as leaders during their careers in academic medicine and dentistry, especially from a non-western perspective. In this context, a country like Pakistan offers an excellent opportunity to study and assess the unique barriers these women must overcome to attain leadership positions in academic medicine and dentistry.
Pakistan is a former British colony and a Muslim-majority country in South Asia with over 241 million population, of whom 48.54% are women [ 16 ]. The country’s strictly conservative and patriarchal society strongly influences culture, traditions, and gender dynamics [ 17 ] thus significantly reducing women’s opportunities for career choice and progression [ 18 ]. The overall job market in Pakistan is male-dominated, and only certain professions, such as teaching and medical practice, are deemed more suitable for women in Pakistan. Women constitute nearly 70% of the students in Pakistani medical schools, but only 50% register with PMDC, and even fewer continue medical practice as a career [ 19 ].
There exists a critical gap in our understanding of the gender-based differences in the extent and experience of leaders from a non-Western perspective. In the current study, we investigated the gender-based distribution of academic leaders in medical and dental colleges in Khyber Pakhtunkhwa (KP), Pakistan and explore the experiences of men and women occupying leadership positions in these colleges. The study results would help design gender-based leadership development strategies and promote gender equity in academic institutes across Pakistan.
The individuals who conceptualized and designed this study (MS & BJ) hold leadership positions themselves. We are cognizant of both the author’s positions and to avoid our personal biases reflect in the study findings, we were constantly engaged in reflexive practices [ 20 ]. For example, the author (MS) wrote personal memos during the data collection, coding, and interpretations. Each transcript was thoroughly discussed, and codes were shared between the researchers. Moreover, before the data analysis started, the coding team set aside and bracketed their personal assumptions to further improve trustworthiness of the study findings.
Based on the study objectives, a mixed-method sequential study design [ 21 ] including (a) survey of the gender-based distribution of academic leadership positions in medical and dental colleges (b) qualitative interviews with selected individuals occupying leadership positions to explore gender-based differences in their experiences were employed. Integration occurred during development of the interview protocol and results description of both phases to gain an in-depth insight into the extent and experiences of gender based distribution in academic medicine and dentistry. The ethics approval of the study was granted by the institutional research board of Khyber Medical University (Ref No: 1–11/IHPER/MHPE/KMU/23–62) and the study design followed GRAMMS (Good Reporting of a Mixed Methods Study) guidelines.
The sampling frame for the quantitative survey was all the medical and dental colleges of KP, both public sector and private that have been recognized by the Pakistan Medical & Dental Council. At the time of data collection (June – October 2023), there were 30 colleges in KP offering undergraduate programs in medicine (MBBS) and dentistry (BDS). Information about the gender and rank of those working in academic leadership positions were collected from the official websites of the institutes, followed by confirmation of the organogram with the HR administration of the respective college. Only two private institutes, including a dental college declined to provide the required details and were excluded from the study. Academic leaders were the individuals occupying the position of dean, director or principal, head of the department or chairs. Academic leaders in medical and dental colleges in Pakistan may be divided into three leadership tiers based on roles and responsibilities. The top level includes the dean, directors, or principals responsible for looking after the overall organizational management, including teaching, research, and administration. The mid-level academic leaders take responsibility for a specific educational/training program and include the head of the departments or chairman/women. The line-level leadership is responsible for teaching within the program, including teachers, course coordinators, etc.
For the qualitative arm of the study, we could only get email IDs of 78 academic leaders connected to the undergraduate medical and dental programs at the top and mid-level. The inclusion criteria were persons of any age and gender who were serving in a leadership position for at least one year in a medical and dental college in KP. Those serving in a leadership position on additional/ad-hoc/acting charge who did not meet the required qualifications and experience were excluded from the study. Potential participants sent an email describing the study and an invitation to participate. Only 21 responded positively to our email and they were all included in our final sample. Written informed consent was obtained from all those participants who agreed to participate in the study. Data saturation was observed after 7 interviews [ 22 ], but we continue our data collection to ensure equal representation of both genders (5 male, 5 female). A thank you email was sent to the remaining prospective participants informing them that they would not be interviewed as the data saturation has been reached.
For quantitative data collection, college name, status (public or private), and Rank and gender of those working in leadership positions were taken from website and entered Microsoft Excel version 2013. Qualitative data was collected through semi-structured interview sessions conducted by a single researcher (MS) face-to-face or online using the Microsoft ® Team platform depending on the participant choice. Based on the available published literature, a 10-item interview guide was developed and piloted on two participants who were not part of the original study (Supplementary file 1 ). Information on age, gender, qualification, experience, and current role were also collected. Face-to-face interviews were audio recorded using the record function of a mobile phone device (iPhone 13, Apple inc USA). In the case of the online interview, the recording was conducted using the record option available in the meeting link. To capture the real essence of the experiences, the participants were free to respond in English, Urdu (the national language of Pakistan), and Pashto, most of the interviewees’ mother tongue or first language. Data was securely recorded, translated, and transcribed verbatim.
The survey data was analyzed using Microsoft Excel 2013 and presented as frequency and percentages. The female-to-male ratio was calculated across different institutes, programs, academic ranks, and departments. Qualitative data was transcribed, and thematic analysis was done using ATLAS Ti Version 8.4.5. The Braun and Clarke framework [ 21 ] was used for thematic analysis (Fig. 1 ). The analysis process involved a systematic approach to coding and interpretation by two researchers (MS & BM). The coding list was further discussed with supervisors (BJ & UM) for improved quality.
Braun and Clarke thematic analysis steps
In total, 18 medical and 10 dental colleges including a total of 497 (312 in medicine and 185 in dentistry) academic leaders were surveyed. The overall gender-based distribution of academic leaders is presented in supplementary Tables 1 & 2 and summarized in Fig. 2 . Overall, women remain under-represented at all leadership tiers. Gender-based differences were highest at top leadership positions, including principal/deans/directors, where women occupy only 15.5% ( n = 9) of the total leadership positions. However, women are relatively well represented in mid-level leadership positions. The mid-level leadership positions are further categorized into three groups (basic sciences, clinical sciences, and medical/dental education departments). Across the three categories at the mid-level, the leadership gap tends to be narrower in the medical/dental education department, where more than 40% ( n = 12) of the heads of the departments were females.
Bar graph showing the gender-based distribution of academic leaders in KP, Pakistan’s medical and dental college
We next assessed gender-based differences in academic leadership positions in medical and dental colleges (Fig. 3 A & B). At top-level management in dental colleges (Fig. 3 A), the percentage of women leaders is almost twice (26.7%; n = 4) that of women leaders in medical colleges (11.6%; n = 5) (Fig. 3 B).
Bar graph representing gender-based distribution at different leadership tiers in ( A ) Dental colleges & ( B ) Medical colleges
Gender-based disparities were also observed at the level of the department as well as whether the institute is private or public sector (Fig. 4 & supplementary Tables 1 & 2 ). In general, except for medical education, the female-to-male ratio at leadership position was higher in public sector medical (average 0.44; range 0.06–1.71) and dental colleges (average 0.67; range 0.1–1.6) than private sector medical (average 0.36; range 0.14–0.67) and dental (average 0.32; range 0.1–0.5) colleges in the KP province (Fig. 4 ). Similarly, in certain clinical fields such as obstetrics gynaecology, and prosthodontics, leadership positions were entirely occupied by only one gender.
Bar graph representing female-to-male ratio in public and private sector medical and dental college
Demographic characteristics of the participants are presented in Table 1 . Wide variations exist among the participants in terms of age, experience, academic credentials and number of years in a leadership position. The mean age of the participants was 45.4 ± 5.2 years and average experience was 16.4 ± 5.3 years.
The experiences of the participants can be described under four main themes and eleven subthemes as presented in Table 2 .
The participants identified various attributes that are important for a leader. Both male and female participants expressed that a leader is internally motivated to substantially impact the lives of those around them. A vast majority of the participants agreed that they possessed a clear and strong desire to bring about positive transformations in the lives of others through their leadership but also to witness and experience the visible results of their efforts personally.
If you are a leader , you are required to influence people and guide them properly (ID-1 Male)
Another leadership attribute identified by participants was good teamwork.
…One of the leadership qualities is working with your colleagues and subordinates as a team. And when you take them into confidence , they support and encourage you and facilitate you in every way… (ID7- Male)
Positive mindset, flexibility, diplomacy, and self-awareness were also cited as essential traits for leaders.
As a leader , you are required to know about yourself.… (ID-1- Female) When you are in a team. Anyone can come up with a good idea , an even better idea than yours. You must have the capacity to Embrace that (ID-10- Male)
The participants in the study shared their views on leadership, stating that it involves multiple responsibilities and requires both education and practical experience. The path to leadership was seen as challenging, but male and female participants had different perspectives on how they achieved it. Female leaders were found to actively seek opportunities, while leadership roles were more readily offered to male participants.
The transition into leadership roles as a male has been easy (ID9-Male)
Female leaders also struggle to balance their domestic responsibilities with their leadership roles, which was not the case for male leaders.
…. I had to do all the house chores myself. The cleaning , the cooking , the washing , the daily activities , and then return home to do more work with the kids. (ID5-Female)
Both male and female participants acknowledged the existence of gender discrimination in leadership roles.
When we apply for projects to get funds , we are unable to get a percentage like the men. (ID2-Female) The gender disparity does come into play , especially when you’re working as a head of an institute where all the men are working with you , even if no one says anything , the default thinking is maybe she’s a woman and that’s why she’s saying this (ID5-Female)
Participants also highlighted various advantages of leadership in their career growth, including training, financial stability, and autonomy.
…but then they got some advantages to their financially stable, they are socially responsible, you have a strong social background… (ID4- Male)
Both male and female participants highlighted the underrepresentation of women in leadership positions.
Yes , because our society is male-dominated. In my own opinion if there was a female in my place , she would have surrendered and said that she would have forgiven that…. (ID6- Male)
Moreover, female participants faced discrimination while pursuing leadership positions. They were able to overcome these challenges by relinquishing some degree of control and seeking assistance from other individuals.
When I was going for my PhD , in my interview , they said , oh , you’re a woman. OK , you’re married. Ohh. You have kids. What are you going to do with your kids… (ID8-Female) ….However , within the limited end available resources , I made efforts with the support from the institute and high-ups (ID7- Female)
The challenge of work-family balance was identified by both male and female participants. However the male participants reported that their families were taken care of by their spouses, which allowed them to focus on their work but the female participants had to balance multiple responsibilities.
My wife is a housewife , so that she can give the family time. And my parents are also looked after by my wife. So , if financially I’m supporting them , the emotional support is always there , and it doesn’t make any difference if I am busy (ID1-Male) …. Yes , a terrible effect. It has badly affected my health. My family has excluded me from their social activities and their social circle… (ID2-Female)
Mentoring was found to be vital to leaders for their personal growth and advancement as they provided focused guidance and support to aspiring leaders, helping them reach their fullest potential. It was interesting to note that most female participants acknowledged the mentoring role of their father during their leadership journey.
Mentor for me as a person is my father. He retired as the Chief Secretary of the province. He was a very senior bureaucrat , a very well-read person , and a scholar… (ID5- Female)
Both male and female participants stressed the importance of developing support systems to build resilience. These include the role of institution, family, administration, and collaborative work environment. Interestingly most female participants highlighted that support from female colleagues is necessary to sustain leadership roles.
A leader cannot become a leader in a vacuum. Your bosses , your followers , your teammates , and your colleagues all have to support a leader to become the leader and there has to be a network of support available in terms of learning opportunities in terms of funding opportunities in terms of Carrying out the vision of the leader (ID-8 Male) My husband has supported me a lot… (ID-3 Female) Of Course , we don’t get to wherever we are until we have somebody as support , especially in academics. (ID2- Female) Yes , and not only that , they (female colleagues) give me support at my workplace. We give support to each other… (ID-5- Female)
Gender equity and diversity in leadership are crucially important to design and implement policies that ensure a safe learning environment for the students and produce a critical mass of healthcare professionals trained in catering to the needs of a diverse patient population. The integration of results from the quantitative and qualitative arms of the present study suggests that gender- based discrepancies in the extent and experience of academic leader exists in Pakistan. The quantitative analysis showed that women are significantly under-represented especially in the senior level leadership positions where women occupy only 15.5% of the positions. These results follow the same trends as observed in neighbouring South Asian countries such as Bangladesh and India, where women occupy 16% and 18% of the top leadership positions in the healthcare [ 23 , 24 ]. These findings were further confirmed during the qualitative study reporting gender-based inequalities at both faculty and administration, thus confirming the alarming gender gaps in leadership positions in higher education that not only exist in Pakistan [ 25 ] but also in developed countries like the USA, Canada, and UK [ 15 ]. The qualitative study also revealed the possible reasons behind the observed gender-based inequalities. For example, the subtheme “Gender discrimination” was acknowledge by both male and female participants. The findings that opportunities and sponsorship are more readily available to males than females have also been reflected in the literature previously [ 26 ]. Compared to males, it is less likely that a female at a junior rank is recognized for her leadership potential and actively helped (sponsored) in the promotion to a higher rank by a senior who has the power and influence to do so [ 27 ]. Another emerging theme was the role of “mentors” in the leadership journey. The male participants reported their colleagues, teachers, and clinical supervisors as their mentors, a finding consistent with the previously published literature [ 28 , 29 ]. However, the female participants frequently reported the role of their father in their academic success and leadership journey. Since Pakistan is a patriarchal society with traditional gender roles in place, this finding is not unsurprising. In the context of gender, the father always plays a major role in helping, guiding, and supporting daughters in pursuit of education, professional excellence or leadership journey [ 17 ] especially when they are not married.
Most of the study participant started their leadership journey at the entry-level positions. These findings are in concordance with the study by Tagoe and colleagues [ 30 ] who indicate that most of the leadership experiences especially in the case of women leaders, can be gained while working in the middle management positions. The presence of women in mid-level leadership positions and their preference for academic medicine [ 10 ] means that gender disparities in high-level positions will gradually be reduced in future.
Our study report comparatively more women in mid-level (Fig. 1 ) than in the senior level leadership positions. During the qualitative interviews, the participants especially the women leaders revealed that they are struggling to maintain a work-life balance. Long working hours (sometimes exceeding 80 h per week) along with domestic responsibilities such as taking care of the children and family [ 31 ] force women to remain in middle rather than senior leadership positions [ 32 , 33 ]. These issues also force a lot of lady doctors in Pakistan to quit their job [ 18 ].
The uneven gender-based distribution of academic leaders was also observed between private and public sector institutes as well as individual departments within the same institute. In general, except for medical education, the female-to-male ratio was higher in public sector medical and dental colleges of the KP province. These findings are similar to recently published reports from Sweden and the USA, indicating a narrower gender gap in the public than private sector [ 34 , 35 ].
Similarly, in certain clinical fields such as obstetrics & gynecology, and prosthodontics, leadership positions were entirely occupied by only one gender. This gender polarization is not surprising. Due to cultural norms and traditions, female patients mostly prefer women doctors for obstetrics and gynae services [ 36 ], thereby making it the field of choice for women doctors in Pakistan [ 37 , 38 ]. On the contrary, the number of women leaders in surgery and allied were disproportionally low in the current study. Historically, surgery is considered a masculine field, and women are scarce in leadership positions across the globe, even in developed countries like the USA [ 39 ].
This study found that gender disparities exist at leadership positions as well as experiences in medical and dental colleges of KP Pakistan. The women leaders frequently reported gender-based inequality experiences and challenges in maintaining a decent work-life balance due to multiple domestic responsibilities. Although the majority of the top-level leadership positions are occupied by men, the relatively high representation of women at the mid-level indicates that the gender-based disparities in leadership could be more balanced in the future. These finding are critical and can have important implications for government, organizations and policy makers in term of enacting laws, propose regulations and establish support mechanism that help current and aspiring leaders in their leadership journey.
Our study provides a multi-faceted perspective on the extent and experiences of gender-based disparities in medical and dental institutes in Pakistan. The inclusion of all medical and dental institutes increased the generalizability of the study findings while the more than 93% response rate of the survey reduced the selection bias thus increasing validity. Moreover, the qualitative arm of the study provided deep insights into the experiences of academic leaders in a developing and strictly patriarchal society. However, our study has some limitations. The study findings cannot be generalized across Pakistan due to widely prevalent differences across different ethnicities, cultures, and socioeconomic background. Another limitation of the study that should be considered is that the survey information was collected based on department-wise availability of leadership roles. However, the subdivisions in some academic departments for example Medicine & Allied were not included in our study and therefore, variations are expected in the exact female-to-male ratio in the medical and dental colleges.
This published article and its supplementary information files include all data generated or analyzed during this study.
Cassells R, Duncan AS. Gender equity insights 2019: Breaking through the glass ceiling. Bankwest Curtin Economics Centre Report series [Internet]. 2019 Mar [cited 2022 Dec 20]; https://ideas.repec.org//p/ozl/bcecrs/ge04.html
Coder L, Spiller M. Leadership Education and Gender Roles: Think Manager, Think ‘?’ The Academy of Educational Leadership Journal [Internet]. 2013 Jul 1 [cited 2023 Dec 18]; https://www.semanticscholar.org/paper/Leadership-Education-and-Gender-Roles%3A-Think-Think-Coder-Spiller/0dc762a868621fab949e0159187a40ff95a7393e
OECD. Gender - OECD [Internet]. 2023 [cited 2024 Apr 22]. https://www.oecd.org/gender/
Schrijver NJ. Fifty years International Human rights covenants: improving the Global Protection of Human rights by bridging the gap. Between Two Continents. 2016;41(4):457–64.
Google Scholar
Stamarski CS, Son Hing LS. Gender inequalities in the workplace: the effects of organizational structures, processes, practices, and decision makers’ sexism. Front Psychol. 2015;6. https://www.frontiersin.org/journals/psychology/articles/ https://doi.org/10.3389/fpsyg.2015.01400/full
AAMC. AAMC. 2019 [cited 2023 Feb 6]. Advancing women in academic medicine. https://www.aamc.org/news-insights/advancing-women-academic-medicine
Jacobs JW, Jagsi R, Stanford FC, Sarno D, Spector ND, Silver JK, et al. Gender Representation among United States Medical Board Leadership. J Womens Health (Larchmt). 2022;31(12):1710–8.
Article Google Scholar
Nielsen MW, Andersen JP, Schiebinger L, Schneider JW. One and a half million medical papers reveal a link between author gender and attention to gender and sex analysis. Nat Hum Behav. 2017;1(11):791–6.
Magua W, Zhu X, Bhattacharya A, Filut A, Potvien A, Leatherberry R, et al. Are Female Applicants Disadvantaged in National Institutes of Health Peer Review? Combining Algorithmic text mining and qualitative methods to detect evaluative differences in R01 reviewers’ critiques. J Womens Health (Larchmt). 2017;26(5):560–70.
Hartmark-Hill J, Hale T, Gallitano A. Women Physicians and Promotion in Academic Medicine. N Engl J Med. 2021;384(7):680.
Silander C, Haake U, Lindberg L, Riis U. Nordic research on gender equality in academic careers: a literature review. Eur J High Educ. 2022;12(1):72–97.
European Commission. She Fig. 2018 - Publications Office of the EU [Internet]. 2019 [cited 2023 Apr 18]. https://op.europa.eu/en/publication-detail/-/publication/9540ffa1-4478-11e9-a8ed-01aa75ed71a1
Carr PL, Raj A, Kaplan SE, Terrin N, Breeze JL, Freund KM. Gender differences in Academic Medicine: Retention, Rank, and Leadership comparisons from the National Faculty Survey. Acad Med. 2018;93(11):1694–9.
Győrffy Z, Dweik D, Girasek E. Workload, mental health and burnout indicators among female physicians. Hum Resour Health. 2016;14:12.
Alwazzan L, Al-Angari SS. Women’s leadership in academic medicine: a systematic review of extent, condition and interventions. BMJ Open. 2020;10(1):e032232.
Pakistan Bureau of Statistics. Population Census [Internet]. 2023 [cited 2023 Dec 4]. https://www.pbs.gov.pk/content/population-census
Chaudhary N, Dutt A. Women as agents of change: exploring women leaders’ resistance and shaping of gender ideologies in Pakistan. Front Psychol. 2022;13. https://www.frontiersin.org/articles/ https://doi.org/10.3389/fpsyg.2022.800334
Mohsin M, Syed J. The missing doctors — an analysis of educated women and female domesticity in Pakistan. Gend Work Organ. 2020;27(6):1077–102.
Rafia Z. The doctor brides - DAWN.COM [Internet]. 2013 [cited 2023 Feb 6]. https://www.dawn.com/news/1032070
Dowling M. Approaches to reflexivity in qualitative research. Nurse Res. 2006;13(3):7–21.
Creswell JW, Clark VLP. Designing and conducting mixed methods research. SAGE; 2007. p. 297.
Saunders B, Sim J, Kingstone T, Baker S, Waterfield J, Bartlam B, et al. Saturation in qualitative research: exploring its conceptualization and operationalization. Qual Quant. 2018;52(4):1893–907.
Gulati K, Davies J, de la González Á. Addressing leadership competency gaps and gender disparities in India’s medical workforce: a call to action. Lancet Reg Health Southeast Asia. 2023;16:100247.
Shiner C, Chemonics I. 2023 [cited 2023 Dec 15]. Women Rise: Celebrating Women in Bangladesh’s Health Sector Who Champion Empowerment. https://chemonics.com/blog/women-rise-celebrating-women-in-bangladeshs-health-sector-who-champion-empowerment/
Batool SQ. Gender and Higher Education In Pakistan. 2013.
Hobgood C, Draucker C. Gender differences in experiences of Leadership Emergence among Emergency Medicine Department Chairs. JAMA Netw Open. 2022;5(3):e221860.
Levine RB, Ayyala MS, Skarupski KA, Bodurtha JN, Fernández MG, Ishii LE, et al. It’s a little different for men—sponsorship and gender in Academic Medicine: a qualitative study. J GEN INTERN MED. 2021;36(1):1–8.
Aminololama-Shakeri S, Seritan AL. Mentorship in academic medicine. Mentoring in formal and informal contexts. Charlotte. NC, US: IAP Information Age Publishing; 2016. pp. 227–46. (Adult learning in professional, organizational, and community settings).
Mellon A, Murdoch-Eaton D. Supervisor or mentor: is there a difference? Implications for paediatric practice. Arch Dis Child. 2015;100(9):873–8.
Tagoe M, Abakah E. Issues of women’s political participation and decision-making in local governance: perspectives from the Central Region of Ghana. Int J Public Adm. 2015;38(5):371–80.
Boateng DA. Experiences of female academics in Ghana: negotiation and strengths as strategies for successful careers. Afr J Social Work. 2018;8(1):21–30.
Barron M. Senior-Level African American Women, Underrepresentation, and Career Decision-Making. Walden Dissertations and Doctoral Studies [Internet]. 2019; https://scholarworks.waldenu.edu/dissertations/6305
Charles A. Perceptions of Women of Color on Career Advancement in High Technology Management. Walden Dissertations and Doctoral Studies [Internet]. 2017; https://scholarworks.waldenu.edu/dissertations/4276
a Gardner D, Lucas T, Mabikacheche, McConnell M. Making government an even better place for women to work | McKinsey [Internet]. 2023 [cited 2023 Dec 16]. https://www.mckinsey.com/industries/public-sector/our-insights/making-government-an-even-better-place-for-women-to-work
Claésson L. Managerial representation: are women better off in the Public. or the Private Sector?; 2019.
Riaz B, Sherwani NZF, Inam SHA, Rafiq MY, Tanveer S, Arif A, et al. Physician gender preference amongst females attending Obstetrics/Gynecology clinics. Cureus. 2021;13(5):e15028.
Ahmed I, Ashar A. To be or not to be an obstetrician / gynaecologist. Pak J Med Sci. 2020;36(6):1165–70.
Zaheer A. The lost doctors. Pak J Med Sci. 2022;38(8):2053–5.
Battaglia F, Farhan SA, Narmeen M, Karimuddin AA, Jalal S, Tse M, et al. Does gender influence leadership roles in academic surgery in the United States of America? A cross-sectional study. Int J Surg. 2020;83:67–74.
Download references
Not applicable.
Brekhna Jami, Bushra Mehboob, Usman Mahboob and Fayig Elmigdadi contributed equally to this work.
Faculty of Dentistry, Zarqa University, Zarqa, Jordan
Muhammad Shahzad & Fayig Elmigdadi
Institute of Basic Medical Sciences, Khyber Medical University, Peshawar, Pakistan
Muhammad Shahzad
Institute of Health Professions Education and Research, Khyber Medical University, Peshawar, Pakistan
Brekhna Jamil & Usman Mahboob
Department of Oral and Maxillofacial Surgery, Peshawar Dental College, Peshawar, Pakistan
Mehboob Bushra
You can also search for this author in PubMed Google Scholar
M.S: Study design, Data collection, Analysis and writing the manuscript. B.J and U.M supervision, data analysis and finalizing the manuscirpt. B.M & F.A helped in qualitative data analysis, writing, reviewing and finalizing the manuscript. All authors read and approved the final manuscript.
Correspondence to Muhammad Shahzad .
Ethics approval and consent to participate.
The ethics approval of the study was granted by the institutional research board of Khyber Medical University (Ref No: 1–11/IHPER/MHPE/KMU/23–62) and the study design followed GRAMMS (Good Reporting of a Mixed Methods Study) guidelines and informed consent was obtained from all participants.
Competing interests.
The authors declare no competing interests.
Publisher’s note.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Below is the link to the electronic supplementary material.
Supplementary material 2, supplementary material 3, rights and permissions.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .
Reprints and permissions
Cite this article.
Shahzad, M., Jamil, B., Bushra, M. et al. Gender-based differences in the representation and experiences of academic leaders in medicine and dentistry: a mixed method study from Pakistan. BMC Med Educ 24 , 885 (2024). https://doi.org/10.1186/s12909-024-05811-6
Download citation
Received : 30 January 2024
Accepted : 23 July 2024
Published : 16 August 2024
DOI : https://doi.org/10.1186/s12909-024-05811-6
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
ISSN: 1472-6920
Leah Nguyen • 20 August, 2024 • 13 min read
Have you ever presented a data report to your boss/coworkers/teachers thinking it was super dope like you’re some cyber hacker living in the Matrix, but all they saw was a pile of static numbers that seemed pointless and didn't make sense to them?
Understanding digits is rigid . Making people from non-analytical backgrounds understand those digits is even more challenging.
How can you clear up those confusing numbers and make your presentation as clear as the day? Let's check out these best ways to present data. 💎
How many type of charts are available to present data? | 7 |
How many charts are there in statistics? | 4, including bar, line, histogram and pie. |
How many types of charts are available in Excel? | 8 |
Who invented charts? | William Playfair |
When were the charts invented? | 18th Century |
Get any of the above examples as templates. Sign up for free and take what you want from the template library!
The term ’data presentation’ relates to the way you present data in a way that makes even the most clueless person in the room understand.
Some say it’s witchcraft (you’re manipulating the numbers in some ways), but we’ll just say it’s the power of turning dry, hard numbers or digits into a visual showcase that is easy for people to digest.
Presenting data correctly can help your audience understand complicated processes, identify trends, and instantly pinpoint whatever is going on without exhausting their brains.
Good data presentation helps…
Imagine you have a delicious pepperoni, extra-cheese pizza. You can decide to cut it into the classic 8 triangle slices, the party style 12 square slices, or get creative and abstract on those slices.
There are various ways to cut a pizza and you get the same variety with how you present your data. In this section, we will bring you the 10 ways to slice a pizza - we mean to present your data - that will make your company’s most important asset as clear as day. Let's dive into 10 ways to present data efficiently.
Among various types of data presentation, tabular is the most fundamental method, with data presented in rows and columns. Excel or Google Sheets would qualify for the job. Nothing fancy.
This is an example of a tabular presentation of data on Google Sheets. Each row and column has an attribute (year, region, revenue, etc.), and you can do a custom format to see the change in revenue throughout the year.
When presenting data as text, all you do is write your findings down in paragraphs and bullet points, and that’s it. A piece of cake to you, a tough nut to crack for whoever has to go through all of the reading to get to the point.
(Source: CustomerThermometer )
All the above quotes present statistical information in textual form. Since not many people like going through a wall of texts, you’ll have to figure out another route when deciding to use this method, such as breaking the data down into short, clear statements, or even as catchy puns if you’ve got the time to think of them.
A pie chart (or a ‘donut chart’ if you stick a hole in the middle of it) is a circle divided into slices that show the relative sizes of data within a whole. If you’re using it to show percentages, make sure all the slices add up to 100%.
The pie chart is a familiar face at every party and is usually recognised by most people. However, one setback of using this method is our eyes sometimes can’t identify the differences in slices of a circle, and it’s nearly impossible to compare similar slices from two different pie charts, making them the villains in the eyes of data analysts.
The bar chart is a chart that presents a bunch of items from the same category, usually in the form of rectangular bars that are placed at an equal distance from each other. Their heights or lengths depict the values they represent.
They can be as simple as this:
Or more complex and detailed like this example of data presentation. Contributing to an effective statistic presentation, this one is a grouped bar chart that not only allows you to compare categories but also the groups within them as well.
Similar in appearance to the bar chart but the rectangular bars in histograms don’t often have the gap like their counterparts.
Instead of measuring categories like weather preferences or favourite films as a bar chart does, a histogram only measures things that can be put into numbers.
Teachers can use presentation graphs like a histogram to see which score group most of the students fall into, like in this example above.
Recordings to ways of displaying data, we shouldn't overlook the effectiveness of line graphs. Line graphs are represented by a group of data points joined together by a straight line. There can be one or more lines to compare how several related things change over time.
On a line chart’s horizontal axis, you usually have text labels, dates or years, while the vertical axis usually represents the quantity (e.g.: budget, temperature or percentage).
A pictogram graph uses pictures or icons relating to the main topic to visualise a small dataset. The fun combination of colours and illustrations makes it a frequent use at schools.
Pictograms are a breath of fresh air if you want to stay away from the monotonous line chart or bar chart for a while. However, they can present a very limited amount of data and sometimes they are only there for displays and do not represent real statistics.
If presenting five or more variables in the form of a bar chart is too stuffy then you should try using a radar chart, which is one of the most creative ways to present data.
Radar charts show data in terms of how they compare to each other starting from the same point. Some also call them ‘spider charts’ because each aspect combined looks like a spider web.
Radar charts can be a great use for parents who’d like to compare their child’s grades with their peers to lower their self-esteem. You can see that each angular represents a subject with a score value ranging from 0 to 100. Each student’s score across 5 subjects is highlighted in a different colour.
If you think that this method of data presentation somehow feels familiar, then you’ve probably encountered one while playing Pokémon .
A heat map represents data density in colours. The bigger the number, the more colour intensity that data will be represented.
Most US citizens would be familiar with this data presentation method in geography. For elections, many news outlets assign a specific colour code to a state, with blue representing one candidate and red representing the other. The shade of either blue or red in each state shows the strength of the overall vote in that state.
Another great thing you can use a heat map for is to map what visitors to your site click on. The more a particular section is clicked the ‘hotter’ the colour will turn, from blue to bright yellow to red.
If you present your data in dots instead of chunky bars, you’ll have a scatter plot.
A scatter plot is a grid with several inputs showing the relationship between two variables. It’s good at collecting seemingly random data and revealing some telling trends.
For example, in this graph, each dot shows the average daily temperature versus the number of beach visitors across several days. You can see that the dots get higher as the temperature increases, so it’s likely that hotter weather leads to more visitors.
#1 - assume your audience understands what the numbers represent.
You may know all the behind-the-scenes of your data since you’ve worked with them for weeks, but your audience doesn’t.
Showing without telling only invites more and more questions from your audience, as they have to constantly make sense of your data, wasting the time of both sides as a result.
While showing your data presentations, you should tell them what the data are about before hitting them with waves of numbers first. You can use interactive activities such as polls , word clouds , online quizzes and Q&A sections , combined with icebreaker games , to assess their understanding of the data and address any confusion beforehand.
Charts such as pie charts must have a total of 100% so if your numbers accumulate to 193% like this example below, you’re definitely doing it wrong.
Before making a chart, ask yourself: what do I want to accomplish with my data? Do you want to see the relationship between the data sets, show the up and down trends of your data, or see how segments of one thing make up a whole?
Remember, clarity always comes first. Some data visualisations may look cool, but if they don’t fit your data, steer clear of them.
3D is a fascinating graphical presentation example. The third dimension is cool, but full of risks.
Can you see what’s behind those red bars? Because we can’t either. You may think that 3D charts add more depth to the design, but they can create false perceptions as our eyes see 3D objects closer and bigger than they appear, not to mention they cannot be seen from multiple angles.
This is like comparing a fish to a monkey. Your audience won’t be able to identify the differences and make an appropriate correlation between the two data sets.
Next time, stick to one type of data presentation only. Avoid the temptation of trying various data visualisation methods in one go and make your data as accessible as possible.
The goal of data presentation is to make complex topics much easier to understand, and if you’re bringing too much information to the table, you’re missing the point.
The more information you give, the more time it will take for your audience to process it all. If you want to make your data understandable and give your audience a chance to remember it, keep the information within it to an absolute minimum. You should end your session with open-ended questions to see what your participants really think.
Finally, which is the best way to present data?
The answer is…
There is none! Each type of presentation has its own strengths and weaknesses and the one you choose greatly depends on what you’re trying to do.
For example:
What is a chart presentation.
A chart presentation is a way of presenting data or information using visual aids such as charts, graphs, and diagrams. The purpose of a chart presentation is to make complex information more accessible and understandable for the audience.
Charts can be used to compare data, show trends over time, highlight patterns, and simplify complex information.
You should use charts to ensure your contents and visuals look clean, as they are the visual representative, provide clarity, simplicity, comparison, contrast and super time-saving!
Histogram, Smoothed frequency graph, Pie diagram or Pie chart, Cumulative or ogive frequency graph, and Frequency Polygon.
Words that convert, stories that stick. I turn complex ideas into engaging narratives - helping audiences learn, remember, and take action.
Tips to Engage with Polls & Trivia
16 Accesses
Explore all metrics
It is an urgent problem of high-precision 3D environment perception to carry out representation learning on point cloud data, which complete the synchronous acquisition of local and global feature information. However, current representation learning methods either only focus on how to efficiently learn local features, or capture long-distance dependencies but lose the fine-grained features. Therefore, we explore transformer on topological structures of point cloud graphs, proposing an enhanced graph convolutional transformer (EGCT) method. EGCT construct graph topology for disordered and unstructured point cloud. Then it uses the enhanced point feature representation method to further aggregate the feature information of all neighborhood points, which can compactly represent the features of this local neighborhood graph. Subsequent process, the graph convolutional transformer simultaneously performs self-attention calculations and convolution operations on the point coordinates and features of the neighborhood graph. It efficiently utilizes the spatial geometric information of point cloud objects. Therefore, EGCT learns comprehensive geometric information of point cloud objects, which can help to improve segmentation and classification accuracy. On the ShapeNetPart and ModelNet40 datasets, our EGCT method achieves a mIoU of 86.8%, OA and AA of 93.5% and 91.2%, respectively. On the large-scale indoor scene point cloud dataset (S3DIS), the OA of EGCT method is 90.1%, and the mIoU is 67.8%. Experimental results demonstrate that our EGCT method can achieve comparable point cloud segmentation and classification performance to state-of-the-art methods while maintaining low model complexity. Our source code is available at https://github.com/shepherds001/EGCT .
This is a preview of subscription content, log in via an institution to check access.
Subscribe and save.
Price excludes VAT (USA) Tax calculation will be finalised during checkout.
Instant access to the full article PDF.
Rent this article via DeepDyve
Institutional subscriptions
The ShapeNetPart, ModelNet40 and S3DIS datasets used in this study were obtained from public domains and are available online at https://shapenet.org/ , https://modelnet.cs.princeton.edu and http://buildingparser.stanford.edu/dataset.html .
Elsner, P., Dornbusch, U., Thomas, I., et al.: Coincident beach surveys using UAS, vehicle mounted and airborne laser scanner: Point cloud inter-comparison and effects of surface type heterogeneity on elevation accuracies. Remote Sens. Environ. 208 , 15–26 (2018). https://doi.org/10.1016/j.rse.2018.02.008 . ( https://www.sciencedirect.com/science/article/pii/S0034425718300142 )
Article Google Scholar
Lemmens, M.: Mobile laser scanning - point clouds: status and prospects of automatic 3d mapping of road objects. GIM Int Worldwide Mag. Geomat. 31 (August), 18–21 (2017)
Google Scholar
Zhu, J., Xu, Y., Ye, Z., et al.: Fusion of urban 3d point clouds with thermal attributes using MLS data and TIR image sequences. Infrared Phys. Technol. 113 , 103622 (2021). https://doi.org/10.1016/j.infrared.2020.103622 . ( https://www.sciencedirect.com/science/article/pii/S1350449520306708 )
Han, X., Dong, Z., Yang, B.: A point-based deep learning network for semantic segmentation of MLS point clouds. ISPRS J. Photogramm. Remote. Sens. 175 , 199–214 (2021). https://doi.org/10.1016/j.isprsjprs.2021.03.001 . ( https://www.sciencedirect.com/science/article/pii/S0924271621000642 )
Cai, S., Xing, Y., Duan, M.: Extraction of DBH from filtering out low intensity point cloud by backpack laser scanning. For Eng 37 (05), 12–19 (2021)
Guo, Y., Wang, H., Hu, Q., et al.: Deep learning for 3d point clouds: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 43 (12), 4338–4364 (2021). https://doi.org/10.1109/TPAMI.2020.3005434
Li, Y., Ma, L., Zhong, Z., et al.: Deep learning for lidar point clouds in autonomous driving: a review. IEEE Trans. Neural Netw. Learn. Systems 32 (8), 3412–3432 (2021). https://doi.org/10.1109/TNNLS.2020.3015992
Kästner, L., Frasineanu, VC., Lambrecht, J.: A 3d-deep-learning-based augmented reality calibration method for robotic environments using depth sensor data. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 1135–1141 (2020) https://doi.org/10.1109/ICRA40945.2020.9197155
Liang, Z., Guo, Y., Feng, Y., et al.: Stereo matching using multi-level cost volume and multi-scale feature constancy. IEEE Trans. Pattern Anal. Mach. Intell. 43 (1), 300–315 (2021). https://doi.org/10.1109/TPAMI.2019.2928550
Niemeyer, J., Rottensteiner, F., Soergel, U.: Contextual classification of lidar data and building object detection in urban areas. ISPRS J. Photogramm. Remote. Sens. 87 , 152–165 (2014). https://doi.org/10.1016/j.isprsjprs.2013.11.001 . ( https://www.sciencedirect.com/science/article/pii/S0924271613002359 )
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates Inc, New York (2017)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale (2020) https://doi.org/10.48550/ARXIV.2010.11929 ,
Maturana, D., Scherer, S.: Voxnet: A 3d convolutional neural network for real-time object recognition. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 922–928 (2015) https://doi.org/10.1109/IROS.2015.7353481
Riegler, G., Ulusoy, AO., Geiger, A.: Octnet: learning deep 3d representations at high resolutions. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 6620–6629 (2017) https://doi.org/10.1109/CVPR.2017.701
Meng, HY., Gao, L., Lai, YK., et al.: Vv-net: voxel vae net with group convolutions for point cloud segmentation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 8499–8507 (2019) https://doi.org/10.1109/ICCV.2019.00859
Su, H., Maji, S., Kalogerakis, E., et al.: Multi-view convolutional neural networks for 3D shape recognition. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 945–953 (2015) https://doi.org/10.1109/ICCV.2015.114
Han, Z., Lu, H., Liu, Z., et al.: 3d2seqviews: aggregating sequential views for 3d global feature learning by CNN with hierarchical attention aggregation. IEEE Trans. Image Process. 28 (8), 3986–3999 (2019). https://doi.org/10.1109/TIP.2019.2904460
Article MathSciNet Google Scholar
He, X., Huang, T., Bai, S., et al.: View n-gram network for 3D object retrieval. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7514–7523 (2019) https://doi.org/10.1109/ICCV.2019.00761
Mohammadi, SS., Wang, Y., Bue, AD.: Pointview-GCN: 3D shape classification with multi-view point clouds. In: 2021 IEEE International Conference on Image Processing (ICIP), pp. 3103–3107 (2021) https://doi.org/10.1109/ICIP42928.2021.9506426
Charles, RQ., Su, H., Kaichun, M., et al.: Pointnet: deep learning on point sets for 3d classification and segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 77–85 (2017) https://doi.org/10.1109/CVPR.2017.16
Qi, C.R., Yi, L., Su, H., et al.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. In: Guyon, I., Luxburg, U.V., Bengio, S., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates Inc, New York (2017)
Yan, X., Zheng, C., Li, Z., et al.: Pointasnl: robust point clouds processing using nonlocal neural networks with adaptive sampling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Li, Y., Bu, R., Sun, M., et al.: Pointcnn: Convolution on x-transformed points. In: Bengio, S., Wallach, H., Larochelle, H., et al. (eds.) Advances in Neural Information Processing Systems, vol. 31. Curran Associates Inc, New York (2018)
Wang, Y., Sun, Y., Liu, Z., et al.: Dynamic graph CNN for learning on point clouds 38 (5) (2019) https://doi.org/10.1145/3326362 ,
Zhang, K., Hao, M., Wang, J., et al.: Linked dynamic graph CNN: Learning on point cloud via linking hierarchical features (2019) https://doi.org/10.48550/ARXIV.1904.10014 ,
Zhou, H., Feng, Y., Fang, M., et al.: Adaptive graph convolution for point cloud analysis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4965–4974 (2021)
Zhu, X., Su, W., Lu, L., et al.: Deformable DETR: deformable transformers for end-to-end object detection. CoRR abs/2010.04159. (2020) arXiv:2010.04159
Liu, Z., Lin, Y., Cao, Y., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10012–10022 (2021)
Zhao, H., Jiang, L., Jia, J., et al.: Point transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 16259–16268 (2021)
Guo, M.H., Cai, J.X., Liu, Z.N., et al.: Pct: point cloud transformer. Comput. Vis. Media 7 (2), 187–199 (2021). https://doi.org/10.1007/s41095-021-0229-5
Zhang, C., Wan, H., Liu, S., et al.: Point-voxel transformer: an efficient approach to 3d deep learning. CoRR abs/2108.06076 (2021) arXiv:2108.06076
Lai, X., Liu, J., Jiang, L., et al.: Stratified transformer for 3d point cloud segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8500–8509 (2022)
Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: Cortes, C., Lawrence, N., Lee, D., et al. (eds.) Advances in Neural Information Processing Systems, vol. 28. Curran Associates Inc, New York (2015)
Woo, S., Park, J., Lee, J.Y., et al.: Cbam: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
Yi, L., Kim, V.G., Ceylan, D., et al.: A scalable active framework for region annotation in 3d shape collections. ACM Trans. Graph. (2016). https://doi.org/10.1145/2980179.2980238
Wu, Z., Song, S., Khosla, A., et al.: 3d shapenets: a deep representation for volumetric shapes. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1912–1920 (2015) https://doi.org/10.1109/CVPR.2015.7298801
Lu, D., Xie, Q., Gao, K., et al.: 3dctn: 3d convolution-transformer network for point cloud classification. IEEE Trans. Intell. Transp. Syst. (2022). https://doi.org/10.1109/TITS.2022.3198836
Xu, M., Ding, R., Zhao, H., et al.: Paconv: position adaptive convolution with dynamic kernel assembling on point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3173–3182 (2021)
Wang, X., Girshick, R., Gupta, A., et al.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with restarts (2016) CoRR abs/1608.03983. arXiv:1608.03983
Han, X.F., Jin, Y.F., Cheng, H.X., et al.: Dual transformer for point cloud analysis. IEEE Trans. Multimedia (2022). https://doi.org/10.1109/TMM.2022.3198318
Armeni, I., Sener, O., Zamir, A.R., et al.: 3d semantic parsing of large-scale indoor spaces. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Achlioptas, P., Diamanti, O., Mitliagkas, I., et al.: Learning representations and generative models for 3D point clouds. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 80, pp. 40–49. PMLR (2018)
Chen, S., Duan, C., Yang, Y., et al.: Deep unsupervised learning of 3d point clouds via graph topology inference and filtering. IEEE Trans. Image Process. 29 , 3183–3198 (2020). https://doi.org/10.1109/TIP.2019.2957935
Yi, Y., Lu, X., Gao, S., et al.: Graph classification via discriminative edge feature learning. Pattern Recogn. 143 , 109799 (2023). https://doi.org/10.1016/j.patcog.2023.109799 . ( https://www.sciencedirect.com/science/article/pii/S0031320323004971 )
Jiang, J., Lu, X., Zhao, L., et al.: Masked autoencoders in 3d point cloud representation learning. IEEE Trans. Multimedia. (2023). https://doi.org/10.1109/TMM.2023.3314973
Karambakhsh, A., Sheng, B., Li, P., et al.: Sparsevoxnet: 3-d object recognition with sparsely aggregation of 3-d dense blocks. IEEE Trans. Neural Netw. Learn. Syst. 35 (1), 532–546 (2024). https://doi.org/10.1109/TNNLS.2022.3175775
Lin, X., Sun, S., Huang, W., et al.: Eapt: efficient attention pyramid transformer for image processing. IEEE Trans. Multimedia 25 , 50–61 (2023). https://doi.org/10.1109/TMM.2021.3120873
Zhao, H., Jia, J., Koltun, V.: Exploring self-attention for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Woo, S., Lee, D., Hwang, S., et al.: Mkconv: multidimensional feature representation for point cloud analysis. Pattern Recogn. 143 , 109800 (2023). https://doi.org/10.1016/j.patcog.2023.109800
Download references
This work was sponsored by Natural Science Foundation of Shanghai under Grant No. 19ZR1435900.
Authors and affiliations.
College of Communication and Art Design, University of Shanghai for Science and Technology, Shanghai, China
Gang Chen, Wenju Wang, Haoran Zhou & Xiaolin Wang
You can also search for this author in PubMed Google Scholar
GC contributed to conceptualization, methodology, resources and software; GC, HZ and XW performed data curation; WW carried out formal analysis and funding acquisition; WW, HZ and XW performed supervision and validation; HZ and XW contributed to visualization; GC performed writing-original draft; GC and WW performed writing-review and editing. All authors have read and agreed to the published version of the manuscript.
Correspondence to Wenju Wang .
Conflict of interest.
The authors declare no conflict of interest.
Publisher's note.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
Reprints and permissions
Chen, G., Wang, W., Zhou, H. et al. EGCT: enhanced graph convolutional transformer for 3D point cloud representation learning. Vis Comput (2024). https://doi.org/10.1007/s00371-024-03600-2
Download citation
Accepted : 01 August 2024
Published : 23 August 2024
DOI : https://doi.org/10.1007/s00371-024-03600-2
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
IMAGES
COMMENTS
Learn what data representation is and how to present data in different ways, such as bar charts, histograms, line graphs, pie charts and frequency tables. See examples and sample questions on data representation for statistics and mathematics.
Learn how to present data using graphs, tables, and charts in mathematics and data structure. See examples of bar charts, histograms, pie charts, and line graphs with explanations and rules.
2.1: Types of Data Representation. Page ID. Two common types of graphic displays are bar charts and histograms. Both bar charts and histograms use vertical or horizontal bars to represent the number of data points in each category or interval. The main difference graphically is that in a bar chart there are spaces between the bars and in a ...
Learn 17 essential data visualization techniques to communicate data effectively and make data-driven decisions. Explore different types of charts, maps, diagrams and matrices with examples and tips.
Data representations are useful for interpreting data and identifying trends and relationships. When working with data representations, pay close attention to both the data values and the key words in the question. When matching data to a representation, check that the values are graphed accurately for all categories.
Data representation methods need to be robust to these variations to be effective. Fine-grained details: Facial expressions can convey fine-grained details that are important for recognizing specific emotions, such as the movement of wrinkles around the eyes or mouth. Image-based methods need to capture these details to be effective.
The Power of Good Data Visualization. Data visualization involves the use of graphical representations of data, such as graphs, charts, and maps. Compared to descriptive statistics or tables, visuals provide a more effective way to analyze data, including identifying patterns, distributions, and correlations and spotting outliers in complex ...
Table 1.1: Powers of Two and their Binary Representation Because of this property, numbers that are a power of two are very, very common when talking about computers. Table1.1shows powers of two up to 210 and their binary representation. The powers of two show up repeatedly in the sizes of di erent objects
As shown in the table, the data representation methods of Internet services fall into two groups: description language-based data representation and mathematical model-based data representation. The models in the first group are further divided into semantic (ontology-based) representation and non-semantic representation, while the second group ...
2. add_item(): This method processes the inventory by adding an item. Again, we're manipulating the data within the GameCharacter object. 3. use_ability(): This method randomly selects an ability from the character's list of abilities and "uses" it. This is another example of data processing.
Data Representation. While data comes in many forms, mathematical models are limited to real numbers. As a result, we often have to engineer our inputs prior to model development and inference. Data representation is best illustrated with an example. This is what the top 3 rows of our dataset looks like -- we can assume that we have at least a ...
The problem of data representation is the problem of representing all the concepts we might want to use in programming—integers, fractions, real numbers, sets, pictures, texts, buildings, animal species, relationships—using the limited medium of addresses and bytes. Powers of ten and powers of two.
This information is called static data. static data 0000 stack FFFF • Each time you call a method, Java allocates a new block of memory called a stack frame to hold its local variables. These stack frames come from a region of memory called the stack. • Whenever you create a new object, Java allocates space from a pool of memory called the ...
Data Representation • Data refers to the symbols that represent people, events, things, and ideas. Data can be a name, a number, the colors in a photograph, or the notes in a musical composition. • Data Representation refers to the form in which data is stored, processed, and transmitted. • Devices such as smartphones, iPods, and
Data can be represented in multiple ways such as tables, bar graphs, histograms, and frequency plots. These methods provide different perspectives of the same information, helping to answer various questions about the data.
This chapter from a 1988 report by the National Research Council reviews the methods and challenges of data collection, representation, and analysis in the behavioral and social sciences. It covers topics such as survey research, experimental design, measurement, and data management.
The relational model unified data and metadata so that there was only one form of data representation. It defined a non-procedural data access language based on algebra or logic. It was easier for end users to visualize and understand than the pointers-and-records-based DBTG model.
This chapter concerns research on collecting, representing, and analyzing the data that underlie behavioral and social sciences knowledge. Such research, methodological in character, includes ethnographic and historical approaches, scaling, axiomatic measurement, and statistics, with its important relatives, econometrics and psychometrics. The field can be described as including the self ...
Graphical Representation of Data: Graphical Representation of Data," where numbers and facts become lively pictures and colorful diagrams. Instead of staring at boring lists of numbers, we use fun charts, ... Class 11 Geography Methods of Relief Representation Notes: Methods of Relief Representation is the fifth chapter in CBSE Class 11 ...
Learn more about Data Representation. Take a deep dive into Data Representation with our course AI for Designers . In an era where technology is rapidly reshaping the way we interact with the world, understanding the intricacies of AI is not just a skill, but a necessity for designers. The AI for Designers course delves into the heart of this ...
Representation of Data/Information - Computers do not understand human language; they understand data within the prescribed form. Data representation is a method to represent data and encode it in a computer system. Generally, a user inputs numbers, text, images, audio, and video etc types of data to process but the computer converts t.
There are several methods for data representation in personality recognition, and each method has advantages and limitations. These methods can be divided by the resulting data created by the technique, such as vectors, embeddings, graphs, and matrices, or by modalities, such as speech, text, behavior on images, and physiological signals.
The under-representation is particularly common in higher education organizations, including medical and dental schools. The current study aimed to investigate gender-based distribution and explore leaders' experiences in the medical and dental institutes in a developing country, Pakistan. A mixed-method approach was used.
Among various types of data presentation, tabular is the most fundamental method, with data presented in rows and columns. Excel or Google Sheets would qualify for the job. Nothing fancy. This is an example of a tabular presentation of data on Google Sheets.
It is an urgent problem of high-precision 3D environment perception to carry out representation learning on point cloud data, which complete the synchronous acquisition of local and global feature information. However, current representation learning methods either only focus on how to efficiently learn local features, or capture long-distance dependencies but lose the fine-grained features ...