09 Nov

Plenty of Room, Part II

This is the second part of the epic (two three-part) series of articles about tiny intros, the previous part was an essay about 256-byte intros in general.

Ed. note: Since this article seems to take forever to finish, here’s the first half of it. The (hopefully) final part will detail more of the specifics.

rubba As stated earlier, tiny intros written in assembly language fascinate me. I have written a few in x86 assembly language, here’s one of them. I have tried to make the inner workings of the program as accessible — or, at least as thought-provoking — as possible even if assembly wasn’t their weapon of choice.

I’ve included the original DOS intro (you can use DOSBox to run it, it should also work on Windows XP) and a Win32 port of it, written in C while trying to keep the original structure intact. I’ll also try to explain the general idea of the effect in pseudocode where syntax can be an obstacle. The archive rubba_c.zip contains the source code, rubba_b.exe which is the compiled Win32 executable and RUBBA.COM which is the 16-bit MS-DOS executable. To compile the C program, you need the TinyPTC library (like SDL but even more bare-bones).

I won’t go into details about x86 assembly language, I suggest anyone interested first reads an introduction and learns some of the basics. However, I’ll try to make the code look interesting, explain some weird or obfuscated code and probably show some basic size optimization tricks.

The Effect

The intro, called rubba_b, shows one effect: a twisting bar drawn using primitive light-shaded texture mapping. The color palette and the texture are generated run-time. The texturing is done using linear interpolation and no vector math is used even if the bar looks like it is rotated. The lighting is an extremely simple approximation of the light source being exactly where the camera is located. That is, the length of the textured line directly determines the light.

If looked from above, the bar will be a tower of squares. If one of the squares is rotated around the center, the corners will move in a circular pattern. So, the X coordinate will be equal to cos(corner_number*(360 degrees/4)+square_rotation), the Z coordinate (why Z? Because it goes towards the screen) is equal to the sine but it can be discarded because we will not need it for perspective calculation. Remember, we’re short on bytes.

We then modulate the bar rotation for each square in the tower. If the amount of rotation changes when moving up or down the tower, the bar will twist. If the rotation stays the same for each square, the bar will rotate uniformly and look uninteresting.

The textured horizontal line is drawn from each corner to the next corner, from left to right. If the line would be drawn from right to left, we know it isn’t facing towards the camera, another line facing the camera will be drawn over it and we simply skip the line. The color value fetched from the texture is multiplied by the line length which makes short lines darker.

Still with me?

The Code

Initialization

First things first. We need to set the video mode before we continue. In the Win32 version we simply open a TinyPTC window, in the original version we need to tell BIOS to go in a nice 320x200x8 video mode (the de facto video mode back in the day).

C asm
ptc_open("rubba_b",320,200)
mov al,13h
int 10h

In the above code, the Win32 part is self-explanatory. The assembly bit probably needs some clarification: we put the number 13h (hex) in the 8-bit register al and we request for interrupt 10h. This is the BIOS video interrupt and since register ax (which consists of al – “low” – and ah – “high”) equals to 0013h (when the program starts, ax is always zeroed), BIOS will call the routine for changing the video mode (ah=00h) to 13h.

If above sounds complicated, don’t worry. It’s just a matter of memorization – similar to how you would memorize the function name for changing the video mode.

The next thing we need is some space for the texture, the sine table and the double buffer. In the Win32 version this is obvious, we just add a few arrays (although since TinyPTC only supports 32-bit video modes, we will also have an array for the palette). Again, in the assembly version we won’t use any fancy way to reserve memory to save some precious bytes: we simply decide certain areas of the memory will be used for whatever we want. The benefits of no memory protection and single tasking. ;)

C asm
short int sinetab[SINETAB];
unsigned char palette[256*4];
unsigned char texture[256*256];
unsigned char screen[320*200];
mov dh,80h
mov gs,dx           
mov dh,70h
mov es,dx
mov dh,60h
mov fs,dx

The assembly version basically sets the register dx to 60xxh-80xxh (we set only the high byte, i.e. dh to save bytes, thus the low byte of dx is undefined – it won’t matter) and puts the value into various segment registers (es-gs).

This makes it so that if we use the different segment registers, we can access each of the three 64 kilobyte segments as a 64 kilobyte array. E.g. the sine is in gs, thus mov ax,[gs:2] would move the second word in the sine table in ax. In C, the equivalent would be short int value=sinetab[1] (note how the C compiler knows the fact that a word is 2 bytes but in assembly you have to take care of that yourself – all memory accesses are always by the exact location, not the array element!).

All this is because in 16-bit memory accessing you can see only 64 kilobytes at one time. You can’t have a 128 KB array, nor can you have two 64 K arrays in the same segment. It’s something comparable to having a flashlight and a big sheet of paper in a dark room; you can move the light to show different areas but you will never see more than what there is in the spotlight.

The next two parts calculate the sine table (back in the day you simply could not do trigonometric stuff real-time, even in hardware — although in the intro it’s there just for show) and set the palette. This is pretty straight-forward stuff across the two versions. The only difference is that in the Windows version we have to remember the palette has 8-bit color components and the original has 6-bit components (0..255 ~ 0..63). And of course, the Windows version simply puts the values in a palette look-up table (because 32-bit video mode doesn’t use a palette) and the original actually sets the video mode colors.

I won’t reiterate the source code for the sine table and palette change here, I think you should be able to figure it out by comparing the source code. But in theory, here’s how to change a color: first, write the color index in port 3C8h, then write the three color components in port 3C9h (hint: dx first contains 3C8h and it’s incremented to 3C9h to save bytes).

The sine routine increases the angle (st0 the topmost register on the FPU) by 2*PI/32768 (a full circle is 2*PI, the sine table has 32768 elements). You probably should check some FPU tutorial, arithmetic on the 8087 is pretty interesting due to the stack-based architecture. For example, you first push in two numbers and then do an add, which (in most cases) pops out the two values and pushes in the result.

The texture generation bit is interesting. It also was annoying to port to C thanks to the fact you have to emulate how some instructions work – there are no accurate analogies in the C standard. A big thanks goes to baze whose code I originally cannibalized for this (I think). To be honest the conversion doesn’t work 100 % accurately but does still produce a nice texture.

The algorithm uses addition, bitwise operations and other simple things to achieve complexity thanks to how processors do arithmetics. Mainly, the results from an earlier calculation is carried over to the next calculation — when an addition or a subtraction overflows, i.e. the result is too large or too small to fit in a register, the processor lets the result wrap over but sets the carry flag.

This is quite similar to how you would carry numbers when calculating with a pen and a paper. The flag affects the results unpredictably because it’s used across quite different operations; usually you would just use to to add big numbers together as in the pen and paper example.

The Main Loop

Here is the meat of the code. The C version has many variables that are named after registers in order to see the connection with the original code. Sometimes, as with the 8-bit registers, some code doesn’t work exactly as in the original because you can’t do things similarly in C. E.g. you can’t have two variables that also are a part of one big register similarly how al and ah form ax (well, you can with pointers or unions but that is besides the point, kind of).

Self Modifying Code

I use self modifying code (SMC) in a few places because it produces faster and also much simpler code. For example, if you have a variable that is changed in a few places but used by one instruction only (and the instruction performs arithmetic or otherwise accepts a constant parameter), it’s much faster to store the variable where the constant for the instruction would be. That way you don’t have to read the variable in a register and then use the register to do something.

E.g. Let’s multiply cx by the variable foo:

Original SMC
  push ax ; save ax
  mov ax,[foo] ; move variable foo in ax
  imul cx,ax ; multiply cx by ax
  pop ax  ; restore ax
  ...
  mov ax,123   ; set foo ...
  mov [foo],ax ; ... to 123
  ...
foo: dw 0
  imul cx,123
foo equ $-2
  ...
  mov ax,123   ; set foo ...
  mov [foo],ax ; ... to 123

We can exploit the fact imul (signed multiplication) accepts constant multipliers. If you looked at the code with a hex editor, you’d see 123 next to the opcode. You can change the constant run-time and you do that exactly like you would change a variable: if you just define foo as the address where the constant is (the above code defines it as the last two bytes (i.e. word) of the previous instruction: in NASM, $ is the current location and $-2 equals the address of the previous word).

To be concluded…

08 Sep

You Can Stop Programming Now

The above is puls, a 256-byte intro by ?r?ola. It’s basically a raytracer with screen space ambient occlusion (which makes it so much realistic and cooler). While tube — which I think was the best 256-byte intro until now (when design and code are judged together) — also did raytracing of a cylinders, and after that many other intros did similar tracing of more complex surfaces, puls simply crushes all of them with objects that are formed by multiple plane surfaces (e.g. a cube would be a combination of six intersecting planes), a very nice color palette and that delicious ambient occlusion.

Thinking it out in C and sketching it out in asm took about a week, byte crunching took another one… that’s like forty hours of full focus and eighty of playing.

It’s also really, really slow which is the only minus especially because you can’t run 16-bit executables on Windows 7, so you have to use DOSBox to watch it (or, use a boot floppy to run it or something). There’s now a Windows port including a screensaver, see the Pouet.net page for more. A big thank you to nordak5 who was kind enough to upload a video on Youtube.

?r?ola has also included source code with the binary that you can find over here. That said, I’ll be deleting all my own source code since perfection has finally been achieved and there is no need for programmers anymore.

22 Oct

There’s Plenty of Room at the Bottom

This is the first part of the epic (two-part) series of articles about tiny intros, the next one will be about an actual intro.

I love tiny graphical presentations called “intros” in the demoscene and by tiny I mean 256 bytes tiny. Usually, at this point people mention a paragraph of text is well over 256 bytes worth of data. However, you should not think that computer needs the same amount of data to store information such as “multiply variable X by variable Y” as we do. That textual representation needs 34 bytes of storage, less if you write it using shorter words or simply an equation. What a processor needs would be something like 3 instructions, 2-4 bytes each.

However, my point is that 256 bytes is ridiculously little for most people (even though you can hardly impress a mouth breather playing Halo 3 by stuffing amazing things in that space). At those sizes the source code is many times bigger than the resulting program. On the other hand, 256 bytes is plenty of space for those who are crazy enough to write such things. Why executables tend to be much, much bigger than that is because there is a lot of stuff in there what the programmer didn’t put in there himself/herself.

The Code

An obvious source of bloat in the executable is the fact programs are written in high-level languages such as C++ and then compiled into code that the processor can run. Compilers generally do a good job at producing working executables with fast code and they don’t even try to save space. Add tons of modern features – object polymorphism, runtime error checking et cetera – into that and you have your normal 100 kilobytes of code for the smallest program that necessarily doesn’t even do anything but exit back to desktop. On top of that, the executable file also houses information such as which libraries to use, the program icon and so on.

One way around the problem above – and it is a problem only if you for some reason are decided to spend you time doing something very few find nice – is to write the programs in more condensed languages. For example, C produces smaller code than C++ (thanks to the fact there is minimal extra fluff in C) and writing everything in assembly language produces pretty much as small code as it is possible to write because it basically is the human-readable version of machine language. A reminder: those modern languages (and C isn’t even modern as it was conceived in the 1970s) were created because writing complex things in assembly is not everyone’s cup of tea – that makes writing very tiny stuff in assembly not-everyone’s-cup-of-tea squared.

“rubba_b” by kometbomb…

… and the same for Windows – 254 bytes vs. 9278 bytes.

Another layer of arcane computing that makes a 256-byte executable possible is the use of old operating systems such as DOS. As you read above, the executable contains information about libraries and such, and back in the 80s hardly any shared libraries were used at all anyway. So, an executable pretty much was only the compiled program code and that was it. No extra fluff. Using this old executable format and assembly language the programmer essentially is the only factor that has any effect in the size of the executable.

Ironically, while old operating systems with no libraries do not provide any modern conveniences like fast 3D graphics or a mentionable interface to anything at all, this is a blessing. You can access the hardware directly which means a lot less code to do things like drawing a pixel on the screen. You don’t even have to set up those libraries because there isn’t any. As an example, the same intro (albeit converted to C code from assembly) in Windows XP is about 10 kilobytes in size, while the original version for DOS is less than 256 bytes. And, it would be even larger in Windows if it used the hardware to do the things it does in software.

The Data

Yet another reason why things like computer games or even larger intros are huge compared to the tiniest intros is because they come with a lot of precreated data like graphics and music. For 256-byte intros even a precreated flightpath for the camera is pretty much a no-go. Or, a 3D object with, say, more than four faces. That means we have to use procedurally generated data which is in the vogue right now even outside size restrictions, with things like Spore.

Actually, in many new school 256-byte intros the data is not even precalculated but simply calculated while the intro draws it on the screen. This is to save the space you would need if you had both the drawing loop and the precalculation loop (remember, even the loop structure uses precious bytes) – thanks to the fact even slow things are fast with a relatively new computer.

Speaking of slow things, one of them is raytracing. Which is purely mathematical, which means very little precreated material. Raytracing also is one way to produce realistic three-dimensional images, which as a rule of thumb, look nice. So, it’s not a surprise that a modern 256-byte intro uses raytracing, or raycasting, to simulate nature, look amazing and still fit in the specified size limit. If it’s 256 bytes and 3D, it probably uses some blend of raycasting. Older intros usually were versions of familiar 2D effects such as the rotozoomer, the fire effect, the Mandelbrot set and other well-explored routines.

The Art

Even among demo afficianados there is some controversy whether the tiniest intros have any merit as art or if they simply are a proof of concept of how small you can make something. There is no hidden gold vein as in nanotechnology in there. While most tiny intros are for the most part coder porn, nobody can deny their status as art installations. And if you stay in the 128-256 byte range, most of the popular tiny intros are quite pretty too, especially considering all the compromises that had to be made to make it fit.

The ultimate irony in downplaying the merits of the intros is that the demoscene pretty much was born from people pushing the machine to the limits. Well, nowadays to create a great demo (think of it as a large intro with many parts) you have to push your artistic talent more than the limits of the machines. It’s more like creating a music video, though, with your own tools. Seems like the final limit is the man-made limit of the smallest size you can fit something in and still call it art with a clean conscience.