Wednesday, November 3, 2021

Paolo Amoroso: An Intel 8080 Assembly Suite in Python

A blog post I stumbled upon made me start a new project, crank out lots of Python code, slip down a rabbit hole of arcane and fascinating corners of retrocomputing, and overflow with fun.

The project is Suite8080, a suite of Intel 8080 Assembly cross-development tools comprising an assembler and a disassembler. I developed it in Python entirely with Replit. At over 1,500 lines of code, it’s my second and largest Python project after Spacestills, a NASA TV still image viewer of about 340 lines of code.

Hello world Intel 8080 program running in the z80pack CP/M emulator
A hello world Intel 8080 program running in the z80pack CP/M emulator on Crostini Linux. I assembled the program with asm80, the Suite8080 assembler.

Why did I write software for a half-century old CPU? This is the story of how I got started with Suite8080, how I developed it, the challenges I faced, and what I learned.

Let’s start before the beginning.

Background

I’m a hobby programmer and a Python beginner, not a professional developer.

To practice with the language, I finally set out to work on programming projects. Like Spacestills, which draws on my major interests in astronomy and space and my work experience with the outreach and education of these fields.

I’ve always been a computer enthusiast too. I’m old enough to have lived through the personal computer revolution of the 1970s and 1980s as a teenager, a revolution to which the Intel 8080 contributed a lot. It was the CPU of the early popular personal computers running the CP/M operating system, such as the Altair 8800 and IMSAI.

Witnessing the birth of technologies, devices, and companies that promised to change the world was a unique experience.

Back then, there was little distinction between using and programming computers. So, along with the popular high-level languages like BASIC and Pascal, I experimented with Zilog Z80 and Motorola 68000 Assembly programming. These experiences seeded my interest in low-level programming, languages, and development tools.

The stars aligned again in 2021, when I stumbled upon the idea that became Suite8080.


Origin of the project

A post shared on Hacker News from the blog of Dr. Brian Robert Callahan, a computer science academic and developer, kick-started my project. While browsing his blog archive I found an intriguing series of posts on demystifying programs that create programs, where he describes the development of an Intel 8080 disassembler, an assembler, and other tools.

Brian demystifies these technologies by a combination of simplified design, a preference for straightforward techniques over established algorithms, and clear commentary. He wrote:

«Second, we are not necessarily writing the assembler that someone who has decades of experience writing such tools would write. [...] We are writing an assembler that someone who has little to no programming experience or knowledge can come to understand purely through engagement with the code itself and a series of blog posts.»

Brian’s code and explanations are not just clear, they are motivating and fun.

The code in his posts is written in D (also known as Dlang), a system programming language in the C family I didn’t even know existed. But D is so readable, and Brian’s commentary so good, that his published D code is as effective as pseudocode.

As I read the posts, I frequently nodded thinking to myself "I can do that". So the idea of rewriting the Assembly tools in Python was born.

Developing assemblers seemed beyond my abilities, but two insights on Brian’s assembler sold me on the project idea.

First, Brian pointed out an assembler needs to examine just one source line with a simple syntax at a time, so the complexity of processing an entire program reduces to processing one line. Second, no fancy traditional parsing algorithms are required. An assembler can just search for a few symbols that separate the syntactic elements and split the line into the elements.


Goals

What could I learn from the project?

Besides improving my Python skills, for Suite8080, I set the goal of exploring interesting application domains such as Assembly programming and development tools. And having fun, lots of it.

The venerable Intel 8080 is a valuable learning playground at a sweet spot between simplicity, functionality, and depth. The chip comes from an era when CPUs and computer systems were simple and could be understood in full, memory layouts were fixed and simple, and there were no concerns over instruction pipelines, timing, power management issues, or other complications of modern processors.

Another goal was extending Brian’s tools with useful features, deviating from his design as necessary, and following the path to where it led.

Running executable 8080 code is not only a necessity for testing an assembler but also an opportunity for experiencing low-level programming. Therefore, I wanted my code to run on 8080 and CP/M emulators, another retrocomputing rabbit hole.

As for the Python learning opportunities, I knew my next project after Spacestills would be larger and require more structure, a few modules, and packages. Therefore, I wanted to try tools and techniques I skipped with Spacestills such as multi-module systems, automated testing, command-line scripts, and publishing packages on PyPI.

Don’t tell anyone, but I enjoy writing tests.

In a previous life, I was a Lisp enthusiast and always interactively tested expressions, functions, and code snippets in the REPL. This drove home the importance of creating tested and reliable building blocks to combine and expand on.


The Suite8080 tools

Suite8080 includes the first two tools Brian covers in his series, an 8080 disassembler and a cross-assembler. I named mine dis80 and asm80, which are command-line Python scripts.

This Linux shell session demonstrates how to run the tools and what their output looks like:

(venv) paoloamoroso@penguin:~/python/suite8080$ asm80 -v greet.asm 
Wrote 35 bytes
(venv) paoloamoroso@penguin:~/python/suite8080$ dis80 greet.com
0000 0e 09 mvi c, 09h
0002 11 09 01 lxi d, 0109h
0005 cd 05 00 call 0005h
0008 c9 ret
0009 47 mov b, a
000a 72 mov m, d
000b 65 mov h, l
000c 65 mov h, l
000d 74 mov m, h
000e 69 mov l, c
000f 6e mov l, m
0010 67 mov h, a
0011 73 mov m, e
0012 20 nop
0013 66 mov h, m
0014 72 mov m, d
0015 6f mov l, a
0016 6d mov l, l
0017 20 nop
0018 53 mov d, e
0019 75 mov m, l
001a 69 mov l, c
001b 74 mov m, h
001c 65 mov h, l
001d 38 nop
001e 30 nop
001f 38 nop
0020 30 nop
0021 2e 24 mvi l, 24h
(venv) paoloamoroso@penguin:~/python/suite8080$

The session starts by executing asm80 to assemble the greet.asm source file, a hello world program that runs on CP/M, and then disassembling it with dis80. The -v verbose option prints how many bytes the assembler wrote to the output file greet.com.

As with most other disassemblers, mine can’t tell code from data bytes. The above session shows this as the instructions after the address 0008 holding the ret instruction are spurious. The data area, which stores the $-terminated string Greetings from Suite8080.$ the program prints, starts at address 0009 and dis80 disassembles it without realizing there’s no code there.

Brian’s assembler accepts a good fraction of the Assembly language and directives of early 8080 assemblers, such as the ones by Intel, Microsoft, and Digital Research. But for explanatory purposes, he skipped some features to keep the code simple. For example, the db memory allocation directive takes only one argument. The consequence is initializing a memory block with several data bytes requires as many one-argument db clauses as the bytes or strings to initialize.

As the work proceeded and I gained more experience with Intel 8080 programming, I extended the Suite8080 tools with features that add convenience and expressiveness.

For example, I changed the assembler to accept db with multiple arguments that may be character constants, such as ‘C’ or ‘*’, or labels. Character constants may also be immediate operands of Assembly instructions and the equ directive supports character constants too. Why is it important? Imagine an Assembly instruction for comparing the value of the accumulator with an ASCII character, which in the source can be the character itself instead of an integer number.

Brian’s assembler doesn’t support macros but, by cheating a bit, I rigged mine to turn asm80 into a macro-assembler. All I did was to extend asm80 to read the input file from standard input, which gave me macros for free via the standard Unix program m4. Here is a Linux session showing how to use m4 macros with asm80:

The file ldabcmac.m4 holds a macro definition included by the ldabc.m4 Assembly program. The session prints the source files, assembles them with asm80, and disassembles the executable program with the dis80 disassembler. The following commands are executed:

$ cat ldabcmac.m4
$ cat ldabc.m4
$ cat ldabc.m4 | more
$ cat ldabc.m4 | m4 | asm80 - -o ldabc.com
$ dis80 ldabc.com

In the above shell session, as well as the previous one, the hexadecimal dump of the opcode and operands next to the instruction address is an extension to Brian’s disassembler I added to dis80.


Development environment

On the desktop I use Chrome OS only and my daily driver is an ASUS Chromebox 3, an Intel i7 machine with 16 GB RAM and 256 GB storage.

Developing in Python with Replit is the best fit for my cloud lifestyle.

The Suite8080 Replit workspace on an ASUS Chromebox 3
The Suite8080 Replit workspace on my ASUS Chromebox 3.

I love Replit, a multi-language development environment that works fully in the cloud, requires no software installation, and synchronizes across devices out of the box. Replit also comes with a nice client for GitHub, which hosts the Suite8080 project repo.

Although I develop on Replit, I install and test Suite8080 also on Crostini, the Chrome OS Linux container that runs Debian. This helps ensure my code is portable and has no obvious system dependencies, as well as check that installing Suite8080 from PyPI works as intended.


Design and coding

Suite8080 began as a Python port of Brian’s D code.

I originally used equivalent Python features, focusing on making the tools run rather than writing Pythonic code that could come later. Implementing minimal or no input validation helped evolve the tools quickly to a mostly complete state, deferring refinements to later.

I closely followed Brian’s design and code structure, introducing a few changes to make identifiers less terse or more descriptive, or reorganizing functions for adapting the code to the features I wanted.

I decided to start this way and evolve the system to where it would lead, such as new features, better algorithms, or more Pythonic code.

I worked quickly and confidently on the disassembler and the assembler because of Brian’s clear design and commentary. I tried to make the code easily understandable — well, to me — with strategic comments and documentation.

Brian took two key design decisions that simplified his assembler, which I adopted in mine.

First, he didn’t rely on recursive descent or traditional parsing algorithms. Instead, Brian’s parser scans a source line for the symbols that separate syntactic elements, and splits the line at the symbols to isolate the elements.

For example, consider the syntax of a source line:

[label:] [mnemonic [operand1[, operand2]]] [; comment]

If the parser finds a semicolon, it splits the line at ; to isolate the comment text from the rest of the line to parse. Next, it looks for a comma separating the operands of an assembly instruction and splits there, thus isolating the second operand from the rest of the line to parse. And so on.

This approach has limitations, such as fragility and no support for arbitrary arithmetic expressions.

The other design decision Brian took for explanatory purposes is to store state in global variables, half a dozen in the assembler and a couple in the disassembler. As a result, most functions don’t return values but perform side effects.

This simplifies the implementation and description of the tools but, as with Spacestills where I used global state too, I’m not completely satisfied. Not much for the implications on maintainability and extensibility of the system, but because global state complicates testing.

It’s not that we beginners aren’t aware of the drawbacks of global variables, it’s just that few Python instructional resources explain the alternatives well.


Testing and debugging

I went blind when developing the disassembler, the first tool I coded.

Since I had no matching pairs of 8080 sources and binaries handy, I had to wait for the early work on the assembler to test dis80 by comparing the disassembled binaries assembled by asm80 with the corresponding sources.

I was lucky. Making the disassembler is really straightforward, and the tool worked correctly almost immediately.

Checking out the assembler was trickier.

Testing involved manually evaluating in the Python REPL expressions that called the various assembler functions and components, comparing the output with the expected behavior.

I implemented one new Assembly mnemonic or directive at a time and checked it out by assembling a one-line test program exercising the instruction or directive. Next, I changed the program to add variations, such as different operands or arguments, a label or comment, and so on. I occasionally tested by feeding the assembler longer demo 8080 programs I collected in the asm directory of the source tree.

This workflow gave me confidence in the correctness of processing individual instructions and directives.

Pytest helped a lot with automation. However, the global state and side effects, as well as most functions not returning values, made it hard to write the tests and allowed covering only a fraction of the system. In the tests, I found it difficult to reference the global variables and mock them, an area of Python still confusing for me.

As for debugging, the problem was not so much the manual tasks, but getting stuck with no clue on how to fix an issue.

Print-debugging saved the day on over one such occasion. Don’t tell the good folks at Replit, but I didn’t use their debugger. It’s awesome, but I couldn’t figure out how to invoke it in a multi-file system. Instead, I used strategically placed print statements that displayed the process state and made clear where my assumptions were wrong.


The code

I’m mostly pleased with the code I wrote.

Aside from debugging and testing issues, it’s still easy for me to read, understand and extend the code. To preserve the code equally accessible and open when it won’t be as fresh in my mind, I’m adding more documentation and comments.

To give an idea of what the code looks like, here is the full source of the disassembler, minus the comments and with the giant instruction table abbreviated for clarity.

import argparse

MNEMONIC = 0
SIZE = 1

instructions = [
('nop', 1),
('lxi b,', 3),
('stax b', 1),
...
('call', 3),
('cpi', 2),
('rst 7', 1)
]

program = b''


def disassemble():
address = 0
mnemonic = ''
program_length = len(program)

while address < program_length:
opcode = program[address]
instruction = instructions[opcode]
mnemonic = instruction[MNEMONIC]
size = instruction[SIZE]

if address + size > program_length:
break

arg1 = arg2 = ' '
lsb = msb = ''

if size > 1:
if size == 3:
arg2 = f'{program[address + 2]:02x}'
msb = f'{program[address + 2]:02x}'
arg1 = f'{program[address + 1]:02x}'
lsb = f'{program[address + 1]:02x}h'
output = f'{address:04x} {opcode:02x} {arg1} {arg2}\t\t{mnemonic} {msb}{lsb}'
print(output)

address += size


def main():
global program
dis80_description = f'Intel 8080 disassembler / Suite8080'

parser = argparse.ArgumentParser(description=dis80_description)
parser.add_argument('filename', type=str, help=' A file name')
args = parser.parse_args()
filename = args.filename

with open(filename, 'rb') as file:
program = file.read()
disassemble()


if __name__ == '__main__':
main()

After a couple of constants, there’s a table sorted by opcode holding entries comprising the symbolic instructions and their byte length. Next come the disassembly function and the main function. The disassembler dispatches on the opcode, fetches the argument bytes if any, and prints the resulting source line preceded by the address and a dump of the opcode and argument bytes.

To understand the details of the disassembly function note that:

  • the 8080 is little-endian
  • instructions have a one byte opcode followed by 0 to 2 argument bytes
  • if an argument is a 2-byte sequence, it represents a 16-bit value
  • with 16-bit arguments, in the instruction source the disassembler swaps the least significant byte (lsb) with the most significant (msb) to print these values with the most significant byte followed by the least significant, which is easier to read

That’s it. Brian really demystified these tools.

The assembler code is longer and less polished. The major source of complexity is the parser, the longest function. I made major changes to the parser to extend the functionality of the db directive and make it accept multiple arguments. Brian’s parser instead limits Assembly mnemonics or directives to take from zero to two arguments. This and other new features I implemented caused changes in a few more places.

See the full code of the assembler in the source tree.


Challenges

Working on Suite8080 meant learning pytest, 8080 programming, Assembly tools, and CP/M at the same time. Aside from bootstrapping my understanding of these topics, there were other challenges.

Some challenges came from Python itself. For example, I still don’t fully grasp importing modules and packages and referencing identifiers. My confusion had consequences also on testing with pytest as discussed earlier.

Other challenges came from the porting work.

The parser ended up as a bit of a mess because I misunderstood how the findSplit() function of D works. Brian’s parser uses the function several times, which scans a string from left to right. However, in Python I replaced it with str.rpartition() that instead scans from right to left. Also, I didn’t fully comprehend one of the special cases in Brian’s parser. All this made my parser more difficult to understand than I’d like.


Lessons learned

Part of the value of Suite8080 comes from what I learn, which I hope will help me in future projects.

The main takeaway is that design is the limit to the growth and functionality of a system. If the design is not robust and scalable, its complexity hinders growth without a complete rewrite.

With the current Suite8080 design, I’m still at a stage at which I know where to put my hands to make changes or add features despite the relatively large code base. And I’m aware the system has design issues, even if I don’t have a solution yet — hey, it’s progress.

Suite8080 is making me love Python even more. But sometimes the Python code I write, which is supposedly at a higher-level of abstraction than D, seems more verbose, like in the assembler’s parser.

Finally, I’m learning a lot about the tools I build Suite8080 with. For example, I ran across 4 things tutorials don’t tell about PyPI.


Next steps

My initial plan was to implement a couple of 8080 Assembly tools and call it a day. But this rabbit hole turned out so deep, and the fun was so great, I couldn’t stop working on Suite8080.

The next step is likely an IDE to tie together the assembler and the disassembler.

I’d also like to write a basic Intel 8080 simulator.

Why a simulator? I tried several 8080 emulators for testing my Assembly programs, but none are completely satisfactory for inspecting and debugging low-level code and machine state. Some are missing useful code or memory visualization features. Others don’t support directives or specific Assembly features. A few are just broken.

I can fix this only by building my own simulator. I’m not aiming at an emulator, which seems much more complex.

In his post series, Brian discusses also a linker and an object library archiver. I may implement these tools too, but extend them by storing object files in sqlite3 databases to see how far this gets me in terms of versatility.

It’s not as glamorous as making new tools, but I also need to refactor to more Pythonic code and add to the existing tools the features that daily usage will suggest.

My adventure with Suite8080 continues. If you’re interested in keeping up to date, subscribe to the RSS feed of the Python posts of my blog, or the complete RSS feed if you’re curious what other content I publish. You can also follow my @amoroso Twitter profile.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...