A place where I can ramble about my projects. To get a permalink to any blog entry, click on the date of publish at the top of the post.

Mirror Git to Mercurial

June 17th 2020, 2:53:35 pm

Tired of Git absolutely destroying your work due to its garbage interface and dozens of built-in foot-guns? Need to use it to publish your code to GitHub? I keep a dual-repo of everything I work on locally so that I can do my real work in Mercurial but mirror manually to Git for publishing. Its a bit of a pain in the ass, but its less of a pain in the ass than having to use Git in any way other than adding commits and pushing to GitHub. If you find yourself in a situation where you want to dual-repo an existing personal project, here is a handy set of command-line instructions that you can run on Linux to make a commit-accurate (but not time-accurate) mirror of your Git repo in Mercurial. Note that this is for personal projects, so it will assume you're the committer on everything. But maybe it will help you like it helped me?

  • List out all commits from HEAD (make sure to "git checkout trunk" or similar first).
git rev-list HEAD --reverse > /tmp/hgimport

  • Init mercurial
hg init

  • Copy gitignore, format it for mercurial syntax, ignore git repo
echo -e "syntax: glob\n\n.git/\n.gitignore\n.gitattributes" | cat - .gitignore > .hgignore

  • Go through commit list, check out commit, clean the repo (delete removed/untracked files), add/remove those files, commit them with the git commit message.
for I in $(cat /tmp/hgimport) ; do git checkout $I ; git clean  -d  -fx --dry-run . | sed 's/Would remove //' | grep -v "\.hg" | xargs rm ; hg addremove . ; hg commit -m "$(git log --format=%B -n 1 .)" ; done

  • Clean up temp export
rm /tmp/hgimport

Once you run the above steps, you should be able to re-run "git checkout trunk" or similar and see that your Mercurial repository is clean. You will have untracked .hg/ and .hgignore files in your Git repo, but its easy enough to add them to your .gitignore in a commit. Once this is done, you can continue where you left off working in Mercurial. My preferred workflow is to get things to a point where I'm ready to publish, then one-by-one check out the Mercurial commits and commit them to Git so the history matches. Its not perfect, but it is virtually free of foot-guns and I haven't lost work yet in 3 years of doing this. Can't say the same about just using Git to do regular stuff.

MiniDragon Takes its First Steps!

May 1st 2020, 6:19:36 pm

Last time I posted about the MiniDragon CPU project I had assembled several components, sent out the heart of the instruction decoder to fabrication and was just beginning to assemble the (at the time) 14 additional register boards to complete the PC, IP and A registers. I had done some initial optimization of the existing stdlib and had a pretty good idea of how I wanted to lay out the boards physically. I had not yet tested any of the boards with each other, although I was relatively confident in their operations. A lot has changed since then! First and foremost, I have assembled enough of MiniDragon to successfully execute my first instruction! Last night, after debugging the last known issue with the boards, I was able to step through the microcodes for LOADI and set the A register to the sign-extended immediate value stored in the lower 6 bits of the LOADI opcode. This works both in manual clocking mode where I can single step with a pushbutton and in automatic mode where the clock generation circuit runs the CPU at a predetermined speed.

Problems with the Instruction Decoder

The instruction decoder is essentially a pull-down bus with 32 parallel control signals. There is a distributor which provides 1K pull-ups and a pair of 32-pin connectors designed to plug into the ROM boards as well as feed the various control signals for the CPU. The ROM boards consist of a series of jumpers allowing me to set which control signals should be pulled low when the ROM is active. There are two variants of the ROM board: a 4-position board which has two bits of addressing and an enable input, and a 1-position mini-board which has only an enable input. When a ROM board is enabled, it will select the correct set of jumpers and pull the correct bits of the bus low to set the control signals for a particular microcode. When it is not enabled, the open-collector transistors providing the pull-down effect will be deactivated and thus high impedance. This is the most elegant solution I could find for tying a large number of ROM boards together.

Initial bring-up of the first ROM board connected to the control signal distributor board.

Now, when measuring the control signal outputs all seemed to be acceptable. The control signals illuminated their respective indicator LED and output ~4.7V according to my oscilloscope. When off, they appeared close enough to 0V to cut it. So, I assembled enough ROM boards to code for a LOADI instruction, including a mini-board that stored the "load next instruction from memory, placing it into the instruction register" microcode. In isolation, everything appeared to work fine. When I connected the instruction decoder board to the microcode counter board, I was able to successfully step through the various microcodes, seeing the control signals change for every step. So I hastily assembled and connected the rest of the components necessary for the LOADI to work (instruction register, A register, immediate register).

The boards I assembled in one marathon assembly in order to integration test a single instruction.

You know what comes next. Nothing worked. Nothing. Somehow the microcode counter wasn't counting up, the instruction register wasn't loading the value from the bus, and it seemed like I completely fried something. The only thing that appeared to work was the SRAM emulator circuit which was faithfully outputting an 8-bit binary value onto the data bus that I had entered on a bank of DIP switches. I know better than to just throw everything together and hope for the best but I'll admit that I was very excited. So, it was back to the drawing board and I was quite disappointed. I put the project down for a night and gave it some thought.

The next morning, I started by taking some measurements on the control signals. I found it a bit suspicious that when I connected the instruction register to the control signal distributor the indicator LED for that signal got quite dim. So, I poked at it with an oscilloscope and found that the control signal line was sitting at around 3V. Unplugging one of the two 4-bit registers bumped it up to almost 4V and unplugging both brought it back to the expected 4.7V. However, when buffering through a pair of not gates, the voltage for a logic high only dropped by a few tenths of a volt. So I went back to the schematics and realized that I had a pair of design flaws.

The inputs for all of my gates go through 10K current limiting resistors before hitting their respective transistors. For almost all circuits, there is only one resistor/transistor pair that inputs connect to. However, the 4-bit registers are effectively 4 identical copies of a 1-bit register laid out on a single board. I had connected both the EN and RST lines directly to the respective 4 bits. This meant that while the data inputs and clock input could be seen as costing 1 fanout to a driving circuit, the EN and RST lines cost 4 fanout. So, connecting two registers was a cost of 8 fanout. Basically connecting a single register board was equivalent to trying to drive four NOT gates at once. Most of my circuits included a emitter-follower buffer on the output stage, ensuring that there is no current limiting and effectively making the fanout a function of the input impedance (normally 10K per circuit) and the 2N2222A's maximum current. So this flaw would not be fatal in and of itself. However, the control signal distributor did not include any output conditioning. It included only a 1K pull-up per signal, allowing that to drive the logic level to 1 when a ROM wasn't setting it to 0.

In the control signal distributor the 1K resistor acts as a current limiter and makes the output voltage succeptible to the number of gates connected to that circuit. To understand why, Ohm's law can be used. If we look at the voltage at the control signal pin when only one circuit is connected, we have 5V feeding a 1K resistor, the control signal pin and then a 10K resistor feeding the base of a 2N2222A. While there is a voltage drop across the 2N2222A, it doesn't matter when trying to understand the problem. So, we have a simple voltage divider with the voltage at the control signal pin equivalent to 5V * (10K/(10K + 1K)) or around 4.5V. That should work. Okay, so let's connect two circuits. Ohm's law also allows us to calculate effective resistance when multiple resistors are in parallel, which in this case is 5K. So, the control signal voltage is now equal to 5V * (5K/(5K + 1K)) or around 4.1V. Its easy to see a pattern here, the more inputs you place in line, the lower the effective resistance is on the low side of the 1K resistor and therefore the lower the voltage is. At a certain point, it becomes not enough to be seen as a logic 1.

So, with the existing design of the control signal distributor there was no way to reliably drive more than 1-2 circuits on a single control line, and an 8-bit register was effectively 8 circuits from the perspective of the instruction decoder. Fortunately, I know how to solve this! Most of my circuits already included emitter-follower buffers to drive their outputs. These buffers pull down to GND instead of up to 5V, meaning they can source as much current as downstream circuits demand up until the transistor burns out. For the 2N2222A, you can provide about 800mA of current before the transistor dies. With 10K current limiting resistors on all the inputs of my circuits, that fanout works out to ~1600 circuits. For my purposes, that's effectively infinite. So, I needed to design a buffer circuit that could be plugged into the control signals distributor that provided an emitter-follower buffer per bit. I looked at my schematics and realized the clock and reset circuit also suffered from the same flaw on its outputs. So, I designed an 8-bit buffer for the control signals and a 2-bit buffer for the clock and reset circuit.

Emitter-follower buffer for the clock and reset lines.

It happens that the 2-to-4 decoder also suffers from this particular flaw. I cheaped out on the design, dropping the output buffer as well as the indicator LEDs. So, instead of trying to patch existing circuits I redesigned it to include both the standard buffering and indicator LEDs that are available on all other circuits. Since designing it I've found that the LEDs are an amazing debugging aide so making larger (and more expensive) boards which contain them seems to be always worth it. That wasn't as clear to me when I laid out the original version of the board a few months ago. Also, remember in a previous blog post where the original AND and OR gates I designed didn't work in some cases? That turns out to be the same root cause so if I'd taken a bit more time to understand why they failed the way they did instead of just shrugging and throwing the output stage back on them I might not have had to fix so much. Oh well though, what's done is done and I have a much more solid understanding of just why everything is actually working together.

Additional Issues

Of course, in any engineering project bigger than a small program or circuit there will be unforseen problems. So its no surprise that when I took the time to bring up the components slowly and in a more organized fashion I found several more issues with my overall design. The first was relating to the data inputs of the various registers. The data bus works very similarly to the control signals distributor. There is a central backbone component providing 1K pull-ups and a bunch of circuits hanging off of that which either read from the bus, write to the bus or both. They work under the principle that when a component isn't reading it doesn't consume any current from the bus and thus doesn't affect its voltage level. That turned out to not be the case for the 4-bit registers. This is due to the inverter connected to the reset input of the internal SR-latch for each bit. So, if I connected enough registers, the voltage on the data bus would sag until other registers only saw 0's on the bus. I didn't want to sacrifice on the elegant design of having one 8-bit bus connection per register so I instead sacrificed on the current consumption for the bus circuits. Remembering the formula for voltage dividers above, we can generalize it for X registers connected like so: 5V * ((10K / X)/((10K / X) + 1K)). As X increases, the dominant factor in the equation becomes the 1K resistor. So I swapped out the 1K pull-ups for 100Ohm pull-ups instead. If we assume 5 registers connected, we go from a 3.33V for a 1 up to a 4.76V for a 1 which is more than adequate.

Partially re-connected components as I brought up each piece methodically.

The second problem was somewhat surprising, but easy to account for. There's a large amount of voltage drop across the feeder wires from the bench power supply to the circuits themselves. This is obviously current-dependent since wires have parasitic resistance. So, the more circuits I plugged in, the more the voltage sagged at the inputs for each circuit. Remember that the power on reset circuitry in the clock generator looks for a voltage of around 4.7V and holds the system in reset if it drops below that amount. This allows the system to self-reset when being powered for the first time but also means that we're fairly sensitive to voltage drops. Once I figured this out it was easy enough to put a probe from the oscilloscope on the power inputs for a random circuit and adjust the voltage of the bench power supply until the oscillocope read 5V. Once I did this the circuits that were behaving erratically when everything was connected started behaving correctly.

The third had to do with the way the instruction decoder ROM boards pulled relevant control signals to 0. I noticed that when the ROM boards were active, they would only pull the control signals to ~1V. This was good enough before the emitter-follower buffers, but now that those were in place lots of circuits were seeing all control signals as active all the time. This turned out to be the current limiting resistor feeding the open-collector pull-down transistors. In the case of the 4-position ROM boards, I replaced the resistors with bridges as they were fed by the internal demux circuit which was already current limited. For the mini-ROM boards I first tried a bridge but burned out the enable transistor on the load instruction ROM. Replacing the bridge with a 1K resistor instead of the original 10K resistor did the trick.

Finally, I had a phantom problem with the clock. I noticed that when additional circuits were connected, the clock would start behaving more and more erratically. It got to the point where just connecting the clock to the general purpose registers was enough to break synchronization for the whole CPU. I spent a bunch of time thinking on this one before starting to take a bunch of measurements. However, they all ended up being unnecessary as the problem turned out to be the 2-position DIP switch on the clock circuit itself. This switch allows you to select between manual and automatic clocking and must have been shorting in some cases, leading to erratic clock behavior. After flipping the DIP switches a few times, the problem went away!

Current state of MiniDragon, with A register holding the binary value 00010111 which was set via a LOADI instruction.

New Physical Layout

I wasn't quite happy with the layout from last time as it stretched some control signals 4+ feet and took up a lot of room physically. So, I came up with a revised 3x3 layout instead of a 2x5 layout. This new version puts the busses much closer to their intended components and also takes into account power circuitry. In order for this to work, the circuits will have to be stacked higher but that's no big problem. My partner also pointed out that she thinks it would look cooler with clear acrylic instead of black ABS so I have a few 12"x12" sheets coming to try that out. I agree with her since its a bit disappointing to work hard on LEDs and circuit aesthetics only to have everything obscured by stacks of additional boards. So, we'll see how that looks!

I've done lots of little tweaks to the schematics in order to make sure they stay up-to-date with the actual built MiniDragon. There haven't been many design changes except to the instruction decoder itself which needed updates to reflect a new opcode layout as well as the buffer boards. I also switched the MCODE_RST control signal to being active low instead of active high. I was a bit worried that it was possible for a ROM board to be deselected before another ROM board was activated. If you remember, control signas are default high with pull-down jumpers. That means if there were variable delays in the enable lines for various microcode ROM boards it would be possible for the MCODE_RST line to pulse high briefly. This is tied to the RST inputs of the flip-flops on the microcode counter board which are async. So, in order to head off any potential problems I switched polarity of the signal. That way, its active low and default high meaning it will never be activated by mistake. All other control signals are acted upon during clock change so this isn't a problem for any other circuits. As a bonus, this also saves jumpers as only one is needed on the final microcode.

The new layout, with bold highlighting completed boards.

Software Side

My progress on MiniDragon has not been limited to hardware. When I needed a break from assembly or I was mulling over problems I tackled lots of TODOs on the software side of things. As I've been coding the stdlib I've run into a few annoying patterns. One of them was that you had to stick temporary values in memory in order to perform math against them. The accumulator was just that: an accumulator. So, more complicated algorithms wasted a lot of their time shuffling values around in memory to get their job done. I took at look at the instruction set and realized that the "no variable width instructions" ship had sailed with the introduction of LNGJUMP and decided to make LOADI perform the same. So, I moved it into the stack operations opcode range and updated the microcode so that LOADI loads an 8 bit immediate value from the next position in memory after the opcode. This meant that a lot of operations in the CPU were much faster now because setting an arbitrary value in A didn't involve lots of shifting and adding. It also meant that I freed up 64 opcodes for use with register math!

I realized that I had several spare register boards laying around that I could populate and bring up so I introduced two temporary registers: U and V. I added additional opcodes to work with these registers. They both hold an 8-bit value, and they can be used as the second math source for all ALU operations. This means that you can manipulate the accumulator using the value in memory that PC points to or the value in U or V. Additionally, instructions to load/store the value of memory at PC into U/V have been added, instructions to move values between A, U and V have been added and instructions to swap the value of A and U, A and V or U and V have been added. I took advantage of these instructions in a few functions, most notably the umult function which is both 20% faster and slightly smaller in the bootROM. Similar optimizations have been made to several other functions. Notably, atoi is almost 50% faster than it was when I originally coded it.

Nothing comes for free, however. If you remember in the last blog post, I had three control signals to spare for future additions. In order to have a working U and V register, I need four control signals (read bus to U, read bus to V, write U to bus, write V to bus). So, I had to take a look and find another group of signals that would never be active at once. I settled on the control signals that feed various bus outputs to the top half of the data bus. It shouldn't be possible to have more than one signal for data bus writing active at once because then two circuits would fight over what value to place on the bus. So, I took three signals (ALU_OUT, A_HIGH_OUT and D_HIGH_OUT) and connected them through a 2-to-4 decoder so they could take up only two bits on the ROM boards. As a bonus, the default value (11 in binary) is mapped to nothing, meaning that for these signals, leaving jumpers out defaults to no signals active. Much like the MCODE_RST negation above, this saves a lot of jumpers in practice.

In order to make U/V registers work with math, a new chunk of the opcode space has been carved out for arithmetic that uses two bits as the operand input. This also left room for modifications to SHL/SHR. As of the last blog post, shifting left or right always shifted in a 0 to the unoccupied spot, and always shifted the lost bit into the carry flag. This meant that in order to shift a 16/32-bit number you would have to do a fair amount of juggling numbers. Given that shifts do not take a second operand, I had room in the opcode layout for variations on shifting. So, I've added rotate left/right and rotate with carry left and right. The former, known as ROL/ROR, shifts as expected, but sets the shifted in bit to the bit that was shifted out, allowing a barrel rotate. The latter, known as RCL/RCR does similar, but the bit which is rotated in comes from the carry. Both set carry to the bit that was shifted out. RCL/RCR are especially useful in umult as they allow me to multiply/divide the two operands by two as per the Russian peasant's algorithm far more efficiently.

The udiv function is now more correct than it was before. Previously, it could only divide unsigned integers that were in the range of 0-127. While it was performing unsigned math, it could not function against numbers that appeared signed on account of how it determined that it was finished. Given the speed-ups afforded by the U/V registers, I was able to switch it over to using ucmp and lift this restriction. It is still twice as slow as it used to be, but it no longer has any caveats. There is also a working implementation of umult16 and umult32 meaning that while MiniDragon technically only has an 8-bit adder, it can successfully multiply two 32-bit integers as long as their result fits in 32 bits.

The assembler/disassembler I've written for the project has its limitations which prevents having separate opcodes decode to the same mnemonic. I work around that by having plenty of duplicated instruction names which include the operand in the name, such as LOADA to load the A register and LOADU to load the U register. This looks ugly in practice, so I sugar over it with macros which the assembler has supported from the beginning. So, under the hood there is a LOADA, LOADU and LOADV instruction, and then there is a LOAD macro which takes a single parameter and emits the correct mnemonic. This allows me to type things such as LOAD A instead of LOADA, making the assembly much prettier in my opinion. If I was truly ambitious, I could integrate with an existing table assembler but I don't see the benefit at the moment.

Next Steps

With the successful integration of the currently built boards and the execution of the (outdated) LOADI instruction I now have a lot more confidence in my direction. I still need to design and lay out the ALU and SRAM interface circuits. However, the design of the rest of the components is solidified and tested. All I have to do to complete the general purpose registers and the majority of the instruction decoder is solder about a billion components to way too many circuits, bring them up and then build the various boards with them. I also need to finish up several functions for the standard library and choose an 8-bit serial chip to use for IO routines. A friend of mine has graciously volunteered a serial terminal that I can use to interface with MiniDragon some day when it is complete. So hopefully next time I write an entry I'll have things further along!

MiniDragon Homebrew CPU Early Progress

March 27th 2020, 2:05:46 pm

Progress has continued on my MiniDragon Homebrew CPU at a fairly linear pace. I'm well on my way to a very early bring up. Lots of things have been cemented and I am narrowing in on the final physical layout for lots of parts! Since my last blog post a month ago I've made a ton of progress on both the hardware and software side of things, validated a bunch of assumptions and tested a giant chunk of the existing design individually. I have yet to do a full integration test but I am getting very close to having enough of the CPU built in order to start that process!

Physical Assembly and Layout

The current plan for the physical towers of components. The bold text represent finished sections.

At the end of February several major components were out to fab and I had no design for the CPU itself outside of the simulator. In order to get an accurate part count I started putting together a high level block diagram for the whole thing. Dispite a lot of limitations and bugs, I decided to do this in KiCad. The advantage is huge: I have a block diagram where each component can be opened up and inspected, all the way down to an individual transistor. So MiniDragon is about as fully documented as is possible. I have yet to finish all of the diagrams but they are already complete enough for me to have built six components of the CPU as well as create a microcode programming generator program. If you are curious, the block diagram is up on github, along with everything else in this blog entry!

I've continued assembling circuits as they come back from fab. Progress has shifted from initial board bring-up on the first board of a design to bulk assembly of boards. I've been building components as fast as I can in order to get enough boards to build out the various high-level components of the CPU itself. Since the last blog post I've assembled nine Rev. 2 D/T flip-flops, almost a dozen simple logic gates, a few 1-to-2 decoders and 2-to-4 decoders and and a handful of the 4-bit register circuits. On the completed components I count 24 logic boards, flip-flops and demultiplexer circuts and another 24 breakout boards used to bring connections out to the edge of the 1'x1' component boards.

Assembling the D/T flip-flops to be used in the microcode counter board.

As for the components themselves, I have assembled the B and D registers (seen below on the right), the instruction register, the flags circuitry, the microcode counter (seen below on the left) and the data bus (below in the center). The pieces of each of these that interface with each other have been connected as well. Each board has been verified in isolation to ensure that it performs as specified. However, without the beginnings of an instruction decoder any test to verify that the components play well with each other will be meaningless so I've held off for now. I'm sure I'll find stuff during integration and bring-up but that's how every project goes. I'm extremely excited to be very close to this, however!

Current layout of completed components.

The block diagrams also include the wiring for control signals coming out of the instruction decoder. My current design revolves around a bunch of 32-bit ROM boards which are programmed using jumpers and collected through a pull-down open-collector bus. That means that I have plenty of room to represent the 29 control signals as reprogrammable microcodes per-instruction. It also means that I now have a physical location in ROM for each of the control signals. With that I've been able to write a utility that takes the instruction classes in the current simulator and spits out a microcode programming guide for each instruction, telling me how many microcode entries each instruction needs as well as where to put the jumpers in order to make the instructions work correctly. This also means that any modifications I make to the instruction set in the simulator can be quickly reflected in hardware. This will surely come in handy as I continue to iterate on the instruction set.

Software and Opcodes

The instruction set itself has changed little since the last blog post but the changes I have made unlock new processing power. I made some minor adjustments to the ADDPC/SUBPC instructions, renaming them to ADDPCI/SUBPCI (the I standing for immediate, to bring them in line with the other immediate-based instructions). I also added a new 4-bit sign extended immediate register to complement the 6-bit one that currently exists. This allowed me to optimize the ADDPCI/SUBPCI instructions in terms of clock cycles per execution as well as remove the need for 29 ROM boards! This doesn't seem like much, but the overall speedup to the standard library was around 7% and a few of the worst algorithms were sped up by over 20%. Also, the ROM boards themselves are massive and thus expensive, so reducing my need by such a large number of boards is huge!

Measurements taken in the simulator before and after the ADDPCI/SUBPCI change.

Renaming the ADDPC/SUBPC instructions to ADDPCI and SUBPCI opened up the ADDPC namespace for a new instruction that can adjust the PC register by a signed offset stored in the A register. I originally added this instruction to rewind string pointers for strcmp/strcat/strlen/strcpy functions in the standard library. However, after finishing the implementation I realized that it also unlocks indirect memory addressing. This means doing object-oriented operations as well as array lookups and jump tables become much, much faster. It was always theoretically possible to increment/decrement the PC in a loop, but this takes operations that could potentially be thousands of clock ticks down to a single instruction and makes it feasable to use in practice. Much like introducing SKIPIF gave me turing completeness and introducing PUSHIP/POPIP gave me subroutines, this single instruction gives me yet another degree of power to write complex algorithms!

On the software side of things, I've been hard at work fleshing out the standard library for MiniDragon. I've been coding up a host of useful basics that one might expect in a stdlib, such as atoi/itoa, strlen/strcpy/strstr/strcmp, cmp/add/negate/multiply/divide and processor initialization routines all of which are up on github. This has been an enormous amount of fun! I love nothing more than writing a standard library from scratch on a new CPU architecture that doesn't even exist physically yet! It has also been extremely valuable. The changes I've made to instructions that I detailed above came directly out of this exercise. In order to make the standard library as useful and comprehensive as possible I've been making sure that all the functions are fully tested and side-effect free from the perspective of the caller. That meant, in the case of the string functions, adding the ADDPC instruction in order to facilitate this! It has also been a huge relief to see that it is indeed possible to do real-world, useful things with this CPU and that it is not just a physical build of a toy instruction set.

Finally, in the process of writing the standard library I've made a lot of progress refining the toolset that comes with MiniDragon. I've made a boatload of improvements to the assembler, fixed several small bugs in the simulator and added additional utilities such as a function visualizer and the aforementioned microcode ROM programming generator. The processor test suite now includes validation for many classes of errors that the assembler should generate and the assembler is much more helpful in attempting to communicate why something isn't valid. This has been especially valuable in tracking down when JRI instructions reference labels outside of the 31 byte jump boundary. The simulator frontend now allows step-over-function debugging as well as run until return debugging in tandem with the existing single-step. And finally, the assembler supports much more powerful constant definitions, including the addition of the ".char" and ".str" directives for data embedding as well as support for using character literals as parameters to instructions. This has allowed me to keep the standard library fairly readable (as readable as a low-level stack-based CPU can be) while also using fewer instructions for referencing constants.

Screenshot visualization of strlen, showing stack tracing for PC and SPC registers.

What's Left?

Of course, I'm nowhere near done! I've had a few expected setbacks and for everything I finish two more things magically show up on my TODO list. The AND/OR gates that I put together to make the microcode counter board ended up not working in-circuit so I had to submit redesigns of them to fabrication before assembling the microcode counter board. Some of my seemingly simpler components were revealed to be more complicated once I block diagrammed them which meant that I had to submit more fabrication requests. And, of course, assembling the 4-bit register boards is still a very long process. I have gotten the time down from about 3 hours to an hour and twenty minutes. I have also refined my soldering technique which has resulted in far fewer parts coming up wrong during bring-up, reducing the need for time-intensive debugging sessions. However, I still have 14 more boards to assemble, so at the current rate of assembly that's about 20 hours of soldering!

Stack of register boards awaiting assembly and bring-up.

If assembling enough 4-bit register boards to complete the PC, IP and A registers wasn't enough, I also have hours upon hours of additional things to tackle before I'm on the home stretch. The microcode programming boards and the control signals termination and distributor board are all with oshpark right now. When they arrive, I'll need to bring them up and ensure they work as expected before I order a ton more ROM boards. I have another 30 bus output boards coming from fab which will be used for everything from the general purpose registers to the immediate register and the ALU. I also need to start the design of both the ALU and the external memory interface circuitry which will talk to the external RAM, bootROM and external hardware. I also need to decide on that external hardware in order to work on IO routines in the standard library. Right now I am leaning towards a simple LCD for debug messages and an 8-bit serial chip to provide standard in/standard out. It might never happen, but I'd love to pair this thing to a VT-100 like some old mainframe. And finally, I need to continue working on the standard library. I have yet to code memcpy, memset or memcmp. These will be similar to the existing string functionality. I also need to create 16-bit and 32-bit versions of the math and conversion libraries, and also handle 8-bit, 16-bit and 32-bit signed variants of the math library. And finally, once I standardize on the IO itself, I'll need to create access routines and start working on a basic shell to place in the bootROM.

As I get closer to the physical realm one thing is becoming clear: I need to finish this. If I don't, and get to 80-90% done before calling success, I am going to miss the other 80-90% of the process. I am not interested in software-only theoretical CPUs, I want a real life stack of hardware that executes software I wrote for it, built from the ground up. It is going to be a lot of hand-soldering and a lot of patience, but it is going to be VERY worth it!

Transistor CPU Part Fabrication

February 26th 2020, 4:25:26 pm

I've been working on the Transistor CPU for a few months now, which I've dubbed the MiniDragon. Since the last blog post I've changed a few minor things and nailed down a lot more of the CPU. I have a much more complete simulator, an assembler and disassembler and the beginning of a software library for an upcoming boot ROM. The CPU now has subroutine and stack instructions as well as an absolute jump, making it possible to write a reasonable library of built-in functions. Through writing a multiply subroutine I learned a lot about my chosen instruction set and have made several changes to the CPU as a result. I'm a lot more confident in the instructions since I've implemented actual software with them. That software, as well as an up-to-date description of the CPU itself, is available on GitHub.

Fleshing out the simulator allowed me to get a much better handle on the CPU itself. However, I only spent a small amount of time working on the software side of things. The majority of my time over the last few months was spent learning KiCad, digitizing schematics that I've tested on breadboards, laying boards out and sending them out for fabrication. I've been using Oshpark for fabrication since it can take pcbnew files directly and is cheap enough for small runs of boards. The boards are also my favorite color: purple! I started out with some simple logic gates that I'd breadboarded over Christmas: NOT, NAND and NOR. These aren't super interesting and they only consist of a few discrete components. However, they let me learn the ropes of KiCad before I undertook some more complicated design and routing projects like a 4-bit register.

Probably the hardest part of the last few months was being patient. As soon as I finished board layouts I wanted to throw them over the fence and get them fabricated. However, since I've never done PCB layout and fabrication before I was sure that I would make a ton of mistakes. So, I forced myself to send out small batches and observe everything that I did wrong before sending additional boards to fabrication. This paid off heavily, as I made a few mistakes that I had time to correct before it cost me unusable boards or ugly rework. It also allowed me to pace myself with board assembly, since I underestimated how much time it would take to assemble through-hole circuits.

A few specific things that I learned during this process stand out. First is to always ensure that you silkscreen the circuit title AND revision to the board. If you change component values or layout significantly, you will want to know what revision the board is in order to assemble it properly. You might be 100% sure that you will always recognize your boards, but manufacturing can take a few weeks and by the time it comes in you won't remember. I messed up my first board and forgot to silk screen it at all, so I ended up affixing labels to the circuits.

Rev 1 NOT gates with missing skilk screen.

Second is to make sure to space headers on the board apart by a multiple of the pin spacing. I chose to use 2.54mm (0.1") pitch pin headers for all of my circuits. Especially when you are dealing with 1x1 headers, it can be next to impossible to correctly insert the pin and solder it while making sure that it is plumb and square against the top of the board. However, if you have a row of multiple sets of pin headers, spacing them apart by a multiple of the pin pitch (2.54mm in my case) means that you can use a long pinsocket as a temporary holder in order to line up all the pins at once. I happened to do this on accident on my first board and was super glad that I learned this before sending out the next batch in which I had not lined up the pins at all. I use a 40-pin pinsocket as my assembly template and plug in the pin headers to it before soldering them all in one go.

Third is to use the rendered top copper and bottom copper layer photos on Oshpark. I had one board where a trace was routed far too close to a through-hole connection and another which had spurious traces. Neither of these were caught by pcbnew's constraint checker and I missed them looking at the PCB in pcbnew itself. Seeing your circuit in a new light is a super good time to check for errors. I definitely get a bit blind to errors in my layouts due to staring at them for so long while I put them together. So, a different look at the same boards has let me catch issues before sending to fabrication.

Finally, your breadboard designs might not work when laid out on an actual PCB. I arrived at values for my clock edge detection circuit when breadboarding which did not actually work when I assembled my first revision T/D flip-flop. I ended up having to do a really gross rework to verify an updated design. Luckily, I held off on sending a 4-bit register to fabrication which was based off this flip-flop. Only once I got the second revision of the flip-flop back from fabrication and verified that it worked did I send the 4-bit registers out. As a result, they worked fine! Given their size, it would have been a costly mistake had I not waited and verified. A secondary advantage to reworking the circuit was that I got to reduce the resistor part count from 4 distinct values to only 3 on clocked circuits. This makes assembly a lot easier and means that I can keep fewer parts on hand.

Nasty rework done to fix a hardware bug in the Rev. 1 T/D flip-flop.

I have a fairly decent library of logic gates and CPU components fabricated and verified on the bench at this point. I have my staples, like NOT, NAND, NOR, XOR, XNOR and a T/D flip-flop (behavior can be changed with a jumper). I am waiting for fabrication to finish on some AND and OR gates. Some of the CPU microcode logic requires them and I don't want to waste physical space and propagation delay chaining NAND/NOR circuits to NOT circuits. I have a few more useful circuits such as a 1-to-2 decoder and 2-to-4 decoder. Naturally, these can be chained to make a 3-to-8 or 4-to-16 decoder which I will be using in the instruction decoder and microcode lookup circuits. I also have a few parts that I'll use for the actual CPU designed and tested. I settled on a pull-up, open collector bus architecture due to its ease of coupling multiple driving circuits. To support that I have a few 8-bit bus backbone circuits that provide current bus value indicator LEDs and pull-up resistors as well as a ton of 8-pin headers. I have an 8-bit bus writing circuit that has an 8-pin input header and an enable control signal and an 8-bit output header designed to plug directly into a bus. Finally, I have a power on reset and clock circuit that provides automatic and manual reset as well as automatic (adjustable speed) and manual clock pulses.

Lots of completed and verified circuits!

Most of the above circuits are fairly boring in their implementation. The logic gates are similar in design to any RTL circuit you can find online. The bus circuit is just a bunch of pull-ups and a few buffered transistors driving LEDs. The bus writing circuit just converts 8 signals from 0V/5V signals to open-collector outputs. The clock circuit, however, is a bit more interesting. At its core it consists of two components: power detection and a clock generator. The clock generator is essentially an astable multivibrator buffered through some signal-shaping inverting transistors, then finally lead through a 2-bit selection switch and an output amplifier. This lets me choose either the automatic clock or manual clock pulses that are input by a switch that's debounced using a capacitor. The power detection circuit uses a zener with a reverse breakdown voltage of 4.7V, hooked up in reverse. This drives a transistor that switches on to indicate that power is good enough to use which drives an RC circuit to generate a reset pulse. The output of that is buffered, and also goes to the clock's amplifier circuit. This allows the reset logic to disable the clock while the reset line is held high or when the voltage is too low. This lets the CPU auto-reset its registers on power-up, ensuring stable boot every time power is turned on. Finally, a manual reset button is wired in as a logical OR to the reset pulse circuit. That way, if I decide to reset the CPU I can press the button and reset will be asserted while clock is disabled.

Oscilloscope showing power (purple), reset pulse (blue) and system clock (yellow).

I still have a ton of stuff to lay out and get to fabrication. I have the parts on hand to build the flags circuitry and one of the 8-bit registers (either the A register or the D register). I'm waiting on additional parts to come in to build the microcode counter circuitry. I've standardized on a 1x1 foot ABS plastic base for various logical components of the CPU but haven't permanently attached any components yet. I am also waiting on some simple bus/power/control line breakout boards that will let me run various connections to the edge of the panels for easier and more modular assembly. I need to design some jumpered 8-bit ROM boards so that I can program the microcode instructions in, and I need to start laying out the actual instructions one at a time. I also need to assemble 18 more 4-bit register boards to have enough registers for the whole CPU. Then, once all of that is brought up and verified, it will be time to start work designing, laying out and fabricating the ALU itself!

Transistor CPU Project

January 4th 2020, 4:20:15 pm

For about a decade and a half, I've wanted to design and build my own CPU from some sort of discrete components. This has become fairly standard in the hobby world and is completely obsolete with the existence of FPGAs, tons of cheap and available processors and even some microcontrollers costing as little as three cents USD. Nonetheless, I wanted to design a CPU myself mostly as an opportunity to learn and give myself a large project as a challenge.

A few weeks ago I was feeling super under the weather so I came home from work to rest up and started binge watching Ben Eater's Channel on YouTube. Side note, I love his channel. He does such an amazing job breaking things down to easy, well paced chunks that so many people can understand. The channel is like junk food to me and I love picking random things to watch. Anyway, six or seven videos into his 8-bit CPU build I started asking myself why I hadn't ever gotten around to designing and building my own CPU. So, I grabbed a set of resistors and some 2N2222 transistors I had laying around and just started playing with BJT logic gate circuits I could find on Google.

Simple circuits that I threw on paper after testing.

I didn't want to just take somebody's word for it online, so when I built the circuits I took a lot of measurements using my oscilloscope to verify the design. For each simple circuit I build I measured propagation time when the input went from low to high and from high to low as well as the amperage pulled by the circuit when running at 5V in various scenarios. Getting the worst-case propagation delay for each circuit allows me to figure out what the maximum clock speed of any CPU I build will be. I worked my way up from a simple not gate with a single transistor as well as a buffer, to nand, nor, and and or gates, and finally an SR latch. Once I had those parts built and verified, I would sketch up schematics for them with various notes on their propagation delay and current requirements. Ignore the obvious schematic errors below, I haven't done pen and paper logic design in years and completely forgot that I was drawing xor gates instead of nor gates.

Additional circuits that I tested and measured.

With an SR latch you can build all other types of latches and flip-flops. A CPU needs registers, and I want to build the core of the CPU entirely from discrete transistors and resistors, so I needed to build and test a D flip-flop. This meant adding an enable line and tying that enable to an edge detection circuit and verifying that I could "clock in" a 1 or a 0. This worked as expected, except for playing around with the resistor and capacitor values for the edge detection circuit. It still doesn't work quite right depending on the speed of the clock, and its affected by the circuit that drives it so I think I'll have to buffer it in a future redesign. Later, I added a second un-clocked enable and an asynchronous reset input, both of which will be necessary to use this as a single bit in an upcoming register. The enable will act as a chip select, allowing CPU control logic to dictate whether a particular register should store a value present on its input at the next clock pulse or retain the existing value. The asynchronous reset will allow a power on reset circuit to reset all registers to zero when the CPU is powered on for the first time.

D flip-flop with a clock and data input, a buffered output with a 10K load and an indicator LED.

If you have a D flip-flop and you have access to the inverted output, you can feed that back into the data input in order to make a T flip-flop. This type of circuit is great for chaining together to make counter circuits or clock dividers. I verified that the theory also worked on my D flip-flop circuit. I have several more circuits that I have to build and verify before I could theoretically put together a CPU of any sort. I need xor gates. I also need some way of selecting one of multiple inputs to drive a bus. Both of these could be handled by the circuits I've already built at the cost of additional propagation delay as well as higher part count. However, I want to keep the part count low so I need to build simpler circuits. Currently I have plans to lay out several busses for the CPU core and I've decided to go with an open collector design instead of tri-state output. This is because of the increased complexity involved in producing a tri-state output (multiple transistors, diodes and an inversion required) versus an open-collector (a single pull-down transistor on an active-high bus).

In order to make the CPU core more modular and thus easier to build, I've decided to go with a microcoded architecture. This will let me prototype the CPU using an EEPROM to hold the microcodes and very quickly swap things out if I don't like how it works. The final CPU will use combinatorial logic to decode instructions and a diode matrix board per opcode to store the control signals at each step in the CPU's execution. I'll use a series of D flip-flops as a counter to control which microcode to select given a decoded instruction. This design also allows me to reduce parts in several critical areas of the CPU since I can reuse expensive parts such as an adder circuit to drive both the ALU and the program counter. This comes at the cost of slower instruction throughput as only part of each instruction will be executed every clock cycle. I could have made the trade-off to have more complicated logic circuitry, but when I'm looking at hand-soldering each transistor I would prefer to keep things simple and slow.

With most of the basic theory out of the way and verified in-circuit, I got to work thinking about an instruction set. I took a lot of inspiration from the PDP-8, another transistor computer, as well as Ben Eater's simple 8-bit computer. Ben Eater's computer is more of a learning CPU since it only has 16 bytes of memory available. While it is turing complete, it is extremely limited. I want to keep my CPU simple so that its humanly possible to design, wire up, debug and code for. However, I do want to be able to write "useful" software for it. I'd like it to be capable of interfacing with external devices, possibly through serial or keyboard and VGA. I'd like to be able to code simple games or productivity software for it. And finally, I'd like it to be self hosting which means making it powerful enough to code an assembler that runs in a boot ROM. This necessitates a Von Neumann architecture. It also necessitates having access to a decent amount of memory and external hardware registers.

I settled on a hybrid 8-bit CPU design which allows for software access to 16 bits of RAM/ROM/external hardware registers. Staying true to its inspiration, I have a simple 8-bit, accumulator-based CPU and software can only interact with memory or this single register. However, several more support registers that aren't directly software-accessible will be 16-bit to enable full access to program and data memory and external hardware. I wanted to keep instruction decoding simple, so all instructions are 8-bit as well with no variable instruction width support. Most instructions deal with loading/storing or manipulating the accumulator in some manner, with a few instructions able to interact with special registers. Instead of conditional jumps, I'm going with a skip next instruction opcode which will allow any supported instruction to be made conditional. I don't currently have absolute jumps, call or return support or a stack right now but I have plenty of space reserved in the opcode space to add these in a future revision. Several of the registers are write-only or cannot be directly read or written from software which means this CPU cannot be multi-threaded. I think that's okay though, given that the CPU is probably going to run on a clock in the KHz range and would struggle with even the simplest of multi-threaded code.

Snapshot of a Google Sheets document outlining my current instruction support.

The full list of busses and their design is as follows:

  • 16-bit data bus. This is the primary bus used for moving data between registers and memory. It is 16 bits in order to allow the instruction pointer register to interact with the ALU.
  • 16-bit address bus. This is the bus that feeds the address circuitry for main RAM. It is separate from the data bus to remove the need for a dedicated memory address register.
  • 16-bit ALU source bus. This bus feeds one input to the ALU.

The full list of registers and their capabilities are as follows:

  • 16-bit instruction pointer register (IP), holding the address of the current instruction in memory. Its value can be placed on the address bus or ALU input bus and it can read from the data bus. Software cannot directly set this, but a jump relative to immediate instruction allows it to be indirectly updated.
  • 8-bit instruction register (IR), holding the current instruction that was fetched from memory. It is write-only and feeds microcode decoding logic, but can read from the data bus. Software cannot directly set this, but memory is modifiable so self-modifying code is possible.
  • 8-bit accumulator register (A), holding the current accumulated result. It can read from and write to the data bus, and it can output to the ALU input bus. Many instructions available to software can directly manipulate this register.
  • 8-bit ALU temporary register (B), holding a temporary value from the bus. It can read from and write to the data bus. Its output is also hardcoded to the second input of the ALU. Software has no capability to modify this register and it is used by various microcodes to accomplish virtually all CPU operations.
  • 8-bit memory page register (P) and 8-bit memory cell register (C), together holding a 16-bit address. The P and C registers can individually read from the data bus, and the combined PC value can be output to the address bus or the ALU input bus. Software can write to the P and C registers from the A register and can use the combined PC register contents to load from and store to memory, but it cannot directly read from either register.
  • 2-bit flags register, containing a carry flag (CF) and a zero flag (ZF). Software can directly set or clear the carry flag using a pair of instructions, and both carry and zero flags are set appropriately when carrying out any ALU-based operation which sources from and stores to the A register. It is not directly readable by software but there exist skip instructions that allow software to conditionally execute a particular instruction if either CF or ZF is set or cleared.

Aside from registers and busses, a few more pieces of hardware will exist to make the CPU core:

  • An ALU, which performs operations against the B register and either the A register (sign-extended from 8 bits to 16 bits), the IP register or the PC virtual register. All operations except for add operate only on the low 8 bits of the ALU bus. Add works on all 16 bits of the ALU bus and a sign-extended version of the B register. This allows software to request that the PC be incremented or decremented and it allows the CPU to use the ALU to both increment the IR as well as perform both conditional and unconditional jumps.
  • A zero generator which outputs all zeros to the data bus. This is for pre-loading the B register for certain ALU operations. We also need a -1 value but given that we are using an open collector bus design, we can simply turn off all outputs and the bus will read all 1's which is equivalent to a -1 in two's compliment.
  • Some sort of ROM and some sort of RAM. Given that its infeasable to build an SRAM circuit of any usable size out of discrete transistors and core memory is far past obsolte and difficult to obtain, I'll probably use a standard EEPROM and SRAM chip for this.
  • Combinatorial decoding logic, feeding diode ROM select boards and sourcing from the IR. This is the heart of the control circuitry which will generate the control signals which feed the various register enable input and bus output inputs.
  • Flags register combinatorial logic, feeding the data bus with either a 0 or 1 value given particular opcodes in the IR and current values in the CF and ZF registers. This allows us to preload a 1 or a 0 into the B register and implement a conditional skip.

Given that its going to be rather expensive to prototype in terms of space, time and actual components, I went ahead and wrote a microcode simulator for the CPU design. This was super useful when I was laying out the supported instructions because I was able to test out the actual capabilities of such a CPU by writing miniature programs. Using this simulator I realized that I could do away with several opcodes such as shift right, and could cleverly manipulate the IP register using the ALU to implement standard instruction advancing, relative jumps and conditional execution. During the development of the simulator I ended up also writing a simple assembler and disassembler in Python which will be super useful for writing code that runs on the real hardware before I get the on-target assembler off the ground. If you're interested in playing with it, I threw it up on my website. It also serves as the master documentation for microcodes since it fully simulates the various busses and registers.

I still have a lot of work to do on virtually all of the CPU pieces before I have anything resembling a real CPU. However, given the layout and simplicity of the busses and control signals, I should be able to piece-wise assemble and test the CPU part by part. The next big thing I need to do is test circuitry that will allow me to assert on a bus so that I can start building the bus itself and then send out some boards to be fabricated. I think I'll start with register read/write and a bus and build out the various pieces from there. The most complicated part is going to be the ALU but even that can be built function-by-function until I have a fully functioning ALU circuit. And finally, once I get this particular version of the CPU up and running I'll jump in and see if I cant get absolute jump, call and return instructions and a real stack implemented. Stay tuned for updates to this project!

Newer Entries

Older Entries