Blog

A place where I can ramble about my projects.

Teaching Naomi to Auto-Set the EEPROM

October 1st 2021, 12:27:54 pm

In a few of my games downstairs I run Naomi hardware. Instead of loading the games from the original GD-ROM I netboot the games. For that I run a small web server which monitors the games and lets guests choose which game is loaded on what cabinet. It's pretty slick, and its all solid-state and future-proof. However, the settings for each game are reset to the defaults every time a new game is turned on. That means I have to go into the operator menu, enable free-play, modify game settings such as lives and difficulty, calibrate joysticks and the works. That's annoying and not very guest-friendly.

My solution to this problem was originally to create a series of patches which forced the Naomi to put games into free-play and silent attract mode. This was good enough, since both me and my guests could load a game and play it without getting the keys, going through the test menu and setting everything up. However, I figured I could do one better! So I figured out the format of the EEPROM that saves system and game settings, figured out how to read from and write to it and then wrote a program that could change my settings for me. I made it such that the program could be loaded before any game so that it had a change to customize the settings for me before running the game. I also made a utility that I can run on my server which lets me choose custom game options and then saves those options into a small program it attaches to the game ROM which then gets netbooted. The end result is that I can pick custom game and system settings, load the game onto my arcade cabinet and when it finishes booting up it has all of the settings I want already configured!

The whole project is avilable at https://github.com/DragonMinded/netboot for download if you want to do the same thing to your games. Right now there is only a console UI for changing or viewing settings, and I have settings definitions for the system settings and Marvel Vs. Capcom 2. I would like to integrate it with my ROM configuration screen on the web server so that I can edit the settings on the fly without messing with ROM files directly. I would also like to get more definition files worked out so that I can edit the settings for other games I play such as Ikaruga and Monkey Ball. Hopefully that's all coming soon! For now you can check out my thread on it on the Arcade-Projects forums: https://www.arcade-projects.com/threads/netboot-naomi-with-eeprom-presets.18977/

Adding Free Play to Vs. Tetris

June 20th 2021, 6:20:41 pm

About a week ago I bought a Vs. Tetris kit for my Vs. DualSystem on a whim to replace Vs. Super Mario Bros. When it arrived, I swapped the ROMs and PPU and booted it up only to find that there was no free play option in the game. That presented a problem since the coin slots and service buttons aren't hooked up in my cabinet. It was completely gutted when I got it and I never hooked them up. I also prefer not to have to coin up games before playing them. So I looked online to see if anyone had made a free play mod for the game. I found nothing but other people complaining about the same thing so I decided to add it myself!

The general theory of adding free play to an arcade game that doesn't have it is simple. You figure out what routines handle coin checks and either modify them to require zero coins or trick them into thinking there are always coins inserted into the game. Nice-to-haves include displaying "FREE PLAY" instead of a coin counter on the screen, making sure the game still runs the attract sequence when it thinks its coined up and allowing the modification to be enabled or disabled, usually with an unused DIP switch. The older the game, usually the easier it is to find and fix up the routines. The game came out in 1988 and runs on modified NES hardware so I knew I wasn't going to be dealing with threads, obfuscation or compression schemes. With that in mind I set to work.

Knowing nothing about the game's internals and only a little bit about the NES layout in general, I booted the game up in MAME with the debugger enabled. I also pulled up a copy of the source to the MAME driver so I could get an overview of the Vs. DualSystem memory layout. Its often difficult to tell what the entrypoint of a ROM-based game is from MAME source so the easiest thing for me to do is to single step once in the MAME debugger and see what address it starts at. The source code will at least tell you where in memory the ROMs will be located, so its fairly easy to figure out what offset in the ROMs is the start of execution. With the start address and the ROM locations understood, I combined the ROMs into one large file and imported them into a new Ghidra project. Ghidra can load and decompile raw 6502 binaries as long as you know what address to start with, so I headed to the entrypoint offset and told Ghidra to start decompiling.

Ghidra can be a bit fussy with global variable references pointing at memory that's outside of the ROM region. For that reason, I find it is often convenient to add more of the memory map for whatever system I'm analyzing. So, I opened the Memory Map window and added the system RAM as well as the coin counter and DIP switch registers. I then started poking around to see if any obvious code popped out to me. The game initialization and main loop were pretty obvious but aside from that I couldn't see anything clearly handling coins. So, I headed back to MAME.

MAME's cheat engine is a fantastic way to find memory addresses of interest even when you aren't trying to cheat. I wanted to isolate any memory addresses that held coin counts so I could examine code that read and wrote coin values. So I initialized the cheat engine with the cheatinit debugger command, then incremented the coin counter by pressing the service credit button. I then ran cheatnext increase, 1 to tell MAME that I expected the memory I cared about to have gone up by 1. It narrowed down to a few hundred entries. So, I ran cheatnext equal without adding another coin to tell MAME that I didn't modify anything and to narrow down the memory further by keeping only the memory locations that hadn't changed. After a few iterations of adding coins and narrowing down memory I had three different addresses to look at which was good enough. I popped open three different memory viewer windows in the MAME debugger and set each one to look at a different address and tried modifying them to see what happened. The first one immediately reset itself back to the old value whenever I changed it. The second, when changed, caused the first one to mirror it. The credits display on the screen also immediately updated to display the new value I typed as well. The third one had no discernable effect.

Intuition told me the second one was probably what I cared about. So I set a memory watchpoint on the address and then performed a few actions in-game, such as adding a coin, starting a game, losing a game, etc. Basically I was allowing the game to naturally increment and decrement the coin counter and noting down the memory addresses of the code doing the actual incrementing and decrementing. I navigated to those addresses in Ghidra to check out the functions further. I'll spare the boring details of examining all of the functions I found. I ended up with a few functions that all did something with the coin count memory address. One was in the main video update loop and read the coin value to display it on the screen. This one turned out to be what was responsible for writing the coin value in the first address that I found. One was just a simple increment function for when a coin was inserted. One checked how many coins were in the machine and used that to determine whether to go back to the attract sequence or the start screen at the end of a game. And one looked at the game mode selection that a user made (1P, 2P cooperative, 2P competitive) and subtracted the right number of coins (1, 2 or 2 coins respectively) when a game was started.

With that, I had almost everything I needed in order to do the hack. Some experimenting with the "CREDITS X" display function showed me that it was writing characters to a buffer that included the X and Y position of each character and that the game only allowed a maximum of 8 characters to be written. Instead of trying to figure out how to enlarge that buffer, I decided that my free play display would print "FREEPLAY" instead of "FREE PLAY". Given each character is accompanied by an X position I could go back and adjust the display to include a space but it didn't occur to me to do that at the time. In order to modify the minimum amount of game code, I also decided to go the route of forcing the coin value to 2 (so both 1P and 2P games would work) instead of attempting to modify the various coin functions to accept 0 coins. I also decided on using "DIP 5" as the freeplay/coin mode switch as it was unused according to MAME source, various online documentation and a precursory search of code that accessed the DIP switch registers.

The first function that I wrote was the function that would display "FREEPLAY" on the screen. I started with that because it was almost a carbon copy of the "CREDITS X" display function. I just had to change the memory addresses for the text and X offsets, get rid of the bit that copied the credits amount and adjust the X offsets to center the display horizontally. Also, since this function was called in the main display loop for every frame it was a good place to stick a bit of code that would force the credit count to 2. I also wrote a small function that would load the DIP switch register, check if DIP 5 was set or cleared and then jump to my freeplay display function or the original credit display function depending on the result. Both of these functions were stuck near the end of the last ROM in a large chunk of blank space. I located where in the main thread the original credit display function was called using Ghidra and then changed it to instead call my new function that checked DIP switches and jumped to the appropriate display.

After confirming that swapping DIP 5 to "on" in MAME got the word "FREEPLAY" displayed and let me start a 1P or 2P game, I got started on the second half of the hack. When you get a game over the game checks to see if you added another credit. If you did, it goes right back to the start game screen where you can select the game type. If you did not, it goes back to the attract sequence. Since this is an arcade game running on a CRT I didn't want any unnecessary burn-in, and I wanted it to feel like it was always intended to have free play. So I went back to the function that looked at the credit count to determine whether to go back to the attract sequence or not. It was using a 3 byte 6502 opcode to load the coin count into the A register. Luckily, unconditional jumps are also 3 bytes! So I wrote another small function that checked whether DIP 5 was set or cleared and then either loaded the coin count into the A register or set the A register to zero coins. Basically, when the game is in free play mode, I lie to the function that there are no credits in this one scenario so that I can guarantee the game goes back to attract mode. In coin mode, the game will do what it was originally doing. I then replaced the original credit load with a jump to that function and tested it again in MAME.

With all that tested, it was time to burn ROMs and test it in my actual cabinet. The original ROM chips for the game had no seal over the erase windows and the game is extremely common so I wasn't worked up about reusing the ROMs. So I popped them into my UV eraser and rewrote them with the patched ROMs that I tested in MAME and popped them back into the cabinet. Success! With DIP 5 turned on, the game is now in free play mode and I can play the game!

If you want to apply the patches to your own copy of the game or you're just curious and want to see the source code, I put it all on GitHub here. The code was so simple and 6502 is so easy to work with that I did not bother setting up a 6502 assembler. Instead, I used this 6502 reference and hand-assembled the instructions that I needed. I'm probably going to go back sometime this week and fix the spacing oversight on the "FREEPLAY" display function, but aside from that all is done!

Alpha Blending in Software

June 14th 2021, 8:45:56 am

For the past couple of months I've been reverse engineering Konami's TXP2/IFS container formats and AFP animation format out of curiosity. The latter is a Flash-based animation file format that appears to have been forked and extended by Konami starting around 15 years or so ago. It retains the same general concepts as SWF. Animation files are composed of a list of tags, some of which are acted on each frame in order to place, update and remove objects from the animation canvas. Bytecode can be executed that has access to the placed objects and current execution engine. Color blending and placed object masking is available. The root animation is known as a clip, and an animation can include additional clips which are embedded animations complete with their own set of tags. These are treated the same as any other placed object and can be transformed and color blended at will as well as contain their own embedded clips. Animations can import and export tags from other animations so AFP files can be used as libraries or embed animations from other files in their own animations. The full format is in use in The*BishiBashi which implements all levels and most of the menus as AFP files with a ton of bytecode for the level logic and many AFP files acting as libraries for things such as displaying ready/go/finish animations and common functions. If we fast forward to current games, many of the original features have been stubbed out and no longer have code to function. The format has been extended to allow for 3D transforms and a camera system instead of SWF's simple affine transforms. As of writing this blog post AFP files appear to be used in virtually all Konami games for animations in menus, character displays, background videos and the like. Bytecode use is limited to setting properties on placed clips such as looping animations, requesting masks and other simple playback features.

What started as a dive into the way The*BishiBashi stored and executed level data turned into a full-on AFP rendering engine. This is, of course, written in Python 3 with some equivalent C++ code that can be loaded in for performance-critical sections such as blending pixels. My goal wasn't to provide a real-time viewer of files, but to provide as accurate a documentation of the format as possible. Because of this I implemented all of the pixel blending in software instead of relying on GPU functions. If you are curious, all of the code that implements the pixel blending discussed here can be found in the blend directory on GitHub. I maintain two equivalent blending engines, one in pure Python 3 and one in C++. You can examine whichever one is closer to your preferred language. AFP, much like SWF, supports several different blending modes which are effectively identical to the ones found in Adobe products such as Photoshop. While additive, subtractive and other blending modes are interesting they do not contain nearly the amount of gotchas that "normal" blending has. So, I'm focusing on the normal blending mode which works out to alpha blending a source pixel onto a destination pixel.

Let's start out with the simplest iteration of alpha blending. A RGB pixel is esentially a group of 3 integers that represent the intensity of light (or brightness) for red, green and blue color that gets mixed together to form the final color. Many software packages choose to use 8-bit pixels, meaning there are 2^8 possible numbers that can be stored for each color, from 0 through 255. A 0 in a particular color's integer bucket means there is absolutely none of that color mixed into the final color, and 255 means that as much of that color as possible is mixed into the final color. Any number in between corresponds to a brightness that is proportional to the number itself. If you divide each number by the maximum possible number (255) you can visualize the color as a percentage instead. So, a RGB color of "128, 0, 0" can be thought of as "50% of possible red, 0% of possible green, 0% of possible blue".

With RGB you can represent every single color that is possible to display on a computer screen. RGB colors are missing something, however. There's no way to represent how see-through a pixel is. This means that if all you have is RGB you can accurately store an image, but not how it would interact with another image. Blending is boring in this case because each image is fully opaque. If you had an image already placed on a canvas (the destination) and wanted to blend a new image with it (the source), you would simply replace each pixel in the destination with the corresponding pixel from the source. We can fix this by adding a new integer "alpha" to the RGB pixel, creating an RGBA color instead. The alpha number works very similar to the red, green and blue numbers. It can hold any number from 0 through 255. However, instead of representing the brightness for a particular color, it instead represents the amount that the RGBA pixel modifies a destination pixel when it is blended. It means nothing on its own but it allows us to store how each pixel should interact with another image if blended.

For ease of discussion, let's assign a variable to each part of the source and destination RGBA colors:

  • Sr = The red component of the source image.
  • Sg = The green component of the source image.
  • Sb = The blue component of the source image.
  • Sa = The alpha component of the source image.
  • Dr = The red component of the destination image.
  • Dg = The green component of the destination image.
  • Db = The blue component of the destination image.
  • Da = The alpha component of the destionation image.

Now blending a source image with a destination image can get interesting. Your source image can have some fully or partially transparent pixels and some fully opaque pixels. You can now represent things like stained glass windows, sunglasses, chain link fences and any other thing in the real world that allows part or all of the thing behind it to be visible. We can come up with some simple code to blend each primary color in each pixel based on the source image's alpha which dictates how much of the source and destination color we mix together. That code looks like the following:

source_percent = Sa / 255
source_remainder = 1 - source_percent

Dr = (Sr * source_percent) + (Dr * source_remainder)
Dg = (Sg * source_percent) + (Dg * source_remainder)
Db = (Sb * source_percent) + (Db * source_remainder)

Effectively, the code is figuring out what ratio of each color to include when mixing. If you had a source RGBA color of "255, 0, 0, 64", that works out to a source percent of 0.25 (25% of the final color should be the source) and a source remainder of 0.75 (75% of the final color should be the original destination). If you were blending a source image representing red stained glass onto your destination you would expect the resulting image to be tinted red. That's exactly what the code ends up doing! If you work out the math for a source alpha of 0 (completely transparent), you can see that the color component equations simplify to leaving the destination colors unchanged. If you set the source alpha to 255 (completely opaque), you can see that the equations simplify to setting the destination colors equal to the source colors. Anything in between and your destination pixel ends up being a mix of the source and destination colors with the appropriate ratio.

You might notice that the above code does not handle one particular thing. It does not update (or even use) the destination alpha. This is because we assumed that the destination canvas was fully opaque. That makes sense in many cases because usually the destination canvas is an image that we are going to display on a computer screen or print out later. That means we can consider the destination to be the the final image and thus we can assume that the alpha component for each pixel in the destination is 255 (fully opaque). If all you want to do is alpha blend a source image onto a destination and then look at it, you're done! However, what if you want part or all of the destination canvas to be transparent? What if the final canvas is meant to have transparency so it can be placed onto another canvas? In that case, we need to update the code a little bit:

source_percent = Sa / 255
destination_percent = Da / 255
source_remainder = 1 - source_percent

Dr = (Sr * source_percent) + ((Dr * destination_percent) * source_remainder)
Dg = (Sg * source_percent) + ((Dg * destination_percent) * source_remainder)
Db = (Sb * source_percent) + ((Db * destination_percent) * source_remainder)
Da = (255 * source_percent) + ((255 * destination_percent) * source_remainder)

Okay, this got a little complicated! Let's dissect the new code a little bit. First, you'll notice the introduction of destination percent. I placed parenthesis around where its used so its easier to see that we are now doing the same thing to the destination colors as we are to the source colors! Now, both the source and destination colors are being scaled down by their respective alpha percentages. Essentially, we are using the alpha component to figure out how much of each color should be mixed into the final color for both the source and destination image now. If you work out the math for a destination alpha of 255, you'll see that this code simplifies down to the original code for the three color components! We're still computing a ratio of the two colors based on the source alpha since it is the one being blended onto the destination. We just added scaling the destination color components by the destination alpha. The second addition is updating the destination alpha. We treat it almost the same as the color components, except that we already have a percentage so we multiply by the maximum number (255) instead of the alpha component itself. Again, if you plug in 255 for the destination alpha, you'll see that the equation turns into "Da = 255" which matches the assumption we had previously made! And again, if you have a source pixel that's fully opaque (alpha component of 255) or fully transparent (alpha component of 0), the equations simplify to setting the destination RGBA components equal to either the source or destination as you would expect. Cool!

But wait! The above code has a subtle bug. Imagine we are blending a source pixel with an alpha component of 64 onto a destination pixel with an alpha component of 128. The source percent works out to 0.25 and the destination percent works out to 0.50. That means we are blending 25% of the source pixel and 50% of the destination pixel together. To do this, we scale the source and destination color components by 0.25 and 0.50 respectively before blending them together. We then compute the new alpha component which works out to 159. That means nothing by itself, but remember, the whole point of keeping around a destination alpha is because we want a new image suitable for blending with another canvas. The RGB components have already been scaled down by their respective alpha components when we mixed the colors. But we also computed a new alpha component that was not 255. When we blend the image on this canvas with another image, we will multiply thse RGB color components by the new alpha percentage that we computed (which works out to about .623) meaning we will have scaled our colors down twice! Our colors will end up darker than we wanted because of this! This is easiest to see if you assume a source pixel that's fully transparent (alpha component of 0) and a destination pixel that's partially transparent. The destination RGB components get multiplied by the destination percent, and the destination alpha gets left alone, meaning we just accidentally darkened the colors.

The key to understanding this bug is this: In our original code, we assumed the final alpha component was always 255. We didn't premultiply the destination color by its alpha percentage. We only took the ratio of the two colors, computed by figuring out the alpha percentage of the source. In these new equations we are still computing the ratio based on the source alpha, but we are also scaling the destination by its alpha percentage as well. You'll notice that when the destination alpha is 255, these equations work out. Its no coincidence that the bug does not appear in these scenarios! That's because the ratio of colors we compute is not out of 255, but out of the final alpha component! Its only when the final alpha percentage is 1.0 that we have actually calculated the colors correctly. The fix then, is to perform the inverse scaling on the destination RGB components so that when we later scale based on the alpha component we computed we get the correct colors:

source_percent = Sa / 255
destination_percent = Da / 255
source_remainder = 1 - source_percent
final_percent = source_percent + destination_percent * source_remainder

Dr = ((Sr * source_percent) + ((Dr * destination_percent) * source_remainder)) / final_percent
Dg = ((Sg * source_percent) + ((Dg * destination_percent) * source_remainder)) / final_percent
Db = ((Sb * source_percent) + ((Db * destination_percent) * source_remainder)) / final_percent
Da = 255 * final_percent

Finally! You'll note that the computed destination RGBA pixel, if inserted into the same code as a source image in the future, get multiplied by the same final percent. So we've effectively un-scaled the colors so they can be correctly scaled again in the future. The destination alpha component equation is factored out but remains the same. We were always calculating the alpha component correctly, we just weren't taking into consideration that the RGB components were going to be scaled by the alpha component when the final image was blended onto a new canvas. If we take the previous example of a source pixel that is fully transparent and a destination pixel that is partially transparent, you can see that final percent reduces to the destination percent, meaning that the color component equations do indeed reduce to "Dr = Dr, Dg = Dg, Db = Db, and Da = Da". Aside from some clamping to make sure color components always stay in the range of 0 through 255, this is the code that appears in the AFP renderer. It was necessary in order to create PNG and WEBP files that could be placed on top of other graphics after they were rendered.

Mirror Git to Mercurial

June 17th 2020, 2:53:35 pm

Tired of Git absolutely destroying your work due to its garbage interface and dozens of built-in foot-guns? Need to use it to publish your code to GitHub? I keep a dual-repo of everything I work on locally so that I can do my real work in Mercurial but mirror manually to Git for publishing. Its a bit of a pain in the ass, but its less of a pain in the ass than having to use Git in any way other than adding commits and pushing to GitHub. If you find yourself in a situation where you want to dual-repo an existing personal project, here is a handy set of command-line instructions that you can run on Linux to make a commit-accurate (but not time-accurate) mirror of your Git repo in Mercurial. Note that this is for personal projects, so it will assume you're the committer on everything. But maybe it will help you like it helped me?

  • List out all commits from HEAD (make sure to "git checkout trunk" or similar first).
git rev-list HEAD --reverse > /tmp/hgimport

  • Init mercurial
hg init

  • Copy gitignore, format it for mercurial syntax, ignore git repo
echo -e "syntax: glob\n\n.git/\n.gitignore\n.gitattributes" | cat - .gitignore > .hgignore

  • Go through commit list, check out commit, clean the repo (delete removed/untracked files), add/remove those files, commit them with the git commit message.
for I in $(cat /tmp/hgimport) ; do git checkout $I ; git clean  -d  -fx --dry-run . | sed 's/Would remove //' | grep -v "\.hg" | xargs rm ; hg addremove . ; hg commit -m "$(git log --format=%B -n 1 .)" ; done

  • Clean up temp export
rm /tmp/hgimport

Once you run the above steps, you should be able to re-run "git checkout trunk" or similar and see that your Mercurial repository is clean. You will have untracked .hg/ and .hgignore files in your Git repo, but its easy enough to add them to your .gitignore in a commit. Once this is done, you can continue where you left off working in Mercurial. My preferred workflow is to get things to a point where I'm ready to publish, then one-by-one check out the Mercurial commits and commit them to Git so the history matches. Its not perfect, but it is virtually free of foot-guns and I haven't lost work yet in 3 years of doing this. Can't say the same about just using Git to do regular stuff.

MiniDragon Takes its First Steps!

May 1st 2020, 6:19:36 pm

Last time I posted about the MiniDragon CPU project I had assembled several components, sent out the heart of the instruction decoder to fabrication and was just beginning to assemble the (at the time) 14 additional register boards to complete the PC, IP and A registers. I had done some initial optimization of the existing stdlib and had a pretty good idea of how I wanted to lay out the boards physically. I had not yet tested any of the boards with each other, although I was relatively confident in their operations. A lot has changed since then! First and foremost, I have assembled enough of MiniDragon to successfully execute my first instruction! Last night, after debugging the last known issue with the boards, I was able to step through the microcodes for LOADI and set the A register to the sign-extended immediate value stored in the lower 6 bits of the LOADI opcode. This works both in manual clocking mode where I can single step with a pushbutton and in automatic mode where the clock generation circuit runs the CPU at a predetermined speed.

Problems with the Instruction Decoder

The instruction decoder is essentially a pull-down bus with 32 parallel control signals. There is a distributor which provides 1K pull-ups and a pair of 32-pin connectors designed to plug into the ROM boards as well as feed the various control signals for the CPU. The ROM boards consist of a series of jumpers allowing me to set which control signals should be pulled low when the ROM is active. There are two variants of the ROM board: a 4-position board which has two bits of addressing and an enable input, and a 1-position mini-board which has only an enable input. When a ROM board is enabled, it will select the correct set of jumpers and pull the correct bits of the bus low to set the control signals for a particular microcode. When it is not enabled, the open-collector transistors providing the pull-down effect will be deactivated and thus high impedance. This is the most elegant solution I could find for tying a large number of ROM boards together.


Initial bring-up of the first ROM board connected to the control signal distributor board.

Now, when measuring the control signal outputs all seemed to be acceptable. The control signals illuminated their respective indicator LED and output ~4.7V according to my oscilloscope. When off, they appeared close enough to 0V to cut it. So, I assembled enough ROM boards to code for a LOADI instruction, including a mini-board that stored the "load next instruction from memory, placing it into the instruction register" microcode. In isolation, everything appeared to work fine. When I connected the instruction decoder board to the microcode counter board, I was able to successfully step through the various microcodes, seeing the control signals change for every step. So I hastily assembled and connected the rest of the components necessary for the LOADI to work (instruction register, A register, immediate register).


The boards I assembled in one marathon assembly in order to integration test a single instruction.

You know what comes next. Nothing worked. Nothing. Somehow the microcode counter wasn't counting up, the instruction register wasn't loading the value from the bus, and it seemed like I completely fried something. The only thing that appeared to work was the SRAM emulator circuit which was faithfully outputting an 8-bit binary value onto the data bus that I had entered on a bank of DIP switches. I know better than to just throw everything together and hope for the best but I'll admit that I was very excited. So, it was back to the drawing board and I was quite disappointed. I put the project down for a night and gave it some thought.

The next morning, I started by taking some measurements on the control signals. I found it a bit suspicious that when I connected the instruction register to the control signal distributor the indicator LED for that signal got quite dim. So, I poked at it with an oscilloscope and found that the control signal line was sitting at around 3V. Unplugging one of the two 4-bit registers bumped it up to almost 4V and unplugging both brought it back to the expected 4.7V. However, when buffering through a pair of not gates, the voltage for a logic high only dropped by a few tenths of a volt. So I went back to the schematics and realized that I had a pair of design flaws.

The inputs for all of my gates go through 10K current limiting resistors before hitting their respective transistors. For almost all circuits, there is only one resistor/transistor pair that inputs connect to. However, the 4-bit registers are effectively 4 identical copies of a 1-bit register laid out on a single board. I had connected both the EN and RST lines directly to the respective 4 bits. This meant that while the data inputs and clock input could be seen as costing 1 fanout to a driving circuit, the EN and RST lines cost 4 fanout. So, connecting two registers was a cost of 8 fanout. Basically connecting a single register board was equivalent to trying to drive four NOT gates at once. Most of my circuits included a emitter-follower buffer on the output stage, ensuring that there is no current limiting and effectively making the fanout a function of the input impedance (normally 10K per circuit) and the 2N2222A's maximum current. So this flaw would not be fatal in and of itself. However, the control signal distributor did not include any output conditioning. It included only a 1K pull-up per signal, allowing that to drive the logic level to 1 when a ROM wasn't setting it to 0.

In the control signal distributor the 1K resistor acts as a current limiter and makes the output voltage succeptible to the number of gates connected to that circuit. To understand why, Ohm's law can be used. If we look at the voltage at the control signal pin when only one circuit is connected, we have 5V feeding a 1K resistor, the control signal pin and then a 10K resistor feeding the base of a 2N2222A. While there is a voltage drop across the 2N2222A, it doesn't matter when trying to understand the problem. So, we have a simple voltage divider with the voltage at the control signal pin equivalent to 5V * (10K/(10K + 1K)) or around 4.5V. That should work. Okay, so let's connect two circuits. Ohm's law also allows us to calculate effective resistance when multiple resistors are in parallel, which in this case is 5K. So, the control signal voltage is now equal to 5V * (5K/(5K + 1K)) or around 4.1V. Its easy to see a pattern here, the more inputs you place in line, the lower the effective resistance is on the low side of the 1K resistor and therefore the lower the voltage is. At a certain point, it becomes not enough to be seen as a logic 1.

So, with the existing design of the control signal distributor there was no way to reliably drive more than 1-2 circuits on a single control line, and an 8-bit register was effectively 8 circuits from the perspective of the instruction decoder. Fortunately, I know how to solve this! Most of my circuits already included emitter-follower buffers to drive their outputs. These buffers pull down to GND instead of up to 5V, meaning they can source as much current as downstream circuits demand up until the transistor burns out. For the 2N2222A, you can provide about 800mA of current before the transistor dies. With 10K current limiting resistors on all the inputs of my circuits, that fanout works out to ~1600 circuits. For my purposes, that's effectively infinite. So, I needed to design a buffer circuit that could be plugged into the control signals distributor that provided an emitter-follower buffer per bit. I looked at my schematics and realized the clock and reset circuit also suffered from the same flaw on its outputs. So, I designed an 8-bit buffer for the control signals and a 2-bit buffer for the clock and reset circuit.


Emitter-follower buffer for the clock and reset lines.

It happens that the 2-to-4 decoder also suffers from this particular flaw. I cheaped out on the design, dropping the output buffer as well as the indicator LEDs. So, instead of trying to patch existing circuits I redesigned it to include both the standard buffering and indicator LEDs that are available on all other circuits. Since designing it I've found that the LEDs are an amazing debugging aide so making larger (and more expensive) boards which contain them seems to be always worth it. That wasn't as clear to me when I laid out the original version of the board a few months ago. Also, remember in a previous blog post where the original AND and OR gates I designed didn't work in some cases? That turns out to be the same root cause so if I'd taken a bit more time to understand why they failed the way they did instead of just shrugging and throwing the output stage back on them I might not have had to fix so much. Oh well though, what's done is done and I have a much more solid understanding of just why everything is actually working together.

Additional Issues

Of course, in any engineering project bigger than a small program or circuit there will be unforseen problems. So its no surprise that when I took the time to bring up the components slowly and in a more organized fashion I found several more issues with my overall design. The first was relating to the data inputs of the various registers. The data bus works very similarly to the control signals distributor. There is a central backbone component providing 1K pull-ups and a bunch of circuits hanging off of that which either read from the bus, write to the bus or both. They work under the principle that when a component isn't reading it doesn't consume any current from the bus and thus doesn't affect its voltage level. That turned out to not be the case for the 4-bit registers. This is due to the inverter connected to the reset input of the internal SR-latch for each bit. So, if I connected enough registers, the voltage on the data bus would sag until other registers only saw 0's on the bus. I didn't want to sacrifice on the elegant design of having one 8-bit bus connection per register so I instead sacrificed on the current consumption for the bus circuits. Remembering the formula for voltage dividers above, we can generalize it for X registers connected like so: 5V * ((10K / X)/((10K / X) + 1K)). As X increases, the dominant factor in the equation becomes the 1K resistor. So I swapped out the 1K pull-ups for 100Ohm pull-ups instead. If we assume 5 registers connected, we go from a 3.33V for a 1 up to a 4.76V for a 1 which is more than adequate.


Partially re-connected components as I brought up each piece methodically.

The second problem was somewhat surprising, but easy to account for. There's a large amount of voltage drop across the feeder wires from the bench power supply to the circuits themselves. This is obviously current-dependent since wires have parasitic resistance. So, the more circuits I plugged in, the more the voltage sagged at the inputs for each circuit. Remember that the power on reset circuitry in the clock generator looks for a voltage of around 4.7V and holds the system in reset if it drops below that amount. This allows the system to self-reset when being powered for the first time but also means that we're fairly sensitive to voltage drops. Once I figured this out it was easy enough to put a probe from the oscilloscope on the power inputs for a random circuit and adjust the voltage of the bench power supply until the oscillocope read 5V. Once I did this the circuits that were behaving erratically when everything was connected started behaving correctly.

The third had to do with the way the instruction decoder ROM boards pulled relevant control signals to 0. I noticed that when the ROM boards were active, they would only pull the control signals to ~1V. This was good enough before the emitter-follower buffers, but now that those were in place lots of circuits were seeing all control signals as active all the time. This turned out to be the current limiting resistor feeding the open-collector pull-down transistors. In the case of the 4-position ROM boards, I replaced the resistors with bridges as they were fed by the internal demux circuit which was already current limited. For the mini-ROM boards I first tried a bridge but burned out the enable transistor on the load instruction ROM. Replacing the bridge with a 1K resistor instead of the original 10K resistor did the trick.

Finally, I had a phantom problem with the clock. I noticed that when additional circuits were connected, the clock would start behaving more and more erratically. It got to the point where just connecting the clock to the general purpose registers was enough to break synchronization for the whole CPU. I spent a bunch of time thinking on this one before starting to take a bunch of measurements. However, they all ended up being unnecessary as the problem turned out to be the 2-position DIP switch on the clock circuit itself. This switch allows you to select between manual and automatic clocking and must have been shorting in some cases, leading to erratic clock behavior. After flipping the DIP switches a few times, the problem went away!


Current state of MiniDragon, with A register holding the binary value 00010111 which was set via a LOADI instruction.

New Physical Layout

I wasn't quite happy with the layout from last time as it stretched some control signals 4+ feet and took up a lot of room physically. So, I came up with a revised 3x3 layout instead of a 2x5 layout. This new version puts the busses much closer to their intended components and also takes into account power circuitry. In order for this to work, the circuits will have to be stacked higher but that's no big problem. My partner also pointed out that she thinks it would look cooler with clear acrylic instead of black ABS so I have a few 12"x12" sheets coming to try that out. I agree with her since its a bit disappointing to work hard on LEDs and circuit aesthetics only to have everything obscured by stacks of additional boards. So, we'll see how that looks!

I've done lots of little tweaks to the schematics in order to make sure they stay up-to-date with the actual built MiniDragon. There haven't been many design changes except to the instruction decoder itself which needed updates to reflect a new opcode layout as well as the buffer boards. I also switched the MCODE_RST control signal to being active low instead of active high. I was a bit worried that it was possible for a ROM board to be deselected before another ROM board was activated. If you remember, control signas are default high with pull-down jumpers. That means if there were variable delays in the enable lines for various microcode ROM boards it would be possible for the MCODE_RST line to pulse high briefly. This is tied to the RST inputs of the flip-flops on the microcode counter board which are async. So, in order to head off any potential problems I switched polarity of the signal. That way, its active low and default high meaning it will never be activated by mistake. All other control signals are acted upon during clock change so this isn't a problem for any other circuits. As a bonus, this also saves jumpers as only one is needed on the final microcode.


The new layout, with bold highlighting completed boards.

Software Side

My progress on MiniDragon has not been limited to hardware. When I needed a break from assembly or I was mulling over problems I tackled lots of TODOs on the software side of things. As I've been coding the stdlib I've run into a few annoying patterns. One of them was that you had to stick temporary values in memory in order to perform math against them. The accumulator was just that: an accumulator. So, more complicated algorithms wasted a lot of their time shuffling values around in memory to get their job done. I took at look at the instruction set and realized that the "no variable width instructions" ship had sailed with the introduction of LNGJUMP and decided to make LOADI perform the same. So, I moved it into the stack operations opcode range and updated the microcode so that LOADI loads an 8 bit immediate value from the next position in memory after the opcode. This meant that a lot of operations in the CPU were much faster now because setting an arbitrary value in A didn't involve lots of shifting and adding. It also meant that I freed up 64 opcodes for use with register math!

I realized that I had several spare register boards laying around that I could populate and bring up so I introduced two temporary registers: U and V. I added additional opcodes to work with these registers. They both hold an 8-bit value, and they can be used as the second math source for all ALU operations. This means that you can manipulate the accumulator using the value in memory that PC points to or the value in U or V. Additionally, instructions to load/store the value of memory at PC into U/V have been added, instructions to move values between A, U and V have been added and instructions to swap the value of A and U, A and V or U and V have been added. I took advantage of these instructions in a few functions, most notably the umult function which is both 20% faster and slightly smaller in the bootROM. Similar optimizations have been made to several other functions. Notably, atoi is almost 50% faster than it was when I originally coded it.

Nothing comes for free, however. If you remember in the last blog post, I had three control signals to spare for future additions. In order to have a working U and V register, I need four control signals (read bus to U, read bus to V, write U to bus, write V to bus). So, I had to take a look and find another group of signals that would never be active at once. I settled on the control signals that feed various bus outputs to the top half of the data bus. It shouldn't be possible to have more than one signal for data bus writing active at once because then two circuits would fight over what value to place on the bus. So, I took three signals (ALU_OUT, A_HIGH_OUT and D_HIGH_OUT) and connected them through a 2-to-4 decoder so they could take up only two bits on the ROM boards. As a bonus, the default value (11 in binary) is mapped to nothing, meaning that for these signals, leaving jumpers out defaults to no signals active. Much like the MCODE_RST negation above, this saves a lot of jumpers in practice.

In order to make U/V registers work with math, a new chunk of the opcode space has been carved out for arithmetic that uses two bits as the operand input. This also left room for modifications to SHL/SHR. As of the last blog post, shifting left or right always shifted in a 0 to the unoccupied spot, and always shifted the lost bit into the carry flag. This meant that in order to shift a 16/32-bit number you would have to do a fair amount of juggling numbers. Given that shifts do not take a second operand, I had room in the opcode layout for variations on shifting. So, I've added rotate left/right and rotate with carry left and right. The former, known as ROL/ROR, shifts as expected, but sets the shifted in bit to the bit that was shifted out, allowing a barrel rotate. The latter, known as RCL/RCR does similar, but the bit which is rotated in comes from the carry. Both set carry to the bit that was shifted out. RCL/RCR are especially useful in umult as they allow me to multiply/divide the two operands by two as per the Russian peasant's algorithm far more efficiently.

The udiv function is now more correct than it was before. Previously, it could only divide unsigned integers that were in the range of 0-127. While it was performing unsigned math, it could not function against numbers that appeared signed on account of how it determined that it was finished. Given the speed-ups afforded by the U/V registers, I was able to switch it over to using ucmp and lift this restriction. It is still twice as slow as it used to be, but it no longer has any caveats. There is also a working implementation of umult16 and umult32 meaning that while MiniDragon technically only has an 8-bit adder, it can successfully multiply two 32-bit integers as long as their result fits in 32 bits.

The assembler/disassembler I've written for the project has its limitations which prevents having separate opcodes decode to the same mnemonic. I work around that by having plenty of duplicated instruction names which include the operand in the name, such as LOADA to load the A register and LOADU to load the U register. This looks ugly in practice, so I sugar over it with macros which the assembler has supported from the beginning. So, under the hood there is a LOADA, LOADU and LOADV instruction, and then there is a LOAD macro which takes a single parameter and emits the correct mnemonic. This allows me to type things such as LOAD A instead of LOADA, making the assembly much prettier in my opinion. If I was truly ambitious, I could integrate with an existing table assembler but I don't see the benefit at the moment.

Next Steps

With the successful integration of the currently built boards and the execution of the (outdated) LOADI instruction I now have a lot more confidence in my direction. I still need to design and lay out the ALU and SRAM interface circuits. However, the design of the rest of the components is solidified and tested. All I have to do to complete the general purpose registers and the majority of the instruction decoder is solder about a billion components to way too many circuits, bring them up and then build the various boards with them. I also need to finish up several functions for the standard library and choose an 8-bit serial chip to use for IO routines. A friend of mine has graciously volunteered a serial terminal that I can use to interface with MiniDragon some day when it is complete. So hopefully next time I write an entry I'll have things further along!

Older Entries