My dream computer (Part 1: Processor architecture)

Preamble

For multiple years now I have had a vision for how a perfect operating system should be designed. The specifics transformed slightly over the years but have generally stood the same. However I have never actually gotten to implement this design that envisioned, because of some or another problem that was hard to solve at the time. I have still learned many things with my failed attempts and considered more radical ideas too. In my continuing efforts in implementing my ideas some day I want to write my most important thoughts on this topic in the next few days because I realized that a problem of such great complexity is hard to solve just within my head. These notes should be helpful for me but may also interest others and so are published here.

During my time of failing to implement the thing I found myself questioning lower-level things a lot and so reevaluated these too and came to the conclusion that these should be redesigned aswell. For my overview I will therefore start from the bottom.

In the rest of this article I am also just writing what I think is true or close to the truth even for topics that I know less about. So view the text below as pure speculation. I expect some things to be wrong. In that case, please tell me.

Content

My goal is it to design a Computer system (You should know what that means to you.). The heart of the physical part of this machine is a Processor, the remaining parts store data or interface with the outside world or user and are of less interest to me. Software is also important and I will write more about this in the next posts.

For performing the actual computations many design principles have been found in the past: You could use a mechanical contraption, DNA/Protein's biochemical reactions or anything really. Electric circuits however are most useful and widely deployed today and since I do not have the expertise, I am not reconsidering at this level.

With electronic semiconductor logic gates you can arrange circuits in form of a silicon die that perform most arithmetic and logic operations. The arrangement of those gates on such a die is immutable however, meaning it is not possible to be reprogrammed in that way. It is therefore required to build a circuit that is general-purpose and can be programmed with data ("software").

This problem has been approached from multiple sides before. An FPGA is where relatively small clusters of gates can be arranged to circuits by switching connecting gates (that's a simplification). [Write how "traditional" CPUs work] In the end there are many ways to achieve the required semi-turing-completeness required for a computer. The performance of these does differ though. From observing optimization in current CPUs it seems that the useful computation per space per time matters most. The minimum time interval for computations (cpu clock) does primarily depend on the size of the die (which is constraint by the manufacturing technology which I consider fixed) and distance of components on the die. A major part in optimizing computational density of a processor is keeping all parts as busy with computation as possible.

I see a tradeoff between having small reconfigurable parts (like FPGAs) where the overhead of reconfiguration is large (as in: many "useless" gates for programming) and big parts where you lack the flexibility optimize the software. In an unoptimized basic CPU for example, only one part would active at a time. (The ADD instruction only performs an addition while other destinct components like the multiplier, logic unit or memory are idle). Modern CPU design avoid this to some extent with pipelining, speculative execution and simultaneous multithreading which however introduce new complexity and also overhead for this coordination. Also the program execution timing varies at runtime which introduces additional problems. The concept of Very Long Instruction Word architectures appeals to me. It describes architectures where an instruction specifies the configuration of most (usually all) components of the processor at once.

[Skipping a few steps because its already 4 am...] Next I am describing my design: Firstly the processor is made up of a number of modules that have inputs, outputs and parameters. These modules' inputs are then connected with data multiplexers to allow abitrary cross-module data flow. Parameters of modules are bits that configure the modules operating within one clock cycle of the processor. These parameters also include an output index for each input of the module to obtain the data for this cycle from. The processor operates in clock cycles where each cycle a new instruction is loaded and executed. This module system is very flexible; Here are a few examples of useful modules for a processor:

Add/Sub unit: Two inputs and one output. A flag to switch subtract mode.
Literal loader: One output set to a value set by 32 or so parameter bits. Essentially providing a way to introduce new constants from program code
Memory loader: One input (address) and an output that takes on the value of the RAM at the provided address.
Flow controller: Two inputs, a condition and a program counter value, that set the PC when a condition is met, otherwise increments the PC as normal.
Memory storer: (left as an exercise to the reader)
Logic unit
Mul/Div unit
Floating-point unit

To balance the load between modules, the most busy units can be contained multiple times within the processor, allowing for example to perform two addition per cycle.

Additionally, since typical RAM is rather slow, it is also possible to introduce a latency to some modules like that memory loader that need main memory where there is internal uninterupted pipelining for memory fetches such that, for example memory fetches arrive with 4 cycles delay and jumps happen 4 cycles after beeing issued. This always assumes the worst case memory latency and can not utilise a cache. You could however reimplement a cache manually be adding secondary "fast" memory that is local but smaller where you load data into with software.

Still current superscalar CPUs can be smarter about module load balancing and memory caching because of their runtime knowledge, that a compiler does not have. In general, writing a good compiler for this architecture (assuming a traditional language) is relatively hard.

I am already visualizing code for such a processor in my head in a table with time and module axis. For some later post or development I might generate such diagrams from code to think about about. (will probably look super cool)

End

Thats it for today. The writing of this article has been motivated by me "participating" in Macoy Madson's Simple Useful System Challenge. As usual I am always happy to receive feedback about what I am writing here. Send your thoughts! (any method is fine)

Next article will be more about my ideas about software. The overview is:

Compiler/Language design
- Memory safety
- Type systems
- Representing hardware execution/performance characteristics
- Memory management
- Data flow oriented programming
Software design
- Permission checks at compile-time
- Cooperative multithreading asserted by compiler
- No driver-program boundary
- Global Safety

Article written by metamuffin, text licenced under CC BY-ND 4.0, non-trivial code blocks under GPL-3.0-only except where indicated otherwise

metamuffin's personal website