Modifications and tweaks needed to make the fpga 1130 a good clock sync implementation

The clocks in an 1130 are not used to gate flipflops or to control the timing of signal changes, but are referenced in logic to determine actions to take. I think of them more like imprecise, informal, gentleman's agreement states rather than concrete clocks. They are in a hierarchy of three levels of clock.

At the bottom is the 280 nanosecond oscillator cycle, approximately 3.57 Mhz. The real 1130 ran at 3.64MHz for the fastest models, corresponding to a 275 ns cycle, but with a 20 ns fpga clock the closest I could come is 280, thus I am running at 98.2% of an IBM 1130's speed. Oscillator drift in a real 1130 could be larger than this differential; Consequently, I am willing to consider this a direct match.

The oscillator turns on for 7 fpga cycles, a duration of 140 ns, and turns off for another 7 fpga cycles. The 1130 uses the oscillator to drive two signals, phase A and phase B, that alternate and are of the same 140 nanosecond length. They are not exactly aligned, but we can think of them for normal system operation as if Phase A is the on state of the oscillator and Phase B is the off state. When single-cycle stepping the 1130 from the operators console, pushing down on the Program Start key causes Phase A to be true as long as the key is pressed. When the key is released or not pushed, Phase B is true. Otherwise, while the processor is running, we see Phase A and Phase B alternating at the basic 1130 clock rate of 280 nanoseconds per cycle.

The next level up in the clock hierarchy covers the basic timing for core storage, which they call a memory cycle. The faster version of core storage on an 1130 has a cycle time of 2.2 microseconds, which is eight of the basic machine cycles. This is tracked with the T-clock states, with one memory cycle involving sequentially states T0, T1, T2, T3, T4, T5, T6 and ending on T7. Each of the T-clock states spans a Phase A - Phase B pair of low level clock states.

Core storage is read by flipping all bits of the memory word off, detecting whether the bit was previously on or not by the pulse that is created or not by the act of turning off all bits. It takes about 1.1 microseconds to first stabilize the X and Y lines that intersect at the chosen memory word and flip the cores to zero orientation. Memory is a stack of these X-Y planes, one stack for each of the 16 bits in a word, plus two stacks for parity bits. Each stack has a sensing line that will detect the pulse if a core flips from on to off, or that sees no pulse if that core was already off when we began the read.

Since reading is a destructive process, yet we usually want the value in the memory word to remain, the second half of the memory cycle involves setting the appropriate bits of the word back to a one state. For another 1.1 microseconds, the X and Y lines are stabilized and then the memory flips the cores in the opposite direction, to the orientation that represents a 1. Any bits of that word that had been zero when we read the location are left as zero by driving a counter-signal that cancels out the write pulse. The counter-signal is to the planes that did not detect a pulse during the read, keeping them from flipping on, while the other planes that have no counter-signal will have the core flipping to its on state.

The memory begins addressing the X and Y location as T0 starts, with some timers controlling the pulses that flip the word off. This happens sometime around T2 of the memory cycle, but not in any precise synchronization. As the sense amplifiers detect pulses, the send a signal that flips thecorresponding bit of the 1130's B register on, the whole register having been reset to zeroes at the start of T0.

The values of the B register are the determinant of the counter-signals for the rewriting phase of memory. Any bit that is off in the B register will have a counter signal applied to its assigned core plane, while the bits that are on in the B register are allowed to flip on as core is rewritten. The rewriting of core starts at T4, using the same address as was set up at T0 for the reading. If the B register has not been modified by the processor, it still contains the data pattern that was read in the earlier part of the memory cycle, which is rewritten, completing around T6 more or less. Thus, by the end of a memory cycle, normally the content of a word is the same value as had been there prior to the cycle. However, if the processor changes the B register before T4, that new value is what is rewritten to the memory word, ignoring its previous value we read in T0-T2.

The topmost level of the clock hierarchy represents the stages of executing an instruction. Not every instruction goes through all the stages; most only use a few and many complete in just the first stage. These stages represent memory cycles, because each of the stages begins with a complete T-clock sequence from T0 to T7. Some of the stages may need to last longer than the 2.2 microsecond memory cycle, which is supported by extending or repeating the stage T7 as long as necessary.

Adding or subtracting takes a variable amount of time on the 1130, depending on the exact values of the two numbers involved. Carries from a bit position are saved in the 1130 D register, then those carry bits are themselves added (or subtracted) to the partial sum in the A register until we have no more carries. An example  is adding 32767 and 1. Those two values are 0111 1111 1111 1111 and 0000 0000 0000 0001, so that the first T cycle of adding produces a partial sum of 0111 1111 1111 1110 in A and the value 0000 0000 0000 0010 in D to track the carry that happened in the low order position.

The second T cycle will add the partial sum and carries to produce a new partial sum of 0111 1111 1111 1100 and a new carry value of 0000 0000 0000 0100 in D. The third cycle produces 0111 1111 1111 1000 and a carry of 0000 0000 0000 1000. You can see how the carries are rippling left to right a cycle at a time, so that the penultimate cycle has a partial sum of 0000 0000 0000 0000 and a carry value in D of 1000 0000 0000 0000. The final addition gives us an A register with 1000 0000 0000 0000 and a zeroed out D register indicating our addition is complete. With addition starting in T4 of a cycle, typically, there would be T5, T6 and then many T7 cycles until the D register gets to all zeroes.

Thus, the clock from one high level instruction stage to another is triggered by signal pulses not by a fixed relationship to the T clock steps. When the instruction stage is complete, we let the T clock advance from T7 to the T0 state, beginning the next memory cycle. A T0 pulse is produced that moves the instruction clock to its next stage, determined by some logical conditions and the just ended stage.

The stages of an instruction always begin with an I1 stage and can contain I2, IX, IA, E1, E2 and E3 stages depending on the particular instruction being executed. Instructions can be one or two words long, indicated by the F flag in bit 5 of the first word. If F is on and we are moving out of I1 stage, then an I2 stage is triggered. If bit 5 is off at the end of I1, I2 is skipped and we determine which of the later stages we move into.

Index registers (there are three in the 1130) are just memory locations 1, 2 and 3, thus we need a memory cycle if we have to read and use the value of an index register. This is indicated by the two T flags in bits 6 and 7 of the first instruction word. If T is nonzero, and we are moving from I1 or I2, we move into an IX cycle to read the index register.

Each of these I type stages are building up the effective address for the instruction, using the value of the second word of memory from an I2 cycle, the value in the index register from an IX cycle, the remaining bits of the first instruction word for a single word instruction, and may involve yet another memory reference if we have requested indirect addressing by setting on bit 8 of the first instruction word. Bit 8 is the indirect addressing flag and once we are moving out of the last earlier I cycle (I1, I2 or IX depending on whether the instruction has nonzero F and/or T) and we have bit 8 on, we then move to an IA cycle. That memory cycle takes the effective address we have created so far and reads the memory word at that address, the resulting contents of memory become the new and final effective address.

For each of these I stages, if the addition being done to update the effective address requires more than one T7 cycle to complete, we take those additional T7 until the D register becomes zero.

Some instructions can complete entirely in the I stages, the T0 SP that is produced at the start of a new memory cycle is the End Op type which means that the previous instruction was completed. That makes this new memory cycle become an I1 instruction stage. If on the other hand, there are more I cycles to processor for indexes, indirect addresses, etc or the instruction also requires some memory accesses for execution, the T0 SP is the Not End Op type. that advances us to another of the instruction stages rather than resetting us back to I1.

The basic clock circuitry of the 1130 produces a T clock advance pulse when beginning the next Phase A as long as the processor is not stopped or in some kind of single cycle mode. The advance plus flips the T clock registers to the next T clock state, e.g. all but T3 might be off, then with the pulse we turn off T3 and turn on T4. I wanted the phase A, phase B and T clock signals aligned to the same fpga clock cycles.

The 1130 design has an edge triggered gate that detects the rise of phase A and emits a quick pulse that will cause the pulse triggered flipflops of the T clock to advance. This is only allowed if all the right conditions exist for moving to a new T clock step. If we are adding and still have carry values in D, then the arithmetic control signal is still on. This blocks the clock advance so that we stay in T7. Similarly, shift operations might need more cycles since shifting moves the register one bit over per cycle. If the number of positions we want to shift is big enough, we needed extended T7 cycles so shift control signal is another factor that will block the advance from T7 to T0.

There are other complexities, but the core problem to consider is that if we use an edge triggered gate to see that phase A started, the output of that gate is 20ns after the start of phaseA. That delayed signal is the trigger for the T clock flipflops, which are now synchronous so they don't change state for another 20 ns. That puts the T clock signals 40 ns or about 2/7 of the way past the start/stop of Phase A and B signals.

To resolve this, I created a new signal that anticipates the oscillator switching from phase B to phase A by one fpga cycle. Because of the way that I toggle phase A and B on and off, I am generating a 20 ns pulse to make the phase change on the next cycle. This early signal is therefore 20ns early and used instead of the actual phase A or B. I mirror all the logic that decides whether to emit a T clock advance plus but based on the early pulse instead of the real phase A signal.

Now, since that early signal is already a pulse, I didn't need to use an edge triggered gate. To preserve the mapping of my logic to the 1130 ALDs, I left the code calling the edge triggered gate component but added a new parameter to that component called pass thu. With the passthru parameter, the gate doesn't delay or do clock syncing, just implements pure combinatorial logic. This provides the clock advancing pulse just ahead of the start of Phase A as the trigger to the T clock flipflops. Voila, they now switch coincident with the basic clock starting Phase A.

There were a few related signals, such as a pulse that was intended to fire at the start of Phase A which I generate at the correct time thru judicious use of early and on-time pulse signals and passthru mode.

The next set of issues to address were pulses that are produced by the 1130 when certain signals shut off, such as the end of T3, but they are gated by a signal from T3 that wont appear until the start of T4, or correspondingly some signal that is triggered by the start of a cycle but is gated by a signal from the prior cycle which is being shut off. The exact timing of the decision, either just before or just after the clock change, makes the difference between successful operation and failure.

In some cases, I had to convert logic to mix the prior clock state and some condition, firing off the result pulse at the start of the next cycle. Where IBM may have used the start of T2 cycle to accomplish something, I may need to gate T1 with the clock advance pulse that happens just before the move to the next clock state. That would give me a pulse just as we are moving from T1 into T2. If that set of logic fires off my usual edge triggered gate, whose output is delayed one fpga cycle, I get a pulse right at the beginning of T2 just as would occur on the original 1130.

A few registers get reset signals that are generated while the storage select signal is active. On the physical 1130, that signal exists from T0 thru the end of T1, but my clock sync version was beginnign it later in T0 due to the delay of extra fpga cycles and was extending it well into T2. Since that kept some registers in reset mode until after they were being set by other pulses, I had to accelerate the start of the reset signal and shorten its duration to end promptly with T1's end.

I need to look at every place where an SP pulse (sample pulse, the activation of some action) is emitted and determine what timing constraint may pertain to it. then, I have to look at how it is being generated, adjust if it was dependent on latency of signal values beyond their legitimate end on fpga, and test.

Another area to which I had to adjust timing was with the combinatorial latches used in the 1130. When a combinatorial loop was used for a latch that was set by one async signal and reset by another, I wished to convert this to clock synchronous behavior. I chose to do that by creating a 'latchgate' component that takes the then current signal value at a clock edge and sustains the emission of that signal value until the next clock edge. Thus, any signal change that flows through this component is quantized in time to only change states at clock edges.

Any glitchy behavior settles out before the next clock tick when a signal is passed through my gate. If a false set pulse arrives for a short interval, it has not yet made it through my gate to the rest of the latch circuits. When that glitch disappears and the set signal is back to its off state, when the next clock tick occurs it has erased the instantaneous false triggering. The latch will only fire if the set signal persists long enough to make it through the next clock edge.

I had assumed that the contemporaneous designs by IBM used similar approaches to how they designed the 1130, as they were built from the same new logic technology and tools, but that does not seem to be true. At least two of the 360 series, the model 20 and model 30, take a very different tack fundamentally.

Lawrence Wilkinson, who has a parallel project to this one but recreating a 360/30 machine, identifies the basic approach for that machine using almost all gates and flipflops as level sensitive devices. Pulses are not used to trigger changes of state or other actions. Flipflops are changed while a clock phase is active - with the clock divided into four phases - then held steady for the other phases. This holds the output steady while it is mixed in combinatorial logic and allowed to settle before the next stage of logic relies on the value, since it won't look at the inputs until its clock phase is on.

That 360/30 logic behavior is substantially clock synchronous, free of the 'flip now' change of gates in the 1130. The 1130 is riddled with pulses freefloating between clock ticks,that sequence events within one clock cycle and basic clock phase. Although the 1130 has two clock phases, A and B, multiple steps take place inside one phase.

My scans of the 360/20 ALDs show very few edge triggered gates and almost no single shot pulse generators. It does not use the 1130 style design approach and seems more like the 360/30 approach described by Lawrence. I haven't looked at other models yet, so it is possible that some use the same pulse and edge trigger fundamentals as the 1130, or at least use it in portions.

I could imagine that the 360/30 approach may be part of the design strategy for microcode driven machines, which is a key difference between 1130 and the 360 line. If behavior is defined (and updated) by microcode rather than fixed hardware timing circuits, it may not make sense to try to control timing of pulses and async behavior.

On the other hand, use of async actions allows more decisions and actions to take place in a clock phase, thus for a given clock speed it may be possible to increase the work done over the level sensitive, clock phase controlled approach of the 360/30. This may be used in high end machines where performance objectives were more readily achieved by applying the async approach of 1130. This is all speculation right now, but if I get some spare time I may research anything I can find on the high end models to see whether this does occur.

No comments:

Post a Comment