************************************************************************
*
* FFT for Interactive C
* version 1.1
*
* Public functions:
*
*   int power_spectrum(char *data)... accepts pointer to 512-byte array,
*     the first 256 bytes of which represents signed 8-bit real-valued
*     data; performs simple power-spectrum estimation, without windowing
*     or binning; on return, array holds unsigned 8-bit spectral compo-
*     nents; return value is the index to the peak frequency
*
*   int fft(char *data)... accepts pointer to 512-byte array, the first
*     half of which represents signed 8-bit real components, the second
*     half of which represents imaginary components; performs in-place
*     FFT; return value is number of times data had to be divided by 2
*
*   int inverse_fft(char *data)... same as fft() but array contains
*     complex-valued inverse transform
*
* The function argument is the address of the first byte of the array.
* Because of the nonstandard way IC handles arrays, you'll need to call
* using a form such as power_spectrum((int)&data[0]).
*
************************************************************************
*
* Version history:
*
*   1.0  7/31/01  George Musser (georgejr@musser.com)
*     Adapted from fft.c11 by
*
*       Ron Williams
*       Department of Chemistry
*       Ohio University
*       Athens, OH 45701
*
*     This, in turn, was a modification of the 6800 FFT presented by
*
*       Richard Lord
*       Byte Magazine, pp. 108-119
*       February 1979
*
*   1.1  8/4/01  gsm
*     power_spectrum() now returns peak frequency
*
************************************************************************
*
* Williams's FFT algorithm is written in ROMable code for the HC11.  It
* uses a sine look-up table for speed and stores its dynamic variables
* in the machine stack.  The user passes the address of a 512-byte array
* of 8-bit signed numbers, with the real components in the first 256
* bytes and the imaginary components in the second.  power_spectrum()
* zeroes out the imaginary components.
*
* The fft() and inverse_fft() return value is a normalization exponent
* equal to the number of times the data was divided by two to keep it
* within 8-bit range during the transform.  The power_spectrum() return
* value is the index of the array component with the maximum absolute
* value.  This is easily related to the peak frequency.  For data at
* timesteps of
*
*   0, dt, ..., 127 dt, 128 dt, 129 dt, ..., 255 dt
*
* the values represent the power at periods of
*
*   DC, 256 dt, ..., 256/127 dt, 2 dt, -256/127 dt, ..., -256 dt
*
* For real-valued data, the transform components at positive and nega-
* tive periods are complex conjugates, so the power depends only on
* the absolute value of the period.  Accordingly, the return value is
* the absolute value of the array index.  Array component 128, which
* corresponds to the Nyquist frequency, is a special case.
*
* Thus a power_spectrum() return value of 0 indicates that the DC
* component dominates, a return value of 1 indicates that a frequency
* of 1/256th of the sampling frequency dominates, and a return value
* of 128 indicates that the Nyquist frequency dominates.
*
* To ensure that the determination of the peak is as accurate as possi-
* ble, the user may want to apply a window function to the data before
* passing it to power_spectrum().  I haven't implemented windowing here,
* because, in fixed-point math, it would destroy precision and create
* spurious harmonics.  The calling routine can minimize the loss of
* precision by scaling the data to fill the full 8-bit range -- an
* operation that is easier to do in Interactive C.
*
* According to Williams's benchmarking, the computation takes 350
* milliseconds -- pretty fast, but remember that it hogs the processor,
* since IC will not preempt a machine-language routine.  I have not done
* a great deal of testing, apart from some simple cases.
*
* Clever packing and unpacking of the data array could double the power-
* spectrum computation, at the price of added complexity.

************************************************************************
*
* offsets for local variables in FFT routine
*

REAL		EQU	0
TOP		EQU	2	; gsm-added
SIGN		EQU	4	; gsm-added
CELNM		EQU	5
CELCT		EQU	6
PAIRNM	EQU	7
CELDIS	EQU	8
DELTA		EQU	9
SCLFCT	EQU	$0A
COSA		EQU	$0B
SINA		EQU	$0C
SINPT		EQU	$0D
REAL1		EQU	$0F
REAL2		EQU	$11
TREAL		EQU	$13
TIMAG		EQU	$14
TMP		EQU	$15
TMP2		EQU	$16


        ORG     MAIN_START

************************************************************************
*
* power_spectrum
*
* I split the original routine into two pieces: the actual FFT and the
* power-spectrum calculation.  That way, we can call the FFT directly if
* we have complex-valued data (as we might if using the FFT for filter-
* ing or waveform synthesis).
*
* The argument -- the address of the real-valued data array -- is passed
* in the accumulator.
*

subroutine_power_spectrum

* clear imaginary part of data array
*
* slight optimization saves 1019 cycles (gsm):
*   + use STAA instead of CLR, saving 2 cycles per iteration
*   + use X rather than Y, saving 2 cycles per iteration

	ADDD	#$100		; get address of imaginary part [4]
	XGDX			; load data address into X [3]
	CLRA			; A register is zero [2]
	CLRB			; B register holds count (0=256) [2]
zero	STAA	$FF,X		; [4]
	DEX			; [3]
	DECB			; [2]
	BNE	zero		; [3]

* do the FFT

	PSHX			; save the data address
	XGDX			; put data address back in accumulator
	BSR	subroutine_fft
	PULX			; grab the data address

* calculate sum of absolute values
*
* formally, we should be calculating the sum of squares, but this is
* close enough for government work
*
* also, remember where the maximum value is, at a cost of 7437 cycles;
* the B register is the maximum value, and the location of that value
* is stored temporarily on the stack (gsm-added 8/4/01)
*
* slight optimization saves 2046 cycles (gsm):
*   + use X rather than Y, saving 3 cycles per iteration
*   + use Y for the byte counter, saving 5 cycles per iteration

smsq	LDY	#0		; clear byte counter
	CLRB			; initialize maximum value [2]
	PSHY			; initialize index [5]

sum	LDAA	0,X		; get real component
	BPL	sm1		; force positive
	NEGA
	BVC	sm1		; watch for $80
	CLRA			;   which is really 0
sm1	STAA	0,X		; store real component [4]
	INX			; get imaginary component
	LDAA	$FF,X
	DEX
	BPL	sm2		; force positive
	NEGA
	BVC	sm2		; watch for $80 again
	CLRA
sm2	ADDA	0,X		; compute sum [4]
	STAA	0,X		; save back in real part of array

	CBA			; is this the new maximum? [2]
	BLS	sm3		; no [3]
	TAB			; yes [2]
	INS			; ...so pop old index [3]
	INS
	PSHY			; ...and push new one [5]

sm3	INX			; inc X for next round
	INY			; done when byte counter reaches 256
	CPY	#$100		; [5]
	BNE	sum

* done

	PULX			; move index to accumulator
	XGDX
	CMPB	#$80		; take absolute value of LSB
	BLS	smex
	NEGB
smex	RTS

************************************************************************
*
* fft
*
* Here's the main body of the FFT.  The argument -- the address of the
* complex-valued data array -- is passed in the accumulator.  The
* direction of the transform (0 for forward, 1 for reverse) is passed
* in Y.
*
* fixed local-memory management, which didn't allow room on the stack
* for nested calls or interrupt service (gsm)
*

* public entry points, which set direction of transform

subroutine_fft

	LDY	#0
	BRA	fftc11

subroutine_inverse_fft

	LDY	#1

* initialize local variables

fftc11
	TSX			; top of stack for frame pointer
	XGDX			;     to be placed in X
	SUBD	#$17		; subtract offset to make room
	XGDX			; X now has frame pointer
	TXS			; relocate stack underneath frame (gsm-added)
	STD	REAL,X	; save data address (passed in accumulator)
	ADDD	#$1FF		; needed in scale subroutine (gsm-added)
	STD	TOP,X
	XGDY			; store direction of transform
	STAB	SIGN,X
	CLR	SCLFCT,X	; zero out the scale factor

* do bit sorting
*
* added swapping of the imaginary part, too, in case the data is complex-
* valued (gsm)
*
* slight optimization saves ~29000 cycles (gsm):
*   + expand rev1 loop and use A instead of TMP, saving 72 cycles
*   + store B in TMP2 rather than on the stack, saving 2 cycles
*   + use both index pointers during byte swap, saving 40 cycles

	LDAB	#$FE		; setup start for bit reversal

revbit
	STAB	TMP2,X	; save copy of address [4]

	RORB			; rotate B right - bit 1 to carry
	ROLA			; rotate left - carry bit in
	RORB			; rotate B right - bit 2 to carry
	ROLA			; rotate left - carry bit in
	RORB			; rotate B right - bit 3 to carry
	ROLA			; rotate left - carry bit in
	RORB			; rotate B right - bit 4 to carry
	ROLA			; rotate left - carry bit in
	RORB			; rotate B right - bit 5 to carry
	ROLA			; rotate left - carry bit in
	RORB			; rotate B right - bit 6 to carry
	ROLA			; rotate left - carry bit in
	RORB			; rotate B right - bit 7 to carry
	ROLA			; rotate left - carry bit in
	RORB			; rotate B right - bit 8 to carry
	ROLA			; rotate left - carry bit in

	LDAB	TMP2,X	; retrieve unbitreversed address [4]
	CBA			; make sure we only swap once [2]
	BHS	noswap

swap	PSHX			; save frame pointer
	LDY	REAL,X	; load base address
	LDX	REAL,X
	ABY			; add unbitreversed address
	TAB			; add bitreversed address
	ABX
	LDAA	0,Y		; swap real components
	LDAB	0,X
	STAA	0,X
	STAB	0,Y
	INX			; prepare for 256-byte offset
	INY
	LDAA	$FF,Y		; swap imaginary components
	LDAB	$FF,X
	STAA	$FF,X
	STAB	$FF,Y
	PULX			; restore frame pointer
	LDAB	TMP2,X	; get current address back

noswap
	DECB			; decrement address
	BNE	revbit	; do next if not done

* special case of first pass of FFT
*
* added imaginary part, for which the Danielson-Lanczos formula is the
* same as for the positive component (gsm)
*
* slight optimization saves 1658 cycles (gsm):
*   + use X instead of Y, netting 5x2 cycles per iteration
*   + use Y instead of TMP, netting 3 cycles per iteration

	JSR	scale
	PSHX			; save frame pointer [4]
	LDX	REAL,X	; set up data pointer [5]
	LDY	#128		; get number of cells [4]
fpss	LDAA	0,X		; get rm [4]
	LDAB	1,X		; get rn [4]
	PSHA			; make copy
	ABA			; rm'=rm+rn
	STAA	0,X		; save back in data array [4]
	PULA			; get rm again
	SBA			; rn'=rm-rn
	STAA	1,X		; put away [4]
	INX			; point to next pair [3]
	INX			; [3]

	LDAA	$FE,X		; get im [4]
	LDAB	$FF,X		; get in [4]
	PSHA			; make copy
	ABA			; im'=im+in
	STAA	$FE,X		; save back in data array [4]
	PULA			; get im again
	SBA			; in'=im-in
	STAA	$FF,X		; put away [4]

	DEY			; decrement # cells [4]
	BNE	fpss		; go back if not done
	PULX			; restore frame pointer [5]

* now, the FFT proper for passes 2 through N
*
* for inverse transforms, it should be enough just to negate the sine value

four	LDAA	#64		; # of cells is now 64
	STAA	CELNM,X	; store
	STAA	DELTA,X	; so is delta
	LDAA	#02		; number of pairs is 2
	STAA	PAIRNM,X
	STAA	CELDIS,X	; so is distance between
npass	JSR	scale		; check for over-range
	LDAA	CELNM,X	; get current cell #
	STAA	CELCT,X	; store at cell counter
	LDY	REAL,X
	STY	REAL1,X	; get copy of data
ncell	LDY	#sintab	; get address of sines
	STY	SINPT,X	; save copy
	LDAA	PAIRNM,X	; get current pairnm
np1	PSHA			; save pair counter
	LDAB	64,Y		; get sine
	LDAA	SIGN,X	; negate if inverse transform (gsm-added)
	BEQ	ncos
	NEGB
ncos	LDAA	0,Y		; get cosine
	STAA	COSA,X	; save copy
	STAB	SINA,X	; ditto
	LDY	REAL1,X	; point to top of data
	LDAB	CELDIS,X	; get current offset
	ABY			; add to y for current 
	STY	REAL2,X	; copy it
	LDAA	0,Y		; get data point rn
	PSHA			; copy it
	LDAB	COSA,X	; get cosine
	JSR	smul		; rn*cos(a)
	STAA	TREAL,X
	PULA			; get copy of rn
	LDAB	SINA,X	; get sin(a)
	JSR	smul		; rn*sin(a)
	STAA	TIMAG,X	; store imaginary tmp
	INY
	LDAA	$FF,Y		; get imaginary data
	PSHA			; save it
	LDAB	SINA,X	; get sin(a)
	JSR	smul		; in*sin(a)
	ADDA	TREAL,X
	STAA	TREAL,X	; tr=rn*cos+in*sin
	PULA			; get data back
	LDAB	COSA,X	; get cosine
	JSR	smul		; in*cos(a)
	SUBA	TIMAG,X	; ti=in*cos-rn*sin
	STAA	TIMAG,X
	LDY	REAL1,X
	LDAA	0,Y		; get rm 
	TAB			; save a copy
	ADDA	TREAL,X	; rm'=rm+tr
	STAA	0,Y		; store new rm
	SUBB	TREAL,X	; rn'=rm-tr
	LDY	REAL2,X
	STAB	0,Y		; store new rn
	LDY	REAL1,X
	INY
	STY	REAL1,X	; save real1 for nxt
	LDAA	$FF,Y		; get im
	TAB			; save copy
	ADDA	TIMAG,X	; im'=im+ti
	STAA	$FF,Y		; put back in array
	LDY	REAL2,X
	INY
	SUBB	TIMAG,X	; in'=im-ti
	STAB	$FF,Y		; put back in array
	LDY	SINPT,X
	LDAB	DELTA,X	; increment sine pntr
	ABY
	STY	SINPT,X	; save away
	PULA
	DECA			; dec pair counter
	BEQ	ar1		; gsm-added
	JMP	np1		; gsm-added
ar1	LDY	REAL1,X
	LDAB	CELDIS,X
	ABY
	STY	REAL1,X
	DEC	CELCT,X
	BEQ	ar3
	JMP	ncell
ar3	LSR	CELNM,X	; half cells
	BEQ	done		; done when all cells
	ASL	PAIRNM,X	; double pairs
	ASL	CELDIS,X	; twice as far apart
	LSR	DELTA,X	; delta is half
	JMP	npass		; one more time!

* return to calling program

done	LDAB	SCLFCT,X	; scale factor is return value (in accumulator)
	CLRA			; zero MSB for IC's benefit (gsm-added)
	XGDX			; pop off all the local variables (gsm-added)
	ADDD	#$17
	XGDX
	TXS
	RTS

************************************************************************
*
* subroutine for rescaling out-of-range data
*
* fixed array overflow, incorrect branches; moved top-of-data calculation
* to main FFT routine (gsm)
*
* made out-of-range detection a subroutine, at a cost of 13 cycles (gsm)
*
* slight optimization saves 3584 cycles in each loop (gsm):
*   + reverse roles of X and Y, saving 3 cycles per iteration
*   + replaced ADDA #$80, LSRA, SUBA #$40 sequence with ASRA, saving 4
*     cycles per iteration
*

scale	XGDX			; move frame pointer to accumulator [3]
	XGDY			; thence to Y [4]

* first, check whether any value lies outside the range (-64,64]

scdow	LDAA	#$C0		; -64
	LDAB	#$40		; +64
	BSR	range		; sets carry if out of range
	BCC	sexit		; if not, check whether we need to scale up

* divide the whole array by 2

scl	INC	SCLFCT,Y	; keep track of scaling [7]
	LDX	TOP,Y		; reset pointer [6]
scl1	LDAA	0,X		; get data [4]
	ASRA			; divide by two [2]
	STAA	0,X		; store away [4]
	DEX			; bump pointer [3]
	CPX	REAL,Y	; done when both [7]
	BHS	scl1		;   imag and real done (gsm: was BNE)

sexit	XGDY			; put frame pointer [4]
	XGDX			;   back into X [3]
	RTS

#if 1==0
* check whether any value lies outside the range (-32,32]
*
* this code, as it stands, doesn't do any good -- multiplying everything
* by 2 won't increase the precision of subsequent calcuations; but it
* may provide inspiration for a cleverer rescaling algorithm someday
*

scup	LDAA	#$E0		; -32
	LDAB	#$20		; +32
	BSR	range		; sets carry if out of range
	BCS	sexit		; if set, all is well

* multiply the whole array by 2

sc2	DEC	SCLFCT,Y	; keep track of scaling [7]
	LDX	TOP,Y		; reset pointer [6]
sc21	LDAA	0,X		; get data [4]
	LSLA			; multiply by two [2]
	STAA	0,X		; store away [4]
	DEX			; bump pointer [3]
	CPX	REAL,Y	; done when both [7]
	BHS	sc21		;   imag and real done
	BRA	scup		; keep scaling up until data is in midrange
#endif

************************************************************************
*
* subroutine for checking whether any data is out of range: if any value
* is outside the range (A,B], set the carry bit
*
* extracted from scale subroutine (gsm)
*
* slight optimization saves 1536 cycles (gsm):
*   + reverse roles of X and Y, saving 3 cycles per iteration
*

range	LDX	TOP,Y		; start at top of data [6]
rtop	CMPA	0,X		; check for minimum [4]
	BLO	rnxt		; if less negative than A, don't fix
	CMPB	0,X		; check for maximum [4]
	BLO	rexit		; if > B, go fix it
rnxt	DEX			; bump pointer [3]
	CPX	REAL,Y	; done when both [7]
	BHS	rtop		;   imag and real done (gsm: was BNE)
	CLC
rexit	RTS

************************************************************************
*
* subroutine for signed 8-bit multiplication
*
* A <- A*B
* B represents number from -1 to 1, so only hi byte counts.
*

smul	STAA	TMP,X		; copy multiplier
	STAB	TMP2,X	; ditto multiplicand

* get absolute values

	TSTA			; check sign of multiplier
	BPL	sk1		; skip negation
	NEGA
	BVS	sko		; check for $80
	BEQ	sko		; check for zero
sk1	TSTB			; check multiplier sign
	BPL	sk2
	NEGB
	BVS	sko		; check for $80
	BEQ	sko

* do the multiplication

sk2	MUL			; do multiplication
	ADCA	#0		; 8 bit conversion
	ASLA			; and correct for sine (so that 127=1.0)
	LDAB	TMP2,X	; get original multiplicand
	EORB	TMP,X		; check sign of result
	BPL	out
	NEGA			; result is negative
out	RTS
sko	CLRA			; return zero to main
	RTS

************************************************************************
*
* now for the sine lookup table
*

sintab            
 FCB  127, 127, 127, 127, 126, 126, 126, 125, 125, 124
 FCB  123, 122, 122, 121, 120, 118, 117, 116, 115, 113
 FCB  112, 111, 109, 107, 106, 104, 102, 100,  98,  96
 FCB   94,  92,  90,  88,  85,  83,  81,  78,  76,  73
 FCB   71,  68,  65,  63,  60,  57,  54,  51,  49,  46
 FCB   43,  40,  37,  34,  31,  28,  25,  22,  19,  16
 FCB   12,   9,   6,   3,   0,  -3,  -6,  -9, -12, -16
 FCB  -19, -22, -25, -28, -31, -34, -37, -40, -43, -46
 FCB  -49, -51, -54, -57, -60, -63, -65, -68, -71, -73
 FCB  -76, -78, -81, -83, -85, -88, -90, -92, -94, -96
 FCB  -98,-100,-102,-104,-106,-107,-109,-111,-112,-113
 FCB -115,-116,-117,-118,-120,-121,-122,-122,-123,-124
 FCB -125,-125,-126,-126,-126,-127,-127,-127,-127,-127
 FCB -127,-127,-126,-126,-126,-125,-125,-124,-123,-122
 FCB -122,-121,-120,-118,-117,-116,-115,-113,-112,-111
 FCB -109,-107,-106,-104,-102,-100, -98, -96, -94, -92
 FCB  -90, -88, -85, -83, -81, -78, -76, -73, -71, -68
 FCB  -65, -63, -60, -57, -54, -51, -49, -46, -43, -40
 FCB  -37, -34, -31, -28, -25, -22, -19, -16, -12,  -9
 FCB   -6,  -3,   0,   3,   6,   9,  12,  16,  19,  22
 FCB   25,  28,  31,  34,  37,  40,  43,  46,  49,  51
 FCB   54,  57,  60,  63,  65,  68,  71,  73,  76,  78
 FCB   81,  83,  85,  88,  90,  92,  94,  96,  98, 100
 FCB  102, 104, 106, 107, 109, 111, 112, 113, 115, 116
 FCB  117, 118, 120, 121, 122, 122, 123, 124, 125, 125
 FCB  126, 126, 126, 127, 127, 127