Discussion:
[avr-gcc-list] AVR byte swap optimization
Shaun Jackman
19 years ago
Permalink
The following macro expands to some rather frightful code on the AVR:

#define BSWAP_16(x) \
((((x) >> 8) & 0xff) | (((x) & 0xff) << 8))

uint16_t bswap_16(uint16_t x)
{
0: 9c 01 movw r18, r24
2: 89 2f mov r24, r25
4: 99 27 eor r25, r25
6: 32 2f mov r19, r18
8: 22 27 eor r18, r18
return BSWAP_16(x);
}
a: 82 2b or r24, r18
c: 93 2b or r25, r19
e: 08 95 ret

Ideally, this macro would expand to three mov instructions and a ret.
Is there anything I can do to help GCC along here? I'm using GCC 4.1.0
with -O2.

I won't bother to show bswap_32 here, which produces a real disaster!
Think 47 instructions, for what should be 6.

Cheers,
Shaun

$ avr-gcc --version |head -1
avr-gcc (GCC) 4.1.0
Eric Weddington
19 years ago
Permalink
You could always equate the macro to some inline assembly like what is done
for a number of avr-libc macros.

Eric Weddington

> -----Original Message-----
> From:
> avr-gcc-list-bounces+eweddington=***@nongnu.org
> [mailto:avr-gcc-list-bounces+eweddington=***@nongnu.
> org] On Behalf Of Shaun Jackman
> Sent: Friday, November 17, 2006 4:31 PM
> To: avr-gcc-***@nongnu.org; ***@gcc.gnu.org
> Subject: [avr-gcc-list] AVR byte swap optimization
>
> The following macro expands to some rather frightful code on the AVR:
>
> #define BSWAP_16(x) \
> ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8))
>
> uint16_t bswap_16(uint16_t x)
> {
> 0: 9c 01 movw r18, r24
> 2: 89 2f mov r24, r25
> 4: 99 27 eor r25, r25
> 6: 32 2f mov r19, r18
> 8: 22 27 eor r18, r18
> return BSWAP_16(x);
> }
> a: 82 2b or r24, r18
> c: 93 2b or r25, r19
> e: 08 95 ret
>
> Ideally, this macro would expand to three mov instructions and a ret.
> Is there anything I can do to help GCC along here? I'm using GCC 4.1.0
> with -O2.
>
> I won't bother to show bswap_32 here, which produces a real disaster!
> Think 47 instructions, for what should be 6.
>
> Cheers,
> Shaun
>
> $ avr-gcc --version |head -1
> avr-gcc (GCC) 4.1.0
>
>
> _______________________________________________
> AVR-GCC-list mailing list
> AVR-GCC-***@nongnu.org
> http://lists.nongnu.org/mailman/listinfo/avr-gcc-list
>
Larry Barello
19 years ago
Permalink
Although a bit wordy, the following produces good code under a lot of
different circumstances. It allows the compiler to shove things around &
optimize.


static inline uint16_t ByteSwap(uint16_t data)
{
union byteswap
{
uint16_t word;
struct
{
uint8_t low;
uint8_t high;
};
} foo;
foo.word = data;

uint8_t t = foo.high;
foo.high = foo.low;
foo.low = t;
return foo.word;
}


| -----Original Message-----
| From: avr-gcc-list-bounces+larry=***@nongnu.org [mailto:avr-gcc-
| list-bounces+larry=***@nongnu.org] On Behalf Of Shaun Jackman
| Sent: Friday, November 17, 2006 3:31 PM
| To: avr-gcc-***@nongnu.org; ***@gcc.gnu.org
| Subject: [avr-gcc-list] AVR byte swap optimization
|
| The following macro expands to some rather frightful code on the AVR:
|
| #define BSWAP_16(x) \
| ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8))
|
| uint16_t bswap_16(uint16_t x)
| {
| 0: 9c 01 movw r18, r24
| 2: 89 2f mov r24, r25
| 4: 99 27 eor r25, r25
| 6: 32 2f mov r19, r18
| 8: 22 27 eor r18, r18
| return BSWAP_16(x);
| }
| a: 82 2b or r24, r18
| c: 93 2b or r25, r19
| e: 08 95 ret
|
| Ideally, this macro would expand to three mov instructions and a ret.
| Is there anything I can do to help GCC along here? I'm using GCC 4.1.0
| with -O2.
|
| I won't bother to show bswap_32 here, which produces a real disaster!
| Think 47 instructions, for what should be 6.
|
| Cheers,
| Shaun
|
| $ avr-gcc --version |head -1
| avr-gcc (GCC) 4.1.0
|
|
| _______________________________________________
| AVR-GCC-list mailing list
| AVR-GCC-***@nongnu.org
| http://lists.nongnu.org/mailman/listinfo/avr-gcc-list
Rask Ingemann Lambertsen
19 years ago
Permalink
On Fri, Nov 17, 2006 at 04:30:53PM -0700, Shaun Jackman wrote:
> The following macro expands to some rather frightful code on the AVR:
>
> #define BSWAP_16(x) \
> ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8))
[snip]
>
> Ideally, this macro would expand to three mov instructions and a ret.
> Is there anything I can do to help GCC along here? I'm using GCC 4.1.0
> with -O2.

Sure. Add a HImode rotate left by eight instruction to the AVR backend.
For example (untested, and I know very little about the AVR):

(define_insn_and_split "*rotlhi3_const8"
[(set (match_operand:HI 0 "register_operand" "=r")
(rotate:HI (match_operand:HI 1 "register_operand" "0")
(const_int 8)))]
"reload_completed"
[(set (match_dup 2) (xor:QI (match_dup 2) (match_dup 3)))
(set (match_dup 3) (xor:QI (match_dup 3) (match_dup 2)))
(set (match_dup 2) (xor:QI (match_dup 2) (match_dup 3)))]
{
operands[2] = simplify_gen_subreg (QImode, operands[0], HImode, 0);
operands[3] = simplify_gen_subreg (QImode, operands[0], HImode, 1);
}

If that isn't enough, try adding an expander also:

(define_expand "rotlhi3"
[(set (match_operand:HI 0 "register_operand")
(rotate:HI (match_operand:HI 1 "register_operand")
(match_operand:HI 2 "const_int_operand")))]
""
{
if (INTVAL (operands[2]) != 8)
FAIL;
})

--
Rask Ingemann Lambertsen
Paul Brook
19 years ago
Permalink
> Ideally, this macro would expand to three mov instructions and a ret.
> Is there anything I can do to help GCC along here? I'm using GCC 4.1.0
> with -O2.
>
> I won't bother to show bswap_32 here, which produces a real disaster!
> Think 47 instructions, for what should be 6.

Use gcc head, __builtin_bswap and make sure the AVR backend implements the
bswap rtl patterns.

Future versions of gcc may also be able to recognise these idioms without
using the builtin, but AFAIK that's not been implemented yet.

Paul
Eric Weddington
19 years ago
Permalink
> -----Original Message-----
> From:
> avr-gcc-list-bounces+eweddington=***@nongnu.org
> [mailto:avr-gcc-list-bounces+eweddington=***@nongnu.
> org] On Behalf Of Paul Brook
> Sent: Saturday, November 18, 2006 9:46 AM
> To: ***@gcc.gnu.org; Shaun Jackman
> Cc: avr-gcc-***@nongnu.org
> Subject: [avr-gcc-list] Re: AVR byte swap optimization
>
> > Ideally, this macro would expand to three mov instructions
> and a ret.
> > Is there anything I can do to help GCC along here? I'm
> using GCC 4.1.0
> > with -O2.
> >
> > I won't bother to show bswap_32 here, which produces a real
> disaster!
> > Think 47 instructions, for what should be 6.
>
> Use gcc head, __builtin_bswap and make sure the AVR backend
> implements the
> bswap rtl patterns.

There's the problem. You can't just glibly say "make sure the AVR backend
implements the bswap rtl patterns". There are precious few volunteers who
are familiar enough with gcc internals and the avr port in particular to go
do just that. AFAIK, there is no bswap rtl pattern in the avr port, at least
there doesn't seem to be in 4.1.1.


> Future versions of gcc may also be able to recognise these
> idioms without
> using the builtin, but AFAIK that's not been implemented yet.

Plus there is a long lead time between when it is implemented on HEAD, then
branched, released from a branch, and then when it shows up in binary
distributions.

Eric Weddington
Steven Bosscher
19 years ago
Permalink
On 11/19/06, Eric Weddington <***@cso.atmel.com> wrote:
> > Use gcc head, __builtin_bswap and make sure the AVR backend
> > implements the
> > bswap rtl patterns.
>
> There's the problem. You can't just glibly say "make sure the AVR backend
> implements the bswap rtl patterns". There are precious few volunteers who
> are familiar enough with gcc internals and the avr port in particular to go
> do just that. AFAIK, there is no bswap rtl pattern in the avr port, at least
> there doesn't seem to be in 4.1.1.

Why is that a problem?
Do you have a different solution in mind?

> > Future versions of gcc may also be able to recognise these
> > idioms without
> > using the builtin, but AFAIK that's not been implemented yet.
>
> Plus there is a long lead time between when it is implemented on HEAD, then
> branched, released from a branch, and then when it shows up in binary
> distributions.

That happens with all improvements that are implemented between
releases, so I don't see your point.

Gr.
Steven
Eric Weddington
19 years ago
Permalink
> -----Original Message-----
> From: Steven Bosscher [mailto:***@gmail.com]
> Sent: Sunday, November 19, 2006 3:55 AM
> To: Eric Weddington
> Cc: Paul Brook; ***@gcc.gnu.org; Shaun Jackman;
> avr-gcc-***@nongnu.org
> Subject: Re: [avr-gcc-list] Re: AVR byte swap optimization
>
> On 11/19/06, Eric Weddington <***@cso.atmel.com> wrote:
> > > Use gcc head, __builtin_bswap and make sure the AVR backend
> > > implements the
> > > bswap rtl patterns.
> >
> > There's the problem. You can't just glibly say "make sure
> the AVR backend
> > implements the bswap rtl patterns". There are precious few
> volunteers who
> > are familiar enough with gcc internals and the avr port in
> particular to go
> > do just that. AFAIK, there is no bswap rtl pattern in the
> avr port, at least
> > there doesn't seem to be in 4.1.1.
>
> Why is that a problem?
> Do you have a different solution in mind?

>
> > > Future versions of gcc may also be able to recognise these
> > > idioms without
> > > using the builtin, but AFAIK that's not been implemented yet.
> >
> > Plus there is a long lead time between when it is
> implemented on HEAD, then
> > branched, released from a branch, and then when it shows up
> in binary
> > distributions.
>
> That happens with all improvements that are implemented between
> releases, so I don't see your point.

The original message went to the avr-gcc-list mailing list. I wasn't aware
that it also went to the gcc mailing list. There are different sets of
assumptions based on the audience.

A lot depends on where the OP is coming from, in wanting help:
Does the OP want a pre-built toolchain?
Does the OP build the toolchain from source?
Is the OP familiar with even patching the toolchain?

There are a number of people on the avr-gcc-list that when they say they
want help, they want the final, pre-built toolchain from a distributor to
fix their problem. They don't want to build the toolchain from source, or
they don't even know how to. Yes there are also people who can and will
build the toolchain from source but they are a small minority.

The point that I was making was to the original OP: yes, it would be great
to get it fixed, permanently. Historically there has never been enough
volunteers with the knowledge, capability, and the time on the avr port.

The audience on the gcc list is a completely different sort with different
assumptions. My apologies for annoying the people on the gcc list.

And yes I do have a different solution in mind:
Shaun, could you open a GCC bug report on this issue? Rask, could you then
post your implementation to that GCC bug report as a patch? Steven and Paul,
could you eyeball Rask's implementation and see if it is reasonable
impelementation? Rask, do you have FSF paperwork in place? If not, then
could somebody else, with FSF assignments in place, create a patch?

After all, you guys have a process to follow, right? ;-)

Eric Weddington
'Rask Ingemann Lambertsen'
19 years ago
Permalink
On Sun, Nov 19, 2006 at 08:31:22AM -0700, Eric Weddington wrote:

> Rask, do you have FSF paperwork in place?

Yes, it was completed recently.

--
Rask Ingemann Lambertsen
Rask Ingemann Lambertsen
19 years ago
Permalink
On Fri, Nov 17, 2006 at 04:30:53PM -0700, Shaun Jackman wrote:
> Is there anything I can do to help GCC along here? I'm using GCC 4.1.0
> with -O2.
>
> I won't bother to show bswap_32 here, which produces a real disaster!
> Think 47 instructions, for what should be 6.

You may get better code if you write it something like this:

#include <string.h>

uint32_t bswap_32 (uint32_t x)
{
unsigned char c[4], temp;

memcpy (c, &x, 4);
temp = c[0];
c[0] = c[3];
c[3] = temp;
temp = c[1];
c[1] = c[2];
c[2] = temp;
memcpy (&x, c, 4);

return (x);
}

It isn't only on the AVR that bswap_32() is nontrivial to get right.
These two versions would rule on the i386 if GCC would be just a little bit
smarter:

#include <string.h>
#define BSWAP_16(x) \
((((x) >> 8) & 0xff) | (((x) & 0xff) << 8))

uint32_t bswap_32_a (uint32_t y)
{
uint16_t d[2];
uint32_t x = y;

memcpy (d, &x, sizeof (d));
d[0] = BSWAP_16 (d[0]);
memcpy (&x, d, sizeof (x));
x = ((x >> 16) & 0xffff) | ((x & 0xffff) << 16);
memcpy (d, &x, sizeof (d));
d[0] = BSWAP_16 (d[0]);
memcpy (&x, d, sizeof (x));

return (x);
}
/*
bswap_32_a:
subl $16, %esp
movl 20(%esp), %eax
movl %eax, 12(%esp)
rolw $8, 12(%esp)
roll $16, 12(%esp)
rolw $8, 12(%esp)
movl 12(%esp), %eax
addl $16, %esp
ret
*/
uint32_t bswap_32_b (uint32_t y)
{
union { uint16_t d[2]; uint32_t x; } t;
t.x = y;

t.d[0] = BSWAP_16 (t.d[0]);
t.x = ((t.x >> 16) & 0xffff) | ((t.x & 0xffff) << 16);
t.d[0] = BSWAP_16 (t.d[0]);

return (t.x);
}
/*
bswap_32_b:
movl 4(%esp), %edx
movl %edx, %eax
rolw $8, %ax
movw %ax, %dx
movl %edx, %eax
roll $16, %eax
movl %eax, %edx
rolw $8, %ax
movw %ax, %dx
movl %edx, %eax
ret
*/

--
Rask Ingemann Lambertsen
Eric Christopher
19 years ago
Permalink
>
> It isn't only on the AVR that bswap_32() is nontrivial to get
> right.
> These two versions would rule on the i386 if GCC would be just a
> little bit
> smarter:
>

I prefer the single instruction bswap that we now generate for
__builtin_bswap[32,64] myself...

-eric
Rask Ingemann Lambertsen
19 years ago
Permalink
On Sun, Nov 19, 2006 at 02:46:44PM -0800, Eric Christopher wrote:

> >These two versions would rule on the i386 if GCC would be just a
> >little bit smarter:
>
> I prefer the single instruction bswap that we now generate for
> __builtin_bswap[32,64] myself...

But that's only available starting with the i486. We should still produce
good code for an i386. On the m68k, we also want something with three
rotates, say

rol.w $8, d0
swap.l d0
rol.w $8, d0

--
Rask Ingemann Lambertsen
Eric Christopher
19 years ago
Permalink
On Nov 19, 2006, at 10:18 PM, Rask Ingemann Lambertsen wrote:

> On Sun, Nov 19, 2006 at 02:46:44PM -0800, Eric Christopher wrote:
>
>>> These two versions would rule on the i386 if GCC would be just a
>>> little bit smarter:
>>
>> I prefer the single instruction bswap that we now generate for
>> __builtin_bswap[32,64] myself...
>
> But that's only available starting with the i486. We should still
> produce
> good code for an i386. On the m68k, we also want something with three
> rotates, say
>

I suppose, not much is being shipped that doesn't have the i486
instructions
these days.

> rol.w $8, d0
> swap.l d0
> rol.w $8, d0
>

Well, there's a bswap opcode now...

-eric
Denis Vlasenko
19 years ago
Permalink
On Saturday 18 November 2006 00:30, Shaun Jackman wrote:
> The following macro expands to some rather frightful code on the AVR:
>
> #define BSWAP_16(x) \
> ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8))

Sometimes gcc is generating better code if you cast
values instead of masking. Try:

( (uint8_t)((x) >> 8) | ((uint8_t)(x)) << 8 )

--
vda
Shaun Jackman
19 years ago
Permalink
On 11/26/06, Denis Vlasenko <***@googlemail.com> wrote:
> On Saturday 18 November 2006 00:30, Shaun Jackman wrote:
> > The following macro expands to some rather frightful code on the AVR:
> >
> > #define BSWAP_16(x) \
> > ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8))
>
> Sometimes gcc is generating better code if you cast
> values instead of masking. Try:
>
> ( (uint8_t)((x) >> 8) | ((uint8_t)(x)) << 8 )

Your suggestion seemed like a good one and gave me some hope.
Unfortunately, it produces identical results to my BSWAP_16 macro. I
also tried the following:

uint8_t a = x >> 8, b = x;
return b << 8 | a;

Different register allocations this time, but identical instructions.

Cheers,
Shaun
Anton Erasmus
19 years ago
Permalink
On 18 Dec 2006 at 11:28, Shaun Jackman wrote:

> On 11/26/06, Denis Vlasenko <***@googlemail.com> wrote:
> > On Saturday 18 November 2006 00:30, Shaun Jackman wrote:
> > > The following macro expands to some rather frightful code on the AVR:
> > >
> > > #define BSWAP_16(x) \
> > > ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8))
> >
> > Sometimes gcc is generating better code if you cast
> > values instead of masking. Try:
> >
> > ( (uint8_t)((x) >> 8) | ((uint8_t)(x)) << 8 )
>
> Your suggestion seemed like a good one and gave me some hope.
> Unfortunately, it produces identical results to my BSWAP_16 macro. I
> also tried the following:
>
> uint8_t a = x >> 8, b = x;
> return b << 8 | a;
>
> Different register allocations this time, but identical instructions.
>

Hi,

Not a macro, but the following seems to generate reasonable code.

typedef struct
{
unsigned char h;
unsigned char l;
}B2;


typedef union
{
B2 x;
unsigned short us;
}BU;



inline unsigned short byteswap(unsigned short xx)
{
BU tmp;
unsigned char tmp1;

tmp.us=xx;
tmp1=tmp.x.h;
tmp.x.h=tmp.x.l;
tmp.x.l=tmp1;
return(tmp.us);
}

volatile unsigned short gX;

void test(void)
{
gX=byteswap(gX);

}


Regards
Anton Erasmus

--
A J Erasmus
David VanHorn
19 years ago
Permalink
>
> tmp.us=xx;
> tmp1=tmp.x.h;
> tmp.x.h=tmp.x.l;
> tmp.x.l=tmp1;



Am I missing something here?
Why not pop to assembler, push the high, push the low, pop the high, pop the
low?
Shaun Jackman
19 years ago
Permalink
On 12/18/06, David VanHorn <***@microbrix.com> wrote:
> Am I missing something here?
> Why not pop to assembler, push the high, push the low, pop the high, pop the
> low?

* Inline assembler cannot be used at compile time, for example to
initialize a static variable.

* If the swap function is called on a constant, the compiler cannot
remove the inline assembler. In general, any inline assembler tends to
handcuff the optimizer to some degree.

* Push and pop take two cycles since they access memory. Three mov
instructions are faster than one push and one pop.

Cheers,
Shaun
Shaun Jackman
19 years ago
Permalink
On 12/18/06, David VanHorn <***@microbrix.com> wrote:
> > * If the swap function is called on a constant, the compiler cannot
> > remove the inline assembler. In general, any inline assembler tends to
> > handcuff the optimizer to some degree.
>
> I'm new to C, but why would you want to swap a constant?
> CAN you swap a constant? Seems an inconstant constant :)

Quite! My usual case for this is when the constant is defined in host
order, but I'm comparing against a network order value. For example,

if (incoming == htons(EXPECTED))

Here, incoming is a 16-bit network order value and htons (a.k.a.
byteswap_16) is taking the 16-bit host order constant EXPECTED. If
htons is implemented as a mathematical expression, the compiler can
optimise it away to nothing.

Cheers,
Shaun
Shaun Jackman
19 years ago
Permalink
On 12/18/06, Anton Erasmus <***@sentechsa.com> wrote:
> Hi,
>
> Not a macro, but the following seems to generate reasonable code.
...

Thanks Anton,

I came to the same conclusion.

Cheers,
Shaun

static inline uint16_t bswap_16_inline(uint16_t x)
{
union {
uint16_t x;
struct {
uint8_t a;
uint8_t b;
} s;
} in, out;
in.x = x;
out.s.a = in.s.b;
out.s.b = in.s.a;
return out.x;
}

static inline uint32_t bswap_32_inline(uint32_t x)
{
union {
uint32_t x;
struct {
uint8_t a;
uint8_t b;
uint8_t c;
uint8_t d;
} s;
} in, out;
in.x = x;
out.s.a = in.s.d;
out.s.b = in.s.c;
out.s.c = in.s.b;
out.s.d = in.s.a;
return out.x;
}
Nils Springob
19 years ago
Permalink
Hi,

and it is possible to use an anonymous struct:

static inline uint16_t bswap_16_inline(uint16_t x)
{
union {
uint16_t x;
struct {
uint8_t a, b;
};
} in, out;
in.x = x;
out.a = in.b;
out.b = in.a;
return out.x;
}

Regards,

Nils
Shaun Jackman
19 years ago
Permalink
On 12/18/06, Nils Springob <***@nicai-systems.de> wrote:
> Hi,
>
> and it is possible to use an anonymous struct:

True. However, unnamed struct/union fields are an extension of the C
language provided by GCC and not strictly portable. It is a nice
feature though. It would be nice if it crept its way into a future C
standard.

Cheers,
Shaun
Loading...