The main difference is SSE is a SIMD instruction set which performs the same action on multiple (packed) values. The majority of the SSE2/3 set operates on 4x32bit or 2x64bit values per register but includes some 8bit and 16bit operations as well for both integer and reals. I'm not really familiar enough with the later versions but they're more specialised towards codecs afaik.
In general, these instructions are useful for computationally intensive operations like vectors/matrices, more complex mathematical functions or even just fast memory copying.
Below is an example from my OS's software graphics driver for alpha blending 4 pixels per loop using fixed point integer operations:
pxor xmm0,xmm0
movdqa xmm10,dqword [const_00010101000101010001010100010101]
movdqa xmm11,dqword [const_00000001000100010000000100010001]
movdqa xmm12,dqword [const_FF00FF00FF00FF00FF00FF00FF00FF00]
movdqa xmm13,dqword [const_00FFFFFF00FFFFFF00FFFFFF00FFFFFF]
movdqa xmm14,dqword [const_0000000000000000FFFFFFFFFFFFFFFF]
movdqa xmm15,dqword [const_FFFFFFFFFFFFFFFF0000000000000000]
mov rax,16
xor rcx,rcx
movzx rdx,dword [buffer_size]
shr rdx,4
mov rsi,qword [plSrc]
mov rdi,qword [plDst]
; F E D C B A 9 8 7 6 5 4 3 2 1 0
mov rsi,[rsi+LAYER.pBuffer] ; A3 R3 G3 B3 A2 R2 G2 B2 A1 R1 G1 B1 A0 R0 G0 B0
mov rdi,[rdi+LAYER.pBuffer] ; A7 R7 G7 B7 A6 R6 G6 B6 A5 R5 G5 B5 A4 R4 G4 B4
align 4
.loop:
movdqa xmm1,dqword [rsi+rcx] ; xmm1 = A3 R3 G3 B3 A2 R2 G2 B2 A1 R1 G1 B1 A0 R0 G0 B0
movdqa xmm2,xmm1 ; xmm2 = A3 R3 G3 B3 A2 R2 G2 B2 A1 R1 G1 B1 A0 R0 G0 B0
punpcklbw xmm1,xmm0 ; xmm1 = 00 A1 00 R1 00 G1 00 B1 00 A0 00 R0 00 G0 00 B0
punpckhbw xmm2,xmm0 ; xmm2 = 00 A3 00 R3 00 G3 00 B3 00 A2 00 R2 00 G2 00 B2
pshuflw xmm3,xmm1,const_3_3_3_3 ; xmm3 = 00 A1 00 R1 00 G1 00 B1 00 A0 00 A0 00 A0 00 A0
pand xmm3,xmm14 ; xmm3 = 00 00 00 00 00 00 00 00 00 A0 00 A0 00 A0 00 A0
pshufhw xmm8,xmm1,const_3_3_3_3 ; xmm8 = 00 A1 00 A1 00 A1 00 A1 00 A0 00 R0 00 G0 00 B0
pand xmm8,xmm15 ; xmm8 = 00 A1 00 A1 00 A1 00 A1 00 00 00 00 00 00 00 00
por xmm3,xmm8 ; xmm3 = 00 A1 00 A1 00 A1 00 A1 00 A0 00 A0 00 A0 00 A0
pshuflw xmm4,xmm2,const_3_3_3_3 ; xmm4 = 00 A3 00 R3 00 G3 00 B3 00 A2 00 A2 00 A2 00 A2
pand xmm4,xmm14 ; xmm4 = 00 00 00 00 00 00 00 00 00 A2 00 A2 00 A2 00 A2
pshufhw xmm8,xmm2,const_3_3_3_3 ; xmm8 = 00 A3 00 A3 00 A3 00 A3 00 A2 00 R2 00 G2 00 B2
pand xmm8,xmm15 ; xmm8 = 00 A3 00 A3 00 A3 00 A3 00 00 00 00 00 00 00 00
por xmm4,xmm8 ; xmm4 = 00 A3 00 A3 00 A3 00 A3 00 A2 00 A2 00 A2 00 A2
psllw xmm3,0x8 ; xmm3 = A1 00 A1 00 A1 00 A1 00 A0 00 A0 00 A0 00 A0 00
psllw xmm4,0x8 ; xmm4 = A3 00 A3 00 A3 00 A3 00 A2 00 A2 00 A2 00 A2 00
movdqa xmm5,xmm3 ; xmm5 = A1 00 A1 00 A1 00 A1 00 A0 00 A0 00 A0 00 A0 00
movdqa xmm6,xmm4 ; xmm6 = A3 00 A3 00 A3 00 A3 00 A2 00 A2 00 A2 00 A2 00
pxor xmm5,xmm12 ; xmm5 = V1 00 V1 00 V1 00 V1 00 V0 00 V0 00 V0 00 V0 00
pxor xmm6,xmm12 ; xmm6 = V3 00 V3 00 V3 00 V3 00 V2 00 V2 00 V2 00 V2 00
movdqa xmm8,dqword [rdi+rcx] ; xmm8 = 00 R7 G7 B7 00 R6 G6 B6 00 R5 G5 B5 00 R4 G4 B4
movdqa xmm9,xmm8 ; xmm9 = 00 R7 G7 B7 00 R6 G6 B6 00 R5 G5 B5 00 R4 G4 B4
punpcklbw xmm8,xmm0 ; xmm8 = 00 00 00 R5 00 G5 00 B5 00 00 00 R4 00 G4 00 B4
paddusw xmm8,xmm11 ; xmm8 = 00 00 00 R5 00 G5 00 B5 00 00 00 R4 00 G4 00 B4
punpckhbw xmm9,xmm0 ; xmm9 = 00 00 00 R7 00 G7 00 B7 00 00 00 R6 00 G6 00 B6
paddusw xmm9,xmm11 ; xmm9 = 00 00 00 R7 00 G7 00 B7 00 00 00 R6 00 G6 00 B6
pmulhuw xmm1,xmm3 ; xmm1 = 00 II 00 X1 00 X1 00 X1 00 II 00 X0 00 X0 00 X0
pmulhuw xmm2,xmm4 ; xmm2 = 00 II 00 X3 00 X3 00 X3 00 II 00 X2 00 X2 00 X2
pmulhuw xmm8,xmm5 ; xmm8 = 00 II 00 Y1 00 Y1 00 Y1 00 II 00 Y0 00 Y0 00 Y0
pmulhuw xmm9,xmm6 ; xmm9 = 00 II 00 Y3 00 Y3 00 Y3 00 II 00 Y2 00 Y2 00 Y2
packuswb xmm1,xmm2 ; xmm1 = II X3 X3 X3 II X2 X2 X2 II X1 X1 X1 II X0 X0 X0
packuswb xmm8,xmm9 ; xmm8 = II Y3 Y3 Y3 II Y2 Y2 Y2 II Y1 Y1 Y1 II Y0 Y0 Y0
paddusb xmm1,xmm8 ; xmm1 = II Z3 Z3 Z3 II Z2 Z2 Z2 II Z1 Z1 Z1 II Z0 Z0 Z0
paddusb xmm1,xmm10 ; xmm1 = II Z3 Z3 Z3 II Z2 Z2 Z2 II Z1 Z1 Z1 II Z0 Z0 Z0
pand xmm1,xmm13 ; xmm1 = 00 Z3 Z3 Z3 00 Z2 Z2 Z2 00 Z1 Z1 Z1 00 Z0 Z0 Z0
movdqa dqword [rdi+rcx],xmm1 ;
add rcx,rax
cmp rcx,rdx
jb .loop
This is not fully optimised but shows how much can be done with 2 loads, 1 store, 4 multiplies and some addition/packing; with GPRs this would be something like 24 loads, 12 stores, 24 multiplies but less packing.
The key points are that memory accesses are reduced by performing fewer, larger accesses and the computational path is dramatically shorter because the operations are not (inter-)dependent, not sharing a resource [as often] (in this case the multiplier) and can therefore occur simultaneously.