This blog post is a supplement to the paper, “‘Style’ Transfer for Musical Audio Using Multiple Time-Frequency Representations”.

Experiment 4.1: Musical Texture Generation

Figure 4: Columns 1 and 3: comparison of inter-onset lengths distribution and KL divergence from the source distribution for a texture generation example as the effective receptive field increases in time. Columns 2 and 4: mean local autocorrelation plots showing the increase in hierarchical rhythmic structure of the audio without any significant increase in the maximum cross-correlation value.

Example 1
Source
Textures
Example 2
Source
Textures

Experiment 4.2: Testing Key Invariance

Figure 5: Comparison of the error with different content-based representations for a task where the content and style audio is exactly the same except for key. The x-axis represents varying semi-tone offsets in musical representation. The first point on the left of 0 semi-tone offset represents a trivial problem where both content and style are exactly the same signals. We plot the error in the log-magnitude STFT representations to show the overall signal error.

Source
STFT Reconstructions
Mel Reconstructions
CQT Reconstructions
Mel + CQT Reconstructions

Experiment 4.3: Comparison of examples with best implementation.

The table below gives which content and style representation combinations work best for different types of examples, with a corresponding style transfer pair in the below examples that works best with that combination of loss terms

Example 1
Content
Style
Results
Example 2
Content
Style
Results
Example 3
Content
Style
Result
Example 4
Content
Style
Result
Example 5
Content
Style
Result
Example 6
Content
Style
Result (Content - Mel+CQT, Style - Mel (2 Resid) + CQT (3 resid)
Result (Content - STFT, Style - Mel (2 Resid) + CQT (3 resid)
Example 7
Content
Style
Result