FAQs:

Why?

Duplicate code in software projects is generally accepted to be a bad idea. The Linux kernel is a very high quality piece of software, but it does contain some duplicated code.

Why a minimum 42 line size segment?

These seemed like a good place to start. It can be increased or decreased dynamically when rerunning the tools.

What tool do you use to do the Kernel Analysis?

I use a combination of Red Hill's Simian tool and a small bit of C code to generate the html files for the subpages.

Why does your webpage suck?
Do you know your webpage sucks?

I suck at webpage design. It will hopefully improve over time. If you want to help it not suck please let me know.

Why don't you use PMD's CPD, Comparator, Sim or some other tool?

PMD wasn't able to handle all of the Linux source code on my home system. I tried several other tools all with varying levels of success. Simian just works for me. The Linux kernel is also a large chunk of code which stresses a lot of tools. Your mileage may vary.

Where can I get the source code for your analyzer?

Here.

What Command Line options do you use for Simian?
I keep running into memory errors when I run Simian?

# java -mx400m -jar simian.jar "-recurse=*.c" -ignoreSubtypeNames -threshold=42

What is next, for the project?

There are a couple of heuristics for preprocessing the source code to do some normalization. Examples include: breaking each statement onto own line, renname all identifiers to the same, sort input files so sequence of events isn't as important and the like. This would introduce a lot of false positives but might allow the tool to handle slightly changed code and the like.

I'm also working on automating more of it so it will be able update itself when a new release comes out without requiring manual work on my part.