FAQs:
Why?
Duplicate code in software projects is generally accepted to be a bad
idea. The Linux kernel is a very high quality piece of software,
but it does contain some duplicated code.
Why a minimum 42 line size
segment?
These seemed like a good place to start. It can be increased or
decreased dynamically when rerunning the tools.
What tool do you use to do
the Kernel Analysis?
I use a combination of Red Hill's Simian tool and a small bit of C code
to generate the html files for the subpages.
Why does your webpage
suck?
Do you know your webpage
sucks?
I suck at webpage design. It will hopefully improve
over time. If you want to help it not suck please let me know.
Why don't you use PMD's CPD,
Comparator, Sim or some other tool?
PMD wasn't able to handle all of the Linux source code on my home
system. I tried several other tools all with varying levels of
success. Simian just works for me. The Linux kernel is also
a large chunk of code which stresses a lot of tools. Your mileage
may vary.
Where can I get the source
code for your analyzer?
Here.
What Command Line options do
you use for Simian?
I keep running into memory
errors when I run Simian?
# java -mx400m -jar simian.jar "-recurse=*.c" -ignoreSubtypeNames -threshold=42
What is next, for the project?
There are a couple of heuristics for preprocessing the source
code to do some normalization. Examples include: breaking each
statement onto own line, renname all identifiers to the same, sort
input files so sequence of events isn't as important and the
like. This would introduce a lot of false positives but might
allow the tool to handle slightly changed code and the like.
I'm also working on automating more of it so it will be able update
itself when a new release comes out without requiring manual work on my
part.