Home | History | Annotate | Download | only in sort
History log of /src/usr.bin/sort/fsort.c
RevisionDateAuthorComments
 1.47  05-Feb-2010  enami Don't touch past the end of allocated region. It results segmentation
violation.
 1.46  06-Nov-2009  joerg Retire __SCCSID. It has only archeological value now. Also retire lint
conditional around __RCSID, lint can handle that fine.
 1.45  09-Oct-2009  dsl If anyone is stupid enough to feed records longer than 8MB into sort, don't
sit in an infinite loop, instead eat memory until we have read 8 records.
 1.44  09-Oct-2009  dsl Don't give merge an empty file when we detect EOF with nothing in our
buffer.
 1.43  28-Sep-2009  dsl Fix borked fix for sort relying on realloc() changing the buffer end.
Sorts of more than 8MB data now probably work again.
 1.42  26-Sep-2009  dsl Move all the fopen() calls out of the record read routines into the callers.
Split the merge sort so that fsort() can pass the 'FILE *' of the temporary
files to be merged into the merge code.
Don't rely on realloc() not moving the end address of a buffer!
Rework merge sort so that it sorts pointers to 'struct mfile' and only
copies about sort record descriptors.
No functional change intended.
 1.41  10-Sep-2009  dsl Save length of key instead of relying of the weight of the record sep.
This frees a byte value to use for 'end of key' (to correctly sort
short keys) while still having a weight assigned to the field sep.
(Unless -t is given, the field sep is in the field data.)
Do reverse sorts by writing the output file in reverse order (rather
than reversing the sort - apart from merges).
All key compares are now unweighted.
For 'sort -u' mark duplicates keys during the sort and don't write
to the output.
Use -S to mean a posix sort - where equal keys are sorted using the
raw record (rather than being kept in the original order).
For 'sort -f' (no keys) generate a key of the folded data (as for -n
-i and -d), simplifies the code and allows a 'posix' sort.
 1.40  05-Sep-2009  dsl Now we have our own radix_sort() change the interface so that we pass
an array of 'RECHEADER *' and remove all the crappy stuff that backed up
by REC_DATA_OFFSET (etc).
Also change radix_sort() to return the number of elements, soon to be used
to drop duplicate keys (for sort -u).
 1.39  22-Aug-2009  dsl Rework the way sort generates sort keys:
- If we generate a key, it is always sortable using memcmp()
- If we are sorting the whole record, then a weight-table must be used
during compares.
- Major surgery to encoding of numbers to ensure unique keys for equal
numeric values. Reverse numerics are handled by inverting the sign.
- Case folding (-f) is handled when the sort keys are generated. No other
code has to care at all.
- Key uniqueness (-u) is done during merge for large datasets. It only
has to be done when writing the output file for small files.
Since the file is in key order this is simple!
Probably fixes all of: PR/27257 PR/25551 PR/22182 PR/31095 PR/30504
PR/36816 PR/37860 PR/39308
Also PR/18614 should no longer die, but a little more work needs to be
done on the merging for very large files.
 1.38  20-Aug-2009  dsl Delete more unwanted/unused cruft.
Simplify logic for reading input records.
Do a merge sort whenever we have 16 partial sorted blocks.
The patient is breathing, but still carrying a lot of extra weight.
 1.37  18-Aug-2009  dsl The code that attempted to sort large files by sorting each chunk by the
first key byte and writing to a temp file, then sorting the records from
each temp file that had the same first key byte (and repeating for upto
4 key bytes) was a nice idea, but completely doomed to failure.
Eg PR/9308 where a 70MB file has all but one record the same and short keys.
Not only does the code not work, it is rather guaranteed to be slow.
Instead always use a merge sort for fully sorted chunk of records (each
temporary file contains one lot of sorted records).
The -H option already did this, so just rip out all the code and variables
that can't be used when -H was specified.
Further cleanup to come ...
 1.36  16-Aug-2009  dsl 'depth' is used for the number of bytes into the key that the pointers
reference, when we want to find the record header put the larger value
into 'hdr_off' to avoid any confusion that the code might be changing
'depth'!
There is now no need to save the original value as 'odepth' in append.c.
All an a vague attempt to make this code slightly readable.
 1.35  16-Aug-2009  dsl Replace all uses of sizeof(TRECHEADER) with REC_DATA_OFFSET - which
is defined as offsetof(RECHEADER, data). Delete TRECHEADER.
 1.34  15-Aug-2009  dsl linebuf and linebuf_size are only used inside seq() - which also not
only has its own static variable, but will also extend the buffer.
Remove linebuf/size and change seq() to use a private, locally managed
buffer.
 1.33  15-Aug-2009  dsl Ansify.
I'm looking at fixing the 'sort -n' fubars, but this code is an
inpeneterable mess - which needs some fixing first!
 1.32  28-Apr-2008  martin branches: 1.32.6; 1.32.12;
Remove clause 3 and 4 from TNF licenses
 1.31  10-Jun-2005  jmc branches: 1.31.20;
Init some variables the compiler is complaining about and mark w. XXGCC as it
affects only m68k compilers.
 1.30  15-Feb-2004  jdolecek fix -Wunitialized warnings
 1.29  18-Oct-2003  itojun KNF (mostly whitespace)
 1.28  17-Oct-2003  enami Test the value returned by realloc() rather than anything else.
 1.27  16-Oct-2003  itojun safer use of realloc
 1.26  07-Aug-2003  jdolecek add TNF copyright
 1.25  07-Aug-2003  agc Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22365, verified by myself.
 1.24  24-Dec-2002  jdolecek improve previous slightly - need >= (not just >) in CHECKFSTACK()
 1.23  24-Dec-2002  jdolecek make sure we don't attempt to write past end of fstack[], error out instead

this fixes second part ('tmpdir get smashed') of bin/18614 by Michael Graff
 1.22  10-Oct-2002  jdolecek g/c extern reference to toutpath
 1.21  30-Sep-2002  enami Use the right file to output merge result.
 1.20  15-May-2001  jdolecek branches: 1.20.2;
Only try to copy the extra incomplete record data if there is anything
actually read already. Albeit it's not damaging to copy zero data
for bufend == crec->data case, the buffer end could also be between
memory position 'crec' and 'crec->data'. Thus, we could end up with
negative 'bufend - crec->data' value, and obvious havoc.

This change fixes lib/12673, though the problem was masked and no longer
repeatable with the provided example after the recent buffer size bump.
The change was tested with the buffer size change backed off, and really
fixes the problem in the PR.
 1.19  15-May-2001  jdolecek fsort(): rearrange the push code to reduce one level of intendation,
free keylist, buffer on end of work; no functional changes
 1.18  14-May-2001  jdolecek Bump the initial record buffer size to 1MB and allow it to grow to 8MB,
if needed and record count is within bounds (<MAXNUM), rather than
sorting the input by 64KB chunks. This cuts the number of needed
temporary files considerably (and improves performance, too).
Slighly adjust some #defines, mostly to power of 2 values.

This addresses bin/12673 and bin/12614, as well as complains from other
people.
 1.17  20-Feb-2001  jdolecek fsort(): don't call append() with zero nelem
This fixes the 'sort -f /dev/null' coredump reported on current-users.
 1.16  19-Feb-2001  jdolecek Pull up various cosmetic (mostly whitespace) changes from OpenBSD.
This is primarily to ease syncing the two versions.
 1.15  19-Feb-2001  jdolecek oops - wrong file, backoff local test change
 1.14  19-Feb-2001  jdolecek enterkey():
* move the test for keybuf size before keypos[-1] assignment, "just in case"
* move the keypos assignment to improve readability
 1.13  19-Feb-2001  jdolecek cosmetic changes - make keylist[] static and remove extern definition
in fsort.h, move macro SALIGN() from sort.h to fsort.c
 1.12  05-Feb-2001  itojun make sure to initialize malloc'ed region. PR 12138. found by malloc.conf=AJ
 1.11  19-Jan-2001  jdolecek use MERGE_FNUM instead of magic value 16
 1.10  18-Jan-2001  jdolecek keep bumping the record buffer up to 8 records - this is to avoid making
excessive number of temporary files for oversized records; the way the
buffer is enlarged is now also safer

initialize 'bufsize' statically, so that the value can be safely used
in e.g. msort.c:fmerge()
 1.9  13-Jan-2001  itojun fix few confusing indentation. XXX still broken
 1.8  11-Jan-2001  jdolecek general cleanup of file list passing:
* get rid of union f_handle, replace by passing explicit int parameter
and (new) struct filelist
* add new typedefs gen_func_t and put_func_t and use where appropriate
 1.7  08-Jan-2001  jdolecek by default, use stable sort
add -S flag to switch to non-stable sort; for GNU sort compatibility,
provide -s flag too
 1.6  17-Oct-2000  jdolecek fix bugs caused by implicit assumption that 'length' and
'offset' members of struct recheader/trecheader are shorts - they are size_t
now
this makes sort pass all tests in TEST/stests again after my last change

other misc cosmetic changes
 1.5  16-Oct-2000  jdolecek enlarge line buffer as necessary, so that it's possible
to process lines longer than 65522 characters
constify, rename MAXLLEN to DEFLLEN
 1.4  15-Oct-2000  jdolecek don't use register declarations
 1.3  07-Oct-2000  bjh21 Two classes of changes from the initial OpenBSD commit of this sort(1):
FILE * variables are called "fp" rather than "fd".
Better (safer) temporary-file handling.
 1.2  07-Oct-2000  bjh21 Hit sort(1) with a hammer till it compiles.
Also add RCSIDs.
 1.1  07-Oct-2000  bjh21 branches: 1.1.1;
Initial revision
 1.1.1.1  07-Oct-2000  bjh21 4.4BSD-Lite2 contrib/sort
 1.20.2.1  01-Oct-2002  lukem Pull up revision 1.21 (requested by enami in ticket #883):
Use the right file to output merge result.
 1.31.20.1  18-May-2008  yamt sync with head.
 1.32.12.2  20-May-2011  matt bring matt-nb5-mips64 up to date with netbsd-5-1-RELEASE
 1.32.12.1  21-Apr-2010  matt sync to netbsd-5
 1.32.6.2  29-Jun-2010  riz Pull up following revision(s) (requested by dholland in ticket #1420):
usr.bin/sort/sort.h: revision 1.31
usr.bin/sort/sort.c: revision 1.58
usr.bin/sort/fsort.c: revision 1.47
usr.bin/sort/msort.c: revision 1.30
Don't touch past the end of allocated region. It results segmentation
violation.
 1.32.6.1  14-Oct-2009  sborrill Pull up the following revisions(s) (requested by dsl in ticket #1084):
usr.bin/sort/Makefile: revision 1.6-1.8
usr.bin/sort/append.c: revision 1.15-1.22
usr.bin/sort/fields.c: revision 1.20-1.30
usr.bin/sort/files.c: revision 1.27-1.40
usr.bin/sort/fsort.c: revision 1.33-1.45
usr.bin/sort/fsort.h: revision 1.14-1.17
usr.bin/sort/init.c: revision 1.19-1.23
usr.bin/sort/msort.c: revision 1.19-1.28
usr.bin/sort/radix_sort.c: revision 1.1-1.4
usr.bin/sort/sort.1: revision 1.27-1.29
usr.bin/sort/sort.c: revision 1.47-1.56
usr.bin/sort/sort.h: revision 1.20-1.30
usr.bin/sort/tmp.c: revision 1.14-1.15

Only use radix sort for in-memory sort, always merge temporary files.
Use a local radixsort() function so we can pass record length.
Avoid use of weight tables for key compares.
Fix generation of keys for numbers, negate value for reverse sort.
Write file in reverse-key order for 'sort -n'.
'sort -S' now does a posix sort (sort matching keys by record data).
Ensure merge sort doesn't have too many temporary files open.
Fixes: PR#18614 PR#27257 PR#25551 PR#22182 PR#31095 PR#30504 PR#36816
PR#37860 PR#39308 PR#42094

RSS XML Feed