- 
                Notifications
    You must be signed in to change notification settings 
- Fork 789
Huge text handling #3121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Huge text handling #3121
Conversation
| this actually looks good, I just quickly skimmed through maybe only one concern, I think we should WARN instead of FINE print if a file is skipped because of limits ... (unless FINE is printed by default to log file ... but then I'd like to see those as WARN on console too ... ) | 
| What happens if index is created with particular limits and then the limits are changed ? | 
| 
 No file is skipped. It is still included but under the  | 
| 
 I tried to describe that above, but to clarify: You can change  If you change  
 | 
| sorry, I meant "trimmed" down, not skipped | 
| @tarzanek , that's done. | 
| 
 I suppose it would be straight-forward to store a value for uncompressed size in the  | 
| 
 Oh but that would mean decompressing entirely. Probably not a good idea. | 
5dca6b3    to
    1d15e51      
    Compare
  
    cdcbea3    to
    7559eb7      
    Compare
  
    | Just rebased on master since this needed revision to accommodate the  | 
| I will take a look; also needs rebase. | 
7559eb7    to
    8ae1950      
    Compare
  
    | Just trivial conflicts upon rebase | 
1b51b08    to
    2fb1b2e      
    Compare
  
    5403dfd    to
    29aad0e      
    Compare
  
    | Just rebasing for trivial conflicts related to R analyzer and then again after parallel detection merged | 
        
          
                opengrok-indexer/src/main/java/org/opengrok/indexer/index/Indexer.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                opengrok-indexer/src/test/java/org/opengrok/indexer/index/HugeTextTest.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
      29aad0e    to
    358d2f6      
    Compare
  
    | Rebased for trivial conflict in search.jsp | 
Also, move some logic properly to AnalyzerGuru that had crept into IndexDatabase.
f6bdc40    to
    36245a5      
    Compare
  
    | Rebased for PageConfig.java re-lo, and git automatic-merge took care of it | 
Hello,
Please consider for integration this patch to add Huge Text file handling.
IndexerandConfigurationget two new settings,hugeTextThresholdBytes(default 1_000_000) andhugeTextLimitCharacters(default 5_000_000). The threshold determines when OpenGrok will override aPLAINgenre file as ahugetextDATAfile instead. The character limit determines how much to read and index forhugetext(with contextless truncation); the limit may be zero.hugeTextThresholdBytesis checked for applicable files with each run, while no state forhugeTextLimitCharactersis stored. ChanginghugeTextLimitCharactersafter indexing would require touching affected source code files to revise the index.For affected gzip and bzip2 files, changes to either
hugeTextThresholdBytesorhugeTextLimitCharacterswould require touching affected compressed files to revise the index.Thank you.